[PATCH 00/14] better handling of checksum errors/bitrot

Linux bcachefs list
 help / color / mirror / Atom feed

* [PATCH 00/14] better handling of checksum errors/bitrot
@ 2025-03-11 20:15 Kent Overstreet
  2025-03-11 20:15 ` [PATCH 01/14] bcachefs: Convert read path to standard error codes Kent Overstreet
                   ` (14 more replies)
  0 siblings, 15 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs, linux-block; +Cc: Kent Overstreet, Roland Vet, linux-fsdevel

Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
we've got an extent that can't be read, due to checksum error/bitrot.

This took some doing to fix properly, because

- We don't want to just delete such data (replace it with
  KEY_TYPE_error); we never want to delete anything except when
  explicitly told to by the user, and while we don't yet have an API for
  "read this file even though there's errors, just give me what you
  have" we definitely will in the future.

- Not being able to move data is a non-option: that would block copygc,
  device removal, etc.

- And, moving said extent requires giving it a new checksum - strictly
  required if the move has to fragment it, teaching the write/move path
  about extents with bad checksums is unpalateable, and anyways we'd
  like to be able to guard against more bitrot, if we can.

So that means:

- Extents need a poison bit: "reads return errors, even though it now
  has a good checksum" - this was added in a separate patch queued up
  for 6.15.

  It's an incompat feature because it's a new extent field, and old
  versions can't parse extents with unknown field types, since they
  won't know their sizes - meaning users will have to explicitly do an
  incompat upgrade to make use of this stuff.

- The read path needs to do additional retries after checksum errors
  before giving up and marking it poisoned, so that we don't
  accidentally convert a transient error to permanent corruption.

- The read path gets a whole bunch of work to plumb precise modern error
  codes around, so that e.g. the retry path, the data move path, and the
  "mark extent poisoned" path all know exactly what's going on.

- Read path is responsible for marking extents poisoned after sufficient
  retry attempts (controlled by a new filesystem option)

- Data move path is allowed to move extents after a read error, if it's
  a checksum error (giving it a new checksum) if it's been poisoned
  (i.e. the extent flags feature is enabled).

Code should be more or less finalized - still have more tests for corner
cases to write, but "write corrupt data and then tell rebalance to move
it to another device" works as expected.

TODO:

- NVME has a "read recovery level" attribute that controlls how hard the
  erasure coding algorithms work - we want that plumbed.

  Before we give up and move data that we know is bad, we need to try
  _as hard as possible_ to get a successful read.

Code currently lives in
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-testing

Kent Overstreet (14):
  bcachefs: Convert read path to standard error codes
  bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
  bcachefs: Read error message now indicates if it was for an internal
    move
  bcachefs: BCH_ERR_data_read_buffer_too_small
  bcachefs: Return errors to top level bch2_rbio_retry()
  bcachefs: Print message on successful read retry
  bcachefs: Don't create bch_io_failures unless it's needed
  bcachefs: Checksum errors get additional retries
  bcachefs: __bch2_read() now takes a btree_trans
  bcachefs: Poison extents that can't be read due to checksum errors
  bcachefs: Data move can read from poisoned extents
  bcachefs: Debug params for data corruption injection
  block: Allow REQ_FUA|REQ_READ
  bcachefs: Read retries are after checksum errors now REQ_FUA

 block/blk-core.c              |  19 +-
 fs/bcachefs/bcachefs_format.h |   2 +
 fs/bcachefs/btree_io.c        |   2 +-
 fs/bcachefs/errcode.h         |  19 +-
 fs/bcachefs/extents.c         | 157 +++++++++-------
 fs/bcachefs/extents.h         |   7 +-
 fs/bcachefs/extents_types.h   |  11 +-
 fs/bcachefs/io_read.c         | 325 +++++++++++++++++++++++++---------
 fs/bcachefs/io_read.h         |  21 +--
 fs/bcachefs/io_write.c        |  24 +++
 fs/bcachefs/move.c            |  26 ++-
 fs/bcachefs/opts.h            |   5 +
 fs/bcachefs/super-io.c        |   4 +
 fs/bcachefs/util.c            |  21 +++
 fs/bcachefs/util.h            |  12 ++
 15 files changed, 473 insertions(+), 182 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/14] bcachefs: Convert read path to standard error codes
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 02/14] bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path Kent Overstreet
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard
error codes that describe precisely which error occured.

This is going to be used for the data move path, to move but poison
extents with checksum errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/errcode.h | 13 ++++++
 fs/bcachefs/io_read.c | 93 ++++++++++++++++++++++++-------------------
 fs/bcachefs/io_read.h |  4 +-
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index e14e0d1cc93d..5050d978624b 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -282,6 +282,19 @@
 	x(EIO,				EIO_fault_injected)			\
 	x(EIO,				ec_block_read)				\
 	x(EIO,				ec_block_write)				\
+	x(EIO,				data_read)				\
+	x(BCH_ERR_data_read,		data_read_retry)			\
+	x(BCH_ERR_data_read_retry,	data_read_retry_avoid)			\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_device_offline)		\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_io_err)			\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_ec_reconstruct_err)	\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_csum_err)		\
+	x(BCH_ERR_data_read_retry,	data_read_retry_csum_err_maybe_userspace)\
+	x(BCH_ERR_data_read,		data_read_decompress_err)		\
+	x(BCH_ERR_data_read,		data_read_decrypt_err)			\
+	x(BCH_ERR_data_read,		data_read_ptr_stale_race)		\
+	x(BCH_ERR_data_read_retry,	data_read_ptr_stale_retry)		\
+	x(BCH_ERR_data_read,		data_read_no_encryption_key)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_fixable)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_want_retry)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_must_retry)		\
diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 97df07add2cc..72a0a6f643d8 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -341,10 +341,6 @@ static void bch2_read_err_msg(struct bch_fs *c, struct printbuf *out,
 	bch2_trans_run(c, bch2_read_err_msg_trans(trans, out, rbio, read_pos));
 }
 
-#define READ_RETRY_AVOID	1
-#define READ_RETRY		2
-#define READ_ERR		3
-
 enum rbio_context {
 	RBIO_CONTEXT_NULL,
 	RBIO_CONTEXT_HIGHPRI,
@@ -446,7 +442,7 @@ static noinline void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_
 err:
 	bch2_trans_iter_exit(trans, &iter);
 
-	if (ret == READ_RETRY)
+	if (bch2_err_matches(ret, BCH_ERR_data_read_retry))
 		goto retry;
 	if (ret)
 		rbio->bio.bi_status = BLK_STS_IOERR;
@@ -473,11 +469,13 @@ static void bch2_rbio_retry(struct work_struct *work)
 	this_cpu_add(c->counters[BCH_COUNTER_io_read_retry],
 		     bvec_iter_sectors(rbio->bvec_iter));
 
-	if (rbio->retry == READ_RETRY_AVOID)
+	if (bch2_err_matches(rbio->ret, BCH_ERR_data_read_retry_avoid))
 		bch2_mark_io_failure(&failed, &rbio->pick);
 
-	if (!rbio->split)
-		rbio->bio.bi_status = 0;
+	if (!rbio->split) {
+		rbio->bio.bi_status	= 0;
+		rbio->ret		= 0;
+	}
 
 	rbio = bch2_rbio_free(rbio);
 
@@ -492,23 +490,29 @@ static void bch2_rbio_retry(struct work_struct *work)
 		__bch2_read(c, rbio, iter, inum, &failed, flags);
 }
 
-static void bch2_rbio_error(struct bch_read_bio *rbio, int retry,
-			    blk_status_t error)
+static void bch2_rbio_error(struct bch_read_bio *rbio,
+			    int ret, blk_status_t blk_error)
 {
-	rbio->retry = retry;
-	rbio->saw_error = true;
+	BUG_ON(ret >= 0);
+
+	rbio->ret		= ret;
+	rbio->bio.bi_status	= blk_error;
+
+	bch2_rbio_parent(rbio)->saw_error = true;
 
 	if (rbio->flags & BCH_READ_in_retry)
 		return;
 
-	if (retry == READ_ERR) {
+	if (bch2_err_matches(ret, BCH_ERR_data_read_retry)) {
+		bch2_rbio_punt(rbio, bch2_rbio_retry,
+			       RBIO_CONTEXT_UNBOUND, system_unbound_wq);
+	} else {
 		rbio = bch2_rbio_free(rbio);
 
-		rbio->bio.bi_status = error;
+		rbio->ret		= ret;
+		rbio->bio.bi_status	= blk_error;
+
 		bch2_rbio_done(rbio);
-	} else {
-		bch2_rbio_punt(rbio, bch2_rbio_retry,
-			       RBIO_CONTEXT_UNBOUND, system_unbound_wq);
 	}
 }
 
@@ -530,7 +534,7 @@ static void bch2_read_io_err(struct work_struct *work)
 		bch_err_ratelimited(c, "%s", buf.buf);
 
 	printbuf_exit(&buf);
-	bch2_rbio_error(rbio, READ_RETRY_AVOID, bio->bi_status);
+	bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_io_err, bio->bi_status);
 }
 
 static int __bch2_rbio_narrow_crcs(struct btree_trans *trans,
@@ -617,7 +621,7 @@ static void bch2_read_csum_err(struct work_struct *work)
 	else
 		bch_err_ratelimited(c, "%s", buf.buf);
 
-	bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR);
+	bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err, BLK_STS_IOERR);
 	printbuf_exit(&buf);
 }
 
@@ -637,7 +641,7 @@ static void bch2_read_decompress_err(struct work_struct *work)
 	else
 		bch_err_ratelimited(c, "%s", buf.buf);
 
-	bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR);
+	bch2_rbio_error(rbio, -BCH_ERR_data_read_decompress_err, BLK_STS_IOERR);
 	printbuf_exit(&buf);
 }
 
@@ -657,7 +661,7 @@ static void bch2_read_decrypt_err(struct work_struct *work)
 	else
 		bch_err_ratelimited(c, "%s", buf.buf);
 
-	bch2_rbio_error(rbio, READ_ERR, BLK_STS_IOERR);
+	bch2_rbio_error(rbio, -BCH_ERR_data_read_decrypt_err, BLK_STS_IOERR);
 	printbuf_exit(&buf);
 }
 
@@ -698,7 +702,8 @@ static void __bch2_read_endio(struct work_struct *work)
 	 */
 	if (!csum_good && !rbio->bounce && (rbio->flags & BCH_READ_user_mapped)) {
 		rbio->flags |= BCH_READ_must_bounce;
-		bch2_rbio_error(rbio, READ_RETRY, BLK_STS_IOERR);
+		bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_csum_err_maybe_userspace,
+				BLK_STS_IOERR);
 		goto out;
 	}
 
@@ -812,9 +817,9 @@ static void bch2_read_endio(struct bio *bio)
 		trace_and_count(c, io_read_reuse_race, &rbio->bio);
 
 		if (rbio->flags & BCH_READ_retry_if_stale)
-			bch2_rbio_error(rbio, READ_RETRY, BLK_STS_AGAIN);
+			bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_retry, BLK_STS_AGAIN);
 		else
-			bch2_rbio_error(rbio, READ_ERR, BLK_STS_AGAIN);
+			bch2_rbio_error(rbio, -BCH_ERR_data_read_ptr_stale_race, BLK_STS_AGAIN);
 		return;
 	}
 
@@ -887,7 +892,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	struct bch_read_bio *rbio = NULL;
 	bool bounce = false, read_full = false, narrow_crcs = false;
 	struct bpos data_pos = bkey_start_pos(k.k);
-	int pick_ret;
+	int ret = 0;
 
 	if (bkey_extent_is_inline_data(k.k)) {
 		unsigned bytes = min_t(unsigned, iter.bi_size,
@@ -903,16 +908,16 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		goto out_read_done;
 	}
 retry_pick:
-	pick_ret = bch2_bkey_pick_read_device(c, k, failed, &pick, dev);
+	ret = bch2_bkey_pick_read_device(c, k, failed, &pick, dev);
 
 	/* hole or reservation - just zero fill: */
-	if (!pick_ret)
+	if (!ret)
 		goto hole;
 
-	if (unlikely(pick_ret < 0)) {
+	if (unlikely(ret < 0)) {
 		struct printbuf buf = PRINTBUF;
 		bch2_read_err_msg_trans(trans, &buf, orig, read_pos);
-		prt_printf(&buf, "no device to read from: %s\n  ", bch2_err_str(pick_ret));
+		prt_printf(&buf, "%s\n  ", bch2_err_str(ret));
 		bch2_bkey_val_to_text(&buf, c, k);
 
 		bch_err_ratelimited(c, "%s", buf.buf);
@@ -928,6 +933,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 
 		bch_err_ratelimited(c, "%s", buf.buf);
 		printbuf_exit(&buf);
+		ret = -BCH_ERR_data_read_no_encryption_key;
 		goto err;
 	}
 
@@ -1063,7 +1069,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	rbio->have_ioref	= ca != NULL;
 	rbio->narrow_crcs	= narrow_crcs;
 	rbio->hole		= 0;
-	rbio->retry		= 0;
+	rbio->ret		= 0;
 	rbio->context		= 0;
 	rbio->pick		= pick;
 	rbio->subvol		= orig->subvol;
@@ -1118,7 +1124,9 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 			bch_err_ratelimited(c, "%s", buf.buf);
 			printbuf_exit(&buf);
 
-			bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR);
+			bch2_rbio_error(rbio,
+					-BCH_ERR_data_read_retry_device_offline,
+					BLK_STS_IOERR);
 			goto out;
 		}
 
@@ -1144,7 +1152,8 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	} else {
 		/* Attempting reconstruct read: */
 		if (bch2_ec_read_extent(trans, rbio, k)) {
-			bch2_rbio_error(rbio, READ_RETRY_AVOID, BLK_STS_IOERR);
+			bch2_rbio_error(rbio, -BCH_ERR_data_read_retry_ec_reconstruct_err,
+					BLK_STS_IOERR);
 			goto out;
 		}
 
@@ -1162,13 +1171,11 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		rbio->context = RBIO_CONTEXT_UNBOUND;
 		bch2_read_endio(&rbio->bio);
 
-		ret = rbio->retry;
+		ret = rbio->ret;
 		rbio = bch2_rbio_free(rbio);
 
-		if (ret == READ_RETRY_AVOID) {
+		if (bch2_err_matches(ret, BCH_ERR_data_read_retry_avoid))
 			bch2_mark_io_failure(failed, &pick);
-			ret = READ_RETRY;
-		}
 
 		if (!ret)
 			goto out_read_done;
@@ -1178,9 +1185,10 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 
 err:
 	if (flags & BCH_READ_in_retry)
-		return READ_ERR;
+		return ret;
 
-	orig->bio.bi_status = BLK_STS_IOERR;
+	orig->bio.bi_status	= BLK_STS_IOERR;
+	orig->ret		= ret;
 	goto out_read_done;
 
 hole:
@@ -1277,8 +1285,7 @@ void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 err:
 		if (ret &&
 		    !bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
-		    ret != READ_RETRY &&
-		    ret != READ_RETRY_AVOID)
+		    !bch2_err_matches(ret, BCH_ERR_data_read_retry))
 			break;
 	}
 
@@ -1289,11 +1296,13 @@ void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 		lockrestart_do(trans,
 			bch2_inum_offset_err_msg_trans(trans, &buf, inum,
 						       bvec_iter.bi_sector << 9));
-		prt_printf(&buf, "read error %i from btree lookup", ret);
+		prt_printf(&buf, "read error %s from btree lookup", bch2_err_str(ret));
 		bch_err_ratelimited(c, "%s", buf.buf);
 		printbuf_exit(&buf);
 
-		rbio->bio.bi_status = BLK_STS_IOERR;
+		rbio->bio.bi_status	= BLK_STS_IOERR;
+		rbio->ret		= ret;
+
 		bch2_rbio_done(rbio);
 	}
 
diff --git a/fs/bcachefs/io_read.h b/fs/bcachefs/io_read.h
index 73275da5d2c4..dd6694c2cf3f 100644
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@@ -42,11 +42,11 @@ struct bch_read_bio {
 				narrow_crcs:1,
 				hole:1,
 				saw_error:1,
-				retry:2,
 				context:2;
 	};
 	u16			_state;
 	};
+	s16			ret;
 
 	struct extent_ptr_decoded pick;
 
@@ -166,6 +166,7 @@ static inline struct bch_read_bio *rbio_init_fragment(struct bio *bio,
 
 	rbio->c		= orig->c;
 	rbio->_state	= 0;
+	rbio->ret	= 0;
 	rbio->split	= true;
 	rbio->parent	= orig;
 	rbio->opts	= orig->opts;
@@ -182,6 +183,7 @@ static inline struct bch_read_bio *rbio_init(struct bio *bio,
 	rbio->start_time	= local_clock();
 	rbio->c			= c;
 	rbio->_state		= 0;
+	rbio->ret	= 0;
 	rbio->opts		= opts;
 	rbio->bio.bi_end_io	= end_io;
 	return rbio;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 02/14] bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
  2025-03-11 20:15 ` [PATCH 01/14] bcachefs: Convert read path to standard error codes Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 03/14] bcachefs: Read error message now indicates if it was for an internal move Kent Overstreet
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

When we do a read to a buffer that's mapped into userspace, it's
possible to get a spurious checksum error if userspace was modified the
buffer at the same time.

When we retry those, they have to be bounced before we know definitively
whether we're reading corrupt data.

But the retry path propagates read flags differently, so needs special
handling.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 72a0a6f643d8..af963999e1a7 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -1283,6 +1283,9 @@ void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 		swap(bvec_iter.bi_size, bytes);
 		bio_advance_iter(&rbio->bio, &bvec_iter, bytes);
 err:
+		if (ret == -BCH_ERR_data_read_retry_csum_err_maybe_userspace)
+			flags |= BCH_READ_must_bounce;
+
 		if (ret &&
 		    !bch2_err_matches(ret, BCH_ERR_transaction_restart) &&
 		    !bch2_err_matches(ret, BCH_ERR_data_read_retry))
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 03/14] bcachefs: Read error message now indicates if it was for an internal move
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
  2025-03-11 20:15 ` [PATCH 01/14] bcachefs: Convert read path to standard error codes Kent Overstreet
  2025-03-11 20:15 ` [PATCH 02/14] bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 04/14] bcachefs: BCH_ERR_data_read_buffer_too_small Kent Overstreet
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index af963999e1a7..03baab542abb 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -329,10 +329,17 @@ static struct bch_read_bio *promote_alloc(struct btree_trans *trans,
 static int bch2_read_err_msg_trans(struct btree_trans *trans, struct printbuf *out,
 				   struct bch_read_bio *rbio, struct bpos read_pos)
 {
-	return lockrestart_do(trans,
+	int ret = lockrestart_do(trans,
 		bch2_inum_offset_err_msg_trans(trans, out,
 				(subvol_inum) { rbio->subvol, read_pos.inode },
 				read_pos.offset << 9));
+	if (ret)
+		return ret;
+
+	if (rbio->flags & BCH_READ_data_update)
+		prt_str(out, "(internal move) ");
+
+	return 0;
 }
 
 static void bch2_read_err_msg(struct bch_fs *c, struct printbuf *out,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 04/14] bcachefs: BCH_ERR_data_read_buffer_too_small
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (2 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 03/14] bcachefs: Read error message now indicates if it was for an internal move Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 05/14] bcachefs: Return errors to top level bch2_rbio_retry() Kent Overstreet
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Now that the read path uses proper error codes, we can get rid of the
weird rbio->hole signalling to the move path that the read didn't
happen.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/errcode.h | 2 ++
 fs/bcachefs/io_read.c | 9 ++++-----
 fs/bcachefs/io_read.h | 1 -
 fs/bcachefs/move.c    | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index 5050d978624b..afa16d58041e 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -295,6 +295,8 @@
 	x(BCH_ERR_data_read,		data_read_ptr_stale_race)		\
 	x(BCH_ERR_data_read_retry,	data_read_ptr_stale_retry)		\
 	x(BCH_ERR_data_read,		data_read_no_encryption_key)		\
+	x(BCH_ERR_data_read,		data_read_buffer_too_small)		\
+	x(BCH_ERR_data_read,		data_read_key_overwritten)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_fixable)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_want_retry)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_must_retry)		\
diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 03baab542abb..8f514d7e4021 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -437,7 +437,7 @@ static noinline void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_
 
 	if (!bkey_and_val_eq(k, bkey_i_to_s_c(u->k.k))) {
 		/* extent we wanted to read no longer exists: */
-		rbio->hole = true;
+		rbio->ret = -BCH_ERR_data_read_key_overwritten;
 		goto err;
 	}
 
@@ -992,10 +992,10 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		 */
 		struct data_update *u = container_of(orig, struct data_update, rbio);
 		if (pick.crc.compressed_size > u->op.wbio.bio.bi_iter.bi_size) {
-			BUG();
 			if (ca)
 				percpu_ref_put(&ca->io_ref);
-			goto hole;
+			rbio->ret = -BCH_ERR_data_read_buffer_too_small;
+			goto out_read_done;
 		}
 
 		iter.bi_size	= pick.crc.compressed_size << 9;
@@ -1075,7 +1075,6 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	rbio->flags		= flags;
 	rbio->have_ioref	= ca != NULL;
 	rbio->narrow_crcs	= narrow_crcs;
-	rbio->hole		= 0;
 	rbio->ret		= 0;
 	rbio->context		= 0;
 	rbio->pick		= pick;
@@ -1207,7 +1206,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	 * to read no longer exists we have to signal that:
 	 */
 	if (flags & BCH_READ_data_update)
-		orig->hole = true;
+		orig->ret = -BCH_ERR_data_read_key_overwritten;
 
 	zero_fill_bio_iter(&orig->bio, iter);
 out_read_done:
diff --git a/fs/bcachefs/io_read.h b/fs/bcachefs/io_read.h
index dd6694c2cf3f..b29fedd85a9c 100644
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@@ -40,7 +40,6 @@ struct bch_read_bio {
 				split:1,
 				have_ioref:1,
 				narrow_crcs:1,
-				hole:1,
 				saw_error:1,
 				context:2;
 	};
diff --git a/fs/bcachefs/move.c b/fs/bcachefs/move.c
index 0787d04a5fc3..8cfd842a86cf 100644
--- a/fs/bcachefs/move.c
+++ b/fs/bcachefs/move.c
@@ -125,8 +125,8 @@ static void move_write(struct moving_io *io)
 				     &ctxt->stats->sectors_error_corrected);
 	}
 
-	if (unlikely(io->write.rbio.bio.bi_status ||
-		     io->write.rbio.hole ||
+	if (unlikely(io->write.rbio.ret ||
+		     io->write.rbio.bio.bi_status ||
 		     io->write.data_opts.scrub)) {
 		move_free(io);
 		return;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 05/14] bcachefs: Return errors to top level bch2_rbio_retry()
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (3 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 04/14] bcachefs: BCH_ERR_data_read_buffer_too_small Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 06/14] bcachefs: Print message on successful read retry Kent Overstreet
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Next patch will be adding an additional retry loop for checksum errors,
so that we can rule out transient errors before marking an extent as
poisoned.

Prerequisite to this is returning errors to bch2_rbio_retry(); this will
also let us add a "successful retry" message.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 41 ++++++++++++++++++++++++++---------------
 fs/bcachefs/io_read.h |  4 ++--
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 8f514d7e4021..8764e88ef574 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -416,7 +416,7 @@ static void bch2_rbio_done(struct bch_read_bio *rbio)
 	bio_endio(&rbio->bio);
 }
 
-static noinline void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_bio *rbio,
+static noinline int bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_bio *rbio,
 				     struct bvec_iter bvec_iter,
 				     struct bch_io_failures *failed,
 				     unsigned flags)
@@ -451,12 +451,16 @@ static noinline void bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_
 
 	if (bch2_err_matches(ret, BCH_ERR_data_read_retry))
 		goto retry;
-	if (ret)
-		rbio->bio.bi_status = BLK_STS_IOERR;
+
+	if (ret) {
+		rbio->bio.bi_status	= BLK_STS_IOERR;
+		rbio->ret		= ret;
+	}
 
 	BUG_ON(atomic_read(&rbio->bio.__bi_remaining) != 1);
-	bch2_rbio_done(rbio);
 	bch2_trans_put(trans);
+
+	return ret;
 }
 
 static void bch2_rbio_retry(struct work_struct *work)
@@ -491,10 +495,16 @@ static void bch2_rbio_retry(struct work_struct *work)
 	flags &= ~BCH_READ_last_fragment;
 	flags |= BCH_READ_must_clone;
 
-	if (flags & BCH_READ_data_update)
-		bch2_read_retry_nodecode(c, rbio, iter, &failed, flags);
-	else
-		__bch2_read(c, rbio, iter, inum, &failed, flags);
+	int ret = flags & BCH_READ_data_update
+		? bch2_read_retry_nodecode(c, rbio, iter, &failed, flags)
+		: __bch2_read(c, rbio, iter, inum, &failed, flags);
+
+	if (ret) {
+		rbio->ret = ret;
+		rbio->bio.bi_status = BLK_STS_IOERR;
+	}
+
+	bch2_rbio_done(rbio);
 }
 
 static void bch2_rbio_error(struct bch_read_bio *rbio,
@@ -1183,9 +1193,6 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		if (bch2_err_matches(ret, BCH_ERR_data_read_retry_avoid))
 			bch2_mark_io_failure(failed, &pick);
 
-		if (!ret)
-			goto out_read_done;
-
 		return ret;
 	}
 
@@ -1210,12 +1217,13 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 
 	zero_fill_bio_iter(&orig->bio, iter);
 out_read_done:
-	if (flags & BCH_READ_last_fragment)
+	if ((flags & BCH_READ_last_fragment) &&
+	    !(flags & BCH_READ_in_retry))
 		bch2_rbio_done(orig);
 	return 0;
 }
 
-void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
+int __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 		 struct bvec_iter bvec_iter, subvol_inum inum,
 		 struct bch_io_failures *failed, unsigned flags)
 {
@@ -1305,18 +1313,21 @@ void __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 		lockrestart_do(trans,
 			bch2_inum_offset_err_msg_trans(trans, &buf, inum,
 						       bvec_iter.bi_sector << 9));
-		prt_printf(&buf, "read error %s from btree lookup", bch2_err_str(ret));
+		prt_printf(&buf, "read error: %s", bch2_err_str(ret));
 		bch_err_ratelimited(c, "%s", buf.buf);
 		printbuf_exit(&buf);
 
 		rbio->bio.bi_status	= BLK_STS_IOERR;
 		rbio->ret		= ret;
 
-		bch2_rbio_done(rbio);
+		if (!(flags & BCH_READ_in_retry))
+			bch2_rbio_done(rbio);
 	}
 
 	bch2_trans_put(trans);
 	bch2_bkey_buf_exit(&sk, c);
+
+	return ret;
 }
 
 void bch2_fs_io_read_exit(struct bch_fs *c)
diff --git a/fs/bcachefs/io_read.h b/fs/bcachefs/io_read.h
index b29fedd85a9c..6e5ea5b45a54 100644
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@@ -140,8 +140,8 @@ static inline void bch2_read_extent(struct btree_trans *trans,
 			   data_btree, k, offset_into_extent, NULL, flags, -1);
 }
 
-void __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter,
-		 subvol_inum, struct bch_io_failures *, unsigned flags);
+int __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter,
+		subvol_inum, struct bch_io_failures *, unsigned flags);
 
 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 			     subvol_inum inum)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 06/14] bcachefs: Print message on successful read retry
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (4 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 05/14] bcachefs: Return errors to top level bch2_rbio_retry() Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 07/14] bcachefs: Don't create bch_io_failures unless it's needed Kent Overstreet
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Users have been asking for this, and now that errors are returned to the
top level read retry path - we can.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 8764e88ef574..61fb3e558450 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -488,6 +488,9 @@ static void bch2_rbio_retry(struct work_struct *work)
 		rbio->ret		= 0;
 	}
 
+	unsigned subvol		= rbio->subvol;
+	struct bpos read_pos	= rbio->read_pos;
+
 	rbio = bch2_rbio_free(rbio);
 
 	flags |= BCH_READ_in_retry;
@@ -502,6 +505,19 @@ static void bch2_rbio_retry(struct work_struct *work)
 	if (ret) {
 		rbio->ret = ret;
 		rbio->bio.bi_status = BLK_STS_IOERR;
+	} else {
+		struct printbuf buf = PRINTBUF;
+
+		bch2_trans_do(c,
+			bch2_inum_offset_err_msg_trans(trans, &buf,
+					(subvol_inum) { subvol, read_pos.inode },
+					read_pos.offset << 9));
+		if (rbio->flags & BCH_READ_data_update)
+			prt_str(&buf, "(internal move) ");
+		prt_str(&buf, "successful retry");
+
+		bch_err_ratelimited(c, "%s", buf.buf);
+		printbuf_exit(&buf);
 	}
 
 	bch2_rbio_done(rbio);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 07/14] bcachefs: Don't create bch_io_failures unless it's needed
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (5 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 06/14] bcachefs: Print message on successful read retry Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 08/14] bcachefs: Checksum errors get additional retries Kent Overstreet
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Only needed in retry path, no point in wasting stack space.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/bcachefs/io_read.h b/fs/bcachefs/io_read.h
index 6e5ea5b45a54..42a22985d789 100644
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@@ -146,13 +146,11 @@ int __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter,
 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 			     subvol_inum inum)
 {
-	struct bch_io_failures failed = { .nr = 0 };
-
 	BUG_ON(rbio->_state);
 
 	rbio->subvol = inum.subvol;
 
-	__bch2_read(c, rbio, rbio->bio.bi_iter, inum, &failed,
+	__bch2_read(c, rbio, rbio->bio.bi_iter, inum, NULL,
 		    BCH_READ_retry_if_stale|
 		    BCH_READ_may_promote|
 		    BCH_READ_user_mapped);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 08/14] bcachefs: Checksum errors get additional retries
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (6 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 07/14] bcachefs: Don't create bch_io_failures unless it's needed Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 09/14] bcachefs: __bch2_read() now takes a btree_trans Kent Overstreet
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

It's possible for checksum errors to be transient - e.g. flakey
controller or cable, thus we need additional retries (besides retrying
from different replicas) before we can definitely return an error.

This is particularly important for the next patch, which will allow the
data move path to move extents with checksum errors - we don't want to
accidentally introduce bitrot due to a transient error!

- bch2_bkey_pick_read_device() is substantially reworked, and
  bch2_dev_io_failures is expanded to record more information about the
  type of failure (i.e. number of checksum errors).

  It now returns an error code that describes more precisely the reason
  for the failure - checksum error, io error, or offline device, instead
  of the previous generic "insufficient devices". This is important for
  the next patches that add poisoning, as we only want to poison extents
  when we've got real checksum errors (or perhaps IO errors?) - not
  because a device was offline.

- Add a new option and superblock field for the number of checksum
  retries.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bcachefs_format.h |   2 +
 fs/bcachefs/btree_io.c        |   2 +-
 fs/bcachefs/errcode.h         |   4 +-
 fs/bcachefs/extents.c         | 155 ++++++++++++++++++++--------------
 fs/bcachefs/extents.h         |   7 +-
 fs/bcachefs/extents_types.h   |  11 +--
 fs/bcachefs/io_read.c         |  10 ++-
 fs/bcachefs/opts.h            |   5 ++
 fs/bcachefs/super-io.c        |   4 +
 9 files changed, 121 insertions(+), 79 deletions(-)

diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index 7a5b0d211a82..e96d87767020 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -842,6 +842,7 @@ LE64_BITMASK(BCH_SB_SHARD_INUMS,	struct bch_sb, flags[3], 28, 29);
 LE64_BITMASK(BCH_SB_INODES_USE_KEY_CACHE,struct bch_sb, flags[3], 29, 30);
 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DELAY,struct bch_sb, flags[3], 30, 62);
 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DISABLED,struct bch_sb, flags[3], 62, 63);
+/* one free bit */
 LE64_BITMASK(BCH_SB_JOURNAL_RECLAIM_DELAY,struct bch_sb, flags[4], 0, 32);
 LE64_BITMASK(BCH_SB_JOURNAL_TRANSACTION_NAMES,struct bch_sb, flags[4], 32, 33);
 LE64_BITMASK(BCH_SB_NOCOW,		struct bch_sb, flags[4], 33, 34);
@@ -861,6 +862,7 @@ LE64_BITMASK(BCH_SB_VERSION_INCOMPAT_ALLOWED,
 					struct bch_sb, flags[5], 48, 64);
 LE64_BITMASK(BCH_SB_SHARD_INUMS_NBITS,	struct bch_sb, flags[6],  0,  4);
 LE64_BITMASK(BCH_SB_WRITE_ERROR_TIMEOUT,struct bch_sb, flags[6],  4, 14);
+LE64_BITMASK(BCH_SB_CSUM_ERR_RETRY_NR,	struct bch_sb, flags[6], 14, 20);
 
 static inline __u64 BCH_SB_COMPRESSION_TYPE(const struct bch_sb *sb)
 {
diff --git a/fs/bcachefs/btree_io.c b/fs/bcachefs/btree_io.c
index cd792fee7ee3..6abc9f17ea3c 100644
--- a/fs/bcachefs/btree_io.c
+++ b/fs/bcachefs/btree_io.c
@@ -1355,7 +1355,7 @@ static void btree_node_read_work(struct work_struct *work)
 			percpu_ref_put(&ca->io_ref);
 		rb->have_ioref = false;
 
-		bch2_mark_io_failure(&failed, &rb->pick);
+		bch2_mark_io_failure(&failed, &rb->pick, false);
 
 		can_retry = bch2_bkey_pick_read_device(c,
 				bkey_i_to_s_c(&b->key),
diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index afa16d58041e..493cae4efc37 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -273,7 +273,6 @@
 	x(EIO,				stripe_reconstruct)			\
 	x(EIO,				key_type_error)				\
 	x(EIO,				extent_poisened)			\
-	x(EIO,				no_device_to_read_from)			\
 	x(EIO,				missing_indirect_extent)		\
 	x(EIO,				invalidate_stripe_to_dev)		\
 	x(EIO,				no_encryption_key)			\
@@ -283,6 +282,9 @@
 	x(EIO,				ec_block_read)				\
 	x(EIO,				ec_block_write)				\
 	x(EIO,				data_read)				\
+	x(BCH_ERR_data_read,		no_device_to_read_from)			\
+	x(BCH_ERR_data_read,		data_read_io_err)			\
+	x(BCH_ERR_data_read,		data_read_csum_err)			\
 	x(BCH_ERR_data_read,		data_read_retry)			\
 	x(BCH_ERR_data_read_retry,	data_read_retry_avoid)			\
 	x(BCH_ERR_data_read_retry_avoid,data_read_retry_device_offline)		\
diff --git a/fs/bcachefs/extents.c b/fs/bcachefs/extents.c
index d5f3457179c5..1df0e034e988 100644
--- a/fs/bcachefs/extents.c
+++ b/fs/bcachefs/extents.c
@@ -58,7 +58,8 @@ struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *f,
 }
 
 void bch2_mark_io_failure(struct bch_io_failures *failed,
-			  struct extent_ptr_decoded *p)
+			  struct extent_ptr_decoded *p,
+			  bool csum_error)
 {
 	struct bch_dev_io_failures *f = bch2_dev_io_failures(failed, p->ptr.dev);
 
@@ -66,17 +67,16 @@ void bch2_mark_io_failure(struct bch_io_failures *failed,
 		BUG_ON(failed->nr >= ARRAY_SIZE(failed->devs));
 
 		f = &failed->devs[failed->nr++];
-		f->dev		= p->ptr.dev;
-		f->idx		= p->idx;
-		f->nr_failed	= 1;
-		f->nr_retries	= 0;
-	} else if (p->idx != f->idx) {
-		f->idx		= p->idx;
-		f->nr_failed	= 1;
-		f->nr_retries	= 0;
-	} else {
-		f->nr_failed++;
+		memset(f, 0, sizeof(*f));
+		f->dev = p->ptr.dev;
 	}
+
+	if (p->do_ec_reconstruct)
+		f->failed_ec = true;
+	else if (!csum_error)
+		f->failed_io = true;
+	else
+		f->failed_csum_nr++;
 }
 
 static inline u64 dev_latency(struct bch_dev *ca)
@@ -96,35 +96,37 @@ static inline bool ptr_better(struct bch_fs *c,
 			      const struct extent_ptr_decoded p1,
 			      const struct extent_ptr_decoded p2)
 {
-	if (likely(!p1.idx && !p2.idx)) {
-		struct bch_dev *ca1 = bch2_dev_rcu(c, p1.ptr.dev);
-		struct bch_dev *ca2 = bch2_dev_rcu(c, p2.ptr.dev);
+	struct bch_dev *ca1 = bch2_dev_rcu(c, p1.ptr.dev);
+	struct bch_dev *ca2 = bch2_dev_rcu(c, p2.ptr.dev);
 
-		int failed_delta = dev_failed(ca1) - dev_failed(ca2);
+	int failed_delta = dev_failed(ca1) - dev_failed(ca2);
+	if (unlikely(failed_delta))
+		return failed_delta < 0;
 
-		if (failed_delta)
-			return failed_delta < 0;
+	if (unlikely(bch2_force_reconstruct_read))
+		return p1.do_ec_reconstruct > p2.do_ec_reconstruct;
 
-		u64 l1 = dev_latency(ca1);
-		u64 l2 = dev_latency(ca2);
+	if (unlikely(p1.do_ec_reconstruct || p2.do_ec_reconstruct))
+		return p1.do_ec_reconstruct < p2.do_ec_reconstruct;
 
-		/*
-		 * Square the latencies, to bias more in favor of the faster
-		 * device - we never want to stop issuing reads to the slower
-		 * device altogether, so that we can update our latency numbers:
-		 */
-		l1 *= l1;
-		l2 *= l2;
+	int crc_retry_delta = (int) p1.crc_retry_nr - (int) p2.crc_retry_nr;
+	if (unlikely(crc_retry_delta))
+		return crc_retry_delta < 0;
 
-		/* Pick at random, biased in favor of the faster device: */
+	u64 l1 = dev_latency(ca1);
+	u64 l2 = dev_latency(ca2);
 
-		return bch2_rand_range(l1 + l2) > l1;
-	}
+	/*
+	 * Square the latencies, to bias more in favor of the faster
+	 * device - we never want to stop issuing reads to the slower
+	 * device altogether, so that we can update our latency numbers:
+	 */
+	l1 *= l1;
+	l2 *= l2;
 
-	if (bch2_force_reconstruct_read)
-		return p1.idx > p2.idx;
+	/* Pick at random, biased in favor of the faster device: */
 
-	return p1.idx < p2.idx;
+	return bch2_rand_range(l1 + l2) > l1;
 }
 
 /*
@@ -133,73 +135,96 @@ static inline bool ptr_better(struct bch_fs *c,
  * other devices, it will still pick a pointer from avoid.
  */
 int bch2_bkey_pick_read_device(struct bch_fs *c, struct bkey_s_c k,
-			      struct bch_io_failures *failed,
-			      struct extent_ptr_decoded *pick,
-			      int dev)
+			       struct bch_io_failures *failed,
+			       struct extent_ptr_decoded *pick,
+			       int dev)
 {
-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
-	const union bch_extent_entry *entry;
-	struct extent_ptr_decoded p;
-	struct bch_dev_io_failures *f;
-	int ret = 0;
+	bool have_csum_errors = false, have_io_errors = false, have_missing_devs = false;
+	bool have_dirty_ptrs = false, have_pick = false;
 
 	if (k.k->type == KEY_TYPE_error)
 		return -BCH_ERR_key_type_error;
 
+	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+
 	if (bch2_bkey_extent_ptrs_flags(ptrs) & BIT_ULL(BCH_EXTENT_FLAG_poisoned))
 		return -BCH_ERR_extent_poisened;
 
 	rcu_read_lock();
+	const union bch_extent_entry *entry;
+	struct extent_ptr_decoded p;
+
 	bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
+		have_dirty_ptrs |= !p.ptr.cached;
+
 		/*
 		 * Unwritten extent: no need to actually read, treat it as a
 		 * hole and return 0s:
 		 */
 		if (p.ptr.unwritten) {
-			ret = 0;
-			break;
+			rcu_read_unlock();
+			return 0;
 		}
 
 		/* Are we being asked to read from a specific device? */
 		if (dev >= 0 && p.ptr.dev != dev)
 			continue;
 
-		/*
-		 * If there are any dirty pointers it's an error if we can't
-		 * read:
-		 */
-		if (!ret && !p.ptr.cached)
-			ret = -BCH_ERR_no_device_to_read_from;
-
 		struct bch_dev *ca = bch2_dev_rcu(c, p.ptr.dev);
 
 		if (p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr)))
 			continue;
 
-		f = failed ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL;
-		if (f)
-			p.idx = f->nr_failed < f->nr_retries
-				? f->idx
-				: f->idx + 1;
+		struct bch_dev_io_failures *f =
+			unlikely(failed) ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL;
+		if (unlikely(f)) {
+			p.crc_retry_nr	   = f->failed_csum_nr;
+			p.has_ec	  &= ~f->failed_ec;
 
-		if (!p.idx && (!ca || !bch2_dev_is_online(ca)))
-			p.idx++;
+			if (ca && ca->mi.state != BCH_MEMBER_STATE_failed) {
+				have_io_errors	|= f->failed_io;
+				have_io_errors	|= f->failed_ec;
+			}
+			have_csum_errors	|= !!f->failed_csum_nr;
 
-		if (!p.idx && p.has_ec && bch2_force_reconstruct_read)
-			p.idx++;
+			if (p.has_ec && (f->failed_io || f->failed_csum_nr))
+				p.do_ec_reconstruct = true;
+			else if (f->failed_io ||
+				 f->failed_csum_nr > c->opts.checksum_err_retry_nr)
+				continue;
+		}
 
-		if (p.idx > (unsigned) p.has_ec)
-			continue;
+		have_missing_devs |= ca && !bch2_dev_is_online(ca);
 
-		if (ret > 0 && !ptr_better(c, p, *pick))
-			continue;
+		if (!ca || !bch2_dev_is_online(ca)) {
+			if (!p.has_ec)
+				continue;
+			p.do_ec_reconstruct = true;
+		}
+
+		if (bch2_force_reconstruct_read && p.has_ec)
+			p.do_ec_reconstruct = true;
 
-		*pick = p;
-		ret = 1;
+		if (!have_pick || ptr_better(c, p, *pick)) {
+			*pick = p;
+			have_pick = true;
+		}
 	}
 	rcu_read_unlock();
 
-	return ret;
+	if (have_pick)
+		return 1;
+	if (!have_dirty_ptrs)
+		return 0;
+	if (have_missing_devs)
+		return -BCH_ERR_no_device_to_read_from;
+	if (have_csum_errors)
+		return -BCH_ERR_data_read_csum_err;
+	if (have_io_errors)
+		return -BCH_ERR_data_read_io_err;
+
+	WARN_ONCE(1, "unhandled error case in %s\n", __func__);
+	return -EINVAL;
 }
 
 /* KEY_TYPE_btree_ptr: */
diff --git a/fs/bcachefs/extents.h b/fs/bcachefs/extents.h
index c50c4f353bab..e78a39e7e18f 100644
--- a/fs/bcachefs/extents.h
+++ b/fs/bcachefs/extents.h
@@ -320,8 +320,9 @@ static inline struct bkey_ptrs bch2_bkey_ptrs(struct bkey_s k)
 ({									\
 	__label__ out;							\
 									\
-	(_ptr).idx	= 0;						\
-	(_ptr).has_ec	= false;					\
+	(_ptr).has_ec			= false;			\
+	(_ptr).do_ec_reconstruct	= false;			\
+	(_ptr).crc_retry_nr		= 0;				\
 									\
 	__bkey_extent_entry_for_each_from(_entry, _end, _entry)		\
 		switch (__extent_entry_type(_entry)) {			\
@@ -401,7 +402,7 @@ out:									\
 struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *,
 						 unsigned);
 void bch2_mark_io_failure(struct bch_io_failures *,
-			  struct extent_ptr_decoded *);
+			  struct extent_ptr_decoded *, bool);
 int bch2_bkey_pick_read_device(struct bch_fs *, struct bkey_s_c,
 			       struct bch_io_failures *,
 			       struct extent_ptr_decoded *, int);
diff --git a/fs/bcachefs/extents_types.h b/fs/bcachefs/extents_types.h
index 43d6c341ecca..e51529dca4c2 100644
--- a/fs/bcachefs/extents_types.h
+++ b/fs/bcachefs/extents_types.h
@@ -20,8 +20,9 @@ struct bch_extent_crc_unpacked {
 };
 
 struct extent_ptr_decoded {
-	unsigned			idx;
 	bool				has_ec;
+	bool				do_ec_reconstruct;
+	u8				crc_retry_nr;
 	struct bch_extent_crc_unpacked	crc;
 	struct bch_extent_ptr		ptr;
 	struct bch_extent_stripe_ptr	ec;
@@ -31,10 +32,10 @@ struct bch_io_failures {
 	u8			nr;
 	struct bch_dev_io_failures {
 		u8		dev;
-		u8		idx;
-		u8		nr_failed;
-		u8		nr_retries;
-	}			devs[BCH_REPLICAS_MAX];
+		unsigned	failed_csum_nr:6,
+				failed_io:1,
+				failed_ec:1;
+	}			devs[BCH_REPLICAS_MAX + 1];
 };
 
 #endif /* _BCACHEFS_EXTENTS_TYPES_H */
diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 61fb3e558450..40a23e779e6c 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -481,7 +481,8 @@ static void bch2_rbio_retry(struct work_struct *work)
 		     bvec_iter_sectors(rbio->bvec_iter));
 
 	if (bch2_err_matches(rbio->ret, BCH_ERR_data_read_retry_avoid))
-		bch2_mark_io_failure(&failed, &rbio->pick);
+		bch2_mark_io_failure(&failed, &rbio->pick,
+				     rbio->ret == -BCH_ERR_data_read_retry_csum_err);
 
 	if (!rbio->split) {
 		rbio->bio.bi_status	= 0;
@@ -983,7 +984,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	    ca &&
 	    unlikely(dev_ptr_stale(ca, &pick.ptr))) {
 		read_from_stale_dirty_pointer(trans, ca, k, pick.ptr);
-		bch2_mark_io_failure(failed, &pick);
+		bch2_mark_io_failure(failed, &pick, false);
 		percpu_ref_put(&ca->io_ref);
 		goto retry_pick;
 	}
@@ -1146,7 +1147,7 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	else
 		bch2_trans_unlock_long(trans);
 
-	if (!rbio->pick.idx) {
+	if (likely(!rbio->pick.do_ec_reconstruct)) {
 		if (unlikely(!rbio->have_ioref)) {
 			struct printbuf buf = PRINTBUF;
 			bch2_read_err_msg_trans(trans, &buf, rbio, read_pos);
@@ -1207,7 +1208,8 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		rbio = bch2_rbio_free(rbio);
 
 		if (bch2_err_matches(ret, BCH_ERR_data_read_retry_avoid))
-			bch2_mark_io_failure(failed, &pick);
+			bch2_mark_io_failure(failed, &pick,
+					ret == -BCH_ERR_data_read_retry_csum_err);
 
 		return ret;
 	}
diff --git a/fs/bcachefs/opts.h b/fs/bcachefs/opts.h
index afb89d318d24..baa9c11acb1a 100644
--- a/fs/bcachefs/opts.h
+++ b/fs/bcachefs/opts.h
@@ -186,6 +186,11 @@ enum fsck_err_opts {
 	  OPT_STR(__bch2_csum_opts),					\
 	  BCH_SB_DATA_CSUM_TYPE,	BCH_CSUM_OPT_crc32c,		\
 	  NULL,		NULL)						\
+	x(checksum_err_retry_nr,	u8,				\
+	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
+	  OPT_UINT(0, 32),						\
+	  BCH_SB_CSUM_ERR_RETRY_NR,	3,				\
+	  NULL,		NULL)						\
 	x(compression,			u8,				\
 	  OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,		\
 	  OPT_FN(bch2_opt_compression),					\
diff --git a/fs/bcachefs/super-io.c b/fs/bcachefs/super-io.c
index ee32d043414a..f87e3bf33ec0 100644
--- a/fs/bcachefs/super-io.c
+++ b/fs/bcachefs/super-io.c
@@ -457,6 +457,10 @@ static int bch2_sb_validate(struct bch_sb_handle *disk_sb,
 
 		if (!BCH_SB_WRITE_ERROR_TIMEOUT(sb))
 			SET_BCH_SB_WRITE_ERROR_TIMEOUT(sb, 30);
+
+		if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_extent_flags &&
+		    !BCH_SB_CSUM_ERR_RETRY_NR(sb))
+			SET_BCH_SB_CSUM_ERR_RETRY_NR(sb, 3);
 	}
 
 #ifdef __KERNEL__
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 09/14] bcachefs: __bch2_read() now takes a btree_trans
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (7 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 08/14] bcachefs: Checksum errors get additional retries Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 10/14] bcachefs: Poison extents that can't be read due to checksum errors Kent Overstreet
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Next patch will be checking if the extent we're reading from matches the
IO failure we saw before marking the failure.

For this to work, __bch2_read() needs to take the same transaction
context that bch2_rbio_retry() uses to do that check.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 90 ++++++++++++++++++++++++++++++++++---------
 fs/bcachefs/io_read.h | 12 +++---
 2 files changed, 79 insertions(+), 23 deletions(-)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 40a23e779e6c..b5bcd08fc983 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -416,13 +416,53 @@ static void bch2_rbio_done(struct bch_read_bio *rbio)
 	bio_endio(&rbio->bio);
 }
 
-static noinline int bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_bio *rbio,
-				     struct bvec_iter bvec_iter,
-				     struct bch_io_failures *failed,
-				     unsigned flags)
+static struct bkey_s_c get_rbio_extent(struct btree_trans *trans,
+				       struct bch_read_bio *rbio,
+				       struct btree_iter *iter)
+{
+	if (rbio->flags & BCH_READ_data_update) {
+		struct data_update *u = container_of(rbio, struct data_update, rbio);
+
+		return bch2_bkey_get_iter(trans, iter,
+					  u->btree_id, bkey_start_pos(&u->k.k->k), 0);
+	} else {
+		struct bpos pos = rbio->read_pos;
+		int ret = bch2_subvolume_get_snapshot(trans, rbio->subvol, &pos.snapshot);
+		if (ret)
+			return bkey_s_c_err(ret);
+
+		return bch2_bkey_get_iter(trans, iter, BTREE_ID_extents, pos, 0);
+	}
+}
+
+static void mark_io_failure_if_current_extent_matches(struct btree_trans *trans,
+						      struct bch_read_bio *rbio,
+						      struct bch_io_failures *failed)
+{
+	struct btree_iter iter = {};
+	struct bkey_s_c k;
+	int ret = lockrestart_do(trans,
+				 bkey_err(k = get_rbio_extent(trans, rbio, &iter)));
+
+	if (!ret) {
+		struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+
+		bkey_for_each_ptr(ptrs, ptr)
+			if (bch2_extent_ptr_eq(*ptr, rbio->pick.ptr))
+				bch2_mark_io_failure(failed, &rbio->pick,
+					rbio->ret == -BCH_ERR_data_read_retry_csum_err);
+	}
+
+	bch2_trans_iter_exit(trans, &iter);
+}
+
+static noinline int bch2_read_retry_nodecode(struct btree_trans *trans,
+					struct bch_read_bio *rbio,
+					struct bvec_iter bvec_iter,
+					struct bch_io_failures *failed,
+					unsigned flags)
 {
 	struct data_update *u = container_of(rbio, struct data_update, rbio);
-	struct btree_trans *trans = bch2_trans_get(c);
 retry:
 	bch2_trans_begin(trans);
 
@@ -458,8 +498,6 @@ static noinline int bch2_read_retry_nodecode(struct bch_fs *c, struct bch_read_b
 	}
 
 	BUG_ON(atomic_read(&rbio->bio.__bi_remaining) != 1);
-	bch2_trans_put(trans);
-
 	return ret;
 }
 
@@ -475,14 +513,14 @@ static void bch2_rbio_retry(struct work_struct *work)
 		.inum	= rbio->read_pos.inode,
 	};
 	struct bch_io_failures failed = { .nr = 0 };
+	struct btree_trans *trans = bch2_trans_get(c);
 
 	trace_io_read_retry(&rbio->bio);
 	this_cpu_add(c->counters[BCH_COUNTER_io_read_retry],
 		     bvec_iter_sectors(rbio->bvec_iter));
 
 	if (bch2_err_matches(rbio->ret, BCH_ERR_data_read_retry_avoid))
-		bch2_mark_io_failure(&failed, &rbio->pick,
-				     rbio->ret == -BCH_ERR_data_read_retry_csum_err);
+		mark_io_failure_if_current_extent_matches(trans, rbio, &failed);
 
 	if (!rbio->split) {
 		rbio->bio.bi_status	= 0;
@@ -500,8 +538,8 @@ static void bch2_rbio_retry(struct work_struct *work)
 	flags |= BCH_READ_must_clone;
 
 	int ret = flags & BCH_READ_data_update
-		? bch2_read_retry_nodecode(c, rbio, iter, &failed, flags)
-		: __bch2_read(c, rbio, iter, inum, &failed, flags);
+		? bch2_read_retry_nodecode(trans, rbio, iter, &failed, flags)
+		: __bch2_read(trans, rbio, iter, inum, &failed, flags);
 
 	if (ret) {
 		rbio->ret = ret;
@@ -509,7 +547,7 @@ static void bch2_rbio_retry(struct work_struct *work)
 	} else {
 		struct printbuf buf = PRINTBUF;
 
-		bch2_trans_do(c,
+		lockrestart_do(trans,
 			bch2_inum_offset_err_msg_trans(trans, &buf,
 					(subvol_inum) { subvol, read_pos.inode },
 					read_pos.offset << 9));
@@ -522,6 +560,7 @@ static void bch2_rbio_retry(struct work_struct *work)
 	}
 
 	bch2_rbio_done(rbio);
+	bch2_trans_put(trans);
 }
 
 static void bch2_rbio_error(struct bch_read_bio *rbio,
@@ -1241,11 +1280,11 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	return 0;
 }
 
-int __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
-		 struct bvec_iter bvec_iter, subvol_inum inum,
-		 struct bch_io_failures *failed, unsigned flags)
+int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio,
+		struct bvec_iter bvec_iter, subvol_inum inum,
+		struct bch_io_failures *failed, unsigned flags)
 {
-	struct btree_trans *trans = bch2_trans_get(c);
+	struct bch_fs *c = trans->c;
 	struct btree_iter iter;
 	struct bkey_buf sk;
 	struct bkey_s_c k;
@@ -1278,6 +1317,23 @@ int __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 		if (ret)
 			goto err;
 
+		if (unlikely(flags & BCH_READ_in_retry)) {
+			struct data_update *u = flags & BCH_READ_data_update
+				? container_of(rbio, struct data_update, rbio)
+				: NULL;
+
+			if (u &&
+			    !bkey_and_val_eq(k, bkey_i_to_s_c(u->k.k))) {
+				/* extent we wanted to read no longer exists: */
+				ret = -BCH_ERR_data_read_key_overwritten;
+				goto err;
+			}
+
+			if (!bkey_deleted(&sk.k->k) &&
+			    !bkey_and_val_eq(k, bkey_i_to_s_c(sk.k)))
+				failed->nr = 0;
+		}
+
 		s64 offset_into_extent = iter.pos.offset -
 			bkey_start_offset(k.k);
 		unsigned sectors = k.k->size - offset_into_extent;
@@ -1342,9 +1398,7 @@ int __bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 			bch2_rbio_done(rbio);
 	}
 
-	bch2_trans_put(trans);
 	bch2_bkey_buf_exit(&sk, c);
-
 	return ret;
 }
 
diff --git a/fs/bcachefs/io_read.h b/fs/bcachefs/io_read.h
index 42a22985d789..edcf50a4418c 100644
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@@ -3,6 +3,7 @@
 #define _BCACHEFS_IO_READ_H
 
 #include "bkey_buf.h"
+#include "btree_iter.h"
 #include "reflink.h"
 
 struct bch_read_bio {
@@ -140,7 +141,7 @@ static inline void bch2_read_extent(struct btree_trans *trans,
 			   data_btree, k, offset_into_extent, NULL, flags, -1);
 }
 
-int __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter,
+int __bch2_read(struct btree_trans *, struct bch_read_bio *, struct bvec_iter,
 		subvol_inum, struct bch_io_failures *, unsigned flags);
 
 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
@@ -150,10 +151,11 @@ static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 
 	rbio->subvol = inum.subvol;
 
-	__bch2_read(c, rbio, rbio->bio.bi_iter, inum, NULL,
-		    BCH_READ_retry_if_stale|
-		    BCH_READ_may_promote|
-		    BCH_READ_user_mapped);
+	bch2_trans_run(c,
+		__bch2_read(trans, rbio, rbio->bio.bi_iter, inum, NULL,
+			    BCH_READ_retry_if_stale|
+			    BCH_READ_may_promote|
+			    BCH_READ_user_mapped));
 }
 
 static inline struct bch_read_bio *rbio_init_fragment(struct bio *bio,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 10/14] bcachefs: Poison extents that can't be read due to checksum errors
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (8 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 09/14] bcachefs: __bch2_read() now takes a btree_trans Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 11/14] bcachefs: Data move can read from poisoned extents Kent Overstreet
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Copygc needs to be able to move extents that have bitrotted. We don't
want to delete them - in the future we'll have an API for "read me the
data even if there's checksum errors", and in general we don't want to
delete anything unless the user asks us to.

That will require writing it with a new checksum, which means we can't
forget that there was a checksum error so we return the correct error to
userspace.

Rebalance also wants to skip bad extents; we can now use the poison flag
for that.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 60 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 55 insertions(+), 5 deletions(-)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index b5bcd08fc983..4fd5a4e646e0 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -456,6 +456,47 @@ static void mark_io_failure_if_current_extent_matches(struct btree_trans *trans,
 	bch2_trans_iter_exit(trans, &iter);
 }
 
+static noinline int maybe_poison_extent(struct btree_trans *trans, struct bch_read_bio *rbio,
+					enum btree_id btree, struct bkey_s_c read_k)
+{
+	struct bch_fs *c = trans->c;
+
+	u64 flags = bch2_bkey_extent_flags(read_k);
+	if (flags & BIT_ULL(BCH_EXTENT_FLAG_poisoned))
+		return 0;
+
+	struct btree_iter iter;
+	struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, btree, bkey_start_pos(read_k.k),
+					       BTREE_ITER_intent);
+	int ret = bkey_err(k);
+	if (ret)
+		return ret;
+
+	if (!bkey_and_val_eq(k, read_k))
+		goto out;
+
+	struct bkey_i *new = bch2_trans_kmalloc(trans,
+					bkey_bytes(k.k) + sizeof(struct bch_extent_flags));
+	ret =   PTR_ERR_OR_ZERO(new) ?:
+		(bkey_reassemble(new, k), 0) ?:
+		bch2_bkey_extent_flags_set(c, new, flags|BIT_ULL(BCH_EXTENT_FLAG_poisoned)) ?:
+		bch2_trans_update(trans, &iter, new, BTREE_UPDATE_internal_snapshot_node) ?:
+		bch2_trans_commit(trans, NULL, NULL, 0);
+
+	if (!ret && (rbio->flags & BCH_READ_data_update)) {
+		/*
+		 * Propagate key change back to data update path, in particular
+		 * so it knows the extent has been poisoned and it's safe to
+		 * change the checksum
+		 */
+		struct data_update *u = container_of(rbio, struct data_update, rbio);
+		bch2_bkey_buf_copy(&u->k, c, new);
+	}
+out:
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
 static noinline int bch2_read_retry_nodecode(struct btree_trans *trans,
 					struct bch_read_bio *rbio,
 					struct bvec_iter bvec_iter,
@@ -489,7 +530,8 @@ static noinline int bch2_read_retry_nodecode(struct btree_trans *trans,
 err:
 	bch2_trans_iter_exit(trans, &iter);
 
-	if (bch2_err_matches(ret, BCH_ERR_data_read_retry))
+	if (bch2_err_matches(ret, BCH_ERR_transaction_restart) ||
+	    bch2_err_matches(ret, BCH_ERR_data_read_retry))
 		goto retry;
 
 	if (ret) {
@@ -988,6 +1030,14 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 		goto hole;
 
 	if (unlikely(ret < 0)) {
+		if (ret == -BCH_ERR_data_read_csum_err) {
+			int ret2 = maybe_poison_extent(trans, orig, data_btree, k);
+			if (ret2) {
+				ret = ret2;
+				goto err;
+			}
+		}
+
 		struct printbuf buf = PRINTBUF;
 		bch2_read_err_msg_trans(trans, &buf, orig, read_pos);
 		prt_printf(&buf, "%s\n  ", bch2_err_str(ret));
@@ -1288,6 +1338,7 @@ int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio,
 	struct btree_iter iter;
 	struct bkey_buf sk;
 	struct bkey_s_c k;
+	enum btree_id data_btree;
 	int ret;
 
 	BUG_ON(flags & BCH_READ_data_update);
@@ -1298,7 +1349,7 @@ int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio,
 			     BTREE_ITER_slots);
 
 	while (1) {
-		enum btree_id data_btree = BTREE_ID_extents;
+		data_btree = BTREE_ID_extents;
 
 		bch2_trans_begin(trans);
 
@@ -1380,9 +1431,7 @@ int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio,
 			break;
 	}
 
-	bch2_trans_iter_exit(trans, &iter);
-
-	if (ret) {
+	if (unlikely(ret)) {
 		struct printbuf buf = PRINTBUF;
 		lockrestart_do(trans,
 			bch2_inum_offset_err_msg_trans(trans, &buf, inum,
@@ -1398,6 +1447,7 @@ int __bch2_read(struct btree_trans *trans, struct bch_read_bio *rbio,
 			bch2_rbio_done(rbio);
 	}
 
+	bch2_trans_iter_exit(trans, &iter);
 	bch2_bkey_buf_exit(&sk, c);
 	return ret;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 11/14] bcachefs: Data move can read from poisoned extents
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (9 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 10/14] bcachefs: Poison extents that can't be read due to checksum errors Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 12/14] bcachefs: Debug params for data corruption injection Kent Overstreet
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Now, if an extent is poisoned we can move it even if there was a
checksum error. We'll have to give it a new checksum, but the poison bit
means that userspace will still see the appropriate error when they try
to read it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/extents.c |  6 +-----
 fs/bcachefs/io_read.c |  4 ++++
 fs/bcachefs/move.c    | 26 ++++++++++++++++++++------
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/fs/bcachefs/extents.c b/fs/bcachefs/extents.c
index 1df0e034e988..03c260f78406 100644
--- a/fs/bcachefs/extents.c
+++ b/fs/bcachefs/extents.c
@@ -145,12 +145,8 @@ int bch2_bkey_pick_read_device(struct bch_fs *c, struct bkey_s_c k,
 	if (k.k->type == KEY_TYPE_error)
 		return -BCH_ERR_key_type_error;
 
-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
-
-	if (bch2_bkey_extent_ptrs_flags(ptrs) & BIT_ULL(BCH_EXTENT_FLAG_poisoned))
-		return -BCH_ERR_extent_poisened;
-
 	rcu_read_lock();
+	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 	const union bch_extent_entry *entry;
 	struct extent_ptr_decoded p;
 
diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 4fd5a4e646e0..1ff4edac7e81 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -1022,6 +1022,10 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 			     bvec_iter_sectors(iter));
 		goto out_read_done;
 	}
+
+	if ((bch2_bkey_extent_flags(k) & BIT_ULL(BCH_EXTENT_FLAG_poisoned)) &&
+	    !(flags & BCH_READ_data_update))
+		return -BCH_ERR_extent_poisened;
 retry_pick:
 	ret = bch2_bkey_pick_read_device(c, k, failed, &pick, dev);
 
diff --git a/fs/bcachefs/move.c b/fs/bcachefs/move.c
index 8cfd842a86cf..1e8e51fdad4b 100644
--- a/fs/bcachefs/move.c
+++ b/fs/bcachefs/move.c
@@ -114,26 +114,40 @@ static void move_write_done(struct bch_write_op *op)
 
 static void move_write(struct moving_io *io)
 {
+	struct bch_fs *c = io->write.op.c;
 	struct moving_context *ctxt = io->write.ctxt;
+	struct bch_read_bio *rbio = &io->write.rbio;
 
 	if (ctxt->stats) {
-		if (io->write.rbio.bio.bi_status)
+		if (rbio->bio.bi_status)
 			atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9,
 				     &ctxt->stats->sectors_error_uncorrected);
-		else if (io->write.rbio.saw_error)
+		else if (rbio->saw_error)
 			atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9,
 				     &ctxt->stats->sectors_error_corrected);
 	}
 
-	if (unlikely(io->write.rbio.ret ||
-		     io->write.rbio.bio.bi_status ||
-		     io->write.data_opts.scrub)) {
+	/*
+	 * If the extent has been bitrotted, we're going to have to give it a
+	 * new checksum in order to move it - but the poison bit will ensure
+	 * that userspace still gets the appropriate error.
+	 */
+	if (unlikely(rbio->ret == -BCH_ERR_data_read_csum_err &&
+		     (bch2_bkey_extent_flags(bkey_i_to_s_c(io->write.k.k)) & BIT_ULL(BCH_EXTENT_FLAG_poisoned)))) {
+		struct bch_extent_crc_unpacked crc = rbio->pick.crc;
+		struct nonce nonce = extent_nonce(rbio->version, crc);
+
+		rbio->pick.crc.csum	= bch2_checksum_bio(c, rbio->pick.crc.csum_type,
+							    nonce, &rbio->bio);
+		rbio->ret		= 0;
+	}
+
+	if (unlikely(rbio->ret || io->write.data_opts.scrub)) {
 		move_free(io);
 		return;
 	}
 
 	if (trace_io_move_write_enabled()) {
-		struct bch_fs *c = io->write.op.c;
 		struct printbuf buf = PRINTBUF;
 
 		bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(io->write.k.k));
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 12/14] bcachefs: Debug params for data corruption injection
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (10 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 11/14] bcachefs: Data move can read from poisoned extents Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

dm-flakey is busted, and this is simpler anyways - this lets us test the
checksum error retry ptahs

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c  |  8 ++++++++
 fs/bcachefs/io_write.c | 24 ++++++++++++++++++++++++
 fs/bcachefs/util.c     | 21 +++++++++++++++++++++
 fs/bcachefs/util.h     | 12 ++++++++++++
 4 files changed, 65 insertions(+)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 1ff4edac7e81..69c1422685e0 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -27,6 +27,12 @@
 
 #include <linux/sched/mm.h>
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+static unsigned bch2_read_corrupt_ratio;
+module_param_named(read_corrupt_ratio, bch2_read_corrupt_ratio, uint, 0644);
+MODULE_PARM_DESC(read_corrupt_ratio, "");
+#endif
+
 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
 
 static bool bch2_target_congested(struct bch_fs *c, u16 target)
@@ -807,6 +813,8 @@ static void __bch2_read_endio(struct work_struct *work)
 		src->bi_iter			= rbio->bvec_iter;
 	}
 
+	bch2_maybe_corrupt_bio(src, bch2_read_corrupt_ratio);
+
 	csum = bch2_checksum_bio(c, crc.csum_type, nonce, src);
 	bool csum_good = !bch2_crc_cmp(csum, rbio->pick.crc.csum) || c->opts.no_data_io;
 
diff --git a/fs/bcachefs/io_write.c b/fs/bcachefs/io_write.c
index dbfcb28f003d..48befbae0226 100644
--- a/fs/bcachefs/io_write.c
+++ b/fs/bcachefs/io_write.c
@@ -34,6 +34,12 @@
 #include <linux/random.h>
 #include <linux/sched/mm.h>
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+static unsigned bch2_write_corrupt_ratio;
+module_param_named(write_corrupt_ratio, bch2_write_corrupt_ratio, uint, 0644);
+MODULE_PARM_DESC(write_corrupt_ratio, "");
+#endif
+
 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
 
 static inline void bch2_congested_acct(struct bch_dev *ca, u64 io_latency,
@@ -1005,6 +1011,15 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
 		bounce = true;
 	}
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+	unsigned write_corrupt_ratio = READ_ONCE(bch2_write_corrupt_ratio);
+	if (!bounce && write_corrupt_ratio) {
+		dst = bch2_write_bio_alloc(c, wp, src,
+					   &page_alloc_failed,
+					   ec_buf);
+		bounce = true;
+	}
+#endif
 	saved_iter = dst->bi_iter;
 
 	do {
@@ -1114,6 +1129,14 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
 
 		init_append_extent(op, wp, version, crc);
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+		if (write_corrupt_ratio) {
+			swap(dst->bi_iter.bi_size, dst_len);
+			bch2_maybe_corrupt_bio(dst, write_corrupt_ratio);
+			swap(dst->bi_iter.bi_size, dst_len);
+		}
+#endif
+
 		if (dst != src)
 			bio_advance(dst, dst_len);
 		bio_advance(src, src_len);
@@ -1394,6 +1417,7 @@ static void bch2_nocow_write(struct bch_write_op *op)
 		bio->bi_private	= &op->cl;
 		bio->bi_opf |= REQ_OP_WRITE;
 		closure_get(&op->cl);
+
 		bch2_submit_wbio_replicas(to_wbio(bio), c, BCH_DATA_user,
 					  op->insert_keys.top, true);
 
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index 50a90e48f6dd..7623cf6f75ae 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -698,6 +698,27 @@ void memcpy_from_bio(void *dst, struct bio *src, struct bvec_iter src_iter)
 	}
 }
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+void bch2_corrupt_bio(struct bio *bio)
+{
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	unsigned offset = get_random_u32_below(bio->bi_iter.bi_size / sizeof(u64));
+
+	bio_for_each_segment(bv, bio, iter) {
+		unsigned u64s = bv.bv_len / sizeof(u64);
+
+		if (offset < u64s) {
+			u64 *segment = bvec_kmap_local(&bv);
+			segment[offset] = get_random_u64();
+			kunmap_local(segment);
+			return;
+		}
+		offset -= u64s;
+	}
+}
+#endif
+
 #if 0
 void eytzinger1_test(void)
 {
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index e7c3541b38f3..fd74fafac297 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -406,6 +406,18 @@ size_t bch2_rand_range(size_t);
 void memcpy_to_bio(struct bio *, struct bvec_iter, const void *);
 void memcpy_from_bio(void *, struct bio *, struct bvec_iter);
 
+#ifdef CONFIG_BCACHEFS_DEBUG
+void bch2_corrupt_bio(struct bio *);
+
+static inline void bch2_maybe_corrupt_bio(struct bio *bio, unsigned ratio)
+{
+	if (ratio && !get_random_u32_below(ratio))
+		bch2_corrupt_bio(bio);
+}
+#else
+#define bch2_maybe_corrupt_bio(...)	do {} while (0)
+#endif
+
 static inline void memcpy_u64s_small(void *dst, const void *src,
 				     unsigned u64s)
 {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (11 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 12/14] bcachefs: Debug params for data corruption injection Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-15 16:47   ` Jens Axboe
  2025-03-11 20:15 ` [PATCH 14/14] bcachefs: Read retries are after checksum errors now REQ_FUA Kent Overstreet
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
  14 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet, Jens Axboe, linux-block

REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
the same as writes.

This is useful for when the filesystem gets a checksum error, it's
possible that a bit was flipped in the controller cache, and when we
retry we want to retry the entire IO, not just from cache.

The nvme driver already passes through REQ_FUA for reads, not just
writes, so disabling the warning is sufficient to start using it, and
bcachefs is implementing additional retries for checksum errors so can
immediately use it.

Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/blk-core.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d6c4fa3943b5..7b1103eb877d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -793,20 +793,21 @@ void submit_bio_noacct(struct bio *bio)
 			goto end_io;
 	}
 
+	if (WARN_ON_ONCE((bio->bi_opf & REQ_PREFLUSH) &&
+			 bio_op(bio) != REQ_OP_WRITE &&
+			 bio_op(bio) != REQ_OP_ZONE_APPEND))
+		goto end_io;
+
 	/*
 	 * Filter flush bio's early so that bio based drivers without flush
 	 * support don't have to worry about them.
 	 */
-	if (op_is_flush(bio->bi_opf)) {
-		if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE &&
-				 bio_op(bio) != REQ_OP_ZONE_APPEND))
+	if (op_is_flush(bio->bi_opf) &&
+	    !bdev_write_cache(bdev)) {
+		bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
+		if (!bio_sectors(bio)) {
+			status = BLK_STS_OK;
 			goto end_io;
-		if (!bdev_write_cache(bdev)) {
-			bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
-			if (!bio_sectors(bio)) {
-				status = BLK_STS_OK;
-				goto end_io;
-			}
 		}
 	}
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 14/14] bcachefs: Read retries are after checksum errors now REQ_FUA
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (12 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
  14 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet, Benjamin LaHaise

REQ_FUA means "skip the drive cache", and it can be used with reads to.
If there was a checksum error, we want to retry the whole read path, not
read it from cache again.

Suggested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/io_read.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 69c1422685e0..d90e32fb6893 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -1217,6 +1217,10 @@ int __bch2_read_extent(struct btree_trans *trans, struct bch_read_bio *orig,
 	rbio->bio.bi_iter.bi_sector = pick.ptr.offset;
 	rbio->bio.bi_end_io	= bch2_read_endio;
 
+	/* XXX: also nvme read recovery level */
+	if (unlikely(failed && bch2_dev_io_failures(failed, pick.ptr.dev)))
+		rbio->bio.bi_opf |= REQ_FUA;
+
 	if (rbio->bounce)
 		trace_and_count(c, io_read_bounce, &rbio->bio);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
@ 2025-03-15 16:47   ` Jens Axboe
  2025-03-15 17:01     ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2025-03-15 16:47 UTC (permalink / raw)
  To: Kent Overstreet, linux-bcachefs; +Cc: linux-block

On 3/11/25 2:15 PM, Kent Overstreet wrote:
> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> the same as writes.
> 
> This is useful for when the filesystem gets a checksum error, it's
> possible that a bit was flipped in the controller cache, and when we
> retry we want to retry the entire IO, not just from cache.
> 
> The nvme driver already passes through REQ_FUA for reads, not just
> writes, so disabling the warning is sufficient to start using it, and
> bcachefs is implementing additional retries for checksum errors so can
> immediately use it.

This one got effectively nak'ed by various folks here:

> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/

yet it's part of this series and in linux-next? Hmm?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 16:47   ` Jens Axboe
@ 2025-03-15 17:01     ` Kent Overstreet
  2025-03-15 17:03       ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-15 17:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> > the same as writes.
> > 
> > This is useful for when the filesystem gets a checksum error, it's
> > possible that a bit was flipped in the controller cache, and when we
> > retry we want to retry the entire IO, not just from cache.
> > 
> > The nvme driver already passes through REQ_FUA for reads, not just
> > writes, so disabling the warning is sufficient to start using it, and
> > bcachefs is implementing additional retries for checksum errors so can
> > immediately use it.
> 
> This one got effectively nak'ed by various folks here:
> 
> > Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> 
> yet it's part of this series and in linux-next? Hmm?

As I explained in that thread, they were only thinking about the caching
of writes.

That's not what we're concerned about; when we retry a read due to a
checksum error we do not want the previous _read_ cached.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:01     ` Kent Overstreet
@ 2025-03-15 17:03       ` Jens Axboe
  2025-03-15 17:27         ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2025-03-15 17:03 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block

On 3/15/25 11:01 AM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>> the same as writes.
>>>
>>> This is useful for when the filesystem gets a checksum error, it's
>>> possible that a bit was flipped in the controller cache, and when we
>>> retry we want to retry the entire IO, not just from cache.
>>>
>>> The nvme driver already passes through REQ_FUA for reads, not just
>>> writes, so disabling the warning is sufficient to start using it, and
>>> bcachefs is implementing additional retries for checksum errors so can
>>> immediately use it.
>>
>> This one got effectively nak'ed by various folks here:
>>
>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>
>> yet it's part of this series and in linux-next? Hmm?
> 
> As I explained in that thread, they were only thinking about the caching
> of writes.
> 
> That's not what we're concerned about; when we retry a read due to a
> checksum error we do not want the previous _read_ cached.

Please follow the usual procedure of getting the patch acked/reviewed on
the block list, and go through the usual trees. Until that happens, this
patch should not be in your tree, not should it be staged in linux-next.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:03       ` Jens Axboe
@ 2025-03-15 17:27         ` Kent Overstreet
  2025-03-15 17:43           ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-15 17:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
> On 3/15/25 11:01 AM, Kent Overstreet wrote:
> > On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> >> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> >>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> >>> the same as writes.
> >>>
> >>> This is useful for when the filesystem gets a checksum error, it's
> >>> possible that a bit was flipped in the controller cache, and when we
> >>> retry we want to retry the entire IO, not just from cache.
> >>>
> >>> The nvme driver already passes through REQ_FUA for reads, not just
> >>> writes, so disabling the warning is sufficient to start using it, and
> >>> bcachefs is implementing additional retries for checksum errors so can
> >>> immediately use it.
> >>
> >> This one got effectively nak'ed by various folks here:
> >>
> >>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> >>
> >> yet it's part of this series and in linux-next? Hmm?
> > 
> > As I explained in that thread, they were only thinking about the caching
> > of writes.
> > 
> > That's not what we're concerned about; when we retry a read due to a
> > checksum error we do not want the previous _read_ cached.
> 
> Please follow the usual procedure of getting the patch acked/reviewed on
> the block list, and go through the usual trees. Until that happens, this
> patch should not be in your tree, not should it be staged in linux-next.

It's been posted to linux-block and sent to your inbox. If you're going
to take it that's fine, otherwise - since this is necessary for handling
bitrotted data correctly and I've got users who've been waiting on this
patch series, and it's just deleting a warning, I'm inclined to just
send it.

I'll make sure he's got the lore links and knows what's going on, but
this isn't a great thing to be delaying on citing process.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:27         ` Kent Overstreet
@ 2025-03-15 17:43           ` Jens Axboe
  2025-03-15 18:07             ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2025-03-15 17:43 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block

On 3/15/25 11:27 AM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
>> On 3/15/25 11:01 AM, Kent Overstreet wrote:
>>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>>>> the same as writes.
>>>>>
>>>>> This is useful for when the filesystem gets a checksum error, it's
>>>>> possible that a bit was flipped in the controller cache, and when we
>>>>> retry we want to retry the entire IO, not just from cache.
>>>>>
>>>>> The nvme driver already passes through REQ_FUA for reads, not just
>>>>> writes, so disabling the warning is sufficient to start using it, and
>>>>> bcachefs is implementing additional retries for checksum errors so can
>>>>> immediately use it.
>>>>
>>>> This one got effectively nak'ed by various folks here:
>>>>
>>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>>>
>>>> yet it's part of this series and in linux-next? Hmm?
>>>
>>> As I explained in that thread, they were only thinking about the caching
>>> of writes.
>>>
>>> That's not what we're concerned about; when we retry a read due to a
>>> checksum error we do not want the previous _read_ cached.
>>
>> Please follow the usual procedure of getting the patch acked/reviewed on
>> the block list, and go through the usual trees. Until that happens, this
>> patch should not be in your tree, not should it be staged in linux-next.
> 
> It's been posted to linux-block and sent to your inbox. If you're going
> to take it that's fine, otherwise - since this is necessary for handling
> bitrotted data correctly and I've got users who've been waiting on this
> patch series, and it's just deleting a warning, I'm inclined to just
> send it.
> 
> I'll make sure he's got the lore links and knows what's going on, but
> this isn't a great thing to be delaying on citing process.

Kent, you know how this works, there's nothing to argue about here.

Let the block people get it reviewed and staged. You can't just post a
patch, ignore most replies, and then go on to stage it yourself later
that day. It didn't work well in other subsystems, and it won't work
well here either.

EOD on my end.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:43           ` Jens Axboe
@ 2025-03-15 18:07             ` Kent Overstreet
  2025-03-15 18:32               ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-15 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote:
> On 3/15/25 11:27 AM, Kent Overstreet wrote:
> > On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
> >> On 3/15/25 11:01 AM, Kent Overstreet wrote:
> >>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> >>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> >>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> >>>>> the same as writes.
> >>>>>
> >>>>> This is useful for when the filesystem gets a checksum error, it's
> >>>>> possible that a bit was flipped in the controller cache, and when we
> >>>>> retry we want to retry the entire IO, not just from cache.
> >>>>>
> >>>>> The nvme driver already passes through REQ_FUA for reads, not just
> >>>>> writes, so disabling the warning is sufficient to start using it, and
> >>>>> bcachefs is implementing additional retries for checksum errors so can
> >>>>> immediately use it.
> >>>>
> >>>> This one got effectively nak'ed by various folks here:
> >>>>
> >>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> >>>>
> >>>> yet it's part of this series and in linux-next? Hmm?
> >>>
> >>> As I explained in that thread, they were only thinking about the caching
> >>> of writes.
> >>>
> >>> That's not what we're concerned about; when we retry a read due to a
> >>> checksum error we do not want the previous _read_ cached.
> >>
> >> Please follow the usual procedure of getting the patch acked/reviewed on
> >> the block list, and go through the usual trees. Until that happens, this
> >> patch should not be in your tree, not should it be staged in linux-next.
> > 
> > It's been posted to linux-block and sent to your inbox. If you're going
> > to take it that's fine, otherwise - since this is necessary for handling
> > bitrotted data correctly and I've got users who've been waiting on this
> > patch series, and it's just deleting a warning, I'm inclined to just
> > send it.
> > 
> > I'll make sure he's got the lore links and knows what's going on, but
> > this isn't a great thing to be delaying on citing process.
> 
> Kent, you know how this works, there's nothing to argue about here.
> 
> Let the block people get it reviewed and staged. You can't just post a
> patch, ignore most replies, and then go on to stage it yourself later
> that day. It didn't work well in other subsystems, and it won't work
> well here either.

Jens, those replies weren't ignored, the concerns were answered. Never
have I seen that considered "effectively nacked".

And as far as "how things work around here" - let's not even go there,
lest we have to cover your issues with process.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:07             ` Kent Overstreet
@ 2025-03-15 18:32               ` Jens Axboe
  2025-03-15 18:41                 ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2025-03-15 18:32 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Linus Torvalds

On 3/15/25 12:07 PM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote:
>> On 3/15/25 11:27 AM, Kent Overstreet wrote:
>>> On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
>>>> On 3/15/25 11:01 AM, Kent Overstreet wrote:
>>>>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>>>>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>>>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>>>>>> the same as writes.
>>>>>>>
>>>>>>> This is useful for when the filesystem gets a checksum error, it's
>>>>>>> possible that a bit was flipped in the controller cache, and when we
>>>>>>> retry we want to retry the entire IO, not just from cache.
>>>>>>>
>>>>>>> The nvme driver already passes through REQ_FUA for reads, not just
>>>>>>> writes, so disabling the warning is sufficient to start using it, and
>>>>>>> bcachefs is implementing additional retries for checksum errors so can
>>>>>>> immediately use it.
>>>>>>
>>>>>> This one got effectively nak'ed by various folks here:
>>>>>>
>>>>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>>>>>
>>>>>> yet it's part of this series and in linux-next? Hmm?
>>>>>
>>>>> As I explained in that thread, they were only thinking about the caching
>>>>> of writes.
>>>>>
>>>>> That's not what we're concerned about; when we retry a read due to a
>>>>> checksum error we do not want the previous _read_ cached.
>>>>
>>>> Please follow the usual procedure of getting the patch acked/reviewed on
>>>> the block list, and go through the usual trees. Until that happens, this
>>>> patch should not be in your tree, not should it be staged in linux-next.
>>>
>>> It's been posted to linux-block and sent to your inbox. If you're going
>>> to take it that's fine, otherwise - since this is necessary for handling
>>> bitrotted data correctly and I've got users who've been waiting on this
>>> patch series, and it's just deleting a warning, I'm inclined to just
>>> send it.
>>>
>>> I'll make sure he's got the lore links and knows what's going on, but
>>> this isn't a great thing to be delaying on citing process.
>>
>> Kent, you know how this works, there's nothing to argue about here.
>>
>> Let the block people get it reviewed and staged. You can't just post a
>> patch, ignore most replies, and then go on to stage it yourself later
>> that day. It didn't work well in other subsystems, and it won't work
>> well here either.
> 
> Jens, those replies weren't ignored, the concerns were answered. Never
> have I seen that considered "effectively nacked".

Kent, let me make this really simple for you, since you either still
don't understand how the development process works, or is under some
assumption that it doesn't apply to you - get some of the decades of
storage experience in that first thread to sign off on the patch and use
case. The patch doesn't go upstream before that happens, simple as that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:32               ` Jens Axboe
@ 2025-03-15 18:41                 ` Kent Overstreet
  2025-03-17  6:00                   ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-15 18:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block, Linus Torvalds

On Sat, Mar 15, 2025 at 12:32:49PM -0600, Jens Axboe wrote:
> Kent, let me make this really simple for you, since you either still
> don't understand how the development process works, or is under some
> assumption that it doesn't apply to you - get some of the decades of
> storage experience in that first thread to sign off on the patch and use
> case. The patch doesn't go upstream before that happens, simple as that.

No, you're the one who's consistently been under the assumption that it
doesn't apply to you.

You've got a history of applying patches to code that I maintain without
even CCing the list, let alone myself, and introducing silent data
corruption bugs into code that I wrote, without CCs, which I then had to
debug, and then putting up absolutely ridiculous fights to get fixed.

That's the sort of thing that process is supposed to avoid. To be clear,
you and Christoph are the two reasons I've had to harp on process in the
past - everyone else in the kernel has been fine.

As I said before, I'm not trying to bypass you without communicating -
but this has gone completely off the rails.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:41                 ` Kent Overstreet
@ 2025-03-17  6:00                   ` Christoph Hellwig
  2025-03-17 12:15                     ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-17  6:00 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> That's the sort of thing that process is supposed to avoid. To be clear,
> you and Christoph are the two reasons I've had to harp on process in the
> past - everyone else in the kernel has been fine.

Kent, stop those attacks.  Maybe your patches are just simply not
correct?  Sometimes that might be on the eye of the beholder when it
is about code structure, but in may cases they simply are objectively
broken.

In this case you very clearly misunderstood how the FUA bit works on
reads (and clearly documented that misunderstanding in the commit log).
You've had multiple people explain to you how it is wrong, and the
maintainer reject it for that reason.  What other amount of arguments
to you want?  If you need another one: FUA is not just use NVMe and
SCSI where it is supported on reads even if not in the way you think,
but also for all kinds of other protocols.  Even if we ended up thinking
REQ_FUA on reads is a good idea (and so far no one but you does) you'd
need to audit all these to check if the behavior is valid, and most
likely add a per-driver opt-in.  But so far you've not even tried to
make the use case for your feature.

> As I said before, I'm not trying to bypass you without communicating -
> but this has gone completely off the rails.

You have a very clear pattern where you assume you're perfect, everyone
else is stupid and thus must be overridden.  Most people (including me
recently) have mostly given up arguing with your because it's so
fruitless.  Maybe we shouldn't have because that just seem to reaffirm
you in your personal universe.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17  6:00                   ` Christoph Hellwig
@ 2025-03-17 12:15                     ` Kent Overstreet
  2025-03-17 14:13                       ` Keith Busch
  2025-03-18  6:01                       ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > That's the sort of thing that process is supposed to avoid. To be clear,
> > you and Christoph are the two reasons I've had to harp on process in the
> > past - everyone else in the kernel has been fine.

...

> In this case you very clearly misunderstood how the FUA bit works on
> reads (and clearly documented that misunderstanding in the commit log).

FUA reads are clearly documented in the NVME spec.

As for the rest, I'm juts going to ignore the ranting.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 12:15                     ` Kent Overstreet
@ 2025-03-17 14:13                       ` Keith Busch
  2025-03-17 14:49                         ` Kent Overstreet
  2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-18  6:01                       ` Christoph Hellwig
  1 sibling, 2 replies; 63+ messages in thread
From: Keith Busch @ 2025-03-17 14:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > you and Christoph are the two reasons I've had to harp on process in the
> > > past - everyone else in the kernel has been fine.
> 
> ...
> 
> > In this case you very clearly misunderstood how the FUA bit works on
> > reads (and clearly documented that misunderstanding in the commit log).
> 
> FUA reads are clearly documented in the NVME spec.

NVMe's Read with FUA documentation is a pretty short:

  Force Unit Access (FUA): If this bit is set to `1´, then for data and
  metadata, if any, associated with logical blocks specified by the Read
  command, the controller shall:

    1) commit that data and metadata, if any, to non-volatile medium; and
    2) read the data and metadata, if any, from non-volatile medium.

So this aligns with ATA and SCSI.

The concern about what you're doing is with respect to "1". Avoiding that
requires all writes also set FUA, or you sent a flush command before doing any
reads. Otherwise how would you know whether or not read data came from a
previous cached write?

I'm actually more curious how you came to use this mechanism for recovery.
Have you actually seen this approach fix a real problem? In my experience, a
Read FUA is used to ensure what you're reading has been persistently committed.
It can be used as an optimization to flush a specific LBA range, but I'd not
seen this used as a bit error recovery mechanism.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:13                       ` Keith Busch
@ 2025-03-17 14:49                         ` Kent Overstreet
  2025-03-17 15:15                           ` Keith Busch
  2025-03-17 15:30                         ` Martin K. Petersen
  1 sibling, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 14:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:13:59AM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > > you and Christoph are the two reasons I've had to harp on process in the
> > > > past - everyone else in the kernel has been fine.
> > 
> > ...
> > 
> > > In this case you very clearly misunderstood how the FUA bit works on
> > > reads (and clearly documented that misunderstanding in the commit log).
> > 
> > FUA reads are clearly documented in the NVME spec.
> 
> NVMe's Read with FUA documentation is a pretty short:
> 
>   Force Unit Access (FUA): If this bit is set to `1´, then for data and
>   metadata, if any, associated with logical blocks specified by the Read
>   command, the controller shall:
> 
>     1) commit that data and metadata, if any, to non-volatile medium; and
>     2) read the data and metadata, if any, from non-volatile medium.
> 
> So this aligns with ATA and SCSI.
> 
> The concern about what you're doing is with respect to "1". Avoiding that
> requires all writes also set FUA, or you sent a flush command before doing any
> reads. Otherwise how would you know whether or not read data came from a
> previous cached write?

Line 1 is not the relevant concern here. That would be relevant if we
were trying to verify the integrity of writes end-to-end immediately
after they'd been completed - and that would be a lovely thing to do in
the spirit of "true end to end data integrity", but we're not ready to
contemplate such things due to the obvious performance implications.

The scenario here is: we go to read old data, we get a bit error, and
then we want to retry the read to verify that it wasn't a transient
error before declaring it dead (marking it poisoned).

In this scenario, "dirty data in the writeback cache" is something that
would only ever happen if userspace is doing very funny things with
O_DIRECT, and the drive flushing that data on FUA read is totally fine.
It's an irrelevant corner case.

But we do absolutely require checking for transient read errors, and we
will miss things and corrupt data unnecessarily without FUA reads that
bypass the controller cache.

> I'm actually more curious how you came to use this mechanism for recovery.
> Have you actually seen this approach fix a real problem? In my experience, a
> Read FUA is used to ensure what you're reading has been persistently committed.
> It can be used as an optimization to flush a specific LBA range, but I'd not
> seen this used as a bit error recovery mechanism.

It's not data recovery - in the conventional sense.

Typically, on conventional filesystems, this doesn't come up because a:
most filesystems don't even do data checksumming, and even then on a
filesystem with data checksumming this won't automatically come up -
because like other other transient errors, it's generally perfectly
acceptable to return the error and let the upper layer retry later; the
filesystem generaly doesn't have to care if the error is transient or
not.

Where this becomes a filesystem concern is when those errors start
driving state changes within the filesystem, i.e. when we need to track
and know about them: in that case, we absolutely do need to try hard to
distinguish between transient and permanent errors.

Specifically, bcachefs needs to be able to move data around - even if
it's bitrotted and generating checksum errors. There's a couple
different reasons we may need to move data:

- rebalance, this is the task that does arbitrary data processing in the
  background based on user policy (moving it to a slower device,
  compressing it or recompressing it with a more expensive algorithm)

- copygc, which compacts fragmented data in mostly-empty buckets and is
  required for the allocator and the whole system to be able to make
  forward progress

- device evacuate/removal

Rebalance is optional (and still needs to bitrotted extents to be
tracked so that it doesn't spin on them, retrying them endlessly), but
copygc and evacuate are not.

And the only way for these processes to handle bitrotted extents
(besides throwing them out, which obviously we're not going to do) is

- generate a new checksum
- mark them as poisoned, so that reads to userspace still return the
  appropriate error (unless it's a read done with a "I know there's
  errors, give me what you still have" API).

But the critical thing here is, marking an extent as poisoned will turn
a transient corruption into a permanent error unless we've made damn
sure there is no transient error with additional FUA retries.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:49                         ` Kent Overstreet
@ 2025-03-17 15:15                           ` Keith Busch
  2025-03-17 15:22                             ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Keith Busch @ 2025-03-17 15:15 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote:
> But we do absolutely require checking for transient read errors, and we
> will miss things and corrupt data unnecessarily without FUA reads that
> bypass the controller cache.

Read FUA doesn't "bypass the controller cache". This is a "flush cache
first, then read the result" operation. That description seems
consistent for nvme, ata, and scsi.

You are only interested in the case where the read doesn't overlap with
any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm
still curious if you have actually seen a case where read data was
corrupt in cache but correct with FUA?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:15                           ` Keith Busch
@ 2025-03-17 15:22                             ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 15:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 09:15:04AM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote:
> > But we do absolutely require checking for transient read errors, and we
> > will miss things and corrupt data unnecessarily without FUA reads that
> > bypass the controller cache.
> 
> Read FUA doesn't "bypass the controller cache". This is a "flush cache
> first, then read the result" operation. That description seems
> consistent for nvme, ata, and scsi.

Quoting from your previous email,

> 2) read the data and metadata, if any, from non-volatile medium.

That's pretty specific.
 
> You are only interested in the case where the read doesn't overlap with
> any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm
> still curious if you have actually seen a case where read data was
> corrupt in cache but correct with FUA?

No.

It's completely typical and expected to code defensively against errors
which we have not observed but know are possible.

Is bit corruption possible in the device cache? You betcha, it's
hardware.

Are we going to have to test for correct implementation of READ|FUA in
the future? Given all the devices that don't even get flushes right,
again yes.

Are we still going to go ahead and do it? Yes, and I'd even say it's a
prerequisite for complaining to manufacturers that don't get this right.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:13                       ` Keith Busch
  2025-03-17 14:49                         ` Kent Overstreet
@ 2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-17 15:43                           ` Kent Overstreet
  2025-03-17 17:32                           ` Keith Busch
  1 sibling, 2 replies; 63+ messages in thread
From: Martin K. Petersen @ 2025-03-17 15:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

Keith,

> In my experience, a Read FUA is used to ensure what you're reading has
> been persistently committed.

Exactly, that is the intended semantic.

Kent is trying to address a scenario in which data in the cache is
imperfect and the data on media is reliable. SCSI did not consider such
a scenario because the entire access model would be fundamentally broken
if cache contents couldn't be trusted.

FUA was defined to handle the exact opposite scenario: Data in cache is
reliable by definition and the media inherently unreliable.
Consequently, what all the standards describe is a flush-then-read
semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is
to validate endurance of future reads.

SCSI has a different per-command flag for cache management. However, it
is only a hint and therefore does not offer the guarantee Kent is
looking for.

At least for SCSI, given how FUA is usually implemented, I consider it
quite unlikely that two read operations back to back would somehow cause
different data to be transferred. Regardless of which flags you use.

Also, and obviously this is also highly implementation-dependent, but I
don't think one should underestimate the amount of checking done along
the entire data path inside the device, including DRAM to the transport
interface ASIC. Data is usually validated by the ASIC on the way out and
from there on all modern transports have some form of checking. Having
corrupted data placed in host memory without an associated error
condition is not a common scenario. Bit flips in host memory are much
more common.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:30                         ` Martin K. Petersen
@ 2025-03-17 15:43                           ` Kent Overstreet
  2025-03-17 17:57                             ` Martin K. Petersen
  2025-03-17 17:32                           ` Keith Busch
  1 sibling, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 15:43 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote:
> 
> Keith,
> 
> > In my experience, a Read FUA is used to ensure what you're reading has
> > been persistently committed.
> 
> Exactly, that is the intended semantic.
> 
> Kent is trying to address a scenario in which data in the cache is
> imperfect and the data on media is reliable. SCSI did not consider such
> a scenario because the entire access model would be fundamentally broken
> if cache contents couldn't be trusted.
> 
> FUA was defined to handle the exact opposite scenario: Data in cache is
> reliable by definition and the media inherently unreliable.
> Consequently, what all the standards describe is a flush-then-read
> semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is
> to validate endurance of future reads.
> 
> SCSI has a different per-command flag for cache management. However, it
> is only a hint and therefore does not offer the guarantee Kent is
> looking for.
> 
> At least for SCSI, given how FUA is usually implemented, I consider it
> quite unlikely that two read operations back to back would somehow cause
> different data to be transferred. Regardless of which flags you use.

Based on what, exactly?

We're already seeing bit corruption caught - and corrupted with scrub,
when users aren't running with replicas=1 (which is fine for data that's
backed up elsewhere, of course).

Then it's just a question of where in the lower layers it comes from.

We _know_ devices are not perfect, and your claim that "it's quite
unlikely that two reads back to back would return different data"
amounts to claiming that there are no bugs in a good chunk of the IO
path and all that is implemented perfectly.

But that's just bullshit.

When we do a retry after observing bit corruption, we need to retry the
full read, not just from device cache, and the spec is quite specific on
that point.

Now, if you're claiming that a FUA read will behave differently, please
do state clearly what you think will happen with clean cached data in
the cache. But that'll amount to a clear spec violation...

> Also, and obviously this is also highly implementation-dependent, but I
> don't think one should underestimate the amount of checking done along
> the entire data path inside the device, including DRAM to the transport
> interface ASIC. Data is usually validated by the ASIC on the way out and
> from there on all modern transports have some form of checking. Having
> corrupted data placed in host memory without an associated error
> condition is not a common scenario. Bit flips in host memory are much
> more common.

Yeah, I'm not in the business of trusting the lower layers like this.
Trust, but verify :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-17 15:43                           ` Kent Overstreet
@ 2025-03-17 17:32                           ` Keith Busch
  2025-03-18  6:19                             ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Keith Busch @ 2025-03-17 17:32 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote:
> Kent is trying to address a scenario in which data in the cache is
> imperfect and the data on media is reliable. SCSI did not consider such
> a scenario because the entire access model would be fundamentally broken
> if cache contents couldn't be trusted.

I'm not even sure it makes sense to suspect the controller side, but the
host side is considered reliable? If the media is in fact perfect, and
if non-FUA read returns data that fails the checksum, wouldn't it be
just as likely that the transport or host side is the problem? Every
NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer
directly from media (excluding Optane; RIP), so a read with or without
FUA would stage the data in the very same controller-side DRAM.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:43                           ` Kent Overstreet
@ 2025-03-17 17:57                             ` Martin K. Petersen
  2025-03-17 18:21                               ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Martin K. Petersen @ 2025-03-17 17:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Keith Busch, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

Kent,

>> At least for SCSI, given how FUA is usually implemented, I consider
>> it quite unlikely that two read operations back to back would somehow
>> cause different data to be transferred. Regardless of which flags you
>> use.
>
> Based on what, exactly?

Based on the fact that many devices will either blindly flush on FUA or
they'll do the equivalent of a media verify operation. In neither case
will you get different data returned. The emphasis for FUA is on media
durability, not caching.

In most implementations the cache isn't an optional memory buffer thingy
that can be sidestepped. It is the only access mechanism that exists
between the media and the host interface. Working memory if you will. So
bypassing the device cache is not really a good way to think about it.

The purpose of FUA is to ensure durability for future reads, it is a
media management flag. As such, any effect FUA may have on the device
cache is incidental.

For SCSI there is a different flag to specify caching behavior. That
flag is orthogonal to FUA and did not get carried over to NVMe.

> We _know_ devices are not perfect, and your claim that "it's quite
> unlikely that two reads back to back would return different data"
> amounts to claiming that there are no bugs in a good chunk of the IO
> path and all that is implemented perfectly.

I'm not saying that devices are perfect or that the standards make
sense. I'm just saying that your desired behavior does not match the
reality of how a large number of these devices are actually implemented.

The specs are largely written by device vendors and therefore
deliberately ambiguous. Many of the explicit cache management bits and
bobs have been removed from SCSI or are defined as hints because device
vendors don't want the OS to interfere with how they manage resources,
including caching. I get what your objective is. I just don't think FUA
offers sufficient guarantees in that department.

Also, given the amount of hardware checking done at the device level, my
experience tells me that you are way more likely to have undetected
corruption problems on the host side than inside the storage device. In
general storage devices implement very extensive checking on both
control and data paths. And they will return an error if there is a
mismatch (as opposed to returning random data).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 17:57                             ` Martin K. Petersen
@ 2025-03-17 18:21                               ` Kent Overstreet
  2025-03-17 19:24                                 ` Keith Busch
  2025-03-18  6:11                                 ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 18:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> 
> Kent,
> 
> >> At least for SCSI, given how FUA is usually implemented, I consider
> >> it quite unlikely that two read operations back to back would somehow
> >> cause different data to be transferred. Regardless of which flags you
> >> use.
> >
> > Based on what, exactly?
> 
> Based on the fact that many devices will either blindly flush on FUA or
> they'll do the equivalent of a media verify operation. In neither case
> will you get different data returned. The emphasis for FUA is on media
> durability, not caching.
> 
> In most implementations the cache isn't an optional memory buffer thingy
> that can be sidestepped. It is the only access mechanism that exists
> between the media and the host interface. Working memory if you will. So
> bypassing the device cache is not really a good way to think about it.
> 
> The purpose of FUA is to ensure durability for future reads, it is a
> media management flag. As such, any effect FUA may have on the device
> cache is incidental.
> 
> For SCSI there is a different flag to specify caching behavior. That
> flag is orthogonal to FUA and did not get carried over to NVMe.
> 
> > We _know_ devices are not perfect, and your claim that "it's quite
> > unlikely that two reads back to back would return different data"
> > amounts to claiming that there are no bugs in a good chunk of the IO
> > path and all that is implemented perfectly.
> 
> I'm not saying that devices are perfect or that the standards make
> sense. I'm just saying that your desired behavior does not match the
> reality of how a large number of these devices are actually implemented.
> 
> The specs are largely written by device vendors and therefore
> deliberately ambiguous. Many of the explicit cache management bits and
> bobs have been removed from SCSI or are defined as hints because device
> vendors don't want the OS to interfere with how they manage resources,
> including caching. I get what your objective is. I just don't think FUA
> offers sufficient guarantees in that department.

If you're saying this is going to be a work in progress to get the
behaviour we need in this scenario - yes, absolutely.

Beyond making sure that retries go to the physical media, there's "retry
level" in the NVME spec which needs to be plumbed, and that one will be
particularly useful in multi device scenarios. (Crank retry level up
or down based on whether we can retry from different devices).

But we've got to start somewhere, and given that the spec says "bypass
the cache" - that looks like the place to start. If devices don't
support the behaviour we want today, then nudging the drive
manufacturers to support it is infinitely saner than getting a whole
nother bit plumbed through the NVME standard, especially given that the
letter of the spec does describe exactly what we want.

So: I understand your concerns, but they're out of scope for the moment.
As long as nothing actively breaks when we feed it READ|FUA, it's
totally fine.

Later on I can imagine us adding some basic sanity tests to alert if
READ|FUA is behaving as expected; it's pretty easy to test latency of
random vs. reads to the same location vs. FUA reads to the same
location.

So yes: there will be more work to do in this area.

> Also, given the amount of hardware checking done at the device level, my
> experience tells me that you are way more likely to have undetected
> corruption problems on the host side than inside the storage device. In
> general storage devices implement very extensive checking on both
> control and data paths. And they will return an error if there is a
> mismatch (as opposed to returning random data).

Bugs host side are very much a concern, yes (we can only do so much
against memory stompers, and the RMW that buffered IO does is a giant
hole), but at the filesystem level there are techniques for avoiding
corruption that are not available at the block level.  The main one
being, every pointer carries the checksum that validates the data it
points to - the ZFS model.

So we do that, and additionally we're _very_ careful when moving around
data to carry around existing checksums, and validate the old after
generating the new (or sum up new checksums to validate them against the
old directly) when moving data around.

IOW - in my world, it's the filesystem's job to verify and authenticate
everything.

And layers above bcachefs do their own verification and authentication
as well - nixos package builders and git being the big ones that have
caught bugs for us.

It all matters...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 18:21                               ` Kent Overstreet
@ 2025-03-17 19:24                                 ` Keith Busch
  2025-03-17 19:40                                   ` Kent Overstreet
  2025-03-18  6:11                                 ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Keith Busch @ 2025-03-17 19:24 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> > I'm not saying that devices are perfect or that the standards make
> > sense. I'm just saying that your desired behavior does not match the
> > reality of how a large number of these devices are actually implemented.
> > 
> > The specs are largely written by device vendors and therefore
> > deliberately ambiguous. Many of the explicit cache management bits and
> > bobs have been removed from SCSI or are defined as hints because device
> > vendors don't want the OS to interfere with how they manage resources,
> > including caching. I get what your objective is. I just don't think FUA
> > offers sufficient guarantees in that department.
> 
> If you're saying this is going to be a work in progress to get the
> behaviour we need in this scenario - yes, absolutely.
> 
> Beyond making sure that retries go to the physical media, there's "retry
> level" in the NVME spec which needs to be plumbed, and that one will be
> particularly useful in multi device scenarios. (Crank retry level up
> or down based on whether we can retry from different devices).

I saw you mention the RRL mechanism in another patch, and it really
piqued my interest. How are you intending to use this? In NVMe, this is
controlled via an admin "Set Feature" command, which is absolutley not
available to a block device, much less a file system. That command queue
is only accesible to the driver and to user space admin, and is
definitely not a per-io feature.

> But we've got to start somewhere, and given that the spec says "bypass
> the cache" - that looks like the place to start. 

This is a bit dangerous to assume. I don't find anywhere in any nvme
specifications (also checked T10 SBC) with text saying anything similiar
to "bypass" in relation to the cache for FUA reads. I am reasonably
confident some vendors, especially ones developing active-active
controllers, will fight you to the their win on the spec committee for
this if you want to take it up in those forums.

> If devices don't support the behaviour we want today, then nudging the
> drive manufacturers to support it is infinitely saner than getting a
> whole nother bit plumbed through the NVME standard, especially given
> that the letter of the spec does describe exactly what we want.

I my experience, the storage standards committees are more aligned to
accomodate appliance vendors than anything Linux specific. Your desired
read behavior would almost certainly be a new TPAR in NVMe to get spec
defined behavior. It's not impossible, but I'll just say it is an uphill
battle and the end result may or may not look like what you have in
mind.

In summary, what we have by the specs from READ FUA:

 Flush and Read

What (I think) you want:

 Invalidate and Read

It sounds like you are trying to say that your scenario doesn't care
about the "Flush" so you get to use the existing semantics as the
"Invalidate" case, and I really don't think you get that guarantee from
any spec.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 19:24                                 ` Keith Busch
@ 2025-03-17 19:40                                   ` Kent Overstreet
  2025-03-17 20:39                                     ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-17 19:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> > > I'm not saying that devices are perfect or that the standards make
> > > sense. I'm just saying that your desired behavior does not match the
> > > reality of how a large number of these devices are actually implemented.
> > > 
> > > The specs are largely written by device vendors and therefore
> > > deliberately ambiguous. Many of the explicit cache management bits and
> > > bobs have been removed from SCSI or are defined as hints because device
> > > vendors don't want the OS to interfere with how they manage resources,
> > > including caching. I get what your objective is. I just don't think FUA
> > > offers sufficient guarantees in that department.
> > 
> > If you're saying this is going to be a work in progress to get the
> > behaviour we need in this scenario - yes, absolutely.
> > 
> > Beyond making sure that retries go to the physical media, there's "retry
> > level" in the NVME spec which needs to be plumbed, and that one will be
> > particularly useful in multi device scenarios. (Crank retry level up
> > or down based on whether we can retry from different devices).
> 
> I saw you mention the RRL mechanism in another patch, and it really
> piqued my interest. How are you intending to use this? In NVMe, this is
> controlled via an admin "Set Feature" command, which is absolutley not
> available to a block device, much less a file system. That command queue
> is only accesible to the driver and to user space admin, and is
> definitely not a per-io feature.

Oof, that's going to be a giant hassle then. That is something we
definitely want, but it may be something for well down the line then.

My more immediate priority is going to be finishing ZNS support, since
that will no doubt inform anything we do in that area.
  
> > But we've got to start somewhere, and given that the spec says "bypass
> > the cache" - that looks like the place to start. 
> 
> This is a bit dangerous to assume. I don't find anywhere in any nvme
> specifications (also checked T10 SBC) with text saying anything similiar
> to "bypass" in relation to the cache for FUA reads. I am reasonably
> confident some vendors, especially ones developing active-active
> controllers, will fight you to the their win on the spec committee for
> this if you want to take it up in those forums.

"Read will come direct from media" reads pretty clear to me.

But even if it's not supported the way we want, I'm not seeing anything
dangerous about using it this way. Worst case, our retries aren't as
good as we want them to be, and it'll be an item to work on in the
future.

As long as drives aren't barfing when we give them a read fua (and so
far they haven't when running this code), we're fine for now.

> > If devices don't support the behaviour we want today, then nudging the
> > drive manufacturers to support it is infinitely saner than getting a
> > whole nother bit plumbed through the NVME standard, especially given
> > that the letter of the spec does describe exactly what we want.
> 
> I my experience, the storage standards committees are more aligned to
> accomodate appliance vendors than anything Linux specific. Your desired
> read behavior would almost certainly be a new TPAR in NVMe to get spec
> defined behavior. It's not impossible, but I'll just say it is an uphill
> battle and the end result may or may not look like what you have in
> mind.

I'm not so sure.

If there are users out there depending on a different meaning of read
fua, then yes, absolutely (and it sounds like Martin might have been
alluding to that - but why wouldn't the write have been done fua? I'd
want to hear more about that).

If, OTOH, this is just something that hasn't come up before - the
language in the spec is already there, so once code is out there with
enough users and a demonstrated use case then it might be a pretty
simple nudge - "shoot down this range of the cache, don't just flush it"
is a pretty simple code change, as far as these things go.

> In summary, what we have by the specs from READ FUA:
> 
>  Flush and Read
> 
> What (I think) you want:
> 
>  Invalidate and Read
> 
> It sounds like you are trying to say that your scenario doesn't care
> about the "Flush" so you get to use the existing semantics as the
> "Invalidate" case, and I really don't think you get that guarantee from
> any spec.

Exactly. Previous data being flushed, if it was dirty, is totally fine.

Specs aren't worth much if no one's depending on or testing a given
behaviour, so what the spec strictly guarantees doesn't really matter
here. What matters more is - does the behaviour make sense, will it be
easy enough to implement, and does it conflict with behaviour anyone
else is depneding on.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 19:40                                   ` Kent Overstreet
@ 2025-03-17 20:39                                     ` Keith Busch
  2025-03-17 21:13                                       ` Bart Van Assche
  2025-03-18  0:27                                       ` Kent Overstreet
  0 siblings, 2 replies; 63+ messages in thread
From: Keith Busch @ 2025-03-17 20:39 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > specifications (also checked T10 SBC) with text saying anything similiar
> > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > confident some vendors, especially ones developing active-active
> > controllers, will fight you to the their win on the spec committee for
> > this if you want to take it up in those forums.
> 
> "Read will come direct from media" reads pretty clear to me.
>
> But even if it's not supported the way we want, I'm not seeing anything
> dangerous about using it this way. Worst case, our retries aren't as
> good as we want them to be, and it'll be an item to work on in the
> future.

I don't think you're appreciating the complications that active-active
and multi-host brings to the scenario. Those are why this is not the
forum to solve it. The specs need to be clear on the guarantees, and
what they currently guarnatee might provide some overlap with what
you're seeking in specific scenarios, but I really think (and I believe
Martin agrees) your use is outside its targeted use case.
 
> As long as drives aren't barfing when we give them a read fua (and so
> far they haven't when running this code), we're fine for now.

In this specific regard, I think its safe to assume the devices will
remain operational.

> > > If devices don't support the behaviour we want today, then nudging the
> > > drive manufacturers to support it is infinitely saner than getting a
> > > whole nother bit plumbed through the NVME standard, especially given
> > > that the letter of the spec does describe exactly what we want.
> > 
> > I my experience, the storage standards committees are more aligned to
> > accomodate appliance vendors than anything Linux specific. Your desired
> > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > defined behavior. It's not impossible, but I'll just say it is an uphill
> > battle and the end result may or may not look like what you have in
> > mind.
> 
> I'm not so sure.
> 
> If there are users out there depending on a different meaning of read
> fua, then yes, absolutely (and it sounds like Martin might have been
> alluding to that - but why wouldn't the write have been done fua? I'd
> want to hear more about that)

As I mentioned, READ FUA provides an optimization opportunity that
can be used instead of Write + Flush or WriteFUA when the host isn't
sure about the persistence needs at the time of the initial Write: it
can be used as a checkpoint on a specific block range that you may have
written and overwritten. This kind of "read" command provides a well
defined persistence barrier. Thinking of Read FUA as a barrier is better
aligned with how the standards and device makers intended it to be used.

> If, OTOH, this is just something that hasn't come up before - the
> language in the spec is already there, so once code is out there with
> enough users and a demonstrated use case then it might be a pretty
> simple nudge - "shoot down this range of the cache, don't just flush it"
> is a pretty simple code change, as far as these things go.

So you're telling me you've never written SSD firmware then waited for
the manufacturer to release it to your users? Yes, I jest, and maybe
YMMV; but relying on that process is putting your destiny in the wrong
hands.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
                   ` (13 preceding siblings ...)
  2025-03-11 20:15 ` [PATCH 14/14] bcachefs: Read retries are after checksum errors now REQ_FUA Kent Overstreet
@ 2025-03-17 20:55 ` John Stoffel
  2025-03-17 21:12   ` errors compiling bcachefs-tools v1.20.0 on debian 12 John Stoffel
  2025-03-18  1:15   ` [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
  14 siblings, 2 replies; 63+ messages in thread
From: John Stoffel @ 2025-03-17 20:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> we've got an extent that can't be read, due to checksum error/bitrot.

> This took some doing to fix properly, because

> - We don't want to just delete such data (replace it with
>   KEY_TYPE_error); we never want to delete anything except when
>   explicitly told to by the user, and while we don't yet have an API for
>   "read this file even though there's errors, just give me what you
>   have" we definitely will in the future.

So will open() just return an error?  How will this look from
userspace?  

> - Not being able to move data is a non-option: that would block copygc,
>   device removal, etc.

> - And, moving said extent requires giving it a new checksum - strictly
>   required if the move has to fragment it, teaching the write/move path
>   about extents with bad checksums is unpalateable, and anyways we'd
>   like to be able to guard against more bitrot, if we can.

Why does it need a new checksum if there are read errors?  What
happens if there are more read errors?   If I have a file on a
filesystem which is based in spinning rust and I get a single bit
flip, I'm super happy you catch it.  

But now you re-checksum the file, with the read error, and return it?
I'm stupid and just a user/IT guy.  I want notifications, but I don't
want my application to block so I can't kill it, or unmount the
filesystem.  Or continue to use it if I like.  

> So that means:

> - Extents need a poison bit: "reads return errors, even though it now
>   has a good checksum" - this was added in a separate patch queued up
>   for 6.15.

Sorry for being dense, but does a file have one or more extents?  Or
is this at a level below that?  

>   It's an incompat feature because it's a new extent field, and old
>   versions can't parse extents with unknown field types, since they
>   won't know their sizes - meaning users will have to explicitly do an
>   incompat upgrade to make use of this stuff.

> - The read path needs to do additional retries after checksum errors
>   before giving up and marking it poisoned, so that we don't
>   accidentally convert a transient error to permanent corruption.

When doing these retries, is the filesystem locked up or will the
process doing the read() be blocked from being killed?  

> - The read path gets a whole bunch of work to plumb precise modern error
>   codes around, so that e.g. the retry path, the data move path, and the
>   "mark extent poisoned" path all know exactly what's going on.

> - Read path is responsible for marking extents poisoned after sufficient
>   retry attempts (controlled by a new filesystem option)

> - Data move path is allowed to move extents after a read error, if it's
>   a checksum error (giving it a new checksum) if it's been poisoned
>   (i.e. the extent flags feature is enabled).

So if just a single bit flips, the extent gets moved onto better
storage, and the file gets re-checksummed.  But what about if more
bits go bad afterwards?  

> Code should be more or less finalized - still have more tests for corner
> cases to write, but "write corrupt data and then tell rebalance to move
> it to another device" works as expected.

> TODO:

> - NVME has a "read recovery level" attribute that controlls how hard the
>   erasure coding algorithms work - we want that plumbed.

>   Before we give up and move data that we know is bad, we need to try
>   _as hard as possible_ to get a successful read.

What does this mean exactly in terms of end user and SysAdmin point of
view?  Locking up the system (no new writes to the now problematic
storage) that I can't kill would be annoyingly painful.  


Don't get me wrong, I'm happy to see data integrity stuff get added,
I'm just trying to understand how it interacts with userspace.  

John



^ permalink raw reply	[flat|nested] 63+ messages in thread

* errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
@ 2025-03-17 21:12   ` John Stoffel
  2025-03-17 21:48     ` Malte Schröder
  2025-03-18  1:15   ` [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: John Stoffel @ 2025-03-17 21:12 UTC (permalink / raw)
  To: linux-bcachefs


Hi guys,
I'm trying to upgrade the bcachefs tools since I tend to use newer
kernels on my Debian 12 system.  The debian packages bcachefs-tools
are ancient.  

So I cloned the repo, installed the pre-reqs and I'm trying to compile
it, but I get the following:


quad:/bcachefs/bcachefs1/bcachefs-tools$ make
    [CC]     c_src/bcachefs.o
In file included from ./libbcachefs/bcachefs.h:202,
                 from c_src/tools-util.h:21,
                 from c_src/cmds.h:10,
                 from c_src/bcachefs.c:26:
include/linux/srcu.h:10:41: error: return type is an incomplete type
   10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h: In function ‘get_state_synchronize_rcu’:
include/linux/srcu.h:12:16: warning: implicit declaration of function ‘start_poll_synchronize_rcu’ ]
   12 |         return start_poll_synchronize_rcu();
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h:12:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
   12 |         return start_poll_synchronize_rcu();
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h:10:41: note: declared here
   10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h: At top level:
include/linux/srcu.h:25:99: error: parameter 2 (‘cookie’) has incomplete type
   25 | bool poll_state_synchronize_srcu(struct srcu_struct *ssp, struct urcu_gp_poll_state cookie)
      |                                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~

include/linux/srcu.h: In function ‘poll_state_synchronize_srcu’:
include/linux/srcu.h:27:16: warning: implicit declaration of function ‘poll_state_synchronize_rcu’;]
   27 |         return poll_state_synchronize_rcu(cookie);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                poll_state_synchronize_srcu
include/linux/srcu.h: At top level:
include/linux/srcu.h:30:41: error: return type is an incomplete type
   30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h: In function ‘start_poll_synchronize_srcu’:
include/linux/srcu.h:32:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
   32 |         return start_poll_synchronize_rcu();
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h:30:41: note: declared here
   30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/srcu.h: At top level:
include/linux/srcu.h:35:41: error: return type is an incomplete type
   35 | static inline struct urcu_gp_poll_state get_state_synchronize_srcu(struct srcu_struct *ssp)
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~
make: *** [Makefile:171: c_src/bcachefs.o] Error 1
quad:/bcachefs/bcachefs1/bcachefs-tools$ git branch
  master  * v1.20.0

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 20:39                                     ` Keith Busch
@ 2025-03-17 21:13                                       ` Bart Van Assche
  2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  0:27                                       ` Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-03-17 21:13 UTC (permalink / raw)
  To: Keith Busch, Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On 3/17/25 1:39 PM, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
>> If, OTOH, this is just something that hasn't come up before - the
>> language in the spec is already there, so once code is out there with
>> enough users and a demonstrated use case then it might be a pretty
>> simple nudge - "shoot down this range of the cache, don't just flush it"
>> is a pretty simple code change, as far as these things go.
> 
> So you're telling me you've never written SSD firmware then waited for
> the manufacturer to release it to your users? Yes, I jest, and maybe
> YMMV; but relying on that process is putting your destiny in the wrong
> hands.

As far as I know in the Linux kernel we only support storage device 
behavior that has been standardized and also workarounds for bugs. I'm
not sure we should support behavior in the Linux kernel that deviates
from existing standards and that also deviates from longstanding and
well-established conventions.

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-17 21:12   ` errors compiling bcachefs-tools v1.20.0 on debian 12 John Stoffel
@ 2025-03-17 21:48     ` Malte Schröder
  2025-03-17 23:10       ` John Stoffel
  2025-03-18 21:04       ` John Stoffel
  0 siblings, 2 replies; 63+ messages in thread
From: Malte Schröder @ 2025-03-17 21:48 UTC (permalink / raw)
  To: John Stoffel, linux-bcachefs

On 17/03/2025 22:12, John Stoffel wrote:
> Hi guys,
> I'm trying to upgrade the bcachefs tools since I tend to use newer
> kernels on my Debian 12 system.  The debian packages bcachefs-tools
> are ancient.  
>
> So I cloned the repo, installed the pre-reqs and I'm trying to compile
> it, but I get the following:
>
>
> quad:/bcachefs/bcachefs1/bcachefs-tools$ make
>     [CC]     c_src/bcachefs.o
> In file included from ./libbcachefs/bcachefs.h:202,
>                  from c_src/tools-util.h:21,
>                  from c_src/cmds.h:10,
>                  from c_src/bcachefs.c:26:
> include/linux/srcu.h:10:41: error: return type is an incomplete type
>    10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
>       |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h: In function ‘get_state_synchronize_rcu’:
> include/linux/srcu.h:12:16: warning: implicit declaration of function ‘start_poll_synchronize_rcu’ ]
>    12 |         return start_poll_synchronize_rcu();
>       |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h:12:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
>    12 |         return start_poll_synchronize_rcu();
>       |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h:10:41: note: declared here
>    10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
>       |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h: At top level:
> include/linux/srcu.h:25:99: error: parameter 2 (‘cookie’) has incomplete type
>    25 | bool poll_state_synchronize_srcu(struct srcu_struct *ssp, struct urcu_gp_poll_state cookie)
>       |                                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
>
> include/linux/srcu.h: In function ‘poll_state_synchronize_srcu’:
> include/linux/srcu.h:27:16: warning: implicit declaration of function ‘poll_state_synchronize_rcu’;]
>    27 |         return poll_state_synchronize_rcu(cookie);
>       |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>       |                poll_state_synchronize_srcu
> include/linux/srcu.h: At top level:
> include/linux/srcu.h:30:41: error: return type is an incomplete type
>    30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
>       |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h: In function ‘start_poll_synchronize_srcu’:
> include/linux/srcu.h:32:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
>    32 |         return start_poll_synchronize_rcu();
>       |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h:30:41: note: declared here
>    30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
>       |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> include/linux/srcu.h: At top level:
> include/linux/srcu.h:35:41: error: return type is an incomplete type
>    35 | static inline struct urcu_gp_poll_state get_state_synchronize_srcu(struct srcu_struct *ssp)
>       |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~
> make: *** [Makefile:171: c_src/bcachefs.o] Error 1
> quad:/bcachefs/bcachefs1/bcachefs-tools$ git branch
>   master  * v1.20.0
>
Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
someone pulled it of by forcing some libs from Trixie, but that caused
other things to break.



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-17 21:48     ` Malte Schröder
@ 2025-03-17 23:10       ` John Stoffel
  2025-03-18 21:04       ` John Stoffel
  1 sibling, 0 replies; 63+ messages in thread
From: John Stoffel @ 2025-03-17 23:10 UTC (permalink / raw)
  To: Malte Schröder; +Cc: John Stoffel, linux-bcachefs

>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:

> On 17/03/2025 22:12, John Stoffel wrote:
>> Hi guys,
>> I'm trying to upgrade the bcachefs tools since I tend to use newer
>> kernels on my Debian 12 system.  The debian packages bcachefs-tools
>> are ancient.  
>> 
>> So I cloned the repo, installed the pre-reqs and I'm trying to compile
>> it, but I get the following:
>> 
>> 
>> quad:/bcachefs/bcachefs1/bcachefs-tools$ make
>> [CC]     c_src/bcachefs.o
>> In file included from ./libbcachefs/bcachefs.h:202,
>> from c_src/tools-util.h:21,
>> from c_src/cmds.h:10,
>> from c_src/bcachefs.c:26:
>> include/linux/srcu.h:10:41: error: return type is an incomplete type
>> 10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
>> |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h: In function ‘get_state_synchronize_rcu’:
>> include/linux/srcu.h:12:16: warning: implicit declaration of function ‘start_poll_synchronize_rcu’ ]
>> 12 |         return start_poll_synchronize_rcu();
>> |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h:12:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
>> 12 |         return start_poll_synchronize_rcu();
>> |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h:10:41: note: declared here
>> 10 | static inline struct urcu_gp_poll_state get_state_synchronize_rcu()
>> |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h: At top level:
>> include/linux/srcu.h:25:99: error: parameter 2 (‘cookie’) has incomplete type
>> 25 | bool poll_state_synchronize_srcu(struct srcu_struct *ssp, struct urcu_gp_poll_state cookie)
>> |                                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
>> 
>> include/linux/srcu.h: In function ‘poll_state_synchronize_srcu’:
>> include/linux/srcu.h:27:16: warning: implicit declaration of function ‘poll_state_synchronize_rcu’;]
>> 27 |         return poll_state_synchronize_rcu(cookie);
>> |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> |                poll_state_synchronize_srcu
>> include/linux/srcu.h: At top level:
>> include/linux/srcu.h:30:41: error: return type is an incomplete type
>> 30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
>> |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h: In function ‘start_poll_synchronize_srcu’:
>> include/linux/srcu.h:32:16: warning: ‘return’ with a value, in function returning void [-Wreturn-ty]
>> 32 |         return start_poll_synchronize_rcu();
>> |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h:30:41: note: declared here
>> 30 | static inline struct urcu_gp_poll_state start_poll_synchronize_srcu(struct srcu_struct *ssp)
>> |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> include/linux/srcu.h: At top level:
>> include/linux/srcu.h:35:41: error: return type is an incomplete type
>> 35 | static inline struct urcu_gp_poll_state get_state_synchronize_srcu(struct srcu_struct *ssp)
>> |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> make: *** [Makefile:171: c_src/bcachefs.o] Error 1
>> quad:/bcachefs/bcachefs1/bcachefs-tools$ git branch
>> master  * v1.20.0
>> 

> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
> someone pulled it of by forcing some libs from Trixie, but that
> caused other things to break.

Since the docs say I can use 12/Bookworm, I was quite surprised. Can
someone update the requirements for what's currently required for
building?  And maybe the process of taking the built image back to
older releases?  It's kinda annoying that the tools are _so_ bleeding
edge.

John


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 20:39                                     ` Keith Busch
  2025-03-17 21:13                                       ` Bart Van Assche
@ 2025-03-18  0:27                                       ` Kent Overstreet
  1 sibling, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18  0:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:39:07PM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > > specifications (also checked T10 SBC) with text saying anything similiar
> > > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > > confident some vendors, especially ones developing active-active
> > > controllers, will fight you to the their win on the spec committee for
> > > this if you want to take it up in those forums.
> > 
> > "Read will come direct from media" reads pretty clear to me.
> >
> > But even if it's not supported the way we want, I'm not seeing anything
> > dangerous about using it this way. Worst case, our retries aren't as
> > good as we want them to be, and it'll be an item to work on in the
> > future.
> 
> I don't think you're appreciating the complications that active-active
> and multi-host brings to the scenario. Those are why this is not the
> forum to solve it. The specs need to be clear on the guarantees, and
> what they currently guarnatee might provide some overlap with what
> you're seeking in specific scenarios, but I really think (and I believe
> Martin agrees) your use is outside its targeted use case.

You do realize this is a single node filesystem we're talking about,
right?

Crazy multi-homing stuff, while cool, has no bearing here.

>  
> > As long as drives aren't barfing when we give them a read fua (and so
> > far they haven't when running this code), we're fine for now.
> 
> In this specific regard, I think its safe to assume the devices will
> remain operational.
> 
> > > > If devices don't support the behaviour we want today, then nudging the
> > > > drive manufacturers to support it is infinitely saner than getting a
> > > > whole nother bit plumbed through the NVME standard, especially given
> > > > that the letter of the spec does describe exactly what we want.
> > > 
> > > I my experience, the storage standards committees are more aligned to
> > > accomodate appliance vendors than anything Linux specific. Your desired
> > > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > > defined behavior. It's not impossible, but I'll just say it is an uphill
> > > battle and the end result may or may not look like what you have in
> > > mind.
> > 
> > I'm not so sure.
> > 
> > If there are users out there depending on a different meaning of read
> > fua, then yes, absolutely (and it sounds like Martin might have been
> > alluding to that - but why wouldn't the write have been done fua? I'd
> > want to hear more about that)
> 
> As I mentioned, READ FUA provides an optimization opportunity that
> can be used instead of Write + Flush or WriteFUA when the host isn't
> sure about the persistence needs at the time of the initial Write: it
> can be used as a checkpoint on a specific block range that you may have
> written and overwritten. This kind of "read" command provides a well
> defined persistence barrier. Thinking of Read FUA as a barrier is better
> aligned with how the standards and device makers intended it to be used.

Yeah, I got that. Again, a neat trick, but who in their right mind would
use that sort of thing when the sane thing to do is just write fua?

So I'm skeptical that that sort of thing is an actual use with any
bearing on the devices any single node filesystem targets.

Now, in crazy enterprise multi homing land, sure.

Now, if you're saying you think the standard should be interpreted in a
way such that read fua does what it seems you and Martin want it to do
in crazy enterprise multi homing land... now _that_ would be a fun
argument to have in a standards committee :)

But I mostly jest, because my suspicion is that there wouldn't be any
real conflict, just a bit of the "I hadn't thought to use it that way"
anxiety.

> > If, OTOH, this is just something that hasn't come up before - the
> > language in the spec is already there, so once code is out there with
> > enough users and a demonstrated use case then it might be a pretty
> > simple nudge - "shoot down this range of the cache, don't just flush it"
> > is a pretty simple code change, as far as these things go.
> 
> So you're telling me you've never written SSD firmware then waited for
> the manufacturer to release it to your users? Yes, I jest, and maybe
> YMMV; but relying on that process is putting your destiny in the wrong
> hands.

Nah, back when I was at an employer that did SSD drivers/firmware, it
was all in house - no waiting to ship required :)

And incidentally, it's been fun watching the "FTL host or device side"
thing go back and forth since then; we had the same back-and-forth
happen just between different generations of the internal stuff being
built at Google that's been happening now with NVME and ZNS.

The appeal of a host side FTL was fairly obvious back then, but the FTL
ended up moving from the host to the device because people wanted to do
complete kernel IO stack bypass. The AIO and DIO code were really bad
back then especially in certain multithreaded scenarios - profiles were
absolutely atrocious, and the performance gains of a host side FTL
only really happen if you can do it in combination with the filesystem,
cutting out a lot of redundancy and improving on GC efficiency (this is
a big one) because in the filesystem, you have a lot more information on
what goes with what, that's lost at the block layer. IO tail latency in
particular improves, especially on loaded machines.

But a filesystem meant for that didn't exist at the time, nor did
hardware with any kind of an open interface...

Since then the kernel IO stack has gotten massively faster, ZNS hardware
exists, and bcachefs was pretty much designed from the start for
directly driving flash. There's about a month of work left, max, for
finish that off and driving hardware I have sitting on my desk.

Which means there should be more interesting things happen in the NVME
transport area in the future. In particular, moving the FTL into the
filesystem ought to allow for much more gracefully degrading failure
modes instead of the whole SSD going offline (and signals to the user
about when flash is going bad! flash ECC algorithms give you this, it's
just not exposed!).

So should be fun times.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 21:13                                       ` Bart Van Assche
@ 2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  6:16                                           ` Christoph Hellwig
  2025-03-18 17:49                                           ` Bart Van Assche
  0 siblings, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18  1:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:13:14PM -0700, Bart Van Assche wrote:
> On 3/17/25 1:39 PM, Keith Busch wrote:
> > On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> > > If, OTOH, this is just something that hasn't come up before - the
> > > language in the spec is already there, so once code is out there with
> > > enough users and a demonstrated use case then it might be a pretty
> > > simple nudge - "shoot down this range of the cache, don't just flush it"
> > > is a pretty simple code change, as far as these things go.
> > 
> > So you're telling me you've never written SSD firmware then waited for
> > the manufacturer to release it to your users? Yes, I jest, and maybe
> > YMMV; but relying on that process is putting your destiny in the wrong
> > hands.
> 
> As far as I know in the Linux kernel we only support storage device behavior
> that has been standardized and also workarounds for bugs. I'm
> not sure we should support behavior in the Linux kernel that deviates
> from existing standards and that also deviates from longstanding and
> well-established conventions.

What bcachefs is doing is entirely in line with the behaviour the
standard states.

Keith and Martin are describing systems with a different interpretation,
because they want the side effect (flush cache) without the main effect
(possibly evict cache, but read comes from media).

It's an amusing state of affairs, but it'd be easily resolved with an
admin level NVME command to flip a state bit (like the read recovery
level we were also talking about), and anyways multihoming capable NVME
devices are an entirely different market from the conventional stuff.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
  2025-03-17 21:12   ` errors compiling bcachefs-tools v1.20.0 on debian 12 John Stoffel
@ 2025-03-18  1:15   ` Kent Overstreet
  2025-03-18 14:47     ` John Stoffel
  1 sibling, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18  1:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> > we've got an extent that can't be read, due to checksum error/bitrot.
> 
> > This took some doing to fix properly, because
> 
> > - We don't want to just delete such data (replace it with
> >   KEY_TYPE_error); we never want to delete anything except when
> >   explicitly told to by the user, and while we don't yet have an API for
> >   "read this file even though there's errors, just give me what you
> >   have" we definitely will in the future.
> 
> So will open() just return an error?  How will this look from
> userspace?  

Not the open, the read - the typical case is only a single extent goes
bad; it's like any other IO error.

> > - Not being able to move data is a non-option: that would block copygc,
> >   device removal, etc.
> 
> > - And, moving said extent requires giving it a new checksum - strictly
> >   required if the move has to fragment it, teaching the write/move path
> >   about extents with bad checksums is unpalateable, and anyways we'd
> >   like to be able to guard against more bitrot, if we can.
> 
> Why does it need a new checksum if there are read errors?  What
> happens if there are more read errors?   If I have a file on a
> filesystem which is based in spinning rust and I get a single bit
> flip, I'm super happy you catch it.  

The data move paths very strictly verify checksums as they move data
around so they don't introduce bitrot.

I'm not going to add
	if (!bitrotted_extent) checksum(); else no_checksum()
Eww...


Besides being gross, we also would like to guard against introducing
more bitrot.

> But now you re-checksum the file, with the read error, and return it?
> I'm stupid and just a user/IT guy.  I want notifications, but I don't
> want my application to block so I can't kill it, or unmount the
> filesystem.  Or continue to use it if I like.  

The aforementioned poison bit ensures that you still get the error from
the original checksum error when you read that data - unless you use an
appropriate "give it to me anyways" API.

> > So that means:
> 
> > - Extents need a poison bit: "reads return errors, even though it now
> >   has a good checksum" - this was added in a separate patch queued up
> >   for 6.15.
> 
> Sorry for being dense, but does a file have one or more extents?  Or
> is this at a level below that?  

Files have multiple extents.

An extent is one contiguous range of data, and in bcachefs checksums are
at the extent level, not block, so checksummed (and compressed) extents
are limited to, by default, 128k.

> >   It's an incompat feature because it's a new extent field, and old
> >   versions can't parse extents with unknown field types, since they
> >   won't know their sizes - meaning users will have to explicitly do an
> >   incompat upgrade to make use of this stuff.
> 
> > - The read path needs to do additional retries after checksum errors
> >   before giving up and marking it poisoned, so that we don't
> >   accidentally convert a transient error to permanent corruption.
> 
> When doing these retries, is the filesystem locked up or will the
> process doing the read() be blocked from being killed?  

The process doing the read() can't be killed during this, no. If
requested this could be changed, but keep in mind retries are limited in
number.

Nothing else is "locked up", everything else proceeds as normal.

> > - The read path gets a whole bunch of work to plumb precise modern error
> >   codes around, so that e.g. the retry path, the data move path, and the
> >   "mark extent poisoned" path all know exactly what's going on.
> 
> > - Read path is responsible for marking extents poisoned after sufficient
> >   retry attempts (controlled by a new filesystem option)
> 
> > - Data move path is allowed to move extents after a read error, if it's
> >   a checksum error (giving it a new checksum) if it's been poisoned
> >   (i.e. the extent flags feature is enabled).
> 
> So if just a single bit flips, the extent gets moved onto better
> storage, and the file gets re-checksummed.  But what about if more
> bits go bad afterwards?  

The new checksum means they're detected, and if you have replication
enabled they'll be corrected automatically, like any other IO error.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 12:15                     ` Kent Overstreet
  2025-03-17 14:13                       ` Keith Busch
@ 2025-03-18  6:01                       ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:01 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > you and Christoph are the two reasons I've had to harp on process in the
> > > past - everyone else in the kernel has been fine.
> 
> ...
> 
> > In this case you very clearly misunderstood how the FUA bit works on
> > reads (and clearly documented that misunderstanding in the commit log).
> 
> FUA reads are clearly documented in the NVME spec.

They are, and the semantics are as multiple people pointed out very
different from what you claim they are in the commit message.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 18:21                               ` Kent Overstreet
  2025-03-17 19:24                                 ` Keith Busch
@ 2025-03-18  6:11                                 ` Christoph Hellwig
  2025-03-18 21:33                                   ` Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:11 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> Beyond making sure that retries go to the physical media, there's "retry
> level" in the NVME spec which needs to be plumbed, and that one will be
> particularly useful in multi device scenarios. (Crank retry level up
> or down based on whether we can retry from different devices).

The read recovery level is to reduce the amount or intensity of read
retries, not to increase it so that workloads that have multiple sources
for data aren't stalled by the sometimes extremely long read recovery.
You won't really find much hardware that actually implements it.

As a little background: The read recovery level was added as part of the
predictive latency mode and originally tied to the (now heavily deprecated
and never implemented in scale) NVM sets.  Yours truly successfully
argued that they should not be tied to NVM sets and helped to make them
more generic, but the there was basically no uptake of the read recovery
level, with or without NVM sets.

> But we've got to start somewhere, and given that the spec says "bypass
> the cache"

But as people have been telling you multiple times that is not what any
of the specs says.

> - that looks like the place to start. If devices don't
> support the behaviour we want today, then nudging the drive
> manufacturers to support it is infinitely saner than getting a whole
> nother bit plumbed through the NVME standard, especially given that the
> letter of the spec does describe exactly what we want.

That assumes that anyone but you actually agrees with your definition
of sane behavior.  I don't think you've been overly successful on
selling anyone on that idea so far.  Selling the hardware people on it
will be even harder given that the "cache" really usually is not a
cache but an integral part of the data path.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  1:06                                         ` Kent Overstreet
@ 2025-03-18  6:16                                           ` Christoph Hellwig
  2025-03-18 17:49                                           ` Bart Van Assche
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bart Van Assche, Keith Busch, Martin K. Petersen,
	Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 09:06:13PM -0400, Kent Overstreet wrote:
> What bcachefs is doing is entirely in line with the behaviour the
> standard states.

Setting the FUA bit on any READ command (if supported by the device)
is entirely in line with the behaviour the standards.  It's just not
going to do what you hope.  And while you claim that it helps you
with data recovery, you've not shown either a practical example where
it does, or a theory based on hardware architeture how it could.

> It's an amusing state of affairs, but it'd be easily resolved with an
> admin level NVME command to flip a state bit (like the read recovery
> level we were also talking about), and anyways multihoming capable NVME
> devices are an entirely different market from the conventional stuff.

Please joing the NVMe technical working group and write a proposal.
NVMe requires you to clearly state the benefits of the proposal, which
means you have to actually clearly write that down first.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 17:32                           ` Keith Busch
@ 2025-03-18  6:19                             ` Christoph Hellwig
  0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Kent Overstreet, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:32:21AM -0600, Keith Busch wrote:
> I'm not even sure it makes sense to suspect the controller side, but the
> host side is considered reliable? If the media is in fact perfect, and
> if non-FUA read returns data that fails the checksum, wouldn't it be
> just as likely that the transport or host side is the problem?

From the error numbers we've seen it would in fact be much more likely
to be the host or transport side.

> Every
> NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer
> directly from media (excluding Optane; RIP), so a read with or without
> FUA would stage the data in the very same controller-side DRAM.

Exactly.  It's an writeback+invalidate+reread operation.  A great
way to slow down the read, but it's not going to help you with
data recovery.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18  1:15   ` [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
@ 2025-03-18 14:47     ` John Stoffel
  2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: John Stoffel @ 2025-03-18 14:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: John Stoffel, linux-bcachefs, linux-block, Roland Vet,
	linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
>> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
>> 
>> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
>> > we've got an extent that can't be read, due to checksum error/bitrot.
>> 
>> > This took some doing to fix properly, because
>> 
>> > - We don't want to just delete such data (replace it with
>> >   KEY_TYPE_error); we never want to delete anything except when
>> >   explicitly told to by the user, and while we don't yet have an API for
>> >   "read this file even though there's errors, just give me what you
>> >   have" we definitely will in the future.
>> 
>> So will open() just return an error?  How will this look from
>> userspace?  

> Not the open, the read - the typical case is only a single extent goes
> bad; it's like any other IO error.

Good. But then how would an application know we got a checksum error
for data corruption?  Would I have to update all my code to do another
special call when I get a read/write error to see if it was a
corruption issue?  

>> > - Not being able to move data is a non-option: that would block copygc,
>> >   device removal, etc.
>> 
>> > - And, moving said extent requires giving it a new checksum - strictly
>> >   required if the move has to fragment it, teaching the write/move path
>> >   about extents with bad checksums is unpalateable, and anyways we'd
>> >   like to be able to guard against more bitrot, if we can.
>> 
>> Why does it need a new checksum if there are read errors?  What
>> happens if there are more read errors?   If I have a file on a
>> filesystem which is based in spinning rust and I get a single bit
>> flip, I'm super happy you catch it.  

> The data move paths very strictly verify checksums as they move data
> around so they don't introduce bitrot.

Good.  This is something I really liked as an idea in ZFS, happy to
see it coming to bcachefs as well. 

> I'm not going to add
> 	if (!bitrotted_extent) checksum(); else no_checksum()
> Eww...

LOL!

> Besides being gross, we also would like to guard against introducing
> more bitrot.

>> But now you re-checksum the file, with the read error, and return it?
>> I'm stupid and just a user/IT guy.  I want notifications, but I don't
>> want my application to block so I can't kill it, or unmount the
>> filesystem.  Or continue to use it if I like.  

> The aforementioned poison bit ensures that you still get the error from
> the original checksum error when you read that data - unless you use an
> appropriate "give it to me anyways" API.

So this implies that I need to do extra work to A) know I'm on
bcachefs filesystem, B) that I got a read/write error and I need to do
some more checks to see what the error exactly is.  

And if I want to re-write the file I can either copy it to a new name,
but only when I use the new API to say "give me all the data, even if
you have a checksum error".  

>> > So that means:
>> 
>> > - Extents need a poison bit: "reads return errors, even though it now
>> >   has a good checksum" - this was added in a separate patch queued up
>> >   for 6.15.
>> 
>> Sorry for being dense, but does a file have one or more extents?  Or
>> is this at a level below that?  

> Files have multiple extents.

> An extent is one contiguous range of data, and in bcachefs checksums are
> at the extent level, not block, so checksummed (and compressed) extents
> are limited to, by default, 128k.

>> >   It's an incompat feature because it's a new extent field, and old
>> >   versions can't parse extents with unknown field types, since they
>> >   won't know their sizes - meaning users will have to explicitly do an
>> >   incompat upgrade to make use of this stuff.
>> 
>> > - The read path needs to do additional retries after checksum errors
>> >   before giving up and marking it poisoned, so that we don't
>> >   accidentally convert a transient error to permanent corruption.
>> 
>> When doing these retries, is the filesystem locked up or will the
>> process doing the read() be blocked from being killed?  

> The process doing the read() can't be killed during this, no. If
> requested this could be changed, but keep in mind retries are limited in
> number.

How limited?  And in the worse case, if I have 10 or 100 readers of a
file with checksum errors, now I've blocked all those processes for X
amount of time.  Will this info be logged somewhere so a sysadm could
possibly just do an 'rm' on the file to nuke it, or have some means of
forcing a scrub?  

> Nothing else is "locked up", everything else proceeds as normal.

But is the filesystem able to be unmounted when there's a locked up
process?  I'm just thinking in terms of system shutdowns when you have
failing hardware and want to get things closed as cleanly as possible
since you're going to clone the underlying block device onto new media
ASAP in an offline manner.  

>> > - The read path gets a whole bunch of work to plumb precise modern error
>> >   codes around, so that e.g. the retry path, the data move path, and the
>> >   "mark extent poisoned" path all know exactly what's going on.
>> 
>> > - Read path is responsible for marking extents poisoned after sufficient
>> >   retry attempts (controlled by a new filesystem option)
>> 
>> > - Data move path is allowed to move extents after a read error, if it's
>> >   a checksum error (giving it a new checksum) if it's been poisoned
>> >   (i.e. the extent flags feature is enabled).
>> 
>> So if just a single bit flips, the extent gets moved onto better
>> storage, and the file gets re-checksummed.  But what about if more
>> bits go bad afterwards?  

> The new checksum means they're detected, and if you have replication
> enabled they'll be corrected automatically, like any other IO error.

Nice!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  6:16                                           ` Christoph Hellwig
@ 2025-03-18 17:49                                           ` Bart Van Assche
  2025-03-18 18:00                                             ` Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-03-18 17:49 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On 3/17/25 6:06 PM, Kent Overstreet wrote:
> What bcachefs is doing is entirely in line with the behaviour the
> standard states.

I do not agree with the above.

Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm
wondering whether allowing REQ_FUA|REQ_READ would be a good topic
for that summit.

Bart.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 17:49                                           ` Bart Van Assche
@ 2025-03-18 18:00                                             ` Kent Overstreet
  2025-03-18 18:10                                               ` Keith Busch
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18 18:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 10:49:25AM -0700, Bart Van Assche wrote:
> On 3/17/25 6:06 PM, Kent Overstreet wrote:
> > What bcachefs is doing is entirely in line with the behaviour the
> > standard states.
> 
> I do not agree with the above.
> 
> Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm
> wondering whether allowing REQ_FUA|REQ_READ would be a good topic
> for that summit.

No, I won't be there this year. And I don't think it'd be the right
forum for arguing over the meaning of an obscure line in the NVME spec,
anyways :)

It's certainly not in dispute that read fua is a documented, legitimate
command, so there's no reason for the block layer to be rejecting it.

Whether it has exactly the behaviour we want isn't a critical issue that
has to be determined right now. The starting point for that will be to
test device behaviour (with some simple performance tests, like I
mentioned), and anyways it's outside the scope of the block layer.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:00                                             ` Kent Overstreet
@ 2025-03-18 18:10                                               ` Keith Busch
  2025-03-18 18:13                                                 ` Kent Overstreet
  2025-03-20  5:40                                                 ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Keith Busch @ 2025-03-18 18:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote:
> It's certainly not in dispute that read fua is a documented, legitimate
> command, so there's no reason for the block layer to be rejecting it.
> 
> Whether it has exactly the behaviour we want isn't a critical issue that
> has to be determined right now. The starting point for that will be to
> test device behaviour (with some simple performance tests, like I
> mentioned), and anyways it's outside the scope of the block layer.

Maybe just change the commit log. Read FUA has legit uses for persisting
data as described by the specs. No need to introduce contested behavior
to justify this patch, yah?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:10                                               ` Keith Busch
@ 2025-03-18 18:13                                                 ` Kent Overstreet
  2025-03-20  5:40                                                 ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18 18:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote:
> > It's certainly not in dispute that read fua is a documented, legitimate
> > command, so there's no reason for the block layer to be rejecting it.
> > 
> > Whether it has exactly the behaviour we want isn't a critical issue that
> > has to be determined right now. The starting point for that will be to
> > test device behaviour (with some simple performance tests, like I
> > mentioned), and anyways it's outside the scope of the block layer.
> 
> Maybe just change the commit log. Read FUA has legit uses for persisting
> data as described by the specs. No need to introduce contested behavior
> to justify this patch, yah?

That's reasonable, I'll do that.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-17 21:48     ` Malte Schröder
  2025-03-17 23:10       ` John Stoffel
@ 2025-03-18 21:04       ` John Stoffel
  2025-03-18 21:32         ` Malte Schröder
  1 sibling, 1 reply; 63+ messages in thread
From: John Stoffel @ 2025-03-18 21:04 UTC (permalink / raw)
  To: Malte Schröder; +Cc: John Stoffel, linux-bcachefs

>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:

> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
> someone pulled it of by forcing some libs from Trixie, but that caused
> other things to break.

So what is the best Linux distro to use to compile bcachefs-tools?
And can someone upgrade the INSTALL.md doc and the
https://bcachefs.org website to talk about what is the best
development environment now to use as a baseline?  And how to
cross-compile the bcachefs-tools for use on older distros with newer
kernels?  

I'm sure more people would test and play with bcachefs if there was
more docs for ends users to work from.

John

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-18 21:04       ` John Stoffel
@ 2025-03-18 21:32         ` Malte Schröder
  2025-03-19 14:16           ` John Stoffel
  0 siblings, 1 reply; 63+ messages in thread
From: Malte Schröder @ 2025-03-18 21:32 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs

On 18/03/2025 22:04, John Stoffel wrote:
>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
>> someone pulled it of by forcing some libs from Trixie, but that caused
>> other things to break.
> So what is the best Linux distro to use to compile bcachefs-tools?
> And can someone upgrade the INSTALL.md doc and the
> https://bcachefs.org website to talk about what is the best
> development environment now to use as a baseline?  And how to
> cross-compile the bcachefs-tools for use on older distros with newer
> kernels?  
>
> I'm sure more people would test and play with bcachefs if there was
> more docs for ends users to work from.
>
> John

I'd recommend you join #bcache at OFTC. This kind of things has been and
still is discussed there and people have found solutions that worked for
them regarding -tools vs Bookworm. My personal "solution" was to upgrade
to Trixie.

/Malte


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  6:11                                 ` Christoph Hellwig
@ 2025-03-18 21:33                                   ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-18 21:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:11:10PM -0700, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> > Beyond making sure that retries go to the physical media, there's "retry
> > level" in the NVME spec which needs to be plumbed, and that one will be
> > particularly useful in multi device scenarios. (Crank retry level up
> > or down based on whether we can retry from different devices).
> 
> The read recovery level is to reduce the amount or intensity of read
> retries, not to increase it so that workloads that have multiple sources
> for data aren't stalled by the sometimes extremely long read recovery.
> You won't really find much hardware that actually implements it.
> 
> As a little background: The read recovery level was added as part of the
> predictive latency mode and originally tied to the (now heavily deprecated
> and never implemented in scale) NVM sets.  Yours truly successfully
> argued that they should not be tied to NVM sets and helped to make them
> more generic, but the there was basically no uptake of the read recovery
> level, with or without NVM sets.

Well, if it can't be set per IO, that makes it fairly useless.

If it _was_ per IO it'd be dead easy to slot into bcachefs, the tracking
of IO/checksum errors in a replicated/erasure coded extent is
sophisticated enough to easily accommodate things like this (mainly you
need to know when submitting - do we have additional retries? and then
when you get an error, you don't want count it if it was "fast fail").

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-18 21:32         ` Malte Schröder
@ 2025-03-19 14:16           ` John Stoffel
  2025-03-24 15:25             ` Krzysztof Hajdamowicz
  0 siblings, 1 reply; 63+ messages in thread
From: John Stoffel @ 2025-03-19 14:16 UTC (permalink / raw)
  To: Malte Schröder; +Cc: John Stoffel, linux-bcachefs

>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:

> On 18/03/2025 22:04, John Stoffel wrote:
>>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>>> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
>>> someone pulled it of by forcing some libs from Trixie, but that caused
>>> other things to break.
>> So what is the best Linux distro to use to compile bcachefs-tools?
>> And can someone upgrade the INSTALL.md doc and the
>> https://bcachefs.org website to talk about what is the best
>> development environment now to use as a baseline?  And how to
>> cross-compile the bcachefs-tools for use on older distros with newer
>> kernels?  
>> 
>> I'm sure more people would test and play with bcachefs if there was
>> more docs for ends users to work from.
>> 
>> John

> I'd recommend you join #bcache at OFTC. This kind of things has been
> and still is discussed there and people have found solutions that
> worked for them regarding -tools vs Bookworm. My personal "solution"
> was to upgrade to Trixie.

I'm old, but for some reason I've never done IRC.  :-)  Is there a
base OS (Fedora?) that people use for development which I can use to
cross build a package for Debian?  





^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:10                                               ` Keith Busch
  2025-03-18 18:13                                                 ` Kent Overstreet
@ 2025-03-20  5:40                                                 ` Christoph Hellwig
  2025-03-20 10:28                                                   ` Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2025-03-20  5:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kent Overstreet, Bart Van Assche, Martin K. Petersen,
	Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> Maybe just change the commit log. Read FUA has legit uses for persisting
> data as described by the specs. No need to introduce contested behavior
> to justify this patch, yah?

While not having a factually incorrect commit message is a great
start I still don't think we want it.  For one there is no actual
use case for the actual semantics, so why add it?  Second it still
needs all the proper per-driver opt-in as these sematncis are not
defined for all out protocols as I've already mentioned before.

But hey, maybe Kent can actually find other storage or file system
developers to support it, so having an at least technically correct
patch out on the list would be a big start, even if I would not expect
to Jens to take it in a whim.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-20  5:40                                                 ` Christoph Hellwig
@ 2025-03-20 10:28                                                   ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-20 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Bart Van Assche, Martin K. Petersen, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Wed, Mar 19, 2025 at 10:40:53PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> > Maybe just change the commit log. Read FUA has legit uses for persisting
> > data as described by the specs. No need to introduce contested behavior
> > to justify this patch, yah?
> 
> While not having a factually incorrect commit message is a great
> start I still don't think we want it.  For one there is no actual
> use case for the actual semantics, so why add it?  Second it still
> needs all the proper per-driver opt-in as these sematncis are not
> defined for all out protocols as I've already mentioned before.
> 
> But hey, maybe Kent can actually find other storage or file system
> developers to support it, so having an at least technically correct
> patch out on the list would be a big start, even if I would not expect
> to Jens to take it in a whim.

Chistoph,

You're arguing over nothing. Go back and reread, you and I have the same
interpretation of what read fua should do.

You all really are a surly and argumentative lot...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18 14:47     ` John Stoffel
@ 2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2025-03-20 17:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Tue, Mar 18, 2025 at 10:47:51AM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> >> 
> >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> >> > we've got an extent that can't be read, due to checksum error/bitrot.
> >> 
> >> > This took some doing to fix properly, because
> >> 
> >> > - We don't want to just delete such data (replace it with
> >> >   KEY_TYPE_error); we never want to delete anything except when
> >> >   explicitly told to by the user, and while we don't yet have an API for
> >> >   "read this file even though there's errors, just give me what you
> >> >   have" we definitely will in the future.
> >> 
> >> So will open() just return an error?  How will this look from
> >> userspace?  
> 
> > Not the open, the read - the typical case is only a single extent goes
> > bad; it's like any other IO error.
> 
> Good. But then how would an application know we got a checksum error
> for data corruption?  Would I have to update all my code to do another
> special call when I get a read/write error to see if it was a
> corruption issue?  

We can only return -EIO via the normal IO paths.

> 
> >> > - Not being able to move data is a non-option: that would block copygc,
> >> >   device removal, etc.
> >> 
> >> > - And, moving said extent requires giving it a new checksum - strictly
> >> >   required if the move has to fragment it, teaching the write/move path
> >> >   about extents with bad checksums is unpalateable, and anyways we'd
> >> >   like to be able to guard against more bitrot, if we can.
> >> 
> >> Why does it need a new checksum if there are read errors?  What
> >> happens if there are more read errors?   If I have a file on a
> >> filesystem which is based in spinning rust and I get a single bit
> >> flip, I'm super happy you catch it.  
> 
> > The data move paths very strictly verify checksums as they move data
> > around so they don't introduce bitrot.
> 
> Good.  This is something I really liked as an idea in ZFS, happy to
> see it coming to bcachefs as well. 

Coming? It's been that way since long before bcachefs was merged.

> > The aforementioned poison bit ensures that you still get the error from
> > the original checksum error when you read that data - unless you use an
> > appropriate "give it to me anyways" API.
> 
> So this implies that I need to do extra work to A) know I'm on
> bcachefs filesystem, B) that I got a read/write error and I need to do
> some more checks to see what the error exactly is.  
> 
> And if I want to re-write the file I can either copy it to a new name,
> but only when I use the new API to say "give me all the data, even if
> you have a checksum error".  

This is only if you want an API for "give me possibly corrupted data".

That's pretty niche.

> > The process doing the read() can't be killed during this, no. If
> > requested this could be changed, but keep in mind retries are limited in
> > number.
> 
> How limited?  And in the worse case, if I have 10 or 100 readers of a
> file with checksum errors, now I've blocked all those processes for X
> amount of time.  Will this info be logged somewhere so a sysadm could
> possibly just do an 'rm' on the file to nuke it, or have some means of
> forcing a scrub?  

The poison bit means we won't attempt additional retries on an extent,
once we've reached the (configurable) limit. Future IOs will just return
an error without doing the actual read, since we already know it's bad.

So no, this won't bog down your system.

> > Nothing else is "locked up", everything else proceeds as normal.
> 
> But is the filesystem able to be unmounted when there's a locked up
> process?  I'm just thinking in terms of system shutdowns when you have
> failing hardware and want to get things closed as cleanly as possible
> since you're going to clone the underlying block device onto new media
> ASAP in an offline manner.  

The number of retries defaults to 3, and there's no real reason to make
it more than 5, so this isn't a real change over the current situatino
where drives sometimes take forever on a read as they're going out.

Eventually we could add a configurable hard limit on the amount of time
we spend trying to service an individual read, but that's niche so
further down the road.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-19 14:16           ` John Stoffel
@ 2025-03-24 15:25             ` Krzysztof Hajdamowicz
  2025-03-26 13:45               ` John Stoffel
  0 siblings, 1 reply; 63+ messages in thread
From: Krzysztof Hajdamowicz @ 2025-03-24 15:25 UTC (permalink / raw)
  To: John Stoffel, Malte Schröder; +Cc: linux-bcachefs

W dniu 19.03.2025 o 15:16, John Stoffel pisze:
>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>> On 18/03/2025 22:04, John Stoffel wrote:
>>>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>>>> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
>>>> someone pulled it of by forcing some libs from Trixie, but that caused
>>>> other things to break.

I am one of the people that butchered bcachefs-tools to work on my 
Bookworm/Proxmox 8 machine.

The recent version of bcachefs-tools require liburcu in version 0.14 and 
higher.
Debian Bookworm uses 0.13, while Debian Trixie has 0.14.
I asked debian-backports team if they could backport it [1].
In response, I was informed that newer package uses 64bit time_t;
This makes it incompatible with Bookworm and a patch would be needed to 
address that.

As my risk factor for this machine is higher than usual, I've created a 
setup with:

  * LXC container initially set up with Bookworm, containing all build
    dependencies.
  * I recompiled liburcu8t64 from Trixie to Bookworm
  * The package was then force-installed - rendering dpkg package
    database with conflicting dependencies.
    apt-cache rdepends shows that xfs and glusterfs depends on working
    liburcu.
    Since I do not use either of them, this is something I can live with.
  * Then I built bcachefs-tools
  * The resulting bcachefs binary and liburcu8t64 packages were copied
    to the target system.
  * Finally, I force-installed them, leaving the dpkg database with
    broken dependencies.

1: https://lists.debian.org/debian-backports/2025/01/msg00003.html
>>> So what is the best Linux distro to use to compile bcachefs-tools?

There is none until you transfer compiled binary with its dependencies.
# ldd bcachefs-tools/bcachefs
     linux-vdso.so.1 (0x0000749f55207000)
     liburcu.so.8 => /lib/x86_64-linux-gnu/liburcu.so.8 (0x0000749f54db5000)
     libblkid.so.1 => /lib/x86_64-linux-gnu/libblkid.so.1 
(0x0000749f54d5e000)
     libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x0000749f54d54000)
     libsodium.so.23 => /lib/x86_64-linux-gnu/libsodium.so.23 
(0x0000749f54cfa000)
     libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x0000749f54cdb000)
     liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x0000749f54cb3000)
     libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x0000749f54bf7000)
     libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x0000749f54bc9000)
     libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 
(0x0000749f54bc2000)
     libaio.so.1 => /lib/x86_64-linux-gnu/libaio.so.1 (0x0000749f54bbd000)
     libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x0000749f54b9d000)
     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000749f549ba000)
     /lib64/ld-linux-x86-64.so.2 (0x0000749f55209000)
     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x0000749f549b5000)

>> I'd recommend you join #bcache at OFTC. This kind of things has been
>> and still is discussed there and people have found solutions that
>> worked for them regarding -tools vs Bookworm. My personal "solution"
>> was to upgrade to Trixie.
> Is there a base OS (Fedora?) that people use for development

As far as I know, most of bcachefs development happens on Nix.
> which I can use to cross build a package for Debian?
I would suggest upgrading to Trixie. I can't as Proxmox is not 
compatible with Trixie yet.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: errors compiling bcachefs-tools v1.20.0 on debian 12
  2025-03-24 15:25             ` Krzysztof Hajdamowicz
@ 2025-03-26 13:45               ` John Stoffel
  0 siblings, 0 replies; 63+ messages in thread
From: John Stoffel @ 2025-03-26 13:45 UTC (permalink / raw)
  To: Krzysztof Hajdamowicz; +Cc: John Stoffel, Malte Schröder, linux-bcachefs

>>>>> "Krzysztof" == Krzysztof Hajdamowicz <krzysztof@hajdamowicz.info> writes:

> W dniu 19.03.2025 o 15:16, John Stoffel pisze:
>>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>>> On 18/03/2025 22:04, John Stoffel wrote:
>>>>>>>>> "Malte" == Malte Schröder <malte.schroeder@tnxip.de> writes:
>>>>> Libs in Debian 12/Bookworm are too old for bcachefs-tools. I think
>>>>> someone pulled it of by forcing some libs from Trixie, but that caused
>>>>> other things to break.

> I am one of the people that butchered bcachefs-tools to work on my 
> Bookworm/Proxmox 8 machine.

> The recent version of bcachefs-tools require liburcu in version 0.14 and 
> higher.
> Debian Bookworm uses 0.13, while Debian Trixie has 0.14.

> I asked debian-backports team if they could backport it [1].  In
> response, I was informed that newer package uses 64bit time_t; This
> makes it incompatible with Bookworm and a patch would be needed to
> address that.

I was able to self-compile liburcu without any problems and I've got
bcachefs-tools working on Debian 12.  So I think the solution might be
to just make liburcu a sub-project for bcachefs-tools if the system
provided library isn't new enough.  

> As my risk factor for this machine is higher than usual, I've created a 
> setup with:

>   * LXC container initially set up with Bookworm, containing all build
>     dependencies.
>   * I recompiled liburcu8t64 from Trixie to Bookworm
>   * The package was then force-installed - rendering dpkg package
>     database with conflicting dependencies.
>     apt-cache rdepends shows that xfs and glusterfs depends on working
>     liburcu.
>     Since I do not use either of them, this is something I can live with.

I just installed liburcu into /usr/local/lib and setup my
LD_LIBRARY_PATH to include it as a quick hack.  I really need to try
and tweak the bcachefs-tools Makefile to hardcode that path into the
binary. 

>   * Then I built bcachefs-tools
>   * The resulting bcachefs binary and liburcu8t64 packages were copied
>     to the target system.
>   * Finally, I force-installed them, leaving the dpkg database with
>     broken dependencies.

> 1: https://lists.debian.org/debian-backports/2025/01/msg00003.html
>>>> So what is the best Linux distro to use to compile bcachefs-tools?

> There is none until you transfer compiled binary with its dependencies.
> # ldd bcachefs-tools/bcachefs
>      linux-vdso.so.1 (0x0000749f55207000)
>      liburcu.so.8 => /lib/x86_64-linux-gnu/liburcu.so.8 (0x0000749f54db5000)
>      libblkid.so.1 => /lib/x86_64-linux-gnu/libblkid.so.1 
> (0x0000749f54d5e000)
>      libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x0000749f54d54000)
>      libsodium.so.23 => /lib/x86_64-linux-gnu/libsodium.so.23 
> (0x0000749f54cfa000)
>      libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x0000749f54cdb000)
>      liblz4.so.1 => /lib/x86_64-linux-gnu/liblz4.so.1 (0x0000749f54cb3000)
>      libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x0000749f54bf7000)
>      libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x0000749f54bc9000)
>      libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 
> (0x0000749f54bc2000)
>      libaio.so.1 => /lib/x86_64-linux-gnu/libaio.so.1 (0x0000749f54bbd000)
>      libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
> (0x0000749f54b9d000)
>      libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000749f549ba000)
>      /lib64/ld-linux-x86-64.so.2 (0x0000749f55209000)
>      libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x0000749f549b5000)

>>> I'd recommend you join #bcache at OFTC. This kind of things has been
>>> and still is discussed there and people have found solutions that
>>> worked for them regarding -tools vs Bookworm. My personal "solution"
>>> was to upgrade to Trixie.
>> Is there a base OS (Fedora?) that people use for development

> As far as I know, most of bcachefs development happens on Nix.
>> which I can use to cross build a package for Debian?
> I would suggest upgrading to Trixie. I can't as Proxmox is not 
> compatible with Trixie yet.


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2025-03-26 13:45 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
2025-03-11 20:15 ` [PATCH 01/14] bcachefs: Convert read path to standard error codes Kent Overstreet
2025-03-11 20:15 ` [PATCH 02/14] bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path Kent Overstreet
2025-03-11 20:15 ` [PATCH 03/14] bcachefs: Read error message now indicates if it was for an internal move Kent Overstreet
2025-03-11 20:15 ` [PATCH 04/14] bcachefs: BCH_ERR_data_read_buffer_too_small Kent Overstreet
2025-03-11 20:15 ` [PATCH 05/14] bcachefs: Return errors to top level bch2_rbio_retry() Kent Overstreet
2025-03-11 20:15 ` [PATCH 06/14] bcachefs: Print message on successful read retry Kent Overstreet
2025-03-11 20:15 ` [PATCH 07/14] bcachefs: Don't create bch_io_failures unless it's needed Kent Overstreet
2025-03-11 20:15 ` [PATCH 08/14] bcachefs: Checksum errors get additional retries Kent Overstreet
2025-03-11 20:15 ` [PATCH 09/14] bcachefs: __bch2_read() now takes a btree_trans Kent Overstreet
2025-03-11 20:15 ` [PATCH 10/14] bcachefs: Poison extents that can't be read due to checksum errors Kent Overstreet
2025-03-11 20:15 ` [PATCH 11/14] bcachefs: Data move can read from poisoned extents Kent Overstreet
2025-03-11 20:15 ` [PATCH 12/14] bcachefs: Debug params for data corruption injection Kent Overstreet
2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
2025-03-15 16:47   ` Jens Axboe
2025-03-15 17:01     ` Kent Overstreet
2025-03-15 17:03       ` Jens Axboe
2025-03-15 17:27         ` Kent Overstreet
2025-03-15 17:43           ` Jens Axboe
2025-03-15 18:07             ` Kent Overstreet
2025-03-15 18:32               ` Jens Axboe
2025-03-15 18:41                 ` Kent Overstreet
2025-03-17  6:00                   ` Christoph Hellwig
2025-03-17 12:15                     ` Kent Overstreet
2025-03-17 14:13                       ` Keith Busch
2025-03-17 14:49                         ` Kent Overstreet
2025-03-17 15:15                           ` Keith Busch
2025-03-17 15:22                             ` Kent Overstreet
2025-03-17 15:30                         ` Martin K. Petersen
2025-03-17 15:43                           ` Kent Overstreet
2025-03-17 17:57                             ` Martin K. Petersen
2025-03-17 18:21                               ` Kent Overstreet
2025-03-17 19:24                                 ` Keith Busch
2025-03-17 19:40                                   ` Kent Overstreet
2025-03-17 20:39                                     ` Keith Busch
2025-03-17 21:13                                       ` Bart Van Assche
2025-03-18  1:06                                         ` Kent Overstreet
2025-03-18  6:16                                           ` Christoph Hellwig
2025-03-18 17:49                                           ` Bart Van Assche
2025-03-18 18:00                                             ` Kent Overstreet
2025-03-18 18:10                                               ` Keith Busch
2025-03-18 18:13                                                 ` Kent Overstreet
2025-03-20  5:40                                                 ` Christoph Hellwig
2025-03-20 10:28                                                   ` Kent Overstreet
2025-03-18  0:27                                       ` Kent Overstreet
2025-03-18  6:11                                 ` Christoph Hellwig
2025-03-18 21:33                                   ` Kent Overstreet
2025-03-17 17:32                           ` Keith Busch
2025-03-18  6:19                             ` Christoph Hellwig
2025-03-18  6:01                       ` Christoph Hellwig
2025-03-11 20:15 ` [PATCH 14/14] bcachefs: Read retries are after checksum errors now REQ_FUA Kent Overstreet
2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
2025-03-17 21:12   ` errors compiling bcachefs-tools v1.20.0 on debian 12 John Stoffel
2025-03-17 21:48     ` Malte Schröder
2025-03-17 23:10       ` John Stoffel
2025-03-18 21:04       ` John Stoffel
2025-03-18 21:32         ` Malte Schröder
2025-03-19 14:16           ` John Stoffel
2025-03-24 15:25             ` Krzysztof Hajdamowicz
2025-03-26 13:45               ` John Stoffel
2025-03-18  1:15   ` [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
2025-03-18 14:47     ` John Stoffel
2025-03-20 17:15       ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox