[PATCH 00/14] better handling of checksum errors/bitrot

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/14] better handling of checksum errors/bitrot
@ 2025-03-11 20:15 Kent Overstreet
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
  0 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs, linux-block; +Cc: Kent Overstreet, Roland Vet, linux-fsdevel

Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
we've got an extent that can't be read, due to checksum error/bitrot.

This took some doing to fix properly, because

- We don't want to just delete such data (replace it with
  KEY_TYPE_error); we never want to delete anything except when
  explicitly told to by the user, and while we don't yet have an API for
  "read this file even though there's errors, just give me what you
  have" we definitely will in the future.

- Not being able to move data is a non-option: that would block copygc,
  device removal, etc.

- And, moving said extent requires giving it a new checksum - strictly
  required if the move has to fragment it, teaching the write/move path
  about extents with bad checksums is unpalateable, and anyways we'd
  like to be able to guard against more bitrot, if we can.

So that means:

- Extents need a poison bit: "reads return errors, even though it now
  has a good checksum" - this was added in a separate patch queued up
  for 6.15.

  It's an incompat feature because it's a new extent field, and old
  versions can't parse extents with unknown field types, since they
  won't know their sizes - meaning users will have to explicitly do an
  incompat upgrade to make use of this stuff.

- The read path needs to do additional retries after checksum errors
  before giving up and marking it poisoned, so that we don't
  accidentally convert a transient error to permanent corruption.

- The read path gets a whole bunch of work to plumb precise modern error
  codes around, so that e.g. the retry path, the data move path, and the
  "mark extent poisoned" path all know exactly what's going on.

- Read path is responsible for marking extents poisoned after sufficient
  retry attempts (controlled by a new filesystem option)

- Data move path is allowed to move extents after a read error, if it's
  a checksum error (giving it a new checksum) if it's been poisoned
  (i.e. the extent flags feature is enabled).

Code should be more or less finalized - still have more tests for corner
cases to write, but "write corrupt data and then tell rebalance to move
it to another device" works as expected.

TODO:

- NVME has a "read recovery level" attribute that controlls how hard the
  erasure coding algorithms work - we want that plumbed.

  Before we give up and move data that we know is bad, we need to try
  _as hard as possible_ to get a successful read.

Code currently lives in
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-testing

Kent Overstreet (14):
  bcachefs: Convert read path to standard error codes
  bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
  bcachefs: Read error message now indicates if it was for an internal
    move
  bcachefs: BCH_ERR_data_read_buffer_too_small
  bcachefs: Return errors to top level bch2_rbio_retry()
  bcachefs: Print message on successful read retry
  bcachefs: Don't create bch_io_failures unless it's needed
  bcachefs: Checksum errors get additional retries
  bcachefs: __bch2_read() now takes a btree_trans
  bcachefs: Poison extents that can't be read due to checksum errors
  bcachefs: Data move can read from poisoned extents
  bcachefs: Debug params for data corruption injection
  block: Allow REQ_FUA|REQ_READ
  bcachefs: Read retries are after checksum errors now REQ_FUA

 block/blk-core.c              |  19 +-
 fs/bcachefs/bcachefs_format.h |   2 +
 fs/bcachefs/btree_io.c        |   2 +-
 fs/bcachefs/errcode.h         |  19 +-
 fs/bcachefs/extents.c         | 157 +++++++++-------
 fs/bcachefs/extents.h         |   7 +-
 fs/bcachefs/extents_types.h   |  11 +-
 fs/bcachefs/io_read.c         | 325 +++++++++++++++++++++++++---------
 fs/bcachefs/io_read.h         |  21 +--
 fs/bcachefs/io_write.c        |  24 +++
 fs/bcachefs/move.c            |  26 ++-
 fs/bcachefs/opts.h            |   5 +
 fs/bcachefs/super-io.c        |   4 +
 fs/bcachefs/util.c            |  21 +++
 fs/bcachefs/util.h            |  12 ++
 15 files changed, 473 insertions(+), 182 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
@ 2025-03-11 20:15 ` Kent Overstreet
  2025-03-15 16:47   ` Jens Axboe
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
  1 sibling, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet, Jens Axboe, linux-block

REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
the same as writes.

This is useful for when the filesystem gets a checksum error, it's
possible that a bit was flipped in the controller cache, and when we
retry we want to retry the entire IO, not just from cache.

The nvme driver already passes through REQ_FUA for reads, not just
writes, so disabling the warning is sufficient to start using it, and
bcachefs is implementing additional retries for checksum errors so can
immediately use it.

Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 block/blk-core.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d6c4fa3943b5..7b1103eb877d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -793,20 +793,21 @@ void submit_bio_noacct(struct bio *bio)
 			goto end_io;
 	}
 
+	if (WARN_ON_ONCE((bio->bi_opf & REQ_PREFLUSH) &&
+			 bio_op(bio) != REQ_OP_WRITE &&
+			 bio_op(bio) != REQ_OP_ZONE_APPEND))
+		goto end_io;
+
 	/*
 	 * Filter flush bio's early so that bio based drivers without flush
 	 * support don't have to worry about them.
 	 */
-	if (op_is_flush(bio->bi_opf)) {
-		if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE &&
-				 bio_op(bio) != REQ_OP_ZONE_APPEND))
+	if (op_is_flush(bio->bi_opf) &&
+	    !bdev_write_cache(bdev)) {
+		bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
+		if (!bio_sectors(bio)) {
+			status = BLK_STS_OK;
 			goto end_io;
-		if (!bdev_write_cache(bdev)) {
-			bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
-			if (!bio_sectors(bio)) {
-				status = BLK_STS_OK;
-				goto end_io;
-			}
 		}
 	}
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
@ 2025-03-15 16:47   ` Jens Axboe
  2025-03-15 17:01     ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2025-03-15 16:47 UTC (permalink / raw)
  To: Kent Overstreet, linux-bcachefs; +Cc: linux-block

On 3/11/25 2:15 PM, Kent Overstreet wrote:
> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> the same as writes.
> 
> This is useful for when the filesystem gets a checksum error, it's
> possible that a bit was flipped in the controller cache, and when we
> retry we want to retry the entire IO, not just from cache.
> 
> The nvme driver already passes through REQ_FUA for reads, not just
> writes, so disabling the warning is sufficient to start using it, and
> bcachefs is implementing additional retries for checksum errors so can
> immediately use it.

This one got effectively nak'ed by various folks here:

> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/

yet it's part of this series and in linux-next? Hmm?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 16:47   ` Jens Axboe
@ 2025-03-15 17:01     ` Kent Overstreet
  2025-03-15 17:03       ` Jens Axboe
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-15 17:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> > the same as writes.
> > 
> > This is useful for when the filesystem gets a checksum error, it's
> > possible that a bit was flipped in the controller cache, and when we
> > retry we want to retry the entire IO, not just from cache.
> > 
> > The nvme driver already passes through REQ_FUA for reads, not just
> > writes, so disabling the warning is sufficient to start using it, and
> > bcachefs is implementing additional retries for checksum errors so can
> > immediately use it.
> 
> This one got effectively nak'ed by various folks here:
> 
> > Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> 
> yet it's part of this series and in linux-next? Hmm?

As I explained in that thread, they were only thinking about the caching
of writes.

That's not what we're concerned about; when we retry a read due to a
checksum error we do not want the previous _read_ cached.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:01     ` Kent Overstreet
@ 2025-03-15 17:03       ` Jens Axboe
  2025-03-15 17:27         ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2025-03-15 17:03 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block

On 3/15/25 11:01 AM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>> the same as writes.
>>>
>>> This is useful for when the filesystem gets a checksum error, it's
>>> possible that a bit was flipped in the controller cache, and when we
>>> retry we want to retry the entire IO, not just from cache.
>>>
>>> The nvme driver already passes through REQ_FUA for reads, not just
>>> writes, so disabling the warning is sufficient to start using it, and
>>> bcachefs is implementing additional retries for checksum errors so can
>>> immediately use it.
>>
>> This one got effectively nak'ed by various folks here:
>>
>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>
>> yet it's part of this series and in linux-next? Hmm?
> 
> As I explained in that thread, they were only thinking about the caching
> of writes.
> 
> That's not what we're concerned about; when we retry a read due to a
> checksum error we do not want the previous _read_ cached.

Please follow the usual procedure of getting the patch acked/reviewed on
the block list, and go through the usual trees. Until that happens, this
patch should not be in your tree, not should it be staged in linux-next.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:03       ` Jens Axboe
@ 2025-03-15 17:27         ` Kent Overstreet
  2025-03-15 17:43           ` Jens Axboe
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-15 17:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
> On 3/15/25 11:01 AM, Kent Overstreet wrote:
> > On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> >> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> >>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> >>> the same as writes.
> >>>
> >>> This is useful for when the filesystem gets a checksum error, it's
> >>> possible that a bit was flipped in the controller cache, and when we
> >>> retry we want to retry the entire IO, not just from cache.
> >>>
> >>> The nvme driver already passes through REQ_FUA for reads, not just
> >>> writes, so disabling the warning is sufficient to start using it, and
> >>> bcachefs is implementing additional retries for checksum errors so can
> >>> immediately use it.
> >>
> >> This one got effectively nak'ed by various folks here:
> >>
> >>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> >>
> >> yet it's part of this series and in linux-next? Hmm?
> > 
> > As I explained in that thread, they were only thinking about the caching
> > of writes.
> > 
> > That's not what we're concerned about; when we retry a read due to a
> > checksum error we do not want the previous _read_ cached.
> 
> Please follow the usual procedure of getting the patch acked/reviewed on
> the block list, and go through the usual trees. Until that happens, this
> patch should not be in your tree, not should it be staged in linux-next.

It's been posted to linux-block and sent to your inbox. If you're going
to take it that's fine, otherwise - since this is necessary for handling
bitrotted data correctly and I've got users who've been waiting on this
patch series, and it's just deleting a warning, I'm inclined to just
send it.

I'll make sure he's got the lore links and knows what's going on, but
this isn't a great thing to be delaying on citing process.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:27         ` Kent Overstreet
@ 2025-03-15 17:43           ` Jens Axboe
  2025-03-15 18:07             ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2025-03-15 17:43 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block

On 3/15/25 11:27 AM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
>> On 3/15/25 11:01 AM, Kent Overstreet wrote:
>>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>>>> the same as writes.
>>>>>
>>>>> This is useful for when the filesystem gets a checksum error, it's
>>>>> possible that a bit was flipped in the controller cache, and when we
>>>>> retry we want to retry the entire IO, not just from cache.
>>>>>
>>>>> The nvme driver already passes through REQ_FUA for reads, not just
>>>>> writes, so disabling the warning is sufficient to start using it, and
>>>>> bcachefs is implementing additional retries for checksum errors so can
>>>>> immediately use it.
>>>>
>>>> This one got effectively nak'ed by various folks here:
>>>>
>>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>>>
>>>> yet it's part of this series and in linux-next? Hmm?
>>>
>>> As I explained in that thread, they were only thinking about the caching
>>> of writes.
>>>
>>> That's not what we're concerned about; when we retry a read due to a
>>> checksum error we do not want the previous _read_ cached.
>>
>> Please follow the usual procedure of getting the patch acked/reviewed on
>> the block list, and go through the usual trees. Until that happens, this
>> patch should not be in your tree, not should it be staged in linux-next.
> 
> It's been posted to linux-block and sent to your inbox. If you're going
> to take it that's fine, otherwise - since this is necessary for handling
> bitrotted data correctly and I've got users who've been waiting on this
> patch series, and it's just deleting a warning, I'm inclined to just
> send it.
> 
> I'll make sure he's got the lore links and knows what's going on, but
> this isn't a great thing to be delaying on citing process.

Kent, you know how this works, there's nothing to argue about here.

Let the block people get it reviewed and staged. You can't just post a
patch, ignore most replies, and then go on to stage it yourself later
that day. It didn't work well in other subsystems, and it won't work
well here either.

EOD on my end.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 17:43           ` Jens Axboe
@ 2025-03-15 18:07             ` Kent Overstreet
  2025-03-15 18:32               ` Jens Axboe
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-15 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block

On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote:
> On 3/15/25 11:27 AM, Kent Overstreet wrote:
> > On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
> >> On 3/15/25 11:01 AM, Kent Overstreet wrote:
> >>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
> >>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
> >>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
> >>>>> the same as writes.
> >>>>>
> >>>>> This is useful for when the filesystem gets a checksum error, it's
> >>>>> possible that a bit was flipped in the controller cache, and when we
> >>>>> retry we want to retry the entire IO, not just from cache.
> >>>>>
> >>>>> The nvme driver already passes through REQ_FUA for reads, not just
> >>>>> writes, so disabling the warning is sufficient to start using it, and
> >>>>> bcachefs is implementing additional retries for checksum errors so can
> >>>>> immediately use it.
> >>>>
> >>>> This one got effectively nak'ed by various folks here:
> >>>>
> >>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
> >>>>
> >>>> yet it's part of this series and in linux-next? Hmm?
> >>>
> >>> As I explained in that thread, they were only thinking about the caching
> >>> of writes.
> >>>
> >>> That's not what we're concerned about; when we retry a read due to a
> >>> checksum error we do not want the previous _read_ cached.
> >>
> >> Please follow the usual procedure of getting the patch acked/reviewed on
> >> the block list, and go through the usual trees. Until that happens, this
> >> patch should not be in your tree, not should it be staged in linux-next.
> > 
> > It's been posted to linux-block and sent to your inbox. If you're going
> > to take it that's fine, otherwise - since this is necessary for handling
> > bitrotted data correctly and I've got users who've been waiting on this
> > patch series, and it's just deleting a warning, I'm inclined to just
> > send it.
> > 
> > I'll make sure he's got the lore links and knows what's going on, but
> > this isn't a great thing to be delaying on citing process.
> 
> Kent, you know how this works, there's nothing to argue about here.
> 
> Let the block people get it reviewed and staged. You can't just post a
> patch, ignore most replies, and then go on to stage it yourself later
> that day. It didn't work well in other subsystems, and it won't work
> well here either.

Jens, those replies weren't ignored, the concerns were answered. Never
have I seen that considered "effectively nacked".

And as far as "how things work around here" - let's not even go there,
lest we have to cover your issues with process.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:07             ` Kent Overstreet
@ 2025-03-15 18:32               ` Jens Axboe
  2025-03-15 18:41                 ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Jens Axboe @ 2025-03-15 18:32 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Linus Torvalds

On 3/15/25 12:07 PM, Kent Overstreet wrote:
> On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote:
>> On 3/15/25 11:27 AM, Kent Overstreet wrote:
>>> On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote:
>>>> On 3/15/25 11:01 AM, Kent Overstreet wrote:
>>>>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote:
>>>>>> On 3/11/25 2:15 PM, Kent Overstreet wrote:
>>>>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache",
>>>>>>> the same as writes.
>>>>>>>
>>>>>>> This is useful for when the filesystem gets a checksum error, it's
>>>>>>> possible that a bit was flipped in the controller cache, and when we
>>>>>>> retry we want to retry the entire IO, not just from cache.
>>>>>>>
>>>>>>> The nvme driver already passes through REQ_FUA for reads, not just
>>>>>>> writes, so disabling the warning is sufficient to start using it, and
>>>>>>> bcachefs is implementing additional retries for checksum errors so can
>>>>>>> immediately use it.
>>>>>>
>>>>>> This one got effectively nak'ed by various folks here:
>>>>>>
>>>>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/
>>>>>>
>>>>>> yet it's part of this series and in linux-next? Hmm?
>>>>>
>>>>> As I explained in that thread, they were only thinking about the caching
>>>>> of writes.
>>>>>
>>>>> That's not what we're concerned about; when we retry a read due to a
>>>>> checksum error we do not want the previous _read_ cached.
>>>>
>>>> Please follow the usual procedure of getting the patch acked/reviewed on
>>>> the block list, and go through the usual trees. Until that happens, this
>>>> patch should not be in your tree, not should it be staged in linux-next.
>>>
>>> It's been posted to linux-block and sent to your inbox. If you're going
>>> to take it that's fine, otherwise - since this is necessary for handling
>>> bitrotted data correctly and I've got users who've been waiting on this
>>> patch series, and it's just deleting a warning, I'm inclined to just
>>> send it.
>>>
>>> I'll make sure he's got the lore links and knows what's going on, but
>>> this isn't a great thing to be delaying on citing process.
>>
>> Kent, you know how this works, there's nothing to argue about here.
>>
>> Let the block people get it reviewed and staged. You can't just post a
>> patch, ignore most replies, and then go on to stage it yourself later
>> that day. It didn't work well in other subsystems, and it won't work
>> well here either.
> 
> Jens, those replies weren't ignored, the concerns were answered. Never
> have I seen that considered "effectively nacked".

Kent, let me make this really simple for you, since you either still
don't understand how the development process works, or is under some
assumption that it doesn't apply to you - get some of the decades of
storage experience in that first thread to sign off on the patch and use
case. The patch doesn't go upstream before that happens, simple as that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:32               ` Jens Axboe
@ 2025-03-15 18:41                 ` Kent Overstreet
  2025-03-17  6:00                   ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-15 18:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcachefs, linux-block, Linus Torvalds

On Sat, Mar 15, 2025 at 12:32:49PM -0600, Jens Axboe wrote:
> Kent, let me make this really simple for you, since you either still
> don't understand how the development process works, or is under some
> assumption that it doesn't apply to you - get some of the decades of
> storage experience in that first thread to sign off on the patch and use
> case. The patch doesn't go upstream before that happens, simple as that.

No, you're the one who's consistently been under the assumption that it
doesn't apply to you.

You've got a history of applying patches to code that I maintain without
even CCing the list, let alone myself, and introducing silent data
corruption bugs into code that I wrote, without CCs, which I then had to
debug, and then putting up absolutely ridiculous fights to get fixed.

That's the sort of thing that process is supposed to avoid. To be clear,
you and Christoph are the two reasons I've had to harp on process in the
past - everyone else in the kernel has been fine.

As I said before, I'm not trying to bypass you without communicating -
but this has gone completely off the rails.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-15 18:41                 ` Kent Overstreet
@ 2025-03-17  6:00                   ` Christoph Hellwig
  2025-03-17 12:15                     ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-17  6:00 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> That's the sort of thing that process is supposed to avoid. To be clear,
> you and Christoph are the two reasons I've had to harp on process in the
> past - everyone else in the kernel has been fine.

Kent, stop those attacks.  Maybe your patches are just simply not
correct?  Sometimes that might be on the eye of the beholder when it
is about code structure, but in may cases they simply are objectively
broken.

In this case you very clearly misunderstood how the FUA bit works on
reads (and clearly documented that misunderstanding in the commit log).
You've had multiple people explain to you how it is wrong, and the
maintainer reject it for that reason.  What other amount of arguments
to you want?  If you need another one: FUA is not just use NVMe and
SCSI where it is supported on reads even if not in the way you think,
but also for all kinds of other protocols.  Even if we ended up thinking
REQ_FUA on reads is a good idea (and so far no one but you does) you'd
need to audit all these to check if the behavior is valid, and most
likely add a per-driver opt-in.  But so far you've not even tried to
make the use case for your feature.

> As I said before, I'm not trying to bypass you without communicating -
> but this has gone completely off the rails.

You have a very clear pattern where you assume you're perfect, everyone
else is stupid and thus must be overridden.  Most people (including me
recently) have mostly given up arguing with your because it's so
fruitless.  Maybe we shouldn't have because that just seem to reaffirm
you in your personal universe.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17  6:00                   ` Christoph Hellwig
@ 2025-03-17 12:15                     ` Kent Overstreet
  2025-03-17 14:13                       ` Keith Busch
  2025-03-18  6:01                       ` Christoph Hellwig
  0 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > That's the sort of thing that process is supposed to avoid. To be clear,
> > you and Christoph are the two reasons I've had to harp on process in the
> > past - everyone else in the kernel has been fine.

...

> In this case you very clearly misunderstood how the FUA bit works on
> reads (and clearly documented that misunderstanding in the commit log).

FUA reads are clearly documented in the NVME spec.

As for the rest, I'm juts going to ignore the ranting.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 12:15                     ` Kent Overstreet
@ 2025-03-17 14:13                       ` Keith Busch
  2025-03-17 14:49                         ` Kent Overstreet
  2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-18  6:01                       ` Christoph Hellwig
  1 sibling, 2 replies; 42+ messages in thread
From: Keith Busch @ 2025-03-17 14:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > you and Christoph are the two reasons I've had to harp on process in the
> > > past - everyone else in the kernel has been fine.
> 
> ...
> 
> > In this case you very clearly misunderstood how the FUA bit works on
> > reads (and clearly documented that misunderstanding in the commit log).
> 
> FUA reads are clearly documented in the NVME spec.

NVMe's Read with FUA documentation is a pretty short:

  Force Unit Access (FUA): If this bit is set to `1´, then for data and
  metadata, if any, associated with logical blocks specified by the Read
  command, the controller shall:

    1) commit that data and metadata, if any, to non-volatile medium; and
    2) read the data and metadata, if any, from non-volatile medium.

So this aligns with ATA and SCSI.

The concern about what you're doing is with respect to "1". Avoiding that
requires all writes also set FUA, or you sent a flush command before doing any
reads. Otherwise how would you know whether or not read data came from a
previous cached write?

I'm actually more curious how you came to use this mechanism for recovery.
Have you actually seen this approach fix a real problem? In my experience, a
Read FUA is used to ensure what you're reading has been persistently committed.
It can be used as an optimization to flush a specific LBA range, but I'd not
seen this used as a bit error recovery mechanism.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:13                       ` Keith Busch
@ 2025-03-17 14:49                         ` Kent Overstreet
  2025-03-17 15:15                           ` Keith Busch
  2025-03-17 15:30                         ` Martin K. Petersen
  1 sibling, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 14:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:13:59AM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > > you and Christoph are the two reasons I've had to harp on process in the
> > > > past - everyone else in the kernel has been fine.
> > 
> > ...
> > 
> > > In this case you very clearly misunderstood how the FUA bit works on
> > > reads (and clearly documented that misunderstanding in the commit log).
> > 
> > FUA reads are clearly documented in the NVME spec.
> 
> NVMe's Read with FUA documentation is a pretty short:
> 
>   Force Unit Access (FUA): If this bit is set to `1´, then for data and
>   metadata, if any, associated with logical blocks specified by the Read
>   command, the controller shall:
> 
>     1) commit that data and metadata, if any, to non-volatile medium; and
>     2) read the data and metadata, if any, from non-volatile medium.
> 
> So this aligns with ATA and SCSI.
> 
> The concern about what you're doing is with respect to "1". Avoiding that
> requires all writes also set FUA, or you sent a flush command before doing any
> reads. Otherwise how would you know whether or not read data came from a
> previous cached write?

Line 1 is not the relevant concern here. That would be relevant if we
were trying to verify the integrity of writes end-to-end immediately
after they'd been completed - and that would be a lovely thing to do in
the spirit of "true end to end data integrity", but we're not ready to
contemplate such things due to the obvious performance implications.

The scenario here is: we go to read old data, we get a bit error, and
then we want to retry the read to verify that it wasn't a transient
error before declaring it dead (marking it poisoned).

In this scenario, "dirty data in the writeback cache" is something that
would only ever happen if userspace is doing very funny things with
O_DIRECT, and the drive flushing that data on FUA read is totally fine.
It's an irrelevant corner case.

But we do absolutely require checking for transient read errors, and we
will miss things and corrupt data unnecessarily without FUA reads that
bypass the controller cache.

> I'm actually more curious how you came to use this mechanism for recovery.
> Have you actually seen this approach fix a real problem? In my experience, a
> Read FUA is used to ensure what you're reading has been persistently committed.
> It can be used as an optimization to flush a specific LBA range, but I'd not
> seen this used as a bit error recovery mechanism.

It's not data recovery - in the conventional sense.

Typically, on conventional filesystems, this doesn't come up because a:
most filesystems don't even do data checksumming, and even then on a
filesystem with data checksumming this won't automatically come up -
because like other other transient errors, it's generally perfectly
acceptable to return the error and let the upper layer retry later; the
filesystem generaly doesn't have to care if the error is transient or
not.

Where this becomes a filesystem concern is when those errors start
driving state changes within the filesystem, i.e. when we need to track
and know about them: in that case, we absolutely do need to try hard to
distinguish between transient and permanent errors.

Specifically, bcachefs needs to be able to move data around - even if
it's bitrotted and generating checksum errors. There's a couple
different reasons we may need to move data:

- rebalance, this is the task that does arbitrary data processing in the
  background based on user policy (moving it to a slower device,
  compressing it or recompressing it with a more expensive algorithm)

- copygc, which compacts fragmented data in mostly-empty buckets and is
  required for the allocator and the whole system to be able to make
  forward progress

- device evacuate/removal

Rebalance is optional (and still needs to bitrotted extents to be
tracked so that it doesn't spin on them, retrying them endlessly), but
copygc and evacuate are not.

And the only way for these processes to handle bitrotted extents
(besides throwing them out, which obviously we're not going to do) is

- generate a new checksum
- mark them as poisoned, so that reads to userspace still return the
  appropriate error (unless it's a read done with a "I know there's
  errors, give me what you still have" API).

But the critical thing here is, marking an extent as poisoned will turn
a transient corruption into a permanent error unless we've made damn
sure there is no transient error with additional FUA retries.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:49                         ` Kent Overstreet
@ 2025-03-17 15:15                           ` Keith Busch
  2025-03-17 15:22                             ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-03-17 15:15 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote:
> But we do absolutely require checking for transient read errors, and we
> will miss things and corrupt data unnecessarily without FUA reads that
> bypass the controller cache.

Read FUA doesn't "bypass the controller cache". This is a "flush cache
first, then read the result" operation. That description seems
consistent for nvme, ata, and scsi.

You are only interested in the case where the read doesn't overlap with
any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm
still curious if you have actually seen a case where read data was
corrupt in cache but correct with FUA?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:15                           ` Keith Busch
@ 2025-03-17 15:22                             ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 15:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 09:15:04AM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote:
> > But we do absolutely require checking for transient read errors, and we
> > will miss things and corrupt data unnecessarily without FUA reads that
> > bypass the controller cache.
> 
> Read FUA doesn't "bypass the controller cache". This is a "flush cache
> first, then read the result" operation. That description seems
> consistent for nvme, ata, and scsi.

Quoting from your previous email,

> 2) read the data and metadata, if any, from non-volatile medium.

That's pretty specific.
 
> You are only interested in the case where the read doesn't overlap with
> any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm
> still curious if you have actually seen a case where read data was
> corrupt in cache but correct with FUA?

No.

It's completely typical and expected to code defensively against errors
which we have not observed but know are possible.

Is bit corruption possible in the device cache? You betcha, it's
hardware.

Are we going to have to test for correct implementation of READ|FUA in
the future? Given all the devices that don't even get flushes right,
again yes.

Are we still going to go ahead and do it? Yes, and I'd even say it's a
prerequisite for complaining to manufacturers that don't get this right.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 14:13                       ` Keith Busch
  2025-03-17 14:49                         ` Kent Overstreet
@ 2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-17 15:43                           ` Kent Overstreet
  2025-03-17 17:32                           ` Keith Busch
  1 sibling, 2 replies; 42+ messages in thread
From: Martin K. Petersen @ 2025-03-17 15:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

Keith,

> In my experience, a Read FUA is used to ensure what you're reading has
> been persistently committed.

Exactly, that is the intended semantic.

Kent is trying to address a scenario in which data in the cache is
imperfect and the data on media is reliable. SCSI did not consider such
a scenario because the entire access model would be fundamentally broken
if cache contents couldn't be trusted.

FUA was defined to handle the exact opposite scenario: Data in cache is
reliable by definition and the media inherently unreliable.
Consequently, what all the standards describe is a flush-then-read
semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is
to validate endurance of future reads.

SCSI has a different per-command flag for cache management. However, it
is only a hint and therefore does not offer the guarantee Kent is
looking for.

At least for SCSI, given how FUA is usually implemented, I consider it
quite unlikely that two read operations back to back would somehow cause
different data to be transferred. Regardless of which flags you use.

Also, and obviously this is also highly implementation-dependent, but I
don't think one should underestimate the amount of checking done along
the entire data path inside the device, including DRAM to the transport
interface ASIC. Data is usually validated by the ASIC on the way out and
from there on all modern transports have some form of checking. Having
corrupted data placed in host memory without an associated error
condition is not a common scenario. Bit flips in host memory are much
more common.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:30                         ` Martin K. Petersen
@ 2025-03-17 15:43                           ` Kent Overstreet
  2025-03-17 17:57                             ` Martin K. Petersen
  2025-03-17 17:32                           ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 15:43 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote:
> 
> Keith,
> 
> > In my experience, a Read FUA is used to ensure what you're reading has
> > been persistently committed.
> 
> Exactly, that is the intended semantic.
> 
> Kent is trying to address a scenario in which data in the cache is
> imperfect and the data on media is reliable. SCSI did not consider such
> a scenario because the entire access model would be fundamentally broken
> if cache contents couldn't be trusted.
> 
> FUA was defined to handle the exact opposite scenario: Data in cache is
> reliable by definition and the media inherently unreliable.
> Consequently, what all the standards describe is a flush-then-read
> semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is
> to validate endurance of future reads.
> 
> SCSI has a different per-command flag for cache management. However, it
> is only a hint and therefore does not offer the guarantee Kent is
> looking for.
> 
> At least for SCSI, given how FUA is usually implemented, I consider it
> quite unlikely that two read operations back to back would somehow cause
> different data to be transferred. Regardless of which flags you use.

Based on what, exactly?

We're already seeing bit corruption caught - and corrupted with scrub,
when users aren't running with replicas=1 (which is fine for data that's
backed up elsewhere, of course).

Then it's just a question of where in the lower layers it comes from.

We _know_ devices are not perfect, and your claim that "it's quite
unlikely that two reads back to back would return different data"
amounts to claiming that there are no bugs in a good chunk of the IO
path and all that is implemented perfectly.

But that's just bullshit.

When we do a retry after observing bit corruption, we need to retry the
full read, not just from device cache, and the spec is quite specific on
that point.

Now, if you're claiming that a FUA read will behave differently, please
do state clearly what you think will happen with clean cached data in
the cache. But that'll amount to a clear spec violation...

> Also, and obviously this is also highly implementation-dependent, but I
> don't think one should underestimate the amount of checking done along
> the entire data path inside the device, including DRAM to the transport
> interface ASIC. Data is usually validated by the ASIC on the way out and
> from there on all modern transports have some form of checking. Having
> corrupted data placed in host memory without an associated error
> condition is not a common scenario. Bit flips in host memory are much
> more common.

Yeah, I'm not in the business of trusting the lower layers like this.
Trust, but verify :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:43                           ` Kent Overstreet
@ 2025-03-17 17:57                             ` Martin K. Petersen
  2025-03-17 18:21                               ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Martin K. Petersen @ 2025-03-17 17:57 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Keith Busch, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

Kent,

>> At least for SCSI, given how FUA is usually implemented, I consider
>> it quite unlikely that two read operations back to back would somehow
>> cause different data to be transferred. Regardless of which flags you
>> use.
>
> Based on what, exactly?

Based on the fact that many devices will either blindly flush on FUA or
they'll do the equivalent of a media verify operation. In neither case
will you get different data returned. The emphasis for FUA is on media
durability, not caching.

In most implementations the cache isn't an optional memory buffer thingy
that can be sidestepped. It is the only access mechanism that exists
between the media and the host interface. Working memory if you will. So
bypassing the device cache is not really a good way to think about it.

The purpose of FUA is to ensure durability for future reads, it is a
media management flag. As such, any effect FUA may have on the device
cache is incidental.

For SCSI there is a different flag to specify caching behavior. That
flag is orthogonal to FUA and did not get carried over to NVMe.

> We _know_ devices are not perfect, and your claim that "it's quite
> unlikely that two reads back to back would return different data"
> amounts to claiming that there are no bugs in a good chunk of the IO
> path and all that is implemented perfectly.

I'm not saying that devices are perfect or that the standards make
sense. I'm just saying that your desired behavior does not match the
reality of how a large number of these devices are actually implemented.

The specs are largely written by device vendors and therefore
deliberately ambiguous. Many of the explicit cache management bits and
bobs have been removed from SCSI or are defined as hints because device
vendors don't want the OS to interfere with how they manage resources,
including caching. I get what your objective is. I just don't think FUA
offers sufficient guarantees in that department.

Also, given the amount of hardware checking done at the device level, my
experience tells me that you are way more likely to have undetected
corruption problems on the host side than inside the storage device. In
general storage devices implement very extensive checking on both
control and data paths. And they will return an error if there is a
mismatch (as opposed to returning random data).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 17:57                             ` Martin K. Petersen
@ 2025-03-17 18:21                               ` Kent Overstreet
  2025-03-17 19:24                                 ` Keith Busch
  2025-03-18  6:11                                 ` Christoph Hellwig
  0 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 18:21 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> 
> Kent,
> 
> >> At least for SCSI, given how FUA is usually implemented, I consider
> >> it quite unlikely that two read operations back to back would somehow
> >> cause different data to be transferred. Regardless of which flags you
> >> use.
> >
> > Based on what, exactly?
> 
> Based on the fact that many devices will either blindly flush on FUA or
> they'll do the equivalent of a media verify operation. In neither case
> will you get different data returned. The emphasis for FUA is on media
> durability, not caching.
> 
> In most implementations the cache isn't an optional memory buffer thingy
> that can be sidestepped. It is the only access mechanism that exists
> between the media and the host interface. Working memory if you will. So
> bypassing the device cache is not really a good way to think about it.
> 
> The purpose of FUA is to ensure durability for future reads, it is a
> media management flag. As such, any effect FUA may have on the device
> cache is incidental.
> 
> For SCSI there is a different flag to specify caching behavior. That
> flag is orthogonal to FUA and did not get carried over to NVMe.
> 
> > We _know_ devices are not perfect, and your claim that "it's quite
> > unlikely that two reads back to back would return different data"
> > amounts to claiming that there are no bugs in a good chunk of the IO
> > path and all that is implemented perfectly.
> 
> I'm not saying that devices are perfect or that the standards make
> sense. I'm just saying that your desired behavior does not match the
> reality of how a large number of these devices are actually implemented.
> 
> The specs are largely written by device vendors and therefore
> deliberately ambiguous. Many of the explicit cache management bits and
> bobs have been removed from SCSI or are defined as hints because device
> vendors don't want the OS to interfere with how they manage resources,
> including caching. I get what your objective is. I just don't think FUA
> offers sufficient guarantees in that department.

If you're saying this is going to be a work in progress to get the
behaviour we need in this scenario - yes, absolutely.

Beyond making sure that retries go to the physical media, there's "retry
level" in the NVME spec which needs to be plumbed, and that one will be
particularly useful in multi device scenarios. (Crank retry level up
or down based on whether we can retry from different devices).

But we've got to start somewhere, and given that the spec says "bypass
the cache" - that looks like the place to start. If devices don't
support the behaviour we want today, then nudging the drive
manufacturers to support it is infinitely saner than getting a whole
nother bit plumbed through the NVME standard, especially given that the
letter of the spec does describe exactly what we want.

So: I understand your concerns, but they're out of scope for the moment.
As long as nothing actively breaks when we feed it READ|FUA, it's
totally fine.

Later on I can imagine us adding some basic sanity tests to alert if
READ|FUA is behaving as expected; it's pretty easy to test latency of
random vs. reads to the same location vs. FUA reads to the same
location.

So yes: there will be more work to do in this area.

> Also, given the amount of hardware checking done at the device level, my
> experience tells me that you are way more likely to have undetected
> corruption problems on the host side than inside the storage device. In
> general storage devices implement very extensive checking on both
> control and data paths. And they will return an error if there is a
> mismatch (as opposed to returning random data).

Bugs host side are very much a concern, yes (we can only do so much
against memory stompers, and the RMW that buffered IO does is a giant
hole), but at the filesystem level there are techniques for avoiding
corruption that are not available at the block level.  The main one
being, every pointer carries the checksum that validates the data it
points to - the ZFS model.

So we do that, and additionally we're _very_ careful when moving around
data to carry around existing checksums, and validate the old after
generating the new (or sum up new checksums to validate them against the
old directly) when moving data around.

IOW - in my world, it's the filesystem's job to verify and authenticate
everything.

And layers above bcachefs do their own verification and authentication
as well - nixos package builders and git being the big ones that have
caught bugs for us.

It all matters...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 18:21                               ` Kent Overstreet
@ 2025-03-17 19:24                                 ` Keith Busch
  2025-03-17 19:40                                   ` Kent Overstreet
  2025-03-18  6:11                                 ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-03-17 19:24 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> > I'm not saying that devices are perfect or that the standards make
> > sense. I'm just saying that your desired behavior does not match the
> > reality of how a large number of these devices are actually implemented.
> > 
> > The specs are largely written by device vendors and therefore
> > deliberately ambiguous. Many of the explicit cache management bits and
> > bobs have been removed from SCSI or are defined as hints because device
> > vendors don't want the OS to interfere with how they manage resources,
> > including caching. I get what your objective is. I just don't think FUA
> > offers sufficient guarantees in that department.
> 
> If you're saying this is going to be a work in progress to get the
> behaviour we need in this scenario - yes, absolutely.
> 
> Beyond making sure that retries go to the physical media, there's "retry
> level" in the NVME spec which needs to be plumbed, and that one will be
> particularly useful in multi device scenarios. (Crank retry level up
> or down based on whether we can retry from different devices).

I saw you mention the RRL mechanism in another patch, and it really
piqued my interest. How are you intending to use this? In NVMe, this is
controlled via an admin "Set Feature" command, which is absolutley not
available to a block device, much less a file system. That command queue
is only accesible to the driver and to user space admin, and is
definitely not a per-io feature.

> But we've got to start somewhere, and given that the spec says "bypass
> the cache" - that looks like the place to start. 

This is a bit dangerous to assume. I don't find anywhere in any nvme
specifications (also checked T10 SBC) with text saying anything similiar
to "bypass" in relation to the cache for FUA reads. I am reasonably
confident some vendors, especially ones developing active-active
controllers, will fight you to the their win on the spec committee for
this if you want to take it up in those forums.

> If devices don't support the behaviour we want today, then nudging the
> drive manufacturers to support it is infinitely saner than getting a
> whole nother bit plumbed through the NVME standard, especially given
> that the letter of the spec does describe exactly what we want.

I my experience, the storage standards committees are more aligned to
accomodate appliance vendors than anything Linux specific. Your desired
read behavior would almost certainly be a new TPAR in NVMe to get spec
defined behavior. It's not impossible, but I'll just say it is an uphill
battle and the end result may or may not look like what you have in
mind.

In summary, what we have by the specs from READ FUA:

 Flush and Read

What (I think) you want:

 Invalidate and Read

It sounds like you are trying to say that your scenario doesn't care
about the "Flush" so you get to use the existing semantics as the
"Invalidate" case, and I really don't think you get that guarantee from
any spec.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 19:24                                 ` Keith Busch
@ 2025-03-17 19:40                                   ` Kent Overstreet
  2025-03-17 20:39                                     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-17 19:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> > > I'm not saying that devices are perfect or that the standards make
> > > sense. I'm just saying that your desired behavior does not match the
> > > reality of how a large number of these devices are actually implemented.
> > > 
> > > The specs are largely written by device vendors and therefore
> > > deliberately ambiguous. Many of the explicit cache management bits and
> > > bobs have been removed from SCSI or are defined as hints because device
> > > vendors don't want the OS to interfere with how they manage resources,
> > > including caching. I get what your objective is. I just don't think FUA
> > > offers sufficient guarantees in that department.
> > 
> > If you're saying this is going to be a work in progress to get the
> > behaviour we need in this scenario - yes, absolutely.
> > 
> > Beyond making sure that retries go to the physical media, there's "retry
> > level" in the NVME spec which needs to be plumbed, and that one will be
> > particularly useful in multi device scenarios. (Crank retry level up
> > or down based on whether we can retry from different devices).
> 
> I saw you mention the RRL mechanism in another patch, and it really
> piqued my interest. How are you intending to use this? In NVMe, this is
> controlled via an admin "Set Feature" command, which is absolutley not
> available to a block device, much less a file system. That command queue
> is only accesible to the driver and to user space admin, and is
> definitely not a per-io feature.

Oof, that's going to be a giant hassle then. That is something we
definitely want, but it may be something for well down the line then.

My more immediate priority is going to be finishing ZNS support, since
that will no doubt inform anything we do in that area.
  
> > But we've got to start somewhere, and given that the spec says "bypass
> > the cache" - that looks like the place to start. 
> 
> This is a bit dangerous to assume. I don't find anywhere in any nvme
> specifications (also checked T10 SBC) with text saying anything similiar
> to "bypass" in relation to the cache for FUA reads. I am reasonably
> confident some vendors, especially ones developing active-active
> controllers, will fight you to the their win on the spec committee for
> this if you want to take it up in those forums.

"Read will come direct from media" reads pretty clear to me.

But even if it's not supported the way we want, I'm not seeing anything
dangerous about using it this way. Worst case, our retries aren't as
good as we want them to be, and it'll be an item to work on in the
future.

As long as drives aren't barfing when we give them a read fua (and so
far they haven't when running this code), we're fine for now.

> > If devices don't support the behaviour we want today, then nudging the
> > drive manufacturers to support it is infinitely saner than getting a
> > whole nother bit plumbed through the NVME standard, especially given
> > that the letter of the spec does describe exactly what we want.
> 
> I my experience, the storage standards committees are more aligned to
> accomodate appliance vendors than anything Linux specific. Your desired
> read behavior would almost certainly be a new TPAR in NVMe to get spec
> defined behavior. It's not impossible, but I'll just say it is an uphill
> battle and the end result may or may not look like what you have in
> mind.

I'm not so sure.

If there are users out there depending on a different meaning of read
fua, then yes, absolutely (and it sounds like Martin might have been
alluding to that - but why wouldn't the write have been done fua? I'd
want to hear more about that).

If, OTOH, this is just something that hasn't come up before - the
language in the spec is already there, so once code is out there with
enough users and a demonstrated use case then it might be a pretty
simple nudge - "shoot down this range of the cache, don't just flush it"
is a pretty simple code change, as far as these things go.

> In summary, what we have by the specs from READ FUA:
> 
>  Flush and Read
> 
> What (I think) you want:
> 
>  Invalidate and Read
> 
> It sounds like you are trying to say that your scenario doesn't care
> about the "Flush" so you get to use the existing semantics as the
> "Invalidate" case, and I really don't think you get that guarantee from
> any spec.

Exactly. Previous data being flushed, if it was dirty, is totally fine.

Specs aren't worth much if no one's depending on or testing a given
behaviour, so what the spec strictly guarantees doesn't really matter
here. What matters more is - does the behaviour make sense, will it be
easy enough to implement, and does it conflict with behaviour anyone
else is depneding on.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 19:40                                   ` Kent Overstreet
@ 2025-03-17 20:39                                     ` Keith Busch
  2025-03-17 21:13                                       ` Bart Van Assche
  2025-03-18  0:27                                       ` Kent Overstreet
  0 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2025-03-17 20:39 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > specifications (also checked T10 SBC) with text saying anything similiar
> > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > confident some vendors, especially ones developing active-active
> > controllers, will fight you to the their win on the spec committee for
> > this if you want to take it up in those forums.
> 
> "Read will come direct from media" reads pretty clear to me.
>
> But even if it's not supported the way we want, I'm not seeing anything
> dangerous about using it this way. Worst case, our retries aren't as
> good as we want them to be, and it'll be an item to work on in the
> future.

I don't think you're appreciating the complications that active-active
and multi-host brings to the scenario. Those are why this is not the
forum to solve it. The specs need to be clear on the guarantees, and
what they currently guarnatee might provide some overlap with what
you're seeking in specific scenarios, but I really think (and I believe
Martin agrees) your use is outside its targeted use case.
 
> As long as drives aren't barfing when we give them a read fua (and so
> far they haven't when running this code), we're fine for now.

In this specific regard, I think its safe to assume the devices will
remain operational.

> > > If devices don't support the behaviour we want today, then nudging the
> > > drive manufacturers to support it is infinitely saner than getting a
> > > whole nother bit plumbed through the NVME standard, especially given
> > > that the letter of the spec does describe exactly what we want.
> > 
> > I my experience, the storage standards committees are more aligned to
> > accomodate appliance vendors than anything Linux specific. Your desired
> > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > defined behavior. It's not impossible, but I'll just say it is an uphill
> > battle and the end result may or may not look like what you have in
> > mind.
> 
> I'm not so sure.
> 
> If there are users out there depending on a different meaning of read
> fua, then yes, absolutely (and it sounds like Martin might have been
> alluding to that - but why wouldn't the write have been done fua? I'd
> want to hear more about that)

As I mentioned, READ FUA provides an optimization opportunity that
can be used instead of Write + Flush or WriteFUA when the host isn't
sure about the persistence needs at the time of the initial Write: it
can be used as a checkpoint on a specific block range that you may have
written and overwritten. This kind of "read" command provides a well
defined persistence barrier. Thinking of Read FUA as a barrier is better
aligned with how the standards and device makers intended it to be used.

> If, OTOH, this is just something that hasn't come up before - the
> language in the spec is already there, so once code is out there with
> enough users and a demonstrated use case then it might be a pretty
> simple nudge - "shoot down this range of the cache, don't just flush it"
> is a pretty simple code change, as far as these things go.

So you're telling me you've never written SSD firmware then waited for
the manufacturer to release it to your users? Yes, I jest, and maybe
YMMV; but relying on that process is putting your destiny in the wrong
hands.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 20:39                                     ` Keith Busch
@ 2025-03-17 21:13                                       ` Bart Van Assche
  2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  0:27                                       ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2025-03-17 21:13 UTC (permalink / raw)
  To: Keith Busch, Kent Overstreet
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On 3/17/25 1:39 PM, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
>> If, OTOH, this is just something that hasn't come up before - the
>> language in the spec is already there, so once code is out there with
>> enough users and a demonstrated use case then it might be a pretty
>> simple nudge - "shoot down this range of the cache, don't just flush it"
>> is a pretty simple code change, as far as these things go.
> 
> So you're telling me you've never written SSD firmware then waited for
> the manufacturer to release it to your users? Yes, I jest, and maybe
> YMMV; but relying on that process is putting your destiny in the wrong
> hands.

As far as I know in the Linux kernel we only support storage device 
behavior that has been standardized and also workarounds for bugs. I'm
not sure we should support behavior in the Linux kernel that deviates
from existing standards and that also deviates from longstanding and
well-established conventions.

Bart.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 21:13                                       ` Bart Van Assche
@ 2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  6:16                                           ` Christoph Hellwig
  2025-03-18 17:49                                           ` Bart Van Assche
  0 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18  1:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:13:14PM -0700, Bart Van Assche wrote:
> On 3/17/25 1:39 PM, Keith Busch wrote:
> > On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> > > If, OTOH, this is just something that hasn't come up before - the
> > > language in the spec is already there, so once code is out there with
> > > enough users and a demonstrated use case then it might be a pretty
> > > simple nudge - "shoot down this range of the cache, don't just flush it"
> > > is a pretty simple code change, as far as these things go.
> > 
> > So you're telling me you've never written SSD firmware then waited for
> > the manufacturer to release it to your users? Yes, I jest, and maybe
> > YMMV; but relying on that process is putting your destiny in the wrong
> > hands.
> 
> As far as I know in the Linux kernel we only support storage device behavior
> that has been standardized and also workarounds for bugs. I'm
> not sure we should support behavior in the Linux kernel that deviates
> from existing standards and that also deviates from longstanding and
> well-established conventions.

What bcachefs is doing is entirely in line with the behaviour the
standard states.

Keith and Martin are describing systems with a different interpretation,
because they want the side effect (flush cache) without the main effect
(possibly evict cache, but read comes from media).

It's an amusing state of affairs, but it'd be easily resolved with an
admin level NVME command to flip a state bit (like the read recovery
level we were also talking about), and anyways multihoming capable NVME
devices are an entirely different market from the conventional stuff.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  1:06                                         ` Kent Overstreet
@ 2025-03-18  6:16                                           ` Christoph Hellwig
  2025-03-18 17:49                                           ` Bart Van Assche
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:16 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bart Van Assche, Keith Busch, Martin K. Petersen,
	Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 09:06:13PM -0400, Kent Overstreet wrote:
> What bcachefs is doing is entirely in line with the behaviour the
> standard states.

Setting the FUA bit on any READ command (if supported by the device)
is entirely in line with the behaviour the standards.  It's just not
going to do what you hope.  And while you claim that it helps you
with data recovery, you've not shown either a practical example where
it does, or a theory based on hardware architeture how it could.

> It's an amusing state of affairs, but it'd be easily resolved with an
> admin level NVME command to flip a state bit (like the read recovery
> level we were also talking about), and anyways multihoming capable NVME
> devices are an entirely different market from the conventional stuff.

Please joing the NVMe technical working group and write a proposal.
NVMe requires you to clearly state the benefits of the proposal, which
means you have to actually clearly write that down first.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  1:06                                         ` Kent Overstreet
  2025-03-18  6:16                                           ` Christoph Hellwig
@ 2025-03-18 17:49                                           ` Bart Van Assche
  2025-03-18 18:00                                             ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2025-03-18 17:49 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On 3/17/25 6:06 PM, Kent Overstreet wrote:
> What bcachefs is doing is entirely in line with the behaviour the
> standard states.

I do not agree with the above.

Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm
wondering whether allowing REQ_FUA|REQ_READ would be a good topic
for that summit.

Bart.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 17:49                                           ` Bart Van Assche
@ 2025-03-18 18:00                                             ` Kent Overstreet
  2025-03-18 18:10                                               ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18 18:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 10:49:25AM -0700, Bart Van Assche wrote:
> On 3/17/25 6:06 PM, Kent Overstreet wrote:
> > What bcachefs is doing is entirely in line with the behaviour the
> > standard states.
> 
> I do not agree with the above.
> 
> Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm
> wondering whether allowing REQ_FUA|REQ_READ would be a good topic
> for that summit.

No, I won't be there this year. And I don't think it'd be the right
forum for arguing over the meaning of an obscure line in the NVME spec,
anyways :)

It's certainly not in dispute that read fua is a documented, legitimate
command, so there's no reason for the block layer to be rejecting it.

Whether it has exactly the behaviour we want isn't a critical issue that
has to be determined right now. The starting point for that will be to
test device behaviour (with some simple performance tests, like I
mentioned), and anyways it's outside the scope of the block layer.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:00                                             ` Kent Overstreet
@ 2025-03-18 18:10                                               ` Keith Busch
  2025-03-18 18:13                                                 ` Kent Overstreet
  2025-03-20  5:40                                                 ` Christoph Hellwig
  0 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2025-03-18 18:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote:
> It's certainly not in dispute that read fua is a documented, legitimate
> command, so there's no reason for the block layer to be rejecting it.
> 
> Whether it has exactly the behaviour we want isn't a critical issue that
> has to be determined right now. The starting point for that will be to
> test device behaviour (with some simple performance tests, like I
> mentioned), and anyways it's outside the scope of the block layer.

Maybe just change the commit log. Read FUA has legit uses for persisting
data as described by the specs. No need to introduce contested behavior
to justify this patch, yah?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:10                                               ` Keith Busch
@ 2025-03-18 18:13                                                 ` Kent Overstreet
  2025-03-20  5:40                                                 ` Christoph Hellwig
  1 sibling, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18 18:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote:
> > It's certainly not in dispute that read fua is a documented, legitimate
> > command, so there's no reason for the block layer to be rejecting it.
> > 
> > Whether it has exactly the behaviour we want isn't a critical issue that
> > has to be determined right now. The starting point for that will be to
> > test device behaviour (with some simple performance tests, like I
> > mentioned), and anyways it's outside the scope of the block layer.
> 
> Maybe just change the commit log. Read FUA has legit uses for persisting
> data as described by the specs. No need to introduce contested behavior
> to justify this patch, yah?

That's reasonable, I'll do that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18 18:10                                               ` Keith Busch
  2025-03-18 18:13                                                 ` Kent Overstreet
@ 2025-03-20  5:40                                                 ` Christoph Hellwig
  2025-03-20 10:28                                                   ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-20  5:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kent Overstreet, Bart Van Assche, Martin K. Petersen,
	Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> Maybe just change the commit log. Read FUA has legit uses for persisting
> data as described by the specs. No need to introduce contested behavior
> to justify this patch, yah?

While not having a factually incorrect commit message is a great
start I still don't think we want it.  For one there is no actual
use case for the actual semantics, so why add it?  Second it still
needs all the proper per-driver opt-in as these sematncis are not
defined for all out protocols as I've already mentioned before.

But hey, maybe Kent can actually find other storage or file system
developers to support it, so having an at least technically correct
patch out on the list would be a big start, even if I would not expect
to Jens to take it in a whim.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-20  5:40                                                 ` Christoph Hellwig
@ 2025-03-20 10:28                                                   ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-20 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Bart Van Assche, Martin K. Petersen, Jens Axboe,
	linux-bcachefs, linux-block, Linus Torvalds

On Wed, Mar 19, 2025 at 10:40:53PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote:
> > Maybe just change the commit log. Read FUA has legit uses for persisting
> > data as described by the specs. No need to introduce contested behavior
> > to justify this patch, yah?
> 
> While not having a factually incorrect commit message is a great
> start I still don't think we want it.  For one there is no actual
> use case for the actual semantics, so why add it?  Second it still
> needs all the proper per-driver opt-in as these sematncis are not
> defined for all out protocols as I've already mentioned before.
> 
> But hey, maybe Kent can actually find other storage or file system
> developers to support it, so having an at least technically correct
> patch out on the list would be a big start, even if I would not expect
> to Jens to take it in a whim.

Chistoph,

You're arguing over nothing. Go back and reread, you and I have the same
interpretation of what read fua should do.

You all really are a surly and argumentative lot...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 20:39                                     ` Keith Busch
  2025-03-17 21:13                                       ` Bart Van Assche
@ 2025-03-18  0:27                                       ` Kent Overstreet
  1 sibling, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18  0:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:39:07PM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > > specifications (also checked T10 SBC) with text saying anything similiar
> > > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > > confident some vendors, especially ones developing active-active
> > > controllers, will fight you to the their win on the spec committee for
> > > this if you want to take it up in those forums.
> > 
> > "Read will come direct from media" reads pretty clear to me.
> >
> > But even if it's not supported the way we want, I'm not seeing anything
> > dangerous about using it this way. Worst case, our retries aren't as
> > good as we want them to be, and it'll be an item to work on in the
> > future.
> 
> I don't think you're appreciating the complications that active-active
> and multi-host brings to the scenario. Those are why this is not the
> forum to solve it. The specs need to be clear on the guarantees, and
> what they currently guarnatee might provide some overlap with what
> you're seeking in specific scenarios, but I really think (and I believe
> Martin agrees) your use is outside its targeted use case.

You do realize this is a single node filesystem we're talking about,
right?

Crazy multi-homing stuff, while cool, has no bearing here.

>  
> > As long as drives aren't barfing when we give them a read fua (and so
> > far they haven't when running this code), we're fine for now.
> 
> In this specific regard, I think its safe to assume the devices will
> remain operational.
> 
> > > > If devices don't support the behaviour we want today, then nudging the
> > > > drive manufacturers to support it is infinitely saner than getting a
> > > > whole nother bit plumbed through the NVME standard, especially given
> > > > that the letter of the spec does describe exactly what we want.
> > > 
> > > I my experience, the storage standards committees are more aligned to
> > > accomodate appliance vendors than anything Linux specific. Your desired
> > > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > > defined behavior. It's not impossible, but I'll just say it is an uphill
> > > battle and the end result may or may not look like what you have in
> > > mind.
> > 
> > I'm not so sure.
> > 
> > If there are users out there depending on a different meaning of read
> > fua, then yes, absolutely (and it sounds like Martin might have been
> > alluding to that - but why wouldn't the write have been done fua? I'd
> > want to hear more about that)
> 
> As I mentioned, READ FUA provides an optimization opportunity that
> can be used instead of Write + Flush or WriteFUA when the host isn't
> sure about the persistence needs at the time of the initial Write: it
> can be used as a checkpoint on a specific block range that you may have
> written and overwritten. This kind of "read" command provides a well
> defined persistence barrier. Thinking of Read FUA as a barrier is better
> aligned with how the standards and device makers intended it to be used.

Yeah, I got that. Again, a neat trick, but who in their right mind would
use that sort of thing when the sane thing to do is just write fua?

So I'm skeptical that that sort of thing is an actual use with any
bearing on the devices any single node filesystem targets.

Now, in crazy enterprise multi homing land, sure.

Now, if you're saying you think the standard should be interpreted in a
way such that read fua does what it seems you and Martin want it to do
in crazy enterprise multi homing land... now _that_ would be a fun
argument to have in a standards committee :)

But I mostly jest, because my suspicion is that there wouldn't be any
real conflict, just a bit of the "I hadn't thought to use it that way"
anxiety.

> > If, OTOH, this is just something that hasn't come up before - the
> > language in the spec is already there, so once code is out there with
> > enough users and a demonstrated use case then it might be a pretty
> > simple nudge - "shoot down this range of the cache, don't just flush it"
> > is a pretty simple code change, as far as these things go.
> 
> So you're telling me you've never written SSD firmware then waited for
> the manufacturer to release it to your users? Yes, I jest, and maybe
> YMMV; but relying on that process is putting your destiny in the wrong
> hands.

Nah, back when I was at an employer that did SSD drivers/firmware, it
was all in house - no waiting to ship required :)

And incidentally, it's been fun watching the "FTL host or device side"
thing go back and forth since then; we had the same back-and-forth
happen just between different generations of the internal stuff being
built at Google that's been happening now with NVME and ZNS.

The appeal of a host side FTL was fairly obvious back then, but the FTL
ended up moving from the host to the device because people wanted to do
complete kernel IO stack bypass. The AIO and DIO code were really bad
back then especially in certain multithreaded scenarios - profiles were
absolutely atrocious, and the performance gains of a host side FTL
only really happen if you can do it in combination with the filesystem,
cutting out a lot of redundancy and improving on GC efficiency (this is
a big one) because in the filesystem, you have a lot more information on
what goes with what, that's lost at the block layer. IO tail latency in
particular improves, especially on loaded machines.

But a filesystem meant for that didn't exist at the time, nor did
hardware with any kind of an open interface...

Since then the kernel IO stack has gotten massively faster, ZNS hardware
exists, and bcachefs was pretty much designed from the start for
directly driving flash. There's about a month of work left, max, for
finish that off and driving hardware I have sitting on my desk.

Which means there should be more interesting things happen in the NVME
transport area in the future. In particular, moving the FTL into the
filesystem ought to allow for much more gracefully degrading failure
modes instead of the whole SSD going offline (and signals to the user
about when flash is going bad! flash ECC algorithms give you this, it's
just not exposed!).

So should be fun times.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 18:21                               ` Kent Overstreet
  2025-03-17 19:24                                 ` Keith Busch
@ 2025-03-18  6:11                                 ` Christoph Hellwig
  2025-03-18 21:33                                   ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:11 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> Beyond making sure that retries go to the physical media, there's "retry
> level" in the NVME spec which needs to be plumbed, and that one will be
> particularly useful in multi device scenarios. (Crank retry level up
> or down based on whether we can retry from different devices).

The read recovery level is to reduce the amount or intensity of read
retries, not to increase it so that workloads that have multiple sources
for data aren't stalled by the sometimes extremely long read recovery.
You won't really find much hardware that actually implements it.

As a little background: The read recovery level was added as part of the
predictive latency mode and originally tied to the (now heavily deprecated
and never implemented in scale) NVM sets.  Yours truly successfully
argued that they should not be tied to NVM sets and helped to make them
more generic, but the there was basically no uptake of the read recovery
level, with or without NVM sets.

> But we've got to start somewhere, and given that the spec says "bypass
> the cache"

But as people have been telling you multiple times that is not what any
of the specs says.

> - that looks like the place to start. If devices don't
> support the behaviour we want today, then nudging the drive
> manufacturers to support it is infinitely saner than getting a whole
> nother bit plumbed through the NVME standard, especially given that the
> letter of the spec does describe exactly what we want.

That assumes that anyone but you actually agrees with your definition
of sane behavior.  I don't think you've been overly successful on
selling anyone on that idea so far.  Selling the hardware people on it
will be even harder given that the "cache" really usually is not a
cache but an integral part of the data path.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-18  6:11                                 ` Christoph Hellwig
@ 2025-03-18 21:33                                   ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18 21:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:11:10PM -0700, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote:
> > Beyond making sure that retries go to the physical media, there's "retry
> > level" in the NVME spec which needs to be plumbed, and that one will be
> > particularly useful in multi device scenarios. (Crank retry level up
> > or down based on whether we can retry from different devices).
> 
> The read recovery level is to reduce the amount or intensity of read
> retries, not to increase it so that workloads that have multiple sources
> for data aren't stalled by the sometimes extremely long read recovery.
> You won't really find much hardware that actually implements it.
> 
> As a little background: The read recovery level was added as part of the
> predictive latency mode and originally tied to the (now heavily deprecated
> and never implemented in scale) NVM sets.  Yours truly successfully
> argued that they should not be tied to NVM sets and helped to make them
> more generic, but the there was basically no uptake of the read recovery
> level, with or without NVM sets.

Well, if it can't be set per IO, that makes it fairly useless.

If it _was_ per IO it'd be dead easy to slot into bcachefs, the tracking
of IO/checksum errors in a replicated/erasure coded extent is
sophisticated enough to easily accommodate things like this (mainly you
need to know when submitting - do we have additional retries? and then
when you get an error, you don't want count it if it was "fast fail").

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 15:30                         ` Martin K. Petersen
  2025-03-17 15:43                           ` Kent Overstreet
@ 2025-03-17 17:32                           ` Keith Busch
  2025-03-18  6:19                             ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2025-03-17 17:32 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs,
	linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote:
> Kent is trying to address a scenario in which data in the cache is
> imperfect and the data on media is reliable. SCSI did not consider such
> a scenario because the entire access model would be fundamentally broken
> if cache contents couldn't be trusted.

I'm not even sure it makes sense to suspect the controller side, but the
host side is considered reliable? If the media is in fact perfect, and
if non-FUA read returns data that fails the checksum, wouldn't it be
just as likely that the transport or host side is the problem? Every
NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer
directly from media (excluding Optane; RIP), so a read with or without
FUA would stage the data in the very same controller-side DRAM.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 17:32                           ` Keith Busch
@ 2025-03-18  6:19                             ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Martin K. Petersen, Kent Overstreet, Christoph Hellwig,
	Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds

On Mon, Mar 17, 2025 at 11:32:21AM -0600, Keith Busch wrote:
> I'm not even sure it makes sense to suspect the controller side, but the
> host side is considered reliable? If the media is in fact perfect, and
> if non-FUA read returns data that fails the checksum, wouldn't it be
> just as likely that the transport or host side is the problem?

From the error numbers we've seen it would in fact be much more likely
to be the host or transport side.

> Every
> NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer
> directly from media (excluding Optane; RIP), so a read with or without
> FUA would stage the data in the very same controller-side DRAM.

Exactly.  It's an writeback+invalidate+reread operation.  A great
way to slow down the read, but it's not going to help you with
data recovery.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ
  2025-03-17 12:15                     ` Kent Overstreet
  2025-03-17 14:13                       ` Keith Busch
@ 2025-03-18  6:01                       ` Christoph Hellwig
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2025-03-18  6:01 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block,
	Linus Torvalds

On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > you and Christoph are the two reasons I've had to harp on process in the
> > > past - everyone else in the kernel has been fine.
> 
> ...
> 
> > In this case you very clearly misunderstood how the FUA bit works on
> > reads (and clearly documented that misunderstanding in the commit log).
> 
> FUA reads are clearly documented in the NVME spec.

They are, and the semantics are as multiple people pointed out very
different from what you claim they are in the commit message.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
  2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
@ 2025-03-17 20:55 ` John Stoffel
  2025-03-18  1:15   ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: John Stoffel @ 2025-03-17 20:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> we've got an extent that can't be read, due to checksum error/bitrot.

> This took some doing to fix properly, because

> - We don't want to just delete such data (replace it with
>   KEY_TYPE_error); we never want to delete anything except when
>   explicitly told to by the user, and while we don't yet have an API for
>   "read this file even though there's errors, just give me what you
>   have" we definitely will in the future.

So will open() just return an error?  How will this look from
userspace?  

> - Not being able to move data is a non-option: that would block copygc,
>   device removal, etc.

> - And, moving said extent requires giving it a new checksum - strictly
>   required if the move has to fragment it, teaching the write/move path
>   about extents with bad checksums is unpalateable, and anyways we'd
>   like to be able to guard against more bitrot, if we can.

Why does it need a new checksum if there are read errors?  What
happens if there are more read errors?   If I have a file on a
filesystem which is based in spinning rust and I get a single bit
flip, I'm super happy you catch it.  

But now you re-checksum the file, with the read error, and return it?
I'm stupid and just a user/IT guy.  I want notifications, but I don't
want my application to block so I can't kill it, or unmount the
filesystem.  Or continue to use it if I like.  

> So that means:

> - Extents need a poison bit: "reads return errors, even though it now
>   has a good checksum" - this was added in a separate patch queued up
>   for 6.15.

Sorry for being dense, but does a file have one or more extents?  Or
is this at a level below that?  

>   It's an incompat feature because it's a new extent field, and old
>   versions can't parse extents with unknown field types, since they
>   won't know their sizes - meaning users will have to explicitly do an
>   incompat upgrade to make use of this stuff.

> - The read path needs to do additional retries after checksum errors
>   before giving up and marking it poisoned, so that we don't
>   accidentally convert a transient error to permanent corruption.

When doing these retries, is the filesystem locked up or will the
process doing the read() be blocked from being killed?  

> - The read path gets a whole bunch of work to plumb precise modern error
>   codes around, so that e.g. the retry path, the data move path, and the
>   "mark extent poisoned" path all know exactly what's going on.

> - Read path is responsible for marking extents poisoned after sufficient
>   retry attempts (controlled by a new filesystem option)

> - Data move path is allowed to move extents after a read error, if it's
>   a checksum error (giving it a new checksum) if it's been poisoned
>   (i.e. the extent flags feature is enabled).

So if just a single bit flips, the extent gets moved onto better
storage, and the file gets re-checksummed.  But what about if more
bits go bad afterwards?  

> Code should be more or less finalized - still have more tests for corner
> cases to write, but "write corrupt data and then tell rebalance to move
> it to another device" works as expected.

> TODO:

> - NVME has a "read recovery level" attribute that controlls how hard the
>   erasure coding algorithms work - we want that plumbed.

>   Before we give up and move data that we know is bad, we need to try
>   _as hard as possible_ to get a successful read.

What does this mean exactly in terms of end user and SysAdmin point of
view?  Locking up the system (no new writes to the now problematic
storage) that I can't kill would be annoyingly painful.  


Don't get me wrong, I'm happy to see data integrity stuff get added,
I'm just trying to understand how it interacts with userspace.  

John



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
@ 2025-03-18  1:15   ` Kent Overstreet
  2025-03-18 14:47     ` John Stoffel
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2025-03-18  1:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> > we've got an extent that can't be read, due to checksum error/bitrot.
> 
> > This took some doing to fix properly, because
> 
> > - We don't want to just delete such data (replace it with
> >   KEY_TYPE_error); we never want to delete anything except when
> >   explicitly told to by the user, and while we don't yet have an API for
> >   "read this file even though there's errors, just give me what you
> >   have" we definitely will in the future.
> 
> So will open() just return an error?  How will this look from
> userspace?  

Not the open, the read - the typical case is only a single extent goes
bad; it's like any other IO error.

> > - Not being able to move data is a non-option: that would block copygc,
> >   device removal, etc.
> 
> > - And, moving said extent requires giving it a new checksum - strictly
> >   required if the move has to fragment it, teaching the write/move path
> >   about extents with bad checksums is unpalateable, and anyways we'd
> >   like to be able to guard against more bitrot, if we can.
> 
> Why does it need a new checksum if there are read errors?  What
> happens if there are more read errors?   If I have a file on a
> filesystem which is based in spinning rust and I get a single bit
> flip, I'm super happy you catch it.  

The data move paths very strictly verify checksums as they move data
around so they don't introduce bitrot.

I'm not going to add
	if (!bitrotted_extent) checksum(); else no_checksum()
Eww...


Besides being gross, we also would like to guard against introducing
more bitrot.

> But now you re-checksum the file, with the read error, and return it?
> I'm stupid and just a user/IT guy.  I want notifications, but I don't
> want my application to block so I can't kill it, or unmount the
> filesystem.  Or continue to use it if I like.  

The aforementioned poison bit ensures that you still get the error from
the original checksum error when you read that data - unless you use an
appropriate "give it to me anyways" API.

> > So that means:
> 
> > - Extents need a poison bit: "reads return errors, even though it now
> >   has a good checksum" - this was added in a separate patch queued up
> >   for 6.15.
> 
> Sorry for being dense, but does a file have one or more extents?  Or
> is this at a level below that?  

Files have multiple extents.

An extent is one contiguous range of data, and in bcachefs checksums are
at the extent level, not block, so checksummed (and compressed) extents
are limited to, by default, 128k.

> >   It's an incompat feature because it's a new extent field, and old
> >   versions can't parse extents with unknown field types, since they
> >   won't know their sizes - meaning users will have to explicitly do an
> >   incompat upgrade to make use of this stuff.
> 
> > - The read path needs to do additional retries after checksum errors
> >   before giving up and marking it poisoned, so that we don't
> >   accidentally convert a transient error to permanent corruption.
> 
> When doing these retries, is the filesystem locked up or will the
> process doing the read() be blocked from being killed?  

The process doing the read() can't be killed during this, no. If
requested this could be changed, but keep in mind retries are limited in
number.

Nothing else is "locked up", everything else proceeds as normal.

> > - The read path gets a whole bunch of work to plumb precise modern error
> >   codes around, so that e.g. the retry path, the data move path, and the
> >   "mark extent poisoned" path all know exactly what's going on.
> 
> > - Read path is responsible for marking extents poisoned after sufficient
> >   retry attempts (controlled by a new filesystem option)
> 
> > - Data move path is allowed to move extents after a read error, if it's
> >   a checksum error (giving it a new checksum) if it's been poisoned
> >   (i.e. the extent flags feature is enabled).
> 
> So if just a single bit flips, the extent gets moved onto better
> storage, and the file gets re-checksummed.  But what about if more
> bits go bad afterwards?  

The new checksum means they're detected, and if you have replication
enabled they'll be corrected automatically, like any other IO error.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18  1:15   ` Kent Overstreet
@ 2025-03-18 14:47     ` John Stoffel
  2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: John Stoffel @ 2025-03-18 14:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: John Stoffel, linux-bcachefs, linux-block, Roland Vet,
	linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
>> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
>> 
>> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
>> > we've got an extent that can't be read, due to checksum error/bitrot.
>> 
>> > This took some doing to fix properly, because
>> 
>> > - We don't want to just delete such data (replace it with
>> >   KEY_TYPE_error); we never want to delete anything except when
>> >   explicitly told to by the user, and while we don't yet have an API for
>> >   "read this file even though there's errors, just give me what you
>> >   have" we definitely will in the future.
>> 
>> So will open() just return an error?  How will this look from
>> userspace?  

> Not the open, the read - the typical case is only a single extent goes
> bad; it's like any other IO error.

Good. But then how would an application know we got a checksum error
for data corruption?  Would I have to update all my code to do another
special call when I get a read/write error to see if it was a
corruption issue?  

>> > - Not being able to move data is a non-option: that would block copygc,
>> >   device removal, etc.
>> 
>> > - And, moving said extent requires giving it a new checksum - strictly
>> >   required if the move has to fragment it, teaching the write/move path
>> >   about extents with bad checksums is unpalateable, and anyways we'd
>> >   like to be able to guard against more bitrot, if we can.
>> 
>> Why does it need a new checksum if there are read errors?  What
>> happens if there are more read errors?   If I have a file on a
>> filesystem which is based in spinning rust and I get a single bit
>> flip, I'm super happy you catch it.  

> The data move paths very strictly verify checksums as they move data
> around so they don't introduce bitrot.

Good.  This is something I really liked as an idea in ZFS, happy to
see it coming to bcachefs as well. 

> I'm not going to add
> 	if (!bitrotted_extent) checksum(); else no_checksum()
> Eww...

LOL!

> Besides being gross, we also would like to guard against introducing
> more bitrot.

>> But now you re-checksum the file, with the read error, and return it?
>> I'm stupid and just a user/IT guy.  I want notifications, but I don't
>> want my application to block so I can't kill it, or unmount the
>> filesystem.  Or continue to use it if I like.  

> The aforementioned poison bit ensures that you still get the error from
> the original checksum error when you read that data - unless you use an
> appropriate "give it to me anyways" API.

So this implies that I need to do extra work to A) know I'm on
bcachefs filesystem, B) that I got a read/write error and I need to do
some more checks to see what the error exactly is.  

And if I want to re-write the file I can either copy it to a new name,
but only when I use the new API to say "give me all the data, even if
you have a checksum error".  

>> > So that means:
>> 
>> > - Extents need a poison bit: "reads return errors, even though it now
>> >   has a good checksum" - this was added in a separate patch queued up
>> >   for 6.15.
>> 
>> Sorry for being dense, but does a file have one or more extents?  Or
>> is this at a level below that?  

> Files have multiple extents.

> An extent is one contiguous range of data, and in bcachefs checksums are
> at the extent level, not block, so checksummed (and compressed) extents
> are limited to, by default, 128k.

>> >   It's an incompat feature because it's a new extent field, and old
>> >   versions can't parse extents with unknown field types, since they
>> >   won't know their sizes - meaning users will have to explicitly do an
>> >   incompat upgrade to make use of this stuff.
>> 
>> > - The read path needs to do additional retries after checksum errors
>> >   before giving up and marking it poisoned, so that we don't
>> >   accidentally convert a transient error to permanent corruption.
>> 
>> When doing these retries, is the filesystem locked up or will the
>> process doing the read() be blocked from being killed?  

> The process doing the read() can't be killed during this, no. If
> requested this could be changed, but keep in mind retries are limited in
> number.

How limited?  And in the worse case, if I have 10 or 100 readers of a
file with checksum errors, now I've blocked all those processes for X
amount of time.  Will this info be logged somewhere so a sysadm could
possibly just do an 'rm' on the file to nuke it, or have some means of
forcing a scrub?  

> Nothing else is "locked up", everything else proceeds as normal.

But is the filesystem able to be unmounted when there's a locked up
process?  I'm just thinking in terms of system shutdowns when you have
failing hardware and want to get things closed as cleanly as possible
since you're going to clone the underlying block device onto new media
ASAP in an offline manner.  

>> > - The read path gets a whole bunch of work to plumb precise modern error
>> >   codes around, so that e.g. the retry path, the data move path, and the
>> >   "mark extent poisoned" path all know exactly what's going on.
>> 
>> > - Read path is responsible for marking extents poisoned after sufficient
>> >   retry attempts (controlled by a new filesystem option)
>> 
>> > - Data move path is allowed to move extents after a read error, if it's
>> >   a checksum error (giving it a new checksum) if it's been poisoned
>> >   (i.e. the extent flags feature is enabled).
>> 
>> So if just a single bit flips, the extent gets moved onto better
>> storage, and the file gets re-checksummed.  But what about if more
>> bits go bad afterwards?  

> The new checksum means they're detected, and if you have replication
> enabled they'll be corrected automatically, like any other IO error.

Nice!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18 14:47     ` John Stoffel
@ 2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-20 17:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Tue, Mar 18, 2025 at 10:47:51AM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> >> 
> >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> >> > we've got an extent that can't be read, due to checksum error/bitrot.
> >> 
> >> > This took some doing to fix properly, because
> >> 
> >> > - We don't want to just delete such data (replace it with
> >> >   KEY_TYPE_error); we never want to delete anything except when
> >> >   explicitly told to by the user, and while we don't yet have an API for
> >> >   "read this file even though there's errors, just give me what you
> >> >   have" we definitely will in the future.
> >> 
> >> So will open() just return an error?  How will this look from
> >> userspace?  
> 
> > Not the open, the read - the typical case is only a single extent goes
> > bad; it's like any other IO error.
> 
> Good. But then how would an application know we got a checksum error
> for data corruption?  Would I have to update all my code to do another
> special call when I get a read/write error to see if it was a
> corruption issue?  

We can only return -EIO via the normal IO paths.

> 
> >> > - Not being able to move data is a non-option: that would block copygc,
> >> >   device removal, etc.
> >> 
> >> > - And, moving said extent requires giving it a new checksum - strictly
> >> >   required if the move has to fragment it, teaching the write/move path
> >> >   about extents with bad checksums is unpalateable, and anyways we'd
> >> >   like to be able to guard against more bitrot, if we can.
> >> 
> >> Why does it need a new checksum if there are read errors?  What
> >> happens if there are more read errors?   If I have a file on a
> >> filesystem which is based in spinning rust and I get a single bit
> >> flip, I'm super happy you catch it.  
> 
> > The data move paths very strictly verify checksums as they move data
> > around so they don't introduce bitrot.
> 
> Good.  This is something I really liked as an idea in ZFS, happy to
> see it coming to bcachefs as well. 

Coming? It's been that way since long before bcachefs was merged.

> > The aforementioned poison bit ensures that you still get the error from
> > the original checksum error when you read that data - unless you use an
> > appropriate "give it to me anyways" API.
> 
> So this implies that I need to do extra work to A) know I'm on
> bcachefs filesystem, B) that I got a read/write error and I need to do
> some more checks to see what the error exactly is.  
> 
> And if I want to re-write the file I can either copy it to a new name,
> but only when I use the new API to say "give me all the data, even if
> you have a checksum error".  

This is only if you want an API for "give me possibly corrupted data".

That's pretty niche.

> > The process doing the read() can't be killed during this, no. If
> > requested this could be changed, but keep in mind retries are limited in
> > number.
> 
> How limited?  And in the worse case, if I have 10 or 100 readers of a
> file with checksum errors, now I've blocked all those processes for X
> amount of time.  Will this info be logged somewhere so a sysadm could
> possibly just do an 'rm' on the file to nuke it, or have some means of
> forcing a scrub?  

The poison bit means we won't attempt additional retries on an extent,
once we've reached the (configurable) limit. Future IOs will just return
an error without doing the actual read, since we already know it's bad.

So no, this won't bog down your system.

> > Nothing else is "locked up", everything else proceeds as normal.
> 
> But is the filesystem able to be unmounted when there's a locked up
> process?  I'm just thinking in terms of system shutdowns when you have
> failing hardware and want to get things closed as cleanly as possible
> since you're going to clone the underlying block device onto new media
> ASAP in an offline manner.  

The number of retries defaults to 3, and there's no real reason to make
it more than 5, so this isn't a real change over the current situatino
where drives sometimes take forever on a read as they're going out.

Eventually we could add a configurable hard limit on the amount of time
we spend trying to service an individual read, but that's niche so
further down the road.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-03-20 17:15 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
2025-03-15 16:47   ` Jens Axboe
2025-03-15 17:01     ` Kent Overstreet
2025-03-15 17:03       ` Jens Axboe
2025-03-15 17:27         ` Kent Overstreet
2025-03-15 17:43           ` Jens Axboe
2025-03-15 18:07             ` Kent Overstreet
2025-03-15 18:32               ` Jens Axboe
2025-03-15 18:41                 ` Kent Overstreet
2025-03-17  6:00                   ` Christoph Hellwig
2025-03-17 12:15                     ` Kent Overstreet
2025-03-17 14:13                       ` Keith Busch
2025-03-17 14:49                         ` Kent Overstreet
2025-03-17 15:15                           ` Keith Busch
2025-03-17 15:22                             ` Kent Overstreet
2025-03-17 15:30                         ` Martin K. Petersen
2025-03-17 15:43                           ` Kent Overstreet
2025-03-17 17:57                             ` Martin K. Petersen
2025-03-17 18:21                               ` Kent Overstreet
2025-03-17 19:24                                 ` Keith Busch
2025-03-17 19:40                                   ` Kent Overstreet
2025-03-17 20:39                                     ` Keith Busch
2025-03-17 21:13                                       ` Bart Van Assche
2025-03-18  1:06                                         ` Kent Overstreet
2025-03-18  6:16                                           ` Christoph Hellwig
2025-03-18 17:49                                           ` Bart Van Assche
2025-03-18 18:00                                             ` Kent Overstreet
2025-03-18 18:10                                               ` Keith Busch
2025-03-18 18:13                                                 ` Kent Overstreet
2025-03-20  5:40                                                 ` Christoph Hellwig
2025-03-20 10:28                                                   ` Kent Overstreet
2025-03-18  0:27                                       ` Kent Overstreet
2025-03-18  6:11                                 ` Christoph Hellwig
2025-03-18 21:33                                   ` Kent Overstreet
2025-03-17 17:32                           ` Keith Busch
2025-03-18  6:19                             ` Christoph Hellwig
2025-03-18  6:01                       ` Christoph Hellwig
2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
2025-03-18  1:15   ` Kent Overstreet
2025-03-18 14:47     ` John Stoffel
2025-03-20 17:15       ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox