* [PATCH 00/14] better handling of checksum errors/bitrot
@ 2025-03-11 20:15 Kent Overstreet
2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet
2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel
0 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
To: linux-bcachefs, linux-block; +Cc: Kent Overstreet, Roland Vet, linux-fsdevel
Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
we've got an extent that can't be read, due to checksum error/bitrot.
This took some doing to fix properly, because
- We don't want to just delete such data (replace it with
KEY_TYPE_error); we never want to delete anything except when
explicitly told to by the user, and while we don't yet have an API for
"read this file even though there's errors, just give me what you
have" we definitely will in the future.
- Not being able to move data is a non-option: that would block copygc,
device removal, etc.
- And, moving said extent requires giving it a new checksum - strictly
required if the move has to fragment it, teaching the write/move path
about extents with bad checksums is unpalateable, and anyways we'd
like to be able to guard against more bitrot, if we can.
So that means:
- Extents need a poison bit: "reads return errors, even though it now
has a good checksum" - this was added in a separate patch queued up
for 6.15.
It's an incompat feature because it's a new extent field, and old
versions can't parse extents with unknown field types, since they
won't know their sizes - meaning users will have to explicitly do an
incompat upgrade to make use of this stuff.
- The read path needs to do additional retries after checksum errors
before giving up and marking it poisoned, so that we don't
accidentally convert a transient error to permanent corruption.
- The read path gets a whole bunch of work to plumb precise modern error
codes around, so that e.g. the retry path, the data move path, and the
"mark extent poisoned" path all know exactly what's going on.
- Read path is responsible for marking extents poisoned after sufficient
retry attempts (controlled by a new filesystem option)
- Data move path is allowed to move extents after a read error, if it's
a checksum error (giving it a new checksum) if it's been poisoned
(i.e. the extent flags feature is enabled).
Code should be more or less finalized - still have more tests for corner
cases to write, but "write corrupt data and then tell rebalance to move
it to another device" works as expected.
TODO:
- NVME has a "read recovery level" attribute that controlls how hard the
erasure coding algorithms work - we want that plumbed.
Before we give up and move data that we know is bad, we need to try
_as hard as possible_ to get a successful read.
Code currently lives in
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-testing
Kent Overstreet (14):
bcachefs: Convert read path to standard error codes
bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
bcachefs: Read error message now indicates if it was for an internal
move
bcachefs: BCH_ERR_data_read_buffer_too_small
bcachefs: Return errors to top level bch2_rbio_retry()
bcachefs: Print message on successful read retry
bcachefs: Don't create bch_io_failures unless it's needed
bcachefs: Checksum errors get additional retries
bcachefs: __bch2_read() now takes a btree_trans
bcachefs: Poison extents that can't be read due to checksum errors
bcachefs: Data move can read from poisoned extents
bcachefs: Debug params for data corruption injection
block: Allow REQ_FUA|REQ_READ
bcachefs: Read retries are after checksum errors now REQ_FUA
block/blk-core.c | 19 +-
fs/bcachefs/bcachefs_format.h | 2 +
fs/bcachefs/btree_io.c | 2 +-
fs/bcachefs/errcode.h | 19 +-
fs/bcachefs/extents.c | 157 +++++++++-------
fs/bcachefs/extents.h | 7 +-
fs/bcachefs/extents_types.h | 11 +-
fs/bcachefs/io_read.c | 325 +++++++++++++++++++++++++---------
fs/bcachefs/io_read.h | 21 +--
fs/bcachefs/io_write.c | 24 +++
fs/bcachefs/move.c | 26 ++-
fs/bcachefs/opts.h | 5 +
fs/bcachefs/super-io.c | 4 +
fs/bcachefs/util.c | 21 +++
fs/bcachefs/util.h | 12 ++
15 files changed, 473 insertions(+), 182 deletions(-)
--
2.47.2
^ permalink raw reply [flat|nested] 42+ messages in thread* [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet @ 2025-03-11 20:15 ` Kent Overstreet 2025-03-15 16:47 ` Jens Axboe 2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel 1 sibling, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw) To: linux-bcachefs; +Cc: Kent Overstreet, Jens Axboe, linux-block REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", the same as writes. This is useful for when the filesystem gets a checksum error, it's possible that a bit was flipped in the controller cache, and when we retry we want to retry the entire IO, not just from cache. The nvme driver already passes through REQ_FUA for reads, not just writes, so disabling the warning is sufficient to start using it, and bcachefs is implementing additional retries for checksum errors so can immediately use it. Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> --- block/blk-core.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index d6c4fa3943b5..7b1103eb877d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -793,20 +793,21 @@ void submit_bio_noacct(struct bio *bio) goto end_io; } + if (WARN_ON_ONCE((bio->bi_opf & REQ_PREFLUSH) && + bio_op(bio) != REQ_OP_WRITE && + bio_op(bio) != REQ_OP_ZONE_APPEND)) + goto end_io; + /* * Filter flush bio's early so that bio based drivers without flush * support don't have to worry about them. */ - if (op_is_flush(bio->bi_opf)) { - if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE && - bio_op(bio) != REQ_OP_ZONE_APPEND)) + if (op_is_flush(bio->bi_opf) && + !bdev_write_cache(bdev)) { + bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); + if (!bio_sectors(bio)) { + status = BLK_STS_OK; goto end_io; - if (!bdev_write_cache(bdev)) { - bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); - if (!bio_sectors(bio)) { - status = BLK_STS_OK; - goto end_io; - } } } -- 2.47.2 ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet @ 2025-03-15 16:47 ` Jens Axboe 2025-03-15 17:01 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Jens Axboe @ 2025-03-15 16:47 UTC (permalink / raw) To: Kent Overstreet, linux-bcachefs; +Cc: linux-block On 3/11/25 2:15 PM, Kent Overstreet wrote: > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > the same as writes. > > This is useful for when the filesystem gets a checksum error, it's > possible that a bit was flipped in the controller cache, and when we > retry we want to retry the entire IO, not just from cache. > > The nvme driver already passes through REQ_FUA for reads, not just > writes, so disabling the warning is sufficient to start using it, and > bcachefs is implementing additional retries for checksum errors so can > immediately use it. This one got effectively nak'ed by various folks here: > Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ yet it's part of this series and in linux-next? Hmm? -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 16:47 ` Jens Axboe @ 2025-03-15 17:01 ` Kent Overstreet 2025-03-15 17:03 ` Jens Axboe 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-15 17:01 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-bcachefs, linux-block On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: > On 3/11/25 2:15 PM, Kent Overstreet wrote: > > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > > the same as writes. > > > > This is useful for when the filesystem gets a checksum error, it's > > possible that a bit was flipped in the controller cache, and when we > > retry we want to retry the entire IO, not just from cache. > > > > The nvme driver already passes through REQ_FUA for reads, not just > > writes, so disabling the warning is sufficient to start using it, and > > bcachefs is implementing additional retries for checksum errors so can > > immediately use it. > > This one got effectively nak'ed by various folks here: > > > Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ > > yet it's part of this series and in linux-next? Hmm? As I explained in that thread, they were only thinking about the caching of writes. That's not what we're concerned about; when we retry a read due to a checksum error we do not want the previous _read_ cached. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 17:01 ` Kent Overstreet @ 2025-03-15 17:03 ` Jens Axboe 2025-03-15 17:27 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Jens Axboe @ 2025-03-15 17:03 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-block On 3/15/25 11:01 AM, Kent Overstreet wrote: > On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: >> On 3/11/25 2:15 PM, Kent Overstreet wrote: >>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", >>> the same as writes. >>> >>> This is useful for when the filesystem gets a checksum error, it's >>> possible that a bit was flipped in the controller cache, and when we >>> retry we want to retry the entire IO, not just from cache. >>> >>> The nvme driver already passes through REQ_FUA for reads, not just >>> writes, so disabling the warning is sufficient to start using it, and >>> bcachefs is implementing additional retries for checksum errors so can >>> immediately use it. >> >> This one got effectively nak'ed by various folks here: >> >>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ >> >> yet it's part of this series and in linux-next? Hmm? > > As I explained in that thread, they were only thinking about the caching > of writes. > > That's not what we're concerned about; when we retry a read due to a > checksum error we do not want the previous _read_ cached. Please follow the usual procedure of getting the patch acked/reviewed on the block list, and go through the usual trees. Until that happens, this patch should not be in your tree, not should it be staged in linux-next. -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 17:03 ` Jens Axboe @ 2025-03-15 17:27 ` Kent Overstreet 2025-03-15 17:43 ` Jens Axboe 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-15 17:27 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-bcachefs, linux-block On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote: > On 3/15/25 11:01 AM, Kent Overstreet wrote: > > On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: > >> On 3/11/25 2:15 PM, Kent Overstreet wrote: > >>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > >>> the same as writes. > >>> > >>> This is useful for when the filesystem gets a checksum error, it's > >>> possible that a bit was flipped in the controller cache, and when we > >>> retry we want to retry the entire IO, not just from cache. > >>> > >>> The nvme driver already passes through REQ_FUA for reads, not just > >>> writes, so disabling the warning is sufficient to start using it, and > >>> bcachefs is implementing additional retries for checksum errors so can > >>> immediately use it. > >> > >> This one got effectively nak'ed by various folks here: > >> > >>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ > >> > >> yet it's part of this series and in linux-next? Hmm? > > > > As I explained in that thread, they were only thinking about the caching > > of writes. > > > > That's not what we're concerned about; when we retry a read due to a > > checksum error we do not want the previous _read_ cached. > > Please follow the usual procedure of getting the patch acked/reviewed on > the block list, and go through the usual trees. Until that happens, this > patch should not be in your tree, not should it be staged in linux-next. It's been posted to linux-block and sent to your inbox. If you're going to take it that's fine, otherwise - since this is necessary for handling bitrotted data correctly and I've got users who've been waiting on this patch series, and it's just deleting a warning, I'm inclined to just send it. I'll make sure he's got the lore links and knows what's going on, but this isn't a great thing to be delaying on citing process. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 17:27 ` Kent Overstreet @ 2025-03-15 17:43 ` Jens Axboe 2025-03-15 18:07 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Jens Axboe @ 2025-03-15 17:43 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-block On 3/15/25 11:27 AM, Kent Overstreet wrote: > On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote: >> On 3/15/25 11:01 AM, Kent Overstreet wrote: >>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: >>>> On 3/11/25 2:15 PM, Kent Overstreet wrote: >>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", >>>>> the same as writes. >>>>> >>>>> This is useful for when the filesystem gets a checksum error, it's >>>>> possible that a bit was flipped in the controller cache, and when we >>>>> retry we want to retry the entire IO, not just from cache. >>>>> >>>>> The nvme driver already passes through REQ_FUA for reads, not just >>>>> writes, so disabling the warning is sufficient to start using it, and >>>>> bcachefs is implementing additional retries for checksum errors so can >>>>> immediately use it. >>>> >>>> This one got effectively nak'ed by various folks here: >>>> >>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ >>>> >>>> yet it's part of this series and in linux-next? Hmm? >>> >>> As I explained in that thread, they were only thinking about the caching >>> of writes. >>> >>> That's not what we're concerned about; when we retry a read due to a >>> checksum error we do not want the previous _read_ cached. >> >> Please follow the usual procedure of getting the patch acked/reviewed on >> the block list, and go through the usual trees. Until that happens, this >> patch should not be in your tree, not should it be staged in linux-next. > > It's been posted to linux-block and sent to your inbox. If you're going > to take it that's fine, otherwise - since this is necessary for handling > bitrotted data correctly and I've got users who've been waiting on this > patch series, and it's just deleting a warning, I'm inclined to just > send it. > > I'll make sure he's got the lore links and knows what's going on, but > this isn't a great thing to be delaying on citing process. Kent, you know how this works, there's nothing to argue about here. Let the block people get it reviewed and staged. You can't just post a patch, ignore most replies, and then go on to stage it yourself later that day. It didn't work well in other subsystems, and it won't work well here either. EOD on my end. -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 17:43 ` Jens Axboe @ 2025-03-15 18:07 ` Kent Overstreet 2025-03-15 18:32 ` Jens Axboe 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-15 18:07 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-bcachefs, linux-block On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote: > On 3/15/25 11:27 AM, Kent Overstreet wrote: > > On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote: > >> On 3/15/25 11:01 AM, Kent Overstreet wrote: > >>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: > >>>> On 3/11/25 2:15 PM, Kent Overstreet wrote: > >>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > >>>>> the same as writes. > >>>>> > >>>>> This is useful for when the filesystem gets a checksum error, it's > >>>>> possible that a bit was flipped in the controller cache, and when we > >>>>> retry we want to retry the entire IO, not just from cache. > >>>>> > >>>>> The nvme driver already passes through REQ_FUA for reads, not just > >>>>> writes, so disabling the warning is sufficient to start using it, and > >>>>> bcachefs is implementing additional retries for checksum errors so can > >>>>> immediately use it. > >>>> > >>>> This one got effectively nak'ed by various folks here: > >>>> > >>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ > >>>> > >>>> yet it's part of this series and in linux-next? Hmm? > >>> > >>> As I explained in that thread, they were only thinking about the caching > >>> of writes. > >>> > >>> That's not what we're concerned about; when we retry a read due to a > >>> checksum error we do not want the previous _read_ cached. > >> > >> Please follow the usual procedure of getting the patch acked/reviewed on > >> the block list, and go through the usual trees. Until that happens, this > >> patch should not be in your tree, not should it be staged in linux-next. > > > > It's been posted to linux-block and sent to your inbox. If you're going > > to take it that's fine, otherwise - since this is necessary for handling > > bitrotted data correctly and I've got users who've been waiting on this > > patch series, and it's just deleting a warning, I'm inclined to just > > send it. > > > > I'll make sure he's got the lore links and knows what's going on, but > > this isn't a great thing to be delaying on citing process. > > Kent, you know how this works, there's nothing to argue about here. > > Let the block people get it reviewed and staged. You can't just post a > patch, ignore most replies, and then go on to stage it yourself later > that day. It didn't work well in other subsystems, and it won't work > well here either. Jens, those replies weren't ignored, the concerns were answered. Never have I seen that considered "effectively nacked". And as far as "how things work around here" - let's not even go there, lest we have to cover your issues with process. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 18:07 ` Kent Overstreet @ 2025-03-15 18:32 ` Jens Axboe 2025-03-15 18:41 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Jens Axboe @ 2025-03-15 18:32 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Linus Torvalds On 3/15/25 12:07 PM, Kent Overstreet wrote: > On Sat, Mar 15, 2025 at 11:43:30AM -0600, Jens Axboe wrote: >> On 3/15/25 11:27 AM, Kent Overstreet wrote: >>> On Sat, Mar 15, 2025 at 11:03:20AM -0600, Jens Axboe wrote: >>>> On 3/15/25 11:01 AM, Kent Overstreet wrote: >>>>> On Sat, Mar 15, 2025 at 10:47:09AM -0600, Jens Axboe wrote: >>>>>> On 3/11/25 2:15 PM, Kent Overstreet wrote: >>>>>>> REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", >>>>>>> the same as writes. >>>>>>> >>>>>>> This is useful for when the filesystem gets a checksum error, it's >>>>>>> possible that a bit was flipped in the controller cache, and when we >>>>>>> retry we want to retry the entire IO, not just from cache. >>>>>>> >>>>>>> The nvme driver already passes through REQ_FUA for reads, not just >>>>>>> writes, so disabling the warning is sufficient to start using it, and >>>>>>> bcachefs is implementing additional retries for checksum errors so can >>>>>>> immediately use it. >>>>>> >>>>>> This one got effectively nak'ed by various folks here: >>>>>> >>>>>>> Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ >>>>>> >>>>>> yet it's part of this series and in linux-next? Hmm? >>>>> >>>>> As I explained in that thread, they were only thinking about the caching >>>>> of writes. >>>>> >>>>> That's not what we're concerned about; when we retry a read due to a >>>>> checksum error we do not want the previous _read_ cached. >>>> >>>> Please follow the usual procedure of getting the patch acked/reviewed on >>>> the block list, and go through the usual trees. Until that happens, this >>>> patch should not be in your tree, not should it be staged in linux-next. >>> >>> It's been posted to linux-block and sent to your inbox. If you're going >>> to take it that's fine, otherwise - since this is necessary for handling >>> bitrotted data correctly and I've got users who've been waiting on this >>> patch series, and it's just deleting a warning, I'm inclined to just >>> send it. >>> >>> I'll make sure he's got the lore links and knows what's going on, but >>> this isn't a great thing to be delaying on citing process. >> >> Kent, you know how this works, there's nothing to argue about here. >> >> Let the block people get it reviewed and staged. You can't just post a >> patch, ignore most replies, and then go on to stage it yourself later >> that day. It didn't work well in other subsystems, and it won't work >> well here either. > > Jens, those replies weren't ignored, the concerns were answered. Never > have I seen that considered "effectively nacked". Kent, let me make this really simple for you, since you either still don't understand how the development process works, or is under some assumption that it doesn't apply to you - get some of the decades of storage experience in that first thread to sign off on the patch and use case. The patch doesn't go upstream before that happens, simple as that. -- Jens Axboe ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 18:32 ` Jens Axboe @ 2025-03-15 18:41 ` Kent Overstreet 2025-03-17 6:00 ` Christoph Hellwig 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-15 18:41 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-bcachefs, linux-block, Linus Torvalds On Sat, Mar 15, 2025 at 12:32:49PM -0600, Jens Axboe wrote: > Kent, let me make this really simple for you, since you either still > don't understand how the development process works, or is under some > assumption that it doesn't apply to you - get some of the decades of > storage experience in that first thread to sign off on the patch and use > case. The patch doesn't go upstream before that happens, simple as that. No, you're the one who's consistently been under the assumption that it doesn't apply to you. You've got a history of applying patches to code that I maintain without even CCing the list, let alone myself, and introducing silent data corruption bugs into code that I wrote, without CCs, which I then had to debug, and then putting up absolutely ridiculous fights to get fixed. That's the sort of thing that process is supposed to avoid. To be clear, you and Christoph are the two reasons I've had to harp on process in the past - everyone else in the kernel has been fine. As I said before, I'm not trying to bypass you without communicating - but this has gone completely off the rails. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-15 18:41 ` Kent Overstreet @ 2025-03-17 6:00 ` Christoph Hellwig 2025-03-17 12:15 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2025-03-17 6:00 UTC (permalink / raw) To: Kent Overstreet; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > That's the sort of thing that process is supposed to avoid. To be clear, > you and Christoph are the two reasons I've had to harp on process in the > past - everyone else in the kernel has been fine. Kent, stop those attacks. Maybe your patches are just simply not correct? Sometimes that might be on the eye of the beholder when it is about code structure, but in may cases they simply are objectively broken. In this case you very clearly misunderstood how the FUA bit works on reads (and clearly documented that misunderstanding in the commit log). You've had multiple people explain to you how it is wrong, and the maintainer reject it for that reason. What other amount of arguments to you want? If you need another one: FUA is not just use NVMe and SCSI where it is supported on reads even if not in the way you think, but also for all kinds of other protocols. Even if we ended up thinking REQ_FUA on reads is a good idea (and so far no one but you does) you'd need to audit all these to check if the behavior is valid, and most likely add a per-driver opt-in. But so far you've not even tried to make the use case for your feature. > As I said before, I'm not trying to bypass you without communicating - > but this has gone completely off the rails. You have a very clear pattern where you assume you're perfect, everyone else is stupid and thus must be overridden. Most people (including me recently) have mostly given up arguing with your because it's so fruitless. Maybe we shouldn't have because that just seem to reaffirm you in your personal universe. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 6:00 ` Christoph Hellwig @ 2025-03-17 12:15 ` Kent Overstreet 2025-03-17 14:13 ` Keith Busch 2025-03-18 6:01 ` Christoph Hellwig 0 siblings, 2 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 12:15 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote: > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > > That's the sort of thing that process is supposed to avoid. To be clear, > > you and Christoph are the two reasons I've had to harp on process in the > > past - everyone else in the kernel has been fine. ... > In this case you very clearly misunderstood how the FUA bit works on > reads (and clearly documented that misunderstanding in the commit log). FUA reads are clearly documented in the NVME spec. As for the rest, I'm juts going to ignore the ranting. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 12:15 ` Kent Overstreet @ 2025-03-17 14:13 ` Keith Busch 2025-03-17 14:49 ` Kent Overstreet 2025-03-17 15:30 ` Martin K. Petersen 2025-03-18 6:01 ` Christoph Hellwig 1 sibling, 2 replies; 42+ messages in thread From: Keith Busch @ 2025-03-17 14:13 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote: > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote: > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > > > That's the sort of thing that process is supposed to avoid. To be clear, > > > you and Christoph are the two reasons I've had to harp on process in the > > > past - everyone else in the kernel has been fine. > > ... > > > In this case you very clearly misunderstood how the FUA bit works on > > reads (and clearly documented that misunderstanding in the commit log). > > FUA reads are clearly documented in the NVME spec. NVMe's Read with FUA documentation is a pretty short: Force Unit Access (FUA): If this bit is set to `1´, then for data and metadata, if any, associated with logical blocks specified by the Read command, the controller shall: 1) commit that data and metadata, if any, to non-volatile medium; and 2) read the data and metadata, if any, from non-volatile medium. So this aligns with ATA and SCSI. The concern about what you're doing is with respect to "1". Avoiding that requires all writes also set FUA, or you sent a flush command before doing any reads. Otherwise how would you know whether or not read data came from a previous cached write? I'm actually more curious how you came to use this mechanism for recovery. Have you actually seen this approach fix a real problem? In my experience, a Read FUA is used to ensure what you're reading has been persistently committed. It can be used as an optimization to flush a specific LBA range, but I'd not seen this used as a bit error recovery mechanism. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 14:13 ` Keith Busch @ 2025-03-17 14:49 ` Kent Overstreet 2025-03-17 15:15 ` Keith Busch 2025-03-17 15:30 ` Martin K. Petersen 1 sibling, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 14:49 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 08:13:59AM -0600, Keith Busch wrote: > On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote: > > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote: > > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > > > > That's the sort of thing that process is supposed to avoid. To be clear, > > > > you and Christoph are the two reasons I've had to harp on process in the > > > > past - everyone else in the kernel has been fine. > > > > ... > > > > > In this case you very clearly misunderstood how the FUA bit works on > > > reads (and clearly documented that misunderstanding in the commit log). > > > > FUA reads are clearly documented in the NVME spec. > > NVMe's Read with FUA documentation is a pretty short: > > Force Unit Access (FUA): If this bit is set to `1´, then for data and > metadata, if any, associated with logical blocks specified by the Read > command, the controller shall: > > 1) commit that data and metadata, if any, to non-volatile medium; and > 2) read the data and metadata, if any, from non-volatile medium. > > So this aligns with ATA and SCSI. > > The concern about what you're doing is with respect to "1". Avoiding that > requires all writes also set FUA, or you sent a flush command before doing any > reads. Otherwise how would you know whether or not read data came from a > previous cached write? Line 1 is not the relevant concern here. That would be relevant if we were trying to verify the integrity of writes end-to-end immediately after they'd been completed - and that would be a lovely thing to do in the spirit of "true end to end data integrity", but we're not ready to contemplate such things due to the obvious performance implications. The scenario here is: we go to read old data, we get a bit error, and then we want to retry the read to verify that it wasn't a transient error before declaring it dead (marking it poisoned). In this scenario, "dirty data in the writeback cache" is something that would only ever happen if userspace is doing very funny things with O_DIRECT, and the drive flushing that data on FUA read is totally fine. It's an irrelevant corner case. But we do absolutely require checking for transient read errors, and we will miss things and corrupt data unnecessarily without FUA reads that bypass the controller cache. > I'm actually more curious how you came to use this mechanism for recovery. > Have you actually seen this approach fix a real problem? In my experience, a > Read FUA is used to ensure what you're reading has been persistently committed. > It can be used as an optimization to flush a specific LBA range, but I'd not > seen this used as a bit error recovery mechanism. It's not data recovery - in the conventional sense. Typically, on conventional filesystems, this doesn't come up because a: most filesystems don't even do data checksumming, and even then on a filesystem with data checksumming this won't automatically come up - because like other other transient errors, it's generally perfectly acceptable to return the error and let the upper layer retry later; the filesystem generaly doesn't have to care if the error is transient or not. Where this becomes a filesystem concern is when those errors start driving state changes within the filesystem, i.e. when we need to track and know about them: in that case, we absolutely do need to try hard to distinguish between transient and permanent errors. Specifically, bcachefs needs to be able to move data around - even if it's bitrotted and generating checksum errors. There's a couple different reasons we may need to move data: - rebalance, this is the task that does arbitrary data processing in the background based on user policy (moving it to a slower device, compressing it or recompressing it with a more expensive algorithm) - copygc, which compacts fragmented data in mostly-empty buckets and is required for the allocator and the whole system to be able to make forward progress - device evacuate/removal Rebalance is optional (and still needs to bitrotted extents to be tracked so that it doesn't spin on them, retrying them endlessly), but copygc and evacuate are not. And the only way for these processes to handle bitrotted extents (besides throwing them out, which obviously we're not going to do) is - generate a new checksum - mark them as poisoned, so that reads to userspace still return the appropriate error (unless it's a read done with a "I know there's errors, give me what you still have" API). But the critical thing here is, marking an extent as poisoned will turn a transient corruption into a permanent error unless we've made damn sure there is no transient error with additional FUA retries. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 14:49 ` Kent Overstreet @ 2025-03-17 15:15 ` Keith Busch 2025-03-17 15:22 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Keith Busch @ 2025-03-17 15:15 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote: > But we do absolutely require checking for transient read errors, and we > will miss things and corrupt data unnecessarily without FUA reads that > bypass the controller cache. Read FUA doesn't "bypass the controller cache". This is a "flush cache first, then read the result" operation. That description seems consistent for nvme, ata, and scsi. You are only interested in the case where the read doesn't overlap with any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm still curious if you have actually seen a case where read data was corrupt in cache but correct with FUA? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 15:15 ` Keith Busch @ 2025-03-17 15:22 ` Kent Overstreet 0 siblings, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 15:22 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 09:15:04AM -0600, Keith Busch wrote: > On Mon, Mar 17, 2025 at 10:49:55AM -0400, Kent Overstreet wrote: > > But we do absolutely require checking for transient read errors, and we > > will miss things and corrupt data unnecessarily without FUA reads that > > bypass the controller cache. > > Read FUA doesn't "bypass the controller cache". This is a "flush cache > first, then read the result" operation. That description seems > consistent for nvme, ata, and scsi. Quoting from your previous email, > 2) read the data and metadata, if any, from non-volatile medium. That's pretty specific. > You are only interested in the case where the read doesn't overlap with > any dirty cache, so the "flush" part isn't a factor. Okay fine, but I'm > still curious if you have actually seen a case where read data was > corrupt in cache but correct with FUA? No. It's completely typical and expected to code defensively against errors which we have not observed but know are possible. Is bit corruption possible in the device cache? You betcha, it's hardware. Are we going to have to test for correct implementation of READ|FUA in the future? Given all the devices that don't even get flushes right, again yes. Are we still going to go ahead and do it? Yes, and I'd even say it's a prerequisite for complaining to manufacturers that don't get this right. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 14:13 ` Keith Busch 2025-03-17 14:49 ` Kent Overstreet @ 2025-03-17 15:30 ` Martin K. Petersen 2025-03-17 15:43 ` Kent Overstreet 2025-03-17 17:32 ` Keith Busch 1 sibling, 2 replies; 42+ messages in thread From: Martin K. Petersen @ 2025-03-17 15:30 UTC (permalink / raw) To: Keith Busch Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds Keith, > In my experience, a Read FUA is used to ensure what you're reading has > been persistently committed. Exactly, that is the intended semantic. Kent is trying to address a scenario in which data in the cache is imperfect and the data on media is reliable. SCSI did not consider such a scenario because the entire access model would be fundamentally broken if cache contents couldn't be trusted. FUA was defined to handle the exact opposite scenario: Data in cache is reliable by definition and the media inherently unreliable. Consequently, what all the standards describe is a flush-then-read semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is to validate endurance of future reads. SCSI has a different per-command flag for cache management. However, it is only a hint and therefore does not offer the guarantee Kent is looking for. At least for SCSI, given how FUA is usually implemented, I consider it quite unlikely that two read operations back to back would somehow cause different data to be transferred. Regardless of which flags you use. Also, and obviously this is also highly implementation-dependent, but I don't think one should underestimate the amount of checking done along the entire data path inside the device, including DRAM to the transport interface ASIC. Data is usually validated by the ASIC on the way out and from there on all modern transports have some form of checking. Having corrupted data placed in host memory without an associated error condition is not a common scenario. Bit flips in host memory are much more common. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 15:30 ` Martin K. Petersen @ 2025-03-17 15:43 ` Kent Overstreet 2025-03-17 17:57 ` Martin K. Petersen 2025-03-17 17:32 ` Keith Busch 1 sibling, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 15:43 UTC (permalink / raw) To: Martin K. Petersen Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote: > > Keith, > > > In my experience, a Read FUA is used to ensure what you're reading has > > been persistently committed. > > Exactly, that is the intended semantic. > > Kent is trying to address a scenario in which data in the cache is > imperfect and the data on media is reliable. SCSI did not consider such > a scenario because the entire access model would be fundamentally broken > if cache contents couldn't be trusted. > > FUA was defined to handle the exact opposite scenario: Data in cache is > reliable by definition and the media inherently unreliable. > Consequently, what all the standards describe is a flush-then-read > semantic, not an invalidate-cache-then-read ditto. The purpose of FUA is > to validate endurance of future reads. > > SCSI has a different per-command flag for cache management. However, it > is only a hint and therefore does not offer the guarantee Kent is > looking for. > > At least for SCSI, given how FUA is usually implemented, I consider it > quite unlikely that two read operations back to back would somehow cause > different data to be transferred. Regardless of which flags you use. Based on what, exactly? We're already seeing bit corruption caught - and corrupted with scrub, when users aren't running with replicas=1 (which is fine for data that's backed up elsewhere, of course). Then it's just a question of where in the lower layers it comes from. We _know_ devices are not perfect, and your claim that "it's quite unlikely that two reads back to back would return different data" amounts to claiming that there are no bugs in a good chunk of the IO path and all that is implemented perfectly. But that's just bullshit. When we do a retry after observing bit corruption, we need to retry the full read, not just from device cache, and the spec is quite specific on that point. Now, if you're claiming that a FUA read will behave differently, please do state clearly what you think will happen with clean cached data in the cache. But that'll amount to a clear spec violation... > Also, and obviously this is also highly implementation-dependent, but I > don't think one should underestimate the amount of checking done along > the entire data path inside the device, including DRAM to the transport > interface ASIC. Data is usually validated by the ASIC on the way out and > from there on all modern transports have some form of checking. Having > corrupted data placed in host memory without an associated error > condition is not a common scenario. Bit flips in host memory are much > more common. Yeah, I'm not in the business of trusting the lower layers like this. Trust, but verify :) ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 15:43 ` Kent Overstreet @ 2025-03-17 17:57 ` Martin K. Petersen 2025-03-17 18:21 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: Martin K. Petersen @ 2025-03-17 17:57 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds Kent, >> At least for SCSI, given how FUA is usually implemented, I consider >> it quite unlikely that two read operations back to back would somehow >> cause different data to be transferred. Regardless of which flags you >> use. > > Based on what, exactly? Based on the fact that many devices will either blindly flush on FUA or they'll do the equivalent of a media verify operation. In neither case will you get different data returned. The emphasis for FUA is on media durability, not caching. In most implementations the cache isn't an optional memory buffer thingy that can be sidestepped. It is the only access mechanism that exists between the media and the host interface. Working memory if you will. So bypassing the device cache is not really a good way to think about it. The purpose of FUA is to ensure durability for future reads, it is a media management flag. As such, any effect FUA may have on the device cache is incidental. For SCSI there is a different flag to specify caching behavior. That flag is orthogonal to FUA and did not get carried over to NVMe. > We _know_ devices are not perfect, and your claim that "it's quite > unlikely that two reads back to back would return different data" > amounts to claiming that there are no bugs in a good chunk of the IO > path and all that is implemented perfectly. I'm not saying that devices are perfect or that the standards make sense. I'm just saying that your desired behavior does not match the reality of how a large number of these devices are actually implemented. The specs are largely written by device vendors and therefore deliberately ambiguous. Many of the explicit cache management bits and bobs have been removed from SCSI or are defined as hints because device vendors don't want the OS to interfere with how they manage resources, including caching. I get what your objective is. I just don't think FUA offers sufficient guarantees in that department. Also, given the amount of hardware checking done at the device level, my experience tells me that you are way more likely to have undetected corruption problems on the host side than inside the storage device. In general storage devices implement very extensive checking on both control and data paths. And they will return an error if there is a mismatch (as opposed to returning random data). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 17:57 ` Martin K. Petersen @ 2025-03-17 18:21 ` Kent Overstreet 2025-03-17 19:24 ` Keith Busch 2025-03-18 6:11 ` Christoph Hellwig 0 siblings, 2 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 18:21 UTC (permalink / raw) To: Martin K. Petersen Cc: Keith Busch, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote: > > Kent, > > >> At least for SCSI, given how FUA is usually implemented, I consider > >> it quite unlikely that two read operations back to back would somehow > >> cause different data to be transferred. Regardless of which flags you > >> use. > > > > Based on what, exactly? > > Based on the fact that many devices will either blindly flush on FUA or > they'll do the equivalent of a media verify operation. In neither case > will you get different data returned. The emphasis for FUA is on media > durability, not caching. > > In most implementations the cache isn't an optional memory buffer thingy > that can be sidestepped. It is the only access mechanism that exists > between the media and the host interface. Working memory if you will. So > bypassing the device cache is not really a good way to think about it. > > The purpose of FUA is to ensure durability for future reads, it is a > media management flag. As such, any effect FUA may have on the device > cache is incidental. > > For SCSI there is a different flag to specify caching behavior. That > flag is orthogonal to FUA and did not get carried over to NVMe. > > > We _know_ devices are not perfect, and your claim that "it's quite > > unlikely that two reads back to back would return different data" > > amounts to claiming that there are no bugs in a good chunk of the IO > > path and all that is implemented perfectly. > > I'm not saying that devices are perfect or that the standards make > sense. I'm just saying that your desired behavior does not match the > reality of how a large number of these devices are actually implemented. > > The specs are largely written by device vendors and therefore > deliberately ambiguous. Many of the explicit cache management bits and > bobs have been removed from SCSI or are defined as hints because device > vendors don't want the OS to interfere with how they manage resources, > including caching. I get what your objective is. I just don't think FUA > offers sufficient guarantees in that department. If you're saying this is going to be a work in progress to get the behaviour we need in this scenario - yes, absolutely. Beyond making sure that retries go to the physical media, there's "retry level" in the NVME spec which needs to be plumbed, and that one will be particularly useful in multi device scenarios. (Crank retry level up or down based on whether we can retry from different devices). But we've got to start somewhere, and given that the spec says "bypass the cache" - that looks like the place to start. If devices don't support the behaviour we want today, then nudging the drive manufacturers to support it is infinitely saner than getting a whole nother bit plumbed through the NVME standard, especially given that the letter of the spec does describe exactly what we want. So: I understand your concerns, but they're out of scope for the moment. As long as nothing actively breaks when we feed it READ|FUA, it's totally fine. Later on I can imagine us adding some basic sanity tests to alert if READ|FUA is behaving as expected; it's pretty easy to test latency of random vs. reads to the same location vs. FUA reads to the same location. So yes: there will be more work to do in this area. > Also, given the amount of hardware checking done at the device level, my > experience tells me that you are way more likely to have undetected > corruption problems on the host side than inside the storage device. In > general storage devices implement very extensive checking on both > control and data paths. And they will return an error if there is a > mismatch (as opposed to returning random data). Bugs host side are very much a concern, yes (we can only do so much against memory stompers, and the RMW that buffered IO does is a giant hole), but at the filesystem level there are techniques for avoiding corruption that are not available at the block level. The main one being, every pointer carries the checksum that validates the data it points to - the ZFS model. So we do that, and additionally we're _very_ careful when moving around data to carry around existing checksums, and validate the old after generating the new (or sum up new checksums to validate them against the old directly) when moving data around. IOW - in my world, it's the filesystem's job to verify and authenticate everything. And layers above bcachefs do their own verification and authentication as well - nixos package builders and git being the big ones that have caught bugs for us. It all matters... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 18:21 ` Kent Overstreet @ 2025-03-17 19:24 ` Keith Busch 2025-03-17 19:40 ` Kent Overstreet 2025-03-18 6:11 ` Christoph Hellwig 1 sibling, 1 reply; 42+ messages in thread From: Keith Busch @ 2025-03-17 19:24 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote: > On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote: > > I'm not saying that devices are perfect or that the standards make > > sense. I'm just saying that your desired behavior does not match the > > reality of how a large number of these devices are actually implemented. > > > > The specs are largely written by device vendors and therefore > > deliberately ambiguous. Many of the explicit cache management bits and > > bobs have been removed from SCSI or are defined as hints because device > > vendors don't want the OS to interfere with how they manage resources, > > including caching. I get what your objective is. I just don't think FUA > > offers sufficient guarantees in that department. > > If you're saying this is going to be a work in progress to get the > behaviour we need in this scenario - yes, absolutely. > > Beyond making sure that retries go to the physical media, there's "retry > level" in the NVME spec which needs to be plumbed, and that one will be > particularly useful in multi device scenarios. (Crank retry level up > or down based on whether we can retry from different devices). I saw you mention the RRL mechanism in another patch, and it really piqued my interest. How are you intending to use this? In NVMe, this is controlled via an admin "Set Feature" command, which is absolutley not available to a block device, much less a file system. That command queue is only accesible to the driver and to user space admin, and is definitely not a per-io feature. > But we've got to start somewhere, and given that the spec says "bypass > the cache" - that looks like the place to start. This is a bit dangerous to assume. I don't find anywhere in any nvme specifications (also checked T10 SBC) with text saying anything similiar to "bypass" in relation to the cache for FUA reads. I am reasonably confident some vendors, especially ones developing active-active controllers, will fight you to the their win on the spec committee for this if you want to take it up in those forums. > If devices don't support the behaviour we want today, then nudging the > drive manufacturers to support it is infinitely saner than getting a > whole nother bit plumbed through the NVME standard, especially given > that the letter of the spec does describe exactly what we want. I my experience, the storage standards committees are more aligned to accomodate appliance vendors than anything Linux specific. Your desired read behavior would almost certainly be a new TPAR in NVMe to get spec defined behavior. It's not impossible, but I'll just say it is an uphill battle and the end result may or may not look like what you have in mind. In summary, what we have by the specs from READ FUA: Flush and Read What (I think) you want: Invalidate and Read It sounds like you are trying to say that your scenario doesn't care about the "Flush" so you get to use the existing semantics as the "Invalidate" case, and I really don't think you get that guarantee from any spec. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 19:24 ` Keith Busch @ 2025-03-17 19:40 ` Kent Overstreet 2025-03-17 20:39 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-17 19:40 UTC (permalink / raw) To: Keith Busch Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote: > On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote: > > On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote: > > > I'm not saying that devices are perfect or that the standards make > > > sense. I'm just saying that your desired behavior does not match the > > > reality of how a large number of these devices are actually implemented. > > > > > > The specs are largely written by device vendors and therefore > > > deliberately ambiguous. Many of the explicit cache management bits and > > > bobs have been removed from SCSI or are defined as hints because device > > > vendors don't want the OS to interfere with how they manage resources, > > > including caching. I get what your objective is. I just don't think FUA > > > offers sufficient guarantees in that department. > > > > If you're saying this is going to be a work in progress to get the > > behaviour we need in this scenario - yes, absolutely. > > > > Beyond making sure that retries go to the physical media, there's "retry > > level" in the NVME spec which needs to be plumbed, and that one will be > > particularly useful in multi device scenarios. (Crank retry level up > > or down based on whether we can retry from different devices). > > I saw you mention the RRL mechanism in another patch, and it really > piqued my interest. How are you intending to use this? In NVMe, this is > controlled via an admin "Set Feature" command, which is absolutley not > available to a block device, much less a file system. That command queue > is only accesible to the driver and to user space admin, and is > definitely not a per-io feature. Oof, that's going to be a giant hassle then. That is something we definitely want, but it may be something for well down the line then. My more immediate priority is going to be finishing ZNS support, since that will no doubt inform anything we do in that area. > > But we've got to start somewhere, and given that the spec says "bypass > > the cache" - that looks like the place to start. > > This is a bit dangerous to assume. I don't find anywhere in any nvme > specifications (also checked T10 SBC) with text saying anything similiar > to "bypass" in relation to the cache for FUA reads. I am reasonably > confident some vendors, especially ones developing active-active > controllers, will fight you to the their win on the spec committee for > this if you want to take it up in those forums. "Read will come direct from media" reads pretty clear to me. But even if it's not supported the way we want, I'm not seeing anything dangerous about using it this way. Worst case, our retries aren't as good as we want them to be, and it'll be an item to work on in the future. As long as drives aren't barfing when we give them a read fua (and so far they haven't when running this code), we're fine for now. > > If devices don't support the behaviour we want today, then nudging the > > drive manufacturers to support it is infinitely saner than getting a > > whole nother bit plumbed through the NVME standard, especially given > > that the letter of the spec does describe exactly what we want. > > I my experience, the storage standards committees are more aligned to > accomodate appliance vendors than anything Linux specific. Your desired > read behavior would almost certainly be a new TPAR in NVMe to get spec > defined behavior. It's not impossible, but I'll just say it is an uphill > battle and the end result may or may not look like what you have in > mind. I'm not so sure. If there are users out there depending on a different meaning of read fua, then yes, absolutely (and it sounds like Martin might have been alluding to that - but why wouldn't the write have been done fua? I'd want to hear more about that). If, OTOH, this is just something that hasn't come up before - the language in the spec is already there, so once code is out there with enough users and a demonstrated use case then it might be a pretty simple nudge - "shoot down this range of the cache, don't just flush it" is a pretty simple code change, as far as these things go. > In summary, what we have by the specs from READ FUA: > > Flush and Read > > What (I think) you want: > > Invalidate and Read > > It sounds like you are trying to say that your scenario doesn't care > about the "Flush" so you get to use the existing semantics as the > "Invalidate" case, and I really don't think you get that guarantee from > any spec. Exactly. Previous data being flushed, if it was dirty, is totally fine. Specs aren't worth much if no one's depending on or testing a given behaviour, so what the spec strictly guarantees doesn't really matter here. What matters more is - does the behaviour make sense, will it be easy enough to implement, and does it conflict with behaviour anyone else is depneding on. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 19:40 ` Kent Overstreet @ 2025-03-17 20:39 ` Keith Busch 2025-03-17 21:13 ` Bart Van Assche 2025-03-18 0:27 ` Kent Overstreet 0 siblings, 2 replies; 42+ messages in thread From: Keith Busch @ 2025-03-17 20:39 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote: > On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote: > > This is a bit dangerous to assume. I don't find anywhere in any nvme > > specifications (also checked T10 SBC) with text saying anything similiar > > to "bypass" in relation to the cache for FUA reads. I am reasonably > > confident some vendors, especially ones developing active-active > > controllers, will fight you to the their win on the spec committee for > > this if you want to take it up in those forums. > > "Read will come direct from media" reads pretty clear to me. > > But even if it's not supported the way we want, I'm not seeing anything > dangerous about using it this way. Worst case, our retries aren't as > good as we want them to be, and it'll be an item to work on in the > future. I don't think you're appreciating the complications that active-active and multi-host brings to the scenario. Those are why this is not the forum to solve it. The specs need to be clear on the guarantees, and what they currently guarnatee might provide some overlap with what you're seeking in specific scenarios, but I really think (and I believe Martin agrees) your use is outside its targeted use case. > As long as drives aren't barfing when we give them a read fua (and so > far they haven't when running this code), we're fine for now. In this specific regard, I think its safe to assume the devices will remain operational. > > > If devices don't support the behaviour we want today, then nudging the > > > drive manufacturers to support it is infinitely saner than getting a > > > whole nother bit plumbed through the NVME standard, especially given > > > that the letter of the spec does describe exactly what we want. > > > > I my experience, the storage standards committees are more aligned to > > accomodate appliance vendors than anything Linux specific. Your desired > > read behavior would almost certainly be a new TPAR in NVMe to get spec > > defined behavior. It's not impossible, but I'll just say it is an uphill > > battle and the end result may or may not look like what you have in > > mind. > > I'm not so sure. > > If there are users out there depending on a different meaning of read > fua, then yes, absolutely (and it sounds like Martin might have been > alluding to that - but why wouldn't the write have been done fua? I'd > want to hear more about that) As I mentioned, READ FUA provides an optimization opportunity that can be used instead of Write + Flush or WriteFUA when the host isn't sure about the persistence needs at the time of the initial Write: it can be used as a checkpoint on a specific block range that you may have written and overwritten. This kind of "read" command provides a well defined persistence barrier. Thinking of Read FUA as a barrier is better aligned with how the standards and device makers intended it to be used. > If, OTOH, this is just something that hasn't come up before - the > language in the spec is already there, so once code is out there with > enough users and a demonstrated use case then it might be a pretty > simple nudge - "shoot down this range of the cache, don't just flush it" > is a pretty simple code change, as far as these things go. So you're telling me you've never written SSD firmware then waited for the manufacturer to release it to your users? Yes, I jest, and maybe YMMV; but relying on that process is putting your destiny in the wrong hands. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 20:39 ` Keith Busch @ 2025-03-17 21:13 ` Bart Van Assche 2025-03-18 1:06 ` Kent Overstreet 2025-03-18 0:27 ` Kent Overstreet 1 sibling, 1 reply; 42+ messages in thread From: Bart Van Assche @ 2025-03-17 21:13 UTC (permalink / raw) To: Keith Busch, Kent Overstreet Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On 3/17/25 1:39 PM, Keith Busch wrote: > On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote: >> If, OTOH, this is just something that hasn't come up before - the >> language in the spec is already there, so once code is out there with >> enough users and a demonstrated use case then it might be a pretty >> simple nudge - "shoot down this range of the cache, don't just flush it" >> is a pretty simple code change, as far as these things go. > > So you're telling me you've never written SSD firmware then waited for > the manufacturer to release it to your users? Yes, I jest, and maybe > YMMV; but relying on that process is putting your destiny in the wrong > hands. As far as I know in the Linux kernel we only support storage device behavior that has been standardized and also workarounds for bugs. I'm not sure we should support behavior in the Linux kernel that deviates from existing standards and that also deviates from longstanding and well-established conventions. Bart. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 21:13 ` Bart Van Assche @ 2025-03-18 1:06 ` Kent Overstreet 2025-03-18 6:16 ` Christoph Hellwig 2025-03-18 17:49 ` Bart Van Assche 0 siblings, 2 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 1:06 UTC (permalink / raw) To: Bart Van Assche Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 02:13:14PM -0700, Bart Van Assche wrote: > On 3/17/25 1:39 PM, Keith Busch wrote: > > On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote: > > > If, OTOH, this is just something that hasn't come up before - the > > > language in the spec is already there, so once code is out there with > > > enough users and a demonstrated use case then it might be a pretty > > > simple nudge - "shoot down this range of the cache, don't just flush it" > > > is a pretty simple code change, as far as these things go. > > > > So you're telling me you've never written SSD firmware then waited for > > the manufacturer to release it to your users? Yes, I jest, and maybe > > YMMV; but relying on that process is putting your destiny in the wrong > > hands. > > As far as I know in the Linux kernel we only support storage device behavior > that has been standardized and also workarounds for bugs. I'm > not sure we should support behavior in the Linux kernel that deviates > from existing standards and that also deviates from longstanding and > well-established conventions. What bcachefs is doing is entirely in line with the behaviour the standard states. Keith and Martin are describing systems with a different interpretation, because they want the side effect (flush cache) without the main effect (possibly evict cache, but read comes from media). It's an amusing state of affairs, but it'd be easily resolved with an admin level NVME command to flip a state bit (like the read recovery level we were also talking about), and anyways multihoming capable NVME devices are an entirely different market from the conventional stuff. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 1:06 ` Kent Overstreet @ 2025-03-18 6:16 ` Christoph Hellwig 2025-03-18 17:49 ` Bart Van Assche 1 sibling, 0 replies; 42+ messages in thread From: Christoph Hellwig @ 2025-03-18 6:16 UTC (permalink / raw) To: Kent Overstreet Cc: Bart Van Assche, Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 09:06:13PM -0400, Kent Overstreet wrote: > What bcachefs is doing is entirely in line with the behaviour the > standard states. Setting the FUA bit on any READ command (if supported by the device) is entirely in line with the behaviour the standards. It's just not going to do what you hope. And while you claim that it helps you with data recovery, you've not shown either a practical example where it does, or a theory based on hardware architeture how it could. > It's an amusing state of affairs, but it'd be easily resolved with an > admin level NVME command to flip a state bit (like the read recovery > level we were also talking about), and anyways multihoming capable NVME > devices are an entirely different market from the conventional stuff. Please joing the NVMe technical working group and write a proposal. NVMe requires you to clearly state the benefits of the proposal, which means you have to actually clearly write that down first. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 1:06 ` Kent Overstreet 2025-03-18 6:16 ` Christoph Hellwig @ 2025-03-18 17:49 ` Bart Van Assche 2025-03-18 18:00 ` Kent Overstreet 1 sibling, 1 reply; 42+ messages in thread From: Bart Van Assche @ 2025-03-18 17:49 UTC (permalink / raw) To: Kent Overstreet Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On 3/17/25 6:06 PM, Kent Overstreet wrote: > What bcachefs is doing is entirely in line with the behaviour the > standard states. I do not agree with the above. Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm wondering whether allowing REQ_FUA|REQ_READ would be a good topic for that summit. Bart. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 17:49 ` Bart Van Assche @ 2025-03-18 18:00 ` Kent Overstreet 2025-03-18 18:10 ` Keith Busch 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 18:00 UTC (permalink / raw) To: Bart Van Assche Cc: Keith Busch, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Tue, Mar 18, 2025 at 10:49:25AM -0700, Bart Van Assche wrote: > On 3/17/25 6:06 PM, Kent Overstreet wrote: > > What bcachefs is doing is entirely in line with the behaviour the > > standard states. > > I do not agree with the above. > > Kent, do you plan to attend the LSF/MM/BPF summit next week? I'm > wondering whether allowing REQ_FUA|REQ_READ would be a good topic > for that summit. No, I won't be there this year. And I don't think it'd be the right forum for arguing over the meaning of an obscure line in the NVME spec, anyways :) It's certainly not in dispute that read fua is a documented, legitimate command, so there's no reason for the block layer to be rejecting it. Whether it has exactly the behaviour we want isn't a critical issue that has to be determined right now. The starting point for that will be to test device behaviour (with some simple performance tests, like I mentioned), and anyways it's outside the scope of the block layer. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 18:00 ` Kent Overstreet @ 2025-03-18 18:10 ` Keith Busch 2025-03-18 18:13 ` Kent Overstreet 2025-03-20 5:40 ` Christoph Hellwig 0 siblings, 2 replies; 42+ messages in thread From: Keith Busch @ 2025-03-18 18:10 UTC (permalink / raw) To: Kent Overstreet Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote: > It's certainly not in dispute that read fua is a documented, legitimate > command, so there's no reason for the block layer to be rejecting it. > > Whether it has exactly the behaviour we want isn't a critical issue that > has to be determined right now. The starting point for that will be to > test device behaviour (with some simple performance tests, like I > mentioned), and anyways it's outside the scope of the block layer. Maybe just change the commit log. Read FUA has legit uses for persisting data as described by the specs. No need to introduce contested behavior to justify this patch, yah? ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 18:10 ` Keith Busch @ 2025-03-18 18:13 ` Kent Overstreet 2025-03-20 5:40 ` Christoph Hellwig 1 sibling, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 18:13 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote: > On Tue, Mar 18, 2025 at 02:00:13PM -0400, Kent Overstreet wrote: > > It's certainly not in dispute that read fua is a documented, legitimate > > command, so there's no reason for the block layer to be rejecting it. > > > > Whether it has exactly the behaviour we want isn't a critical issue that > > has to be determined right now. The starting point for that will be to > > test device behaviour (with some simple performance tests, like I > > mentioned), and anyways it's outside the scope of the block layer. > > Maybe just change the commit log. Read FUA has legit uses for persisting > data as described by the specs. No need to introduce contested behavior > to justify this patch, yah? That's reasonable, I'll do that. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 18:10 ` Keith Busch 2025-03-18 18:13 ` Kent Overstreet @ 2025-03-20 5:40 ` Christoph Hellwig 2025-03-20 10:28 ` Kent Overstreet 1 sibling, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2025-03-20 5:40 UTC (permalink / raw) To: Keith Busch Cc: Kent Overstreet, Bart Van Assche, Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote: > Maybe just change the commit log. Read FUA has legit uses for persisting > data as described by the specs. No need to introduce contested behavior > to justify this patch, yah? While not having a factually incorrect commit message is a great start I still don't think we want it. For one there is no actual use case for the actual semantics, so why add it? Second it still needs all the proper per-driver opt-in as these sematncis are not defined for all out protocols as I've already mentioned before. But hey, maybe Kent can actually find other storage or file system developers to support it, so having an at least technically correct patch out on the list would be a big start, even if I would not expect to Jens to take it in a whim. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-20 5:40 ` Christoph Hellwig @ 2025-03-20 10:28 ` Kent Overstreet 0 siblings, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-20 10:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Bart Van Assche, Martin K. Petersen, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Wed, Mar 19, 2025 at 10:40:53PM -0700, Christoph Hellwig wrote: > On Tue, Mar 18, 2025 at 12:10:49PM -0600, Keith Busch wrote: > > Maybe just change the commit log. Read FUA has legit uses for persisting > > data as described by the specs. No need to introduce contested behavior > > to justify this patch, yah? > > While not having a factually incorrect commit message is a great > start I still don't think we want it. For one there is no actual > use case for the actual semantics, so why add it? Second it still > needs all the proper per-driver opt-in as these sematncis are not > defined for all out protocols as I've already mentioned before. > > But hey, maybe Kent can actually find other storage or file system > developers to support it, so having an at least technically correct > patch out on the list would be a big start, even if I would not expect > to Jens to take it in a whim. Chistoph, You're arguing over nothing. Go back and reread, you and I have the same interpretation of what read fua should do. You all really are a surly and argumentative lot... ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 20:39 ` Keith Busch 2025-03-17 21:13 ` Bart Van Assche @ 2025-03-18 0:27 ` Kent Overstreet 1 sibling, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 0:27 UTC (permalink / raw) To: Keith Busch Cc: Martin K. Petersen, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 02:39:07PM -0600, Keith Busch wrote: > On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote: > > On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote: > > > This is a bit dangerous to assume. I don't find anywhere in any nvme > > > specifications (also checked T10 SBC) with text saying anything similiar > > > to "bypass" in relation to the cache for FUA reads. I am reasonably > > > confident some vendors, especially ones developing active-active > > > controllers, will fight you to the their win on the spec committee for > > > this if you want to take it up in those forums. > > > > "Read will come direct from media" reads pretty clear to me. > > > > But even if it's not supported the way we want, I'm not seeing anything > > dangerous about using it this way. Worst case, our retries aren't as > > good as we want them to be, and it'll be an item to work on in the > > future. > > I don't think you're appreciating the complications that active-active > and multi-host brings to the scenario. Those are why this is not the > forum to solve it. The specs need to be clear on the guarantees, and > what they currently guarnatee might provide some overlap with what > you're seeking in specific scenarios, but I really think (and I believe > Martin agrees) your use is outside its targeted use case. You do realize this is a single node filesystem we're talking about, right? Crazy multi-homing stuff, while cool, has no bearing here. > > > As long as drives aren't barfing when we give them a read fua (and so > > far they haven't when running this code), we're fine for now. > > In this specific regard, I think its safe to assume the devices will > remain operational. > > > > > If devices don't support the behaviour we want today, then nudging the > > > > drive manufacturers to support it is infinitely saner than getting a > > > > whole nother bit plumbed through the NVME standard, especially given > > > > that the letter of the spec does describe exactly what we want. > > > > > > I my experience, the storage standards committees are more aligned to > > > accomodate appliance vendors than anything Linux specific. Your desired > > > read behavior would almost certainly be a new TPAR in NVMe to get spec > > > defined behavior. It's not impossible, but I'll just say it is an uphill > > > battle and the end result may or may not look like what you have in > > > mind. > > > > I'm not so sure. > > > > If there are users out there depending on a different meaning of read > > fua, then yes, absolutely (and it sounds like Martin might have been > > alluding to that - but why wouldn't the write have been done fua? I'd > > want to hear more about that) > > As I mentioned, READ FUA provides an optimization opportunity that > can be used instead of Write + Flush or WriteFUA when the host isn't > sure about the persistence needs at the time of the initial Write: it > can be used as a checkpoint on a specific block range that you may have > written and overwritten. This kind of "read" command provides a well > defined persistence barrier. Thinking of Read FUA as a barrier is better > aligned with how the standards and device makers intended it to be used. Yeah, I got that. Again, a neat trick, but who in their right mind would use that sort of thing when the sane thing to do is just write fua? So I'm skeptical that that sort of thing is an actual use with any bearing on the devices any single node filesystem targets. Now, in crazy enterprise multi homing land, sure. Now, if you're saying you think the standard should be interpreted in a way such that read fua does what it seems you and Martin want it to do in crazy enterprise multi homing land... now _that_ would be a fun argument to have in a standards committee :) But I mostly jest, because my suspicion is that there wouldn't be any real conflict, just a bit of the "I hadn't thought to use it that way" anxiety. > > If, OTOH, this is just something that hasn't come up before - the > > language in the spec is already there, so once code is out there with > > enough users and a demonstrated use case then it might be a pretty > > simple nudge - "shoot down this range of the cache, don't just flush it" > > is a pretty simple code change, as far as these things go. > > So you're telling me you've never written SSD firmware then waited for > the manufacturer to release it to your users? Yes, I jest, and maybe > YMMV; but relying on that process is putting your destiny in the wrong > hands. Nah, back when I was at an employer that did SSD drivers/firmware, it was all in house - no waiting to ship required :) And incidentally, it's been fun watching the "FTL host or device side" thing go back and forth since then; we had the same back-and-forth happen just between different generations of the internal stuff being built at Google that's been happening now with NVME and ZNS. The appeal of a host side FTL was fairly obvious back then, but the FTL ended up moving from the host to the device because people wanted to do complete kernel IO stack bypass. The AIO and DIO code were really bad back then especially in certain multithreaded scenarios - profiles were absolutely atrocious, and the performance gains of a host side FTL only really happen if you can do it in combination with the filesystem, cutting out a lot of redundancy and improving on GC efficiency (this is a big one) because in the filesystem, you have a lot more information on what goes with what, that's lost at the block layer. IO tail latency in particular improves, especially on loaded machines. But a filesystem meant for that didn't exist at the time, nor did hardware with any kind of an open interface... Since then the kernel IO stack has gotten massively faster, ZNS hardware exists, and bcachefs was pretty much designed from the start for directly driving flash. There's about a month of work left, max, for finish that off and driving hardware I have sitting on my desk. Which means there should be more interesting things happen in the NVME transport area in the future. In particular, moving the FTL into the filesystem ought to allow for much more gracefully degrading failure modes instead of the whole SSD going offline (and signals to the user about when flash is going bad! flash ECC algorithms give you this, it's just not exposed!). So should be fun times. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 18:21 ` Kent Overstreet 2025-03-17 19:24 ` Keith Busch @ 2025-03-18 6:11 ` Christoph Hellwig 2025-03-18 21:33 ` Kent Overstreet 1 sibling, 1 reply; 42+ messages in thread From: Christoph Hellwig @ 2025-03-18 6:11 UTC (permalink / raw) To: Kent Overstreet Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote: > Beyond making sure that retries go to the physical media, there's "retry > level" in the NVME spec which needs to be plumbed, and that one will be > particularly useful in multi device scenarios. (Crank retry level up > or down based on whether we can retry from different devices). The read recovery level is to reduce the amount or intensity of read retries, not to increase it so that workloads that have multiple sources for data aren't stalled by the sometimes extremely long read recovery. You won't really find much hardware that actually implements it. As a little background: The read recovery level was added as part of the predictive latency mode and originally tied to the (now heavily deprecated and never implemented in scale) NVM sets. Yours truly successfully argued that they should not be tied to NVM sets and helped to make them more generic, but the there was basically no uptake of the read recovery level, with or without NVM sets. > But we've got to start somewhere, and given that the spec says "bypass > the cache" But as people have been telling you multiple times that is not what any of the specs says. > - that looks like the place to start. If devices don't > support the behaviour we want today, then nudging the drive > manufacturers to support it is infinitely saner than getting a whole > nother bit plumbed through the NVME standard, especially given that the > letter of the spec does describe exactly what we want. That assumes that anyone but you actually agrees with your definition of sane behavior. I don't think you've been overly successful on selling anyone on that idea so far. Selling the hardware people on it will be even harder given that the "cache" really usually is not a cache but an integral part of the data path. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-18 6:11 ` Christoph Hellwig @ 2025-03-18 21:33 ` Kent Overstreet 0 siblings, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 21:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Martin K. Petersen, Keith Busch, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 11:11:10PM -0700, Christoph Hellwig wrote: > On Mon, Mar 17, 2025 at 02:21:29PM -0400, Kent Overstreet wrote: > > Beyond making sure that retries go to the physical media, there's "retry > > level" in the NVME spec which needs to be plumbed, and that one will be > > particularly useful in multi device scenarios. (Crank retry level up > > or down based on whether we can retry from different devices). > > The read recovery level is to reduce the amount or intensity of read > retries, not to increase it so that workloads that have multiple sources > for data aren't stalled by the sometimes extremely long read recovery. > You won't really find much hardware that actually implements it. > > As a little background: The read recovery level was added as part of the > predictive latency mode and originally tied to the (now heavily deprecated > and never implemented in scale) NVM sets. Yours truly successfully > argued that they should not be tied to NVM sets and helped to make them > more generic, but the there was basically no uptake of the read recovery > level, with or without NVM sets. Well, if it can't be set per IO, that makes it fairly useless. If it _was_ per IO it'd be dead easy to slot into bcachefs, the tracking of IO/checksum errors in a replicated/erasure coded extent is sophisticated enough to easily accommodate things like this (mainly you need to know when submitting - do we have additional retries? and then when you get an error, you don't want count it if it was "fast fail"). ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 15:30 ` Martin K. Petersen 2025-03-17 15:43 ` Kent Overstreet @ 2025-03-17 17:32 ` Keith Busch 2025-03-18 6:19 ` Christoph Hellwig 1 sibling, 1 reply; 42+ messages in thread From: Keith Busch @ 2025-03-17 17:32 UTC (permalink / raw) To: Martin K. Petersen Cc: Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 11:30:13AM -0400, Martin K. Petersen wrote: > Kent is trying to address a scenario in which data in the cache is > imperfect and the data on media is reliable. SCSI did not consider such > a scenario because the entire access model would be fundamentally broken > if cache contents couldn't be trusted. I'm not even sure it makes sense to suspect the controller side, but the host side is considered reliable? If the media is in fact perfect, and if non-FUA read returns data that fails the checksum, wouldn't it be just as likely that the transport or host side is the problem? Every NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer directly from media (excluding Optane; RIP), so a read with or without FUA would stage the data in the very same controller-side DRAM. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 17:32 ` Keith Busch @ 2025-03-18 6:19 ` Christoph Hellwig 0 siblings, 0 replies; 42+ messages in thread From: Christoph Hellwig @ 2025-03-18 6:19 UTC (permalink / raw) To: Keith Busch Cc: Martin K. Petersen, Kent Overstreet, Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 11:32:21AM -0600, Keith Busch wrote: > I'm not even sure it makes sense to suspect the controller side, but the > host side is considered reliable? If the media is in fact perfect, and > if non-FUA read returns data that fails the checksum, wouldn't it be > just as likely that the transport or host side is the problem? From the error numbers we've seen it would in fact be much more likely to be the host or transport side. > Every > NVMe SSD I've developered couldn't efficiently fill the PCIe tx-buffer > directly from media (excluding Optane; RIP), so a read with or without > FUA would stage the data in the very same controller-side DRAM. Exactly. It's an writeback+invalidate+reread operation. A great way to slow down the read, but it's not going to help you with data recovery. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ 2025-03-17 12:15 ` Kent Overstreet 2025-03-17 14:13 ` Keith Busch @ 2025-03-18 6:01 ` Christoph Hellwig 1 sibling, 0 replies; 42+ messages in thread From: Christoph Hellwig @ 2025-03-18 6:01 UTC (permalink / raw) To: Kent Overstreet Cc: Christoph Hellwig, Jens Axboe, linux-bcachefs, linux-block, Linus Torvalds On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote: > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote: > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > > > That's the sort of thing that process is supposed to avoid. To be clear, > > > you and Christoph are the two reasons I've had to harp on process in the > > > past - everyone else in the kernel has been fine. > > ... > > > In this case you very clearly misunderstood how the FUA bit works on > > reads (and clearly documented that misunderstanding in the commit log). > > FUA reads are clearly documented in the NVME spec. They are, and the semantics are as multiple people pointed out very different from what you claim they are in the commit message. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 00/14] better handling of checksum errors/bitrot 2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet 2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet @ 2025-03-17 20:55 ` John Stoffel 2025-03-18 1:15 ` Kent Overstreet 1 sibling, 1 reply; 42+ messages in thread From: John Stoffel @ 2025-03-17 20:55 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if > we've got an extent that can't be read, due to checksum error/bitrot. > This took some doing to fix properly, because > - We don't want to just delete such data (replace it with > KEY_TYPE_error); we never want to delete anything except when > explicitly told to by the user, and while we don't yet have an API for > "read this file even though there's errors, just give me what you > have" we definitely will in the future. So will open() just return an error? How will this look from userspace? > - Not being able to move data is a non-option: that would block copygc, > device removal, etc. > - And, moving said extent requires giving it a new checksum - strictly > required if the move has to fragment it, teaching the write/move path > about extents with bad checksums is unpalateable, and anyways we'd > like to be able to guard against more bitrot, if we can. Why does it need a new checksum if there are read errors? What happens if there are more read errors? If I have a file on a filesystem which is based in spinning rust and I get a single bit flip, I'm super happy you catch it. But now you re-checksum the file, with the read error, and return it? I'm stupid and just a user/IT guy. I want notifications, but I don't want my application to block so I can't kill it, or unmount the filesystem. Or continue to use it if I like. > So that means: > - Extents need a poison bit: "reads return errors, even though it now > has a good checksum" - this was added in a separate patch queued up > for 6.15. Sorry for being dense, but does a file have one or more extents? Or is this at a level below that? > It's an incompat feature because it's a new extent field, and old > versions can't parse extents with unknown field types, since they > won't know their sizes - meaning users will have to explicitly do an > incompat upgrade to make use of this stuff. > - The read path needs to do additional retries after checksum errors > before giving up and marking it poisoned, so that we don't > accidentally convert a transient error to permanent corruption. When doing these retries, is the filesystem locked up or will the process doing the read() be blocked from being killed? > - The read path gets a whole bunch of work to plumb precise modern error > codes around, so that e.g. the retry path, the data move path, and the > "mark extent poisoned" path all know exactly what's going on. > - Read path is responsible for marking extents poisoned after sufficient > retry attempts (controlled by a new filesystem option) > - Data move path is allowed to move extents after a read error, if it's > a checksum error (giving it a new checksum) if it's been poisoned > (i.e. the extent flags feature is enabled). So if just a single bit flips, the extent gets moved onto better storage, and the file gets re-checksummed. But what about if more bits go bad afterwards? > Code should be more or less finalized - still have more tests for corner > cases to write, but "write corrupt data and then tell rebalance to move > it to another device" works as expected. > TODO: > - NVME has a "read recovery level" attribute that controlls how hard the > erasure coding algorithms work - we want that plumbed. > Before we give up and move data that we know is bad, we need to try > _as hard as possible_ to get a successful read. What does this mean exactly in terms of end user and SysAdmin point of view? Locking up the system (no new writes to the now problematic storage) that I can't kill would be annoyingly painful. Don't get me wrong, I'm happy to see data integrity stuff get added, I'm just trying to understand how it interacts with userspace. John ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 00/14] better handling of checksum errors/bitrot 2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel @ 2025-03-18 1:15 ` Kent Overstreet 2025-03-18 14:47 ` John Stoffel 0 siblings, 1 reply; 42+ messages in thread From: Kent Overstreet @ 2025-03-18 1:15 UTC (permalink / raw) To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote: > >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: > > > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if > > we've got an extent that can't be read, due to checksum error/bitrot. > > > This took some doing to fix properly, because > > > - We don't want to just delete such data (replace it with > > KEY_TYPE_error); we never want to delete anything except when > > explicitly told to by the user, and while we don't yet have an API for > > "read this file even though there's errors, just give me what you > > have" we definitely will in the future. > > So will open() just return an error? How will this look from > userspace? Not the open, the read - the typical case is only a single extent goes bad; it's like any other IO error. > > - Not being able to move data is a non-option: that would block copygc, > > device removal, etc. > > > - And, moving said extent requires giving it a new checksum - strictly > > required if the move has to fragment it, teaching the write/move path > > about extents with bad checksums is unpalateable, and anyways we'd > > like to be able to guard against more bitrot, if we can. > > Why does it need a new checksum if there are read errors? What > happens if there are more read errors? If I have a file on a > filesystem which is based in spinning rust and I get a single bit > flip, I'm super happy you catch it. The data move paths very strictly verify checksums as they move data around so they don't introduce bitrot. I'm not going to add if (!bitrotted_extent) checksum(); else no_checksum() Eww... Besides being gross, we also would like to guard against introducing more bitrot. > But now you re-checksum the file, with the read error, and return it? > I'm stupid and just a user/IT guy. I want notifications, but I don't > want my application to block so I can't kill it, or unmount the > filesystem. Or continue to use it if I like. The aforementioned poison bit ensures that you still get the error from the original checksum error when you read that data - unless you use an appropriate "give it to me anyways" API. > > So that means: > > > - Extents need a poison bit: "reads return errors, even though it now > > has a good checksum" - this was added in a separate patch queued up > > for 6.15. > > Sorry for being dense, but does a file have one or more extents? Or > is this at a level below that? Files have multiple extents. An extent is one contiguous range of data, and in bcachefs checksums are at the extent level, not block, so checksummed (and compressed) extents are limited to, by default, 128k. > > It's an incompat feature because it's a new extent field, and old > > versions can't parse extents with unknown field types, since they > > won't know their sizes - meaning users will have to explicitly do an > > incompat upgrade to make use of this stuff. > > > - The read path needs to do additional retries after checksum errors > > before giving up and marking it poisoned, so that we don't > > accidentally convert a transient error to permanent corruption. > > When doing these retries, is the filesystem locked up or will the > process doing the read() be blocked from being killed? The process doing the read() can't be killed during this, no. If requested this could be changed, but keep in mind retries are limited in number. Nothing else is "locked up", everything else proceeds as normal. > > - The read path gets a whole bunch of work to plumb precise modern error > > codes around, so that e.g. the retry path, the data move path, and the > > "mark extent poisoned" path all know exactly what's going on. > > > - Read path is responsible for marking extents poisoned after sufficient > > retry attempts (controlled by a new filesystem option) > > > - Data move path is allowed to move extents after a read error, if it's > > a checksum error (giving it a new checksum) if it's been poisoned > > (i.e. the extent flags feature is enabled). > > So if just a single bit flips, the extent gets moved onto better > storage, and the file gets re-checksummed. But what about if more > bits go bad afterwards? The new checksum means they're detected, and if you have replication enabled they'll be corrected automatically, like any other IO error. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 00/14] better handling of checksum errors/bitrot 2025-03-18 1:15 ` Kent Overstreet @ 2025-03-18 14:47 ` John Stoffel 2025-03-20 17:15 ` Kent Overstreet 0 siblings, 1 reply; 42+ messages in thread From: John Stoffel @ 2025-03-18 14:47 UTC (permalink / raw) To: Kent Overstreet Cc: John Stoffel, linux-bcachefs, linux-block, Roland Vet, linux-fsdevel >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote: >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: >> >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if >> > we've got an extent that can't be read, due to checksum error/bitrot. >> >> > This took some doing to fix properly, because >> >> > - We don't want to just delete such data (replace it with >> > KEY_TYPE_error); we never want to delete anything except when >> > explicitly told to by the user, and while we don't yet have an API for >> > "read this file even though there's errors, just give me what you >> > have" we definitely will in the future. >> >> So will open() just return an error? How will this look from >> userspace? > Not the open, the read - the typical case is only a single extent goes > bad; it's like any other IO error. Good. But then how would an application know we got a checksum error for data corruption? Would I have to update all my code to do another special call when I get a read/write error to see if it was a corruption issue? >> > - Not being able to move data is a non-option: that would block copygc, >> > device removal, etc. >> >> > - And, moving said extent requires giving it a new checksum - strictly >> > required if the move has to fragment it, teaching the write/move path >> > about extents with bad checksums is unpalateable, and anyways we'd >> > like to be able to guard against more bitrot, if we can. >> >> Why does it need a new checksum if there are read errors? What >> happens if there are more read errors? If I have a file on a >> filesystem which is based in spinning rust and I get a single bit >> flip, I'm super happy you catch it. > The data move paths very strictly verify checksums as they move data > around so they don't introduce bitrot. Good. This is something I really liked as an idea in ZFS, happy to see it coming to bcachefs as well. > I'm not going to add > if (!bitrotted_extent) checksum(); else no_checksum() > Eww... LOL! > Besides being gross, we also would like to guard against introducing > more bitrot. >> But now you re-checksum the file, with the read error, and return it? >> I'm stupid and just a user/IT guy. I want notifications, but I don't >> want my application to block so I can't kill it, or unmount the >> filesystem. Or continue to use it if I like. > The aforementioned poison bit ensures that you still get the error from > the original checksum error when you read that data - unless you use an > appropriate "give it to me anyways" API. So this implies that I need to do extra work to A) know I'm on bcachefs filesystem, B) that I got a read/write error and I need to do some more checks to see what the error exactly is. And if I want to re-write the file I can either copy it to a new name, but only when I use the new API to say "give me all the data, even if you have a checksum error". >> > So that means: >> >> > - Extents need a poison bit: "reads return errors, even though it now >> > has a good checksum" - this was added in a separate patch queued up >> > for 6.15. >> >> Sorry for being dense, but does a file have one or more extents? Or >> is this at a level below that? > Files have multiple extents. > An extent is one contiguous range of data, and in bcachefs checksums are > at the extent level, not block, so checksummed (and compressed) extents > are limited to, by default, 128k. >> > It's an incompat feature because it's a new extent field, and old >> > versions can't parse extents with unknown field types, since they >> > won't know their sizes - meaning users will have to explicitly do an >> > incompat upgrade to make use of this stuff. >> >> > - The read path needs to do additional retries after checksum errors >> > before giving up and marking it poisoned, so that we don't >> > accidentally convert a transient error to permanent corruption. >> >> When doing these retries, is the filesystem locked up or will the >> process doing the read() be blocked from being killed? > The process doing the read() can't be killed during this, no. If > requested this could be changed, but keep in mind retries are limited in > number. How limited? And in the worse case, if I have 10 or 100 readers of a file with checksum errors, now I've blocked all those processes for X amount of time. Will this info be logged somewhere so a sysadm could possibly just do an 'rm' on the file to nuke it, or have some means of forcing a scrub? > Nothing else is "locked up", everything else proceeds as normal. But is the filesystem able to be unmounted when there's a locked up process? I'm just thinking in terms of system shutdowns when you have failing hardware and want to get things closed as cleanly as possible since you're going to clone the underlying block device onto new media ASAP in an offline manner. >> > - The read path gets a whole bunch of work to plumb precise modern error >> > codes around, so that e.g. the retry path, the data move path, and the >> > "mark extent poisoned" path all know exactly what's going on. >> >> > - Read path is responsible for marking extents poisoned after sufficient >> > retry attempts (controlled by a new filesystem option) >> >> > - Data move path is allowed to move extents after a read error, if it's >> > a checksum error (giving it a new checksum) if it's been poisoned >> > (i.e. the extent flags feature is enabled). >> >> So if just a single bit flips, the extent gets moved onto better >> storage, and the file gets re-checksummed. But what about if more >> bits go bad afterwards? > The new checksum means they're detected, and if you have replication > enabled they'll be corrected automatically, like any other IO error. Nice! ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH 00/14] better handling of checksum errors/bitrot 2025-03-18 14:47 ` John Stoffel @ 2025-03-20 17:15 ` Kent Overstreet 0 siblings, 0 replies; 42+ messages in thread From: Kent Overstreet @ 2025-03-20 17:15 UTC (permalink / raw) To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel On Tue, Mar 18, 2025 at 10:47:51AM -0400, John Stoffel wrote: > >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: > > > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote: > >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes: > >> > >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if > >> > we've got an extent that can't be read, due to checksum error/bitrot. > >> > >> > This took some doing to fix properly, because > >> > >> > - We don't want to just delete such data (replace it with > >> > KEY_TYPE_error); we never want to delete anything except when > >> > explicitly told to by the user, and while we don't yet have an API for > >> > "read this file even though there's errors, just give me what you > >> > have" we definitely will in the future. > >> > >> So will open() just return an error? How will this look from > >> userspace? > > > Not the open, the read - the typical case is only a single extent goes > > bad; it's like any other IO error. > > Good. But then how would an application know we got a checksum error > for data corruption? Would I have to update all my code to do another > special call when I get a read/write error to see if it was a > corruption issue? We can only return -EIO via the normal IO paths. > > >> > - Not being able to move data is a non-option: that would block copygc, > >> > device removal, etc. > >> > >> > - And, moving said extent requires giving it a new checksum - strictly > >> > required if the move has to fragment it, teaching the write/move path > >> > about extents with bad checksums is unpalateable, and anyways we'd > >> > like to be able to guard against more bitrot, if we can. > >> > >> Why does it need a new checksum if there are read errors? What > >> happens if there are more read errors? If I have a file on a > >> filesystem which is based in spinning rust and I get a single bit > >> flip, I'm super happy you catch it. > > > The data move paths very strictly verify checksums as they move data > > around so they don't introduce bitrot. > > Good. This is something I really liked as an idea in ZFS, happy to > see it coming to bcachefs as well. Coming? It's been that way since long before bcachefs was merged. > > The aforementioned poison bit ensures that you still get the error from > > the original checksum error when you read that data - unless you use an > > appropriate "give it to me anyways" API. > > So this implies that I need to do extra work to A) know I'm on > bcachefs filesystem, B) that I got a read/write error and I need to do > some more checks to see what the error exactly is. > > And if I want to re-write the file I can either copy it to a new name, > but only when I use the new API to say "give me all the data, even if > you have a checksum error". This is only if you want an API for "give me possibly corrupted data". That's pretty niche. > > The process doing the read() can't be killed during this, no. If > > requested this could be changed, but keep in mind retries are limited in > > number. > > How limited? And in the worse case, if I have 10 or 100 readers of a > file with checksum errors, now I've blocked all those processes for X > amount of time. Will this info be logged somewhere so a sysadm could > possibly just do an 'rm' on the file to nuke it, or have some means of > forcing a scrub? The poison bit means we won't attempt additional retries on an extent, once we've reached the (configurable) limit. Future IOs will just return an error without doing the actual read, since we already know it's bad. So no, this won't bog down your system. > > Nothing else is "locked up", everything else proceeds as normal. > > But is the filesystem able to be unmounted when there's a locked up > process? I'm just thinking in terms of system shutdowns when you have > failing hardware and want to get things closed as cleanly as possible > since you're going to clone the underlying block device onto new media > ASAP in an offline manner. The number of retries defaults to 3, and there's no real reason to make it more than 5, so this isn't a real change over the current situatino where drives sometimes take forever on a read as they're going out. Eventually we could add a configurable hard limit on the amount of time we spend trying to service an individual read, but that's niche so further down the road. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2025-03-20 17:15 UTC | newest] Thread overview: 42+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet 2025-03-11 20:15 ` [PATCH 13/14] block: Allow REQ_FUA|REQ_READ Kent Overstreet 2025-03-15 16:47 ` Jens Axboe 2025-03-15 17:01 ` Kent Overstreet 2025-03-15 17:03 ` Jens Axboe 2025-03-15 17:27 ` Kent Overstreet 2025-03-15 17:43 ` Jens Axboe 2025-03-15 18:07 ` Kent Overstreet 2025-03-15 18:32 ` Jens Axboe 2025-03-15 18:41 ` Kent Overstreet 2025-03-17 6:00 ` Christoph Hellwig 2025-03-17 12:15 ` Kent Overstreet 2025-03-17 14:13 ` Keith Busch 2025-03-17 14:49 ` Kent Overstreet 2025-03-17 15:15 ` Keith Busch 2025-03-17 15:22 ` Kent Overstreet 2025-03-17 15:30 ` Martin K. Petersen 2025-03-17 15:43 ` Kent Overstreet 2025-03-17 17:57 ` Martin K. Petersen 2025-03-17 18:21 ` Kent Overstreet 2025-03-17 19:24 ` Keith Busch 2025-03-17 19:40 ` Kent Overstreet 2025-03-17 20:39 ` Keith Busch 2025-03-17 21:13 ` Bart Van Assche 2025-03-18 1:06 ` Kent Overstreet 2025-03-18 6:16 ` Christoph Hellwig 2025-03-18 17:49 ` Bart Van Assche 2025-03-18 18:00 ` Kent Overstreet 2025-03-18 18:10 ` Keith Busch 2025-03-18 18:13 ` Kent Overstreet 2025-03-20 5:40 ` Christoph Hellwig 2025-03-20 10:28 ` Kent Overstreet 2025-03-18 0:27 ` Kent Overstreet 2025-03-18 6:11 ` Christoph Hellwig 2025-03-18 21:33 ` Kent Overstreet 2025-03-17 17:32 ` Keith Busch 2025-03-18 6:19 ` Christoph Hellwig 2025-03-18 6:01 ` Christoph Hellwig 2025-03-17 20:55 ` [PATCH 00/14] better handling of checksum errors/bitrot John Stoffel 2025-03-18 1:15 ` Kent Overstreet 2025-03-18 14:47 ` John Stoffel 2025-03-20 17:15 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox