[PATCH 00/14] better handling of checksum errors/bitrot

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/14] better handling of checksum errors/bitrot
@ 2025-03-11 20:15 Kent Overstreet
  2025-03-17 20:55 ` John Stoffel
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2025-03-11 20:15 UTC (permalink / raw)
  To: linux-bcachefs, linux-block; +Cc: Kent Overstreet, Roland Vet, linux-fsdevel

Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
we've got an extent that can't be read, due to checksum error/bitrot.

This took some doing to fix properly, because

- We don't want to just delete such data (replace it with
  KEY_TYPE_error); we never want to delete anything except when
  explicitly told to by the user, and while we don't yet have an API for
  "read this file even though there's errors, just give me what you
  have" we definitely will in the future.

- Not being able to move data is a non-option: that would block copygc,
  device removal, etc.

- And, moving said extent requires giving it a new checksum - strictly
  required if the move has to fragment it, teaching the write/move path
  about extents with bad checksums is unpalateable, and anyways we'd
  like to be able to guard against more bitrot, if we can.

So that means:

- Extents need a poison bit: "reads return errors, even though it now
  has a good checksum" - this was added in a separate patch queued up
  for 6.15.

  It's an incompat feature because it's a new extent field, and old
  versions can't parse extents with unknown field types, since they
  won't know their sizes - meaning users will have to explicitly do an
  incompat upgrade to make use of this stuff.

- The read path needs to do additional retries after checksum errors
  before giving up and marking it poisoned, so that we don't
  accidentally convert a transient error to permanent corruption.

- The read path gets a whole bunch of work to plumb precise modern error
  codes around, so that e.g. the retry path, the data move path, and the
  "mark extent poisoned" path all know exactly what's going on.

- Read path is responsible for marking extents poisoned after sufficient
  retry attempts (controlled by a new filesystem option)

- Data move path is allowed to move extents after a read error, if it's
  a checksum error (giving it a new checksum) if it's been poisoned
  (i.e. the extent flags feature is enabled).

Code should be more or less finalized - still have more tests for corner
cases to write, but "write corrupt data and then tell rebalance to move
it to another device" works as expected.

TODO:

- NVME has a "read recovery level" attribute that controlls how hard the
  erasure coding algorithms work - we want that plumbed.

  Before we give up and move data that we know is bad, we need to try
  _as hard as possible_ to get a successful read.

Code currently lives in
https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-testing

Kent Overstreet (14):
  bcachefs: Convert read path to standard error codes
  bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry path
  bcachefs: Read error message now indicates if it was for an internal
    move
  bcachefs: BCH_ERR_data_read_buffer_too_small
  bcachefs: Return errors to top level bch2_rbio_retry()
  bcachefs: Print message on successful read retry
  bcachefs: Don't create bch_io_failures unless it's needed
  bcachefs: Checksum errors get additional retries
  bcachefs: __bch2_read() now takes a btree_trans
  bcachefs: Poison extents that can't be read due to checksum errors
  bcachefs: Data move can read from poisoned extents
  bcachefs: Debug params for data corruption injection
  block: Allow REQ_FUA|REQ_READ
  bcachefs: Read retries are after checksum errors now REQ_FUA

 block/blk-core.c              |  19 +-
 fs/bcachefs/bcachefs_format.h |   2 +
 fs/bcachefs/btree_io.c        |   2 +-
 fs/bcachefs/errcode.h         |  19 +-
 fs/bcachefs/extents.c         | 157 +++++++++-------
 fs/bcachefs/extents.h         |   7 +-
 fs/bcachefs/extents_types.h   |  11 +-
 fs/bcachefs/io_read.c         | 325 +++++++++++++++++++++++++---------
 fs/bcachefs/io_read.h         |  21 +--
 fs/bcachefs/io_write.c        |  24 +++
 fs/bcachefs/move.c            |  26 ++-
 fs/bcachefs/opts.h            |   5 +
 fs/bcachefs/super-io.c        |   4 +
 fs/bcachefs/util.c            |  21 +++
 fs/bcachefs/util.h            |  12 ++
 15 files changed, 473 insertions(+), 182 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
@ 2025-03-17 20:55 ` John Stoffel
  2025-03-18  1:15   ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: John Stoffel @ 2025-03-17 20:55 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> we've got an extent that can't be read, due to checksum error/bitrot.

> This took some doing to fix properly, because

> - We don't want to just delete such data (replace it with
>   KEY_TYPE_error); we never want to delete anything except when
>   explicitly told to by the user, and while we don't yet have an API for
>   "read this file even though there's errors, just give me what you
>   have" we definitely will in the future.

So will open() just return an error?  How will this look from
userspace?  

> - Not being able to move data is a non-option: that would block copygc,
>   device removal, etc.

> - And, moving said extent requires giving it a new checksum - strictly
>   required if the move has to fragment it, teaching the write/move path
>   about extents with bad checksums is unpalateable, and anyways we'd
>   like to be able to guard against more bitrot, if we can.

Why does it need a new checksum if there are read errors?  What
happens if there are more read errors?   If I have a file on a
filesystem which is based in spinning rust and I get a single bit
flip, I'm super happy you catch it.  

But now you re-checksum the file, with the read error, and return it?
I'm stupid and just a user/IT guy.  I want notifications, but I don't
want my application to block so I can't kill it, or unmount the
filesystem.  Or continue to use it if I like.  

> So that means:

> - Extents need a poison bit: "reads return errors, even though it now
>   has a good checksum" - this was added in a separate patch queued up
>   for 6.15.

Sorry for being dense, but does a file have one or more extents?  Or
is this at a level below that?  

>   It's an incompat feature because it's a new extent field, and old
>   versions can't parse extents with unknown field types, since they
>   won't know their sizes - meaning users will have to explicitly do an
>   incompat upgrade to make use of this stuff.

> - The read path needs to do additional retries after checksum errors
>   before giving up and marking it poisoned, so that we don't
>   accidentally convert a transient error to permanent corruption.

When doing these retries, is the filesystem locked up or will the
process doing the read() be blocked from being killed?  

> - The read path gets a whole bunch of work to plumb precise modern error
>   codes around, so that e.g. the retry path, the data move path, and the
>   "mark extent poisoned" path all know exactly what's going on.

> - Read path is responsible for marking extents poisoned after sufficient
>   retry attempts (controlled by a new filesystem option)

> - Data move path is allowed to move extents after a read error, if it's
>   a checksum error (giving it a new checksum) if it's been poisoned
>   (i.e. the extent flags feature is enabled).

So if just a single bit flips, the extent gets moved onto better
storage, and the file gets re-checksummed.  But what about if more
bits go bad afterwards?  

> Code should be more or less finalized - still have more tests for corner
> cases to write, but "write corrupt data and then tell rebalance to move
> it to another device" works as expected.

> TODO:

> - NVME has a "read recovery level" attribute that controlls how hard the
>   erasure coding algorithms work - we want that plumbed.

>   Before we give up and move data that we know is bad, we need to try
>   _as hard as possible_ to get a successful read.

What does this mean exactly in terms of end user and SysAdmin point of
view?  Locking up the system (no new writes to the now problematic
storage) that I can't kill would be annoyingly painful.  


Don't get me wrong, I'm happy to see data integrity stuff get added,
I'm just trying to understand how it interacts with userspace.  

John



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-17 20:55 ` John Stoffel
@ 2025-03-18  1:15   ` Kent Overstreet
  2025-03-18 14:47     ` John Stoffel
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2025-03-18  1:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> > we've got an extent that can't be read, due to checksum error/bitrot.
> 
> > This took some doing to fix properly, because
> 
> > - We don't want to just delete such data (replace it with
> >   KEY_TYPE_error); we never want to delete anything except when
> >   explicitly told to by the user, and while we don't yet have an API for
> >   "read this file even though there's errors, just give me what you
> >   have" we definitely will in the future.
> 
> So will open() just return an error?  How will this look from
> userspace?  

Not the open, the read - the typical case is only a single extent goes
bad; it's like any other IO error.

> > - Not being able to move data is a non-option: that would block copygc,
> >   device removal, etc.
> 
> > - And, moving said extent requires giving it a new checksum - strictly
> >   required if the move has to fragment it, teaching the write/move path
> >   about extents with bad checksums is unpalateable, and anyways we'd
> >   like to be able to guard against more bitrot, if we can.
> 
> Why does it need a new checksum if there are read errors?  What
> happens if there are more read errors?   If I have a file on a
> filesystem which is based in spinning rust and I get a single bit
> flip, I'm super happy you catch it.  

The data move paths very strictly verify checksums as they move data
around so they don't introduce bitrot.

I'm not going to add
	if (!bitrotted_extent) checksum(); else no_checksum()
Eww...


Besides being gross, we also would like to guard against introducing
more bitrot.

> But now you re-checksum the file, with the read error, and return it?
> I'm stupid and just a user/IT guy.  I want notifications, but I don't
> want my application to block so I can't kill it, or unmount the
> filesystem.  Or continue to use it if I like.  

The aforementioned poison bit ensures that you still get the error from
the original checksum error when you read that data - unless you use an
appropriate "give it to me anyways" API.

> > So that means:
> 
> > - Extents need a poison bit: "reads return errors, even though it now
> >   has a good checksum" - this was added in a separate patch queued up
> >   for 6.15.
> 
> Sorry for being dense, but does a file have one or more extents?  Or
> is this at a level below that?  

Files have multiple extents.

An extent is one contiguous range of data, and in bcachefs checksums are
at the extent level, not block, so checksummed (and compressed) extents
are limited to, by default, 128k.

> >   It's an incompat feature because it's a new extent field, and old
> >   versions can't parse extents with unknown field types, since they
> >   won't know their sizes - meaning users will have to explicitly do an
> >   incompat upgrade to make use of this stuff.
> 
> > - The read path needs to do additional retries after checksum errors
> >   before giving up and marking it poisoned, so that we don't
> >   accidentally convert a transient error to permanent corruption.
> 
> When doing these retries, is the filesystem locked up or will the
> process doing the read() be blocked from being killed?  

The process doing the read() can't be killed during this, no. If
requested this could be changed, but keep in mind retries are limited in
number.

Nothing else is "locked up", everything else proceeds as normal.

> > - The read path gets a whole bunch of work to plumb precise modern error
> >   codes around, so that e.g. the retry path, the data move path, and the
> >   "mark extent poisoned" path all know exactly what's going on.
> 
> > - Read path is responsible for marking extents poisoned after sufficient
> >   retry attempts (controlled by a new filesystem option)
> 
> > - Data move path is allowed to move extents after a read error, if it's
> >   a checksum error (giving it a new checksum) if it's been poisoned
> >   (i.e. the extent flags feature is enabled).
> 
> So if just a single bit flips, the extent gets moved onto better
> storage, and the file gets re-checksummed.  But what about if more
> bits go bad afterwards?  

The new checksum means they're detected, and if you have replication
enabled they'll be corrected automatically, like any other IO error.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18  1:15   ` Kent Overstreet
@ 2025-03-18 14:47     ` John Stoffel
  2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: John Stoffel @ 2025-03-18 14:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: John Stoffel, linux-bcachefs, linux-block, Roland Vet,
	linux-fsdevel

>>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:

> On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
>> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
>> 
>> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
>> > we've got an extent that can't be read, due to checksum error/bitrot.
>> 
>> > This took some doing to fix properly, because
>> 
>> > - We don't want to just delete such data (replace it with
>> >   KEY_TYPE_error); we never want to delete anything except when
>> >   explicitly told to by the user, and while we don't yet have an API for
>> >   "read this file even though there's errors, just give me what you
>> >   have" we definitely will in the future.
>> 
>> So will open() just return an error?  How will this look from
>> userspace?  

> Not the open, the read - the typical case is only a single extent goes
> bad; it's like any other IO error.

Good. But then how would an application know we got a checksum error
for data corruption?  Would I have to update all my code to do another
special call when I get a read/write error to see if it was a
corruption issue?  

>> > - Not being able to move data is a non-option: that would block copygc,
>> >   device removal, etc.
>> 
>> > - And, moving said extent requires giving it a new checksum - strictly
>> >   required if the move has to fragment it, teaching the write/move path
>> >   about extents with bad checksums is unpalateable, and anyways we'd
>> >   like to be able to guard against more bitrot, if we can.
>> 
>> Why does it need a new checksum if there are read errors?  What
>> happens if there are more read errors?   If I have a file on a
>> filesystem which is based in spinning rust and I get a single bit
>> flip, I'm super happy you catch it.  

> The data move paths very strictly verify checksums as they move data
> around so they don't introduce bitrot.

Good.  This is something I really liked as an idea in ZFS, happy to
see it coming to bcachefs as well. 

> I'm not going to add
> 	if (!bitrotted_extent) checksum(); else no_checksum()
> Eww...

LOL!

> Besides being gross, we also would like to guard against introducing
> more bitrot.

>> But now you re-checksum the file, with the read error, and return it?
>> I'm stupid and just a user/IT guy.  I want notifications, but I don't
>> want my application to block so I can't kill it, or unmount the
>> filesystem.  Or continue to use it if I like.  

> The aforementioned poison bit ensures that you still get the error from
> the original checksum error when you read that data - unless you use an
> appropriate "give it to me anyways" API.

So this implies that I need to do extra work to A) know I'm on
bcachefs filesystem, B) that I got a read/write error and I need to do
some more checks to see what the error exactly is.  

And if I want to re-write the file I can either copy it to a new name,
but only when I use the new API to say "give me all the data, even if
you have a checksum error".  

>> > So that means:
>> 
>> > - Extents need a poison bit: "reads return errors, even though it now
>> >   has a good checksum" - this was added in a separate patch queued up
>> >   for 6.15.
>> 
>> Sorry for being dense, but does a file have one or more extents?  Or
>> is this at a level below that?  

> Files have multiple extents.

> An extent is one contiguous range of data, and in bcachefs checksums are
> at the extent level, not block, so checksummed (and compressed) extents
> are limited to, by default, 128k.

>> >   It's an incompat feature because it's a new extent field, and old
>> >   versions can't parse extents with unknown field types, since they
>> >   won't know their sizes - meaning users will have to explicitly do an
>> >   incompat upgrade to make use of this stuff.
>> 
>> > - The read path needs to do additional retries after checksum errors
>> >   before giving up and marking it poisoned, so that we don't
>> >   accidentally convert a transient error to permanent corruption.
>> 
>> When doing these retries, is the filesystem locked up or will the
>> process doing the read() be blocked from being killed?  

> The process doing the read() can't be killed during this, no. If
> requested this could be changed, but keep in mind retries are limited in
> number.

How limited?  And in the worse case, if I have 10 or 100 readers of a
file with checksum errors, now I've blocked all those processes for X
amount of time.  Will this info be logged somewhere so a sysadm could
possibly just do an 'rm' on the file to nuke it, or have some means of
forcing a scrub?  

> Nothing else is "locked up", everything else proceeds as normal.

But is the filesystem able to be unmounted when there's a locked up
process?  I'm just thinking in terms of system shutdowns when you have
failing hardware and want to get things closed as cleanly as possible
since you're going to clone the underlying block device onto new media
ASAP in an offline manner.  

>> > - The read path gets a whole bunch of work to plumb precise modern error
>> >   codes around, so that e.g. the retry path, the data move path, and the
>> >   "mark extent poisoned" path all know exactly what's going on.
>> 
>> > - Read path is responsible for marking extents poisoned after sufficient
>> >   retry attempts (controlled by a new filesystem option)
>> 
>> > - Data move path is allowed to move extents after a read error, if it's
>> >   a checksum error (giving it a new checksum) if it's been poisoned
>> >   (i.e. the extent flags feature is enabled).
>> 
>> So if just a single bit flips, the extent gets moved onto better
>> storage, and the file gets re-checksummed.  But what about if more
>> bits go bad afterwards?  

> The new checksum means they're detected, and if you have replication
> enabled they'll be corrected automatically, like any other IO error.

Nice!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 00/14] better handling of checksum errors/bitrot
  2025-03-18 14:47     ` John Stoffel
@ 2025-03-20 17:15       ` Kent Overstreet
  0 siblings, 0 replies; 5+ messages in thread
From: Kent Overstreet @ 2025-03-20 17:15 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-bcachefs, linux-block, Roland Vet, linux-fsdevel

On Tue, Mar 18, 2025 at 10:47:51AM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> 
> > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@linux.dev> writes:
> >> 
> >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> >> > we've got an extent that can't be read, due to checksum error/bitrot.
> >> 
> >> > This took some doing to fix properly, because
> >> 
> >> > - We don't want to just delete such data (replace it with
> >> >   KEY_TYPE_error); we never want to delete anything except when
> >> >   explicitly told to by the user, and while we don't yet have an API for
> >> >   "read this file even though there's errors, just give me what you
> >> >   have" we definitely will in the future.
> >> 
> >> So will open() just return an error?  How will this look from
> >> userspace?  
> 
> > Not the open, the read - the typical case is only a single extent goes
> > bad; it's like any other IO error.
> 
> Good. But then how would an application know we got a checksum error
> for data corruption?  Would I have to update all my code to do another
> special call when I get a read/write error to see if it was a
> corruption issue?  

We can only return -EIO via the normal IO paths.

> 
> >> > - Not being able to move data is a non-option: that would block copygc,
> >> >   device removal, etc.
> >> 
> >> > - And, moving said extent requires giving it a new checksum - strictly
> >> >   required if the move has to fragment it, teaching the write/move path
> >> >   about extents with bad checksums is unpalateable, and anyways we'd
> >> >   like to be able to guard against more bitrot, if we can.
> >> 
> >> Why does it need a new checksum if there are read errors?  What
> >> happens if there are more read errors?   If I have a file on a
> >> filesystem which is based in spinning rust and I get a single bit
> >> flip, I'm super happy you catch it.  
> 
> > The data move paths very strictly verify checksums as they move data
> > around so they don't introduce bitrot.
> 
> Good.  This is something I really liked as an idea in ZFS, happy to
> see it coming to bcachefs as well. 

Coming? It's been that way since long before bcachefs was merged.

> > The aforementioned poison bit ensures that you still get the error from
> > the original checksum error when you read that data - unless you use an
> > appropriate "give it to me anyways" API.
> 
> So this implies that I need to do extra work to A) know I'm on
> bcachefs filesystem, B) that I got a read/write error and I need to do
> some more checks to see what the error exactly is.  
> 
> And if I want to re-write the file I can either copy it to a new name,
> but only when I use the new API to say "give me all the data, even if
> you have a checksum error".  

This is only if you want an API for "give me possibly corrupted data".

That's pretty niche.

> > The process doing the read() can't be killed during this, no. If
> > requested this could be changed, but keep in mind retries are limited in
> > number.
> 
> How limited?  And in the worse case, if I have 10 or 100 readers of a
> file with checksum errors, now I've blocked all those processes for X
> amount of time.  Will this info be logged somewhere so a sysadm could
> possibly just do an 'rm' on the file to nuke it, or have some means of
> forcing a scrub?  

The poison bit means we won't attempt additional retries on an extent,
once we've reached the (configurable) limit. Future IOs will just return
an error without doing the actual read, since we already know it's bad.

So no, this won't bog down your system.

> > Nothing else is "locked up", everything else proceeds as normal.
> 
> But is the filesystem able to be unmounted when there's a locked up
> process?  I'm just thinking in terms of system shutdowns when you have
> failing hardware and want to get things closed as cleanly as possible
> since you're going to clone the underlying block device onto new media
> ASAP in an offline manner.  

The number of retries defaults to 3, and there's no real reason to make
it more than 5, so this isn't a real change over the current situatino
where drives sometimes take forever on a read as they're going out.

Eventually we could add a configurable hard limit on the amount of time
we spend trying to service an individual read, but that's niche so
further down the road.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-03-20 17:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-11 20:15 [PATCH 00/14] better handling of checksum errors/bitrot Kent Overstreet
2025-03-17 20:55 ` John Stoffel
2025-03-18  1:15   ` Kent Overstreet
2025-03-18 14:47     ` John Stoffel
2025-03-20 17:15       ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).