Behavior after encountering bad block

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Behavior after encountering bad block
@ 2020-06-19  7:24 Daniel Smedegaard Buus
  2020-06-19  7:27 ` Daniel Smedegaard Buus
  2020-06-19  7:45 ` Roman Mamedov
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Smedegaard Buus @ 2020-06-19  7:24 UTC (permalink / raw)
  To: linux-btrfs

Hi :)

I'm on Deepin 20 beta, which is based on Debian.

Linux deepin 5.3.0-3-amd64 #1 SMP deepin 5.3.15-6apricot (2020-04-13)
x86_64 GNU/Linux

btrfs-progs v4.20.1

Label: none  uuid: 01775a38-62bb-4bf2-b6a0-d5af252b3435
Total devices 1 FS bytes used 883.55MiB
devid    1 size 1000.00MiB used 999.00MiB path /dev/loop0

Data, single: total=883.00MiB, used=882.44MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=50.00MiB, used=1.09MiB
GlobalReserve, single: total=16.00MiB, used=0.00B

I was testing btrfs to see data checksumming behavior when
encountering a rotten area, so I set up a loop device backed by a 1GB
file. I filled it with a compressed file and made it rot with, e.g.,

dd if=/dev/zero of=loopie bs=1k seek=800000 count=1

That is, the equivalent of having data on a single block on an actual
hard drive go bad. I did this different places in the loopback file,
with the same result: Reading the file back from btrfs seems possible
before the point at which the bad block of data is encoutered, and
then *most* reads from beyond that point yield IO errors. E.g.:

 daniel@deepin  ~  sudo dd of=/dev/null if=/mnt/file bs=1M count=100
status=progress conv=sync,noerror
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0150797 s, 7.0 GB/s

daniel@deepin  ~  sudo dd of=/dev/null if=/mnt/file bs=1M count=100
skip=700 status=progress conv=sync,noerror
dd: error reading '/mnt/file': Input/output error
34+0 records in
34+0 records out
 ... snip 39 more errors ...

daniel@deepin  ~  sudo dd of=/dev/null if=/mnt/file bs=1M count=100
skip=600 status=progress conv=sync,noerror
dd: error reading '/mnt/file': Input/output error
66+1 records in
67+0 records out
 ... snip 36 more errors ...

 daniel@deepin  ~  sudo dd of=/dev/null if=/mnt/file bs=1M count=100
skip=300 status=progress conv=sync,noerror
dd: error reading '/mnt/file': Input/output error
26+0 records in
26+0 records out
 ... snip 63 more errors ...

This seems ... well, wrong. Like, in, bug wrong. Surely, a single
block of bad data on a device shouldn't cause btrfs to produce such a
cascade of errors, making so much data inaccessible?

Cheers :)
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  7:24 Behavior after encountering bad block Daniel Smedegaard Buus
@ 2020-06-19  7:27 ` Daniel Smedegaard Buus
  2020-06-19  7:45 ` Roman Mamedov
  1 sibling, 0 replies; 8+ messages in thread
From: Daniel Smedegaard Buus @ 2020-06-19  7:27 UTC (permalink / raw)
  To: linux-btrfs

On Fri, 19 Jun 2020 at 09:24, Daniel Smedegaard Buus
<danielbuus@gmail.com> wrote:
> This seems ... well, wrong. Like, in, bug wrong. Surely, a single
> block of bad data on a device shouldn't cause btrfs to produce such a
> cascade of errors, making so much data inaccessible?
>

Oh, wait :D Wish my "cancel sending" timeout was longer on my email.
Just realized the bug; me not using notrunc when "rotting" my loopback
file. I essentially destroyed the entire loopback file from that
sector til the end XD. That explains everything.

Sorry for wasting your time :/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  7:24 Behavior after encountering bad block Daniel Smedegaard Buus
  2020-06-19  7:27 ` Daniel Smedegaard Buus
@ 2020-06-19  7:45 ` Roman Mamedov
  2020-06-19  8:08   ` Daniel Smedegaard Buus
  1 sibling, 1 reply; 8+ messages in thread
From: Roman Mamedov @ 2020-06-19  7:45 UTC (permalink / raw)
  To: Daniel Smedegaard Buus; +Cc: linux-btrfs

On Fri, 19 Jun 2020 09:24:26 +0200
Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:

> I was testing btrfs to see data checksumming behavior when
> encountering a rotten area, so I set up a loop device backed by a 1GB
> file. I filled it with a compressed file and made it rot with, e.g.,
> 
> dd if=/dev/zero of=loopie bs=1k seek=800000 count=1
> 
> That is, the equivalent of having data on a single block on an actual
> hard drive go bad.

Not really, because when real on-disk sectors go bad, the (properly behaving)
drive will return I/O errors, not blocks of zeroes instead.

For a closer emulation of hardware bad sectors, check out dm-dust:
https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-dust.html

Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  7:45 ` Roman Mamedov
@ 2020-06-19  8:08   ` Daniel Smedegaard Buus
  2020-06-19  9:31     ` Roman Mamedov
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Smedegaard Buus @ 2020-06-19  8:08 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Fri, 19 Jun 2020 at 09:45, Roman Mamedov <rm@romanrm.net> wrote:
>
> On Fri, 19 Jun 2020 09:24:26 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
>
> > I was testing btrfs to see data checksumming behavior when
> > encountering a rotten area, so I set up a loop device backed by a 1GB
> > file. I filled it with a compressed file and made it rot with, e.g.,
> >
> > dd if=/dev/zero of=loopie bs=1k seek=800000 count=1
> >
> > That is, the equivalent of having data on a single block on an actual
> > hard drive go bad.
>
> Not really, because when real on-disk sectors go bad, the (properly behaving)
> drive will return I/O errors, not blocks of zeroes instead.
>

Well, that's why I wrote having the *data* go bad, not the drive, even
though either scenario should still effectively end up yielding the
same behavior from btrfs, albeit more slowly, and with more chatter in
the kernel log :D But check out my retraction reply from earlier — it
was just me being stupid and forgetting to use conv=notrunc on my dd
command used to damage the loopback file :) Btrfs behaves exactly as
expected when I damage the loop back file properly.

Cheers :)
Daniel

> Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  8:08   ` Daniel Smedegaard Buus
@ 2020-06-19  9:31     ` Roman Mamedov
  2020-06-19 10:06       ` Daniel Smedegaard Buus
                         ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Roman Mamedov @ 2020-06-19  9:31 UTC (permalink / raw)
  To: Daniel Smedegaard Buus; +Cc: linux-btrfs

On Fri, 19 Jun 2020 10:08:43 +0200
Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:

> Well, that's why I wrote having the *data* go bad, not the drive

But data going bad wouldn't pass unnoticed like that (with reads resulting in
bad data), since drives have end-to-end CRC checking, including on-disk and
through the SATA interface. If data on-disk is somehow corrupted, that will be
a CRC failure on read, and still an I/O error for the host.

I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data
as if nothing happened, and only when their flash lifespan is close to
depletion.

> even though either scenario should still effectively end up yielding the
> same behavior from btrfs

I believe that's also an assumption you'd want to test, if you want to be
through in verifying its behavior on failures or corruptions. And anyways it's
better to set up a scenario which is as close as possible to ones you'd get in
real-life.

> But check out my retraction reply from earlier — it was just me being stupid
> and forgetting to use conv=notrunc on my dd command used to damage the
> loopback file :)

Sure, I only commented on the part where it still made sense. :)

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  9:31     ` Roman Mamedov
@ 2020-06-19 10:06       ` Daniel Smedegaard Buus
  2020-06-19 13:12       ` Remi Gauvin
  2020-06-19 21:03       ` Zygo Blaxell
  2 siblings, 0 replies; 8+ messages in thread
From: Daniel Smedegaard Buus @ 2020-06-19 10:06 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Fri, 19 Jun 2020 at 11:31, Roman Mamedov <rm@romanrm.net> wrote:
>
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
>
> > Well, that's why I wrote having the *data* go bad, not the drive
>
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface. If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.
>
> I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data
> as if nothing happened, and only when their flash lifespan is close to
> depletion.
>
> > even though either scenario should still effectively end up yielding the
> > same behavior from btrfs
>
> I believe that's also an assumption you'd want to test, if you want to be
> through in verifying its behavior on failures or corruptions. And anyways it's
> better to set up a scenario which is as close as possible to ones you'd get in
> real-life.
>

All good and valid points — but only presupposing that each piece is
behaving as advertised. For instance, a few years back, I discovered
that some sort of bug allowed my SiI PMP/SATA combo to randomly read
or write data incorrectly at a staggering rate when running at SATA 2
speeds under Linux, with no IO errors, and thus no warnings anywhere.
I was running a zpool on the disks attached to it, and ZFS silently
just kept retrying reads — and writes as well, as it read back and
verified written data as well — and thus I lost no data on that
occasion, simply because I was using a data checksumming filesystem.
There's a record of me seeking help about it somewhere on the
interwebs, probably in a Ubuntu forum, and I plugged a hole in the
data destruction by forcing the controllers to run at SATA 1 speeds
only.

At present, I have an old Macbook Pro that is occasionally
experiencing rotted SSD blocks, silently as well. I've discovered it
two or three times. Perhaps due to it having been dropped quite a few
times, or because of what appeared to be a bit of humidity damage
around the SSD socket (I was given it for free, because it wouldn't
recognize its SSD any longer, and thus not boot).

Also at present, I've experienced that the M2 socket in my Ryzen rig
on a B450 board will give garbage data, at least under multiple
kernels, but perhaps not all, for reasons I'm guessing might be a
buggy driver implementation, because I have experienced no issues with
it under Windows. I've just completely stopped accessing that drive
under Linux. Which is not an issue, because the SSD on that controller
is for my Windows gaming needs anyway.

And finally, again at present, I've seen silent data corruption on
that same rig, with ZFS as the underlying FS, but my suspicion is that
these are the result of overclocking the memory and stressing out the
system for very long stretches, producing par2 and rar files for my
archiving needs.

My point is, yes, the drive and/or controller should tell me if what's
being read back isn't what was once written, but my experience tells
me to never actually rely on this being the case, lest I may end up
with bad, unrecoverable data (had I been running md raid instead of
ZFS on that bad SiI rig, my entire data archive would have been
severely, silently, and irrevocably damaged at that point in time).
And the fact that ZFS and btrfs both implement checksumming underlines
the reality of that risk. Don't trust, check :)

To be fair, I'm not trying to "fix" any of the mentioned hardware
issues with ZFS or btrfs here. I just pick a data checksumming FS by
default when I can, and right now I'm using ZFS on a scratch disk and
getting fed up with the poor performance of ZFS, so I'm looking to use
btrfs instead, as my only need right here is data checksumming, and
AFAIR btrfs performs significantly better than ZFS. That's why I was
verifying that it does indeed have functional data checksumming :)

Cheers for the input!

Daniel :)

> With respect,
> Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  9:31     ` Roman Mamedov
  2020-06-19 10:06       ` Daniel Smedegaard Buus
@ 2020-06-19 13:12       ` Remi Gauvin
  2020-06-19 21:03       ` Zygo Blaxell
  2 siblings, 0 replies; 8+ messages in thread
From: Remi Gauvin @ 2020-06-19 13:12 UTC (permalink / raw)
  To: Roman Mamedov, Daniel Smedegaard Buus; +Cc: linux-btrfs

[-- Attachment #1.1: Type: text/plain, Size: 1078 bytes --]

On 2020-06-19 5:31 a.m., Roman Mamedov wrote:
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
> 
>> Well, that's why I wrote having the *data* go bad, not the drive
> 
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface. If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.
> 

This used to be my assumption as well.  However, since I started using
BTRFS in more places, I have recorded 3 instances of BTRFS detecting
corruption that was completely unnoticed by Drive or system, before
finally hitting an SSD that knew it was hitting an error.

That's a pretty small anecdote in the grand scheme of things, and I'm
sure Zygo can give something that more closely resembles a real
statistic.... But I'm left to admit that silent corruption from drives /
I/O controllers is far more prevalent than I used to think.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Behavior after encountering bad block
  2020-06-19  9:31     ` Roman Mamedov
  2020-06-19 10:06       ` Daniel Smedegaard Buus
  2020-06-19 13:12       ` Remi Gauvin
@ 2020-06-19 21:03       ` Zygo Blaxell
  2 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-06-19 21:03 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Daniel Smedegaard Buus, linux-btrfs

On Fri, Jun 19, 2020 at 02:31:48PM +0500, Roman Mamedov wrote:
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
> 
> > Well, that's why I wrote having the *data* go bad, not the drive
> 
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface. 

Some bespoke SAN drives have proprietary firmware and wire protocols that
pass the CRC data in-band from the platter to the host for verification
(the READ and WRITE commands carry extra bytes for the CRC, so a disk
sector is 520 or 4104 bytes long).  This is a true end-to-end CRC check,
but this is not a complete data integrity solution because it only
contributes protection against data corruption while the data is inside
the disk controller.  It has no impact on any of the other silent data
corruption failure modes.

If your drive just passes 512 or 4096-byte sectors to the host, then
there is no end-to-end checking.  It is only piecemeal, partial coverage
of individual segments in the data path, with no way to detect corruption
at points in between.

> If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.

In my data set, about 1 in 20 failing disks silently corrupt some data
without indicating the data is bad.  No disk is immune to this kind
of failure, from cheap consumer SSDs to enterprise HDDs with bespoke
firmware for proprietary SAN boxes.  Failing drives do not respect the
boundaries of expected non-failing drive behavior.

About a third of silent data corruptions in spinning disks were DRAM
failures.  SSDs and HDDs use DRAM in their embedded controller boards, and
that DRAM fails at the same rate as any other commercially available DRAM.
ECC RAM in disk controllers is the most expensive and least effective way
to improve data integrity in the storage stack, so no rational vendor
offers it.

Another third of the data errors are failures related to write caching.
In these failures the contents of the write cache will be discarded after
the data was reported flushed, and later reads to discarded sectors will
return old data.  This event can be triggered by several different causes,
depending on what faults the firmware can detect and recover from and
what bugs are present in the firmware.  These failures share a defining
characteristic: they can be prevented by disabling write cache.

The remaining third are assorted bugs (botched UNC sector remappings,
write to wrong track, "magic" LBA bugs, firmware recalls, bad SSDs,
bad cables, bad power, misconfigured bus timeout/SCTERC settings,
and mishandled bus resets) or some uncategorizable mix of multiple
simultaneous failure modes.  Some of these are coincident with other
indicators of failure (e.g. unexpected SATA bus timeouts or resets), but
not IO errors during read or write operations to the specific sectors
that are corrupted.  Some of these are not drive failures per se, but
failures in adjacent parts of the system that cause the drive to 
operate improperly, corrupting data and suppressing error reports.

The other 19 out of 20 failing drives report IO errors as expected, or
fail to spin up at all.  Those failure cases are trivial.  Even mdadm
handles them easily.

> I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data
> as if nothing happened, and only when their flash lifespan is close to
> depletion.

Kingston and Sandisk SSDs silently corrupt data starting as early as 20%
of rated TBW.  After some experimenting with them, I don't believe their
firmware is capable of detecting data integrity errors at any point in
their lifespan.

You can put a btrfs on one of these SSDs with DUP data and DUP metadata,
and watch it play whack-a-mole as it self-repairs the csum errors
that pop up all over the filesystem, until eventually the SSD dies.

> > even though either scenario should still effectively end up yielding the
> > same behavior from btrfs
> 
> I believe that's also an assumption you'd want to test, if you want to be
> through in verifying its behavior on failures or corruptions. And anyways it's
> better to set up a scenario which is as close as possible to ones you'd get in
> real-life.
> 
> > But check out my retraction reply from earlier — it was just me being stupid
> > and forgetting to use conv=notrunc on my dd command used to damage the
> > loopback file :)
> 
> Sure, I only commented on the part where it still made sense. :)
> 
> -- 
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-19 21:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-19  7:24 Behavior after encountering bad block Daniel Smedegaard Buus
2020-06-19  7:27 ` Daniel Smedegaard Buus
2020-06-19  7:45 ` Roman Mamedov
2020-06-19  8:08   ` Daniel Smedegaard Buus
2020-06-19  9:31     ` Roman Mamedov
2020-06-19 10:06       ` Daniel Smedegaard Buus
2020-06-19 13:12       ` Remi Gauvin
2020-06-19 21:03       ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox