* Re: Behavior after encountering bad block
2020-06-19 9:31 ` Roman Mamedov
@ 2020-06-19 10:06 ` Daniel Smedegaard Buus
2020-06-19 13:12 ` Remi Gauvin
2020-06-19 21:03 ` Zygo Blaxell
2 siblings, 0 replies; 8+ messages in thread
From: Daniel Smedegaard Buus @ 2020-06-19 10:06 UTC (permalink / raw)
To: Roman Mamedov; +Cc: linux-btrfs
On Fri, 19 Jun 2020 at 11:31, Roman Mamedov <rm@romanrm.net> wrote:
>
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
>
> > Well, that's why I wrote having the *data* go bad, not the drive
>
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface. If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.
>
> I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data
> as if nothing happened, and only when their flash lifespan is close to
> depletion.
>
> > even though either scenario should still effectively end up yielding the
> > same behavior from btrfs
>
> I believe that's also an assumption you'd want to test, if you want to be
> through in verifying its behavior on failures or corruptions. And anyways it's
> better to set up a scenario which is as close as possible to ones you'd get in
> real-life.
>
All good and valid points — but only presupposing that each piece is
behaving as advertised. For instance, a few years back, I discovered
that some sort of bug allowed my SiI PMP/SATA combo to randomly read
or write data incorrectly at a staggering rate when running at SATA 2
speeds under Linux, with no IO errors, and thus no warnings anywhere.
I was running a zpool on the disks attached to it, and ZFS silently
just kept retrying reads — and writes as well, as it read back and
verified written data as well — and thus I lost no data on that
occasion, simply because I was using a data checksumming filesystem.
There's a record of me seeking help about it somewhere on the
interwebs, probably in a Ubuntu forum, and I plugged a hole in the
data destruction by forcing the controllers to run at SATA 1 speeds
only.
At present, I have an old Macbook Pro that is occasionally
experiencing rotted SSD blocks, silently as well. I've discovered it
two or three times. Perhaps due to it having been dropped quite a few
times, or because of what appeared to be a bit of humidity damage
around the SSD socket (I was given it for free, because it wouldn't
recognize its SSD any longer, and thus not boot).
Also at present, I've experienced that the M2 socket in my Ryzen rig
on a B450 board will give garbage data, at least under multiple
kernels, but perhaps not all, for reasons I'm guessing might be a
buggy driver implementation, because I have experienced no issues with
it under Windows. I've just completely stopped accessing that drive
under Linux. Which is not an issue, because the SSD on that controller
is for my Windows gaming needs anyway.
And finally, again at present, I've seen silent data corruption on
that same rig, with ZFS as the underlying FS, but my suspicion is that
these are the result of overclocking the memory and stressing out the
system for very long stretches, producing par2 and rar files for my
archiving needs.
My point is, yes, the drive and/or controller should tell me if what's
being read back isn't what was once written, but my experience tells
me to never actually rely on this being the case, lest I may end up
with bad, unrecoverable data (had I been running md raid instead of
ZFS on that bad SiI rig, my entire data archive would have been
severely, silently, and irrevocably damaged at that point in time).
And the fact that ZFS and btrfs both implement checksumming underlines
the reality of that risk. Don't trust, check :)
To be fair, I'm not trying to "fix" any of the mentioned hardware
issues with ZFS or btrfs here. I just pick a data checksumming FS by
default when I can, and right now I'm using ZFS on a scratch disk and
getting fed up with the poor performance of ZFS, so I'm looking to use
btrfs instead, as my only need right here is data checksumming, and
AFAIR btrfs performs significantly better than ZFS. That's why I was
verifying that it does indeed have functional data checksumming :)
Cheers for the input!
Daniel :)
> With respect,
> Roman
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Behavior after encountering bad block
2020-06-19 9:31 ` Roman Mamedov
2020-06-19 10:06 ` Daniel Smedegaard Buus
@ 2020-06-19 13:12 ` Remi Gauvin
2020-06-19 21:03 ` Zygo Blaxell
2 siblings, 0 replies; 8+ messages in thread
From: Remi Gauvin @ 2020-06-19 13:12 UTC (permalink / raw)
To: Roman Mamedov, Daniel Smedegaard Buus; +Cc: linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 1078 bytes --]
On 2020-06-19 5:31 a.m., Roman Mamedov wrote:
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
>
>> Well, that's why I wrote having the *data* go bad, not the drive
>
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface. If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.
>
This used to be my assumption as well. However, since I started using
BTRFS in more places, I have recorded 3 instances of BTRFS detecting
corruption that was completely unnoticed by Drive or system, before
finally hitting an SSD that knew it was hitting an error.
That's a pretty small anecdote in the grand scheme of things, and I'm
sure Zygo can give something that more closely resembles a real
statistic.... But I'm left to admit that silent corruption from drives /
I/O controllers is far more prevalent than I used to think.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Behavior after encountering bad block
2020-06-19 9:31 ` Roman Mamedov
2020-06-19 10:06 ` Daniel Smedegaard Buus
2020-06-19 13:12 ` Remi Gauvin
@ 2020-06-19 21:03 ` Zygo Blaxell
2 siblings, 0 replies; 8+ messages in thread
From: Zygo Blaxell @ 2020-06-19 21:03 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Daniel Smedegaard Buus, linux-btrfs
On Fri, Jun 19, 2020 at 02:31:48PM +0500, Roman Mamedov wrote:
> On Fri, 19 Jun 2020 10:08:43 +0200
> Daniel Smedegaard Buus <danielbuus@gmail.com> wrote:
>
> > Well, that's why I wrote having the *data* go bad, not the drive
>
> But data going bad wouldn't pass unnoticed like that (with reads resulting in
> bad data), since drives have end-to-end CRC checking, including on-disk and
> through the SATA interface.
Some bespoke SAN drives have proprietary firmware and wire protocols that
pass the CRC data in-band from the platter to the host for verification
(the READ and WRITE commands carry extra bytes for the CRC, so a disk
sector is 520 or 4104 bytes long). This is a true end-to-end CRC check,
but this is not a complete data integrity solution because it only
contributes protection against data corruption while the data is inside
the disk controller. It has no impact on any of the other silent data
corruption failure modes.
If your drive just passes 512 or 4096-byte sectors to the host, then
there is no end-to-end checking. It is only piecemeal, partial coverage
of individual segments in the data path, with no way to detect corruption
at points in between.
> If data on-disk is somehow corrupted, that will be
> a CRC failure on read, and still an I/O error for the host.
In my data set, about 1 in 20 failing disks silently corrupt some data
without indicating the data is bad. No disk is immune to this kind
of failure, from cheap consumer SSDs to enterprise HDDs with bespoke
firmware for proprietary SAN boxes. Failing drives do not respect the
boundaries of expected non-failing drive behavior.
About a third of silent data corruptions in spinning disks were DRAM
failures. SSDs and HDDs use DRAM in their embedded controller boards, and
that DRAM fails at the same rate as any other commercially available DRAM.
ECC RAM in disk controllers is the most expensive and least effective way
to improve data integrity in the storage stack, so no rational vendor
offers it.
Another third of the data errors are failures related to write caching.
In these failures the contents of the write cache will be discarded after
the data was reported flushed, and later reads to discarded sectors will
return old data. This event can be triggered by several different causes,
depending on what faults the firmware can detect and recover from and
what bugs are present in the firmware. These failures share a defining
characteristic: they can be prevented by disabling write cache.
The remaining third are assorted bugs (botched UNC sector remappings,
write to wrong track, "magic" LBA bugs, firmware recalls, bad SSDs,
bad cables, bad power, misconfigured bus timeout/SCTERC settings,
and mishandled bus resets) or some uncategorizable mix of multiple
simultaneous failure modes. Some of these are coincident with other
indicators of failure (e.g. unexpected SATA bus timeouts or resets), but
not IO errors during read or write operations to the specific sectors
that are corrupted. Some of these are not drive failures per se, but
failures in adjacent parts of the system that cause the drive to
operate improperly, corrupting data and suppressing error reports.
The other 19 out of 20 failing drives report IO errors as expected, or
fail to spin up at all. Those failure cases are trivial. Even mdadm
handles them easily.
> I only heard of some bad SSDs (SiliconMotion-based) returning corrupted data
> as if nothing happened, and only when their flash lifespan is close to
> depletion.
Kingston and Sandisk SSDs silently corrupt data starting as early as 20%
of rated TBW. After some experimenting with them, I don't believe their
firmware is capable of detecting data integrity errors at any point in
their lifespan.
You can put a btrfs on one of these SSDs with DUP data and DUP metadata,
and watch it play whack-a-mole as it self-repairs the csum errors
that pop up all over the filesystem, until eventually the SSD dies.
> > even though either scenario should still effectively end up yielding the
> > same behavior from btrfs
>
> I believe that's also an assumption you'd want to test, if you want to be
> through in verifying its behavior on failures or corruptions. And anyways it's
> better to set up a scenario which is as close as possible to ones you'd get in
> real-life.
>
> > But check out my retraction reply from earlier — it was just me being stupid
> > and forgetting to use conv=notrunc on my dd command used to damage the
> > loopback file :)
>
> Sure, I only commented on the part where it still made sense. :)
>
> --
> With respect,
> Roman
^ permalink raw reply [flat|nested] 8+ messages in thread