Bit-Rot

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bit-Rot
@ 2017-03-03 21:25 Gandalf Corvotempesta
  2017-03-03 21:41 ` Bit-Rot Anthony Youngman
  0 siblings, 1 reply; 10+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-03 21:25 UTC (permalink / raw)
  To: linux-raid

Hi to all
Wouldn't be possible to add a sort of bitrot detection on mdadm?
I know that MD is working on blocks and not files, but checksumming a
block should still be possible.

In example, if you read some blocks with dd, you can still hash the
content and verify that on next read/consistency check

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-03 21:25 Bit-Rot Gandalf Corvotempesta
@ 2017-03-03 21:41 ` Anthony Youngman
  2017-03-03 21:54   ` Bit-Rot Gandalf Corvotempesta
  0 siblings, 1 reply; 10+ messages in thread
From: Anthony Youngman @ 2017-03-03 21:41 UTC (permalink / raw)
  To: Gandalf Corvotempesta, linux-raid

On 03/03/17 21:25, Gandalf Corvotempesta wrote:
> Hi to all
> Wouldn't be possible to add a sort of bitrot detection on mdadm?
> I know that MD is working on blocks and not files, but checksumming a
> block should still be possible.
>
> In example, if you read some blocks with dd, you can still hash the
> content and verify that on next read/consistency check

Isn't that what raid 5 does?

Actually, iirc, it doesn't read every stripe and check parity on a read, 
because it would clobber performance. But I guess you could have a 
switch to turn it on. It's unlikely to achieve anything.

Barring bugs in the firmware, it's pretty near 100% that a drive will 
either return what was written, or return a read error. Drives don't 
return dud data, they have quite a lot of error correction built in.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-03 21:41 ` Bit-Rot Anthony Youngman
@ 2017-03-03 21:54   ` Gandalf Corvotempesta
  2017-03-03 22:16     ` Bit-Rot Anthony Youngman
  2017-03-05  8:15     ` Bit-Rot Mikael Abrahamsson
  0 siblings, 2 replies; 10+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-03 21:54 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: linux-raid

2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
> Isn't that what raid 5 does?

nothing to do with raid-5

> Actually, iirc, it doesn't read every stripe and check parity on a read,
> because it would clobber performance. But I guess you could have a switch to
> turn it on. It's unlikely to achieve anything.
>
> Barring bugs in the firmware, it's pretty near 100% that a drive will either
> return what was written, or return a read error. Drives don't return dud
> data, they have quite a lot of error correction built in.

This is wrong.
Sometimes drives return data differently from what was stored, or,
store data differently from the original.
In this case, if real data is "1" and you store "0", when you read
"0", no read error is made, but data is still corrupted.

With a bit-rot prevention this could be fixed, you checksum "1" from
the source, write that to disks and if you read back "0", the checksum
would be invalid.

This is what ZFS does. This is what Gluster does. This is what BRTFS does.
Adding this in mdadm could be an interesting feature.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-03 21:54   ` Bit-Rot Gandalf Corvotempesta
@ 2017-03-03 22:16     ` Anthony Youngman
  2017-03-05  6:01       ` Bit-Rot Chris Murphy
  2017-03-05  8:15     ` Bit-Rot Mikael Abrahamsson
  1 sibling, 1 reply; 10+ messages in thread
From: Anthony Youngman @ 2017-03-03 22:16 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: linux-raid



On 03/03/17 21:54, Gandalf Corvotempesta wrote:
> 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
>> Isn't that what raid 5 does?
>
> nothing to do with raid-5
>
>> Actually, iirc, it doesn't read every stripe and check parity on a read,
>> because it would clobber performance. But I guess you could have a switch to
>> turn it on. It's unlikely to achieve anything.
>>
>> Barring bugs in the firmware, it's pretty near 100% that a drive will either
>> return what was written, or return a read error. Drives don't return dud
>> data, they have quite a lot of error correction built in.
>
> This is wrong.
> Sometimes drives return data differently from what was stored, or,
> store data differently from the original.
> In this case, if real data is "1" and you store "0", when you read
> "0", no read error is made, but data is still corrupted.

Do you have any figures? I didn't say it can't happen, I just said it 
was very unlikely.
>
> With a bit-rot prevention this could be fixed, you checksum "1" from
> the source, write that to disks and if you read back "0", the checksum
> would be invalid.

Or you just read the raid5 parity (which I don't think, by default, is 
what happens). That IS your checksum. So if you think the performance 
hit is worth it, write the code to add it, and turn it on. Not only will 
it detect a bit-flip, but it will tell you which bit flipped, and let 
you correct it.
>
> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
> Adding this in mdadm could be an interesting feature.
>
Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of 
course it needs some way of detecting whether a mirror is corrupt. I 
don't know about gluster or ZFS. (I believe raid5/btrfs is currently 
experimental, and dangerous.)

But the question remains - is the effort worth it?

Can I refer you to a very interesting article on LWN? About git, which 
assumes that "if hash(A) == hash(B) then A == B". And how that was 
actually MORE accurate than "if (memcmp( A, B) == true) then A == B".

Cheers,
Wol

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-03 22:16     ` Bit-Rot Anthony Youngman
@ 2017-03-05  6:01       ` Chris Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2017-03-05  6:01 UTC (permalink / raw)
  To: Anthony Youngman; +Cc: Gandalf Corvotempesta, Linux-RAID

On Fri, Mar 3, 2017 at 3:16 PM, Anthony Youngman
<antlists@youngman.org.uk> wrote:
>
>
> On 03/03/17 21:54, Gandalf Corvotempesta wrote:
>>
>> 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>:
>>>
>>> Isn't that what raid 5 does?
>>
>>
>> nothing to do with raid-5
>>
>>> Actually, iirc, it doesn't read every stripe and check parity on a read,
>>> because it would clobber performance. But I guess you could have a switch
>>> to
>>> turn it on. It's unlikely to achieve anything.
>>>
>>> Barring bugs in the firmware, it's pretty near 100% that a drive will
>>> either
>>> return what was written, or return a read error. Drives don't return dud
>>> data, they have quite a lot of error correction built in.
>>
>>
>> This is wrong.
>> Sometimes drives return data differently from what was stored, or,
>> store data differently from the original.
>> In this case, if real data is "1" and you store "0", when you read
>> "0", no read error is made, but data is still corrupted.
>
>
> Do you have any figures? I didn't say it can't happen, I just said it was
> very unlikely.

Torn and misdirected writes do happen. There are a bunch of papers on
this problem indicating it's real. This and various other sources of
silent corrupt are why ZFS and Btrfs exist.

>>
>>
>> With a bit-rot prevention this could be fixed, you checksum "1" from
>> the source, write that to disks and if you read back "0", the checksum
>> would be invalid.
>
>
> Or you just read the raid5 parity (which I don't think, by default, is what
> happens). That IS your checksum. So if you think the performance hit is
> worth it, write the code to add it, and turn it on. Not only will it detect
> a bit-flip, but it will tell you which bit flipped, and let you correct it.

Parity isn't a checksum. Using it in this fashion is expensive because
it means computing parity for all reads, and means you can't do
partial stripe reads anymore. Next, even once you get a mismatch it's
ambiguous which strip (mdadm chunk) is corrupt. That'd normally be
exposed by the drive reporting an explicit read error. Since that
doesn't exist you'd have to fake "fail" each strip, rebuild from
parity, and compare.

>>
>>
>> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
>> Adding this in mdadm could be an interesting feature.
>>
> Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of
> course it needs some way of detecting whether a mirror is corrupt. I don't
> know about gluster or ZFS. (I believe raid5/btrfs is currently experimental,
> and dangerous.)

Btrfs supports raid1, 10, 5 and 6. It's reasonable to consider raid56
experimental because it has a number of gotchas, not least of which is
there are certain kinds of writes that are not COW, so the COW
safeguards don't always apply in a power failures. As for dangerous,
the opinions vary but probably something everyone can agree on is any
ambiguity with the stability of a file system is that it looks bad.

> But the question remains - is the effort worth it?

That's the central question. And to answer it, you'd need some sort of
rough design. Where are the csums going to be stored? Do you update
data strips before or after the csums? Either way, if this is now COW,
you have a moment of complete mismatching between data and csums, with
live data. So... that's a big problem actually. And if you have a
crash or power failure during writes, it's an even bigger problem. Do
you csum the party?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-03 21:54   ` Bit-Rot Gandalf Corvotempesta
  2017-03-03 22:16     ` Bit-Rot Anthony Youngman
@ 2017-03-05  8:15     ` Mikael Abrahamsson
  2017-03-06 11:56       ` Bit-Rot Pasi Kärkkäinen
  1 sibling, 1 reply; 10+ messages in thread
From: Mikael Abrahamsson @ 2017-03-05  8:15 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Anthony Youngman, linux-raid

On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote:

> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
> Adding this in mdadm could be an interesting feature.

This has been discussed several times. Yes, it would be interesting. It's 
not easy to do because mdadm maps 4k blocks to 4k blocks. Only way to 
"easily" add this I imagine, would be to have an additional "checksum" 
block, so that raid6 would require 3 extra drives instead of 2.

The answer historically has been "patches welcome".

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-05  8:15     ` Bit-Rot Mikael Abrahamsson
@ 2017-03-06 11:56       ` Pasi Kärkkäinen
  2017-03-06 12:45         ` Bit-Rot Reindl Harald
  2017-03-17 15:37         ` Bit-Rot Brassow Jonathan
  0 siblings, 2 replies; 10+ messages in thread
From: Pasi Kärkkäinen @ 2017-03-06 11:56 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Gandalf Corvotempesta, Anthony Youngman, linux-raid

On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote:
> On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote:
> 
> >This is what ZFS does. This is what Gluster does. This is what BRTFS does.
> >Adding this in mdadm could be an interesting feature.
> 
> This has been discussed several times. Yes, it would be interesting.
> It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only
> way to "easily" add this I imagine, would be to have an additional
> "checksum" block, so that raid6 would require 3 extra drives instead
> of 2.
> 
> The answer historically has been "patches welcome".
> 

There was/is an early prototype implementation of checksums for Linux MD RAID:

http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/
http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf
http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2


Also there's the T10 DIF / DIX (Data Integrity Fields / Data Integrity eXtensions) functionality that could be used, at least if the hardware setup is SAS-based (SAS HBA + enterprise SAS disks and modern enough firmware on both that enable DIF/DIX..).

I guess MD RAID could also 'emulate' T10 DIF/DIX even if the HBA/disks don't support it.. but dunno if that makes any sense.

> -- 
> Mikael Abrahamsson    email: swmike@swm.pp.se


-- Pasi


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-06 11:56       ` Bit-Rot Pasi Kärkkäinen
@ 2017-03-06 12:45         ` Reindl Harald
  2017-03-17 15:37         ` Bit-Rot Brassow Jonathan
  1 sibling, 0 replies; 10+ messages in thread
From: Reindl Harald @ 2017-03-06 12:45 UTC (permalink / raw)
  To: Pasi Kärkkäinen, Mikael Abrahamsson
  Cc: Gandalf Corvotempesta, Anthony Youngman, linux-raid



Am 06.03.2017 um 12:56 schrieb Pasi Kärkkäinen:
> On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote:
>> On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote:
>>
>>> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
>>> Adding this in mdadm could be an interesting feature.
>>
>> This has been discussed several times. Yes, it would be interesting.
>> It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only
>> way to "easily" add this I imagine, would be to have an additional
>> "checksum" block, so that raid6 would require 3 extra drives instead
>> of 2.
>>
>> The answer historically has been "patches welcome".
>
> There was/is an early prototype implementation of checksums for Linux MD RAID:
>
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2

well, it would help when the raid-check just verify that in a RAID10 
both mirrors of the stripe have identical data and not just wait for a 
read-error of the drives

given that when you mix HDD/SSD and after the first fstrim which only 
affects the SSD sha1sum of the first MB no longer matches while it does 
on a 4 drive HD array while both machines are clones that is not the case
________________________________________

machine 1 running since 2011

for disk in sda2 sdb2 sdd2 sdc2
do
  echo -n "$disk = ";
  dd if=/dev/$disk bs=1M skip=10 count=10 2>/dev/null | sha1sum
done

sda2 = 61efc1017cac02b1be7a95618215485b70a0d18d  -
sdb2 = ac4ec9b1a96c9c6bbd9ba196fcb7d6cd2dbb0faa  -
sdd2 = ac4ec9b1a96c9c6bbd9ba196fcb7d6cd2dbb0faa  -
sdc2 = 61efc1017cac02b1be7a95618215485b70a0d18d  -
________________________________________

the same on a cloned machine (just take two of the drives to the other 
machine and resync both with 2 new drives)

sda2 = 766fde5907aebc4dca39e31475b295035c95e3b4  -
sdb2 = 4f4b7f3b8f8893b2fb2f0f8b86944aa88f2cf2b6  -
sdd2 = 940ecae52580759abb33328dc464a937a66339ba  -
sdc2 = 9f79a56f0f09bb422a8d40787ca28cb719866e8e  -
________________________________________

* remove the 2 HDD on the mixed one
* overwrite it with zeros
* add them and wait rebuild
* sha1sum is identical
* fstrim -a
* sha1sum mismatch
* no alter from "raid-check"

you can repeat that as often as you want with the same results, the 
simple reason is that when on that first MB a free ext4 blocks on top 
after fstrim the SSD returns zeros while the HDD don't
________________________________________

and yes i am aware that someone or automatism needs to make a decision 
which of the 2 halfs is the truth but at least it should alert by default

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-06 11:56       ` Bit-Rot Pasi Kärkkäinen
  2017-03-06 12:45         ` Bit-Rot Reindl Harald
@ 2017-03-17 15:37         ` Brassow Jonathan
  2017-03-17 16:59           ` Bit-Rot Gandalf Corvotempesta
  1 sibling, 1 reply; 10+ messages in thread
From: Brassow Jonathan @ 2017-03-17 15:37 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Mikael Abrahamsson, Gandalf Corvotempesta, Anthony Youngman,
	linux-raid


> On Mar 6, 2017, at 5:56 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> 
> On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote:
>> On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote:
>> 
>>> This is what ZFS does. This is what Gluster does. This is what BRTFS does.
>>> Adding this in mdadm could be an interesting feature.
>> 
>> This has been discussed several times. Yes, it would be interesting.
>> It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only
>> way to "easily" add this I imagine, would be to have an additional
>> "checksum" block, so that raid6 would require 3 extra drives instead
>> of 2.
>> 
>> The answer historically has been "patches welcome".
>> 
> 
> There was/is an early prototype implementation of checksums for Linux MD RAID:
> 
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2
> 
> 
> Also there's the T10 DIF / DIX (Data Integrity Fields / Data Integrity eXtensions) functionality that could be used, at least if the hardware setup is SAS-based (SAS HBA + enterprise SAS disks and modern enough firmware on both that enable DIF/DIX..).
> 
> I guess MD RAID could also 'emulate' T10 DIF/DIX even if the HBA/disks don't support it.. but dunno if that makes any sense.

There is a device-mapper target that is designed to do precisely this - dm-integrity (see dm-devel mailing list).  It currently being developed as part of an authenticated encryption project, but could be used for this too.  Note that there is a performance penalty that comes from emulating this.

 brassow


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bit-Rot
  2017-03-17 15:37         ` Bit-Rot Brassow Jonathan
@ 2017-03-17 16:59           ` Gandalf Corvotempesta
  0 siblings, 0 replies; 10+ messages in thread
From: Gandalf Corvotempesta @ 2017-03-17 16:59 UTC (permalink / raw)
  To: Brassow Jonathan
  Cc: Pasi Kärkkäinen, Mikael Abrahamsson, Anthony Youngman,
	linux-raid

2017-03-17 16:37 GMT+01:00 Brassow Jonathan <jbrassow@redhat.com>:
> There is a device-mapper target that is designed to do precisely this - dm-integrity (see dm-devel mailing list).  It currently being developed as part of an authenticated encryption project, but could be used for this too.  Note that there is a performance penalty that comes from emulating this.

Probably something similar could be obtained by checking, during a
scrub, the majority of responses from all replica

A sort of quorum
If you have a 3 way mirror, and 2 disks reply with "1" and another
reply with "0", the disk with "0" has triggered a bit rot

Is mdadm able to make this decision?  In a 2 way mirror would be
impossible, as you can't know which disk has the correct data, but in
a 3 way mirrors you have a majority.

Probably the same could be done in RAID-6, where you have 2 parity to evaluate.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-17 16:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-03 21:25 Bit-Rot Gandalf Corvotempesta
2017-03-03 21:41 ` Bit-Rot Anthony Youngman
2017-03-03 21:54   ` Bit-Rot Gandalf Corvotempesta
2017-03-03 22:16     ` Bit-Rot Anthony Youngman
2017-03-05  6:01       ` Bit-Rot Chris Murphy
2017-03-05  8:15     ` Bit-Rot Mikael Abrahamsson
2017-03-06 11:56       ` Bit-Rot Pasi Kärkkäinen
2017-03-06 12:45         ` Bit-Rot Reindl Harald
2017-03-17 15:37         ` Bit-Rot Brassow Jonathan
2017-03-17 16:59           ` Bit-Rot Gandalf Corvotempesta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).