* Bit-Rot @ 2017-03-03 21:25 Gandalf Corvotempesta 2017-03-03 21:41 ` Bit-Rot Anthony Youngman 0 siblings, 1 reply; 10+ messages in thread From: Gandalf Corvotempesta @ 2017-03-03 21:25 UTC (permalink / raw) To: linux-raid Hi to all Wouldn't be possible to add a sort of bitrot detection on mdadm? I know that MD is working on blocks and not files, but checksumming a block should still be possible. In example, if you read some blocks with dd, you can still hash the content and verify that on next read/consistency check ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-03 21:25 Bit-Rot Gandalf Corvotempesta @ 2017-03-03 21:41 ` Anthony Youngman 2017-03-03 21:54 ` Bit-Rot Gandalf Corvotempesta 0 siblings, 1 reply; 10+ messages in thread From: Anthony Youngman @ 2017-03-03 21:41 UTC (permalink / raw) To: Gandalf Corvotempesta, linux-raid On 03/03/17 21:25, Gandalf Corvotempesta wrote: > Hi to all > Wouldn't be possible to add a sort of bitrot detection on mdadm? > I know that MD is working on blocks and not files, but checksumming a > block should still be possible. > > In example, if you read some blocks with dd, you can still hash the > content and verify that on next read/consistency check Isn't that what raid 5 does? Actually, iirc, it doesn't read every stripe and check parity on a read, because it would clobber performance. But I guess you could have a switch to turn it on. It's unlikely to achieve anything. Barring bugs in the firmware, it's pretty near 100% that a drive will either return what was written, or return a read error. Drives don't return dud data, they have quite a lot of error correction built in. Cheers, Wol ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-03 21:41 ` Bit-Rot Anthony Youngman @ 2017-03-03 21:54 ` Gandalf Corvotempesta 2017-03-03 22:16 ` Bit-Rot Anthony Youngman 2017-03-05 8:15 ` Bit-Rot Mikael Abrahamsson 0 siblings, 2 replies; 10+ messages in thread From: Gandalf Corvotempesta @ 2017-03-03 21:54 UTC (permalink / raw) To: Anthony Youngman; +Cc: linux-raid 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>: > Isn't that what raid 5 does? nothing to do with raid-5 > Actually, iirc, it doesn't read every stripe and check parity on a read, > because it would clobber performance. But I guess you could have a switch to > turn it on. It's unlikely to achieve anything. > > Barring bugs in the firmware, it's pretty near 100% that a drive will either > return what was written, or return a read error. Drives don't return dud > data, they have quite a lot of error correction built in. This is wrong. Sometimes drives return data differently from what was stored, or, store data differently from the original. In this case, if real data is "1" and you store "0", when you read "0", no read error is made, but data is still corrupted. With a bit-rot prevention this could be fixed, you checksum "1" from the source, write that to disks and if you read back "0", the checksum would be invalid. This is what ZFS does. This is what Gluster does. This is what BRTFS does. Adding this in mdadm could be an interesting feature. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-03 21:54 ` Bit-Rot Gandalf Corvotempesta @ 2017-03-03 22:16 ` Anthony Youngman 2017-03-05 6:01 ` Bit-Rot Chris Murphy 2017-03-05 8:15 ` Bit-Rot Mikael Abrahamsson 1 sibling, 1 reply; 10+ messages in thread From: Anthony Youngman @ 2017-03-03 22:16 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: linux-raid On 03/03/17 21:54, Gandalf Corvotempesta wrote: > 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>: >> Isn't that what raid 5 does? > > nothing to do with raid-5 > >> Actually, iirc, it doesn't read every stripe and check parity on a read, >> because it would clobber performance. But I guess you could have a switch to >> turn it on. It's unlikely to achieve anything. >> >> Barring bugs in the firmware, it's pretty near 100% that a drive will either >> return what was written, or return a read error. Drives don't return dud >> data, they have quite a lot of error correction built in. > > This is wrong. > Sometimes drives return data differently from what was stored, or, > store data differently from the original. > In this case, if real data is "1" and you store "0", when you read > "0", no read error is made, but data is still corrupted. Do you have any figures? I didn't say it can't happen, I just said it was very unlikely. > > With a bit-rot prevention this could be fixed, you checksum "1" from > the source, write that to disks and if you read back "0", the checksum > would be invalid. Or you just read the raid5 parity (which I don't think, by default, is what happens). That IS your checksum. So if you think the performance hit is worth it, write the code to add it, and turn it on. Not only will it detect a bit-flip, but it will tell you which bit flipped, and let you correct it. > > This is what ZFS does. This is what Gluster does. This is what BRTFS does. > Adding this in mdadm could be an interesting feature. > Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of course it needs some way of detecting whether a mirror is corrupt. I don't know about gluster or ZFS. (I believe raid5/btrfs is currently experimental, and dangerous.) But the question remains - is the effort worth it? Can I refer you to a very interesting article on LWN? About git, which assumes that "if hash(A) == hash(B) then A == B". And how that was actually MORE accurate than "if (memcmp( A, B) == true) then A == B". Cheers, Wol ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-03 22:16 ` Bit-Rot Anthony Youngman @ 2017-03-05 6:01 ` Chris Murphy 0 siblings, 0 replies; 10+ messages in thread From: Chris Murphy @ 2017-03-05 6:01 UTC (permalink / raw) To: Anthony Youngman; +Cc: Gandalf Corvotempesta, Linux-RAID On Fri, Mar 3, 2017 at 3:16 PM, Anthony Youngman <antlists@youngman.org.uk> wrote: > > > On 03/03/17 21:54, Gandalf Corvotempesta wrote: >> >> 2017-03-03 22:41 GMT+01:00 Anthony Youngman <antlists@youngman.org.uk>: >>> >>> Isn't that what raid 5 does? >> >> >> nothing to do with raid-5 >> >>> Actually, iirc, it doesn't read every stripe and check parity on a read, >>> because it would clobber performance. But I guess you could have a switch >>> to >>> turn it on. It's unlikely to achieve anything. >>> >>> Barring bugs in the firmware, it's pretty near 100% that a drive will >>> either >>> return what was written, or return a read error. Drives don't return dud >>> data, they have quite a lot of error correction built in. >> >> >> This is wrong. >> Sometimes drives return data differently from what was stored, or, >> store data differently from the original. >> In this case, if real data is "1" and you store "0", when you read >> "0", no read error is made, but data is still corrupted. > > > Do you have any figures? I didn't say it can't happen, I just said it was > very unlikely. Torn and misdirected writes do happen. There are a bunch of papers on this problem indicating it's real. This and various other sources of silent corrupt are why ZFS and Btrfs exist. >> >> >> With a bit-rot prevention this could be fixed, you checksum "1" from >> the source, write that to disks and if you read back "0", the checksum >> would be invalid. > > > Or you just read the raid5 parity (which I don't think, by default, is what > happens). That IS your checksum. So if you think the performance hit is > worth it, write the code to add it, and turn it on. Not only will it detect > a bit-flip, but it will tell you which bit flipped, and let you correct it. Parity isn't a checksum. Using it in this fashion is expensive because it means computing parity for all reads, and means you can't do partial stripe reads anymore. Next, even once you get a mismatch it's ambiguous which strip (mdadm chunk) is corrupt. That'd normally be exposed by the drive reporting an explicit read error. Since that doesn't exist you'd have to fake "fail" each strip, rebuild from parity, and compare. >> >> >> This is what ZFS does. This is what Gluster does. This is what BRTFS does. >> Adding this in mdadm could be an interesting feature. >> > Well, seeing as I understand btrfs doesn't do raid5, only raid1, then of > course it needs some way of detecting whether a mirror is corrupt. I don't > know about gluster or ZFS. (I believe raid5/btrfs is currently experimental, > and dangerous.) Btrfs supports raid1, 10, 5 and 6. It's reasonable to consider raid56 experimental because it has a number of gotchas, not least of which is there are certain kinds of writes that are not COW, so the COW safeguards don't always apply in a power failures. As for dangerous, the opinions vary but probably something everyone can agree on is any ambiguity with the stability of a file system is that it looks bad. > But the question remains - is the effort worth it? That's the central question. And to answer it, you'd need some sort of rough design. Where are the csums going to be stored? Do you update data strips before or after the csums? Either way, if this is now COW, you have a moment of complete mismatching between data and csums, with live data. So... that's a big problem actually. And if you have a crash or power failure during writes, it's an even bigger problem. Do you csum the party? -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-03 21:54 ` Bit-Rot Gandalf Corvotempesta 2017-03-03 22:16 ` Bit-Rot Anthony Youngman @ 2017-03-05 8:15 ` Mikael Abrahamsson 2017-03-06 11:56 ` Bit-Rot Pasi Kärkkäinen 1 sibling, 1 reply; 10+ messages in thread From: Mikael Abrahamsson @ 2017-03-05 8:15 UTC (permalink / raw) To: Gandalf Corvotempesta; +Cc: Anthony Youngman, linux-raid On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote: > This is what ZFS does. This is what Gluster does. This is what BRTFS does. > Adding this in mdadm could be an interesting feature. This has been discussed several times. Yes, it would be interesting. It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only way to "easily" add this I imagine, would be to have an additional "checksum" block, so that raid6 would require 3 extra drives instead of 2. The answer historically has been "patches welcome". -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-05 8:15 ` Bit-Rot Mikael Abrahamsson @ 2017-03-06 11:56 ` Pasi Kärkkäinen 2017-03-06 12:45 ` Bit-Rot Reindl Harald 2017-03-17 15:37 ` Bit-Rot Brassow Jonathan 0 siblings, 2 replies; 10+ messages in thread From: Pasi Kärkkäinen @ 2017-03-06 11:56 UTC (permalink / raw) To: Mikael Abrahamsson; +Cc: Gandalf Corvotempesta, Anthony Youngman, linux-raid On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote: > On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote: > > >This is what ZFS does. This is what Gluster does. This is what BRTFS does. > >Adding this in mdadm could be an interesting feature. > > This has been discussed several times. Yes, it would be interesting. > It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only > way to "easily" add this I imagine, would be to have an additional > "checksum" block, so that raid6 would require 3 extra drives instead > of 2. > > The answer historically has been "patches welcome". > There was/is an early prototype implementation of checksums for Linux MD RAID: http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/ http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2 Also there's the T10 DIF / DIX (Data Integrity Fields / Data Integrity eXtensions) functionality that could be used, at least if the hardware setup is SAS-based (SAS HBA + enterprise SAS disks and modern enough firmware on both that enable DIF/DIX..). I guess MD RAID could also 'emulate' T10 DIF/DIX even if the HBA/disks don't support it.. but dunno if that makes any sense. > -- > Mikael Abrahamsson email: swmike@swm.pp.se -- Pasi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-06 11:56 ` Bit-Rot Pasi Kärkkäinen @ 2017-03-06 12:45 ` Reindl Harald 2017-03-17 15:37 ` Bit-Rot Brassow Jonathan 1 sibling, 0 replies; 10+ messages in thread From: Reindl Harald @ 2017-03-06 12:45 UTC (permalink / raw) To: Pasi Kärkkäinen, Mikael Abrahamsson Cc: Gandalf Corvotempesta, Anthony Youngman, linux-raid Am 06.03.2017 um 12:56 schrieb Pasi Kärkkäinen: > On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote: >> On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote: >> >>> This is what ZFS does. This is what Gluster does. This is what BRTFS does. >>> Adding this in mdadm could be an interesting feature. >> >> This has been discussed several times. Yes, it would be interesting. >> It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only >> way to "easily" add this I imagine, would be to have an additional >> "checksum" block, so that raid6 would require 3 extra drives instead >> of 2. >> >> The answer historically has been "patches welcome". > > There was/is an early prototype implementation of checksums for Linux MD RAID: > > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/ > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2 well, it would help when the raid-check just verify that in a RAID10 both mirrors of the stripe have identical data and not just wait for a read-error of the drives given that when you mix HDD/SSD and after the first fstrim which only affects the SSD sha1sum of the first MB no longer matches while it does on a 4 drive HD array while both machines are clones that is not the case ________________________________________ machine 1 running since 2011 for disk in sda2 sdb2 sdd2 sdc2 do echo -n "$disk = "; dd if=/dev/$disk bs=1M skip=10 count=10 2>/dev/null | sha1sum done sda2 = 61efc1017cac02b1be7a95618215485b70a0d18d - sdb2 = ac4ec9b1a96c9c6bbd9ba196fcb7d6cd2dbb0faa - sdd2 = ac4ec9b1a96c9c6bbd9ba196fcb7d6cd2dbb0faa - sdc2 = 61efc1017cac02b1be7a95618215485b70a0d18d - ________________________________________ the same on a cloned machine (just take two of the drives to the other machine and resync both with 2 new drives) sda2 = 766fde5907aebc4dca39e31475b295035c95e3b4 - sdb2 = 4f4b7f3b8f8893b2fb2f0f8b86944aa88f2cf2b6 - sdd2 = 940ecae52580759abb33328dc464a937a66339ba - sdc2 = 9f79a56f0f09bb422a8d40787ca28cb719866e8e - ________________________________________ * remove the 2 HDD on the mixed one * overwrite it with zeros * add them and wait rebuild * sha1sum is identical * fstrim -a * sha1sum mismatch * no alter from "raid-check" you can repeat that as often as you want with the same results, the simple reason is that when on that first MB a free ext4 blocks on top after fstrim the SSD returns zeros while the HDD don't ________________________________________ and yes i am aware that someone or automatism needs to make a decision which of the 2 halfs is the truth but at least it should alert by default ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-06 11:56 ` Bit-Rot Pasi Kärkkäinen 2017-03-06 12:45 ` Bit-Rot Reindl Harald @ 2017-03-17 15:37 ` Brassow Jonathan 2017-03-17 16:59 ` Bit-Rot Gandalf Corvotempesta 1 sibling, 1 reply; 10+ messages in thread From: Brassow Jonathan @ 2017-03-17 15:37 UTC (permalink / raw) To: Pasi Kärkkäinen Cc: Mikael Abrahamsson, Gandalf Corvotempesta, Anthony Youngman, linux-raid > On Mar 6, 2017, at 5:56 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote: > > On Sun, Mar 05, 2017 at 09:15:39AM +0100, Mikael Abrahamsson wrote: >> On Fri, 3 Mar 2017, Gandalf Corvotempesta wrote: >> >>> This is what ZFS does. This is what Gluster does. This is what BRTFS does. >>> Adding this in mdadm could be an interesting feature. >> >> This has been discussed several times. Yes, it would be interesting. >> It's not easy to do because mdadm maps 4k blocks to 4k blocks. Only >> way to "easily" add this I imagine, would be to have an additional >> "checksum" block, so that raid6 would require 3 extra drives instead >> of 2. >> >> The answer historically has been "patches welcome". >> > > There was/is an early prototype implementation of checksums for Linux MD RAID: > > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/ > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-presentation.pdf > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf > http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-code.tar.bz2 > > > Also there's the T10 DIF / DIX (Data Integrity Fields / Data Integrity eXtensions) functionality that could be used, at least if the hardware setup is SAS-based (SAS HBA + enterprise SAS disks and modern enough firmware on both that enable DIF/DIX..). > > I guess MD RAID could also 'emulate' T10 DIF/DIX even if the HBA/disks don't support it.. but dunno if that makes any sense. There is a device-mapper target that is designed to do precisely this - dm-integrity (see dm-devel mailing list). It currently being developed as part of an authenticated encryption project, but could be used for this too. Note that there is a performance penalty that comes from emulating this. brassow ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bit-Rot 2017-03-17 15:37 ` Bit-Rot Brassow Jonathan @ 2017-03-17 16:59 ` Gandalf Corvotempesta 0 siblings, 0 replies; 10+ messages in thread From: Gandalf Corvotempesta @ 2017-03-17 16:59 UTC (permalink / raw) To: Brassow Jonathan Cc: Pasi Kärkkäinen, Mikael Abrahamsson, Anthony Youngman, linux-raid 2017-03-17 16:37 GMT+01:00 Brassow Jonathan <jbrassow@redhat.com>: > There is a device-mapper target that is designed to do precisely this - dm-integrity (see dm-devel mailing list). It currently being developed as part of an authenticated encryption project, but could be used for this too. Note that there is a performance penalty that comes from emulating this. Probably something similar could be obtained by checking, during a scrub, the majority of responses from all replica A sort of quorum If you have a 3 way mirror, and 2 disks reply with "1" and another reply with "0", the disk with "0" has triggered a bit rot Is mdadm able to make this decision? In a 2 way mirror would be impossible, as you can't know which disk has the correct data, but in a 3 way mirrors you have a majority. Probably the same could be done in RAID-6, where you have 2 parity to evaluate. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-03-17 16:59 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-03-03 21:25 Bit-Rot Gandalf Corvotempesta 2017-03-03 21:41 ` Bit-Rot Anthony Youngman 2017-03-03 21:54 ` Bit-Rot Gandalf Corvotempesta 2017-03-03 22:16 ` Bit-Rot Anthony Youngman 2017-03-05 6:01 ` Bit-Rot Chris Murphy 2017-03-05 8:15 ` Bit-Rot Mikael Abrahamsson 2017-03-06 11:56 ` Bit-Rot Pasi Kärkkäinen 2017-03-06 12:45 ` Bit-Rot Reindl Harald 2017-03-17 15:37 ` Bit-Rot Brassow Jonathan 2017-03-17 16:59 ` Bit-Rot Gandalf Corvotempesta
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).