* Fw: Why does one get mismatches?
@ 2010-01-20 11:52 Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-02-01 21:18 ` Bill Davidsen
0 siblings, 2 replies; 70+ messages in thread
From: Jon Hardcastle @ 2010-01-20 11:52 UTC (permalink / raw)
To: linux-raid
--- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote:
> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
> Subject: Why does one get mismatches?
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 19 January, 2010, 10:04
> Hi,
>
> I kicked off a check/repair cycle on my machine after i
> moved the phyiscal ordering of my drives around and I am now
> on my second check/repair cycle and it has kept finding
> mismatches.
>
> Is it correct that the mismatch value after a repair was
> needed should equal the value present after a check? What if
> it doesn't? What does it mean if another check STILL reveals
> mismatches?
>
> I had something similar after i reshaped from raid 5 to 6 i
> had to run check/repair/check/repair several times before i
> got my 0.
>
>
Guys,
Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8.
All i have done is move the drives around. It is the same controllers/cables/etc
I really dont like the seeming random nature of what can/does/has caused the mismatches?
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle @ 2010-01-22 18:13 ` Goswin von Brederlow 2010-01-24 17:40 ` Jon Hardcastle 2010-02-01 21:18 ` Bill Davidsen 1 sibling, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-01-22 18:13 UTC (permalink / raw) To: Jon; +Cc: linux-raid Jon Hardcastle <jd_hardcastle@yahoo.com> writes: > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote: > >> From: Jon Hardcastle <jd_hardcastle@yahoo.com> >> Subject: Why does one get mismatches? >> To: linux-raid@vger.kernel.org >> Date: Tuesday, 19 January, 2010, 10:04 >> Hi, >> >> I kicked off a check/repair cycle on my machine after i >> moved the phyiscal ordering of my drives around and I am now >> on my second check/repair cycle and it has kept finding >> mismatches. >> >> Is it correct that the mismatch value after a repair was >> needed should equal the value present after a check? What if >> it doesn't? What does it mean if another check STILL reveals >> mismatches? >> >> I had something similar after i reshaped from raid 5 to 6 i >> had to run check/repair/check/repair several times before i >> got my 0. >> >> > > Guys, > > Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8. > > All i have done is move the drives around. It is the same controllers/cables/etc > > I really dont like the seeming random nature of what can/does/has caused the mismatches? There is some unknown corruption going on with raid1 that causes mismatches but it is believed that it will never occur on any used block. Swapping is a likely cause. Any swap device on the raid? Try turning that off. If that doesn't help try umounting filesystems or remounting RO. MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-22 18:13 ` Goswin von Brederlow @ 2010-01-24 17:40 ` Jon Hardcastle 2010-01-24 21:52 ` Roger Heflin 2010-01-24 23:13 ` Goswin von Brederlow 0 siblings, 2 replies; 70+ messages in thread From: Jon Hardcastle @ 2010-01-24 17:40 UTC (permalink / raw) To: Jon, Goswin von Brederlow; +Cc: linux-raid --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > From: Goswin von Brederlow <goswin-v-b@web.de> > Subject: Re: Fw: Why does one get mismatches? > To: Jon@eHardcastle.com > Cc: linux-raid@vger.kernel.org > Date: Friday, 22 January, 2010, 18:13 > Jon Hardcastle <jd_hardcastle@yahoo.com> > writes: > > > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> > wrote: > > > >> From: Jon Hardcastle <jd_hardcastle@yahoo.com> > >> Subject: Why does one get mismatches? > >> To: linux-raid@vger.kernel.org > >> Date: Tuesday, 19 January, 2010, 10:04 > >> Hi, > >> > >> I kicked off a check/repair cycle on my machine > after i > >> moved the phyiscal ordering of my drives around > and I am now > >> on my second check/repair cycle and it has kept > finding > >> mismatches. > >> > >> Is it correct that the mismatch value after a > repair was > >> needed should equal the value present after a > check? What if > >> it doesn't? What does it mean if another check > STILL reveals > >> mismatches? > >> > >> I had something similar after i reshaped from raid > 5 to 6 i > >> had to run check/repair/check/repair several times > before i > >> got my 0. > >> > >> > > > > Guys, > > > > Anyone got any suggestions here? I am now on my ~5 > check/repair and after a reboot the first check is still > returning 8. > > > > All i have done is move the drives around. It is the > same controllers/cables/etc > > > > I really dont like the seeming random nature of what > can/does/has caused the mismatches? > > There is some unknown corruption going on with raid1 that > causes > mismatches but it is believed that it will never occur on > any used > block. Swapping is a likely cause. > > Any swap device on the raid? Try turning that off. > If that doesn't help try umounting filesystems or > remounting RO. > > MfG > Goswin Hello, my usual savior Goswin! The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg. Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago. The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry! -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-24 17:40 ` Jon Hardcastle @ 2010-01-24 21:52 ` Roger Heflin 2010-01-24 23:13 ` Goswin von Brederlow 1 sibling, 0 replies; 70+ messages in thread From: Roger Heflin @ 2010-01-24 21:52 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid Jon Hardcastle wrote: > --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > >> From: Goswin von Brederlow <goswin-v-b@web.de> >> Subject: Re: Fw: Why does one get mismatches? >> To: Jon@eHardcastle.com >> Cc: linux-raid@vger.kernel.org >> Date: Friday, 22 January, 2010, 18:13 >> Jon Hardcastle <jd_hardcastle@yahoo.com> >> writes: >> >>> --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> >> wrote: >>>> From: Jon Hardcastle <jd_hardcastle@yahoo.com> >>>> Subject: Why does one get mismatches? >>>> To: linux-raid@vger.kernel.org >>>> Date: Tuesday, 19 January, 2010, 10:04 >>>> Hi, >>>> >>>> I kicked off a check/repair cycle on my machine >> after i >>>> moved the phyiscal ordering of my drives around >> and I am now >>>> on my second check/repair cycle and it has kept >> finding >>>> mismatches. >>>> >>>> Is it correct that the mismatch value after a >> repair was >>>> needed should equal the value present after a >> check? What if >>>> it doesn't? What does it mean if another check >> STILL reveals >>>> mismatches? >>>> >>>> I had something similar after i reshaped from raid >> 5 to 6 i >>>> had to run check/repair/check/repair several times >> before i >>>> got my 0. >>>> >>>> >>> Guys, >>> >>> Anyone got any suggestions here? I am now on my ~5 >> check/repair and after a reboot the first check is still >> returning 8. >>> All i have done is move the drives around. It is the >> same controllers/cables/etc >>> I really dont like the seeming random nature of what >> can/does/has caused the mismatches? >> >> There is some unknown corruption going on with raid1 that >> causes >> mismatches but it is believed that it will never occur on >> any used >> block. Swapping is a likely cause. >> >> Any swap device on the raid? Try turning that off. >> If that doesn't help try umounting filesystems or >> remounting RO. >> >> MfG >> Goswin > > Hello, my usual savior Goswin! > > The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg. > > Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago. > > The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry! > It is possible that the reads are somehow corrupting sometimes. I have seen a couple of different controllers fail and result in read corruptions, basically you have 50 largish files or so on the disk with the same checksum (50xsize needs to be 2x greater than ram), and you cksum all of the files and see if the cksum changes, if it does the "bad" file will move around, so in this case the data on disk should be ok. I have seen a couple of different companies controller fail this way, usually it is from a bad PCI interface chip or a bad config (too fast) causing PCI parity errors. I had one controller fail (broken) and cause errors (replaced with spare corrected), and in the second case I found that the MB was running the PCI bus too faster for the number of cards (two different companies FC card fails--both in slightly different ways-one silently corrupted, the other crashed the machine about the time an error would have been expected), and had to slow the bus down one step (PCIX-133 -> PCIX-100, or PCIX-100 to PCIX-66) and the issue went away. In both cases I did not find any write corruptions, but found read corruptions often, if you have this happening with a raid5 device it would be bad if you had to use parity (corrupt read would mean regenerated parity would be wrong, and restore from parity would lead to corrupted data). I don't know how strong the internal SATA communication is, if it uses CRC's errors are almost impossible on the cable, if it uses parity errors are easy, the PCI bus uses parity, so it is pretty easy for errors to get through, but I have only seen them very rarely, maybe 5 times in 10,000 years of machine operations (2000+ machines for several years). ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-24 17:40 ` Jon Hardcastle 2010-01-24 21:52 ` Roger Heflin @ 2010-01-24 23:13 ` Goswin von Brederlow 2010-01-25 10:07 ` Jon Hardcastle 1 sibling, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-01-24 23:13 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid Jon Hardcastle <jd_hardcastle@yahoo.com> writes: > --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > >> From: Goswin von Brederlow <goswin-v-b@web.de> >> Subject: Re: Fw: Why does one get mismatches? >> To: Jon@eHardcastle.com >> Cc: linux-raid@vger.kernel.org >> Date: Friday, 22 January, 2010, 18:13 >> Jon Hardcastle <jd_hardcastle@yahoo.com> >> writes: >> >> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> >> wrote: >> > >> >> From: Jon Hardcastle <jd_hardcastle@yahoo.com> >> >> Subject: Why does one get mismatches? >> >> To: linux-raid@vger.kernel.org >> >> Date: Tuesday, 19 January, 2010, 10:04 >> >> Hi, >> >> >> >> I kicked off a check/repair cycle on my machine >> after i >> >> moved the phyiscal ordering of my drives around >> and I am now >> >> on my second check/repair cycle and it has kept >> finding >> >> mismatches. >> >> >> >> Is it correct that the mismatch value after a >> repair was >> >> needed should equal the value present after a >> check? What if >> >> it doesn't? What does it mean if another check >> STILL reveals >> >> mismatches? >> >> >> >> I had something similar after i reshaped from raid >> 5 to 6 i >> >> had to run check/repair/check/repair several times >> before i >> >> got my 0. >> >> >> >> >> > >> > Guys, >> > >> > Anyone got any suggestions here? I am now on my ~5 >> check/repair and after a reboot the first check is still >> returning 8. >> > >> > All i have done is move the drives around. It is the >> same controllers/cables/etc >> > >> > I really dont like the seeming random nature of what >> can/does/has caused the mismatches? >> >> There is some unknown corruption going on with raid1 that >> causes >> mismatches but it is believed that it will never occur on >> any used >> block. Swapping is a likely cause. >> >> Any swap device on the raid? Try turning that off. >> If that doesn't help try umounting filesystems or >> remounting RO. >> >> MfG >> Goswin > > Hello, my usual savior Goswin! > > The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg. > > Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago. > > The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry! You did run a repair pass and not just repeated check passes, right? Check itself only counts the mismatches but does not correct them. If the raid is unused (vgchange -a n) and you do first repair and then check then that definetly should not find any mismatches. MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-24 23:13 ` Goswin von Brederlow @ 2010-01-25 10:07 ` Jon Hardcastle 2010-01-25 10:37 ` Goswin von Brederlow 0 siblings, 1 reply; 70+ messages in thread From: Jon Hardcastle @ 2010-01-25 10:07 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid --- On Sun, 24/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > From: Goswin von Brederlow <goswin-v-b@web.de> > Subject: Re: Fw: Why does one get mismatches? > To: Jon@eHardcastle.com > Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org > Date: Sunday, 24 January, 2010, 23:13 > Jon Hardcastle <jd_hardcastle@yahoo.com> > writes: > > > --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> > wrote: > > > >> From: Goswin von Brederlow <goswin-v-b@web.de> > >> Subject: Re: Fw: Why does one get mismatches? > >> To: Jon@eHardcastle.com > >> Cc: linux-raid@vger.kernel.org > >> Date: Friday, 22 January, 2010, 18:13 > >> Jon Hardcastle <jd_hardcastle@yahoo.com> > >> writes: > >> > >> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> > >> wrote: > >> > > >> >> From: Jon Hardcastle <jd_hardcastle@yahoo.com> > >> >> Subject: Why does one get mismatches? > >> >> To: linux-raid@vger.kernel.org > >> >> Date: Tuesday, 19 January, 2010, 10:04 > >> >> Hi, > >> >> > >> >> I kicked off a check/repair cycle on my > machine > >> after i > >> >> moved the phyiscal ordering of my drives > around > >> and I am now > >> >> on my second check/repair cycle and it > has kept > >> finding > >> >> mismatches. > >> >> > >> >> Is it correct that the mismatch value > after a > >> repair was > >> >> needed should equal the value present > after a > >> check? What if > >> >> it doesn't? What does it mean if another > check > >> STILL reveals > >> >> mismatches? > >> >> > >> >> I had something similar after i reshaped > from raid > >> 5 to 6 i > >> >> had to run check/repair/check/repair > several times > >> before i > >> >> got my 0. > >> >> > >> >> > >> > > >> > Guys, > >> > > >> > Anyone got any suggestions here? I am now on > my ~5 > >> check/repair and after a reboot the first check is > still > >> returning 8. > >> > > >> > All i have done is move the drives around. It > is the > >> same controllers/cables/etc > >> > > >> > I really dont like the seeming random nature > of what > >> can/does/has caused the mismatches? > >> > >> There is some unknown corruption going on with > raid1 that > >> causes > >> mismatches but it is believed that it will never > occur on > >> any used > >> block. Swapping is a likely cause. > >> > >> Any swap device on the raid? Try turning that > off. > >> If that doesn't help try umounting filesystems or > >> remounting RO. > >> > >> MfG > >> Goswin > > > > Hello, my usual savior Goswin! > > > > The deal is it is a 7 drive raid 6 array. it has LVM > on it and is not used for swapping. I have umounted all LV's > and still got mismatches, i run smartctl --test=long on all > drives - nothing. I have now dismantled the array and am 3/4 > the way through 'badblocks -svn' on each of the component > drive. I have a hunch that it may be a dodgy SATA cable but > have no evidence. No errors in log, nothing on dmesg. > > > > Is there any way to get more information? I am > starting to think this is more happened since i changed from > raid 5 to 6..... which i did < 1 month ago. > > > > The only lead i have is that whilst doing the bad > blocks 1 drive ran at ~10~15MB/s whereas the rest are going > at ~30 i have another identical model drive coming up so i > will see if that one is slow too. But the lack of logging > info is not helpful and worrying! and the prospect of silent > corruption a big worry! > > You did run a repair pass and not just repeated check > passes, right? > Check itself only counts the mismatches but does not > correct them. > If the raid is unused (vgchange -a n) and you do first > repair and then > check then that definetly should not find any mismatches. > > MfG > > Goswin > Hello! Yes, I have a simple script that first does a check, then if there are mismatches it does repair. I have then been manually rerunning a check and I keep getting mismatches. I goes like this 232, 8, 24, 8, 8, 16, 16, 24, 24, 8, 16, 24. But I have also done this manually and run several repairs in a row (assuming that will return 0 if no work is to be done) Now the array is completely dismantled and I am running bad blocks on the drives but I am on the last 2 of the 7 drives and I still have no leads. No bad blocks, no offline uncorrectable, no pending sectors no dmesg errors no nothing. I have absolutely no leads what so ever. The only thing i have left to try is a full Mem test and disconnect and reseat the additional sata controllers, oh and buy 7 new sata cables incase 1 is bad. But it would be REALLY helpful to know on what drive the mismatches have occured. Any help here would be gratefully received! I might even try converting the array back to raid 5 as i remember i had mismatches immediately after i converted from 5 to 6. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-25 10:07 ` Jon Hardcastle @ 2010-01-25 10:37 ` Goswin von Brederlow 2010-01-25 10:52 ` Jon Hardcastle 0 siblings, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-01-25 10:37 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid Jon Hardcastle <jd_hardcastle@yahoo.com> writes: > Now the array is completely dismantled and I am running bad blocks on the drives but I am on the last 2 of the 7 drives and I still have no leads. No bad blocks, no offline uncorrectable, no pending sectors no dmesg errors no nothing. I have absolutely no leads what so ever. > > The only thing i have left to try is a full Mem test and disconnect and reseat the additional sata controllers, oh and buy 7 new sata cables incase 1 is bad. The problem with badblocks is that it writes the same pattern everywhere. If the problem is that data gets read/written to the wrong block then that will not show up. Try formating each drive and run fstest [1] on it. Or some other test that verifies data integrity using different patterns per block. MfG Goswin [1] http://mrvn.homeip.net/fstest/ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-25 10:37 ` Goswin von Brederlow @ 2010-01-25 10:52 ` Jon Hardcastle 2010-01-25 17:32 ` Goswin von Brederlow 2010-01-25 19:32 ` Iustin Pop 0 siblings, 2 replies; 70+ messages in thread From: Jon Hardcastle @ 2010-01-25 10:52 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid --- On Mon, 25/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > From: Goswin von Brederlow <goswin-v-b@web.de> > Subject: Re: Fw: Why does one get mismatches? > To: Jon@eHardcastle.com > Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org > Date: Monday, 25 January, 2010, 10:37 > Jon Hardcastle <jd_hardcastle@yahoo.com> > writes: > > > Now the array is completely dismantled and I am > running bad blocks on the drives but I am on the last 2 of > the 7 drives and I still have no leads. No bad blocks, no > offline uncorrectable, no pending sectors no dmesg errors no > nothing. I have absolutely no leads what so ever. > > > > The only thing i have left to try is a full Mem test > and disconnect and reseat the additional sata controllers, > oh and buy 7 new sata cables incase 1 is bad. > > The problem with badblocks is that it writes the same > pattern > everywhere. If the problem is that data gets read/written > to the wrong > block then that will not show up. > > Try formating each drive and run fstest [1] on it. Or some > other test > that verifies data integrity using different patterns per > block. > > MfG > Goswin > > [1] http://mrvn.homeip.net/fstest/ > This is going to be a time consuming process as i'll have to remove and read from the array each drive 1 at a time then resync. Thanks for the link, but could a similar result be achieved with the -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch that the urandom wont work as you need to read AND write. Just a worry as i clearly have mismatches and therefore corrupted data. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-25 10:52 ` Jon Hardcastle @ 2010-01-25 17:32 ` Goswin von Brederlow 2010-01-25 19:32 ` Iustin Pop 1 sibling, 0 replies; 70+ messages in thread From: Goswin von Brederlow @ 2010-01-25 17:32 UTC (permalink / raw) To: Jon; +Cc: linux-raid Jon Hardcastle <jd_hardcastle@yahoo.com> writes: > --- On Mon, 25/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote: > >> From: Goswin von Brederlow <goswin-v-b@web.de> >> Subject: Re: Fw: Why does one get mismatches? >> To: Jon@eHardcastle.com >> Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org >> Date: Monday, 25 January, 2010, 10:37 >> Jon Hardcastle <jd_hardcastle@yahoo.com> >> writes: >> >> > Now the array is completely dismantled and I am >> running bad blocks on the drives but I am on the last 2 of >> the 7 drives and I still have no leads. No bad blocks, no >> offline uncorrectable, no pending sectors no dmesg errors no >> nothing. I have absolutely no leads what so ever. >> > >> > The only thing i have left to try is a full Mem test >> and disconnect and reseat the additional sata controllers, >> oh and buy 7 new sata cables incase 1 is bad. >> >> The problem with badblocks is that it writes the same >> pattern >> everywhere. If the problem is that data gets read/written >> to the wrong >> block then that will not show up. >> >> Try formating each drive and run fstest [1] on it. Or some >> other test >> that verifies data integrity using different patterns per >> block. >> >> MfG >> Goswin >> >> [1] http://mrvn.homeip.net/fstest/ >> > > This is going to be a time consuming process as i'll have to remove and read from the array each drive 1 at a time then resync. > > Thanks for the link, but could a similar result be achieved with the -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch that the urandom wont work as you need to read AND write. > > Just a worry as i clearly have mismatches and therefore corrupted data. No. You obviously should use the -w option in badblocks. Doing a read-only test is completly pointless as the raid check already tested a read of every block without errors (I assume). But -w will write one pattern on the whole disk, then read and compare. Then repeat for the next pattern. If the disk messes up the address of blocks then that won't be detected. E.g. I had a raid enclosure that droped a bit in the block address every once in a while. You get really interesting corruption with that. MfG Goswin -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-25 10:52 ` Jon Hardcastle 2010-01-25 17:32 ` Goswin von Brederlow @ 2010-01-25 19:32 ` Iustin Pop 1 sibling, 0 replies; 70+ messages in thread From: Iustin Pop @ 2010-01-25 19:32 UTC (permalink / raw) To: Jon; +Cc: Goswin von Brederlow, linux-raid On Mon, Jan 25, 2010 at 02:52:58AM -0800, Jon Hardcastle wrote: > This is going to be a time consuming process as i'll have to remove > and read from the array each drive 1 at a time then resync. > > Thanks for the link, but could a similar result be achieved with the > -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch > that the urandom wont work as you need to read AND write. > > Just a worry as i clearly have mismatches and therefore corrupted > data. Just a comment from the 'benches' here: looking at all the tests you have done, my personal opinion is that this is *not* HW problems of any kind, but indeed some MD software issue. I've never seen such high percentage of consistent and silent corruption in the hardware, and to me it seems corruption in the software, *if at all*. I would run a counter-test, to see at least if the 'check' test is right: - run your array until 'check' returns mismatches - shutdown the array - check that the contents of the drives is indeed different using something else than 'check' (e.g. checksum each 1MB block on the drives independently, and compare the checksum lists) - if indeed there are diffs, start the array, run a repair (but no other traffic to the array) - shutdown the array and re-run the external diff test The above tests should tell you if: check is right, and if repair indeed fixes the differences. And another side-note: it would be really good if md had a debug option to actually show the checksums for the differing blocks and their offsets, to at least see if the same areas of the drive show differences (it would be really funny if the diffs are, for example, in the MD metadata :) (or does md already have something like this? I've stopped using md a year or so ago). regards, iustin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Fw: Why does one get mismatches? 2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle 2010-01-22 18:13 ` Goswin von Brederlow @ 2010-02-01 21:18 ` Bill Davidsen 2010-02-01 22:37 ` Neil Brown 1 sibling, 1 reply; 70+ messages in thread From: Bill Davidsen @ 2010-02-01 21:18 UTC (permalink / raw) To: Jon; +Cc: linux-raid Jon Hardcastle wrote: > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote: > > >> From: Jon Hardcastle <jd_hardcastle@yahoo.com> >> Subject: Why does one get mismatches? >> To: linux-raid@vger.kernel.org >> Date: Tuesday, 19 January, 2010, 10:04 >> Hi, >> >> I kicked off a check/repair cycle on my machine after i >> moved the phyiscal ordering of my drives around and I am now >> on my second check/repair cycle and it has kept finding >> mismatches. >> >> Is it correct that the mismatch value after a repair was >> needed should equal the value present after a check? What if >> it doesn't? What does it mean if another check STILL reveals >> mismatches? >> >> I had something similar after i reshaped from raid 5 to 6 i >> had to run check/repair/check/repair several times before i >> got my 0. >> >> >> > > Guys, > > Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8. > > All i have done is move the drives around. It is the same controllers/cables/etc > > I really dont like the seeming random nature of what can/does/has caused the mismatches? > If you have an ext[34] filesystem on this array, try mounting it data=journal (yes it will slow down, this is a TEST). I did limited testing using this, and it appeared to solve the problem, at least for eight hours I had to test. Comment: when there is a three way RAID-1, why doesn't repair *vote* on the correct value instead of just making a guess? -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-01 21:18 ` Bill Davidsen @ 2010-02-01 22:37 ` Neil Brown 2010-02-02 15:11 ` Bill Davidsen 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-02-01 22:37 UTC (permalink / raw) To: Bill Davidsen; +Cc: Jon, linux-raid On Mon, 01 Feb 2010 16:18:23 -0500 Bill Davidsen <davidsen@tmr.com> wrote: > Comment: when there is a three way RAID-1, why doesn't repair *vote* on > the correct value instead of just making a guess? > Because truth is not democratic. (and I defy you to define "correct" in any general way in this context). NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-01 22:37 ` Neil Brown @ 2010-02-02 15:11 ` Bill Davidsen 2010-02-03 11:17 ` Goswin von Brederlow 2010-02-11 5:14 ` Neil Brown 0 siblings, 2 replies; 70+ messages in thread From: Bill Davidsen @ 2010-02-02 15:11 UTC (permalink / raw) To: Neil Brown; +Cc: Jon, linux-raid Neil Brown wrote: > On Mon, 01 Feb 2010 16:18:23 -0500 > Bill Davidsen <davidsen@tmr.com> wrote: > > >> Comment: when there is a three way RAID-1, why doesn't repair *vote* on >> the correct value instead of just making a guess? >> >> > > Because truth is not democratic. > > (and I defy you to define "correct" in any general way in this context). > If you are willing to accept that the reconstructed data from RAID-[56] is "correct" then the data from RAID-1 majority opinion is "correct." If you say that such recovered data is the "most likely to match what was written," then data consistent on (N+1)/2 drives of a RAID-1 should be viewed in the same light. Call it "most likely to be correct" if you prefer, but picking a value from a drive at random is less likely. This whole discussion simply shows that for RAID-1 software RAID is less reliable than hardware RAID (no, I don't mean fake-RAID), because it doesn't pin the data buffer until all copies are written. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-02 15:11 ` Bill Davidsen @ 2010-02-03 11:17 ` Goswin von Brederlow 2010-02-11 5:14 ` Neil Brown 1 sibling, 0 replies; 70+ messages in thread From: Goswin von Brederlow @ 2010-02-03 11:17 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, Jon, linux-raid Bill Davidsen <davidsen@tmr.com> writes: > Neil Brown wrote: >> On Mon, 01 Feb 2010 16:18:23 -0500 >> Bill Davidsen <davidsen@tmr.com> wrote: >> >> >>> Comment: when there is a three way RAID-1, why doesn't repair >>> *vote* on the correct value instead of just making a guess? >>> >>> >> >> Because truth is not democratic. >> >> (and I defy you to define "correct" in any general way in this context). >> > > If you are willing to accept that the reconstructed data from > RAID-[56] is "correct" then the data from RAID-1 majority opinion is > "correct." If you say that such recovered data is the "most likely to > match what was written," then data consistent on (N+1)/2 drives of a > RAID-1 should be viewed in the same light. Call it "most likely to be > correct" if you prefer, but picking a value from a drive at random is > less likely. > > This whole discussion simply shows that for RAID-1 software RAID is > less reliable than hardware RAID (no, I don't mean fake-RAID), because > it doesn't pin the data buffer until all copies are written. Lets ignore the fact that software raid seems to write bad data supposedly only to unused blocks for now. If the block is really unused it doesn't mater what is done. And if it is used then software raid has a big bug that needs to be fixed and not repaired after the fact. So lets assume there actualy is a true mismatch because one of the drives returns false data on read. Then in raid1/10 with >2 copies and raid6 you have a way to detect the correct data, correct as in most likely to be what was written originally. For raid6 that means detecting the drive so that the rest still give a correct parity and for raid1/10 that means finding the majority. Say you have a 10 way raid1 with 9 blocks having the same data and one differs. Picking a random block is wrong 10% of the time. Do you realy think that in 10% of the cases 9 disks will be corrupt in exactly the same way? MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-02 15:11 ` Bill Davidsen 2010-02-03 11:17 ` Goswin von Brederlow @ 2010-02-11 5:14 ` Neil Brown 2010-02-11 17:51 ` Bryan Mesich 2010-02-11 18:12 ` Piergiorgio Sartor 1 sibling, 2 replies; 70+ messages in thread From: Neil Brown @ 2010-02-11 5:14 UTC (permalink / raw) To: Bill Davidsen; +Cc: Jon, linux-raid On Tue, 02 Feb 2010 10:11:03 -0500 Bill Davidsen <davidsen@tmr.com> wrote: > Neil Brown wrote: > > On Mon, 01 Feb 2010 16:18:23 -0500 > > Bill Davidsen <davidsen@tmr.com> wrote: > > > > > >> Comment: when there is a three way RAID-1, why doesn't repair *vote* on > >> the correct value instead of just making a guess? > >> > >> > > > > Because truth is not democratic. > > > > (and I defy you to define "correct" in any general way in this context). > > > > If you are willing to accept that the reconstructed data from RAID-[56] > is "correct" then the data from RAID-1 majority opinion is "correct." If > you say that such recovered data is the "most likely to match what was > written," then data consistent on (N+1)/2 drives of a RAID-1 should be > viewed in the same light. Call it "most likely to be correct" if you > prefer, but picking a value from a drive at random is less likely. > > This whole discussion simply shows that for RAID-1 software RAID is less > reliable than hardware RAID (no, I don't mean fake-RAID), because it > doesn't pin the data buffer until all copies are written. > That doesn't make it less reliable. It just makes it more confusing. But for a more complete discussion on raid recovery and when it might be sensible to "vote" among the blocks, see http://neil.brown.name/blog/20100211050355 NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-11 5:14 ` Neil Brown @ 2010-02-11 17:51 ` Bryan Mesich 2010-02-16 21:25 ` Bill Davidsen 2010-02-11 18:12 ` Piergiorgio Sartor 1 sibling, 1 reply; 70+ messages in thread From: Bryan Mesich @ 2010-02-11 17:51 UTC (permalink / raw) To: Neil Brown; +Cc: Bill Davidsen, Jon, linux-raid [-- Attachment #1: Type: text/plain, Size: 1584 bytes --] On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote: > > This whole discussion simply shows that for RAID-1 software RAID is less > > reliable than hardware RAID (no, I don't mean fake-RAID), because it > > doesn't pin the data buffer until all copies are written. > > That doesn't make it less reliable. It just makes it more confusing. I agree that linux software RAID is no less reliable than hardware RAID with regards to the above conversation. It's however confusing to have a counter that indicates there are problems with a RAID 1 array when in fact there is not. I (and I'm sure others) value your expertise on this matter, but it's hard to feel at ease when the car you're driving across the country has the check engine light on. In this case, I believe the mechanic when you say the car is okay, but it might be difficult for others to believe as I do. I rely heavily on software RAID as I'm sure many others are. I believe this is quite evident in the amount of email that has been circulated about the mismatch_cnt "problem". IMO, a users perception of reliability is really the root of the problem in this case. No one who depends on this stuff wants to see weakness. Those who do are going to be concerned. Especially those who are running distributions such as RedHat/Fedora that do weekly checks on the arrays. Neil, you had mentioned some time ago that you were going to create a patch that would show where the mismatches were located on disk. Did you do this and if so where can I find the patch? Bryan [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-11 17:51 ` Bryan Mesich @ 2010-02-16 21:25 ` Bill Davidsen 2010-02-16 21:38 ` Steven Haigh 0 siblings, 1 reply; 70+ messages in thread From: Bill Davidsen @ 2010-02-16 21:25 UTC (permalink / raw) To: Bryan Mesich, Neil Brown, Jon, linux-raid Bryan Mesich wrote: > On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote: > >>> This whole discussion simply shows that for RAID-1 software RAID is less >>> reliable than hardware RAID (no, I don't mean fake-RAID), because it >>> doesn't pin the data buffer until all copies are written. >>> >> That doesn't make it less reliable. It just makes it more confusing. >> > > I agree that linux software RAID is no less reliable than > hardware RAID with regards to the above conversation. It's > however confusing to have a counter that indicates there are > problems with a RAID 1 array when in fact there is not. > Sorry, but real hardware raid is more reliable than software raid, and Neil's justification for not doing smart recovery mentions it. Note this referes to real hardware raid, not fakeraid which is just some firmware in a BIOS to use the existing hardware. The issue lies with data changing between write to multiple drives. In hardware raid the data traverses the memory bus once, only once, and goes into cache in the controller, from which it is written to all mirrored drives. With software raid an individual write is done to each drive, and if the data in the buffer changes between writes to one drive or the other you get different values. Neil may be convinced that the OS somehow "knows" which of the mirror copies is correct, ie. most recent, and never uses the stale data, but if that information was really available reads would always return the latest value and it wouldn't be possible to read the same file multiple times and get different MD5sums. It would also be possible to do a stable smart recovery by propagating the most recent copy to the other mirror drives. I hoped that mounting data=journal would lead to consistency, that seems not to be true either. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-16 21:25 ` Bill Davidsen @ 2010-02-16 21:38 ` Steven Haigh 2010-02-17 3:19 ` Bryan Mesich 2010-02-17 23:05 ` Neil Brown 0 siblings, 2 replies; 70+ messages in thread From: Steven Haigh @ 2010-02-16 21:38 UTC (permalink / raw) To: Bill Davidsen; +Cc: Bryan Mesich, Neil Brown, Jon, linux-raid On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com> wrote: > Bryan Mesich wrote: >> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote: >> >>>> This whole discussion simply shows that for RAID-1 software RAID is >>>> less >>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it >>>> doesn't pin the data buffer until all copies are written. >>>> >>> That doesn't make it less reliable. It just makes it more confusing. >>> >> >> I agree that linux software RAID is no less reliable than >> hardware RAID with regards to the above conversation. It's >> however confusing to have a counter that indicates there are >> problems with a RAID 1 array when in fact there is not. >> > > Sorry, but real hardware raid is more reliable than software raid, and > Neil's justification for not doing smart recovery mentions it. Note this > referes to real hardware raid, not fakeraid which is just some firmware > in a BIOS to use the existing hardware. > > The issue lies with data changing between write to multiple drives. In > hardware raid the data traverses the memory bus once, only once, and > goes into cache in the controller, from which it is written to all > mirrored drives. With software raid an individual write is done to each > drive, and if the data in the buffer changes between writes to one drive > or the other you get different values. Neil may be convinced that the OS > somehow "knows" which of the mirror copies is correct, ie. most recent, > and never uses the stale data, but if that information was really > available reads would always return the latest value and it wouldn't be > possible to read the same file multiple times and get different MD5sums. > It would also be possible to do a stable smart recovery by propagating > the most recent copy to the other mirror drives. > > I hoped that mounting data=journal would lead to consistency, that seems > not to be true either. I agree Bill, there is an issue with the software RAID1 when it comes down to some hardware. I have one machine where the ONLY way to stop the root filesystem going readonly due to journal issues is to remove RAID. Having RAID1 enabled gives silent corruption of both data and the journal at seemingly random times. I can see the data corruption from running a verify between RPM and data on the drive. Reinstalling these packages fixes things - until something random things get corrupted next time. The myth that data corruption in RAID1 ONLY happens to swap and/or unused space on a drive is absolute rubbish. -- Steven Haigh Email: netwiz@crc.id.au Web: http://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-16 21:38 ` Steven Haigh @ 2010-02-17 3:19 ` Bryan Mesich 2010-02-17 23:05 ` Neil Brown 1 sibling, 0 replies; 70+ messages in thread From: Bryan Mesich @ 2010-02-17 3:19 UTC (permalink / raw) To: Steven Haigh; +Cc: Bill Davidsen, Neil Brown, Jon, linux-raid [-- Attachment #1: Type: text/plain, Size: 2397 bytes --] On Wed, Feb 17, 2010 at 08:38:11AM +1100, Steven Haigh wrote: > On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com> wrote: > > The issue lies with data changing between write to multiple drives. In > > hardware raid the data traverses the memory bus once, only once, and > > goes into cache in the controller, from which it is written to all > > mirrored drives. With software raid an individual write is done to each > > drive, and if the data in the buffer changes between writes to one drive > > or the other you get different values. Neil may be convinced that the OS > > somehow "knows" which of the mirror copies is correct, ie. most recent, > > and never uses the stale data, but if that information was really > > available reads would always return the latest value and it wouldn't be > > possible to read the same file multiple times and get different MD5sums. [snip...] > I agree Bill, there is an issue with the software RAID1 when it comes down > to some hardware. I have one machine where the ONLY way to stop the root > filesystem going readonly due to journal issues is to remove RAID. Having > RAID1 enabled gives silent corruption of both data and the journal at > seemingly random times. Maybe I missed something earlier in this thread...and if so I apologize. However, I was not aware of anyone reporting FS corruption due do software RAID 1. Needless to say, a serious problem if occurring. At work, we use software RAID 1 on the majority of our production servers and have never seen problems as you describe. I'm not trying to discredit you...just that we have had not seen similar results. > I can see the data corruption from running a verify between RPM and data > on the drive. Reinstalling these packages fixes things - until something > random things get corrupted next time. For curiosity sake, what kind of files did RPM report as being corrupt after running the verify? The reason I ask as that I would expect user data to be corrupt before system files as they are typically written to disk at install/update and never written to again. Or maybe there is a reason...correct me if I'm wrong ;) In my last post, I asked Neil if he had a patch that would indicate where the mis-matches exist on disk. Have you found a way to correlate the mis-matches with your FS corruption? Bryan [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-16 21:38 ` Steven Haigh 2010-02-17 3:19 ` Bryan Mesich @ 2010-02-17 23:05 ` Neil Brown 2010-02-19 15:18 ` Piergiorgio Sartor 2010-02-24 14:46 ` Bill Davidsen 1 sibling, 2 replies; 70+ messages in thread From: Neil Brown @ 2010-02-17 23:05 UTC (permalink / raw) To: Steven Haigh; +Cc: Bill Davidsen, Bryan Mesich, Jon, linux-raid On Wed, 17 Feb 2010 08:38:11 +1100 Steven Haigh <netwiz@crc.id.au> wrote: > On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com> > wrote: > > Bryan Mesich wrote: > >> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote: > >> > >>>> This whole discussion simply shows that for RAID-1 software RAID is > >>>> less > >>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it > >>>> doesn't pin the data buffer until all copies are written. > >>>> > >>> That doesn't make it less reliable. It just makes it more confusing. > >>> > >> > >> I agree that linux software RAID is no less reliable than > >> hardware RAID with regards to the above conversation. It's > >> however confusing to have a counter that indicates there are > >> problems with a RAID 1 array when in fact there is not. > >> > > > > Sorry, but real hardware raid is more reliable than software raid, and > > Neil's justification for not doing smart recovery mentions it. Note this > > > referes to real hardware raid, not fakeraid which is just some firmware > > in a BIOS to use the existing hardware. > > > > The issue lies with data changing between write to multiple drives. In > > hardware raid the data traverses the memory bus once, only once, and > > goes into cache in the controller, from which it is written to all > > mirrored drives. With software raid an individual write is done to each > > drive, and if the data in the buffer changes between writes to one drive > > > or the other you get different values. Neil may be convinced that the OS > > > somehow "knows" which of the mirror copies is correct, ie. most recent, > > and never uses the stale data, but if that information was really > > available reads would always return the latest value and it wouldn't be > > possible to read the same file multiple times and get different MD5sums. > > > It would also be possible to do a stable smart recovery by propagating > > the most recent copy to the other mirror drives. > > > > I hoped that mounting data=journal would lead to consistency, that seems > > > not to be true either. > > I agree Bill, there is an issue with the software RAID1 when it comes down > to some hardware. I have one machine where the ONLY way to stop the root > filesystem going readonly due to journal issues is to remove RAID. Having > RAID1 enabled gives silent corruption of both data and the journal at > seemingly random times. > > I can see the data corruption from running a verify between RPM and data > on the drive. Reinstalling these packages fixes things - until something > random things get corrupted next time. Sounds very much like dodgy drives. > > The myth that data corruption in RAID1 ONLY happens to swap and/or unused > space on a drive is absolute rubbish. > Absolute rubbish does seem to be a suitable phrase here. There is no question of data corruption. When memory changes between being written to one device and to another, this does not cause corruption, only inconsistency. Either the block will be written again consistently soon, or it will never be read. If the host crashes before the blocks are made consistent, then the inconsistency will not be visible as the resync will fix it. If you are getting any corruption, then it is NOT due to this facet of the RAID1 implementation - it due to something else. My guess is bad hardware - anywhere from memory to hard drive. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-17 23:05 ` Neil Brown @ 2010-02-19 15:18 ` Piergiorgio Sartor 2010-02-19 22:02 ` Neil Brown 2010-02-24 14:46 ` Bill Davidsen 1 sibling, 1 reply; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-19 15:18 UTC (permalink / raw) To: Neil Brown; +Cc: Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid Hi, > When memory changes between being written to one device and to another, this > does not cause corruption, only inconsistency. Either the block will be > written again consistently soon, or it will never be read. well, is this for sure? I mean, by design of the md subsystem. Or it is like that because we trust the filesystem? And why it is like that? Why not to use the good old readers-writer mechanism to make sure all blocks are the same, when they're are written (namely lock). It seems to me, maybe I'm wrong, not a so safe design. I assume, it should not be possible to cause this situation, unless there is a crash or a bug in the md layer. What if a new filesystem will write a block, changing on the fly, i.e. during RAID-1 writes, and then, later, reading this block again? It will get, maybe, not the correct data. In other words, would it be better, for the md layer, to be robust against these kind of threats? bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 15:18 ` Piergiorgio Sartor @ 2010-02-19 22:02 ` Neil Brown 2010-02-19 22:37 ` Piergiorgio Sartor ` (3 more replies) 0 siblings, 4 replies; 70+ messages in thread From: Neil Brown @ 2010-02-19 22:02 UTC (permalink / raw) To: Piergiorgio Sartor Cc: Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid On Fri, 19 Feb 2010 16:18:09 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > Hi, > > > When memory changes between being written to one device and to another, this > > does not cause corruption, only inconsistency. Either the block will be > > written again consistently soon, or it will never be read. > > well, is this for sure? > I mean, by design of the md subsystem. > > Or it is like that because we trust the filesystem? It is because we trust the filesystem. > > And why it is like that? Why not to use the good old > readers-writer mechanism to make sure all blocks are > the same, when they're are written (namely lock). md is not in a position to lock the page - there is simply no way it can stop the filesystem from changing it. The only thing it could do would be to make a copy, then write the copy out. This would incur a performance cost. > > It seems to me, maybe I'm wrong, not a so safe design. I think you are wrong. > > I assume, it should not be possible to cause this > situation, unless there is a crash or a bug in the > md layer. I'm not sure what situation you are referring to... > > What if a new filesystem will write a block, changing > on the fly, i.e. during RAID-1 writes, and then, later, > reading this block again? > > It will get, maybe, not the correct data. This is correct. However it would be equally correct if you were talking about s normal disk drive rather than a RAID1 pair. If the filesystem changes the page (or allows it to change) while a write is pending, then it cannot know what actual data was written. So it must write the block out again before it ever reads it in. RAID1 is no different to any other device in this respect. > > In other words, would it be better, for the md layer, > to be robust against these kind of threats? > Possibly, but at what cost? There are two ways that I can imagine to 'solve' this issue. 1/ always copy the page before writing. This would incur a significant overhead, both in the complexity of pre-allocation memory and in the delay taken to perform the copy. And it would very rarely be actually needed. 2/ Have the filesystem protect the page from changes while it is being written. This is quite possible for the filesystem to do (while it is impossible for md to do). There could be some performance cost with memory-mapped pages as they would need to be unmapped, but there would be no significant cost for reads, writes, and filesystem metadata operations. Further, any filesystem that wants to make use of the integrity checks that newer drives provide (where the filesystem provides a 'checksum' for the block which gets passed all the way down and written to storage, and returned on a read) will need to do this anyway. So it is likely the in the near future all significant filesystems will provide all the guarantees md needs or order to simply do nothing different. So my feeling is that md is doing the best thing already. I believe 'swap' will always be an issue as unmapping swap pages during write could be a serious performance cost. It might be that the best thing to do with swap is to somehow mark the area of an array used for swap as "don't care" so md never bothers to resync it, and never reports inconsistencies there, as they really are not an issue. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 22:02 ` Neil Brown @ 2010-02-19 22:37 ` Piergiorgio Sartor 2010-02-19 23:34 ` Asdo ` (2 subsequent siblings) 3 siblings, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-19 22:37 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid Hi, > > Or it is like that because we trust the filesystem? > > It is because we trust the filesystem. well, I hope the trust is not misplaced... :-) > md is not in a position to lock the page - there is simply no way it can stop > the filesystem from changing it. How can this be? > The only thing it could do would be to make a copy, then write the copy out. Even making a copy would not be safe, since during the copy the data could still change, or not? > This would incur a performance cost. It's a matter of deciding what is more important. > > It seems to me, maybe I'm wrong, not a so safe design. > > I think you are wrong. Could be, I never heard of situations like this. > > I assume, it should not be possible to cause this > > situation, unless there is a crash or a bug in the > > md layer. > > I'm not sure what situation you are referring to... It should not be possible to cause that different mirrors of a RAID-1 end up with different data. Otherwise, no point to have the mirroring. > > What if a new filesystem will write a block, changing > > on the fly, i.e. during RAID-1 writes, and then, later, > > reading this block again? > > > > It will get, maybe, not the correct data. > > This is correct. However it would be equally correct if you were talking > about s normal disk drive rather than a RAID1 pair. Nono, there is a huge difference. In a single drive case, the FS is responsible of writing rubbish to a single block. The result would be that a block has "strange" data, but *always* the same data. Here the situation is that the data might be "strange", but different accesses, to the same block of the RAID-1, could potentially return different data. As a byproduct of this effect, the "check" functionality becomes not so useful anymore. > If the filesystem changes the page (or allows it to change) while a write is > pending, then it cannot know what actual data was written. So it must write > the block out again before it ever reads it in. > RAID1 is no different to any other device in this respect. Is different, as mentioned above. The FS could, intentionally, change the data during a write, but later it could expect to have always the same data. In other words, the FS does not guarantee the "spatial" consistency of the data (the bytes in a block), but the "temporal" consistency (successive reads return always the same data) could be expected. And this happens in case of a normal HDD. It does not happen in RAID-1. > Possibly, but at what cost? As I wrote: it is matter to decide what is more important and useful. > There are two ways that I can imagine to 'solve' this issue. > > 1/ always copy the page before writing. This would incur a significant > overhead, both in the complexity of pre-allocation memory and in the > delay taken to perform the copy. And it would very rarely be actually > needed. Does really a copy solve the issue? Is the copy done in atomic way? The pre-allocation does not seem to me to be a problem, since it will be done once and for all (at device creation), and not dynamically. The copy *might* be an overhead, nevertheless I wonder if it is really so much of a problem, expecially considering that, after the copy, the MD layer can optimize the transaction to the HDDs as much as it likes. > 2/ Have the filesystem protect the page from changes while it is being > written. This is quite possible for the filesystem to do (while it > is impossible for md to do). There could be some performance I'm really curious to understand what kind of thinking is behind a design allowing such a situation... I mean *system* design, not md design. > cost with memory-mapped pages as they would need to be unmapped, > but there would be no significant cost for reads, writes, and filesystem > metadata operations. > Further, any filesystem that wants to make use of the integrity checks > that newer drives provide (where the filesystem provides a 'checksum' for > the block which gets passed all the way down and written to storage, and > returned on a read) will need to do this anyway. So it is likely the in > the near future all significant filesystems will provide all the > guarantees md needs or order to simply do nothing different. That's good to know. > So my feeling is that md is doing the best thing already. I do not think this is an md issue, per se, it seems to me, from the description, this is a overall design issue. Normally, also for performance reasons, one approach is to allocate queue(s) of buffers between two modules (like FS and MD) and each of the modules has always *exclusive* access to its own buffer(s), i.e. the buffer(s) it holds in a certain time frame. Once a module releases the buffer(s) this/these cannot be anymore touched (read or write) by the module itself. Once the buffer(s) arrive(s) to the other module, this can do whatever it wants with it/them, and it is sure it has exclusive access to it/them. Normally real-time systems use techniques like this to guarantee consistency *and* performances. Anyway, thanks for the clarifications, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 22:02 ` Neil Brown 2010-02-19 22:37 ` Piergiorgio Sartor @ 2010-02-19 23:34 ` Asdo 2010-02-20 4:27 ` Goswin von Brederlow 2010-02-20 4:23 ` Goswin von Brederlow 2010-02-24 14:54 ` Bill Davidsen 3 siblings, 1 reply; 70+ messages in thread From: Asdo @ 2010-02-19 23:34 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid Thank you for your explanation Neil, Neil Brown wrote: > When memory changes between being written to one device and to another, this > does not cause corruption, only inconsistency. Either the block will be > written again consistently soon, or it will never be read. This is the crucial part... Why would the filesystem reuse the same memory without rewriting the *same* block? Can the same memory area be used for another block? If yes, I understand. If not, no I don't understand why the block is not eventually rewritten to contain equal data on both disks. Is this a power-fail-in-the-middle thing, or it can happen even when the power is always on? Do I understand correctly that raid-456 is instead safe ("no-mismatch") because it copies the memory region? Thank you ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 23:34 ` Asdo @ 2010-02-20 4:27 ` Goswin von Brederlow 2010-02-20 11:12 ` Asdo 0 siblings, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-02-20 4:27 UTC (permalink / raw) To: linux-raid Asdo <asdo@shiftmail.org> writes: > Thank you for your explanation Neil, > > Neil Brown wrote: >> When memory changes between being written to one device and to another, this >> does not cause corruption, only inconsistency. Either the block will be >> written again consistently soon, or it will never be read. > > This is the crucial part... > > Why would the filesystem reuse the same memory without rewriting the > *same* block? > > Can the same memory area be used for another block? > If yes, I understand. If not, no I don't understand why the block is > not eventually rewritten to contain equal data on both disks. > > Is this a power-fail-in-the-middle thing, or it can happen even when > the power is always on? The check is usualy done with the filesystem mounted and in use. So one case would be that the block got written, changed and then checked before the FS decided to flush the dirty block again. The other scenario suggested in the past is that the block was written, changed and then the file deleted, making the block unused, before it got flushed again. The filesystem then sees no need to write a dirty but unused block so it never gets rewritten. It never gets read either so that is safe. > Do I understand correctly that raid-456 is instead safe > ("no-mismatch") because it copies the memory region? > > Thank you MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-20 4:27 ` Goswin von Brederlow @ 2010-02-20 11:12 ` Asdo 2010-02-21 11:13 ` Goswin von Brederlow 0 siblings, 1 reply; 70+ messages in thread From: Asdo @ 2010-02-20 11:12 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: linux-raid Goswin von Brederlow wrote: > The check is usualy done with the filesystem mounted and in use. So one > case would be that the block got written, changed and then checked > before the FS decided to flush the dirty block again. > > The other scenario suggested in the past is that the block was written, > changed and then the file deleted, making the block unused, This is not enough to cause the problem if I understand correctly, it also needs to change value at this point, right? So how can it change value... is the same buffer used for another block? > before it > got flushed again. The filesystem then sees no need to write a dirty but > unused block so it never gets rewritten. It never gets read either so > that is safe. > ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-20 11:12 ` Asdo @ 2010-02-21 11:13 ` Goswin von Brederlow [not found] ` <8754A21825504719B463AD9809E54349@m5> 0 siblings, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-02-21 11:13 UTC (permalink / raw) To: Asdo; +Cc: Goswin von Brederlow, linux-raid Asdo <asdo@shiftmail.org> writes: > Goswin von Brederlow wrote: >> The check is usualy done with the filesystem mounted and in use. So one >> case would be that the block got written, changed and then checked >> before the FS decided to flush the dirty block again. >> >> The other scenario suggested in the past is that the block was written, >> changed and then the file deleted, making the block unused, > This is not enough to cause the problem if I understand correctly, it > also needs to change value at this point, right? > So how can it change value... is the same buffer used for another block? open tempfile write tempfile raid1 starts to commit the block write some more changing the block raid1 writes the 2nd copy of the block delete file fs never recommits the dirty page Personally I don't really buy that scenario. At least not in the frequency mismatches occur. >> before it >> got flushed again. The filesystem then sees no need to write a dirty but >> unused block so it never gets rewritten. It never gets read either so >> that is safe. >> MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
[parent not found: <8754A21825504719B463AD9809E54349@m5>]
[parent not found: <20100221194400.GA2570@lazy.lzy>]
* Re: Why does one get mismatches? [not found] ` <20100221194400.GA2570@lazy.lzy> @ 2010-02-22 13:01 ` Asdo 2010-02-22 13:30 ` Piergiorgio Sartor 2010-02-22 13:44 ` Piergiorgio Sartor 0 siblings, 2 replies; 70+ messages in thread From: Asdo @ 2010-02-22 13:01 UTC (permalink / raw) To: Piergiorgio Sartor Cc: Guy Watkins, 'Goswin von Brederlow', linux-raid Piergiorgio Sartor wrote: > Hi >> If someone can map a mismatch to a file, the debate would be over. >> > well, IMHO mismatches should not happen "by design", > but only due to failures or bugs. > > For me, it is not so relevant if there is a file (or > metadata) or nothing, under a mismatch, the whole idea > of "mirroring" turns out to be not usable, still IMHO, > if a mismatch can be caused intentionally. > Even "nothing"? Why? Here we are talking about "nothing". Or so it seems. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-22 13:01 ` Asdo @ 2010-02-22 13:30 ` Piergiorgio Sartor 2010-02-22 13:44 ` Piergiorgio Sartor 1 sibling, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-22 13:30 UTC (permalink / raw) To: Asdo Cc: Piergiorgio Sartor, Guy Watkins, 'Goswin von Brederlow', linux-raid Hi, > Even "nothing"? > Why? for the following reasons: 1) the "check" command is useless if there are mismatches, harmful or harmless they could be 2) the mirroring concept implies *identical* mirrors, not identical only if the upper layer decides so 3) if a filesystem has a small bug, this will not be catched later, that is, it could be the filesystem causes a *wrong* mismatch (like there are correct ones) 4) in general, it is not safe to offer a mirroring which is not always mirroring properly > > Here we are talking about "nothing". > > Or so it seems. As I wrote, it does not matter, it is just not correct to rely on the good will of other pieces of software to guarantee the RAID-1 is working properly. The RAID-1 should work properly because it does work properly, not because the filesystem is kind enough to allow it to work properly. This could be a system design problem, not and MD one, of course, so I'm stating the Neil or others should fix the MD, what I'm writing is that it is astonishing to me that things work this way (or "walk this way"). That's it, I'm just surprised to learn such situations are present. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-22 13:01 ` Asdo 2010-02-22 13:30 ` Piergiorgio Sartor @ 2010-02-22 13:44 ` Piergiorgio Sartor 1 sibling, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-22 13:44 UTC (permalink / raw) To: Asdo Cc: Piergiorgio Sartor, Guy Watkins, 'Goswin von Brederlow', linux-raid Hi again, forgot one thing... I've some PCs where those mismatches shows up, sometimes more, sometime less, sometimes nothing. All these PCs have the filesystem (ext3) directly on the RAID drive. I've one more PC, where there is a LVM layer inbetween. In this PC, which has also different HW (the others are all identical), I never saw mismatches. I can imagine the LVM takes care of handling properly the memory buffers. Can anyone confirm this? Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? [not found] ` <8754A21825504719B463AD9809E54349@m5> [not found] ` <20100221194400.GA2570@lazy.lzy> @ 2010-02-24 19:42 ` Bill Davidsen 1 sibling, 0 replies; 70+ messages in thread From: Bill Davidsen @ 2010-02-24 19:42 UTC (permalink / raw) To: Guy Watkins; +Cc: 'Goswin von Brederlow', 'Asdo', linux-raid Guy Watkins wrote: > } open tempfile > } write tempfile > } raid1 starts to commit the block > } write some more changing the block > } raid1 writes the 2nd copy of the block > } delete file > } fs never recommits the dirty page > } > } Personally I don't really buy that scenario. At least not in the > } frequency mismatches occur. > } > } > } MfG > } Goswin > > If someone can map a mismatch to a file, the debate would be over. > Simple test: create a three way raid-1. Run it on a heavily used ext3 file system for a few days, until it has 3-4k mismatch count. Shut it down gracefully. Now - run e2fsck -n on each of the parts, to prove the f/s is not corrupt - mount on partition, using ext2 type, ro,noatime[1] - do an md5sum on every file and put the output in a file[2] (on another f/s, obviously) - mount each of the mirrors the same way - run md5sum -C {saved_file} to check file content If you get files which don't compare copy to copy you can see that the issue is real. [1] rather than explain to newbies why neither changes to atime nor any journal activity doesn't make the file content change, I do it this way. [2] MD5 is fine to detect file changes. You need sha1 or such only to detect malicious changes with intent to hide the change. Because it uses little CPU it's as good as any. Use sha256sum or similar if you doubt me. Having had mismatches on raid-1 and not on raid-6 using the same three drives, I question the "hardware error" theory of mismatch origin. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 22:02 ` Neil Brown 2010-02-19 22:37 ` Piergiorgio Sartor 2010-02-19 23:34 ` Asdo @ 2010-02-20 4:23 ` Goswin von Brederlow 2010-02-24 14:54 ` Bill Davidsen 3 siblings, 0 replies; 70+ messages in thread From: Goswin von Brederlow @ 2010-02-20 4:23 UTC (permalink / raw) To: linux-raid Neil Brown <neilb@suse.de> writes: > On Fri, 19 Feb 2010 16:18:09 +0100 > Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > >> Hi, >> >> > When memory changes between being written to one device and to another, this >> > does not cause corruption, only inconsistency. Either the block will be >> > written again consistently soon, or it will never be read. >> >> well, is this for sure? >> I mean, by design of the md subsystem. >> >> Or it is like that because we trust the filesystem? > > It is because we trust the filesystem. > >> >> And why it is like that? Why not to use the good old >> readers-writer mechanism to make sure all blocks are >> the same, when they're are written (namely lock). > > md is not in a position to lock the page - there is simply no way it can stop > the filesystem from changing it. > The only thing it could do would be to make a copy, then write the copy out. > This would incur a performance cost. > >> >> It seems to me, maybe I'm wrong, not a so safe design. > > I think you are wrong. No, he is right. The safe design is to copy or at least copy-on-write the page. Maybe this could be configurable so people can choose between really safe and fast. >> I assume, it should not be possible to cause this >> situation, unless there is a crash or a bug in the >> md layer. > > I'm not sure what situation you are referring to... > >> >> What if a new filesystem will write a block, changing >> on the fly, i.e. during RAID-1 writes, and then, later, >> reading this block again? >> >> It will get, maybe, not the correct data. > > This is correct. However it would be equally correct if you were talking > about s normal disk drive rather than a RAID1 pair. > If the filesystem changes the page (or allows it to change) while a write is > pending, then it cannot know what actual data was written. So it must write > the block out again before it ever reads it in. > RAID1 is no different to any other device in this respect. > > >> >> In other words, would it be better, for the md layer, >> to be robust against these kind of threats? >> > > Possibly, but at what cost? > There are two ways that I can imagine to 'solve' this issue. > > 1/ always copy the page before writing. This would incur a significant > overhead, both in the complexity of pre-allocation memory and in the > delay taken to perform the copy. And it would very rarely be actually > needed. > 2/ Have the filesystem protect the page from changes while it is being > written. This is quite possible for the filesystem to do (while it > is impossible for md to do). There could be some performance > cost with memory-mapped pages as they would need to be unmapped, > but there would be no significant cost for reads, writes, and filesystem > metadata operations. > Further, any filesystem that wants to make use of the integrity checks > that newer drives provide (where the filesystem provides a 'checksum' for > the block which gets passed all the way down and written to storage, and > returned on a read) will need to do this anyway. So it is likely the in > the near future all significant filesystems will provide all the > guarantees md needs or order to simply do nothing different. > > So my feeling is that md is doing the best thing already. > > I believe 'swap' will always be an issue as unmapping swap pages during write > could be a serious performance cost. It might be that the best thing to do > with swap is to somehow mark the area of an array used for swap as "don't > care" so md never bothers to resync it, and never reports inconsistencies > there, as they really are not an issue. > > NeilBrown Or one could turn on the copy/copy-on-write mode at least during the test. I'm also not convinced performance of swap is an issue. Swap speed is already many magnitudes lower than real memory making any relevant use of swap prohibitive. I certainly would not care one bit or another if swapping gets 50% slower. I do care about not having a mismatch count though. MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-19 22:02 ` Neil Brown ` (2 preceding siblings ...) 2010-02-20 4:23 ` Goswin von Brederlow @ 2010-02-24 14:54 ` Bill Davidsen 2010-02-24 21:37 ` Neil Brown 3 siblings, 1 reply; 70+ messages in thread From: Bill Davidsen @ 2010-02-24 14:54 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown wrote: > md is not in a position to lock the page - there is simply no way it can stop > the filesystem from changing it. > The only thing it could do would be to make a copy, then write the copy out. > This would incur a performance cost. > > Two thoughts on that - one is that for critical data, give me the option at array start time, make the copy, slow the performance and make it more consistent. My second thought is that a checksum of the page before initiating write and after all writes are complete might be less of a performance hit, and still could detect that the buffer had changed. >> It seems to me, maybe I'm wrong, not a so safe design. >> > > I think you are wrong. > > > This is correct. However it would be equally correct if you were talking > about s normal disk drive rather than a RAID1 pair. > If the filesystem changes the page (or allows it to change) while a write is > pending, then it cannot know what actual data was written. So it must write > the block out again before it ever reads it in. > RAID1 is no different to any other device in this respect. > > > >> In other words, would it be better, for the md layer, >> to be robust against these kind of threats? >> >> > > Possibly, but at what cost? > There are two ways that I can imagine to 'solve' this issue. > > 1/ always copy the page before writing. This would incur a significant > overhead, both in the complexity of pre-allocation memory and in the > delay taken to perform the copy. And it would very rarely be actually > needed. > 2/ Have the filesystem protect the page from changes while it is being > written. This is quite possible for the filesystem to do (while it > is impossible for md to do). There could be some performance > cost with memory-mapped pages as they would need to be unmapped, > but there would be no significant cost for reads, writes, and filesystem > metadata operations. > Your next section somewhat mirrors my thought on md checking the data after write to be sure it didn't change. > Further, any filesystem that wants to make use of the integrity checks > that newer drives provide (where the filesystem provides a 'checksum' for > the block which gets passed all the way down and written to storage, and > returned on a read) will need to do this anyway. So it is likely the in > the near future all significant filesystems will provide all the > guarantees md needs or order to simply do nothing different. > > So my feeling is that md is doing the best thing already. > > I believe 'swap' will always be an issue as unmapping swap pages during write > could be a serious performance cost. It might be that the best thing to do > with swap is to somehow mark the area of an array used for swap as "don't > care" so md never bothers to resync it, and never reports inconsistencies > there, as they really are not an issue. > > NeilBrown > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 14:54 ` Bill Davidsen @ 2010-02-24 21:37 ` Neil Brown 2010-02-26 20:48 ` Bill Davidsen 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-02-24 21:37 UTC (permalink / raw) To: Bill Davidsen Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid On Wed, 24 Feb 2010 09:54:17 -0500 Bill Davidsen <davidsen@tmr.com> wrote: > Neil Brown wrote: > > md is not in a position to lock the page - there is simply no way it can stop > > the filesystem from changing it. > > The only thing it could do would be to make a copy, then write the copy out. > > This would incur a performance cost. > > > > > Two thoughts on that - one is that for critical data, give me the option > at array start time, make the copy, slow the performance and make it > more consistent. My second thought is that a checksum of the page before > initiating write and after all writes are complete might be less of a > performance hit, and still could detect that the buffer had changed. The idea of calculating a checksum before and after certainly has some merit, if we could choose a checksum algorithm which was sufficiently strong and sufficiently fast, though in many cases a large part of the cost would just be bringing the page contents into cache - twice. It has the advantage over copying the page of not needing to allocate extra memory. If someone wanted to try an prototype this and see how it goes, I'd be happy to advise.... NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 21:37 ` Neil Brown @ 2010-02-26 20:48 ` Bill Davidsen 2010-02-26 21:09 ` Neil Brown 0 siblings, 1 reply; 70+ messages in thread From: Bill Davidsen @ 2010-02-26 20:48 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown wrote: > On Wed, 24 Feb 2010 09:54:17 -0500 > Bill Davidsen <davidsen@tmr.com> wrote: > > >> Neil Brown wrote: >> >>> md is not in a position to lock the page - there is simply no way it can stop >>> the filesystem from changing it. >>> The only thing it could do would be to make a copy, then write the copy out. >>> This would incur a performance cost. >>> >>> >>> >> Two thoughts on that - one is that for critical data, give me the option >> at array start time, make the copy, slow the performance and make it >> more consistent. My second thought is that a checksum of the page before >> initiating write and after all writes are complete might be less of a >> performance hit, and still could detect that the buffer had changed. >> > > > The idea of calculating a checksum before and after certainly has some merit, > if we could choose a checksum algorithm which was sufficiently strong and > sufficiently fast, though in many cases a large part of the cost would just be > bringing the page contents into cache - twice. > > It has the advantage over copying the page of not needing to allocate extra > memory. > > If someone wanted to try an prototype this and see how it goes, I'd be happy > to advise.... > Disagree if you wish, but MD5 should be fine for this. While it is not cryptographically strong on files, where the size can be changed and evil doers can calculate values to add at the end of the data, it should be adequate on data of unchanging size. It's cheap, fast, and readily available. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 20:48 ` Bill Davidsen @ 2010-02-26 21:09 ` Neil Brown 2010-02-26 22:01 ` Piergiorgio Sartor ` (2 more replies) 0 siblings, 3 replies; 70+ messages in thread From: Neil Brown @ 2010-02-26 21:09 UTC (permalink / raw) To: Bill Davidsen Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid On Fri, 26 Feb 2010 15:48:58 -0500 Bill Davidsen <davidsen@tmr.com> wrote: > > > > The idea of calculating a checksum before and after certainly has some merit, > > if we could choose a checksum algorithm which was sufficiently strong and > > sufficiently fast, though in many cases a large part of the cost would just be > > bringing the page contents into cache - twice. > > > > It has the advantage over copying the page of not needing to allocate extra > > memory. > > > > If someone wanted to try an prototype this and see how it goes, I'd be happy > > to advise.... > > > > Disagree if you wish, but MD5 should be fine for this. While it is not > cryptographically strong on files, where the size can be changed and > evil doers can calculate values to add at the end of the data, it should > be adequate on data of unchanging size. It's cheap, fast, and readily > available. > Actually, I'm no longer convinced that the checksumming idea would work. If a mem-mapped page were written, that the app is updating every millisecond (i.e. less than the write latency), then every time a write completed the checksum would be different so we would have to reschedule the write, which would not be the correct behaviour at all. So I think that the only way to address this in the md layer is to copy the data and write the copy. There is already code to copy the data for write-behind that could possible be leveraged to do a copy always. Or I could just stop setting mismatch_cnt for raid1 and raid10. That would also fix the problem :-) NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 21:09 ` Neil Brown @ 2010-02-26 22:01 ` Piergiorgio Sartor 2010-02-26 22:15 ` Bill Davidsen 2010-02-26 22:20 ` Asdo 2 siblings, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-26 22:01 UTC (permalink / raw) To: Neil Brown Cc: Bill Davidsen, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Hi, > So I think that the only way to address this in the md layer is to copy > the data and write the copy. There is already code to copy the data for > write-behind that could possible be leveraged to do a copy always. actually, I wanted to ask how the write-behind works, because I was suspecting it copies the data. BTW, it is possible to set both drives (of a pair) as write-mostly and some write-behind? > Or I could just stop setting mismatch_cnt for raid1 and raid10. That would > also fix the problem :-) Well, the "complaining" problem will be fixed... :-) bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 21:09 ` Neil Brown 2010-02-26 22:01 ` Piergiorgio Sartor @ 2010-02-26 22:15 ` Bill Davidsen 2010-02-26 22:21 ` Piergiorgio Sartor 2010-02-26 22:20 ` Asdo 2 siblings, 1 reply; 70+ messages in thread From: Bill Davidsen @ 2010-02-26 22:15 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown wrote: > On Fri, 26 Feb 2010 15:48:58 -0500 > Bill Davidsen <davidsen@tmr.com> wrote: > > >>> The idea of calculating a checksum before and after certainly has some merit, >>> if we could choose a checksum algorithm which was sufficiently strong and >>> sufficiently fast, though in many cases a large part of the cost would just be >>> bringing the page contents into cache - twice. >>> >>> It has the advantage over copying the page of not needing to allocate extra >>> memory. >>> >>> If someone wanted to try an prototype this and see how it goes, I'd be happy >>> to advise.... >>> >>> >> Disagree if you wish, but MD5 should be fine for this. While it is not >> cryptographically strong on files, where the size can be changed and >> evil doers can calculate values to add at the end of the data, it should >> be adequate on data of unchanging size. It's cheap, fast, and readily >> available. >> >> > > Actually, I'm no longer convinced that the checksumming idea would work. > If a mem-mapped page were written, that the app is updating every > millisecond (i.e. less than the write latency), then every time a write > completed the checksum would be different so we would have to reschedule the > write, which would not be the correct behaviour at all. > So I think that the only way to address this in the md layer is to copy > the data and write the copy. There is already code to copy the data for > write-behind that could possible be leveraged to do a copy always. > > Your point is valid about the possibility, but consider this, if the checksum fails, then at that point do the copy and write again. > Or I could just stop setting mismatch_cnt for raid1 and raid10. That would > also fix the problem :-) > > s/fix/hide/ ;-) My feeling is that we have many ways to change the data, O_DIRECT, aio, threads, mmap, and probably some I haven't found yet. Rather than think that you could prevent that without a flaming layer violation, perhaps my thought above, to detect the fact that the data has changed, and at that point do a copy and write unchanging data to all drives. How that plays with O_DIRECT I can't say, but it sounds to me as if it should eliminate the mismatches without a huge performance impact. Let me know if this addresses your concern with writing forever without taking much overhead. The question is why this happens with raid-1 and doesn't seem to with raid-[56]. And I don't see mismatches on my raid-10, although I'm pretty sure that neither mmap or O_DIRECT is used on those arrays. What would seem to be optimal is some COW on the buffer to prevent the buffer from being modified while it's being used for actual i/o. Doesn't seem hardware supports it, page size, buffer size and sector size all vary. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 22:15 ` Bill Davidsen @ 2010-02-26 22:21 ` Piergiorgio Sartor 0 siblings, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-26 22:21 UTC (permalink / raw) To: Bill Davidsen Cc: Neil Brown, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Hi, > The question is why this happens with raid-1 and doesn't seem to > with raid-[56]. And I don't see mismatches on my raid-10, although > I'm pretty sure that neither mmap or O_DIRECT is used on those > arrays. I believe Neil mentioned that RAID-5/6 always makes a copy, while only RAID-1/10 uses the same page without copying. I get mismatches on RAID-10, but not on the one that has LVM on it, only on the one(s) where the filesystem (ext3) is directly on the RAID volume. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 21:09 ` Neil Brown 2010-02-26 22:01 ` Piergiorgio Sartor 2010-02-26 22:15 ` Bill Davidsen @ 2010-02-26 22:20 ` Asdo 2010-02-27 6:01 ` Michael Evans 2 siblings, 1 reply; 70+ messages in thread From: Asdo @ 2010-02-26 22:20 UTC (permalink / raw) To: Neil Brown Cc: Bill Davidsen, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown wrote: > Actually, I'm no longer convinced that the checksumming idea would work. > If a mem-mapped page were written, that the app is updating every > millisecond (i.e. less than the write latency), then every time a write > completed the checksum would be different so we would have to reschedule the > write, which would not be the correct behaviour at all. > So I think that the only way to address this in the md layer is to copy > the data and write the copy. There is already code to copy the data for > write-behind that could possible be leveraged to do a copy always. > The concerns of slowdowns with copy could be addressed by making the copy a runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/ interface where one can echo "1" to enable copies for this type of raid. Or better 1 could be the default (slower but safer, or if not safer, at least to avoid needless questions on mismatches on this ML by new users, and to allow detection of REAL mismatches which can be due to cabling or defective disks) and echoing 0 would increase performances at the cost of seeing lots of false positive mismatches. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-26 22:20 ` Asdo @ 2010-02-27 6:01 ` Michael Evans 2010-02-28 0:01 ` Bill Davidsen 0 siblings, 1 reply; 70+ messages in thread From: Michael Evans @ 2010-02-27 6:01 UTC (permalink / raw) To: Asdo Cc: Neil Brown, Bill Davidsen, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@shiftmail.org> wrote: > Neil Brown wrote: >> >> Actually, I'm no longer convinced that the checksumming idea would work. >> If a mem-mapped page were written, that the app is updating every >> millisecond (i.e. less than the write latency), then every time a write >> completed the checksum would be different so we would have to reschedule >> the >> write, which would not be the correct behaviour at all. >> So I think that the only way to address this in the md layer is to copy >> the data and write the copy. There is already code to copy the data for >> write-behind that could possible be leveraged to do a copy always. >> > > The concerns of slowdowns with copy could be addressed by making the copy a > runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/ > interface where one can echo "1" to enable copies for this type of raid. Or > better 1 could be the default (slower but safer, or if not safer, at least > to avoid needless questions on mismatches on this ML by new users, and to > allow detection of REAL mismatches which can be due to cabling or defective > disks) and echoing 0 would increase performances at the cost of seeing lots > of false positive mismatches. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Isn't there some way of making the page copy-on-write using hardware and/or an in-kernel structure? Ideally copying could be avoided /unless/ there is change. That way each operation looks like an atomic commit. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-27 6:01 ` Michael Evans @ 2010-02-28 0:01 ` Bill Davidsen 0 siblings, 0 replies; 70+ messages in thread From: Bill Davidsen @ 2010-02-28 0:01 UTC (permalink / raw) To: Michael Evans Cc: Asdo, Neil Brown, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid Michael Evans wrote: > On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@shiftmail.org> wrote: > >> Neil Brown wrote: >> >>> Actually, I'm no longer convinced that the checksumming idea would work. >>> If a mem-mapped page were written, that the app is updating every >>> millisecond (i.e. less than the write latency), then every time a write >>> completed the checksum would be different so we would have to reschedule >>> the >>> write, which would not be the correct behaviour at all. >>> So I think that the only way to address this in the md layer is to copy >>> the data and write the copy. There is already code to copy the data for >>> write-behind that could possible be leveraged to do a copy always. >>> >>> >> The concerns of slowdowns with copy could be addressed by making the copy a >> runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/ >> interface where one can echo "1" to enable copies for this type of raid. Or >> better 1 could be the default (slower but safer, or if not safer, at least >> to avoid needless questions on mismatches on this ML by new users, and to >> allow detection of REAL mismatches which can be due to cabling or defective >> disks) and echoing 0 would increase performances at the cost of seeing lots >> of false positive mismatches. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > > Isn't there some way of making the page copy-on-write using hardware > and/or an in-kernel structure? Ideally copying could be avoided > /unless/ there is change. That way each operation looks like an > atomic commit. > As I think about this, one idea was to add a write-in-progress flag, so that the filesystem, or library, or whatever would know not to change the page. That would mean that every filesystem would need to be enhanced, or that the "safe write" would be optional on a per-filesystem level. Implementation of O_DIRECT could do it, or not, and there could be a safe way to write. However, it occurs to me that there are several other levels involved, and so it could be better but not perfect. While md could flag the start and finish of write, you then need to have the next level, the device driver, do the same thing, so md knows when the data need not be frozen. "But wait, there's more," as they say, the device driver need to track when the data are transferred to the actual device, and the device needs to report when the data actually hit the platter, or you could still have possible mismatches. All of that reminds us of the discussion of barriers, and flush cache commands, and other performance impacting practices. So in the long run I think the most effective solution, one which has the highest improvement at the lowest cost in performance, is a copy. Now if Neil liked my idea of doing a checksum before and after a write, and a copy only in the cases where the data had changed, the impact could be pretty small. All that depends on two things, Neil thinking the whole thing is worth doing, and no one finding a flaw in my proposal to do a checksum rather than a copy each time. And to return to your original question, no. Hardware COW works on memory pages, a buffer could span pages and a write to a page might not be in the part of the page used for the i/o buffer. So as nice as that would be, I don't think the hardware supports it. And even if you could, the COW needs to be done in the layer which tries to change the buffer, so md would set COW and the filesystem would have to deal with it. I am pretty sure that's a layering violation, big time. The advisory "write in progress" flag might be acceptable, it's information the f/s can use or not. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-17 23:05 ` Neil Brown 2010-02-19 15:18 ` Piergiorgio Sartor @ 2010-02-24 14:46 ` Bill Davidsen 2010-02-24 16:12 ` Martin K. Petersen 2010-02-24 21:32 ` Neil Brown 1 sibling, 2 replies; 70+ messages in thread From: Bill Davidsen @ 2010-02-24 14:46 UTC (permalink / raw) To: Neil Brown; +Cc: Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown wrote: > On Wed, 17 Feb 2010 08:38:11 +1100 > Steven Haigh <netwiz@crc.id.au> wrote: > > >> On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com> >> wrote: >> >>> Bryan Mesich wrote: >>> >>>> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote: >>>> >>>> >>>>>> This whole discussion simply shows that for RAID-1 software RAID is >>>>>> less >>>>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it >>>>>> doesn't pin the data buffer until all copies are written. >>>>>> >>>>>> >>>>> That doesn't make it less reliable. It just makes it more confusing. >>>>> >>>>> >>>> I agree that linux software RAID is no less reliable than >>>> hardware RAID with regards to the above conversation. It's >>>> however confusing to have a counter that indicates there are >>>> problems with a RAID 1 array when in fact there is not. >>>> >>>> >>> Sorry, but real hardware raid is more reliable than software raid, and >>> Neil's justification for not doing smart recovery mentions it. Note this >>> >>> referes to real hardware raid, not fakeraid which is just some firmware >>> in a BIOS to use the existing hardware. >>> >>> The issue lies with data changing between write to multiple drives. In >>> hardware raid the data traverses the memory bus once, only once, and >>> goes into cache in the controller, from which it is written to all >>> mirrored drives. With software raid an individual write is done to each >>> drive, and if the data in the buffer changes between writes to one drive >>> >>> or the other you get different values. Neil may be convinced that the OS >>> >>> somehow "knows" which of the mirror copies is correct, ie. most recent, >>> and never uses the stale data, but if that information was really >>> available reads would always return the latest value and it wouldn't be >>> possible to read the same file multiple times and get different MD5sums. >>> >>> It would also be possible to do a stable smart recovery by propagating >>> the most recent copy to the other mirror drives. >>> >>> I hoped that mounting data=journal would lead to consistency, that seems >>> >>> not to be true either. >>> >> I agree Bill, there is an issue with the software RAID1 when it comes down >> to some hardware. I have one machine where the ONLY way to stop the root >> filesystem going readonly due to journal issues is to remove RAID. Having >> RAID1 enabled gives silent corruption of both data and the journal at >> seemingly random times. >> >> I can see the data corruption from running a verify between RPM and data >> on the drive. Reinstalling these packages fixes things - until something >> random things get corrupted next time. >> > > Sounds very much like dodgy drives. > > >> The myth that data corruption in RAID1 ONLY happens to swap and/or unused >> space on a drive is absolute rubbish. >> >> > > Absolute rubbish does seem to be a suitable phrase here. > There is no question of data corruption. > When memory changes between being written to one device and to another, this > does not cause corruption, only inconsistency. Either the block will be > written again consistently soon, or it will never be read. > Just what is it that rewrites the data block? The user program doesn't know it's needed, the filesystem, if any, doesn't know it's needed, and as far as I can tell md doesn't do checksum before issuing the write and after the last write is done. Doesn't make a copy and write from that. So what sees that the data has changed and rewrites it? > If the host crashes before the blocks are made consistent, then the > inconsistency will not be visible as the resync will fix it. > > If you are getting any corruption, then it is NOT due to this facet of the > RAID1 implementation - it due to something else. > My guess is bad hardware - anywhere from memory to hard drive. > Having switched an array from three way raid-1 to raid-6, using the same kernel, utilities, and hardware, I can speak to that. When I first started to run checks, I took the array offline to do repair, and usually saw ~12k mismatches by the end of a week. After changing the array to raid-6 I never had a mismatch again. Therefore, while hardware clearly can be a factor, it is unlikely to be the cause of all mismatch events. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 14:46 ` Bill Davidsen @ 2010-02-24 16:12 ` Martin K. Petersen 2010-02-24 18:51 ` Piergiorgio Sartor 2010-02-24 21:39 ` Neil Brown 2010-02-24 21:32 ` Neil Brown 1 sibling, 2 replies; 70+ messages in thread From: Martin K. Petersen @ 2010-02-24 16:12 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, Steven Haigh, Bryan Mesich, Jon, linux-raid >>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes: >> Absolute rubbish does seem to be a suitable phrase here. There is no >> question of data corruption. When memory changes between being >> written to one device and to another, this does not cause corruption, >> only inconsistency. Either the block will be written again >> consistently soon, or it will never be read. Bill> Just what is it that rewrites the data block? The user program Bill> doesn't know it's needed, the filesystem, if any, doesn't know Bill> it's needed, and as far as I can tell md doesn't do checksum Bill> before issuing the write and after the last write is done. Doesn't Bill> make a copy and write from that. So what sees that the data has Bill> changed and rewrites it? The filesystem updates the page, causing it to be marked dirty again. The VM will then eventually schedule the page to be written out. The "when" depends on filesystem type and whether there's metadata or data in the page. In this discussion there seems to be a focus on the case where one mirror is correct and one is not. However, that's usually not how it works out. A more realistic scenario is that both mirror copies are incorrect because the page was continuously updated. I.e. both mirrors have various degrees of new and stale data inside a 4KB block. So realistically both disk blocks are wrong and there's a window until the new, correct block is written. That window will only cause problems if there is a crash and we'll need to recover. My main concern here is how big the discrepancy between the disks can get, and whether we'll end up corrupting the filesystem during recovery because we could potentially be matching metadata from one disk with journal entries from another. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 16:12 ` Martin K. Petersen @ 2010-02-24 18:51 ` Piergiorgio Sartor 2010-02-24 22:21 ` Neil Brown 2010-02-24 21:39 ` Neil Brown 1 sibling, 1 reply; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-24 18:51 UTC (permalink / raw) To: Martin K. Petersen Cc: Bill Davidsen, Neil Brown, Steven Haigh, Bryan Mesich, Jon, linux-raid Hi, > So realistically both disk blocks are wrong and there's a window until > the new, correct block is written. That window will only cause problems > if there is a crash and we'll need to recover. My main concern here is > how big the discrepancy between the disks can get, and whether we'll end > up corrupting the filesystem during recovery because we could > potentially be matching metadata from one disk with journal entries from > another. well, I know already people will not believe me, but just this evening, one of the infamous PCs with mismatch count going up and down, could not boot anymore. Reason: you must specifiy the filesystem type So, I started it with a live CD. My first idea was a problem with the RAID (type is 10 f2). This was assembled fine, so I tried to mount it, but mount returned the same error as above. So I tried to mount it specifying "-text3" and it was mounted. Everything seemed to be fine, I backup the data anyhow. Some interesting discoveries: tune2fs -l /dev/md/2_0 returns the FS data, no errors. blkid /dev/md/2_0 does not return anything. Running a fsck did not find anything wrong, but it did not repair anything too. Now, I do not know if this was caused by the situation mentioned above, but for sure is quite fishy... BTW, unrelated to the topic, any idea on how to fix this? Is there any tool that can restore the proper ID or else? Thanks, bye. -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 18:51 ` Piergiorgio Sartor @ 2010-02-24 22:21 ` Neil Brown 2010-02-25 8:41 ` Piergiorgio Sartor 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-02-24 22:21 UTC (permalink / raw) To: Piergiorgio Sartor Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid On Wed, 24 Feb 2010 19:51:06 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > Hi, > > > So realistically both disk blocks are wrong and there's a window until > > the new, correct block is written. That window will only cause problems > > if there is a crash and we'll need to recover. My main concern here is > > how big the discrepancy between the disks can get, and whether we'll end > > up corrupting the filesystem during recovery because we could > > potentially be matching metadata from one disk with journal entries from > > another. > > well, I know already people will not believe me, but > just this evening, one of the infamous PCs with mismatch > count going up and down, could not boot anymore. I certainly believe you. > > Reason: you must specifiy the filesystem type This suggests that the superblock which lives at an offset of 1K into the filesystem was sufficiently corrupted that mount couldn't recognise it. > > So, I started it with a live CD. > > My first idea was a problem with the RAID (type is 10 f2). > > This was assembled fine, so I tried to mount it, but mount > returned the same error as above. > So I tried to mount it specifying "-text3" and it was mounted. That is really odd! Both the kernel ext3 module (triggered by '-text3') and the 'mount' program use exactly the same test - look for the magic number in the superblock at 1K into the device. It is very hard to see how 'mount' would fail to find something that the ext3 module finds. > Everything seemed to be fine, I backup the data anyhow. > > Some interesting discoveries: > > tune2fs -l /dev/md/2_0 returns the FS data, no errors. > blkid /dev/md/2_0 does not return anything. This sounds very much like tune2fs and blkid are reading two different things, which is strange. Would you be able to get the first 4K from each device in the raid10: dd if=/dev/whatever of=/tmp/whatever bs=1K count=4 and the tar/gz those up and send them to me. That might give some clue. Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in the device, as the 'data offset'. The --detail output of the array might help too. > > Running a fsck did not find anything wrong, but it did > not repair anything too. Did you use "fsck -f" ?? > > Now, I do not know if this was caused by the situation > mentioned above, but for sure is quite fishy... > > BTW, unrelated to the topic, any idea on how to fix this? > Is there any tool that can restore the proper ID or else? > Until we know what is wrong, it is hard to suggest a fix. NeilBrown > Thanks, > > bye. > ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 22:21 ` Neil Brown @ 2010-02-25 8:41 ` Piergiorgio Sartor 2010-03-02 4:57 ` Neil Brown 0 siblings, 1 reply; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-25 8:41 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid Hi, > I certainly believe you. thank you! > That is really odd! Both the kernel ext3 module (triggered by '-text3') > and the 'mount' program use exactly the same test - look for the magic > number in the superblock at 1K into the device. Today I tried: blkid -p /dev/md1 (this time the live CD autoassembled the md device) and it returned something like: ambivalent result (probably more than one filesystem...) Strange thing is that, the HDDs were brand new, no older partitions or filesystem were there. Anyway, I've one small correction, the RAID is not 10 f2, on this PC, but (due to different installation) a RAID-1 with superblock 0.9 and the device partitions are set to 0xFD (RAID autoassemble). > Would you be able to get the first 4K from each device in the raid10: > dd if=/dev/whatever of=/tmp/whatever bs=1K count=4 > > and the tar/gz those up and send them to me. That might give some clue. > Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in > the device, as the 'data offset'. > The --detail output of the array might help too. I dumped the first 4K of each device, they're identical (so no mismatch there, at least), I'll send them to you, together with the detail output. > > Running a fsck did not find anything wrong, but it did > > not repair anything too. > > Did you use "fsck -f" ?? Yep. > Until we know what is wrong, it is hard to suggest a fix. Thanks a lot (also because this could turn out to be unrelated with this mailing list). bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-25 8:41 ` Piergiorgio Sartor @ 2010-03-02 4:57 ` Neil Brown 2010-03-02 18:49 ` Piergiorgio Sartor 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-03-02 4:57 UTC (permalink / raw) To: Piergiorgio Sartor Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid On Thu, 25 Feb 2010 09:41:41 +0100 Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > Hi, > > > I certainly believe you. > > thank you! > > > That is really odd! Both the kernel ext3 module (triggered by '-text3') > > and the 'mount' program use exactly the same test - look for the magic > > number in the superblock at 1K into the device. > > Today I tried: blkid -p /dev/md1 (this time the live CD > autoassembled the md device) and it returned something > like: ambivalent result (probably more than one filesystem...) > > Strange thing is that, the HDDs were brand new, no older > partitions or filesystem were there. > > Anyway, I've one small correction, the RAID is not 10 f2, > on this PC, but (due to different installation) a RAID-1 > with superblock 0.9 and the device partitions are set > to 0xFD (RAID autoassemble). > > > Would you be able to get the first 4K from each device in the raid10: > > dd if=/dev/whatever of=/tmp/whatever bs=1K count=4 > > > > and the tar/gz those up and send them to me. That might give some clue. > > Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in > > the device, as the 'data offset'. > > The --detail output of the array might help too. > > I dumped the first 4K of each device, they're identical > (so no mismatch there, at least), I'll send them to you, > together with the detail output. Thanks. I finally had a look at these (sorry for delay). If you run "file" on one of the dumps, it tells you: $ file disk1.raw disk1.raw: Minix filesystem Which isn't expected. I would expect something like $ file xxx xxx: Linux rev 1.0 ext3 filesystem data, UUID=fe55fe6f-0412-4a0a-852d-a0e21767aa35 (needs journal recovery) (large files) for an ext3 filesystem. Looking at /usr/share/misc/magic, it seems that a Minix filesystem is defined by: 0x410 leshort 0x137f Minix filesystem i.e. the 2 bytes at 0x410 into the device are 0x137f, which exactly what we find in your dump. 0x410 in an ext3 superblock is the lower bytes of "s_free_inodes_count", the count of free inodes. Your actual number is 14881663, which is 0x00E3137F. So if you just add or remove a file, the number of free inodes should change, and your filesystem will no longer look like a Minix filesystem and your problems should go away. I guess libblkid et-al should do more sanity checks on the superblock before deciding that it really belongs to some particular filesystem. But I'm happy - this clearly isn't a raid problem. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 4:57 ` Neil Brown @ 2010-03-02 18:49 ` Piergiorgio Sartor 0 siblings, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-03-02 18:49 UTC (permalink / raw) To: Neil Brown Cc: Piergiorgio Sartor, Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid Hi, > Thanks. I finally had a look at these (sorry for delay). well, thank you for having a look at the thing. > If you run "file" on one of the dumps, it tells you: > > $ file disk1.raw > disk1.raw: Minix filesystem > > Which isn't expected. I would expect something like > $ file xxx > xxx: Linux rev 1.0 ext3 filesystem data, UUID=fe55fe6f-0412-4a0a-852d-a0e21767aa35 (needs journal recovery) (large files) > > for an ext3 filesystem. > > Looking at /usr/share/misc/magic, it seems that a Minix filesystem is defined > by: > 0x410 leshort 0x137f Minix filesystem > > i.e. the 2 bytes at 0x410 into the device are 0x137f, which exactly what we > find in your dump. > > 0x410 in an ext3 superblock is the lower bytes of "s_free_inodes_count", the > count of free inodes. > Your actual number is 14881663, which is 0x00E3137F. Ah! But this means there is a bug somewhere... > So if you just add or remove a file, the number of free inodes should change, > and your filesystem will no longer look like a Minix filesystem and > your problems should go away. Uhm, OK, I just re-created the MD and the FS, so I took also the opportunity to increase the chunk size to 512K and use RAID-10. > I guess libblkid et-al should do more sanity checks on the superblock before > deciding that it really belongs to some particular filesystem. So, should one of us file a bug report somewhere? I mean, it is not only (lib)blkid, but also "file" which seems to be confused. BTW, "file" does not seem to use libblkid. > But I'm happy - this clearly isn't a raid problem. That's certainly good news, thanks again for the explanation, I learned something today! bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 16:12 ` Martin K. Petersen 2010-02-24 18:51 ` Piergiorgio Sartor @ 2010-02-24 21:39 ` Neil Brown [not found] ` <4B8640A2.4060307@shiftmail.org> 2010-02-28 8:09 ` Luca Berra 1 sibling, 2 replies; 70+ messages in thread From: Neil Brown @ 2010-02-24 21:39 UTC (permalink / raw) To: Martin K. Petersen Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid On Wed, 24 Feb 2010 11:12:09 -0500 "Martin K. Petersen" <martin.petersen@oracle.com> wrote: > So realistically both disk blocks are wrong and there's a window until > the new, correct block is written. That window will only cause problems > if there is a crash and we'll need to recover. My main concern here is > how big the discrepancy between the disks can get, and whether we'll end > up corrupting the filesystem during recovery because we could > potentially be matching metadata from one disk with journal entries from > another. After a crash, md will only read from one of the devices (the first) until a resync has completed. So there should be no room for more confusion than you would expect on a single device. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
[parent not found: <4B8640A2.4060307@shiftmail.org>]
* Re: Why does one get mismatches? [not found] ` <4B8640A2.4060307@shiftmail.org> @ 2010-02-25 10:41 ` Neil Brown 0 siblings, 0 replies; 70+ messages in thread From: Neil Brown @ 2010-02-25 10:41 UTC (permalink / raw) To: Asdo Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid On Thu, 25 Feb 2010 10:19:30 +0100 Asdo <asdo@shiftmail.org> wrote: > Neil Brown wrote: > > On Wed, 24 Feb 2010 11:12:09 -0500 > > "Martin K. Petersen" <martin.petersen@oracle.com> wrote: > > > > > >> So realistically both disk blocks are wrong and there's a window until > >> the new, correct block is written. That window will only cause problems > >> if there is a crash and we'll need to recover. My main concern here is > >> how big the discrepancy between the disks can get, and whether we'll end > >> up corrupting the filesystem during recovery because we could > >> potentially be matching metadata from one disk with journal entries from > >> another. > >> > > > > After a crash, md will only read from one of the devices (the first) until a > > resync has completed. So there should be no room for more confusion than you > > would expect on a single device. > Not enough, I'd say. > The reads are from a single device, the first, but it's the writes which > you don't know if they go to firstly to the first device or in the > reverse order. So I'd still be concerned by what Martin says. I'm getting bored of repeating myself, so I won't respond to this. > > In addition in this ML there are people reporting that the mismatches > occur even when the system is always on, no crashes. So I think there is > another mechanism for mismatches (not sure if in addition or it's the > only mechanism). Ditto > > Besides, if the mechanism for mismatches is correct I'd go for the copy > (or page lock if possible). All raids have copy, except raid0 maybe, and > they are not slow. Here the copy would only occur on writes, and raid-1 > is not targeted to be SO fast on writes... Also raid-1's are usually on > few disk, like no more than 3, so the copy is not likely to bottleneck > the speed of the writes. I'm sure it would be a measurable slowdown, though < 20%. Probably < 10%. I doubt everyone would be happy with that, though you might. > > What about raid-10? Are there copies for the raid-1 part of raid-10? > No. Neither raid1 nor raid10 copy the data, only raid456. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 21:39 ` Neil Brown [not found] ` <4B8640A2.4060307@shiftmail.org> @ 2010-02-28 8:09 ` Luca Berra 2010-03-02 5:01 ` Neil Brown 1 sibling, 1 reply; 70+ messages in thread From: Luca Berra @ 2010-02-28 8:09 UTC (permalink / raw) To: linux-raid On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote: >On Wed, 24 Feb 2010 11:12:09 -0500 >"Martin K. Petersen" <martin.petersen@oracle.com> wrote: > >> So realistically both disk blocks are wrong and there's a window until >> the new, correct block is written. That window will only cause problems >> if there is a crash and we'll need to recover. My main concern here is >> how big the discrepancy between the disks can get, and whether we'll end >> up corrupting the filesystem during recovery because we could >> potentially be matching metadata from one disk with journal entries from >> another. > >After a crash, md will only read from one of the devices (the first) until a >resync has completed. So there should be no room for more confusion than you >would expect on a single device. After thinking more about this i could come up with another concern about write ordering. example app writes block A, B, C md writes A on both disks md writes B on disk1 app writes B again (B') md writes B' on disk2 now md would write B' again on both disks, but the system crashes (note, C is never written due to crash) Disk 1 contains A and B in the correct order, it is missing C and B' but we dont care, app should be able to recover from a crash Disk 2 contains A and B', but they are wrongly ordered because C is missing If in the above case A and C are data blocks and B contains a journal related to A and C, booting from disk 2 could result in inconsistent data. can the above really happen? would using barriers remove the above concern? am i missing something else? L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-28 8:09 ` Luca Berra @ 2010-03-02 5:01 ` Neil Brown 2010-03-02 7:36 ` Luca Berra 0 siblings, 1 reply; 70+ messages in thread From: Neil Brown @ 2010-03-02 5:01 UTC (permalink / raw) To: Luca Berra; +Cc: linux-raid On Sun, 28 Feb 2010 09:09:49 +0100 Luca Berra <bluca@comedia.it> wrote: > On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote: > >On Wed, 24 Feb 2010 11:12:09 -0500 > >"Martin K. Petersen" <martin.petersen@oracle.com> wrote: > > > >> So realistically both disk blocks are wrong and there's a window until > >> the new, correct block is written. That window will only cause problems > >> if there is a crash and we'll need to recover. My main concern here is > >> how big the discrepancy between the disks can get, and whether we'll end > >> up corrupting the filesystem during recovery because we could > >> potentially be matching metadata from one disk with journal entries from > >> another. > > > >After a crash, md will only read from one of the devices (the first) until a > >resync has completed. So there should be no room for more confusion than you > >would expect on a single device. > > After thinking more about this i could come up with another concern > about write ordering. > > example > app writes block A, B, C > md writes A on both disks > md writes B on disk1 > app writes B again (B') > md writes B' on disk2 > now md would write B' again on both disks, but the system crashes > (note, C is never written due to crash) > > Disk 1 contains A and B in the correct order, it is missing C and B' but we > dont care, app should be able to recover from a crash > > Disk 2 contains A and B', but they are wrongly ordered because C is > missing > > If in the above case A and C are data blocks and B contains a journal > related to A and C, booting from disk 2 could result in inconsistent > data. > > can the above really happen? > would using barriers remove the above concern? > am i missing something else? These is no inconsistency here that a filesystem would not equally expect from a single device. After the crash-while-writing B', it should expect to see either B or B', and it does, depending on which device is primary. Nothing to see here. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 5:01 ` Neil Brown @ 2010-03-02 7:36 ` Luca Berra 2010-03-02 10:04 ` Michael Evans 0 siblings, 1 reply; 70+ messages in thread From: Luca Berra @ 2010-03-02 7:36 UTC (permalink / raw) To: linux-raid On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote: >On Sun, 28 Feb 2010 09:09:49 +0100 >Luca Berra <bluca@comedia.it> wrote: > >> On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote: >> >On Wed, 24 Feb 2010 11:12:09 -0500 >> >"Martin K. Petersen" <martin.petersen@oracle.com> wrote: >> > >> >> So realistically both disk blocks are wrong and there's a window until >> >> the new, correct block is written. That window will only cause problems >> >> if there is a crash and we'll need to recover. My main concern here is >> >> how big the discrepancy between the disks can get, and whether we'll end >> >> up corrupting the filesystem during recovery because we could >> >> potentially be matching metadata from one disk with journal entries from >> >> another. >> > >> >After a crash, md will only read from one of the devices (the first) until a >> >resync has completed. So there should be no room for more confusion than you >> >would expect on a single device. >> >> After thinking more about this i could come up with another concern >> about write ordering. >> >> example >> app writes block A, B, C >> md writes A on both disks >> md writes B on disk1 >> app writes B again (B') >> md writes B' on disk2 >> now md would write B' again on both disks, but the system crashes >> (note, C is never written due to crash) >> >> Disk 1 contains A and B in the correct order, it is missing C and B' but we >> dont care, app should be able to recover from a crash >> >> Disk 2 contains A and B', but they are wrongly ordered because C is >> missing >> >> If in the above case A and C are data blocks and B contains a journal >> related to A and C, booting from disk 2 could result in inconsistent >> data. >> >> can the above really happen? >> would using barriers remove the above concern? >> am i missing something else? > >These is no inconsistency here that a filesystem would not equally expect >from a single device. >After the crash-while-writing B', it should expect to see either B or B', >and it does, depending on which device is primary. > >Nothing to see here. I will try to explain better, the problem is not related to the confusion between B or B' the problem is that on one disk we have B' _without_ C. Regards, L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 7:36 ` Luca Berra @ 2010-03-02 10:04 ` Michael Evans 2010-03-02 11:02 ` Luca Berra 0 siblings, 1 reply; 70+ messages in thread From: Michael Evans @ 2010-03-02 10:04 UTC (permalink / raw) To: linux-raid On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote: > On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote: >> >> On Sun, 28 Feb 2010 09:09:49 +0100 >> Luca Berra <bluca@comedia.it> wrote: >> >>> On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote: >>> >On Wed, 24 Feb 2010 11:12:09 -0500 >>> >"Martin K. Petersen" <martin.petersen@oracle.com> wrote: >>> > >>> >> So realistically both disk blocks are wrong and there's a window until >>> >> the new, correct block is written. That window will only cause >>> >> problems >>> >> if there is a crash and we'll need to recover. My main concern here >>> >> is >>> >> how big the discrepancy between the disks can get, and whether we'll >>> >> end >>> >> up corrupting the filesystem during recovery because we could >>> >> potentially be matching metadata from one disk with journal entries >>> >> from >>> >> another. >>> > >>> >After a crash, md will only read from one of the devices (the first) >>> > until a >>> >resync has completed. So there should be no room for more confusion >>> > than you >>> >would expect on a single device. >>> >>> After thinking more about this i could come up with another concern >>> about write ordering. >>> >>> example >>> app writes block A, B, C >>> md writes A on both disks >>> md writes B on disk1 >>> app writes B again (B') >>> md writes B' on disk2 >>> now md would write B' again on both disks, but the system crashes >>> (note, C is never written due to crash) >>> >>> Disk 1 contains A and B in the correct order, it is missing C and B' but >>> we >>> dont care, app should be able to recover from a crash >>> >>> Disk 2 contains A and B', but they are wrongly ordered because C is >>> missing >>> >>> If in the above case A and C are data blocks and B contains a journal >>> related to A and C, booting from disk 2 could result in inconsistent >>> data. >>> >>> can the above really happen? >>> would using barriers remove the above concern? >>> am i missing something else? >> >> These is no inconsistency here that a filesystem would not equally expect >> from a single device. >> After the crash-while-writing B', it should expect to see either B or B', >> and it does, depending on which device is primary. >> >> Nothing to see here. > > I will try to explain better, > the problem is not related to the confusion between B or B' > > the problem is that on one disk we have B' _without_ C. > > Regards, > L. > > -- > Luca Berra -- bluca@comedia.it > Communication Media & Services S.r.l. > /"\ > \ / ASCII RIBBON CAMPAIGN > X AGAINST HTML MAIL > / \ > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > You're demanding full atomic commits; this is precisely what journals and /barriers/ are for. Are you are bypassing them in a quest for performance and paying for it on crashes? Or is this a hardware bug? Or is it some glitch in the block device layering leading to barrier requests not being honored? -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 10:04 ` Michael Evans @ 2010-03-02 11:02 ` Luca Berra 2010-03-02 12:13 ` Michael Evans 2010-03-02 18:14 ` Asdo 0 siblings, 2 replies; 70+ messages in thread From: Luca Berra @ 2010-03-02 11:02 UTC (permalink / raw) To: linux-raid On Tue, Mar 02, 2010 at 02:04:47AM -0800, Michael Evans wrote: >On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote: >> On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote: >>>> Disk 1 contains A and B in the correct order, it is missing C and B' but >>>> we >>>> dont care, app should be able to recover from a crash >>>> >>>> Disk 2 contains A and B', but they are wrongly ordered because C is >>>> missing >>>> >>>> If in the above case A and C are data blocks and B contains a journal >>>> related to A and C, booting from disk 2 could result in inconsistent >>>> data. >>>> >>>> can the above really happen? >>>> would using barriers remove the above concern? >>>> am i missing something else? >>> >>> These is no inconsistency here that a filesystem would not equally expect >>> from a single device. >>> After the crash-while-writing B', it should expect to see either B or B', >>> and it does, depending on which device is primary. >>> >>> Nothing to see here. >> >> I will try to explain better, >> the problem is not related to the confusion between B or B' >> >> the problem is that on one disk we have B' _without_ C. >> >You're demanding full atomic commits; this is precisely what journals >and /barriers/ are for. > >Are you are bypassing them in a quest for performance and paying for >it on crashes? >Or is this a hardware bug? >Or is it some glitch in the block device layering leading to barrier >requests not being honored? I just asked for confirmation that with /barriers/ the scenario above would not happen. L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 11:02 ` Luca Berra @ 2010-03-02 12:13 ` Michael Evans 2010-03-02 18:14 ` Asdo 1 sibling, 0 replies; 70+ messages in thread From: Michael Evans @ 2010-03-02 12:13 UTC (permalink / raw) To: linux-raid On Tue, Mar 2, 2010 at 3:02 AM, Luca Berra <bluca@comedia.it> wrote: > On Tue, Mar 02, 2010 at 02:04:47AM -0800, Michael Evans wrote: >> >> On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote: >>> >>> On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote: >>>>> >>>>> Disk 1 contains A and B in the correct order, it is missing C and B' >>>>> but >>>>> we >>>>> dont care, app should be able to recover from a crash >>>>> >>>>> Disk 2 contains A and B', but they are wrongly ordered because C is >>>>> missing >>>>> >>>>> If in the above case A and C are data blocks and B contains a journal >>>>> related to A and C, booting from disk 2 could result in inconsistent >>>>> data. >>>>> >>>>> can the above really happen? >>>>> would using barriers remove the above concern? >>>>> am i missing something else? >>>> >>>> These is no inconsistency here that a filesystem would not equally >>>> expect >>>> from a single device. >>>> After the crash-while-writing B', it should expect to see either B or >>>> B', >>>> and it does, depending on which device is primary. >>>> >>>> Nothing to see here. >>> >>> I will try to explain better, >>> the problem is not related to the confusion between B or B' >>> >>> the problem is that on one disk we have B' _without_ C. >>> >> You're demanding full atomic commits; this is precisely what journals >> and /barriers/ are for. >> >> Are you are bypassing them in a quest for performance and paying for >> it on crashes? >> Or is this a hardware bug? >> Or is it some glitch in the block device layering leading to barrier >> requests not being honored? > > I just asked for confirmation that with /barriers/ the scenario above > would not happen. > > L. > > -- > Luca Berra -- bluca@comedia.it > Communication Media & Services S.r.l. > /"\ > \ / ASCII RIBBON CAMPAIGN > X AGAINST HTML MAIL > / \ > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Yes, obviously atomic commits require barriers. Older hardware and operating systems that didn't allow any form of buffering or out of order operations (hardware can re-arrange commits internally now) inherently had a barrier between every operation. Modern devices and systems have so many layers of interacting buffers with operation re-ordering to optimize throughput that such predictability is lacking unless explicitly requested via the form of a barrier. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 11:02 ` Luca Berra 2010-03-02 12:13 ` Michael Evans @ 2010-03-02 18:14 ` Asdo 2010-03-02 18:52 ` Piergiorgio Sartor 2010-03-02 20:17 ` Neil Brown 1 sibling, 2 replies; 70+ messages in thread From: Asdo @ 2010-03-02 18:14 UTC (permalink / raw) To: linux-raid Luca Berra wrote: >>> >>> I will try to explain better, >>> the problem is not related to the confusion between B or B' >>> >>> the problem is that on one disk we have B' _without_ C. >>> >> You're demanding full atomic commits; this is precisely what journals >> and /barriers/ are for. >> >> Are you are bypassing them in a quest for performance and paying for >> it on crashes? >> Or is this a hardware bug? >> Or is it some glitch in the block device layering leading to barrier >> requests not being honored? > I just asked for confirmation that with /barriers/ the scenario above > would not happen. > I think so, that it would not happen: the filesystem would stay consistent. (while the mismatches could still happen) The problem is that the barriers were introduced in all md raids in the 2.6.33 (just released), and also I have read XFS has a major performance drop with barriers activated. People will be tempted to disable barriers. AFAIR the performance drop was visible with 1 disk alone, imagine now with RAID. And I expect similar performance drops with other filesystems, correct me if I am wrong. Now it would be interesting to understand why the mismatches don't happen when LVM is above MD-raid!? The mechanisms presented up to now on this ML for mismatches don't explain why on LVM the same issue doesn't show up. I think. So you might want to use raid-1 and raid-10 under LVM, just in case.... ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 18:14 ` Asdo @ 2010-03-02 18:52 ` Piergiorgio Sartor 2010-03-02 23:27 ` Asdo 2010-03-02 20:17 ` Neil Brown 1 sibling, 1 reply; 70+ messages in thread From: Piergiorgio Sartor @ 2010-03-02 18:52 UTC (permalink / raw) To: Asdo; +Cc: linux-raid Hi, > Now it would be interesting to understand why the mismatches don't > happen when LVM is above MD-raid!? well, maybe LVM copies the buffer, and after it plays it nice with MD, i.e. no changes on the fly. Or maybe, it is just one system that behaves properly with LVM over MD. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 18:52 ` Piergiorgio Sartor @ 2010-03-02 23:27 ` Asdo 2010-03-03 9:13 ` Piergiorgio Sartor 0 siblings, 1 reply; 70+ messages in thread From: Asdo @ 2010-03-02 23:27 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid Piergiorgio Sartor wrote: > Hi, > >> Now it would be interesting to understand why the mismatches don't >> happen when LVM is above MD-raid!? >> > > well, maybe LVM copies the buffer, and after it > plays it nice with MD, i.e. no changes on the fly. > LVM copies the buffer!? I don't think so... LVM is near-zero overhead, so I would be surprised if it copied the buffer. Also I don't think it was needed in their case, except maybe if there is an I/O at the boundary of a logical volume or LVM stripe, which would certainly be a mistake at requestor side. LVM also does not merge requests AFAIR. (visible with mdstat -x 1) > Or maybe, it is just one system that behaves > properly with LVM over MD. > hmm maybe... But me also I have never seen mismatches and the only raid-1's I have are above LVM. (except /boot but that's almost never modified) ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 23:27 ` Asdo @ 2010-03-03 9:13 ` Piergiorgio Sartor 2010-03-03 11:42 ` Asdo 0 siblings, 1 reply; 70+ messages in thread From: Piergiorgio Sartor @ 2010-03-03 9:13 UTC (permalink / raw) To: Asdo; +Cc: Piergiorgio Sartor, linux-raid Hi, > LVM copies the buffer!? > I don't think so... > LVM is near-zero overhead, so I would be surprised if it copied the buffer. well, I'm not so sure it is near-zero overhead (I'll exaplain below), and even if, making copies could be still "near-zero" overhead, it depends on where the bottlenecks are. I'm not an LVM insider, so this are just random thoughts. About the near-zero overhead, maybe this could open a different thread, but just to give some numbers... I've a bunch of RAID-6 volumes, made of USB disks, i.e. using PATA<->USB bridges. This volumes are aggregated using LVM and, on top of that, there is a LUKS container. The raw read perfomance on the RAID-6 is, in the best case, around about 48MB/s, which is pretty good for USB, I guess it will be difficult to get more. The raw read perfomance of the LVM volume is i~38MB/s. The raw read performance of the LUKS is ~28MB/s (actually maybe a bit less). Each further layer loses about 10MB/s. I guess this is much more visible in USB than in SATA/SAS situations, since going from 205 to 195 might get unnoticed. This is not a CPU problem, since the PC is dual core, one core runs and it never exceeds 30%. The USB is slow enough to allow all the operations to be performed in real-time. Nevertheless, LVM is doing something there, in this setup is has an overhead of about 20%, far from zero. So, the 10MB/s loss could be, again I've no idea on how LVM works, caused by copying. Could also be something else, of course, it would be interesting to have more information from some expert (also to optimize my USB setup, if possible). > Also I don't think it was needed in their case, except maybe if Maybe, but if the filesystem can play with the buffer while submitted, then I would rather copy the data. Again, some expert opinion would be appreciated. > LVM also does not merge requests AFAIR. (visible with mdstat -x 1) BTW, what's that? I mean "mdstat -x 1"... > But me also I have never seen mismatches and the only raid-1's I > have are above LVM. (except /boot but that's almost never modified) Well, that's good, you confirmed my experience. I've also RAID-10 on LVM and never got mismatches, while the plain RAID-10 got sometimes. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-03 9:13 ` Piergiorgio Sartor @ 2010-03-03 11:42 ` Asdo 2010-03-03 12:03 ` Piergiorgio Sartor 0 siblings, 1 reply; 70+ messages in thread From: Asdo @ 2010-03-03 11:42 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid Piergiorgio Sartor wrote: > I've a bunch of RAID-6 volumes, made of USB disks, i.e. using > PATA<->USB bridges. > Don't your bridges ever drop out or break? What is the brand/model? Years ago I broke alot of those, just by using the disks intensively. They probably kinda overheated then failed. They couldn't last 2 days on intense disk activity... They were chinese stuff bought on ebay though. > This volumes are aggregated using LVM and, on top of that, there > is a LUKS container. > > The raw read perfomance on the RAID-6 is, in the best case, > around about 48MB/s, which is pretty good for USB, I guess it > will be difficult to get more. > 48MB/sec can be good for 1 disk, but it's bad for many disks attached separately to USB ports... > The raw read perfomance of the LVM volume is i~38MB/s. > The raw read performance of the LUKS is ~28MB/s (actually > maybe a bit less). > Might your LVM or partition within it be not aligned, or you didn't set readahead? http://www.beowulf.org/archive/2007-May/018359.html http://www.mail-archive.com/linux-raid@vger.kernel.org/msg10804.html People using LVM on arrays giving hundreds of MB/sec see slowdowns of the order of percent http://article.gmane.org/gmane.linux.raid/18302 >> LVM also does not merge requests AFAIR. (visible with mdstat -x 1) > BTW, what's that? I mean "mdstat -x 1"... I'm sorry I meant " iostat -x 1" >> But me also I have never seen mismatches and the only raid-1's I >> have are above LVM. (except /boot but that's almost never modified) >> > > Well, that's good, you confirmed my experience. > > I've also RAID-10 on LVM and never got mismatches, while the > plain RAID-10 got sometimes. > > This fact needs further investigation methinks... We could ask to the LVM people if LVM really copies the buffer. Regards A. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-03 11:42 ` Asdo @ 2010-03-03 12:03 ` Piergiorgio Sartor 0 siblings, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-03-03 12:03 UTC (permalink / raw) To: Asdo; +Cc: Piergiorgio Sartor, linux-raid Hi, > >I've a bunch of RAID-6 volumes, made of USB disks, i.e. using > >PATA<->USB bridges. > Don't your bridges ever drop out or break? never had problems of broken bridges, but other problems I had, like unreliable transfer under certain conditions. > What is the brand/model? The best I could find were from Digitus, they've a pretty standard chipset, JMicron I guess, but they seem to be build better than others, still with JMicron. No problems, so far. I've other two brands, same chipset, which seem reliable for the SATA part, but the PATA does not work properly. All the other, from different vendors with different chipset, had never problems. I can imagine that the PSU that comes together might be a weak point, I saw real poor quality units. On the other hand, it's a bunch of RAID-6 for a reason... :-) > Years ago I broke alot of those, just by using the disks > intensively. They probably kinda overheated then failed. They > couldn't last 2 days on intense disk activity... They were chinese > stuff bought on ebay though. My use case is an offline storage, so I'll not use the box for two days, but for several hours I used it. > 48MB/sec can be good for 1 disk, but it's bad for many disks > attached separately to USB ports... Well, maybe I forgot to mention that the HDDs are going to the PC thru an USB HUB (three 4-1 USB, to be precise), i.e. one single USB connection. This can do, in theory, 60MB/s, in practice I never saw more than 50MB/s, in ideal conditions. So, in my view, 48MB/s is pretty much the max you can get. > Might your LVM or partition within it be not aligned, or you didn't > set readahead? LVM takes care to align itself, this is in the new version, and also the readahead seems to be automagically set. Nevertheless, the LUKS is aligned, by hand. > http://www.beowulf.org/archive/2007-May/018359.html > http://www.mail-archive.com/linux-raid@vger.kernel.org/msg10804.html > People using LVM on arrays giving hundreds of MB/sec see slowdowns > of the order of percent > http://article.gmane.org/gmane.linux.raid/18302 Thanks for the links, I'll have a look. > >I've also RAID-10 on LVM and never got mismatches, while the > >plain RAID-10 got sometimes. > > > This fact needs further investigation methinks... > We could ask to the LVM people if LVM really copies the buffer. Or, in general, if they have any explanation for this observation of ours. Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-03-02 18:14 ` Asdo 2010-03-02 18:52 ` Piergiorgio Sartor @ 2010-03-02 20:17 ` Neil Brown 1 sibling, 0 replies; 70+ messages in thread From: Neil Brown @ 2010-03-02 20:17 UTC (permalink / raw) To: Asdo; +Cc: linux-raid On Tue, 02 Mar 2010 19:14:25 +0100 Asdo <asdo@shiftmail.org> wrote: > Luca Berra wrote: > >>> > >>> I will try to explain better, > >>> the problem is not related to the confusion between B or B' > >>> > >>> the problem is that on one disk we have B' _without_ C. > >>> > >> You're demanding full atomic commits; this is precisely what journals > >> and /barriers/ are for. > >> > >> Are you are bypassing them in a quest for performance and paying for > >> it on crashes? > >> Or is this a hardware bug? > >> Or is it some glitch in the block device layering leading to barrier > >> requests not being honored? > > I just asked for confirmation that with /barriers/ the scenario above > > would not happen. > > > > I think so, that it would not happen: the filesystem would stay > consistent. (while the mismatches could still happen) > > The problem is that the barriers were introduced in all md raids in the > 2.6.33 (just released), and also I have read XFS has a major performance > drop with barriers activated. People will be tempted to disable > barriers. AFAIR the performance drop was visible with 1 disk alone, > imagine now with RAID. And I expect similar performance drops with other > filesystems, correct me if I am wrong. The barrier support added in 2.6.33 was for striped md arrays. RAID1, which is not striped, has had barrier support since about 2.6.16, as it is much easier to implement. NeilBrown > > Now it would be interesting to understand why the mismatches don't > happen when LVM is above MD-raid!? > The mechanisms presented up to now on this ML for mismatches don't > explain why on LVM the same issue doesn't show up. I think. > So you might want to use raid-1 and raid-10 under LVM, just in case.... > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 14:46 ` Bill Davidsen 2010-02-24 16:12 ` Martin K. Petersen @ 2010-02-24 21:32 ` Neil Brown 2010-02-25 7:22 ` Goswin von Brederlow 2010-02-25 8:47 ` John Robinson 1 sibling, 2 replies; 70+ messages in thread From: Neil Brown @ 2010-02-24 21:32 UTC (permalink / raw) To: Bill Davidsen; +Cc: Steven Haigh, Bryan Mesich, Jon, linux-raid On Wed, 24 Feb 2010 09:46:23 -0500 Bill Davidsen <davidsen@tmr.com> wrote: > > There is no question of data corruption. > > When memory changes between being written to one device and to another, this > > does not cause corruption, only inconsistency. Either the block will be > > written again consistently soon, or it will never be read. > > > > Just what is it that rewrites the data block? The user program doesn't > know it's needed, the filesystem, if any, doesn't know it's needed, and > as far as I can tell md doesn't do checksum before issuing the write and > after the last write is done. Doesn't make a copy and write from that. > So what sees that the data has changed and rewrites it? > The filesystem re-writes the block, though probably it is more accurate to say 'the page cache' rewrites the block (the page cache is essentially just a library of code that the filesystem uses). When a page is changed, its 'Dirty' flag is set. Before a page is written out, the Dirty flag is cleared. So if a page is written differently to two devices, then it must have been changed after the Dirty flag was clear, so the Dirty flag will be set, so the page cache will try to write it out again (after about 30 seconds or at unmount time). When accessing a block device directly ( > /dev/md0 ) the page cache is still used and will still write out any page that has the Dirty flag set. If you open /dev/md0 with O_DIRECT there is no page cache involved and so no setting of Dirty flags. So you could engineer a situation with O_DIRECT that writes different data to the two devices, but you would have to try fairly hard. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 21:32 ` Neil Brown @ 2010-02-25 7:22 ` Goswin von Brederlow 2010-02-25 7:39 ` Neil Brown 2010-02-25 8:47 ` John Robinson 1 sibling, 1 reply; 70+ messages in thread From: Goswin von Brederlow @ 2010-02-25 7:22 UTC (permalink / raw) To: Neil Brown; +Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid Neil Brown <neilb@suse.de> writes: > On Wed, 24 Feb 2010 09:46:23 -0500 > Bill Davidsen <davidsen@tmr.com> wrote: > >> > There is no question of data corruption. >> > When memory changes between being written to one device and to another, this >> > does not cause corruption, only inconsistency. Either the block will be >> > written again consistently soon, or it will never be read. >> > >> >> Just what is it that rewrites the data block? The user program doesn't >> know it's needed, the filesystem, if any, doesn't know it's needed, and >> as far as I can tell md doesn't do checksum before issuing the write and >> after the last write is done. Doesn't make a copy and write from that. >> So what sees that the data has changed and rewrites it? >> > > The filesystem re-writes the block, though probably it is more accurate to > say 'the page cache' rewrites the block (the page cache is essentially just a > library of code that the filesystem uses). > > When a page is changed, its 'Dirty' flag is set. > Before a page is written out, the Dirty flag is cleared. > So if a page is written differently to two devices, then it must have been > changed after the Dirty flag was clear, so the Dirty flag will be set, so the > page cache will try to write it out again (after about 30 seconds or at > unmount time). So maybe MD could check the dirty flag after write and then output a warning so we can track down the issue. MD could also rewrite the page prior to setting the disks in-sync until the dirty bit is clear after a write. MfG Goswin ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-25 7:22 ` Goswin von Brederlow @ 2010-02-25 7:39 ` Neil Brown 0 siblings, 0 replies; 70+ messages in thread From: Neil Brown @ 2010-02-25 7:39 UTC (permalink / raw) To: Goswin von Brederlow Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid On Thu, 25 Feb 2010 08:22:10 +0100 Goswin von Brederlow <goswin-v-b@web.de> wrote: > Neil Brown <neilb@suse.de> writes: > > > On Wed, 24 Feb 2010 09:46:23 -0500 > > Bill Davidsen <davidsen@tmr.com> wrote: > > > >> > There is no question of data corruption. > >> > When memory changes between being written to one device and to another, this > >> > does not cause corruption, only inconsistency. Either the block will be > >> > written again consistently soon, or it will never be read. > >> > > >> > >> Just what is it that rewrites the data block? The user program doesn't > >> know it's needed, the filesystem, if any, doesn't know it's needed, and > >> as far as I can tell md doesn't do checksum before issuing the write and > >> after the last write is done. Doesn't make a copy and write from that. > >> So what sees that the data has changed and rewrites it? > >> > > > > The filesystem re-writes the block, though probably it is more accurate to > > say 'the page cache' rewrites the block (the page cache is essentially just a > > library of code that the filesystem uses). > > > > When a page is changed, its 'Dirty' flag is set. > > Before a page is written out, the Dirty flag is cleared. > > So if a page is written differently to two devices, then it must have been > > changed after the Dirty flag was clear, so the Dirty flag will be set, so the > > page cache will try to write it out again (after about 30 seconds or at > > unmount time). > > So maybe MD could check the dirty flag after write and then output a > warning so we can track down the issue. MD could also rewrite the page > prior to setting the disks in-sync until the dirty bit is clear after a > write. md isn't able to see the dirty bit. It gets a 'bio', which has a 'biovec' which has a list of pages with offset and size. It does not know if the page is in the page cache or not so it cannot know if the dirty flag on the page means anything or not. Yes, it technically could check the dirty bit and if it sees any of them set then it could reschedule the writes. however, 1- this is a layering violation - it is the wrong thing to do. 2- it might not work. The filesystem could keep the 'dirty' status elsewhere such as in a 'buffer_head', and only copy it through to the page occasionally. 3- it could cause a live-lock. If an application is changing a mapped page quite regularly, then the current pagecache will write it out every 30 seconds or so. Your proposed change would write it out again and again as soon as the previous write completes. So, no: we cannot do that. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-24 21:32 ` Neil Brown 2010-02-25 7:22 ` Goswin von Brederlow @ 2010-02-25 8:47 ` John Robinson 2010-02-25 9:07 ` Neil Brown 1 sibling, 1 reply; 70+ messages in thread From: John Robinson @ 2010-02-25 8:47 UTC (permalink / raw) To: linux-raid On 24/02/2010 21:32, Neil Brown wrote: [...] > If you open /dev/md0 with O_DIRECT there is no page cache involved and so no > setting of Dirty flags. So you could engineer a situation with O_DIRECT > that writes different data to the two devices, but you would have to try > fairly hard. Hang on. O_DIRECT sets off all sort of alarm bells for me, not that I understand it properly. Of course there's O_DIRECT on files too. Linus Torvalds is quite outspoken about it: http://kerneltrap.org/node/7563 Could we be seeing mismatches because applications are opening their files with O_DIRECT in a (perhaps misguided) attempt to get better performance? Cheers, John. ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-25 8:47 ` John Robinson @ 2010-02-25 9:07 ` Neil Brown 0 siblings, 0 replies; 70+ messages in thread From: Neil Brown @ 2010-02-25 9:07 UTC (permalink / raw) To: John Robinson; +Cc: linux-raid On Thu, 25 Feb 2010 08:47:59 +0000 John Robinson <john.robinson@anonymous.org.uk> wrote: > On 24/02/2010 21:32, Neil Brown wrote: > [...] > > If you open /dev/md0 with O_DIRECT there is no page cache involved and so no > > setting of Dirty flags. So you could engineer a situation with O_DIRECT > > that writes different data to the two devices, but you would have to try > > fairly hard. > > Hang on. O_DIRECT sets off all sort of alarm bells for me, not that I > understand it properly. Of course there's O_DIRECT on files too. Linus > Torvalds is quite outspoken about it: http://kerneltrap.org/node/7563 > > Could we be seeing mismatches because applications are opening their > files with O_DIRECT in a (perhaps misguided) attempt to get better > performance? Unlikely. The app would need to be doing async direct-io, or it would need to be have multiple threads, and in either case it would need to change the buffer that was being written while the write was happening. And that would be a pretty dumb thing to do unless it almost immediately wrote the same buffer out again. So not exactly impossible, but probably the least-likely of the various possible explanations. NeilBrown ^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Why does one get mismatches? 2010-02-11 5:14 ` Neil Brown 2010-02-11 17:51 ` Bryan Mesich @ 2010-02-11 18:12 ` Piergiorgio Sartor 1 sibling, 0 replies; 70+ messages in thread From: Piergiorgio Sartor @ 2010-02-11 18:12 UTC (permalink / raw) To: Neil Brown; +Cc: Bill Davidsen, Jon, linux-raid Hi all, > > This whole discussion simply shows that for RAID-1 software RAID is less > > reliable than hardware RAID (no, I don't mean fake-RAID), because it > > doesn't pin the data buffer until all copies are written. > > > > That doesn't make it less reliable. It just makes it more confusing. well, sorry to say, but it makes it useless. The problem is: how can we be sure that the FS really plays tricks only with blocks which will be unused? In other words, either there should be an agreed and confirmed interface between caller (FS) and called (MD), handling the situation properly (i.e. the FS will not do these pranks), or the called (MD) should be robust agains all possible nasty things the caller (FS) can do. Because what will happen if someone introduces a new FS which works fine with all, but software RAID? Similarly, I've some, identical, PCs, with RAID-10 f2. Starting with Fedora 12, there is a weekly check of the RAID array (with email notification, BTW without mismatch count...). On these PCs I get mismatches, sometimes. Checking the mismatch count I found out that this is changing, sometimes a bit more, sometimes a bit less (o zero). Now, IMHO the check is completely useless and even annoying. I've got mismatches, changing, but I do not know how serious these are. Not good... I could have lost data or not, and I do not know... > But for a more complete discussion on raid recovery and when it might be > sensible to "vote" among the blocks, see > http://neil.brown.name/blog/20100211050355 > Nice, discussion. Expecially the clarification about the unclean shutdown event. This could be, in effect, a killer for the majority select (or RAID-6 reconstrunction) decision. I personally agree with the conclusion of your conclusion. Anyway, I miss, or I did not get, one more point. Specifically, the "smart recovery" should be composed by two steps. One is detecting where the problems are. This means not only the stripe, but, in case of RAID-6, also the *potential* component (HDD) of the array. Reason is that, as I already wrote some times ago, there is a *huge* difference between having all the mismatches *potentially* on one single component, or spread around several. The first case clearly gives more information and allows a better judgment of the situation. Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2010-03-03 12:03 UTC | newest] Thread overview: 70+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle 2010-01-22 18:13 ` Goswin von Brederlow 2010-01-24 17:40 ` Jon Hardcastle 2010-01-24 21:52 ` Roger Heflin 2010-01-24 23:13 ` Goswin von Brederlow 2010-01-25 10:07 ` Jon Hardcastle 2010-01-25 10:37 ` Goswin von Brederlow 2010-01-25 10:52 ` Jon Hardcastle 2010-01-25 17:32 ` Goswin von Brederlow 2010-01-25 19:32 ` Iustin Pop 2010-02-01 21:18 ` Bill Davidsen 2010-02-01 22:37 ` Neil Brown 2010-02-02 15:11 ` Bill Davidsen 2010-02-03 11:17 ` Goswin von Brederlow 2010-02-11 5:14 ` Neil Brown 2010-02-11 17:51 ` Bryan Mesich 2010-02-16 21:25 ` Bill Davidsen 2010-02-16 21:38 ` Steven Haigh 2010-02-17 3:19 ` Bryan Mesich 2010-02-17 23:05 ` Neil Brown 2010-02-19 15:18 ` Piergiorgio Sartor 2010-02-19 22:02 ` Neil Brown 2010-02-19 22:37 ` Piergiorgio Sartor 2010-02-19 23:34 ` Asdo 2010-02-20 4:27 ` Goswin von Brederlow 2010-02-20 11:12 ` Asdo 2010-02-21 11:13 ` Goswin von Brederlow [not found] ` <8754A21825504719B463AD9809E54349@m5> [not found] ` <20100221194400.GA2570@lazy.lzy> 2010-02-22 13:01 ` Asdo 2010-02-22 13:30 ` Piergiorgio Sartor 2010-02-22 13:44 ` Piergiorgio Sartor 2010-02-24 19:42 ` Bill Davidsen 2010-02-20 4:23 ` Goswin von Brederlow 2010-02-24 14:54 ` Bill Davidsen 2010-02-24 21:37 ` Neil Brown 2010-02-26 20:48 ` Bill Davidsen 2010-02-26 21:09 ` Neil Brown 2010-02-26 22:01 ` Piergiorgio Sartor 2010-02-26 22:15 ` Bill Davidsen 2010-02-26 22:21 ` Piergiorgio Sartor 2010-02-26 22:20 ` Asdo 2010-02-27 6:01 ` Michael Evans 2010-02-28 0:01 ` Bill Davidsen 2010-02-24 14:46 ` Bill Davidsen 2010-02-24 16:12 ` Martin K. Petersen 2010-02-24 18:51 ` Piergiorgio Sartor 2010-02-24 22:21 ` Neil Brown 2010-02-25 8:41 ` Piergiorgio Sartor 2010-03-02 4:57 ` Neil Brown 2010-03-02 18:49 ` Piergiorgio Sartor 2010-02-24 21:39 ` Neil Brown [not found] ` <4B8640A2.4060307@shiftmail.org> 2010-02-25 10:41 ` Neil Brown 2010-02-28 8:09 ` Luca Berra 2010-03-02 5:01 ` Neil Brown 2010-03-02 7:36 ` Luca Berra 2010-03-02 10:04 ` Michael Evans 2010-03-02 11:02 ` Luca Berra 2010-03-02 12:13 ` Michael Evans 2010-03-02 18:14 ` Asdo 2010-03-02 18:52 ` Piergiorgio Sartor 2010-03-02 23:27 ` Asdo 2010-03-03 9:13 ` Piergiorgio Sartor 2010-03-03 11:42 ` Asdo 2010-03-03 12:03 ` Piergiorgio Sartor 2010-03-02 20:17 ` Neil Brown 2010-02-24 21:32 ` Neil Brown 2010-02-25 7:22 ` Goswin von Brederlow 2010-02-25 7:39 ` Neil Brown 2010-02-25 8:47 ` John Robinson 2010-02-25 9:07 ` Neil Brown 2010-02-11 18:12 ` Piergiorgio Sartor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).