Fw: Why does one get mismatches?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fw: Why does one get mismatches?
@ 2010-01-20 11:52 Jon Hardcastle
  2010-01-22 18:13 ` Goswin von Brederlow
  2010-02-01 21:18 ` Bill Davidsen
  0 siblings, 2 replies; 70+ messages in thread
From: Jon Hardcastle @ 2010-01-20 11:52 UTC (permalink / raw)
  To: linux-raid

--- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote:

> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
> Subject: Why does one get mismatches?
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 19 January, 2010, 10:04
> Hi,
> 
> I kicked off a check/repair cycle on my machine after i
> moved the phyiscal ordering of my drives around and I am now
> on my second check/repair cycle and it has kept finding
> mismatches.
> 
> Is it correct that the mismatch value after a repair was
> needed should equal the value present after a check? What if
> it doesn't? What does it mean if another check STILL reveals
> mismatches?
> 
> I had something similar after i reshaped from raid 5 to 6 i
> had to run check/repair/check/repair several times before i
> got my 0.
> 
> 

Guys,

Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8.

All i have done is move the drives around. It is the same controllers/cables/etc 

I really dont like the seeming random nature of what can/does/has caused the mismatches?


      

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
@ 2010-01-22 18:13 ` Goswin von Brederlow
  2010-01-24 17:40   ` Jon Hardcastle
  2010-02-01 21:18 ` Bill Davidsen
  1 sibling, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-01-22 18:13 UTC (permalink / raw)
  To: Jon; +Cc: linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote:
>
>> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
>> Subject: Why does one get mismatches?
>> To: linux-raid@vger.kernel.org
>> Date: Tuesday, 19 January, 2010, 10:04
>> Hi,
>> 
>> I kicked off a check/repair cycle on my machine after i
>> moved the phyiscal ordering of my drives around and I am now
>> on my second check/repair cycle and it has kept finding
>> mismatches.
>> 
>> Is it correct that the mismatch value after a repair was
>> needed should equal the value present after a check? What if
>> it doesn't? What does it mean if another check STILL reveals
>> mismatches?
>> 
>> I had something similar after i reshaped from raid 5 to 6 i
>> had to run check/repair/check/repair several times before i
>> got my 0.
>> 
>> 
>
> Guys,
>
> Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8.
>
> All i have done is move the drives around. It is the same controllers/cables/etc 
>
> I really dont like the seeming random nature of what can/does/has caused the mismatches?

There is some unknown corruption going on with raid1 that causes
mismatches but it is believed that it will never occur on any used
block. Swapping is a likely cause.

Any swap device on the raid? Try turning that off.
If that doesn't help try umounting filesystems or remounting RO.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-22 18:13 ` Goswin von Brederlow
@ 2010-01-24 17:40   ` Jon Hardcastle
  2010-01-24 21:52     ` Roger Heflin
  2010-01-24 23:13     ` Goswin von Brederlow
  0 siblings, 2 replies; 70+ messages in thread
From: Jon Hardcastle @ 2010-01-24 17:40 UTC (permalink / raw)
  To: Jon, Goswin von Brederlow; +Cc: linux-raid

--- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:

> From: Goswin von Brederlow <goswin-v-b@web.de>
> Subject: Re: Fw: Why does one get mismatches?
> To: Jon@eHardcastle.com
> Cc: linux-raid@vger.kernel.org
> Date: Friday, 22 January, 2010, 18:13
> Jon Hardcastle <jd_hardcastle@yahoo.com>
> writes:
> 
> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com>
> wrote:
> >
> >> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
> >> Subject: Why does one get mismatches?
> >> To: linux-raid@vger.kernel.org
> >> Date: Tuesday, 19 January, 2010, 10:04
> >> Hi,
> >> 
> >> I kicked off a check/repair cycle on my machine
> after i
> >> moved the phyiscal ordering of my drives around
> and I am now
> >> on my second check/repair cycle and it has kept
> finding
> >> mismatches.
> >> 
> >> Is it correct that the mismatch value after a
> repair was
> >> needed should equal the value present after a
> check? What if
> >> it doesn't? What does it mean if another check
> STILL reveals
> >> mismatches?
> >> 
> >> I had something similar after i reshaped from raid
> 5 to 6 i
> >> had to run check/repair/check/repair several times
> before i
> >> got my 0.
> >> 
> >> 
> >
> > Guys,
> >
> > Anyone got any suggestions here? I am now on my ~5
> check/repair and after a reboot the first check is still
> returning 8.
> >
> > All i have done is move the drives around. It is the
> same controllers/cables/etc 
> >
> > I really dont like the seeming random nature of what
> can/does/has caused the mismatches?
> 
> There is some unknown corruption going on with raid1 that
> causes
> mismatches but it is believed that it will never occur on
> any used
> block. Swapping is a likely cause.
> 
> Any swap device on the raid? Try turning that off.
> If that doesn't help try umounting filesystems or
> remounting RO.
> 
> MfG
>         Goswin

Hello, my usual savior Goswin!

The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg.

Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago.

The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry!


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-24 17:40   ` Jon Hardcastle
@ 2010-01-24 21:52     ` Roger Heflin
  2010-01-24 23:13     ` Goswin von Brederlow
  1 sibling, 0 replies; 70+ messages in thread
From: Roger Heflin @ 2010-01-24 21:52 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

Jon Hardcastle wrote:
> --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> 
>> From: Goswin von Brederlow <goswin-v-b@web.de>
>> Subject: Re: Fw: Why does one get mismatches?
>> To: Jon@eHardcastle.com
>> Cc: linux-raid@vger.kernel.org
>> Date: Friday, 22 January, 2010, 18:13
>> Jon Hardcastle <jd_hardcastle@yahoo.com>
>> writes:
>>
>>> --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com>
>> wrote:
>>>> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
>>>> Subject: Why does one get mismatches?
>>>> To: linux-raid@vger.kernel.org
>>>> Date: Tuesday, 19 January, 2010, 10:04
>>>> Hi,
>>>>
>>>> I kicked off a check/repair cycle on my machine
>> after i
>>>> moved the phyiscal ordering of my drives around
>> and I am now
>>>> on my second check/repair cycle and it has kept
>> finding
>>>> mismatches.
>>>>
>>>> Is it correct that the mismatch value after a
>> repair was
>>>> needed should equal the value present after a
>> check? What if
>>>> it doesn't? What does it mean if another check
>> STILL reveals
>>>> mismatches?
>>>>
>>>> I had something similar after i reshaped from raid
>> 5 to 6 i
>>>> had to run check/repair/check/repair several times
>> before i
>>>> got my 0.
>>>>
>>>>
>>> Guys,
>>>
>>> Anyone got any suggestions here? I am now on my ~5
>> check/repair and after a reboot the first check is still
>> returning 8.
>>> All i have done is move the drives around. It is the
>> same controllers/cables/etc 
>>> I really dont like the seeming random nature of what
>> can/does/has caused the mismatches?
>>
>> There is some unknown corruption going on with raid1 that
>> causes
>> mismatches but it is believed that it will never occur on
>> any used
>> block. Swapping is a likely cause.
>>
>> Any swap device on the raid? Try turning that off.
>> If that doesn't help try umounting filesystems or
>> remounting RO.
>>
>> MfG
>>         Goswin
> 
> Hello, my usual savior Goswin!
> 
> The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg.
> 
> Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago.
> 
> The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry!
> 

It is possible that the reads are somehow corrupting sometimes.

I have seen a couple of different controllers fail and result in read 
corruptions, basically you have 50 largish files or so on the disk 
with the same checksum (50xsize needs to be 2x greater than ram), and 
you cksum all of the files and see if the cksum changes, if it does 
the "bad" file will move around, so in this case the data on disk 
should be ok.   I have seen a couple of different companies controller 
fail this way, usually it is from a bad PCI interface chip or a bad 
config (too fast) causing PCI parity errors.  I had one controller 
fail (broken) and cause errors (replaced with spare corrected), and in 
the second case I found that the MB was running the PCI bus too faster 
for the number of cards (two different companies FC card fails--both 
in slightly different ways-one silently corrupted, the other crashed 
the machine about the time an error would have been expected), and had 
to slow the bus down one step (PCIX-133 -> PCIX-100, or PCIX-100 to 
PCIX-66) and the issue went away.

In both cases I did not find any write corruptions, but found read 
corruptions often, if you have this happening with a raid5 device it 
would be bad if you had to use parity (corrupt read would mean 
regenerated parity would be wrong, and restore from parity would lead 
to corrupted data).

I don't know how strong the internal SATA communication is, if it uses 
CRC's errors are almost impossible on the cable, if it uses parity 
errors are easy, the PCI bus uses parity, so it is pretty easy for 
errors to get through, but I have only seen them very rarely, maybe 5 
times in 10,000 years of machine operations (2000+ machines for 
several years).

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-24 17:40   ` Jon Hardcastle
  2010-01-24 21:52     ` Roger Heflin
@ 2010-01-24 23:13     ` Goswin von Brederlow
  2010-01-25 10:07       ` Jon Hardcastle
  1 sibling, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-01-24 23:13 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
>> From: Goswin von Brederlow <goswin-v-b@web.de>
>> Subject: Re: Fw: Why does one get mismatches?
>> To: Jon@eHardcastle.com
>> Cc: linux-raid@vger.kernel.org
>> Date: Friday, 22 January, 2010, 18:13
>> Jon Hardcastle <jd_hardcastle@yahoo.com>
>> writes:
>> 
>> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com>
>> wrote:
>> >
>> >> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
>> >> Subject: Why does one get mismatches?
>> >> To: linux-raid@vger.kernel.org
>> >> Date: Tuesday, 19 January, 2010, 10:04
>> >> Hi,
>> >> 
>> >> I kicked off a check/repair cycle on my machine
>> after i
>> >> moved the phyiscal ordering of my drives around
>> and I am now
>> >> on my second check/repair cycle and it has kept
>> finding
>> >> mismatches.
>> >> 
>> >> Is it correct that the mismatch value after a
>> repair was
>> >> needed should equal the value present after a
>> check? What if
>> >> it doesn't? What does it mean if another check
>> STILL reveals
>> >> mismatches?
>> >> 
>> >> I had something similar after i reshaped from raid
>> 5 to 6 i
>> >> had to run check/repair/check/repair several times
>> before i
>> >> got my 0.
>> >> 
>> >> 
>> >
>> > Guys,
>> >
>> > Anyone got any suggestions here? I am now on my ~5
>> check/repair and after a reboot the first check is still
>> returning 8.
>> >
>> > All i have done is move the drives around. It is the
>> same controllers/cables/etc 
>> >
>> > I really dont like the seeming random nature of what
>> can/does/has caused the mismatches?
>> 
>> There is some unknown corruption going on with raid1 that
>> causes
>> mismatches but it is believed that it will never occur on
>> any used
>> block. Swapping is a likely cause.
>> 
>> Any swap device on the raid? Try turning that off.
>> If that doesn't help try umounting filesystems or
>> remounting RO.
>> 
>> MfG
>>         Goswin
>
> Hello, my usual savior Goswin!
>
> The deal is it is a 7 drive raid 6 array. it has LVM on it and is not used for swapping. I have umounted all LV's and still got mismatches, i run smartctl --test=long on all drives - nothing. I have now dismantled the array and am 3/4 the way through 'badblocks -svn' on each of the component drive. I have a hunch that it may be a dodgy SATA cable but have no evidence. No errors in log, nothing on dmesg.
>
> Is there any way to get more information? I am starting to think this is more happened since i changed from raid 5 to 6..... which i did < 1 month ago.
>
> The only lead i have is that whilst doing the bad blocks 1 drive ran at ~10~15MB/s whereas the rest are going at ~30 i have another identical model drive coming up so i will see if that one is slow too. But the lack of logging info is not helpful and worrying! and the prospect of silent corruption a big worry!

You did run a repair pass and not just repeated check passes, right?
Check itself only counts the mismatches but does not correct them.
If the raid is unused (vgchange -a n) and you do first repair and then
check then that definetly should not find any mismatches.

MfG

        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-24 23:13     ` Goswin von Brederlow
@ 2010-01-25 10:07       ` Jon Hardcastle
  2010-01-25 10:37         ` Goswin von Brederlow
  0 siblings, 1 reply; 70+ messages in thread
From: Jon Hardcastle @ 2010-01-25 10:07 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid


--- On Sun, 24/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:

> From: Goswin von Brederlow <goswin-v-b@web.de>
> Subject: Re: Fw: Why does one get mismatches?
> To: Jon@eHardcastle.com
> Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org
> Date: Sunday, 24 January, 2010, 23:13
> Jon Hardcastle <jd_hardcastle@yahoo.com>
> writes:
> 
> > --- On Fri, 22/1/10, Goswin von Brederlow <goswin-v-b@web.de>
> wrote:
> >
> >> From: Goswin von Brederlow <goswin-v-b@web.de>
> >> Subject: Re: Fw: Why does one get mismatches?
> >> To: Jon@eHardcastle.com
> >> Cc: linux-raid@vger.kernel.org
> >> Date: Friday, 22 January, 2010, 18:13
> >> Jon Hardcastle <jd_hardcastle@yahoo.com>
> >> writes:
> >> 
> >> > --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com>
> >> wrote:
> >> >
> >> >> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
> >> >> Subject: Why does one get mismatches?
> >> >> To: linux-raid@vger.kernel.org
> >> >> Date: Tuesday, 19 January, 2010, 10:04
> >> >> Hi,
> >> >> 
> >> >> I kicked off a check/repair cycle on my
> machine
> >> after i
> >> >> moved the phyiscal ordering of my drives
> around
> >> and I am now
> >> >> on my second check/repair cycle and it
> has kept
> >> finding
> >> >> mismatches.
> >> >> 
> >> >> Is it correct that the mismatch value
> after a
> >> repair was
> >> >> needed should equal the value present
> after a
> >> check? What if
> >> >> it doesn't? What does it mean if another
> check
> >> STILL reveals
> >> >> mismatches?
> >> >> 
> >> >> I had something similar after i reshaped
> from raid
> >> 5 to 6 i
> >> >> had to run check/repair/check/repair
> several times
> >> before i
> >> >> got my 0.
> >> >> 
> >> >> 
> >> >
> >> > Guys,
> >> >
> >> > Anyone got any suggestions here? I am now on
> my ~5
> >> check/repair and after a reboot the first check is
> still
> >> returning 8.
> >> >
> >> > All i have done is move the drives around. It
> is the
> >> same controllers/cables/etc 
> >> >
> >> > I really dont like the seeming random nature
> of what
> >> can/does/has caused the mismatches?
> >> 
> >> There is some unknown corruption going on with
> raid1 that
> >> causes
> >> mismatches but it is believed that it will never
> occur on
> >> any used
> >> block. Swapping is a likely cause.
> >> 
> >> Any swap device on the raid? Try turning that
> off.
> >> If that doesn't help try umounting filesystems or
> >> remounting RO.
> >> 
> >> MfG
> >>         Goswin
> >
> > Hello, my usual savior Goswin!
> >
> > The deal is it is a 7 drive raid 6 array. it has LVM
> on it and is not used for swapping. I have umounted all LV's
> and still got mismatches, i run smartctl --test=long on all
> drives - nothing. I have now dismantled the array and am 3/4
> the way through 'badblocks -svn' on each of the component
> drive. I have a hunch that it may be a dodgy SATA cable but
> have no evidence. No errors in log, nothing on dmesg.
> >
> > Is there any way to get more information? I am
> starting to think this is more happened since i changed from
> raid 5 to 6..... which i did < 1 month ago.
> >
> > The only lead i have is that whilst doing the bad
> blocks 1 drive ran at ~10~15MB/s whereas the rest are going
> at ~30 i have another identical model drive coming up so i
> will see if that one is slow too. But the lack of logging
> info is not helpful and worrying! and the prospect of silent
> corruption a big worry!
> 
> You did run a repair pass and not just repeated check
> passes, right?
> Check itself only counts the mismatches but does not
> correct them.
> If the raid is unused (vgchange -a n) and you do first
> repair and then
> check then that definetly should not find any mismatches.
> 
> MfG
> 
>         Goswin

> 

Hello!

Yes, I have a simple script that first does a check, then if there are mismatches it does repair. I have then been manually rerunning a check and I keep getting mismatches. I goes like this 232, 8, 24, 8, 8, 16, 16, 24, 24, 8, 16, 24. But I have also done this manually and run several repairs in a row (assuming that will return 0 if no work is to be done)

Now the array is completely dismantled and I am running bad blocks on the drives but I am on the last 2 of the 7 drives and I still have no leads. No bad blocks, no offline uncorrectable, no pending sectors no dmesg errors no nothing. I have absolutely no leads what so ever.

The only thing i have left to try is a full Mem test and disconnect and reseat the additional sata controllers, oh and buy 7 new sata cables incase 1 is bad.

But it would be REALLY helpful to know on what drive the mismatches have occured.

Any help here would be gratefully received! I might even try converting the array back to raid 5 as i remember i had mismatches immediately after i converted from 5 to 6.


      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-25 10:07       ` Jon Hardcastle
@ 2010-01-25 10:37         ` Goswin von Brederlow
  2010-01-25 10:52           ` Jon Hardcastle
  0 siblings, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-01-25 10:37 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> Now the array is completely dismantled and I am running bad blocks on the drives but I am on the last 2 of the 7 drives and I still have no leads. No bad blocks, no offline uncorrectable, no pending sectors no dmesg errors no nothing. I have absolutely no leads what so ever.
>
> The only thing i have left to try is a full Mem test and disconnect and reseat the additional sata controllers, oh and buy 7 new sata cables incase 1 is bad.

The problem with badblocks is that it writes the same pattern
everywhere. If the problem is that data gets read/written to the wrong
block then that will not show up.

Try formating each drive and run fstest [1] on it. Or some other test
that verifies data integrity using different patterns per block.

MfG
        Goswin

[1] http://mrvn.homeip.net/fstest/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-25 10:37         ` Goswin von Brederlow
@ 2010-01-25 10:52           ` Jon Hardcastle
  2010-01-25 17:32             ` Goswin von Brederlow
  2010-01-25 19:32             ` Iustin Pop
  0 siblings, 2 replies; 70+ messages in thread
From: Jon Hardcastle @ 2010-01-25 10:52 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

--- On Mon, 25/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:

> From: Goswin von Brederlow <goswin-v-b@web.de>
> Subject: Re: Fw: Why does one get mismatches?
> To: Jon@eHardcastle.com
> Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org
> Date: Monday, 25 January, 2010, 10:37
> Jon Hardcastle <jd_hardcastle@yahoo.com>
> writes:
> 
> > Now the array is completely dismantled and I am
> running bad blocks on the drives but I am on the last 2 of
> the 7 drives and I still have no leads. No bad blocks, no
> offline uncorrectable, no pending sectors no dmesg errors no
> nothing. I have absolutely no leads what so ever.
> >
> > The only thing i have left to try is a full Mem test
> and disconnect and reseat the additional sata controllers,
> oh and buy 7 new sata cables incase 1 is bad.
> 
> The problem with badblocks is that it writes the same
> pattern
> everywhere. If the problem is that data gets read/written
> to the wrong
> block then that will not show up.
> 
> Try formating each drive and run fstest [1] on it. Or some
> other test
> that verifies data integrity using different patterns per
> block.
> 
> MfG
>         Goswin
> 
> [1] http://mrvn.homeip.net/fstest/
> 

This is going to be a time consuming process as i'll have to remove and read from the array each drive 1 at a time then resync. 

Thanks for the link, but could a similar result be achieved with the -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch that the urandom wont work as you need to read AND write.

Just a worry as i clearly have mismatches and therefore corrupted data.



      
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-25 10:52           ` Jon Hardcastle
@ 2010-01-25 17:32             ` Goswin von Brederlow
  2010-01-25 19:32             ` Iustin Pop
  1 sibling, 0 replies; 70+ messages in thread
From: Goswin von Brederlow @ 2010-01-25 17:32 UTC (permalink / raw)
  To: Jon; +Cc: linux-raid

Jon Hardcastle <jd_hardcastle@yahoo.com> writes:

> --- On Mon, 25/1/10, Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
>> From: Goswin von Brederlow <goswin-v-b@web.de>
>> Subject: Re: Fw: Why does one get mismatches?
>> To: Jon@eHardcastle.com
>> Cc: "Goswin von Brederlow" <goswin-v-b@web.de>, linux-raid@vger.kernel.org
>> Date: Monday, 25 January, 2010, 10:37
>> Jon Hardcastle <jd_hardcastle@yahoo.com>
>> writes:
>> 
>> > Now the array is completely dismantled and I am
>> running bad blocks on the drives but I am on the last 2 of
>> the 7 drives and I still have no leads. No bad blocks, no
>> offline uncorrectable, no pending sectors no dmesg errors no
>> nothing. I have absolutely no leads what so ever.
>> >
>> > The only thing i have left to try is a full Mem test
>> and disconnect and reseat the additional sata controllers,
>> oh and buy 7 new sata cables incase 1 is bad.
>> 
>> The problem with badblocks is that it writes the same
>> pattern
>> everywhere. If the problem is that data gets read/written
>> to the wrong
>> block then that will not show up.
>> 
>> Try formating each drive and run fstest [1] on it. Or some
>> other test
>> that verifies data integrity using different patterns per
>> block.
>> 
>> MfG
>>         Goswin
>> 
>> [1] http://mrvn.homeip.net/fstest/
>> 
>
> This is going to be a time consuming process as i'll have to remove and read from the array each drive 1 at a time then resync. 
>
> Thanks for the link, but could a similar result be achieved with the -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch that the urandom wont work as you need to read AND write.
>
> Just a worry as i clearly have mismatches and therefore corrupted data.

No. You obviously should use the -w option in badblocks. Doing a
read-only test is completly pointless as the raid check already tested a
read of every block without errors (I assume). But -w will write one
pattern on the whole disk, then read and compare. Then repeat for the
next pattern. If the disk messes up the address of blocks then that
won't be detected. E.g. I had a raid enclosure that droped a bit in the
block address every once in a while. You get really interesting
corruption with that.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-25 10:52           ` Jon Hardcastle
  2010-01-25 17:32             ` Goswin von Brederlow
@ 2010-01-25 19:32             ` Iustin Pop
  1 sibling, 0 replies; 70+ messages in thread
From: Iustin Pop @ 2010-01-25 19:32 UTC (permalink / raw)
  To: Jon; +Cc: Goswin von Brederlow, linux-raid

On Mon, Jan 25, 2010 at 02:52:58AM -0800, Jon Hardcastle wrote:
> This is going to be a time consuming process as i'll have to remove
> and read from the array each drive 1 at a time then resync. 
> 
> Thanks for the link, but could a similar result be achieved with the
> -w option for badblocks? or perhaps a dd if=/dev/urandom? hmm scratch
> that the urandom wont work as you need to read AND write.
> 
> Just a worry as i clearly have mismatches and therefore corrupted
> data.

Just a comment from the 'benches' here: looking at all the tests you
have done, my personal opinion is that this is *not* HW problems of any
kind, but indeed some MD software issue. I've never seen such high
percentage of consistent and silent corruption in the hardware, and to
me it seems corruption in the software, *if at all*.

I would run a counter-test, to see at least if the 'check' test is
right:

- run your array until 'check' returns mismatches
- shutdown the array
- check that the contents of the drives is indeed different using
  something else than 'check' (e.g. checksum each 1MB block on the
  drives independently, and compare the checksum lists)
- if indeed there are diffs, start the array, run a repair (but no other
  traffic to the array)
- shutdown the array and re-run the external diff test

The above tests should tell you if: check is right, and if repair indeed
fixes the differences.

And another side-note: it would be really good if md had a debug option
to actually show the checksums for the differing blocks and their
offsets, to at least see if the same areas of the drive show differences
(it would be really funny if the diffs are, for example, in the MD
metadata :) (or does md already have something like this? I've stopped
using md a year or so ago).

regards,
iustin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Fw: Why does one get mismatches?
  2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
  2010-01-22 18:13 ` Goswin von Brederlow
@ 2010-02-01 21:18 ` Bill Davidsen
  2010-02-01 22:37   ` Neil Brown
  1 sibling, 1 reply; 70+ messages in thread
From: Bill Davidsen @ 2010-02-01 21:18 UTC (permalink / raw)
  To: Jon; +Cc: linux-raid

Jon Hardcastle wrote:
> --- On Tue, 19/1/10, Jon Hardcastle <jd_hardcastle@yahoo.com> wrote:
>
>   
>> From: Jon Hardcastle <jd_hardcastle@yahoo.com>
>> Subject: Why does one get mismatches?
>> To: linux-raid@vger.kernel.org
>> Date: Tuesday, 19 January, 2010, 10:04
>> Hi,
>>
>> I kicked off a check/repair cycle on my machine after i
>> moved the phyiscal ordering of my drives around and I am now
>> on my second check/repair cycle and it has kept finding
>> mismatches.
>>
>> Is it correct that the mismatch value after a repair was
>> needed should equal the value present after a check? What if
>> it doesn't? What does it mean if another check STILL reveals
>> mismatches?
>>
>> I had something similar after i reshaped from raid 5 to 6 i
>> had to run check/repair/check/repair several times before i
>> got my 0.
>>
>>
>>     
>
> Guys,
>
> Anyone got any suggestions here? I am now on my ~5 check/repair and after a reboot the first check is still returning 8.
>
> All i have done is move the drives around. It is the same controllers/cables/etc 
>
> I really dont like the seeming random nature of what can/does/has caused the mismatches?
>   

If you have an ext[34] filesystem on this array, try mounting it 
data=journal (yes it will slow down, this is a TEST). I did limited 
testing using this, and it appeared to solve the problem, at least for 
eight hours I had to test.

Comment: when there is a three way RAID-1, why doesn't repair *vote* on 
the correct value instead of just making a guess?

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-01 21:18 ` Bill Davidsen
@ 2010-02-01 22:37   ` Neil Brown
  2010-02-02 15:11     ` Bill Davidsen
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Brown @ 2010-02-01 22:37 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Jon, linux-raid

On Mon, 01 Feb 2010 16:18:23 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> Comment: when there is a three way RAID-1, why doesn't repair *vote* on 
> the correct value instead of just making a guess?
> 

Because truth is not democratic.

(and I defy you to define "correct" in any general way in this context).

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-01 22:37   ` Neil Brown
@ 2010-02-02 15:11     ` Bill Davidsen
  2010-02-03 11:17       ` Goswin von Brederlow
  2010-02-11  5:14       ` Neil Brown
  0 siblings, 2 replies; 70+ messages in thread
From: Bill Davidsen @ 2010-02-02 15:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: Jon, linux-raid

Neil Brown wrote:
> On Mon, 01 Feb 2010 16:18:23 -0500
> Bill Davidsen <davidsen@tmr.com> wrote:
>
>   
>> Comment: when there is a three way RAID-1, why doesn't repair *vote* on 
>> the correct value instead of just making a guess?
>>
>>     
>
> Because truth is not democratic.
>
> (and I defy you to define "correct" in any general way in this context).
>   

If you are willing to accept that the reconstructed data from RAID-[56] 
is "correct" then the data from RAID-1 majority opinion is "correct." If 
you say that such recovered data is the "most likely to match what was 
written," then data consistent on (N+1)/2 drives of a RAID-1 should be 
viewed in the same light. Call it "most likely to be correct" if you 
prefer, but picking a value from a drive at random is less likely.

This whole discussion simply shows that for RAID-1 software RAID is less 
reliable than hardware RAID (no, I don't mean fake-RAID), because it 
doesn't pin the data buffer until all copies are written.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-02 15:11     ` Bill Davidsen
@ 2010-02-03 11:17       ` Goswin von Brederlow
  2010-02-11  5:14       ` Neil Brown
  1 sibling, 0 replies; 70+ messages in thread
From: Goswin von Brederlow @ 2010-02-03 11:17 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Jon, linux-raid

Bill Davidsen <davidsen@tmr.com> writes:

> Neil Brown wrote:
>> On Mon, 01 Feb 2010 16:18:23 -0500
>> Bill Davidsen <davidsen@tmr.com> wrote:
>>
>>
>>> Comment: when there is a three way RAID-1, why doesn't repair
>>> *vote* on the correct value instead of just making a guess?
>>>
>>>
>>
>> Because truth is not democratic.
>>
>> (and I defy you to define "correct" in any general way in this context).
>>
>
> If you are willing to accept that the reconstructed data from
> RAID-[56] is "correct" then the data from RAID-1 majority opinion is
> "correct." If you say that such recovered data is the "most likely to
> match what was written," then data consistent on (N+1)/2 drives of a
> RAID-1 should be viewed in the same light. Call it "most likely to be
> correct" if you prefer, but picking a value from a drive at random is
> less likely.
>
> This whole discussion simply shows that for RAID-1 software RAID is
> less reliable than hardware RAID (no, I don't mean fake-RAID), because
> it doesn't pin the data buffer until all copies are written.

Lets ignore the fact that software raid seems to write bad data
supposedly only to unused blocks for now. If the block is really unused
it doesn't mater what is done. And if it is used then software raid has
a big bug that needs to be fixed and not repaired after the fact.

So lets assume there actualy is a true mismatch because one of the
drives returns false data on read. Then in raid1/10 with >2 copies and
raid6 you have a way to detect the correct data, correct as in most
likely to be what was written originally. For raid6 that means detecting
the drive so that the rest still give a correct parity and for raid1/10
that means finding the majority. Say you have a 10 way raid1 with 9
blocks having the same data and one differs. Picking a random block is
wrong 10% of the time. Do you realy think that in 10% of the cases 9
disks will be corrupt in exactly the same way?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-02 15:11     ` Bill Davidsen
  2010-02-03 11:17       ` Goswin von Brederlow
@ 2010-02-11  5:14       ` Neil Brown
  2010-02-11 17:51         ` Bryan Mesich
  2010-02-11 18:12         ` Piergiorgio Sartor
  1 sibling, 2 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-11  5:14 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Jon, linux-raid

On Tue, 02 Feb 2010 10:11:03 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> Neil Brown wrote:
> > On Mon, 01 Feb 2010 16:18:23 -0500
> > Bill Davidsen <davidsen@tmr.com> wrote:
> >
> >   
> >> Comment: when there is a three way RAID-1, why doesn't repair *vote* on 
> >> the correct value instead of just making a guess?
> >>
> >>     
> >
> > Because truth is not democratic.
> >
> > (and I defy you to define "correct" in any general way in this context).
> >   
> 
> If you are willing to accept that the reconstructed data from RAID-[56] 
> is "correct" then the data from RAID-1 majority opinion is "correct." If 
> you say that such recovered data is the "most likely to match what was 
> written," then data consistent on (N+1)/2 drives of a RAID-1 should be 
> viewed in the same light. Call it "most likely to be correct" if you 
> prefer, but picking a value from a drive at random is less likely.
> 
> This whole discussion simply shows that for RAID-1 software RAID is less 
> reliable than hardware RAID (no, I don't mean fake-RAID), because it 
> doesn't pin the data buffer until all copies are written.
> 

That doesn't make it less reliable.  It just makes it more confusing.

But for a more complete discussion on raid recovery and when it might be
sensible to "vote" among the blocks, see
   http://neil.brown.name/blog/20100211050355

NeilBrown


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-11  5:14       ` Neil Brown
@ 2010-02-11 17:51         ` Bryan Mesich
  2010-02-16 21:25           ` Bill Davidsen
  2010-02-11 18:12         ` Piergiorgio Sartor
  1 sibling, 1 reply; 70+ messages in thread
From: Bryan Mesich @ 2010-02-11 17:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, Jon, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1584 bytes --]

On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote:
> > This whole discussion simply shows that for RAID-1 software RAID is less 
> > reliable than hardware RAID (no, I don't mean fake-RAID), because it 
> > doesn't pin the data buffer until all copies are written.
> 
> That doesn't make it less reliable.  It just makes it more confusing.

I agree that linux software RAID is no less reliable than
hardware RAID with regards to the above conversation.  It's
however confusing to have a counter that indicates there are
problems with a RAID 1 array when in fact there is not.

I (and I'm sure others) value your expertise on this matter, but
it's hard to feel at ease when the car you're driving across the
country has the check engine light on.  In this case, I believe the 
mechanic when you say the car is okay, but it might be difficult 
for others to believe as I do.

I rely heavily on software RAID as I'm sure many others are.  I
believe this is quite evident in the amount of email that has been
circulated about the mismatch_cnt "problem".  IMO, a users perception 
of reliability is really the root of the problem in this case. No
one who depends on this stuff wants to see weakness.  Those who
do are going to be concerned.  Especially those who are running
distributions such as RedHat/Fedora that do weekly checks on the 
arrays. 

Neil, you had mentioned some time ago that you were going to
create a patch that would show where the mismatches were located
on disk.  Did you do this and if so where can I find the patch?

Bryan

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-11 17:51         ` Bryan Mesich
@ 2010-02-16 21:25           ` Bill Davidsen
  2010-02-16 21:38             ` Steven Haigh
  0 siblings, 1 reply; 70+ messages in thread
From: Bill Davidsen @ 2010-02-16 21:25 UTC (permalink / raw)
  To: Bryan Mesich, Neil Brown, Jon, linux-raid

Bryan Mesich wrote:
> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote:
>   
>>> This whole discussion simply shows that for RAID-1 software RAID is less 
>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it 
>>> doesn't pin the data buffer until all copies are written.
>>>       
>> That doesn't make it less reliable.  It just makes it more confusing.
>>     
>
> I agree that linux software RAID is no less reliable than
> hardware RAID with regards to the above conversation.  It's
> however confusing to have a counter that indicates there are
> problems with a RAID 1 array when in fact there is not.
>   

Sorry, but real hardware raid is more reliable than software raid, and 
Neil's justification for not doing smart recovery mentions it. Note this 
referes to real hardware raid, not fakeraid which is just some firmware 
in a BIOS to use the existing hardware.

The issue lies with data changing between write to multiple drives. In 
hardware raid the data traverses the memory bus once, only once, and 
goes into cache in the controller, from which it is written to all 
mirrored drives. With software raid an individual write is done to each 
drive, and if the data in the buffer changes between writes to one drive 
or the other you get different values. Neil may be convinced that the OS 
somehow "knows" which of the mirror copies is correct, ie. most recent, 
and never uses the stale data, but if that information was really 
available reads would always return the latest value and it wouldn't be 
possible to read the same file multiple times and get different MD5sums. 
It would also be possible to do a stable smart recovery by propagating 
the most recent copy to the other mirror drives.

I hoped that mounting data=journal would lead to consistency, that seems 
not to be true either.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-16 21:25           ` Bill Davidsen
@ 2010-02-16 21:38             ` Steven Haigh
  2010-02-17  3:19               ` Bryan Mesich
  2010-02-17 23:05               ` Neil Brown
  0 siblings, 2 replies; 70+ messages in thread
From: Steven Haigh @ 2010-02-16 21:38 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Bryan Mesich, Neil Brown, Jon, linux-raid

On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com>
wrote:
> Bryan Mesich wrote:
>> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote:
>>   
>>>> This whole discussion simply shows that for RAID-1 software RAID is
>>>> less
>>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it 
>>>> doesn't pin the data buffer until all copies are written.
>>>>       
>>> That doesn't make it less reliable.  It just makes it more confusing.
>>>     
>>
>> I agree that linux software RAID is no less reliable than
>> hardware RAID with regards to the above conversation.  It's
>> however confusing to have a counter that indicates there are
>> problems with a RAID 1 array when in fact there is not.
>>   
> 
> Sorry, but real hardware raid is more reliable than software raid, and 
> Neil's justification for not doing smart recovery mentions it. Note this

> referes to real hardware raid, not fakeraid which is just some firmware 
> in a BIOS to use the existing hardware.
> 
> The issue lies with data changing between write to multiple drives. In 
> hardware raid the data traverses the memory bus once, only once, and 
> goes into cache in the controller, from which it is written to all 
> mirrored drives. With software raid an individual write is done to each 
> drive, and if the data in the buffer changes between writes to one drive

> or the other you get different values. Neil may be convinced that the OS

> somehow "knows" which of the mirror copies is correct, ie. most recent, 
> and never uses the stale data, but if that information was really 
> available reads would always return the latest value and it wouldn't be 
> possible to read the same file multiple times and get different MD5sums.

> It would also be possible to do a stable smart recovery by propagating 
> the most recent copy to the other mirror drives.
> 
> I hoped that mounting data=journal would lead to consistency, that seems

> not to be true either.

I agree Bill, there is an issue with the software RAID1 when it comes down
to some hardware. I have one machine where the ONLY way to stop the root
filesystem going readonly due to journal issues is to remove RAID. Having
RAID1 enabled gives silent corruption of both data and the journal at
seemingly random times.

I can see the data corruption from running a verify between RPM and data
on the drive. Reinstalling these packages fixes things - until something
random things get corrupted next time.

The myth that data corruption in RAID1 ONLY happens to swap and/or unused
space on a drive is absolute rubbish.

-- 
Steven Haigh
 
Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-16 21:38             ` Steven Haigh
@ 2010-02-17  3:19               ` Bryan Mesich
  2010-02-17 23:05               ` Neil Brown
  1 sibling, 0 replies; 70+ messages in thread
From: Bryan Mesich @ 2010-02-17  3:19 UTC (permalink / raw)
  To: Steven Haigh; +Cc: Bill Davidsen, Neil Brown, Jon, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2397 bytes --]

On Wed, Feb 17, 2010 at 08:38:11AM +1100, Steven Haigh wrote:
> On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com> wrote:
> > The issue lies with data changing between write to multiple drives. In 
> > hardware raid the data traverses the memory bus once, only once, and 
> > goes into cache in the controller, from which it is written to all 
> > mirrored drives. With software raid an individual write is done to each 
> > drive, and if the data in the buffer changes between writes to one drive
> > or the other you get different values. Neil may be convinced that the OS
> > somehow "knows" which of the mirror copies is correct, ie. most recent, 
> > and never uses the stale data, but if that information was really 
> > available reads would always return the latest value and it wouldn't be 
> > possible to read the same file multiple times and get different MD5sums.

[snip...]

> I agree Bill, there is an issue with the software RAID1 when it comes down
> to some hardware. I have one machine where the ONLY way to stop the root
> filesystem going readonly due to journal issues is to remove RAID. Having
> RAID1 enabled gives silent corruption of both data and the journal at
> seemingly random times.

Maybe I missed something earlier in this thread...and if so I apologize.
However, I was not aware of anyone reporting FS corruption due do software
RAID 1.  Needless to say, a serious problem if occurring.

At work, we use software RAID 1 on the majority of our production servers
and have never seen problems as you describe.  I'm not trying to
discredit you...just that we have had not seen similar results. 

> I can see the data corruption from running a verify between RPM and data
> on the drive. Reinstalling these packages fixes things - until something
> random things get corrupted next time.

For curiosity sake, what kind of files did RPM report as being corrupt
after running the verify?  The reason I ask as that I would expect user
data to be corrupt before system files as they are typically written to
disk at install/update  and never written to again.  Or maybe there is a
reason...correct me if I'm wrong ;)

In my last post, I asked Neil if he had a patch that would indicate where
the mis-matches exist on disk.  Have you found a way to correlate the
mis-matches with your FS corruption?  

Bryan

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-16 21:38             ` Steven Haigh
  2010-02-17  3:19               ` Bryan Mesich
@ 2010-02-17 23:05               ` Neil Brown
  2010-02-19 15:18                 ` Piergiorgio Sartor
  2010-02-24 14:46                 ` Bill Davidsen
  1 sibling, 2 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-17 23:05 UTC (permalink / raw)
  To: Steven Haigh; +Cc: Bill Davidsen, Bryan Mesich, Jon, linux-raid

On Wed, 17 Feb 2010 08:38:11 +1100
Steven Haigh <netwiz@crc.id.au> wrote:

> On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com>
> wrote:
> > Bryan Mesich wrote:
> >> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote:
> >>   
> >>>> This whole discussion simply shows that for RAID-1 software RAID is
> >>>> less
> >>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it 
> >>>> doesn't pin the data buffer until all copies are written.
> >>>>       
> >>> That doesn't make it less reliable.  It just makes it more confusing.
> >>>     
> >>
> >> I agree that linux software RAID is no less reliable than
> >> hardware RAID with regards to the above conversation.  It's
> >> however confusing to have a counter that indicates there are
> >> problems with a RAID 1 array when in fact there is not.
> >>   
> > 
> > Sorry, but real hardware raid is more reliable than software raid, and 
> > Neil's justification for not doing smart recovery mentions it. Note this
> 
> > referes to real hardware raid, not fakeraid which is just some firmware 
> > in a BIOS to use the existing hardware.
> > 
> > The issue lies with data changing between write to multiple drives. In 
> > hardware raid the data traverses the memory bus once, only once, and 
> > goes into cache in the controller, from which it is written to all 
> > mirrored drives. With software raid an individual write is done to each 
> > drive, and if the data in the buffer changes between writes to one drive
> 
> > or the other you get different values. Neil may be convinced that the OS
> 
> > somehow "knows" which of the mirror copies is correct, ie. most recent, 
> > and never uses the stale data, but if that information was really 
> > available reads would always return the latest value and it wouldn't be 
> > possible to read the same file multiple times and get different MD5sums.
> 
> > It would also be possible to do a stable smart recovery by propagating 
> > the most recent copy to the other mirror drives.
> > 
> > I hoped that mounting data=journal would lead to consistency, that seems
> 
> > not to be true either.
> 
> I agree Bill, there is an issue with the software RAID1 when it comes down
> to some hardware. I have one machine where the ONLY way to stop the root
> filesystem going readonly due to journal issues is to remove RAID. Having
> RAID1 enabled gives silent corruption of both data and the journal at
> seemingly random times.
> 
> I can see the data corruption from running a verify between RPM and data
> on the drive. Reinstalling these packages fixes things - until something
> random things get corrupted next time.

Sounds very much like dodgy drives.

> 
> The myth that data corruption in RAID1 ONLY happens to swap and/or unused
> space on a drive is absolute rubbish.
> 

Absolute rubbish does seem to be a suitable phrase here.
There is no question of data corruption.
When memory changes between being written to one device and to another, this
does not cause corruption, only inconsistency.   Either the block will be
written again consistently soon, or it will never be read.
If the host crashes before the blocks are made consistent, then the 
inconsistency will not be visible as the resync will fix it.

If you are getting any corruption, then it is NOT due to this facet of the
RAID1 implementation - it due to something else.
My guess is bad hardware - anywhere from memory to hard drive.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-17 23:05               ` Neil Brown
@ 2010-02-19 15:18                 ` Piergiorgio Sartor
  2010-02-19 22:02                   ` Neil Brown
  2010-02-24 14:46                 ` Bill Davidsen
  1 sibling, 1 reply; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-19 15:18 UTC (permalink / raw)
  To: Neil Brown; +Cc: Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid

Hi,

> When memory changes between being written to one device and to another, this
> does not cause corruption, only inconsistency.   Either the block will be
> written again consistently soon, or it will never be read.

well, is this for sure?
I mean, by design of the md subsystem.

Or it is like that because we trust the filesystem?

And why it is like that? Why not to use the good old
readers-writer mechanism to make sure all blocks are
the same, when they're are written (namely lock).

It seems to me, maybe I'm wrong, not a so safe design.

I assume, it should not be possible to cause this
situation, unless there is a crash or a bug in the
md layer.

What if a new filesystem will write a block, changing
on the fly, i.e. during RAID-1 writes, and then, later,
reading this block again?

It will get, maybe, not the correct data.

In other words, would it be better, for the md layer,
to be robust against these kind of threats?

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 15:18                 ` Piergiorgio Sartor
@ 2010-02-19 22:02                   ` Neil Brown
  2010-02-19 22:37                     ` Piergiorgio Sartor
                                       ` (3 more replies)
  0 siblings, 4 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-19 22:02 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Steven Haigh, Bill Davidsen, Bryan Mesich, Jon, linux-raid

On Fri, 19 Feb 2010 16:18:09 +0100
Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:

> Hi,
> 
> > When memory changes between being written to one device and to another, this
> > does not cause corruption, only inconsistency.   Either the block will be
> > written again consistently soon, or it will never be read.
> 
> well, is this for sure?
> I mean, by design of the md subsystem.
> 
> Or it is like that because we trust the filesystem?

It is because we trust the filesystem.

> 
> And why it is like that? Why not to use the good old
> readers-writer mechanism to make sure all blocks are
> the same, when they're are written (namely lock).

md is not in a position to lock the page - there is simply no way it can stop
the filesystem from changing it.
The only thing it could do would be to make a copy, then write the copy out.
This would incur a performance cost.

> 
> It seems to me, maybe I'm wrong, not a so safe design.

I think you are wrong.

> 
> I assume, it should not be possible to cause this
> situation, unless there is a crash or a bug in the
> md layer.

I'm not sure what situation you are referring to...

> 
> What if a new filesystem will write a block, changing
> on the fly, i.e. during RAID-1 writes, and then, later,
> reading this block again?
> 
> It will get, maybe, not the correct data.

This is correct.  However it would be equally correct if you were talking
about s normal disk drive rather than a RAID1 pair.
If the filesystem changes the page (or allows it to change) while a write is
pending, then it cannot know what actual data was written.  So it must write
the block out again before it ever reads it in.
RAID1 is no different to any other device in this respect.

> 
> In other words, would it be better, for the md layer,
> to be robust against these kind of threats?
>

Possibly, but at what cost?
There are two ways that I can imagine to 'solve' this issue.

1/ always copy the page before writing.  This would incur a significant
  overhead, both in the complexity of pre-allocation memory and in the
  delay taken to perform the copy.  And it would very rarely be actually
  needed.
2/ Have the filesystem protect the page from changes while it is being
   written.  This is quite possible for the filesystem to do (while it
   is impossible for md to do).  There could be some performance
   cost with memory-mapped pages as they would need to be unmapped,
   but there would be no significant cost for reads, writes, and filesystem
   metadata operations.
   Further, any filesystem that wants to make use of the integrity checks
   that newer drives provide (where the filesystem provides a 'checksum' for
   the block which gets passed all the way down and written to storage, and
   returned on a read) will need to do this anyway.  So it is likely the in
   the near future all significant filesystems will provide all the
   guarantees md needs or order to simply do nothing different.

So my feeling is that md is doing the best thing already.

I believe 'swap' will always be an issue as unmapping swap pages during write
could be a serious performance cost.  It might be that the best thing to do
with swap is to somehow mark the area of an array used for swap as "don't
care" so md never bothers to resync it, and never reports inconsistencies
there, as they really are not an issue.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 22:02                   ` Neil Brown
@ 2010-02-19 22:37                     ` Piergiorgio Sartor
  2010-02-19 23:34                     ` Asdo
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-19 22:37 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Steven Haigh, Bill Davidsen, Bryan Mesich,
	Jon, linux-raid

Hi,

> > Or it is like that because we trust the filesystem?
> 
> It is because we trust the filesystem.

well, I hope the trust is not misplaced... :-)

> md is not in a position to lock the page - there is simply no way it can stop
> the filesystem from changing it.

How can this be?

> The only thing it could do would be to make a copy, then write the copy out.

Even making a copy would not be safe, since during
the copy the data could still change, or not?

> This would incur a performance cost.

It's a matter of deciding what is more important.

> > It seems to me, maybe I'm wrong, not a so safe design.
> 
> I think you are wrong.

Could be, I never heard of situations like this.

> > I assume, it should not be possible to cause this
> > situation, unless there is a crash or a bug in the
> > md layer.
> 
> I'm not sure what situation you are referring to...

It should not be possible to cause that different
mirrors of a RAID-1 end up with different data.

Otherwise, no point to have the mirroring.

> > What if a new filesystem will write a block, changing
> > on the fly, i.e. during RAID-1 writes, and then, later,
> > reading this block again?
> > 
> > It will get, maybe, not the correct data.
> 
> This is correct.  However it would be equally correct if you were talking
> about s normal disk drive rather than a RAID1 pair.

Nono, there is a huge difference.
In a single drive case, the FS is responsible of writing
rubbish to a single block. The result would be that a
block has "strange" data, but *always* the same data.

Here the situation is that the data might be "strange",
but different accesses, to the same block of the RAID-1,
could potentially return different data.

As a byproduct of this effect, the "check" functionality
becomes not so useful anymore.

> If the filesystem changes the page (or allows it to change) while a write is
> pending, then it cannot know what actual data was written.  So it must write
> the block out again before it ever reads it in.
> RAID1 is no different to any other device in this respect.

Is different, as mentioned above.

The FS could, intentionally, change the data during a write,
but later it could expect to have always the same data.

In other words, the FS does not guarantee the "spatial"
consistency of the data (the bytes in a block), but the
"temporal" consistency (successive reads return always
the same data) could be expected. And this happens in
case of a normal HDD. It does not happen in RAID-1.

> Possibly, but at what cost?

As I wrote: it is matter to decide what is more important
and useful.

> There are two ways that I can imagine to 'solve' this issue.
> 
> 1/ always copy the page before writing.  This would incur a significant
>   overhead, both in the complexity of pre-allocation memory and in the
>   delay taken to perform the copy.  And it would very rarely be actually
>   needed.

Does really a copy solve the issue? Is the copy done
in atomic way?
The pre-allocation does not seem to me to be a problem,
since it will be done once and for all (at device creation),
and not dynamically.
The copy *might* be an overhead, nevertheless I wonder if it
is really so much of a problem, expecially considering that,
after the copy, the MD layer can optimize the transaction
to the HDDs as much as it likes.

> 2/ Have the filesystem protect the page from changes while it is being
>    written.  This is quite possible for the filesystem to do (while it
>    is impossible for md to do).  There could be some performance

I'm really curious to understand what kind of thinking
is behind a design allowing such a situation...
I mean *system* design, not md design.

>    cost with memory-mapped pages as they would need to be unmapped,
>    but there would be no significant cost for reads, writes, and filesystem
>    metadata operations.
>    Further, any filesystem that wants to make use of the integrity checks
>    that newer drives provide (where the filesystem provides a 'checksum' for
>    the block which gets passed all the way down and written to storage, and
>    returned on a read) will need to do this anyway.  So it is likely the in
>    the near future all significant filesystems will provide all the
>    guarantees md needs or order to simply do nothing different.

That's good to know.

> So my feeling is that md is doing the best thing already.

I do not think this is an md issue, per se, it seems to me,
from the description, this is a overall design issue.

Normally, also for performance reasons, one approach is
to allocate queue(s) of buffers between two modules (like
FS and MD) and each of the modules has always *exclusive*
access to its own buffer(s), i.e. the buffer(s) it holds
in a certain time frame.
Once a module releases the buffer(s) this/these cannot be
anymore touched (read or write) by the module itself.
Once the buffer(s) arrive(s) to the other module, this
can do whatever it wants with it/them, and it is sure
it has exclusive access to it/them.

Normally real-time systems use techniques like this to
guarantee consistency *and* performances.

Anyway, thanks for the clarifications,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 22:02                   ` Neil Brown
  2010-02-19 22:37                     ` Piergiorgio Sartor
@ 2010-02-19 23:34                     ` Asdo
  2010-02-20  4:27                       ` Goswin von Brederlow
  2010-02-20  4:23                     ` Goswin von Brederlow
  2010-02-24 14:54                     ` Bill Davidsen
  3 siblings, 1 reply; 70+ messages in thread
From: Asdo @ 2010-02-19 23:34 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Steven Haigh, Bill Davidsen, Bryan Mesich,
	Jon, linux-raid

Thank you for your explanation Neil,

Neil Brown wrote:
> When memory changes between being written to one device and to another, this
> does not cause corruption, only inconsistency. Either the block will be
> written again consistently soon, or it will never be read.

This is the crucial part...

Why would the filesystem reuse the same memory without rewriting the 
*same* block?

Can the same memory area be used for another block?
If yes, I understand. If not, no I don't understand why the block is not 
eventually rewritten to contain equal data on both disks.

Is this a power-fail-in-the-middle thing, or it can happen even when the 
power is always on?

Do I understand correctly that raid-456 is instead safe ("no-mismatch") 
because it copies the memory region?

Thank you

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 23:34                     ` Asdo
@ 2010-02-20  4:27                       ` Goswin von Brederlow
  2010-02-20 11:12                         ` Asdo
  0 siblings, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-02-20  4:27 UTC (permalink / raw)
  To: linux-raid

Asdo <asdo@shiftmail.org> writes:

> Thank you for your explanation Neil,
>
> Neil Brown wrote:
>> When memory changes between being written to one device and to another, this
>> does not cause corruption, only inconsistency. Either the block will be
>> written again consistently soon, or it will never be read.
>
> This is the crucial part...
>
> Why would the filesystem reuse the same memory without rewriting the
> *same* block?
>
> Can the same memory area be used for another block?
> If yes, I understand. If not, no I don't understand why the block is
> not eventually rewritten to contain equal data on both disks.
>
> Is this a power-fail-in-the-middle thing, or it can happen even when
> the power is always on?

The check is usualy done with the filesystem mounted and in use. So one
case would be that the block got written, changed and then checked
before the FS decided to flush the dirty block again.

The other scenario suggested in the past is that the block was written,
changed and then the file deleted, making the block unused, before it
got flushed again. The filesystem then sees no need to write a dirty but
unused block so it never gets rewritten. It never gets read either so
that  is safe.

> Do I understand correctly that raid-456 is instead safe
> ("no-mismatch") because it copies the memory region?
>
> Thank you

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-20  4:27                       ` Goswin von Brederlow
@ 2010-02-20 11:12                         ` Asdo
  2010-02-21 11:13                           ` Goswin von Brederlow
  0 siblings, 1 reply; 70+ messages in thread
From: Asdo @ 2010-02-20 11:12 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: linux-raid

Goswin von Brederlow wrote:
> The check is usualy done with the filesystem mounted and in use. So one
> case would be that the block got written, changed and then checked
> before the FS decided to flush the dirty block again.
>
> The other scenario suggested in the past is that the block was written,
> changed and then the file deleted, making the block unused, 
This is not enough to cause the problem if I understand correctly, it 
also needs to change value at this point, right?
So how can it change value... is the same buffer used for another block?
> before it
> got flushed again. The filesystem then sees no need to write a dirty but
> unused block so it never gets rewritten. It never gets read either so
> that  is safe.
>   


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-20 11:12                         ` Asdo
@ 2010-02-21 11:13                           ` Goswin von Brederlow
       [not found]                             ` <8754A21825504719B463AD9809E54349@m5>
  0 siblings, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-02-21 11:13 UTC (permalink / raw)
  To: Asdo; +Cc: Goswin von Brederlow, linux-raid

Asdo <asdo@shiftmail.org> writes:

> Goswin von Brederlow wrote:
>> The check is usualy done with the filesystem mounted and in use. So one
>> case would be that the block got written, changed and then checked
>> before the FS decided to flush the dirty block again.
>>
>> The other scenario suggested in the past is that the block was written,
>> changed and then the file deleted, making the block unused,
> This is not enough to cause the problem if I understand correctly, it
> also needs to change value at this point, right?
> So how can it change value... is the same buffer used for another block?

open tempfile
write tempfile
raid1 starts to commit the block
write some more changing the block
 raid1 writes the 2nd copy of the block
delete file
fs never recommits the dirty page

Personally I don't really buy that scenario. At least not in the
frequency mismatches occur.

>> before it
>> got flushed again. The filesystem then sees no need to write a dirty but
>> unused block so it never gets rewritten. It never gets read either so
>> that  is safe.
>>

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

[parent not found: <8754A21825504719B463AD9809E54349@m5>]

[parent not found: <20100221194400.GA2570@lazy.lzy>]

* Re: Why does one get mismatches?
       [not found]                               ` <20100221194400.GA2570@lazy.lzy>
@ 2010-02-22 13:01                                 ` Asdo
  2010-02-22 13:30                                   ` Piergiorgio Sartor
  2010-02-22 13:44                                   ` Piergiorgio Sartor
  0 siblings, 2 replies; 70+ messages in thread
From: Asdo @ 2010-02-22 13:01 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Guy Watkins, 'Goswin von Brederlow', linux-raid

Piergiorgio Sartor wrote:
> Hi
>> If someone can map a mismatch to a file, the debate would be over.
>>     
> well, IMHO mismatches should not happen "by design",
> but only due to failures or bugs.
>
> For me, it is not so relevant if there is a file (or
> metadata) or nothing, under a mismatch, the whole idea
> of "mirroring" turns out to be not usable, still IMHO,
> if a mismatch can be caused intentionally.
>   
Even "nothing"?
Why?


Here we are talking about "nothing".

Or so it seems.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-22 13:01                                 ` Asdo
@ 2010-02-22 13:30                                   ` Piergiorgio Sartor
  2010-02-22 13:44                                   ` Piergiorgio Sartor
  1 sibling, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-22 13:30 UTC (permalink / raw)
  To: Asdo
  Cc: Piergiorgio Sartor, Guy Watkins, 'Goswin von Brederlow',
	linux-raid

Hi,

> Even "nothing"?
> Why?

for the following reasons:

1) the "check" command is useless if there are mismatches,
harmful or harmless they could be
2) the mirroring concept implies *identical* mirrors, not
identical only if the upper layer decides so
3) if a filesystem has a small bug, this will not be catched
later, that is, it could be the filesystem causes a *wrong*
mismatch (like there are correct ones)
4) in general, it is not safe to offer a mirroring which is
not always mirroring properly

> 
> Here we are talking about "nothing".
> 
> Or so it seems.

As I wrote, it does not matter, it is just not correct
to rely on the good will of other pieces of software to
guarantee the RAID-1 is working properly.
The RAID-1 should work properly because it does work
properly, not because the filesystem is kind enough to
allow it to work properly.

This could be a system design problem, not and MD one, of
course, so I'm stating the Neil or others should fix the
MD, what I'm writing is that it is astonishing to me
that things work this way (or "walk this way").

That's it, I'm just surprised to learn such situations
are present.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-22 13:01                                 ` Asdo
  2010-02-22 13:30                                   ` Piergiorgio Sartor
@ 2010-02-22 13:44                                   ` Piergiorgio Sartor
  1 sibling, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-22 13:44 UTC (permalink / raw)
  To: Asdo
  Cc: Piergiorgio Sartor, Guy Watkins, 'Goswin von Brederlow',
	linux-raid

Hi again,

forgot one thing...

I've some PCs where those mismatches shows up, sometimes
more, sometime less, sometimes nothing.

All these PCs have the filesystem (ext3) directly on
the RAID drive.

I've one more PC, where there is a LVM layer inbetween.

In this PC, which has also different HW (the others are
all identical), I never saw mismatches.
I can imagine the LVM takes care of handling properly
the memory buffers.

Can anyone confirm this?

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
       [not found]                             ` <8754A21825504719B463AD9809E54349@m5>
       [not found]                               ` <20100221194400.GA2570@lazy.lzy>
@ 2010-02-24 19:42                               ` Bill Davidsen
  1 sibling, 0 replies; 70+ messages in thread
From: Bill Davidsen @ 2010-02-24 19:42 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'Goswin von Brederlow', 'Asdo', linux-raid

Guy Watkins wrote:
> } open tempfile
> } write tempfile
> } raid1 starts to commit the block
> } write some more changing the block
> }  raid1 writes the 2nd copy of the block
> } delete file
> } fs never recommits the dirty page
> } 
> } Personally I don't really buy that scenario. At least not in the
> } frequency mismatches occur.
> } 
> } 
> } MfG
> }         Goswin
>
> If someone can map a mismatch to a file, the debate would be over.
>   

Simple test: create a three way raid-1. Run it on a heavily used ext3 
file system for a few days, until it has 3-4k mismatch count. Shut it 
down gracefully. Now
- run e2fsck -n on each of the parts, to prove the f/s is not corrupt
- mount on partition, using ext2 type, ro,noatime[1]
- do an md5sum on every file and put the output in a file[2]
  (on another f/s, obviously)
- mount each of the mirrors the same way
- run md5sum -C {saved_file} to check file content
If you get files which don't compare copy to copy you can see that the 
issue is real.

[1] rather than explain to newbies why neither changes to atime nor any 
journal activity doesn't make the file content change, I do it this way.
[2] MD5 is fine to detect file changes. You need sha1 or such only to 
detect malicious changes with intent to hide the change. Because it uses 
little CPU it's as good as any. Use sha256sum or similar if you doubt me.

Having had mismatches on raid-1 and not on raid-6 using the same three 
drives, I question the "hardware error" theory of mismatch origin.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 22:02                   ` Neil Brown
  2010-02-19 22:37                     ` Piergiorgio Sartor
  2010-02-19 23:34                     ` Asdo
@ 2010-02-20  4:23                     ` Goswin von Brederlow
  2010-02-24 14:54                     ` Bill Davidsen
  3 siblings, 0 replies; 70+ messages in thread
From: Goswin von Brederlow @ 2010-02-20  4:23 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@suse.de> writes:

> On Fri, 19 Feb 2010 16:18:09 +0100
> Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:
>
>> Hi,
>> 
>> > When memory changes between being written to one device and to another, this
>> > does not cause corruption, only inconsistency.   Either the block will be
>> > written again consistently soon, or it will never be read.
>> 
>> well, is this for sure?
>> I mean, by design of the md subsystem.
>> 
>> Or it is like that because we trust the filesystem?
>
> It is because we trust the filesystem.
>
>> 
>> And why it is like that? Why not to use the good old
>> readers-writer mechanism to make sure all blocks are
>> the same, when they're are written (namely lock).
>
> md is not in a position to lock the page - there is simply no way it can stop
> the filesystem from changing it.
> The only thing it could do would be to make a copy, then write the copy out.
> This would incur a performance cost.
>
>> 
>> It seems to me, maybe I'm wrong, not a so safe design.
>
> I think you are wrong.

No, he is right. The safe design is to copy or at least copy-on-write
the page. Maybe this could be configurable so people can choose between
really safe and fast.

>> I assume, it should not be possible to cause this
>> situation, unless there is a crash or a bug in the
>> md layer.
>
> I'm not sure what situation you are referring to...
>
>> 
>> What if a new filesystem will write a block, changing
>> on the fly, i.e. during RAID-1 writes, and then, later,
>> reading this block again?
>> 
>> It will get, maybe, not the correct data.
>
> This is correct.  However it would be equally correct if you were talking
> about s normal disk drive rather than a RAID1 pair.
> If the filesystem changes the page (or allows it to change) while a write is
> pending, then it cannot know what actual data was written.  So it must write
> the block out again before it ever reads it in.
> RAID1 is no different to any other device in this respect.
>
>
>> 
>> In other words, would it be better, for the md layer,
>> to be robust against these kind of threats?
>>
>
> Possibly, but at what cost?
> There are two ways that I can imagine to 'solve' this issue.
>
> 1/ always copy the page before writing.  This would incur a significant
>   overhead, both in the complexity of pre-allocation memory and in the
>   delay taken to perform the copy.  And it would very rarely be actually
>   needed.
> 2/ Have the filesystem protect the page from changes while it is being
>    written.  This is quite possible for the filesystem to do (while it
>    is impossible for md to do).  There could be some performance
>    cost with memory-mapped pages as they would need to be unmapped,
>    but there would be no significant cost for reads, writes, and filesystem
>    metadata operations.
>    Further, any filesystem that wants to make use of the integrity checks
>    that newer drives provide (where the filesystem provides a 'checksum' for
>    the block which gets passed all the way down and written to storage, and
>    returned on a read) will need to do this anyway.  So it is likely the in
>    the near future all significant filesystems will provide all the
>    guarantees md needs or order to simply do nothing different.
>
> So my feeling is that md is doing the best thing already.
>
> I believe 'swap' will always be an issue as unmapping swap pages during write
> could be a serious performance cost.  It might be that the best thing to do
> with swap is to somehow mark the area of an array used for swap as "don't
> care" so md never bothers to resync it, and never reports inconsistencies
> there, as they really are not an issue.
>
> NeilBrown

Or one could turn on the copy/copy-on-write mode at least during the
test.

I'm also not convinced performance of swap is an issue. Swap speed is
already many magnitudes lower than real memory making any relevant use
of swap prohibitive. I certainly would not care one bit or another if
swapping gets 50% slower. I do care about not having a mismatch count
though.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-19 22:02                   ` Neil Brown
                                       ` (2 preceding siblings ...)
  2010-02-20  4:23                     ` Goswin von Brederlow
@ 2010-02-24 14:54                     ` Bill Davidsen
  2010-02-24 21:37                       ` Neil Brown
  3 siblings, 1 reply; 70+ messages in thread
From: Bill Davidsen @ 2010-02-24 14:54 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid

Neil Brown wrote:
> md is not in a position to lock the page - there is simply no way it can stop
> the filesystem from changing it.
> The only thing it could do would be to make a copy, then write the copy out.
> This would incur a performance cost.
>
>   
Two thoughts on that - one is that for critical data, give me the option 
at array start time, make the copy, slow the performance and make it 
more consistent. My second thought is that a checksum of the page before 
initiating write and after all writes are complete might be less of a 
performance hit, and still could detect that the buffer had changed.
>> It seems to me, maybe I'm wrong, not a so safe design.
>>     
>
> I think you are wrong.
>
>   
> This is correct.  However it would be equally correct if you were talking
> about s normal disk drive rather than a RAID1 pair.
> If the filesystem changes the page (or allows it to change) while a write is
> pending, then it cannot know what actual data was written.  So it must write
> the block out again before it ever reads it in.
> RAID1 is no different to any other device in this respect.
>
>
>   
>> In other words, would it be better, for the md layer,
>> to be robust against these kind of threats?
>>
>>     
>
> Possibly, but at what cost?
> There are two ways that I can imagine to 'solve' this issue.
>
> 1/ always copy the page before writing.  This would incur a significant
>   overhead, both in the complexity of pre-allocation memory and in the
>   delay taken to perform the copy.  And it would very rarely be actually
>   needed.
> 2/ Have the filesystem protect the page from changes while it is being
>    written.  This is quite possible for the filesystem to do (while it
>    is impossible for md to do).  There could be some performance
>    cost with memory-mapped pages as they would need to be unmapped,
>    but there would be no significant cost for reads, writes, and filesystem
>    metadata operations.
>   

Your next section somewhat mirrors my thought on md checking the data 
after write to be sure it didn't change.

>    Further, any filesystem that wants to make use of the integrity checks
>    that newer drives provide (where the filesystem provides a 'checksum' for
>    the block which gets passed all the way down and written to storage, and
>    returned on a read) will need to do this anyway.  So it is likely the in
>    the near future all significant filesystems will provide all the
>    guarantees md needs or order to simply do nothing different.
>
> So my feeling is that md is doing the best thing already.
>
> I believe 'swap' will always be an issue as unmapping swap pages during write
> could be a serious performance cost.  It might be that the best thing to do
> with swap is to somehow mark the area of an array used for swap as "don't
> care" so md never bothers to resync it, and never reports inconsistencies
> there, as they really are not an issue.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 14:54                     ` Bill Davidsen
@ 2010-02-24 21:37                       ` Neil Brown
  2010-02-26 20:48                         ` Bill Davidsen
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Brown @ 2010-02-24 21:37 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid

On Wed, 24 Feb 2010 09:54:17 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> Neil Brown wrote:
> > md is not in a position to lock the page - there is simply no way it can stop
> > the filesystem from changing it.
> > The only thing it could do would be to make a copy, then write the copy out.
> > This would incur a performance cost.
> >
> >     
> Two thoughts on that - one is that for critical data, give me the option 
> at array start time, make the copy, slow the performance and make it 
> more consistent. My second thought is that a checksum of the page before 
> initiating write and after all writes are complete might be less of a 
> performance hit, and still could detect that the buffer had changed.


The idea of calculating a checksum before and after certainly has some merit,
if we could choose a checksum algorithm which was sufficiently strong and
sufficiently fast, though in many cases a large part of the cost would just be
bringing the page contents into cache - twice.

It has the advantage over copying the page of not needing to allocate extra
memory.

If someone wanted to try an prototype this and see how it goes, I'd be happy
to advise....

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 21:37                       ` Neil Brown
@ 2010-02-26 20:48                         ` Bill Davidsen
  2010-02-26 21:09                           ` Neil Brown
  0 siblings, 1 reply; 70+ messages in thread
From: Bill Davidsen @ 2010-02-26 20:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid

Neil Brown wrote:
> On Wed, 24 Feb 2010 09:54:17 -0500
> Bill Davidsen <davidsen@tmr.com> wrote:
>
>   
>> Neil Brown wrote:
>>     
>>> md is not in a position to lock the page - there is simply no way it can stop
>>> the filesystem from changing it.
>>> The only thing it could do would be to make a copy, then write the copy out.
>>> This would incur a performance cost.
>>>
>>>     
>>>       
>> Two thoughts on that - one is that for critical data, give me the option 
>> at array start time, make the copy, slow the performance and make it 
>> more consistent. My second thought is that a checksum of the page before 
>> initiating write and after all writes are complete might be less of a 
>> performance hit, and still could detect that the buffer had changed.
>>     
>
>
> The idea of calculating a checksum before and after certainly has some merit,
> if we could choose a checksum algorithm which was sufficiently strong and
> sufficiently fast, though in many cases a large part of the cost would just be
> bringing the page contents into cache - twice.
>
> It has the advantage over copying the page of not needing to allocate extra
> memory.
>
> If someone wanted to try an prototype this and see how it goes, I'd be happy
> to advise....
>   

Disagree if you wish, but MD5 should be fine for this. While it is not 
cryptographically strong on files, where the size can be changed and 
evil doers can calculate values to add at the end of the data, it should 
be adequate on data of unchanging size. It's cheap, fast, and readily 
available.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 20:48                         ` Bill Davidsen
@ 2010-02-26 21:09                           ` Neil Brown
  2010-02-26 22:01                             ` Piergiorgio Sartor
                                               ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-26 21:09 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid

On Fri, 26 Feb 2010 15:48:58 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> >
> > The idea of calculating a checksum before and after certainly has some merit,
> > if we could choose a checksum algorithm which was sufficiently strong and
> > sufficiently fast, though in many cases a large part of the cost would just be
> > bringing the page contents into cache - twice.
> >
> > It has the advantage over copying the page of not needing to allocate extra
> > memory.
> >
> > If someone wanted to try an prototype this and see how it goes, I'd be happy
> > to advise....
> >     
> 
> Disagree if you wish, but MD5 should be fine for this. While it is not 
> cryptographically strong on files, where the size can be changed and 
> evil doers can calculate values to add at the end of the data, it should 
> be adequate on data of unchanging size. It's cheap, fast, and readily 
> available.
> 

Actually, I'm no longer convinced that the checksumming idea would work.
If a mem-mapped page were written, that the app is updating every
millisecond (i.e. less than the write latency), then every time a write
completed the checksum would be different so we would have to reschedule the
write, which would not be the correct behaviour at all.
So I think that the only way to address this in the md layer is to copy
the data and write the copy.  There is already code to copy the data for
write-behind that could possible be leveraged to do a copy always.

Or I could just stop setting mismatch_cnt for raid1 and raid10.  That would
also fix the problem :-)

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 21:09                           ` Neil Brown
@ 2010-02-26 22:01                             ` Piergiorgio Sartor
  2010-02-26 22:15                             ` Bill Davidsen
  2010-02-26 22:20                             ` Asdo
  2 siblings, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-26 22:01 UTC (permalink / raw)
  To: Neil Brown
  Cc: Bill Davidsen, Piergiorgio Sartor, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

Hi,

> So I think that the only way to address this in the md layer is to copy
> the data and write the copy.  There is already code to copy the data for
> write-behind that could possible be leveraged to do a copy always.

actually, I wanted to ask how the write-behind works,
because I was suspecting it copies the data.

BTW, it is possible to set both drives (of a pair) as
write-mostly and some write-behind?

> Or I could just stop setting mismatch_cnt for raid1 and raid10.  That would
> also fix the problem :-)

Well, the "complaining" problem will be fixed... :-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 21:09                           ` Neil Brown
  2010-02-26 22:01                             ` Piergiorgio Sartor
@ 2010-02-26 22:15                             ` Bill Davidsen
  2010-02-26 22:21                               ` Piergiorgio Sartor
  2010-02-26 22:20                             ` Asdo
  2 siblings, 1 reply; 70+ messages in thread
From: Bill Davidsen @ 2010-02-26 22:15 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon, linux-raid

Neil Brown wrote:
> On Fri, 26 Feb 2010 15:48:58 -0500
> Bill Davidsen <davidsen@tmr.com> wrote:
>
>   
>>> The idea of calculating a checksum before and after certainly has some merit,
>>> if we could choose a checksum algorithm which was sufficiently strong and
>>> sufficiently fast, though in many cases a large part of the cost would just be
>>> bringing the page contents into cache - twice.
>>>
>>> It has the advantage over copying the page of not needing to allocate extra
>>> memory.
>>>
>>> If someone wanted to try an prototype this and see how it goes, I'd be happy
>>> to advise....
>>>     
>>>       
>> Disagree if you wish, but MD5 should be fine for this. While it is not 
>> cryptographically strong on files, where the size can be changed and 
>> evil doers can calculate values to add at the end of the data, it should 
>> be adequate on data of unchanging size. It's cheap, fast, and readily 
>> available.
>>
>>     
>
> Actually, I'm no longer convinced that the checksumming idea would work.
> If a mem-mapped page were written, that the app is updating every
> millisecond (i.e. less than the write latency), then every time a write
> completed the checksum would be different so we would have to reschedule the
> write, which would not be the correct behaviour at all.
> So I think that the only way to address this in the md layer is to copy
> the data and write the copy.  There is already code to copy the data for
> write-behind that could possible be leveraged to do a copy always.
>
>   
Your point is valid about the possibility, but consider this, if the 
checksum fails, then at that point do the copy and write again.
> Or I could just stop setting mismatch_cnt for raid1 and raid10.  That would
> also fix the problem :-)
>
>   
s/fix/hide/  ;-)

My feeling is that we have many ways to change the data, O_DIRECT, aio, 
threads, mmap, and probably some I haven't found yet. Rather than think 
that you could prevent that without a flaming layer violation, perhaps 
my thought above, to detect the fact that the data has changed, and at 
that point do a copy and write unchanging data to all drives. How that 
plays with O_DIRECT I can't say, but it sounds to me as if it should 
eliminate the mismatches without a huge performance impact. Let me know 
if this addresses your concern with writing forever without taking much 
overhead.

The question is why this happens with raid-1 and doesn't seem to with 
raid-[56]. And I don't see mismatches on my raid-10, although I'm pretty 
sure that neither mmap or O_DIRECT is used on those arrays.

What would seem to be optimal is some COW on the buffer to prevent the 
buffer from being modified while it's being used for actual i/o. Doesn't 
seem hardware supports it, page size, buffer size and sector size all vary.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 22:15                             ` Bill Davidsen
@ 2010-02-26 22:21                               ` Piergiorgio Sartor
  0 siblings, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-26 22:21 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Neil Brown, Piergiorgio Sartor, Steven Haigh, Bryan Mesich, Jon,
	linux-raid

Hi,

> The question is why this happens with raid-1 and doesn't seem to
> with raid-[56]. And I don't see mismatches on my raid-10, although
> I'm pretty sure that neither mmap or O_DIRECT is used on those
> arrays.

I believe Neil mentioned that RAID-5/6 always
makes a copy, while only RAID-1/10 uses the same
page without copying.

I get mismatches on RAID-10, but not on the one
that has LVM on it, only on the one(s) where the
filesystem (ext3) is directly on the RAID volume.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 21:09                           ` Neil Brown
  2010-02-26 22:01                             ` Piergiorgio Sartor
  2010-02-26 22:15                             ` Bill Davidsen
@ 2010-02-26 22:20                             ` Asdo
  2010-02-27  6:01                               ` Michael Evans
  2 siblings, 1 reply; 70+ messages in thread
From: Asdo @ 2010-02-26 22:20 UTC (permalink / raw)
  To: Neil Brown
  Cc: Bill Davidsen, Piergiorgio Sartor, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

Neil Brown wrote:
> Actually, I'm no longer convinced that the checksumming idea would work.
> If a mem-mapped page were written, that the app is updating every
> millisecond (i.e. less than the write latency), then every time a write
> completed the checksum would be different so we would have to reschedule the
> write, which would not be the correct behaviour at all.
> So I think that the only way to address this in the md layer is to copy
> the data and write the copy.  There is already code to copy the data for
> write-behind that could possible be leveraged to do a copy always.
>   
The concerns of slowdowns with copy could be addressed by making the 
copy a runtime choice triggered by a sysctl interface, a file in 
/sys/block/mdX/md/ interface where one can echo "1" to enable copies for 
this type of raid. Or better 1 could be the default (slower but safer, 
or if not safer, at least to avoid needless questions on mismatches on 
this ML by new users, and to allow detection of REAL mismatches which 
can be due to cabling or defective disks) and echoing 0 would increase 
performances at the cost of seeing lots of false positive mismatches.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-26 22:20                             ` Asdo
@ 2010-02-27  6:01                               ` Michael Evans
  2010-02-28  0:01                                 ` Bill Davidsen
  0 siblings, 1 reply; 70+ messages in thread
From: Michael Evans @ 2010-02-27  6:01 UTC (permalink / raw)
  To: Asdo
  Cc: Neil Brown, Bill Davidsen, Piergiorgio Sartor, Steven Haigh,
	Bryan Mesich, Jon, linux-raid

On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@shiftmail.org> wrote:
> Neil Brown wrote:
>>
>> Actually, I'm no longer convinced that the checksumming idea would work.
>> If a mem-mapped page were written, that the app is updating every
>> millisecond (i.e. less than the write latency), then every time a write
>> completed the checksum would be different so we would have to reschedule
>> the
>> write, which would not be the correct behaviour at all.
>> So I think that the only way to address this in the md layer is to copy
>> the data and write the copy.  There is already code to copy the data for
>> write-behind that could possible be leveraged to do a copy always.
>>
>
> The concerns of slowdowns with copy could be addressed by making the copy a
> runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/
> interface where one can echo "1" to enable copies for this type of raid. Or
> better 1 could be the default (slower but safer, or if not safer, at least
> to avoid needless questions on mismatches on this ML by new users, and to
> allow detection of REAL mismatches which can be due to cabling or defective
> disks) and echoing 0 would increase performances at the cost of seeing lots
> of false positive mismatches.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Isn't there some way of making the page copy-on-write using hardware
and/or an in-kernel structure?  Ideally copying could be avoided
/unless/ there is change.  That way each operation looks like an
atomic commit.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-27  6:01                               ` Michael Evans
@ 2010-02-28  0:01                                 ` Bill Davidsen
  0 siblings, 0 replies; 70+ messages in thread
From: Bill Davidsen @ 2010-02-28  0:01 UTC (permalink / raw)
  To: Michael Evans
  Cc: Asdo, Neil Brown, Piergiorgio Sartor, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

Michael Evans wrote:
> On Fri, Feb 26, 2010 at 2:20 PM, Asdo <asdo@shiftmail.org> wrote:
>   
>> Neil Brown wrote:
>>     
>>> Actually, I'm no longer convinced that the checksumming idea would work.
>>> If a mem-mapped page were written, that the app is updating every
>>> millisecond (i.e. less than the write latency), then every time a write
>>> completed the checksum would be different so we would have to reschedule
>>> the
>>> write, which would not be the correct behaviour at all.
>>> So I think that the only way to address this in the md layer is to copy
>>> the data and write the copy.  There is already code to copy the data for
>>> write-behind that could possible be leveraged to do a copy always.
>>>
>>>       
>> The concerns of slowdowns with copy could be addressed by making the copy a
>> runtime choice triggered by a sysctl interface, a file in /sys/block/mdX/md/
>> interface where one can echo "1" to enable copies for this type of raid. Or
>> better 1 could be the default (slower but safer, or if not safer, at least
>> to avoid needless questions on mismatches on this ML by new users, and to
>> allow detection of REAL mismatches which can be due to cabling or defective
>> disks) and echoing 0 would increase performances at the cost of seeing lots
>> of false positive mismatches.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>     
>
> Isn't there some way of making the page copy-on-write using hardware
> and/or an in-kernel structure?  Ideally copying could be avoided
> /unless/ there is change.  That way each operation looks like an
> atomic commit.
>   

As I think about this, one idea was to add a write-in-progress flag, so 
that the filesystem, or library, or whatever would know not to change 
the page. That would mean that every filesystem would need to be 
enhanced, or that the "safe write" would be optional on a per-filesystem 
level. Implementation of O_DIRECT could do it, or not, and there could 
be a safe way to write.

However, it occurs to me that there are several other levels involved, 
and so it could be better but not perfect. While md could flag the start 
and finish of write, you then need to have the next level, the device 
driver, do the same thing, so md knows when the data need not be frozen. 
"But wait, there's more," as they say, the device driver need to track 
when the data are transferred to the actual device, and the device needs 
to report when the data actually hit the platter, or you could still 
have possible mismatches.

All of that reminds us of the discussion of barriers, and flush cache 
commands, and other performance impacting practices. So in the long run 
I think the most effective solution, one which has the highest 
improvement at the lowest cost in performance, is a copy. Now if Neil 
liked my idea of doing a checksum before and after a write, and a copy 
only in the cases where the data had changed, the impact could be pretty 
small.

All that depends on two things, Neil thinking the whole thing is worth 
doing, and no one finding a flaw in my proposal to do a checksum rather 
than a copy each time.

And to return to your original question, no. Hardware COW works on 
memory pages, a buffer could span pages and a write to a page might not 
be in the part of the page used for the i/o buffer. So as nice as that 
would be, I don't think the hardware supports it. And even if you could, 
the COW needs to be done in the layer which tries to change the buffer, 
so md would set COW and the filesystem would have to deal with it. I am 
pretty sure that's a layering violation, big time. The advisory "write 
in progress" flag might be acceptable, it's information the f/s can use 
or not.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-17 23:05               ` Neil Brown
  2010-02-19 15:18                 ` Piergiorgio Sartor
@ 2010-02-24 14:46                 ` Bill Davidsen
  2010-02-24 16:12                   ` Martin K. Petersen
  2010-02-24 21:32                   ` Neil Brown
  1 sibling, 2 replies; 70+ messages in thread
From: Bill Davidsen @ 2010-02-24 14:46 UTC (permalink / raw)
  To: Neil Brown; +Cc: Steven Haigh, Bryan Mesich, Jon, linux-raid

Neil Brown wrote:
> On Wed, 17 Feb 2010 08:38:11 +1100
> Steven Haigh <netwiz@crc.id.au> wrote:
>
>   
>> On Tue, 16 Feb 2010 16:25:25 -0500, Bill Davidsen <davidsen@tmr.com>
>> wrote:
>>     
>>> Bryan Mesich wrote:
>>>       
>>>> On Thu, Feb 11, 2010 at 04:14:44PM +1100, Neil Brown wrote:
>>>>   
>>>>         
>>>>>> This whole discussion simply shows that for RAID-1 software RAID is
>>>>>> less
>>>>>> reliable than hardware RAID (no, I don't mean fake-RAID), because it 
>>>>>> doesn't pin the data buffer until all copies are written.
>>>>>>       
>>>>>>             
>>>>> That doesn't make it less reliable.  It just makes it more confusing.
>>>>>     
>>>>>           
>>>> I agree that linux software RAID is no less reliable than
>>>> hardware RAID with regards to the above conversation.  It's
>>>> however confusing to have a counter that indicates there are
>>>> problems with a RAID 1 array when in fact there is not.
>>>>   
>>>>         
>>> Sorry, but real hardware raid is more reliable than software raid, and 
>>> Neil's justification for not doing smart recovery mentions it. Note this
>>>       
>>> referes to real hardware raid, not fakeraid which is just some firmware 
>>> in a BIOS to use the existing hardware.
>>>
>>> The issue lies with data changing between write to multiple drives. In 
>>> hardware raid the data traverses the memory bus once, only once, and 
>>> goes into cache in the controller, from which it is written to all 
>>> mirrored drives. With software raid an individual write is done to each 
>>> drive, and if the data in the buffer changes between writes to one drive
>>>       
>>> or the other you get different values. Neil may be convinced that the OS
>>>       
>>> somehow "knows" which of the mirror copies is correct, ie. most recent, 
>>> and never uses the stale data, but if that information was really 
>>> available reads would always return the latest value and it wouldn't be 
>>> possible to read the same file multiple times and get different MD5sums.
>>>       
>>> It would also be possible to do a stable smart recovery by propagating 
>>> the most recent copy to the other mirror drives.
>>>
>>> I hoped that mounting data=journal would lead to consistency, that seems
>>>       
>>> not to be true either.
>>>       
>> I agree Bill, there is an issue with the software RAID1 when it comes down
>> to some hardware. I have one machine where the ONLY way to stop the root
>> filesystem going readonly due to journal issues is to remove RAID. Having
>> RAID1 enabled gives silent corruption of both data and the journal at
>> seemingly random times.
>>
>> I can see the data corruption from running a verify between RPM and data
>> on the drive. Reinstalling these packages fixes things - until something
>> random things get corrupted next time.
>>     
>
> Sounds very much like dodgy drives.
>
>   
>> The myth that data corruption in RAID1 ONLY happens to swap and/or unused
>> space on a drive is absolute rubbish.
>>
>>     
>
> Absolute rubbish does seem to be a suitable phrase here.
> There is no question of data corruption.
> When memory changes between being written to one device and to another, this
> does not cause corruption, only inconsistency.   Either the block will be
> written again consistently soon, or it will never be read.
>   

Just what is it that rewrites the data block? The user program doesn't 
know it's needed, the filesystem, if any, doesn't know it's needed, and 
as far as I can tell md doesn't do checksum before issuing the write and 
after the last write is done. Doesn't make a copy and write from that. 
So what sees that the data has changed and rewrites it?

> If the host crashes before the blocks are made consistent, then the 
> inconsistency will not be visible as the resync will fix it.
>
> If you are getting any corruption, then it is NOT due to this facet of the
> RAID1 implementation - it due to something else.
> My guess is bad hardware - anywhere from memory to hard drive.
>   

Having switched an array from three way raid-1 to raid-6, using the same 
kernel, utilities, and hardware, I can speak to that. When I first 
started to run checks, I took the array offline to do repair, and 
usually saw ~12k mismatches by the end of a week. After changing the 
array to raid-6 I never had a mismatch again. Therefore, while hardware 
clearly can be a factor, it is unlikely to be the cause of all mismatch 
events.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 14:46                 ` Bill Davidsen
@ 2010-02-24 16:12                   ` Martin K. Petersen
  2010-02-24 18:51                     ` Piergiorgio Sartor
  2010-02-24 21:39                     ` Neil Brown
  2010-02-24 21:32                   ` Neil Brown
  1 sibling, 2 replies; 70+ messages in thread
From: Martin K. Petersen @ 2010-02-24 16:12 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Steven Haigh, Bryan Mesich, Jon, linux-raid

>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:

>> Absolute rubbish does seem to be a suitable phrase here.  There is no
>> question of data corruption.  When memory changes between being
>> written to one device and to another, this does not cause corruption,
>> only inconsistency.  Either the block will be written again
>> consistently soon, or it will never be read.

Bill> Just what is it that rewrites the data block? The user program
Bill> doesn't know it's needed, the filesystem, if any, doesn't know
Bill> it's needed, and as far as I can tell md doesn't do checksum
Bill> before issuing the write and after the last write is done. Doesn't
Bill> make a copy and write from that. So what sees that the data has
Bill> changed and rewrites it?

The filesystem updates the page, causing it to be marked dirty again.
The VM will then eventually schedule the page to be written out.  The
"when" depends on filesystem type and whether there's metadata or data
in the page.

In this discussion there seems to be a focus on the case where one
mirror is correct and one is not.  However, that's usually not how it
works out.  A more realistic scenario is that both mirror copies are
incorrect because the page was continuously updated.  I.e. both mirrors
have various degrees of new and stale data inside a 4KB block.

So realistically both disk blocks are wrong and there's a window until
the new, correct block is written.  That window will only cause problems
if there is a crash and we'll need to recover.  My main concern here is
how big the discrepancy between the disks can get, and whether we'll end
up corrupting the filesystem during recovery because we could
potentially be matching metadata from one disk with journal entries from
another.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 16:12                   ` Martin K. Petersen
@ 2010-02-24 18:51                     ` Piergiorgio Sartor
  2010-02-24 22:21                       ` Neil Brown
  2010-02-24 21:39                     ` Neil Brown
  1 sibling, 1 reply; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-24 18:51 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bill Davidsen, Neil Brown, Steven Haigh, Bryan Mesich, Jon,
	linux-raid

Hi,

> So realistically both disk blocks are wrong and there's a window until
> the new, correct block is written.  That window will only cause problems
> if there is a crash and we'll need to recover.  My main concern here is
> how big the discrepancy between the disks can get, and whether we'll end
> up corrupting the filesystem during recovery because we could
> potentially be matching metadata from one disk with journal entries from
> another.

well, I know already people will not believe me, but
just this evening, one of the infamous PCs with mismatch
count going up and down, could not boot anymore.

Reason: you must specifiy the filesystem type

So, I started it with a live CD.

My first idea was a problem with the RAID (type is 10 f2).

This was assembled fine, so I tried to mount it, but mount
returned the same error as above.
So I tried to mount it specifying "-text3" and it was mounted.
Everything seemed to be fine, I backup the data anyhow.

Some interesting discoveries:

tune2fs -l /dev/md/2_0 returns the FS data, no errors.
blkid /dev/md/2_0 does not return anything.

Running a fsck did not find anything wrong, but it did
not repair anything too.

Now, I do not know if this was caused by the situation
mentioned above, but for sure is quite fishy...

BTW, unrelated to the topic, any idea on how to fix this?
Is there any tool that can restore the proper ID or else?

Thanks,

bye. 

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 18:51                     ` Piergiorgio Sartor
@ 2010-02-24 22:21                       ` Neil Brown
  2010-02-25  8:41                         ` Piergiorgio Sartor
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Brown @ 2010-02-24 22:21 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

On Wed, 24 Feb 2010 19:51:06 +0100
Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:

> Hi,
> 
> > So realistically both disk blocks are wrong and there's a window until
> > the new, correct block is written.  That window will only cause problems
> > if there is a crash and we'll need to recover.  My main concern here is
> > how big the discrepancy between the disks can get, and whether we'll end
> > up corrupting the filesystem during recovery because we could
> > potentially be matching metadata from one disk with journal entries from
> > another.
> 
> well, I know already people will not believe me, but
> just this evening, one of the infamous PCs with mismatch
> count going up and down, could not boot anymore.

I certainly believe you.

> 
> Reason: you must specifiy the filesystem type

This suggests that the superblock which lives at an offset of 1K
into the filesystem was sufficiently corrupted that mount couldn't
recognise it.

> 
> So, I started it with a live CD.
> 
> My first idea was a problem with the RAID (type is 10 f2).
> 
> This was assembled fine, so I tried to mount it, but mount
> returned the same error as above.
> So I tried to mount it specifying "-text3" and it was mounted.

That is really odd!  Both the kernel ext3 module (triggered by '-text3')
and the 'mount' program use exactly the same test - look for the magic
number in the superblock at 1K into the device.

It is very hard to see how 'mount' would fail to find something that the ext3
module finds.

> Everything seemed to be fine, I backup the data anyhow.
> 
> Some interesting discoveries:
> 
> tune2fs -l /dev/md/2_0 returns the FS data, no errors.
> blkid /dev/md/2_0 does not return anything.

This sounds very much like tune2fs and blkid are reading two different
things, which is strange.

Would you be able to get the first 4K from each device in the raid10:
   dd if=/dev/whatever of=/tmp/whatever bs=1K count=4

and the tar/gz those up and send them to me.  That might give some clue.
Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in
the device, as the 'data offset'.
The --detail output of the array might help too.


> 
> Running a fsck did not find anything wrong, but it did
> not repair anything too.

Did you use "fsck -f" ??

> 
> Now, I do not know if this was caused by the situation
> mentioned above, but for sure is quite fishy...
> 
> BTW, unrelated to the topic, any idea on how to fix this?
> Is there any tool that can restore the proper ID or else?
> 

Until we know what is wrong, it is hard to suggest a fix.

NeilBrown


> Thanks,
> 
> bye. 
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 22:21                       ` Neil Brown
@ 2010-02-25  8:41                         ` Piergiorgio Sartor
  2010-03-02  4:57                           ` Neil Brown
  0 siblings, 1 reply; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-25  8:41 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Martin K. Petersen, Bill Davidsen,
	Steven Haigh, Bryan Mesich, Jon, linux-raid

Hi,

> I certainly believe you.

thank you!

> That is really odd!  Both the kernel ext3 module (triggered by '-text3')
> and the 'mount' program use exactly the same test - look for the magic
> number in the superblock at 1K into the device.

Today I tried: blkid -p /dev/md1 (this time the live CD
autoassembled the md device) and it returned something
like: ambivalent result (probably more than one filesystem...)

Strange thing is that, the HDDs were brand new, no older
partitions or filesystem were there.

Anyway, I've one small correction, the RAID is not 10 f2,
on this PC, but (due to different installation) a RAID-1
with superblock 0.9 and the device partitions are set
to 0xFD (RAID autoassemble).
 
> Would you be able to get the first 4K from each device in the raid10:
>    dd if=/dev/whatever of=/tmp/whatever bs=1K count=4
> 
> and the tar/gz those up and send them to me.  That might give some clue.
> Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in
> the device, as the 'data offset'.
> The --detail output of the array might help too.

I dumped the first 4K of each device, they're identical
(so no mismatch there, at least), I'll send them to you,
together with the detail output.
 
> > Running a fsck did not find anything wrong, but it did
> > not repair anything too.
> 
> Did you use "fsck -f" ??

Yep.
 
> Until we know what is wrong, it is hard to suggest a fix.

Thanks a lot (also because this could turn out to be unrelated
with this mailing list).

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-25  8:41                         ` Piergiorgio Sartor
@ 2010-03-02  4:57                           ` Neil Brown
  2010-03-02 18:49                             ` Piergiorgio Sartor
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Brown @ 2010-03-02  4:57 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

On Thu, 25 Feb 2010 09:41:41 +0100
Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote:

> Hi,
> 
> > I certainly believe you.
> 
> thank you!
> 
> > That is really odd!  Both the kernel ext3 module (triggered by '-text3')
> > and the 'mount' program use exactly the same test - look for the magic
> > number in the superblock at 1K into the device.
> 
> Today I tried: blkid -p /dev/md1 (this time the live CD
> autoassembled the md device) and it returned something
> like: ambivalent result (probably more than one filesystem...)
> 
> Strange thing is that, the HDDs were brand new, no older
> partitions or filesystem were there.
> 
> Anyway, I've one small correction, the RAID is not 10 f2,
> on this PC, but (due to different installation) a RAID-1
> with superblock 0.9 and the device partitions are set
> to 0xFD (RAID autoassemble).
>  
> > Would you be able to get the first 4K from each device in the raid10:
> >    dd if=/dev/whatever of=/tmp/whatever bs=1K count=4
> > 
> > and the tar/gz those up and send them to me.  That might give some clue.
> > Unless the raid metadata is 1.1 or 1.2 - then I would need blocks further in
> > the device, as the 'data offset'.
> > The --detail output of the array might help too.
> 
> I dumped the first 4K of each device, they're identical
> (so no mismatch there, at least), I'll send them to you,
> together with the detail output.

Thanks.  I finally had a look at these (sorry for delay).

If you run "file" on one of the dumps, it tells you:

$ file disk1.raw 
disk1.raw: Minix filesystem

Which isn't expected.  I would expect something like
$ file xxx
xxx: Linux rev 1.0 ext3 filesystem data, UUID=fe55fe6f-0412-4a0a-852d-a0e21767aa35 (needs journal recovery) (large files)

for an ext3 filesystem.

Looking at /usr/share/misc/magic, it seems that a Minix filesystem is defined
by:
0x410	leshort		0x137f		Minix filesystem

i.e. the 2 bytes at 0x410 into the device are 0x137f, which exactly what we
find in your dump.

0x410 in an ext3 superblock is the lower bytes of "s_free_inodes_count", the
count of free inodes.
Your actual number is 14881663, which is 0x00E3137F.

So if you just add or remove a file, the number of free inodes should change,
and your filesystem will no longer look like a Minix filesystem and
your problems should go away.

I guess libblkid et-al should do more sanity checks on the superblock before
deciding that it really belongs to some particular filesystem.

But I'm happy - this clearly isn't a raid problem.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02  4:57                           ` Neil Brown
@ 2010-03-02 18:49                             ` Piergiorgio Sartor
  0 siblings, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-03-02 18:49 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Martin K. Petersen, Bill Davidsen,
	Steven Haigh, Bryan Mesich, Jon, linux-raid

Hi,

> Thanks.  I finally had a look at these (sorry for delay).

well, thank you for having a look at the thing.

> If you run "file" on one of the dumps, it tells you:
> 
> $ file disk1.raw 
> disk1.raw: Minix filesystem
> 
> Which isn't expected.  I would expect something like
> $ file xxx
> xxx: Linux rev 1.0 ext3 filesystem data, UUID=fe55fe6f-0412-4a0a-852d-a0e21767aa35 (needs journal recovery) (large files)
> 
> for an ext3 filesystem.
> 
> Looking at /usr/share/misc/magic, it seems that a Minix filesystem is defined
> by:
> 0x410	leshort		0x137f		Minix filesystem
> 
> i.e. the 2 bytes at 0x410 into the device are 0x137f, which exactly what we
> find in your dump.
> 
> 0x410 in an ext3 superblock is the lower bytes of "s_free_inodes_count", the
> count of free inodes.
> Your actual number is 14881663, which is 0x00E3137F.

Ah! But this means there is a bug somewhere...
 
> So if you just add or remove a file, the number of free inodes should change,
> and your filesystem will no longer look like a Minix filesystem and
> your problems should go away.

Uhm, OK, I just re-created the MD and the FS, so I took
also the opportunity to increase the chunk size to 512K
and use RAID-10.
 
> I guess libblkid et-al should do more sanity checks on the superblock before
> deciding that it really belongs to some particular filesystem.

So, should one of us file a bug report somewhere?

I mean, it is not only (lib)blkid, but also "file"
which seems to be confused.
BTW, "file" does not seem to use libblkid.

> But I'm happy - this clearly isn't a raid problem.

That's certainly good news, thanks again for
the explanation, I learned something today!

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 16:12                   ` Martin K. Petersen
  2010-02-24 18:51                     ` Piergiorgio Sartor
@ 2010-02-24 21:39                     ` Neil Brown
       [not found]                       ` <4B8640A2.4060307@shiftmail.org>
  2010-02-28  8:09                       ` Luca Berra
  1 sibling, 2 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-24 21:39 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid

On Wed, 24 Feb 2010 11:12:09 -0500
"Martin K. Petersen" <martin.petersen@oracle.com> wrote:

> So realistically both disk blocks are wrong and there's a window until
> the new, correct block is written.  That window will only cause problems
> if there is a crash and we'll need to recover.  My main concern here is
> how big the discrepancy between the disks can get, and whether we'll end
> up corrupting the filesystem during recovery because we could
> potentially be matching metadata from one disk with journal entries from
> another.

After a crash, md will only read from one of the devices (the first) until a
resync has completed.  So there should be no room for more confusion than you
would expect on a single device.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

[parent not found: <4B8640A2.4060307@shiftmail.org>]

* Re: Why does one get mismatches?
       [not found]                       ` <4B8640A2.4060307@shiftmail.org>
@ 2010-02-25 10:41                         ` Neil Brown
  0 siblings, 0 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-25 10:41 UTC (permalink / raw)
  To: Asdo
  Cc: Martin K. Petersen, Bill Davidsen, Steven Haigh, Bryan Mesich,
	Jon, linux-raid

On Thu, 25 Feb 2010 10:19:30 +0100
Asdo <asdo@shiftmail.org> wrote:

> Neil Brown wrote:
> > On Wed, 24 Feb 2010 11:12:09 -0500
> > "Martin K. Petersen" <martin.petersen@oracle.com> wrote:
> >
> >   
> >> So realistically both disk blocks are wrong and there's a window until
> >> the new, correct block is written.  That window will only cause problems
> >> if there is a crash and we'll need to recover.  My main concern here is
> >> how big the discrepancy between the disks can get, and whether we'll end
> >> up corrupting the filesystem during recovery because we could
> >> potentially be matching metadata from one disk with journal entries from
> >> another.
> >>     
> >
> > After a crash, md will only read from one of the devices (the first) until a
> > resync has completed.  So there should be no room for more confusion than you
> > would expect on a single device.
> Not enough, I'd say.
> The reads are from a single device, the first, but it's the writes which 
> you don't know if they go to firstly to the first device or in the 
> reverse order. So I'd still be concerned by what Martin says.

I'm getting bored of repeating myself, so I won't respond to this.

> 
> In addition in this ML there are people reporting that the mismatches 
> occur even when the system is always on, no crashes. So I think there is 
> another mechanism for mismatches (not sure if in addition or it's the 
> only mechanism).

Ditto

> 
> Besides, if the mechanism for mismatches is correct I'd go for the copy 
> (or page lock if possible). All raids have copy, except raid0 maybe, and 
> they are not slow. Here the copy would only occur on writes, and raid-1 
> is not targeted to be SO fast on writes... Also raid-1's are usually on 
> few disk, like no more than 3, so the copy is not likely to bottleneck 
> the speed of the writes.

I'm sure it would be a measurable slowdown, though < 20%.  Probably < 10%.  I
doubt everyone would be happy with that, though you might.

> 
> What about raid-10? Are there copies for the raid-1 part of raid-10?
> 

No.  Neither raid1 nor raid10 copy the data, only raid456.

NeilBrown



^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 21:39                     ` Neil Brown
       [not found]                       ` <4B8640A2.4060307@shiftmail.org>
@ 2010-02-28  8:09                       ` Luca Berra
  2010-03-02  5:01                         ` Neil Brown
  1 sibling, 1 reply; 70+ messages in thread
From: Luca Berra @ 2010-02-28  8:09 UTC (permalink / raw)
  To: linux-raid

On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote:
>On Wed, 24 Feb 2010 11:12:09 -0500
>"Martin K. Petersen" <martin.petersen@oracle.com> wrote:
>
>> So realistically both disk blocks are wrong and there's a window until
>> the new, correct block is written.  That window will only cause problems
>> if there is a crash and we'll need to recover.  My main concern here is
>> how big the discrepancy between the disks can get, and whether we'll end
>> up corrupting the filesystem during recovery because we could
>> potentially be matching metadata from one disk with journal entries from
>> another.
>
>After a crash, md will only read from one of the devices (the first) until a
>resync has completed.  So there should be no room for more confusion than you
>would expect on a single device.

After thinking more about this i could come up with another concern
about write ordering.

example
app writes block A, B, C
md writes A on both disks
md writes B on disk1
app writes B again (B')
md writes B' on disk2
now md would write B' again on both disks, but the system crashes
(note, C is never written due to crash)

Disk 1 contains A and B in the correct order, it is missing C and B' but we
dont care, app should be able to recover from a crash

Disk 2 contains A and B', but they are wrongly ordered because C is
missing

If in the above case A and C are data blocks and B contains a journal
related to A and C, booting from disk 2 could result in inconsistent
data.

can the above really happen?
would using barriers remove the above concern?
am i missing something else?

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-28  8:09                       ` Luca Berra
@ 2010-03-02  5:01                         ` Neil Brown
  2010-03-02  7:36                           ` Luca Berra
  0 siblings, 1 reply; 70+ messages in thread
From: Neil Brown @ 2010-03-02  5:01 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

On Sun, 28 Feb 2010 09:09:49 +0100
Luca Berra <bluca@comedia.it> wrote:

> On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote:
> >On Wed, 24 Feb 2010 11:12:09 -0500
> >"Martin K. Petersen" <martin.petersen@oracle.com> wrote:
> >
> >> So realistically both disk blocks are wrong and there's a window until
> >> the new, correct block is written.  That window will only cause problems
> >> if there is a crash and we'll need to recover.  My main concern here is
> >> how big the discrepancy between the disks can get, and whether we'll end
> >> up corrupting the filesystem during recovery because we could
> >> potentially be matching metadata from one disk with journal entries from
> >> another.
> >
> >After a crash, md will only read from one of the devices (the first) until a
> >resync has completed.  So there should be no room for more confusion than you
> >would expect on a single device.
> 
> After thinking more about this i could come up with another concern
> about write ordering.
> 
> example
> app writes block A, B, C
> md writes A on both disks
> md writes B on disk1
> app writes B again (B')
> md writes B' on disk2
> now md would write B' again on both disks, but the system crashes
> (note, C is never written due to crash)
> 
> Disk 1 contains A and B in the correct order, it is missing C and B' but we
> dont care, app should be able to recover from a crash
> 
> Disk 2 contains A and B', but they are wrongly ordered because C is
> missing
> 
> If in the above case A and C are data blocks and B contains a journal
> related to A and C, booting from disk 2 could result in inconsistent
> data.
> 
> can the above really happen?
> would using barriers remove the above concern?
> am i missing something else?

These is no inconsistency here that a filesystem would not equally expect
from a single device.
After the crash-while-writing B', it should expect to see either B or B',
and it does, depending on which device is primary.

Nothing to see here.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02  5:01                         ` Neil Brown
@ 2010-03-02  7:36                           ` Luca Berra
  2010-03-02 10:04                             ` Michael Evans
  0 siblings, 1 reply; 70+ messages in thread
From: Luca Berra @ 2010-03-02  7:36 UTC (permalink / raw)
  To: linux-raid

On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote:
>On Sun, 28 Feb 2010 09:09:49 +0100
>Luca Berra <bluca@comedia.it> wrote:
>
>> On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote:
>> >On Wed, 24 Feb 2010 11:12:09 -0500
>> >"Martin K. Petersen" <martin.petersen@oracle.com> wrote:
>> >
>> >> So realistically both disk blocks are wrong and there's a window until
>> >> the new, correct block is written.  That window will only cause problems
>> >> if there is a crash and we'll need to recover.  My main concern here is
>> >> how big the discrepancy between the disks can get, and whether we'll end
>> >> up corrupting the filesystem during recovery because we could
>> >> potentially be matching metadata from one disk with journal entries from
>> >> another.
>> >
>> >After a crash, md will only read from one of the devices (the first) until a
>> >resync has completed.  So there should be no room for more confusion than you
>> >would expect on a single device.
>> 
>> After thinking more about this i could come up with another concern
>> about write ordering.
>> 
>> example
>> app writes block A, B, C
>> md writes A on both disks
>> md writes B on disk1
>> app writes B again (B')
>> md writes B' on disk2
>> now md would write B' again on both disks, but the system crashes
>> (note, C is never written due to crash)
>> 
>> Disk 1 contains A and B in the correct order, it is missing C and B' but we
>> dont care, app should be able to recover from a crash
>> 
>> Disk 2 contains A and B', but they are wrongly ordered because C is
>> missing
>> 
>> If in the above case A and C are data blocks and B contains a journal
>> related to A and C, booting from disk 2 could result in inconsistent
>> data.
>> 
>> can the above really happen?
>> would using barriers remove the above concern?
>> am i missing something else?
>
>These is no inconsistency here that a filesystem would not equally expect
>from a single device.
>After the crash-while-writing B', it should expect to see either B or B',
>and it does, depending on which device is primary.
>
>Nothing to see here.
I will try to explain better,
the problem is not related to the confusion between B or B'

the problem is that on one disk we have B' _without_ C.

Regards,
L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02  7:36                           ` Luca Berra
@ 2010-03-02 10:04                             ` Michael Evans
  2010-03-02 11:02                               ` Luca Berra
  0 siblings, 1 reply; 70+ messages in thread
From: Michael Evans @ 2010-03-02 10:04 UTC (permalink / raw)
  To: linux-raid

On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote:
> On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote:
>>
>> On Sun, 28 Feb 2010 09:09:49 +0100
>> Luca Berra <bluca@comedia.it> wrote:
>>
>>> On Thu, Feb 25, 2010 at 08:39:36AM +1100, Neil Brown wrote:
>>> >On Wed, 24 Feb 2010 11:12:09 -0500
>>> >"Martin K. Petersen" <martin.petersen@oracle.com> wrote:
>>> >
>>> >> So realistically both disk blocks are wrong and there's a window until
>>> >> the new, correct block is written.  That window will only cause
>>> >> problems
>>> >> if there is a crash and we'll need to recover.  My main concern here
>>> >> is
>>> >> how big the discrepancy between the disks can get, and whether we'll
>>> >> end
>>> >> up corrupting the filesystem during recovery because we could
>>> >> potentially be matching metadata from one disk with journal entries
>>> >> from
>>> >> another.
>>> >
>>> >After a crash, md will only read from one of the devices (the first)
>>> > until a
>>> >resync has completed.  So there should be no room for more confusion
>>> > than you
>>> >would expect on a single device.
>>>
>>> After thinking more about this i could come up with another concern
>>> about write ordering.
>>>
>>> example
>>> app writes block A, B, C
>>> md writes A on both disks
>>> md writes B on disk1
>>> app writes B again (B')
>>> md writes B' on disk2
>>> now md would write B' again on both disks, but the system crashes
>>> (note, C is never written due to crash)
>>>
>>> Disk 1 contains A and B in the correct order, it is missing C and B' but
>>> we
>>> dont care, app should be able to recover from a crash
>>>
>>> Disk 2 contains A and B', but they are wrongly ordered because C is
>>> missing
>>>
>>> If in the above case A and C are data blocks and B contains a journal
>>> related to A and C, booting from disk 2 could result in inconsistent
>>> data.
>>>
>>> can the above really happen?
>>> would using barriers remove the above concern?
>>> am i missing something else?
>>
>> These is no inconsistency here that a filesystem would not equally expect
>> from a single device.
>> After the crash-while-writing B', it should expect to see either B or B',
>> and it does, depending on which device is primary.
>>
>> Nothing to see here.
>
> I will try to explain better,
> the problem is not related to the confusion between B or B'
>
> the problem is that on one disk we have B' _without_ C.
>
> Regards,
> L.
>
> --
> Luca Berra -- bluca@comedia.it
>        Communication Media & Services S.r.l.
>  /"\
>  \ /     ASCII RIBBON CAMPAIGN
>  X        AGAINST HTML MAIL
>  / \
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

You're demanding full atomic commits; this is precisely what journals
and /barriers/ are for.

Are you are bypassing them in a quest for performance and paying for
it on crashes?
Or is this a hardware bug?
Or is it some glitch in the block device layering leading to barrier
requests not being honored?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 10:04                             ` Michael Evans
@ 2010-03-02 11:02                               ` Luca Berra
  2010-03-02 12:13                                 ` Michael Evans
  2010-03-02 18:14                                 ` Asdo
  0 siblings, 2 replies; 70+ messages in thread
From: Luca Berra @ 2010-03-02 11:02 UTC (permalink / raw)
  To: linux-raid

On Tue, Mar 02, 2010 at 02:04:47AM -0800, Michael Evans wrote:
>On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote:
>> On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote:
>>>> Disk 1 contains A and B in the correct order, it is missing C and B' but
>>>> we
>>>> dont care, app should be able to recover from a crash
>>>>
>>>> Disk 2 contains A and B', but they are wrongly ordered because C is
>>>> missing
>>>>
>>>> If in the above case A and C are data blocks and B contains a journal
>>>> related to A and C, booting from disk 2 could result in inconsistent
>>>> data.
>>>>
>>>> can the above really happen?
>>>> would using barriers remove the above concern?
>>>> am i missing something else?
>>>
>>> These is no inconsistency here that a filesystem would not equally expect
>>> from a single device.
>>> After the crash-while-writing B', it should expect to see either B or B',
>>> and it does, depending on which device is primary.
>>>
>>> Nothing to see here.
>>
>> I will try to explain better,
>> the problem is not related to the confusion between B or B'
>>
>> the problem is that on one disk we have B' _without_ C.
>>
>You're demanding full atomic commits; this is precisely what journals
>and /barriers/ are for.
>
>Are you are bypassing them in a quest for performance and paying for
>it on crashes?
>Or is this a hardware bug?
>Or is it some glitch in the block device layering leading to barrier
>requests not being honored?
I just asked for confirmation that with /barriers/ the scenario above
would not happen.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 11:02                               ` Luca Berra
@ 2010-03-02 12:13                                 ` Michael Evans
  2010-03-02 18:14                                 ` Asdo
  1 sibling, 0 replies; 70+ messages in thread
From: Michael Evans @ 2010-03-02 12:13 UTC (permalink / raw)
  To: linux-raid

On Tue, Mar 2, 2010 at 3:02 AM, Luca Berra <bluca@comedia.it> wrote:
> On Tue, Mar 02, 2010 at 02:04:47AM -0800, Michael Evans wrote:
>>
>> On Mon, Mar 1, 2010 at 11:36 PM, Luca Berra <bluca@comedia.it> wrote:
>>>
>>> On Tue, Mar 02, 2010 at 04:01:00PM +1100, Neil Brown wrote:
>>>>>
>>>>> Disk 1 contains A and B in the correct order, it is missing C and B'
>>>>> but
>>>>> we
>>>>> dont care, app should be able to recover from a crash
>>>>>
>>>>> Disk 2 contains A and B', but they are wrongly ordered because C is
>>>>> missing
>>>>>
>>>>> If in the above case A and C are data blocks and B contains a journal
>>>>> related to A and C, booting from disk 2 could result in inconsistent
>>>>> data.
>>>>>
>>>>> can the above really happen?
>>>>> would using barriers remove the above concern?
>>>>> am i missing something else?
>>>>
>>>> These is no inconsistency here that a filesystem would not equally
>>>> expect
>>>> from a single device.
>>>> After the crash-while-writing B', it should expect to see either B or
>>>> B',
>>>> and it does, depending on which device is primary.
>>>>
>>>> Nothing to see here.
>>>
>>> I will try to explain better,
>>> the problem is not related to the confusion between B or B'
>>>
>>> the problem is that on one disk we have B' _without_ C.
>>>
>> You're demanding full atomic commits; this is precisely what journals
>> and /barriers/ are for.
>>
>> Are you are bypassing them in a quest for performance and paying for
>> it on crashes?
>> Or is this a hardware bug?
>> Or is it some glitch in the block device layering leading to barrier
>> requests not being honored?
>
> I just asked for confirmation that with /barriers/ the scenario above
> would not happen.
>
> L.
>
> --
> Luca Berra -- bluca@comedia.it
>        Communication Media & Services S.r.l.
>  /"\
>  \ /     ASCII RIBBON CAMPAIGN
>  X        AGAINST HTML MAIL
>  / \
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Yes, obviously atomic commits require barriers.  Older hardware and
operating systems that didn't allow any form of buffering or out of
order operations (hardware can re-arrange commits internally now)
inherently had a barrier between every operation.  Modern devices and
systems have so many layers of interacting buffers with operation
re-ordering to optimize throughput that such predictability is lacking
unless explicitly requested via the form of a barrier.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 11:02                               ` Luca Berra
  2010-03-02 12:13                                 ` Michael Evans
@ 2010-03-02 18:14                                 ` Asdo
  2010-03-02 18:52                                   ` Piergiorgio Sartor
  2010-03-02 20:17                                   ` Neil Brown
  1 sibling, 2 replies; 70+ messages in thread
From: Asdo @ 2010-03-02 18:14 UTC (permalink / raw)
  To: linux-raid

Luca Berra wrote:
>>>
>>> I will try to explain better,
>>> the problem is not related to the confusion between B or B'
>>>
>>> the problem is that on one disk we have B' _without_ C.
>>>
>> You're demanding full atomic commits; this is precisely what journals
>> and /barriers/ are for.
>>
>> Are you are bypassing them in a quest for performance and paying for
>> it on crashes?
>> Or is this a hardware bug?
>> Or is it some glitch in the block device layering leading to barrier
>> requests not being honored?
> I just asked for confirmation that with /barriers/ the scenario above
> would not happen.
>

I think so, that it would not happen: the filesystem would stay 
consistent. (while the mismatches could still happen)

The problem is that the barriers were introduced in all md raids in the 
2.6.33 (just released), and also I have read XFS has a major performance 
drop with barriers activated. People will be tempted to disable 
barriers. AFAIR the performance drop was visible with 1 disk alone, 
imagine now with RAID. And I expect similar performance drops with other 
filesystems, correct me if I am wrong.

Now it would be interesting to understand why the mismatches don't 
happen when LVM is above MD-raid!?
The mechanisms presented up to now on this ML for mismatches don't 
explain why on LVM the same issue doesn't show up. I think.
So you might want to use raid-1 and raid-10 under LVM, just in case....

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 18:14                                 ` Asdo
@ 2010-03-02 18:52                                   ` Piergiorgio Sartor
  2010-03-02 23:27                                     ` Asdo
  2010-03-02 20:17                                   ` Neil Brown
  1 sibling, 1 reply; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-03-02 18:52 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

Hi,

> Now it would be interesting to understand why the mismatches don't
> happen when LVM is above MD-raid!?

well, maybe LVM copies the buffer, and after it
plays it nice with MD, i.e. no changes on the fly.

Or maybe, it is just one system that behaves
properly with LVM over MD.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 18:52                                   ` Piergiorgio Sartor
@ 2010-03-02 23:27                                     ` Asdo
  2010-03-03  9:13                                       ` Piergiorgio Sartor
  0 siblings, 1 reply; 70+ messages in thread
From: Asdo @ 2010-03-02 23:27 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Piergiorgio Sartor wrote:
> Hi,
>   
>> Now it would be interesting to understand why the mismatches don't
>> happen when LVM is above MD-raid!?
>>     
>
> well, maybe LVM copies the buffer, and after it
> plays it nice with MD, i.e. no changes on the fly.
>   

LVM copies the buffer!?
I don't think so...
LVM is near-zero overhead, so I would be surprised if it copied the buffer.
Also I don't think it was needed in their case, except maybe if there is 
an I/O at the boundary of a logical volume or LVM stripe, which would 
certainly be a mistake at requestor side.
LVM also does not merge requests AFAIR. (visible with mdstat -x 1)

> Or maybe, it is just one system that behaves
> properly with LVM over MD.
>   

hmm maybe...

But me also I have never seen mismatches and the only raid-1's I have 
are above LVM. (except /boot but that's almost never modified)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 23:27                                     ` Asdo
@ 2010-03-03  9:13                                       ` Piergiorgio Sartor
  2010-03-03 11:42                                         ` Asdo
  0 siblings, 1 reply; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-03-03  9:13 UTC (permalink / raw)
  To: Asdo; +Cc: Piergiorgio Sartor, linux-raid

Hi,

> LVM copies the buffer!?
> I don't think so...
> LVM is near-zero overhead, so I would be surprised if it copied the buffer.

well, I'm not so sure it is near-zero overhead (I'll exaplain
below), and even if, making copies could be still "near-zero"
overhead, it depends on where the bottlenecks are.

I'm not an LVM insider, so this are just random thoughts.

About the near-zero overhead, maybe this could open a
different thread, but just to give some numbers...

I've a bunch of RAID-6 volumes, made of USB disks, i.e. using
PATA<->USB bridges.
This volumes are aggregated using LVM and, on top of that, there
is a LUKS container.

The raw read perfomance on the RAID-6 is, in the best case,
around about 48MB/s, which is pretty good for USB, I guess it
will be difficult to get more.
The raw read perfomance of the LVM volume is i~38MB/s.
The raw read performance of the LUKS is ~28MB/s (actually
maybe a bit less).

Each further layer loses about 10MB/s.

I guess this is much more visible in USB than in SATA/SAS
situations, since going from 205 to 195 might get unnoticed.

This is not a CPU problem, since the PC is dual core, one
core runs and it never exceeds 30%. The USB is slow enough
to allow all the operations to be performed in real-time.

Nevertheless, LVM is doing something there, in this setup
is has an overhead of about 20%, far from zero.
So, the 10MB/s loss could be, again I've no idea on how LVM
works, caused by copying.
Could also be something else, of course, it would be interesting
to have more information from some expert (also to optimize my
USB setup, if possible).

> Also I don't think it was needed in their case, except maybe if

Maybe, but if the filesystem can play with the buffer while
submitted, then I would rather copy the data.

Again, some expert opinion would be appreciated.

> LVM also does not merge requests AFAIR. (visible with mdstat -x 1)

BTW, what's that? I mean "mdstat -x 1"...

> But me also I have never seen mismatches and the only raid-1's I
> have are above LVM. (except /boot but that's almost never modified)

Well, that's good, you confirmed my experience.

I've also RAID-10 on LVM and never got mismatches, while the
plain RAID-10 got sometimes.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-03  9:13                                       ` Piergiorgio Sartor
@ 2010-03-03 11:42                                         ` Asdo
  2010-03-03 12:03                                           ` Piergiorgio Sartor
  0 siblings, 1 reply; 70+ messages in thread
From: Asdo @ 2010-03-03 11:42 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Piergiorgio Sartor wrote:
> I've a bunch of RAID-6 volumes, made of USB disks, i.e. using
> PATA<->USB bridges.
>   
Don't your bridges ever drop out or break?
What is the brand/model?
Years ago I broke alot of those, just by using the disks intensively. 
They probably kinda overheated then failed. They couldn't last 2 days on 
intense disk activity... They were chinese stuff bought on ebay though.

> This volumes are aggregated using LVM and, on top of that, there
> is a LUKS container.
>
> The raw read perfomance on the RAID-6 is, in the best case,
> around about 48MB/s, which is pretty good for USB, I guess it
> will be difficult to get more.
>   
48MB/sec can be good for 1 disk, but it's bad for many disks attached 
separately to USB ports...

> The raw read perfomance of the LVM volume is i~38MB/s.
> The raw read performance of the LUKS is ~28MB/s (actually
> maybe a bit less).
>   
Might your LVM or partition within it be not aligned, or you didn't set 
readahead?
http://www.beowulf.org/archive/2007-May/018359.html
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg10804.html
People using LVM on arrays giving hundreds of MB/sec see slowdowns of 
the order of percent
http://article.gmane.org/gmane.linux.raid/18302

>> LVM also does not merge requests AFAIR. (visible with mdstat -x 1)    
> BTW, what's that? I mean "mdstat -x 1"...  

I'm sorry I meant " iostat -x 1"

>> But me also I have never seen mismatches and the only raid-1's I
>> have are above LVM. (except /boot but that's almost never modified)
>>     
>
> Well, that's good, you confirmed my experience.
>
> I've also RAID-10 on LVM and never got mismatches, while the
> plain RAID-10 got sometimes.
>
>   
This fact needs further investigation methinks...
We could ask to the LVM people if LVM really copies the buffer.

Regards
A.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-03 11:42                                         ` Asdo
@ 2010-03-03 12:03                                           ` Piergiorgio Sartor
  0 siblings, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-03-03 12:03 UTC (permalink / raw)
  To: Asdo; +Cc: Piergiorgio Sartor, linux-raid

Hi,

> >I've a bunch of RAID-6 volumes, made of USB disks, i.e. using
> >PATA<->USB bridges.
> Don't your bridges ever drop out or break?

never had problems of broken bridges, but other problems
I had, like unreliable transfer under certain conditions.

> What is the brand/model?

The best I could find were from Digitus, they've a pretty
standard chipset, JMicron I guess, but they seem to be build
better than others, still with JMicron.
No problems, so far.

I've other two brands, same chipset, which seem reliable
for the SATA part, but the PATA does not work properly.

All the other, from different vendors with different
chipset, had never problems.

I can imagine that the PSU that comes together might be a
weak point, I saw real poor quality units.

On the other hand, it's a bunch of RAID-6 for a reason... :-)

> Years ago I broke alot of those, just by using the disks
> intensively. They probably kinda overheated then failed. They
> couldn't last 2 days on intense disk activity... They were chinese
> stuff bought on ebay though.

My use case is an offline storage, so I'll not use the
box for two days, but for several hours I used it.

> 48MB/sec can be good for 1 disk, but it's bad for many disks
> attached separately to USB ports...

Well, maybe I forgot to mention that the HDDs are going to
the PC thru an USB HUB (three 4-1 USB, to be precise), i.e.
one single USB connection.
This can do, in theory, 60MB/s, in practice I never saw
more than 50MB/s, in ideal conditions.
So, in my view, 48MB/s is pretty much the max you can get.

> Might your LVM or partition within it be not aligned, or you didn't
> set readahead?

LVM takes care to align itself, this is in the new version,
and also the readahead seems to be automagically set.

Nevertheless, the LUKS is aligned, by hand.

> http://www.beowulf.org/archive/2007-May/018359.html
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg10804.html
> People using LVM on arrays giving hundreds of MB/sec see slowdowns
> of the order of percent
> http://article.gmane.org/gmane.linux.raid/18302

Thanks for the links, I'll have a look.

> >I've also RAID-10 on LVM and never got mismatches, while the
> >plain RAID-10 got sometimes.
> >
> This fact needs further investigation methinks...
> We could ask to the LVM people if LVM really copies the buffer.

Or, in general, if they have any explanation for this
observation of ours.

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-03-02 18:14                                 ` Asdo
  2010-03-02 18:52                                   ` Piergiorgio Sartor
@ 2010-03-02 20:17                                   ` Neil Brown
  1 sibling, 0 replies; 70+ messages in thread
From: Neil Brown @ 2010-03-02 20:17 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

On Tue, 02 Mar 2010 19:14:25 +0100
Asdo <asdo@shiftmail.org> wrote:

> Luca Berra wrote:
> >>>
> >>> I will try to explain better,
> >>> the problem is not related to the confusion between B or B'
> >>>
> >>> the problem is that on one disk we have B' _without_ C.
> >>>
> >> You're demanding full atomic commits; this is precisely what journals
> >> and /barriers/ are for.
> >>
> >> Are you are bypassing them in a quest for performance and paying for
> >> it on crashes?
> >> Or is this a hardware bug?
> >> Or is it some glitch in the block device layering leading to barrier
> >> requests not being honored?
> > I just asked for confirmation that with /barriers/ the scenario above
> > would not happen.
> >
> 
> I think so, that it would not happen: the filesystem would stay 
> consistent. (while the mismatches could still happen)
> 
> The problem is that the barriers were introduced in all md raids in the 
> 2.6.33 (just released), and also I have read XFS has a major performance 
> drop with barriers activated. People will be tempted to disable 
> barriers. AFAIR the performance drop was visible with 1 disk alone, 
> imagine now with RAID. And I expect similar performance drops with other 
> filesystems, correct me if I am wrong.

The barrier support added in 2.6.33 was for striped md arrays.
RAID1, which is not striped, has had barrier support since about 2.6.16,
as it is much easier to implement.

NeilBrown

> 
> Now it would be interesting to understand why the mismatches don't 
> happen when LVM is above MD-raid!?
> The mechanisms presented up to now on this ML for mismatches don't 
> explain why on LVM the same issue doesn't show up. I think.
> So you might want to use raid-1 and raid-10 under LVM, just in case....
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 14:46                 ` Bill Davidsen
  2010-02-24 16:12                   ` Martin K. Petersen
@ 2010-02-24 21:32                   ` Neil Brown
  2010-02-25  7:22                     ` Goswin von Brederlow
  2010-02-25  8:47                     ` John Robinson
  1 sibling, 2 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-24 21:32 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Steven Haigh, Bryan Mesich, Jon, linux-raid

On Wed, 24 Feb 2010 09:46:23 -0500
Bill Davidsen <davidsen@tmr.com> wrote:

> > There is no question of data corruption.
> > When memory changes between being written to one device and to another, this
> > does not cause corruption, only inconsistency.   Either the block will be
> > written again consistently soon, or it will never be read.
> >     
> 
> Just what is it that rewrites the data block? The user program doesn't 
> know it's needed, the filesystem, if any, doesn't know it's needed, and 
> as far as I can tell md doesn't do checksum before issuing the write and 
> after the last write is done. Doesn't make a copy and write from that. 
> So what sees that the data has changed and rewrites it?
> 

The filesystem re-writes the block, though probably it is more accurate to
say 'the page cache' rewrites the block (the page cache is essentially just a
library of code that the filesystem uses).

When a page is changed, its 'Dirty' flag is set.
Before a page is written out, the Dirty flag is cleared.
So if a page is written differently to two devices, then it must have been
changed after the Dirty flag was clear, so the Dirty flag will be set, so the
page cache will try to write it out again (after about 30 seconds or at
unmount time).

When accessing a block device directly ( > /dev/md0 ) the page cache is still
used and will still write out any page that has the Dirty flag set.

If you open /dev/md0 with O_DIRECT there is no page cache involved and so no
setting of Dirty flags.  So you could engineer a situation with O_DIRECT
that writes different data to the two devices, but you would have to try
fairly hard.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 21:32                   ` Neil Brown
@ 2010-02-25  7:22                     ` Goswin von Brederlow
  2010-02-25  7:39                       ` Neil Brown
  2010-02-25  8:47                     ` John Robinson
  1 sibling, 1 reply; 70+ messages in thread
From: Goswin von Brederlow @ 2010-02-25  7:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid

Neil Brown <neilb@suse.de> writes:

> On Wed, 24 Feb 2010 09:46:23 -0500
> Bill Davidsen <davidsen@tmr.com> wrote:
>
>> > There is no question of data corruption.
>> > When memory changes between being written to one device and to another, this
>> > does not cause corruption, only inconsistency.   Either the block will be
>> > written again consistently soon, or it will never be read.
>> >     
>> 
>> Just what is it that rewrites the data block? The user program doesn't 
>> know it's needed, the filesystem, if any, doesn't know it's needed, and 
>> as far as I can tell md doesn't do checksum before issuing the write and 
>> after the last write is done. Doesn't make a copy and write from that. 
>> So what sees that the data has changed and rewrites it?
>> 
>
> The filesystem re-writes the block, though probably it is more accurate to
> say 'the page cache' rewrites the block (the page cache is essentially just a
> library of code that the filesystem uses).
>
> When a page is changed, its 'Dirty' flag is set.
> Before a page is written out, the Dirty flag is cleared.
> So if a page is written differently to two devices, then it must have been
> changed after the Dirty flag was clear, so the Dirty flag will be set, so the
> page cache will try to write it out again (after about 30 seconds or at
> unmount time).

So maybe MD could check the dirty flag after write and then output a
warning so we can track down the issue. MD could also rewrite the page
prior to setting the disks in-sync until the dirty bit is clear after a
write.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-25  7:22                     ` Goswin von Brederlow
@ 2010-02-25  7:39                       ` Neil Brown
  0 siblings, 0 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-25  7:39 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Bill Davidsen, Steven Haigh, Bryan Mesich, Jon, linux-raid

On Thu, 25 Feb 2010 08:22:10 +0100
Goswin von Brederlow <goswin-v-b@web.de> wrote:

> Neil Brown <neilb@suse.de> writes:
> 
> > On Wed, 24 Feb 2010 09:46:23 -0500
> > Bill Davidsen <davidsen@tmr.com> wrote:
> >
> >> > There is no question of data corruption.
> >> > When memory changes between being written to one device and to another, this
> >> > does not cause corruption, only inconsistency.   Either the block will be
> >> > written again consistently soon, or it will never be read.
> >> >     
> >> 
> >> Just what is it that rewrites the data block? The user program doesn't 
> >> know it's needed, the filesystem, if any, doesn't know it's needed, and 
> >> as far as I can tell md doesn't do checksum before issuing the write and 
> >> after the last write is done. Doesn't make a copy and write from that. 
> >> So what sees that the data has changed and rewrites it?
> >> 
> >
> > The filesystem re-writes the block, though probably it is more accurate to
> > say 'the page cache' rewrites the block (the page cache is essentially just a
> > library of code that the filesystem uses).
> >
> > When a page is changed, its 'Dirty' flag is set.
> > Before a page is written out, the Dirty flag is cleared.
> > So if a page is written differently to two devices, then it must have been
> > changed after the Dirty flag was clear, so the Dirty flag will be set, so the
> > page cache will try to write it out again (after about 30 seconds or at
> > unmount time).
> 
> So maybe MD could check the dirty flag after write and then output a
> warning so we can track down the issue. MD could also rewrite the page
> prior to setting the disks in-sync until the dirty bit is clear after a
> write.

md isn't able to see the dirty bit.

It gets a 'bio', which has a 'biovec' which has a list of pages
with offset and size.
It does not know if the page is in the page cache or not so it cannot know if
the dirty flag on the page means anything or not.

Yes, it technically could check the dirty bit and if it sees any of them set
then it could reschedule the writes. however,
 1- this is a layering violation - it is the wrong thing to do.
 2- it might not work.  The filesystem could keep the 'dirty' status elsewhere
    such as in a 'buffer_head', and only copy it through to the page
    occasionally.
 3- it could cause a live-lock.  If an application is changing a mapped page
    quite regularly, then the current pagecache will write it out every 30
    seconds or so.  Your proposed change would write it out again and again
    as soon as the previous write completes.

So, no:  we cannot do that.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-24 21:32                   ` Neil Brown
  2010-02-25  7:22                     ` Goswin von Brederlow
@ 2010-02-25  8:47                     ` John Robinson
  2010-02-25  9:07                       ` Neil Brown
  1 sibling, 1 reply; 70+ messages in thread
From: John Robinson @ 2010-02-25  8:47 UTC (permalink / raw)
  To: linux-raid

On 24/02/2010 21:32, Neil Brown wrote:
[...]
> If you open /dev/md0 with O_DIRECT there is no page cache involved and so no
> setting of Dirty flags.  So you could engineer a situation with O_DIRECT
> that writes different data to the two devices, but you would have to try
> fairly hard.

Hang on. O_DIRECT sets off all sort of alarm bells for me, not that I 
understand it properly. Of course there's O_DIRECT on files too. Linus 
Torvalds is quite outspoken about it: http://kerneltrap.org/node/7563

Could we be seeing mismatches because applications are opening their 
files with O_DIRECT in a (perhaps misguided) attempt to get better 
performance?

Cheers,

John.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-25  8:47                     ` John Robinson
@ 2010-02-25  9:07                       ` Neil Brown
  0 siblings, 0 replies; 70+ messages in thread
From: Neil Brown @ 2010-02-25  9:07 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

On Thu, 25 Feb 2010 08:47:59 +0000
John Robinson <john.robinson@anonymous.org.uk> wrote:

> On 24/02/2010 21:32, Neil Brown wrote:
> [...]
> > If you open /dev/md0 with O_DIRECT there is no page cache involved and so no
> > setting of Dirty flags.  So you could engineer a situation with O_DIRECT
> > that writes different data to the two devices, but you would have to try
> > fairly hard.
> 
> Hang on. O_DIRECT sets off all sort of alarm bells for me, not that I 
> understand it properly. Of course there's O_DIRECT on files too. Linus 
> Torvalds is quite outspoken about it: http://kerneltrap.org/node/7563
> 
> Could we be seeing mismatches because applications are opening their 
> files with O_DIRECT in a (perhaps misguided) attempt to get better 
> performance?

Unlikely.
The app would need to be doing async direct-io, or it would need to be have
multiple threads, and in either case it would need to change the buffer that
was being written while the write was happening.  And that would be a pretty
dumb thing to do unless it almost immediately wrote the same buffer out again.

So not exactly impossible, but probably the least-likely of the various
possible explanations.

NeilBrown

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: Why does one get mismatches?
  2010-02-11  5:14       ` Neil Brown
  2010-02-11 17:51         ` Bryan Mesich
@ 2010-02-11 18:12         ` Piergiorgio Sartor
  1 sibling, 0 replies; 70+ messages in thread
From: Piergiorgio Sartor @ 2010-02-11 18:12 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, Jon, linux-raid

Hi all,

> > This whole discussion simply shows that for RAID-1 software RAID is less 
> > reliable than hardware RAID (no, I don't mean fake-RAID), because it 
> > doesn't pin the data buffer until all copies are written.
> > 
> 
> That doesn't make it less reliable.  It just makes it more confusing.

well, sorry to say, but it makes it useless.

The problem is: how can we be sure that the FS really
plays tricks only with blocks which will be unused?

In other words, either there should be an agreed and
confirmed interface between caller (FS) and called (MD),
handling the situation properly (i.e. the FS will not
do these pranks), or the called (MD) should be robust
agains all possible nasty things the caller (FS) can do.

Because what will happen if someone introduces a new
FS which works fine with all, but software RAID?

Similarly, I've some, identical, PCs, with RAID-10 f2.

Starting with Fedora 12, there is a weekly check of
the RAID array (with email notification, BTW without
mismatch count...).

On these PCs I get mismatches, sometimes.
Checking the mismatch count I found out that this is
changing, sometimes a bit more, sometimes a bit less (o zero).

Now, IMHO the check is completely useless and even annoying.

I've got mismatches, changing, but I do not know how
serious these are.

Not good... I could have lost data or not, and I do
not know...

> But for a more complete discussion on raid recovery and when it might be
> sensible to "vote" among the blocks, see
>    http://neil.brown.name/blog/20100211050355
> 

Nice, discussion.
Expecially the clarification about the unclean shutdown event.
This could be, in effect, a killer for the majority select
(or RAID-6 reconstrunction) decision.

I personally agree with the conclusion of your conclusion.
Anyway, I miss, or I did not get, one more point.

Specifically, the "smart recovery" should be composed by
two steps. One is detecting where the problems are.
This means not only the stripe, but, in case of RAID-6,
also the *potential* component (HDD) of the array.

Reason is that, as I already wrote some times ago,
there is a *huge* difference between having all the
mismatches *potentially* on one single component, or
spread around several.

The first case clearly gives more information and allows
a better judgment of the situation.

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2010-03-03 12:03 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-20 11:52 Fw: Why does one get mismatches? Jon Hardcastle
2010-01-22 18:13 ` Goswin von Brederlow
2010-01-24 17:40   ` Jon Hardcastle
2010-01-24 21:52     ` Roger Heflin
2010-01-24 23:13     ` Goswin von Brederlow
2010-01-25 10:07       ` Jon Hardcastle
2010-01-25 10:37         ` Goswin von Brederlow
2010-01-25 10:52           ` Jon Hardcastle
2010-01-25 17:32             ` Goswin von Brederlow
2010-01-25 19:32             ` Iustin Pop
2010-02-01 21:18 ` Bill Davidsen
2010-02-01 22:37   ` Neil Brown
2010-02-02 15:11     ` Bill Davidsen
2010-02-03 11:17       ` Goswin von Brederlow
2010-02-11  5:14       ` Neil Brown
2010-02-11 17:51         ` Bryan Mesich
2010-02-16 21:25           ` Bill Davidsen
2010-02-16 21:38             ` Steven Haigh
2010-02-17  3:19               ` Bryan Mesich
2010-02-17 23:05               ` Neil Brown
2010-02-19 15:18                 ` Piergiorgio Sartor
2010-02-19 22:02                   ` Neil Brown
2010-02-19 22:37                     ` Piergiorgio Sartor
2010-02-19 23:34                     ` Asdo
2010-02-20  4:27                       ` Goswin von Brederlow
2010-02-20 11:12                         ` Asdo
2010-02-21 11:13                           ` Goswin von Brederlow
     [not found]                             ` <8754A21825504719B463AD9809E54349@m5>
     [not found]                               ` <20100221194400.GA2570@lazy.lzy>
2010-02-22 13:01                                 ` Asdo
2010-02-22 13:30                                   ` Piergiorgio Sartor
2010-02-22 13:44                                   ` Piergiorgio Sartor
2010-02-24 19:42                               ` Bill Davidsen
2010-02-20  4:23                     ` Goswin von Brederlow
2010-02-24 14:54                     ` Bill Davidsen
2010-02-24 21:37                       ` Neil Brown
2010-02-26 20:48                         ` Bill Davidsen
2010-02-26 21:09                           ` Neil Brown
2010-02-26 22:01                             ` Piergiorgio Sartor
2010-02-26 22:15                             ` Bill Davidsen
2010-02-26 22:21                               ` Piergiorgio Sartor
2010-02-26 22:20                             ` Asdo
2010-02-27  6:01                               ` Michael Evans
2010-02-28  0:01                                 ` Bill Davidsen
2010-02-24 14:46                 ` Bill Davidsen
2010-02-24 16:12                   ` Martin K. Petersen
2010-02-24 18:51                     ` Piergiorgio Sartor
2010-02-24 22:21                       ` Neil Brown
2010-02-25  8:41                         ` Piergiorgio Sartor
2010-03-02  4:57                           ` Neil Brown
2010-03-02 18:49                             ` Piergiorgio Sartor
2010-02-24 21:39                     ` Neil Brown
     [not found]                       ` <4B8640A2.4060307@shiftmail.org>
2010-02-25 10:41                         ` Neil Brown
2010-02-28  8:09                       ` Luca Berra
2010-03-02  5:01                         ` Neil Brown
2010-03-02  7:36                           ` Luca Berra
2010-03-02 10:04                             ` Michael Evans
2010-03-02 11:02                               ` Luca Berra
2010-03-02 12:13                                 ` Michael Evans
2010-03-02 18:14                                 ` Asdo
2010-03-02 18:52                                   ` Piergiorgio Sartor
2010-03-02 23:27                                     ` Asdo
2010-03-03  9:13                                       ` Piergiorgio Sartor
2010-03-03 11:42                                         ` Asdo
2010-03-03 12:03                                           ` Piergiorgio Sartor
2010-03-02 20:17                                   ` Neil Brown
2010-02-24 21:32                   ` Neil Brown
2010-02-25  7:22                     ` Goswin von Brederlow
2010-02-25  7:39                       ` Neil Brown
2010-02-25  8:47                     ` John Robinson
2010-02-25  9:07                       ` Neil Brown
2010-02-11 18:12         ` Piergiorgio Sartor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).