mismatch_cnt questions

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mismatch_cnt questions
@ 2007-03-04 11:22 Christian Pernegger
  2007-03-04 11:50 ` Neil Brown
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Pernegger @ 2007-03-04 11:22 UTC (permalink / raw)
  To: linux-raid

Hello,

these questions apparently got buried in another thread, so here goes again ...

I have a mismatch_cnt of 384 on a 2-way mirror.
The box runs 2.6.17.4 and can't really be rebooted or have its kernel
updated easily

1) Where does the mismatch come from?
 The box hasn't been down since the creation of the array.

2) How much data is 384? Blocks? Chunks? Bytes?

3) Is the "repair" sync action safe to use on the above kernel? Any
other methods / additional steps for fixing this?

Regards,

C.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 11:22 mismatch_cnt questions Christian Pernegger
@ 2007-03-04 11:50 ` Neil Brown
  2007-03-04 12:01   ` Christian Pernegger
                     ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Neil Brown @ 2007-03-04 11:50 UTC (permalink / raw)
  To: Christian Pernegger; +Cc: linux-raid

On Sunday March 4, pernegger@gmail.com wrote:
> Hello,
> 
> these questions apparently got buried in another thread, so here goes again ...
> 
> I have a mismatch_cnt of 384 on a 2-way mirror.
> The box runs 2.6.17.4 and can't really be rebooted or have its kernel
> updated easily
> 
> 1) Where does the mismatch come from?
>  The box hasn't been down since the creation of the array.

Do you have swap on the mirror at all? 
I recently discovered/realised that when 'swap' writes to a raid1 it
can end up with different data on the different devices.  This is
perfectly acceptable as in that case the data will never be read.

If you don't have swap, then I don't know what is happening.

> 
> 2) How much data is 384? Blocks? Chunks? Bytes?

The units is 'sectors', but the granularity is about 64K.
so '384' means 3 different 64K sections of the device showed an
error.  One day I might reduce the granularity.

> 
> 3) Is the "repair" sync action safe to use on the above kernel? Any
> other methods / additional steps for fixing this?

"repair" is safe, though it may not be effective.
"repair" for raid1 was did not work until Jan 26th this year.
Before then it was identical in effect to 'check'.

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 11:50 ` Neil Brown
@ 2007-03-04 12:01   ` Christian Pernegger
  2007-03-04 22:19     ` Neil Brown
  2007-03-04 21:21   ` mismatch_cnt questions Eyal Lebedinsky
  2008-05-12 11:16   ` Bas van Schaik
  2 siblings, 1 reply; 36+ messages in thread
From: Christian Pernegger @ 2007-03-04 12:01 UTC (permalink / raw)
  To: linux-raid

Hey, that was quick ... thanks!

> > 1) Where does the mismatch come from? The box hasn't been down since the creation of
> >  the array.
>
> Do you have swap on the mirror at all?

As a matter of fact I do, /dev/md0_p2 is a swap partition.

> I recently discovered/realised that when 'swap' writes to a raid1 it can end up with different
> data on the different devices.  This is perfectly acceptable as in that case the data will never
> be read.

Interesting ... care to elaborate a little?

Would disabling swap, running mkswap again and rerunning check return
0 in this case?

Regards,

C.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 12:01   ` Christian Pernegger
@ 2007-03-04 22:19     ` Neil Brown
  2007-03-06 10:04       ` mismatch_cnt questions - how about raid10? Peter Rabbitson
  0 siblings, 1 reply; 36+ messages in thread
From: Neil Brown @ 2007-03-04 22:19 UTC (permalink / raw)
  To: Christian Pernegger; +Cc: linux-raid

On Sunday March 4, pernegger@gmail.com wrote:
> Hey, that was quick ... thanks!
> 
> > > 1) Where does the mismatch come from? The box hasn't been down since the creation of
> > >  the array.
> >
> > Do you have swap on the mirror at all?
> 
> As a matter of fact I do, /dev/md0_p2 is a swap partition.
> 
> > I recently discovered/realised that when 'swap' writes to a raid1 it can end up with different
> > data on the different devices.  This is perfectly acceptable as in that case the data will never
> > be read.
> 
> Interesting ... care to elaborate a little?

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.

When the data being written is part of a file, the page will still be
dirty after the write 'completes' so another write will be issued
fairly soon (depending on various VM settings) and so the
inconsistency will only be visible for a short time, and you probably
won't notice.

If this happens when writing to swap - i.e. if the page is dirtied
while the write is happening - then the swap system will just forget
that that page was written out.  It is obviously still active, so some
other page will get swapped out instead.
There will never be any attempt to write out the 'correct' data to the
device as that doesn't really mean anything.

As more swap activity happens it is quite possible that the
inconsistent area of the array will be written again with consistent
data, but it is also quite possible that it won't be written for a
long time.  Long enough that a 'check' will find it.

In any of these cases there is no risk of data corruption as the
inconsistent area of the array will never be read from.

> 
> Would disabling swap, running mkswap again and rerunning check return
> 0 in this case?

Disable swap, write to the entire swap area
   dd if=/dev/zero of=/dev/md0_p2 bs=1M
then mkswap and rerun 'check' and it should return '0'.  It did for
me.

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-04 22:19     ` Neil Brown
@ 2007-03-06 10:04       ` Peter Rabbitson
  2007-03-06 10:20         ` Neil Brown
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Rabbitson @ 2007-03-06 10:04 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> When we write to a raid1, the data is DMAed from memory out to each
> device independently, so if the memory changes between the two (or
> more) DMA operations, you will get inconsistency between the devices.

Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data 
once and distribute exact copies of it to the PVs (in this case just the 
raid) effectively giving the raid array invariable data?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-06 10:04       ` mismatch_cnt questions - how about raid10? Peter Rabbitson
@ 2007-03-06 10:20         ` Neil Brown
  2007-03-06 10:56           ` Peter Rabbitson
  0 siblings, 1 reply; 36+ messages in thread
From: Neil Brown @ 2007-03-06 10:20 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On Tuesday March 6, rabbit@rabbit.us wrote:
> Neil Brown wrote:
> > When we write to a raid1, the data is DMAed from memory out to each
> > device independently, so if the memory changes between the two (or
> > more) DMA operations, you will get inconsistency between the devices.
> 
> Does this apply to raid 10 devices too? And in case of LVM if swap is on 
> top of a LV which is a part of a VG which has a single PV as the raid 
> array - will this happen as well? Or will the LVM layer take the data 
> once and distribute exact copies of it to the PVs (in this case just the 
> raid) effectively giving the raid array invariable data?

Yes, it applies to raid10 too.

I don't know the details of the inner workings of LVM, but I doubt it
will make a difference.  Copying the data in memory is just too costly
to do if it can be avoided.  With LVM and raid1/10 it can be avoided
with no significant cost.
With raid4/5/6, not copying into the cache can cause data corruption.
So we always copy.

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-06 10:20         ` Neil Brown
@ 2007-03-06 10:56           ` Peter Rabbitson
  2007-03-06 10:59             ` Justin Piszcz
  2007-03-12  5:35             ` Neil Brown
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Rabbitson @ 2007-03-06 10:56 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> On Tuesday March 6, rabbit@rabbit.us wrote:
>> Neil Brown wrote:
>>> When we write to a raid1, the data is DMAed from memory out to each
>>> device independently, so if the memory changes between the two (or
>>> more) DMA operations, you will get inconsistency between the devices.
>> Does this apply to raid 10 devices too? And in case of LVM if swap is on 
>> top of a LV which is a part of a VG which has a single PV as the raid 
>> array - will this happen as well? Or will the LVM layer take the data 
>> once and distribute exact copies of it to the PVs (in this case just the 
>> raid) effectively giving the raid array invariable data?
> 
> Yes, it applies to raid10 too.
> 
> I don't know the details of the inner workings of LVM, but I doubt it
> will make a difference.  Copying the data in memory is just too costly
> to do if it can be avoided.  With LVM and raid1/10 it can be avoided
> with no significant cost.
> With raid4/5/6, not copying into the cache can cause data corruption.
> So we always copy.
> 

I see. So basically for those of us who want to run swap on raid 1 or 
10, and at the same time want to rely on mismatch_cnt for early problem 
detection, the only option is to create a separate md device just for 
the swap. Is this about right?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-06 10:56           ` Peter Rabbitson
@ 2007-03-06 10:59             ` Justin Piszcz
  2007-03-12  5:35             ` Neil Brown
  1 sibling, 0 replies; 36+ messages in thread
From: Justin Piszcz @ 2007-03-06 10:59 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid



On Tue, 6 Mar 2007, Peter Rabbitson wrote:

> Neil Brown wrote:
>> On Tuesday March 6, rabbit@rabbit.us wrote:
>>> Neil Brown wrote:
>>>> When we write to a raid1, the data is DMAed from memory out to each
>>>> device independently, so if the memory changes between the two (or
>>>> more) DMA operations, you will get inconsistency between the devices.
>>> Does this apply to raid 10 devices too? And in case of LVM if swap is on 
>>> top of a LV which is a part of a VG which has a single PV as the raid 
>>> array - will this happen as well? Or will the LVM layer take the data once 
>>> and distribute exact copies of it to the PVs (in this case just the raid) 
>>> effectively giving the raid array invariable data?
>> 
>> Yes, it applies to raid10 too.
>> 
>> I don't know the details of the inner workings of LVM, but I doubt it
>> will make a difference.  Copying the data in memory is just too costly
>> to do if it can be avoided.  With LVM and raid1/10 it can be avoided
>> with no significant cost.
>> With raid4/5/6, not copying into the cache can cause data corruption.
>> So we always copy.
>> 
>
> I see. So basically for those of us who want to run swap on raid 1 or 10, and 
> at the same time want to rely on mismatch_cnt for early problem detection, 
> the only option is to create a separate md device just for the swap. Is this 
> about right?
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

That is what I do.

/dev/md0 - swap
/dev/md1 - boot
/dev/md2 - root

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-06 10:56           ` Peter Rabbitson
  2007-03-06 10:59             ` Justin Piszcz
@ 2007-03-12  5:35             ` Neil Brown
  2007-03-12 14:26               ` Peter Rabbitson
  1 sibling, 1 reply; 36+ messages in thread
From: Neil Brown @ 2007-03-12  5:35 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: linux-raid

On Tuesday March 6, rabbit@rabbit.us wrote:
> > 
> 
> I see. So basically for those of us who want to run swap on raid 1 or 
> 10, and at the same time want to rely on mismatch_cnt for early problem 
> detection, the only option is to create a separate md device just for 
> the swap. Is this about right?

Though it is less likely, a regular filesystem could still (I think)
genuinely write different data to difference devices in a raid1/10.

So relying on mismatch_cnt for early problem detection probably isn't
really workable.

And I think that if a drive is returning bad data without signalling
an error, then you are very much into the 'late' side of problem
detection.

I see the 'check' and 'repair' functions mostly as valuable for the
fact that they read every block and will trigger sleeping bad blocks
early.  If they every find a discrepancy, then it is either perfectly
normal, or something seriously wrong that could have been wrong for a
while.... 

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions - how about raid10?
  2007-03-12  5:35             ` Neil Brown
@ 2007-03-12 14:26               ` Peter Rabbitson
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Rabbitson @ 2007-03-12 14:26 UTC (permalink / raw)
  To: linux-raid

Neil Brown wrote:
> On Tuesday March 6, rabbit@rabbit.us wrote:
 >
> Though it is less likely, a regular filesystem could still (I think)
> genuinely write different data to difference devices in a raid1/10.
> 
> So relying on mismatch_cnt for early problem detection probably isn't
> really workable.
> 
> And I think that if a drive is returning bad data without signalling
> an error, then you are very much into the 'late' side of problem
> detection.

I agree with the later, but my concern is not that much with the cause, 
but with the effect. From past discussion on the list I gather that no 
special effort is made to determine which chunk to take as 'valid', even 
though more than 2 logically identical chunks might be present 
(raid1/10). And you also seem to think that the DMA syndrome might even 
apply to plain fast-changing filesystems, left alone something with 
multiple layers (fs on lvm on raid).
So here is my question: how (theoretically) safe it is to use a raid1/10 
array for something very disk intensive, e.g. a mail spool? How likely 
it is that the effect you described above will creep different blocks 
onto disks, and subsequently will return the wrong data to the kernel? 
Should I look into raid5/6 for this kind of activity, in case both 
uptime and data integrity are my number one priorities, and I am willing 
to sacrifice performance?

Thank you

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 11:50 ` Neil Brown
  2007-03-04 12:01   ` Christian Pernegger
@ 2007-03-04 21:21   ` Eyal Lebedinsky
  2007-03-04 22:30     ` Neil Brown
  2008-05-12 11:16   ` Bas van Schaik
  2 siblings, 1 reply; 36+ messages in thread
From: Eyal Lebedinsky @ 2007-03-04 21:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christian Pernegger, linux-raid

Neil Brown wrote:
> On Sunday March 4, pernegger@gmail.com wrote:
>>I have a mismatch_cnt of 384 on a 2-way mirror.
[trim]
>>3) Is the "repair" sync action safe to use on the above kernel? Any
>>other methods / additional steps for fixing this?
> 
> "repair" is safe, though it may not be effective.
> "repair" for raid1 was did not work until Jan 26th this year.
> Before then it was identical in effect to 'check'.

How is "repair" safe but not effective? When it finds a mismatch, how does
it know which part is correct and which should be fixed (which copy of
raid1, or which block in raid5)?

When a disk fails we know what to rewrite, but when we discover a mismatch
we do not have this knowledge. It may corrupt the good copy of a raid1.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 21:21   ` mismatch_cnt questions Eyal Lebedinsky
@ 2007-03-04 22:30     ` Neil Brown
  2007-03-05  7:45       ` Eyal Lebedinsky
  2007-03-06  6:27       ` Paul Davidson
  0 siblings, 2 replies; 36+ messages in thread
From: Neil Brown @ 2007-03-04 22:30 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: Christian Pernegger, linux-raid

On Monday March 5, eyal@eyal.emu.id.au wrote:
> Neil Brown wrote:
> > On Sunday March 4, pernegger@gmail.com wrote:
> >>I have a mismatch_cnt of 384 on a 2-way mirror.
> [trim]
> >>3) Is the "repair" sync action safe to use on the above kernel? Any
> >>other methods / additional steps for fixing this?
> > 
> > "repair" is safe, though it may not be effective.
> > "repair" for raid1 was did not work until Jan 26th this year.
> > Before then it was identical in effect to 'check'.
> 
> How is "repair" safe but not effective? When it finds a mismatch, how does
> it know which part is correct and which should be fixed (which copy of
> raid1, or which block in raid5)?

It is not 'effective' in that before 26jan2007 it did not actually
copy the chosen data on to the other drives.  i.e. a 'repair' had the
same effect as a 'check', which is 'safe'.

> 
> When a disk fails we know what to rewrite, but when we discover a mismatch
> we do not have this knowledge. It may corrupt the good copy of a raid1.

If a block differs between the different drives in a raid1, then no
copy is 'good'.  It is possible that one copy is the one you think you
want, but you probably wouldn't know by looking at it.
The worst situation is the have inconsistent data. If you read and get
one value, then later read and get another value, that is really bad.

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.

You might be able to imagine a failure scenario where this produces
the 'wrong' result, but I'm confident that is the majority of cases it
is as good as any other option.

If we had something like ZFS which tracks checksums for all blocks,
and could somehow get that information usefully into the md level,
than maybe we could do something better.

I suspect that it would be very rare for raid5 to detect a mismatch
during a 'check', and raid1 would only see them when a write was
aborted, such as swap can do, and filesystems might do occasionally
(e.g. truncate a file that was recently written to).

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 22:30     ` Neil Brown
@ 2007-03-05  7:45       ` Eyal Lebedinsky
  2007-03-05 14:56         ` detecting/correcting _slightly_ flaky disks Michael Stumpf
                           ` (2 more replies)
  2007-03-06  6:27       ` Paul Davidson
  1 sibling, 3 replies; 36+ messages in thread
From: Eyal Lebedinsky @ 2007-03-05  7:45 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christian Pernegger, linux-raid

Neil Brown wrote:
[trim Q re how resync fixes data]
> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
> and writing it over all other copies.
> For raid5 we assume the data is correct and update the parity.

Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?

This will surely mean more value for raid6 than just the two-disk-failure
protection.

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/>
	attach .zip as .dat

^ permalink raw reply	[flat|nested] 36+ messages in thread

* detecting/correcting _slightly_ flaky disks
  2007-03-05  7:45       ` Eyal Lebedinsky
@ 2007-03-05 14:56         ` Michael Stumpf
  2007-03-05 15:09           ` Justin Piszcz
  2007-03-05 23:40         ` mismatch_cnt questions Neil Brown
  2007-03-08  6:34         ` H. Peter Anvin
  2 siblings, 1 reply; 36+ messages in thread
From: Michael Stumpf @ 2007-03-05 14:56 UTC (permalink / raw)
  To: linux-raid

I'm trying to assemble an array (raid 5) of 8 older, but not yet old age 
ATA 120 gig disks, but there is intermittent flakiness in one or more of 
the drives.  Symptoms:

* Won't boot sometimes.  Even after moving to 2 power supplies and 
monitoring the amp spikes, sometimes I get "clicking" from 1-2 of the 
drives after the startup.

* When initiating a SMART long test, so far two of them have:
       + passed 50-75% of the time
       + when "failed", didn't actually fail, just perpetually were 
stuck at an arbitrary % of test remain.
       + If I cancel and restart the test, often they pass.

I've heard clicking from some drives when executing SMART long tests.  
Doing 4 drives at a time, but still can't
isolate and don't want to use laborious "sit and listen by computer" 
method to determine which are dying--would prefer a tool to detect the 
issue.

I know there's a problem with one or more because my issues with my 
primary array disappeared the minute I used LVM to remove these devices 
(and upgrade to some larger/newer ones).

Two questions:

1)  Is it smartest to isolate which drives are clicking and chuck them 
into the wood chipper, given the circumstances?

2)  Are there tools that are designed to determine if a drive is fit for 
duty?  dd_rescue et all seem focused on saving a dying drive; spinrite 
seems to be controversial black magic marketing, etc.  I could try the 
manufacturer shipped tools but given their black box nature I have no 
idea how much (or little) is being done by their tests.  What do you 
folks recommend?

Thanks in advance.
--Michael Stumpf

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-05 14:56         ` detecting/correcting _slightly_ flaky disks Michael Stumpf
@ 2007-03-05 15:09           ` Justin Piszcz
  2007-03-05 17:01             ` Michael Stumpf
  0 siblings, 1 reply; 36+ messages in thread
From: Justin Piszcz @ 2007-03-05 15:09 UTC (permalink / raw)
  To: Michael Stumpf; +Cc: linux-raid



On Mon, 5 Mar 2007, Michael Stumpf wrote:

> I'm trying to assemble an array (raid 5) of 8 older, but not yet old age ATA 
> 120 gig disks, but there is intermittent flakiness in one or more of the 
> drives.  Symptoms:
>
> * Won't boot sometimes.  Even after moving to 2 power supplies and monitoring 
> the amp spikes, sometimes I get "clicking" from 1-2 of the drives after the 
> startup.
>
> * When initiating a SMART long test, so far two of them have:
>      + passed 50-75% of the time
>      + when "failed", didn't actually fail, just perpetually were stuck at 
> an arbitrary % of test remain.
>      + If I cancel and restart the test, often they pass.
>
> I've heard clicking from some drives when executing SMART long tests.  Doing 
> 4 drives at a time, but still can't
> isolate and don't want to use laborious "sit and listen by computer" method 
> to determine which are dying--would prefer a tool to detect the issue.
>
> I know there's a problem with one or more because my issues with my primary 
> array disappeared the minute I used LVM to remove these devices (and upgrade 
> to some larger/newer ones).
>
> Two questions:
>
> 1)  Is it smartest to isolate which drives are clicking and chuck them into 
> the wood chipper, given the circumstances?
>
> 2)  Are there tools that are designed to determine if a drive is fit for 
> duty?  dd_rescue et all seem focused on saving a dying drive; spinrite seems 
> to be controversial black magic marketing, etc.  I could try the manufacturer 
> shipped tools but given their black box nature I have no idea how much (or 
> little) is being done by their tests.  What do you folks recommend?
>
> Thanks in advance.
> --Michael Stumpf
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

This is what I use:

799] What is the best way to verify a hard drive has no bad blocks? 
/usr/bin/time badblocks -b 512 -s -v -w /dev/hdg

Note, this will wipe anything out on the drive.

There is also a non-destructive write, check the manpage for badblocks(8).

This operation usually takes 12 hours or so on a 400GB drive, if this 
passes & short+long tests pass without error, the drive is probably OK for 
the time being.

Also, what does smartctl -a /dev/hda for each of your drives show?

Justin.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-05 15:09           ` Justin Piszcz
@ 2007-03-05 17:01             ` Michael Stumpf
  2007-03-05 17:11               ` Justin Piszcz
  2007-03-07  0:14               ` Bill Davidsen
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Stumpf @ 2007-03-05 17:01 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid

This is the drive I think is most suspect.  What isn't obvious, because 
it isn't listed in the self test log, is between #1 and #2 there was an 
aborted, hung test.   The #4 short test that was aborted was also a hung 
test that I eventually, manually aborted--heard clicking from drives at 
that time, can't swear it was from this drive though.

Not sure I fully understand the nuances of this report.  If anything 
jumps out at you, I'd appreciate a tip on how you read it.  (to me, 
looks mostly healthy)

>
>
> Also, what does smartctl -a /dev/hda for each of your drives show?
>
> Justin.
>
>
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE family
Device Model:     WDC WD1200JB-75CRA0
Serial Number:    WD-WMA8C3115683
Firmware Version: 16.06V76
User Capacity:    120,000,000,000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   5
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Mar 05 10:52:05 2007 CAST
SMART support is: Available - device has SMART capability.
                  Enabled status cached by OS, trying SMART RETURN 
STATUS cmd.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting 
command from host.
                                        Auto Offline Data Collection: 
Enabled.
Self-test execution status:      (   0) The previous self-test routine 
completed
                                        without error or no self-test 
has ever
                                        been run.
Total time to complete Offline
data collection:                 (4680) seconds.
Offline data collection
capabilities:                    (0x3b) SMART execute Offline immediate.
                                        Auto Offline data collection 
on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  87) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  
Always       -       0
  3 Spin_Up_Time            0x0007   146   098   021    Pre-fail  
Always       -       3491
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   
Always       -       399
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  
Always       -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  
Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   
Always       -       22147
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  
Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   
Always       -       397
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   
Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   
Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   
Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   
Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  
Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       
299         -
# 2  Extended offline    Interrupted (host reset)      50%       
279         -
# 3  Short offline       Completed without error       00%       
279         -
# 4  Short offline       Aborted by host               80%       
279         -
# 5  Extended offline    Completed without error       00%       
102         -
# 6  Extended offline    Completed without error       00%      
1026         -
# 7  Extended offline    Completed without error       00%       
859         -
# 8  Extended offline    Completed without error       00%       
692         -
# 9  Extended offline    Completed without error       00%       
525         -
#10  Extended offline    Completed without error       00%       
380         -
#11  Extended offline    Completed without error       00%       
370         -

Device does not support Selective Self Tests/Logging


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-05 17:01             ` Michael Stumpf
@ 2007-03-05 17:11               ` Justin Piszcz
  2007-03-07  0:14               ` Bill Davidsen
  1 sibling, 0 replies; 36+ messages in thread
From: Justin Piszcz @ 2007-03-05 17:11 UTC (permalink / raw)
  To: Michael Stumpf; +Cc: linux-raid

Besides being run for a long time, I don't see anything strange with this 
drive.

Justin.

On Mon, 5 Mar 2007, Michael Stumpf wrote:

> This is the drive I think is most suspect.  What isn't obvious, because it 
> isn't listed in the self test log, is between #1 and #2 there was an aborted, 
> hung test.   The #4 short test that was aborted was also a hung test that I 
> eventually, manually aborted--heard clicking from drives at that time, can't 
> swear it was from this drive though.
>
> Not sure I fully understand the nuances of this report.  If anything jumps 
> out at you, I'd appreciate a tip on how you read it.  (to me, looks mostly 
> healthy)
>
>> 
>> 
>> Also, what does smartctl -a /dev/hda for each of your drives show?
>> 
>> Justin.
>> 
>> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Caviar SE family
> Device Model:     WDC WD1200JB-75CRA0
> Serial Number:    WD-WMA8C3115683
> Firmware Version: 16.06V76
> User Capacity:    120,000,000,000 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   5
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Mon Mar 05 10:52:05 2007 CAST
> SMART support is: Available - device has SMART capability.
>                 Enabled status cached by OS, trying SMART RETURN STATUS cmd.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x85) Offline data collection activity
>                                       was aborted by an interrupting command 
> from host.
>                                       Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine 
> completed
>                                       without error or no self-test has ever
>                                       been run.
> Total time to complete Offline
> data collection:                 (4680) seconds.
> Offline data collection
> capabilities:                    (0x3b) SMART execute Offline immediate.
>                                       Auto Offline data collection on/off 
> support.
>                                       Suspend Offline collection upon new
>                                       command.
>                                       Offline surface scan supported.
>                                       Self-test supported.
>                                       Conveyance Self-test supported.
>                                       No Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                       power-saving mode.
>                                       Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                       No General Purpose Logging support.
> Short self-test routine
> recommended polling time:        (   2) minutes.
> Extended self-test routine
> recommended polling time:        (  87) minutes.
> Conveyance self-test routine
> recommended polling time:        (   5) minutes.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED 
> WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always 
> -       0
> 3 Spin_Up_Time            0x0007   146   098   021    Pre-fail  Always 
> -       3491
> 4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always 
> -       399
> 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always 
> -       0
> 7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always 
> -       0
> 9 Power_On_Hours          0x0032   070   070   000    Old_age   Always 
> -       22147
> 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always 
> -       0
> 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always 
> -       0
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always 
> -       397
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always 
> -       0
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always 
> -       0
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always 
> -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always 
> -       0
> 200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline 
> -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours) 
> LBA_of_first_error
> # 1  Extended offline    Completed without error       00%       299 
> -
> # 2  Extended offline    Interrupted (host reset)      50%       279 
> -
> # 3  Short offline       Completed without error       00%       279 
> -
> # 4  Short offline       Aborted by host               80%       279 
> -
> # 5  Extended offline    Completed without error       00%       102 
> -
> # 6  Extended offline    Completed without error       00%      1026 
> -
> # 7  Extended offline    Completed without error       00%       859 
> -
> # 8  Extended offline    Completed without error       00%       692 
> -
> # 9  Extended offline    Completed without error       00%       525 
> -
> #10  Extended offline    Completed without error       00%       380 
> -
> #11  Extended offline    Completed without error       00%       370 
> -
>
> Device does not support Selective Self Tests/Logging
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-05 17:01             ` Michael Stumpf
  2007-03-05 17:11               ` Justin Piszcz
@ 2007-03-07  0:14               ` Bill Davidsen
  2007-03-07  1:37                 ` Michael Stumpf
  1 sibling, 1 reply; 36+ messages in thread
From: Bill Davidsen @ 2007-03-07  0:14 UTC (permalink / raw)
  To: mjstumpf; +Cc: Justin Piszcz, linux-raid

Michael Stumpf wrote:
> This is the drive I think is most suspect.  What isn't obvious, 
> because it isn't listed in the self test log, is between #1 and #2 
> there was an aborted, hung test.   The #4 short test that was aborted 
> was also a hung test that I eventually, manually aborted--heard 
> clicking from drives at that time, can't swear it was from this drive 
> though.
>
> Not sure I fully understand the nuances of this report.  If anything 
> jumps out at you, I'd appreciate a tip on how you read it.  (to me, 
> looks mostly healthy)
>
For what it's worth, if you are getting hung tests, either your drive or 
power supply should be redeployed as a paperweight. My opinion...

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-07  0:14               ` Bill Davidsen
@ 2007-03-07  1:37                 ` Michael Stumpf
  2007-03-07 13:57                   ` berk walker
  2007-03-07 15:01                   ` Bill Davidsen
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Stumpf @ 2007-03-07  1:37 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Justin Piszcz, linux-raid

Bill Davidsen wrote:
> Michael Stumpf wrote:
>> This is the drive I think is most suspect.  What isn't obvious, 
>> because it isn't listed in the self test log, is between #1 and #2 
>> there was an aborted, hung test.   The #4 short test that was aborted 
>> was also a hung test that I eventually, manually aborted--heard 
>> clicking from drives at that time, can't swear it was from this drive 
>> though.
>>
>> Not sure I fully understand the nuances of this report.  If anything 
>> jumps out at you, I'd appreciate a tip on how you read it.  (to me, 
>> looks mostly healthy)
>>
> For what it's worth, if you are getting hung tests, either your drive 
> or power supply should be redeployed as a paperweight. My opinion...
>
I don't disagree but I'd like to find something more concrete or 
repeatable, especially given that these give an audible click when 
failing.  The problem I'm having is that I can't nail down precisely 
where the problem is, although your suggestion makes a lot of sense.

After running Justin's suggested badblocks test, I'm kind-of-disturbed 
to see that all these drives are passing with flying colors.

Firmware issue?  WD had it in the past.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-07  1:37                 ` Michael Stumpf
@ 2007-03-07 13:57                   ` berk walker
  2007-03-07 15:01                   ` Bill Davidsen
  1 sibling, 0 replies; 36+ messages in thread
From: berk walker @ 2007-03-07 13:57 UTC (permalink / raw)
  To: mjstumpf; +Cc: Bill Davidsen, Justin Piszcz, linux-raid



Michael Stumpf wrote:
> Bill Davidsen wrote:
>> Michael Stumpf wrote:
>>> This is the drive I think is most suspect.  What isn't obvious, 
>>> because it isn't listed in the self test log, is between #1 and #2 
>>> there was an aborted, hung test.   The #4 short test that was 
>>> aborted was also a hung test that I eventually, manually 
>>> aborted--heard clicking from drives at that time, can't swear it was 
>>> from this drive though.
>>>
>>> Not sure I fully understand the nuances of this report.  If anything 
>>> jumps out at you, I'd appreciate a tip on how you read it.  (to me, 
>>> looks mostly healthy)
>>>
>> For what it's worth, if you are getting hung tests, either your drive 
>> or power supply should be redeployed as a paperweight. My opinion...
>>
> I don't disagree but I'd like to find something more concrete or 
> repeatable, especially given that these give an audible click when 
> failing.  The problem I'm having is that I can't nail down precisely 
> where the problem is, although your suggestion makes a lot of sense.
>
> After running Justin's suggested badblocks test, I'm kind-of-disturbed 
> to see that all these drives are passing with flying colors.
>
> Firmware issue?  WD had it in the past.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
One nice thing, if your cables are OK, and your power is OK, then you 
can trash the electronics and transplant from a similar drive with bad 
sectors.
-

"load head
seek spindle
unload head"
is not a nice thing for the hardware.

b-


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: detecting/correcting _slightly_ flaky disks
  2007-03-07  1:37                 ` Michael Stumpf
  2007-03-07 13:57                   ` berk walker
@ 2007-03-07 15:01                   ` Bill Davidsen
  1 sibling, 0 replies; 36+ messages in thread
From: Bill Davidsen @ 2007-03-07 15:01 UTC (permalink / raw)
  To: mjstumpf; +Cc: Justin Piszcz, linux-raid

Michael Stumpf wrote:
> Bill Davidsen wrote:
>> Michael Stumpf wrote:
>>> This is the drive I think is most suspect.  What isn't obvious, 
>>> because it isn't listed in the self test log, is between #1 and #2 
>>> there was an aborted, hung test.   The #4 short test that was 
>>> aborted was also a hung test that I eventually, manually 
>>> aborted--heard clicking from drives at that time, can't swear it was 
>>> from this drive though.
>>>
>>> Not sure I fully understand the nuances of this report.  If anything 
>>> jumps out at you, I'd appreciate a tip on how you read it.  (to me, 
>>> looks mostly healthy)
>>>
>> For what it's worth, if you are getting hung tests, either your drive 
>> or power supply should be redeployed as a paperweight. My opinion...
>>
> I don't disagree but I'd like to find something more concrete or 
> repeatable, especially given that these give an audible click when 
> failing.  The problem I'm having is that I can't nail down precisely 
> where the problem is, although your suggestion makes a lot of sense.
Well, here's thought if you are inclined... power up and go into BIOS 
config mode. That will leave the drives powered but not in use. Now pull 
the power cable out on one of them. Does the drive make a familiar click 
as the heads do an emergency park? That's the easiest thing to check 
which might cause the click. One thing your SMART doesn't include is 
Temp, which might or might now tell you anything. You could try hddtemp, 
but SMART would probably report it if the sensor was there.
>
> After running Justin's suggested badblocks test, I'm kind-of-disturbed 
> to see that all these drives are passing with flying colors.
>
> Firmware issue?  WD had it in the past.
Certainly you could check for newer firmware, and to see if all drives 
have the same level.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-05  7:45       ` Eyal Lebedinsky
  2007-03-05 14:56         ` detecting/correcting _slightly_ flaky disks Michael Stumpf
@ 2007-03-05 23:40         ` Neil Brown
  2007-03-07  0:22           ` Bill Davidsen
  2007-03-08  6:39           ` H. Peter Anvin
  2007-03-08  6:34         ` H. Peter Anvin
  2 siblings, 2 replies; 36+ messages in thread
From: Neil Brown @ 2007-03-05 23:40 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: Christian Pernegger, linux-raid

On Monday March 5, eyal@eyal.emu.id.au wrote:
> Neil Brown wrote:
> [trim Q re how resync fixes data]
> > For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
> > and writing it over all other copies.
> > For raid5 we assume the data is correct and update the parity.
> 
> Can raid6 identify the bad block (two parity blocks could allow this
> if only one block has bad data in a stripe)? If so, does it?

No, it doesn't.

I guess that maybe it could:
   Rebuild each block in turn based on the xor parity, and then test 
   if the Q-syndrome is satisfied.
but I doubt the gain would be worth the pain.

What we really want in drives that store 520 byte sectors so that a
checksum can be passed all the way up and down through the stack
.... or something like that.

NeilBrown

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-05 23:40         ` mismatch_cnt questions Neil Brown
@ 2007-03-07  0:22           ` Bill Davidsen
  2007-03-08  6:39           ` H. Peter Anvin
  1 sibling, 0 replies; 36+ messages in thread
From: Bill Davidsen @ 2007-03-07  0:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: Eyal Lebedinsky, Christian Pernegger, linux-raid

Neil Brown wrote:
> On Monday March 5, eyal@eyal.emu.id.au wrote:
>   
>> Neil Brown wrote:
>> [trim Q re how resync fixes data]
>>     
>>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
>>> and writing it over all other copies.
>>> For raid5 we assume the data is correct and update the parity.
>>>       
>> Can raid6 identify the bad block (two parity blocks could allow this
>> if only one block has bad data in a stripe)? If so, does it?
>>     
>
> No, it doesn't.
>
> I guess that maybe it could:
>    Rebuild each block in turn based on the xor parity, and then test 
>    if the Q-syndrome is satisfied.
> but I doubt the gain would be worth the pain.
What's the value of "I have a drive which returned bad data" vs. "I have 
a whole array and some part of it returned bad data?" What's the cost of 
doing that identification, since it need only be done when the data are 
inconsistent between the drives and give a parity or Q mismatch? It 
seems easy, given that you are going to read all the pertinent sectors 
into memory anyway.

If the drive can be identified the data can be rewritten with confidence.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-05 23:40         ` mismatch_cnt questions Neil Brown
  2007-03-07  0:22           ` Bill Davidsen
@ 2007-03-08  6:39           ` H. Peter Anvin
  2007-03-08 13:54             ` Martin K. Petersen
  1 sibling, 1 reply; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-08  6:39 UTC (permalink / raw)
  To: Neil Brown; +Cc: Eyal Lebedinsky, Christian Pernegger, linux-raid

Neil Brown wrote:
> On Monday March 5, eyal@eyal.emu.id.au wrote:
>> Neil Brown wrote:
>> [trim Q re how resync fixes data]
>>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
>>> and writing it over all other copies.
>>> For raid5 we assume the data is correct and update the parity.
>> Can raid6 identify the bad block (two parity blocks could allow this
>> if only one block has bad data in a stripe)? If so, does it?
> 
> No, it doesn't.
> 
> I guess that maybe it could:
>    Rebuild each block in turn based on the xor parity, and then test 
>    if the Q-syndrome is satisfied.
> but I doubt the gain would be worth the pain.
> 
> What we really want in drives that store 520 byte sectors so that a
> checksum can be passed all the way up and down through the stack
> .... or something like that.
> 

A lot of SCSI disks have that option, but I believe it's not arbitrary 
bytes.  In particular, the integrity check portion is only 2 bytes, 16 bits.

One option, of course, would be to store, say, 16 sectors/pages/blocks 
in 17 physical sectors/pages/blocks, where the last one is a packing of 
some sort of high-powered integrity checks, e.g. SHA-256, or even an ECC 
block.  This would hurt performance substantially, but it would be 
highly useful for very high data integrity applications.

I will look at the mathematics of trying to do this with RAID-6, but I'm 
99% sure RAID-6 isn't sufficient to do it, even with syndrome set 
recomputation on every read.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-08  6:39           ` H. Peter Anvin
@ 2007-03-08 13:54             ` Martin K. Petersen
  2007-03-09  2:00               ` Bill Davidsen
  0 siblings, 1 reply; 36+ messages in thread
From: Martin K. Petersen @ 2007-03-08 13:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Neil Brown, Eyal Lebedinsky, Christian Pernegger, linux-raid

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

>> What we really want in drives that store 520 byte sectors so that a
>> checksum can be passed all the way up and down through the stack
>> .... or something like that.
>> 

hpa> A lot of SCSI disks have that option, but I believe it's not
hpa> arbitrary bytes.  In particular, the integrity check portion is
hpa> only 2 bytes, 16 bits.

It's important to distinguish between drives that support 520 byte
sectors and drives that include the Data Integrity Feature which also
uses 520 byte sectors.

Most regular SCSI drives can be formatted with 520 byte sectors and a
lot of disk arrays use the extra space to store an internal checksum.
The downside to 520 byte sectors is that it makes buffer management a
pain as 512 bytes of data is followed by 8 bytes of protection data.
That sucks when writing - say - a 4KB block because your scatterlist
becomes long and twisted having to interleave data and protection
data every sector.

The data integrity feature also uses 520 byte byte sectors.  The
difference is that the format of the 8 bytes is well defined.  And
that both initiator and target are capable of verifying the integrity
of an I/O.  It is correct that the CRC is only 16 bits.

DIF is strictly between HBA and disk.  I'm lobbying HBA vendors to
expose it to the OS so we can use it.  I'm also lobbying to get them
to allow us to submit the data and the protection data in separate
scatterlists so we don't have to do the interleaving at the OS level.

hpa> One option, of course, would be to store, say, 16
hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where
hpa> the last one is a packing of some sort of high-powered integrity
hpa> checks, e.g. SHA-256, or even an ECC block.  This would hurt
hpa> performance substantially, but it would be highly useful for very
hpa> high data integrity applications.

A while ago I tinkered with something like that.  I actually cheated
and stored the checking data in a different partition on the same
drive.  It was a pretty simple test using my DIF code (i.e. 8 bytes
per sector).

I wanted to see how badly the extra seeks would affect us.  The
results weren't too discouraging but I decided I liked the ZFS
approach better (having the checksum in the fs parent block which
you'll be reading anyway).

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-08 13:54             ` Martin K. Petersen
@ 2007-03-09  2:00               ` Bill Davidsen
  2007-03-09  4:20                 ` H. Peter Anvin
  0 siblings, 1 reply; 36+ messages in thread
From: Bill Davidsen @ 2007-03-09  2:00 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: H. Peter Anvin, Neil Brown, Eyal Lebedinsky, Christian Pernegger,
	linux-raid

Martin K. Petersen wrote:
>>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:
>>>>>>             
>
>   
>>> What we really want in drives that store 520 byte sectors so that a
>>> checksum can be passed all the way up and down through the stack
>>> .... or something like that.
>>>
>>>       
>
> hpa> A lot of SCSI disks have that option, but I believe it's not
> hpa> arbitrary bytes.  In particular, the integrity check portion is
> hpa> only 2 bytes, 16 bits.
>
> It's important to distinguish between drives that support 520 byte
> sectors and drives that include the Data Integrity Feature which also
> uses 520 byte sectors.
>
> Most regular SCSI drives can be formatted with 520 byte sectors and a
> lot of disk arrays use the extra space to store an internal checksum.
> The downside to 520 byte sectors is that it makes buffer management a
> pain as 512 bytes of data is followed by 8 bytes of protection data.
> That sucks when writing - say - a 4KB block because your scatterlist
> becomes long and twisted having to interleave data and protection
> data every sector.
>
> The data integrity feature also uses 520 byte byte sectors.  The
> difference is that the format of the 8 bytes is well defined.  And
> that both initiator and target are capable of verifying the integrity
> of an I/O.  It is correct that the CRC is only 16 bits.
>   

When last I looked at Hamming code, and that would be 1989 or 1990, I 
believe that I learned that the number of Hamming bits needed to cover N 
data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into 
a 16 bit field nicely. I don't know that I would go that way, fix any 
one bit error, detect any two bit error, rather than a CRC which gives 
me only one chance in 64k of an undetected data error, but I find it 
interesting.

I also looked at fire codes, which at the time would still be a viable 
topic for a thesis. I remember nothing about how they worked whatsoever.
> DIF is strictly between HBA and disk.  I'm lobbying HBA vendors to
> expose it to the OS so we can use it.  I'm also lobbying to get them
> to allow us to submit the data and the protection data in separate
> scatterlists so we don't have to do the interleaving at the OS level.
>
>
> hpa> One option, of course, would be to store, say, 16
> hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where
> hpa> the last one is a packing of some sort of high-powered integrity
> hpa> checks, e.g. SHA-256, or even an ECC block.  This would hurt
> hpa> performance substantially, but it would be highly useful for very
> hpa> high data integrity applications.
>
> A while ago I tinkered with something like that.  I actually cheated
> and stored the checking data in a different partition on the same
> drive.  It was a pretty simple test using my DIF code (i.e. 8 bytes
> per sector).
>
> I wanted to see how badly the extra seeks would affect us.  The
> results weren't too discouraging but I decided I liked the ZFS
> approach better (having the checksum in the fs parent block which
> you'll be reading anyway).
>
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-09  2:00               ` Bill Davidsen
@ 2007-03-09  4:20                 ` H. Peter Anvin
  2007-03-09  5:20                   ` Bill Davidsen
  0 siblings, 1 reply; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-09  4:20 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Martin K. Petersen, Neil Brown, Eyal Lebedinsky,
	Christian Pernegger, linux-raid

Bill Davidsen wrote:
> 
> When last I looked at Hamming code, and that would be 1989 or 1990, I 
> believe that I learned that the number of Hamming bits needed to cover N 
> data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into 
> a 16 bit field nicely. I don't know that I would go that way, fix any 
> one bit error, detect any two bit error, rather than a CRC which gives 
> me only one chance in 64k of an undetected data error, but I find it 
> interesting.
> 

A Hamming code across the bytes of a sector is pretty darn pointless, 
since that's not a typical failure pattern.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-09  4:20                 ` H. Peter Anvin
@ 2007-03-09  5:20                   ` Bill Davidsen
  0 siblings, 0 replies; 36+ messages in thread
From: Bill Davidsen @ 2007-03-09  5:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Martin K. Petersen, Neil Brown, Eyal Lebedinsky,
	Christian Pernegger, linux-raid

H. Peter Anvin wrote:
> Bill Davidsen wrote:
>>
>> When last I looked at Hamming code, and that would be 1989 or 1990, I 
>> believe that I learned that the number of Hamming bits needed to 
>> cover N data bits was 1+log2(N), which for 512 bytes would be 1+12, 
>> and fit into a 16 bit field nicely. I don't know that I would go that 
>> way, fix any one bit error, detect any two bit error, rather than a 
>> CRC which gives me only one chance in 64k of an undetected data 
>> error, but I find it interesting.
>>
>
> A Hamming code across the bytes of a sector is pretty darn pointless, 
> since that's not a typical failure pattern.
I just thought it was perhaps one of those little known facts that 
meaningful ECC could fit in 16 bits. I mentioned that I wouldn't go that 
way, mainly because it would be less effective catching multibit errors. 
This was a "fun fact" for all those folks who missed Hamming codes in 
their education, because they are old tech.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-05  7:45       ` Eyal Lebedinsky
  2007-03-05 14:56         ` detecting/correcting _slightly_ flaky disks Michael Stumpf
  2007-03-05 23:40         ` mismatch_cnt questions Neil Brown
@ 2007-03-08  6:34         ` H. Peter Anvin
  2007-03-08  7:00           ` H. Peter Anvin
  2 siblings, 1 reply; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-08  6:34 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: Neil Brown, Christian Pernegger, linux-raid

Eyal Lebedinsky wrote:
> Neil Brown wrote:
> [trim Q re how resync fixes data]
>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
>> and writing it over all other copies.
>> For raid5 we assume the data is correct and update the parity.
> 
> Can raid6 identify the bad block (two parity blocks could allow this
> if only one block has bad data in a stripe)? If so, does it?
> 
> This will surely mean more value for raid6 than just the two-disk-failure
> protection.
> 

No.  It's not mathematically possible.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-08  6:34         ` H. Peter Anvin
@ 2007-03-08  7:00           ` H. Peter Anvin
  2007-03-08  8:21             ` H. Peter Anvin
  0 siblings, 1 reply; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-08  7:00 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid

H. Peter Anvin wrote:
> Eyal Lebedinsky wrote:
>> Neil Brown wrote:
>> [trim Q re how resync fixes data]
>>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
>>> and writing it over all other copies.
>>> For raid5 we assume the data is correct and update the parity.
>>
>> Can raid6 identify the bad block (two parity blocks could allow this
>> if only one block has bad data in a stripe)? If so, does it?
>>
>> This will surely mean more value for raid6 than just the two-disk-failure
>> protection.
>>
> 
> No.  It's not mathematically possible.
> 

Okay, I've thought about it, and I got it wrong the first time 
(off-the-cuff misapplication of the pigeonhole principle.)

It apparently *is* possible (for notation and algebra rules, see my paper):

Let's assume we know exactly one of the data (Dn) drives is corrupt 
(ignoring the case of P or Q corruption for now.)  That means instead of 
Dn we have a corrupt value, Xn.  Note that which data drive that is 
corrupt (n) is not known.

We compute P' and Q' as the computed values over the corrupt set.

P+P' = Dn+Xn
Q+Q' = g^n Dn + g^n Xn		g = {02}

Q+Q' = g^n (Dn+Xn)

By assumption, Dn != Xn, so P+P' = Dn+Xn != {00}.
g^n is *never* {00}, so Q+Q' = g^n (Dn+Xn) != {00}.

(Q+Q')/(P+P') = [g^n (Dn+Xn)]/(Dn+Xn) = g^n

Since n is known to be in the range [0,255), we thus have:

n = log_g((Q+Q')/(P+P'))

... which is a well-defined relation.

For the case where either the P or the Q drives are corrupt (and the 
data drives are all good), this is easily detected by the fact that if P 
is the corrupt drive, Q+Q' = {00}; similarly, if Q is the corrupt drive, 
P+P' = {00}.  Obviously, if P+P' = Q+Q' = {00}, then as far as RAID-6 
can discover, there is no corruption in the drive set.

So, yes, RAID-6 *can* detect single drive corruption, and even tell you 
which drive it is, if you're willing to compute a full syndrome set (P', 
Q') on every read (as well on every write.)

Note: RAID-6 cannot detect 2-drive corruption, unless of course the 
corruption is in different byte positions.  If multiple corresponding 
byte positions are corrupt, then the algorithm above will generally 
point you to a completely innocent drive.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-08  7:00           ` H. Peter Anvin
@ 2007-03-08  8:21             ` H. Peter Anvin
  2007-03-13  9:58               ` Andre Noll
  0 siblings, 1 reply; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-08  8:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid

I have just updated the paper at:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

... with this information (in slightly different notation and with a bit 
more detail.)

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-08  8:21             ` H. Peter Anvin
@ 2007-03-13  9:58               ` Andre Noll
  2007-03-13 23:46                 ` H. Peter Anvin
  0 siblings, 1 reply; 36+ messages in thread
From: Andre Noll @ 2007-03-13  9:58 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

On 00:21, H. Peter Anvin wrote:
> I have just updated the paper at:
> 
> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> 
> ... with this information (in slightly different notation and with a bit 
> more detail.)

There's a typo in the new section:

s/By assumption, X_z != D_n/By assumption, X_z != D_z/

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-13  9:58               ` Andre Noll
@ 2007-03-13 23:46                 ` H. Peter Anvin
  0 siblings, 0 replies; 36+ messages in thread
From: H. Peter Anvin @ 2007-03-13 23:46 UTC (permalink / raw)
  To: Andre Noll; +Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid

Andre Noll wrote:
> On 00:21, H. Peter Anvin wrote:
>> I have just updated the paper at:
>>
>> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
>>
>> ... with this information (in slightly different notation and with a bit 
>> more detail.)
> 
> There's a typo in the new section:
> 
> s/By assumption, X_z != D_n/By assumption, X_z != D_z/
> 

Thanks, fixed.

	-hpa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 22:30     ` Neil Brown
  2007-03-05  7:45       ` Eyal Lebedinsky
@ 2007-03-06  6:27       ` Paul Davidson
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Davidson @ 2007-03-06  6:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Hi Neil,

I've been following this thread with interest and I have a few questions.

Neil Brown wrote:
> On Monday March 5, eyal@eyal.emu.id.au wrote:
> 
>>Neil Brown wrote:
> 
>>When a disk fails we know what to rewrite, but when we discover a mismatch
>>we do not have this knowledge. It may corrupt the good copy of a raid1.
> 
> If a block differs between the different drives in a raid1, then no
> copy is 'good'.  It is possible that one copy is the one you think you
> want, but you probably wouldn't know by looking at it.
> The worst situation is the have inconsistent data. If you read and get
> one value, then later read and get another value, that is really bad.
> 
> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
> and writing it over all other copies.
> For raid5 we assume the data is correct and update the parity.

Wouldn't it be better to signal an error rather than potentially
corrupt data - or perhaps this already happens? Does the above only
refer to a 'repair' action?

I'm worrying here about silent data corruption that gets on to my
backup tapes. If an error was (is?) signaled by the raid system
during the backup and could be tracked to the file being copied at
the time, it would allow recovery of the data from a prior
backup. If raid remains silent, the corrupted data eventually
gets copied onto my entire backup rotation. Can you comment on this?

FWIW, my 600GB raid5 array shows mismatch_cnt of 24 when I 'check'
it - that machine has hung up on occasion.

Cheers,
Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2007-03-04 11:50 ` Neil Brown
  2007-03-04 12:01   ` Christian Pernegger
  2007-03-04 21:21   ` mismatch_cnt questions Eyal Lebedinsky
@ 2008-05-12 11:16   ` Bas van Schaik
  2008-05-12 14:31     ` Justin Piszcz
  2 siblings, 1 reply; 36+ messages in thread
From: Bas van Schaik @ 2008-05-12 11:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> On Sunday March 4, pernegger@gmail.com wrote:
>   
> (...)
>> 3) Is the "repair" sync action safe to use on the above kernel? Any
>> other methods / additional steps for fixing this?
>>     
>
> "repair" is safe, though it may not be effective.
> "repair" for raid1 was did not work until Jan 26th this year.
> Before then it was identical in effect to 'check'.
>   
Sorry to dig up such an old thread, but I'd like to know since when
(which kernel version) "repair" (recompute parity, assuming data is
consistent) for raid5 is effective?

  -- Bas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: mismatch_cnt questions
  2008-05-12 11:16   ` Bas van Schaik
@ 2008-05-12 14:31     ` Justin Piszcz
  0 siblings, 0 replies; 36+ messages in thread
From: Justin Piszcz @ 2008-05-12 14:31 UTC (permalink / raw)
  To: Bas van Schaik; +Cc: Neil Brown, linux-raid



On Mon, 12 May 2008, Bas van Schaik wrote:

> Neil Brown wrote:
>> On Sunday March 4, pernegger@gmail.com wrote:
>>
>> (...)
>>> 3) Is the "repair" sync action safe to use on the above kernel? Any
>>> other methods / additional steps for fixing this?
>>>
>>
>> "repair" is safe, though it may not be effective.
>> "repair" for raid1 was did not work until Jan 26th this year.
>> Before then it was identical in effect to 'check'.
>>
> Sorry to dig up such an old thread, but I'd like to know since when
> (which kernel version) "repair" (recompute parity, assuming data is
> consistent) for raid5 is effective?
Good question.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2008-05-12 14:31 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-04 11:22 mismatch_cnt questions Christian Pernegger
2007-03-04 11:50 ` Neil Brown
2007-03-04 12:01   ` Christian Pernegger
2007-03-04 22:19     ` Neil Brown
2007-03-06 10:04       ` mismatch_cnt questions - how about raid10? Peter Rabbitson
2007-03-06 10:20         ` Neil Brown
2007-03-06 10:56           ` Peter Rabbitson
2007-03-06 10:59             ` Justin Piszcz
2007-03-12  5:35             ` Neil Brown
2007-03-12 14:26               ` Peter Rabbitson
2007-03-04 21:21   ` mismatch_cnt questions Eyal Lebedinsky
2007-03-04 22:30     ` Neil Brown
2007-03-05  7:45       ` Eyal Lebedinsky
2007-03-05 14:56         ` detecting/correcting _slightly_ flaky disks Michael Stumpf
2007-03-05 15:09           ` Justin Piszcz
2007-03-05 17:01             ` Michael Stumpf
2007-03-05 17:11               ` Justin Piszcz
2007-03-07  0:14               ` Bill Davidsen
2007-03-07  1:37                 ` Michael Stumpf
2007-03-07 13:57                   ` berk walker
2007-03-07 15:01                   ` Bill Davidsen
2007-03-05 23:40         ` mismatch_cnt questions Neil Brown
2007-03-07  0:22           ` Bill Davidsen
2007-03-08  6:39           ` H. Peter Anvin
2007-03-08 13:54             ` Martin K. Petersen
2007-03-09  2:00               ` Bill Davidsen
2007-03-09  4:20                 ` H. Peter Anvin
2007-03-09  5:20                   ` Bill Davidsen
2007-03-08  6:34         ` H. Peter Anvin
2007-03-08  7:00           ` H. Peter Anvin
2007-03-08  8:21             ` H. Peter Anvin
2007-03-13  9:58               ` Andre Noll
2007-03-13 23:46                 ` H. Peter Anvin
2007-03-06  6:27       ` Paul Davidson
2008-05-12 11:16   ` Bas van Schaik
2008-05-12 14:31     ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).