Redundancy check using "echo check > sync

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Redundancy check using "echo check > sync_action": error reporting?
@ 2008-03-16 14:21 Bas van Schaik
  2008-03-16 15:14 ` Janek Kozicki
  0 siblings, 1 reply; 44+ messages in thread
From: Bas van Schaik @ 2008-03-16 14:21 UTC (permalink / raw)
  To: linux-raid

Hi all,

As we speak, I'm trying to debug a real weird type of filesystem
corruption in a quite complex layered system with networking involved:

  ATA over Ethernet - RAID5 - LVM - CryptoLoop - EXT3

In plain English: four storage servers export a bunch of block devices
using AoE, the "cluster frontend" uses those devices to build three
RAID5 arrays. Those arrays are the basis of a large LVM volume group, in
which an Logical Volume was created with an encrypted 2.5TB EXT3
filesystem (cryptoloop).

Recently the system suffered massive filesystem corruption, which even
made e2fsck crash. Theodore Tso was able to analyze and fix the
filesystem partially and found out that some random garbage was written
to the EXT3 inode tables, as well some other weird corruptions.
Personally, I'm suspecting one of the storage servers or the network to
have caused these severe corruptions, but I have never seen any errors
on the RAID5 level.

The (Debian) system runs a montly check of the RAID5 arrays using Martin
F. Krafft's checkarray script. Basically this scripts performs a "echo
check > /sys/block/$array/md/sync_action" for all arrays. With my
(basic) knowledge of RAID5 I assume this check only recomputes the sums
and compares them to the stored XOR'ed value. This makes me wonder:

1) Will the kernel actually warn me when an inconsistency is found?
Reading some other posts on the lists, it seems the kernel will print a
"read error corrected!", is that correct? Note that I'm using kernel
2.6.18 (Debian stable), was it already implemented that way in that kernel?

2) How can the RAID code actually correct such a read error on RAID5?
How does it know which device actually contains the faulty data?

The answers to those questions are very important to me: if the kernel
actually warns me when an inconsistency is found, that rules out the
possibility that there is something wrong with the network or one of the
storage servers. Actually that would mean that the "cluster frontend" is
causing the corruptions.

Kind regards,

  -- Bas van Schaik

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
@ 2008-03-16 15:14 ` Janek Kozicki
  2008-03-20 13:32   ` Bas van Schaik
  0 siblings, 1 reply; 44+ messages in thread
From: Janek Kozicki @ 2008-03-16 15:14 UTC (permalink / raw)
  Cc: linux-raid

Bas van Schaik said:     (by the date of Sun, 16 Mar 2008 15:21:11 +0100)

> As we speak, I'm trying to debug a real weird type of filesystem
> corruption in a quite complex layered system with networking involved:

AFAIK, even for the simplest case where corruption happens between a
head and a disk platter during write operation - the RAID has no way
to detect that. Unless it discovers later that the checksum on
another disk is wrong, and it is automatically updated to reflect the
corrupted data. Does it produce a message during resync in such case?
Someone here should be able to answer this.

-- 
Janek Kozicki                                                         |

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-16 15:14 ` Janek Kozicki
@ 2008-03-20 13:32   ` Bas van Schaik
  2008-03-20 13:47     ` Robin Hill
  0 siblings, 1 reply; 44+ messages in thread
From: Bas van Schaik @ 2008-03-20 13:32 UTC (permalink / raw)
  To: Janek Kozicki; +Cc: linux-raid

Janek Kozicki wrote:
> Bas van Schaik said:     (by the date of Sun, 16 Mar 2008 15:21:11 +0100)
>
>   
>> As we speak, I'm trying to debug a real weird type of filesystem
>> corruption in a quite complex layered system with networking involved:
>>     
>
> AFAIK, even for the simplest case where corruption happens between a
> head and a disk platter during write operation - the RAID has no way
> to detect that. Unless it discovers later that the checksum on
> another disk is wrong, and it is automatically updated to reflect the
> corrupted data. Does it produce a message during resync in such case?
> Someone here should be able to answer this.
>   
Anyone able to answer the last and most important question: does it
produce a message during resync in case of corruption? That would be great!

  -- Bas


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 13:32   ` Bas van Schaik
@ 2008-03-20 13:47     ` Robin Hill
  2008-03-20 14:19       ` Bas van Schaik
  0 siblings, 1 reply; 44+ messages in thread
From: Robin Hill @ 2008-03-20 13:47 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote:

> Anyone able to answer the last and most important question: does it
> produce a message during resync in case of corruption? That would be great!
> 
There's no explicit message produced by the md module, no.  You need to
check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
mismatches there are.  Similarly, following a repair this will indicate
how many mismatches it thinks have been fixed (by updating the parity
block to match the data blocks).

I've no idea whether the checkarray script you're using is checking this
counter - there seems little point in having a special script if it
isn't though.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 13:47     ` Robin Hill
@ 2008-03-20 14:19       ` Bas van Schaik
  2008-03-20 14:45         ` Robin Hill
  2008-03-20 16:35         ` Theodore Tso
  0 siblings, 2 replies; 44+ messages in thread
From: Bas van Schaik @ 2008-03-20 14:19 UTC (permalink / raw)
  To: linux-raid; +Cc: Theodore Tso

Robin Hill wrote:
> On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote:
>   
>> Anyone able to answer the last and most important question: does it
>> produce a message during resync in case of corruption? That would be great!
>>     
> There's no explicit message produced by the md module, no.  You need to
> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
> mismatches there are.  Similarly, following a repair this will indicate
> how many mismatches it thinks have been fixed (by updating the parity
> block to match the data blocks).
>   
Marvellous! I naively assumed that the module would warn me, but that's
not true. Wouldn't it be appropriate to print a message to dmesg if such
a mismatch occurs during a check? Such a mismatch clearly means that
there is something wrong with your hardware lying beneath md, doesn't it?

> I've no idea whether the checkarray script you're using is checking this
> counter - there seems little point in having a special script if it
> isn't though.
>   
If I understand the meaning of this counter, it would be sufficient to
check the value of it _before_ the check operation and compare that
value to the counter value _after_ the check. If the counter has
increased: the check has encountered some inconsistencies which should
be reported.
Please correct me if I'm wrong!

Cheers,

  Bas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 14:19       ` Bas van Schaik
@ 2008-03-20 14:45         ` Robin Hill
  2008-03-20 15:16           ` Bas van Schaik
  2008-03-20 16:35         ` Theodore Tso
  1 sibling, 1 reply; 44+ messages in thread
From: Robin Hill @ 2008-03-20 14:45 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2491 bytes --]

On Thu Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:

> Robin Hill wrote:
> > On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote:
> >   
> >> Anyone able to answer the last and most important question: does it
> >> produce a message during resync in case of corruption? That would be great!
> >>     
> > There's no explicit message produced by the md module, no.  You need to
> > check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
> > mismatches there are.  Similarly, following a repair this will indicate
> > how many mismatches it thinks have been fixed (by updating the parity
> > block to match the data blocks).
> >   
> Marvellous! I naively assumed that the module would warn me, but that's
> not true. Wouldn't it be appropriate to print a message to dmesg if such
> a mismatch occurs during a check? Such a mismatch clearly means that
> there is something wrong with your hardware lying beneath md, doesn't it?
> 
With a RAID5 then mostly, yes - there may be errors caused by transient
situations (interference, cosmic rays, etc) which are entirely
independent of the hardware.  With other RAID versions it's not quite as
clear cut.  For example with RAID1 it's possible for the in-memory data
to have been changed between writing to each disk (especially with swap
disks) - this isn't necessarily an issue (and certainly not a hardware
one).

> > I've no idea whether the checkarray script you're using is checking this
> > counter - there seems little point in having a special script if it
> > isn't though.
> >   
> If I understand the meaning of this counter, it would be sufficient to
> check the value of it _before_ the check operation and compare that
> value to the counter value _after_ the check. If the counter has
> increased: the check has encountered some inconsistencies which should
> be reported.
> Please correct me if I'm wrong!
> 
Depends on what the previous operation was.  After a repair, the counter
will indicate the number of errors fixed, not the number remaining.
Theoretically, after a repair there will be no errors remaining, so any
value (> 0) in the counter after a check would indicate an issue to be
reported.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 14:45         ` Robin Hill
@ 2008-03-20 15:16           ` Bas van Schaik
  2008-03-20 16:04             ` Robin Hill
  0 siblings, 1 reply; 44+ messages in thread
From: Bas van Schaik @ 2008-03-20 15:16 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Thu Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
>
>   
>> Robin Hill wrote:
>>     
>>> On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote:
>>>   
>>>       
>>>> Anyone able to answer the last and most important question: does it
>>>> produce a message during resync in case of corruption? That would be great!
>>>>     
>>>>         
>>> There's no explicit message produced by the md module, no.  You need to
>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
>>> mismatches there are.  Similarly, following a repair this will indicate
>>> how many mismatches it thinks have been fixed (by updating the parity
>>> block to match the data blocks).
>>>   
>>>       
>> Marvellous! I naively assumed that the module would warn me, but that's
>> not true. Wouldn't it be appropriate to print a message to dmesg if such
>> a mismatch occurs during a check? Such a mismatch clearly means that
>> there is something wrong with your hardware lying beneath md, doesn't it?
>>
>>     
> With a RAID5 then mostly, yes - there may be errors caused by transient
> situations (interference, cosmic rays, etc) which are entirely
> independent of the hardware.  With other RAID versions it's not quite as
> clear cut.  For example with RAID1 it's possible for the in-memory data
> to have been changed between writing to each disk (especially with swap
> disks) - this isn't necessarily an issue (and certainly not a hardware
> one).
>   
Maybe I understand something wrong then. In an ideal situation, the
following should hold:
 - for RAID5: all data should count up to the parity bit
 - for RAID1: all bits should be identical

If the redundancy check encounters a anomaly, something should be fixed.
If something should be fixed, clearly something went wrong somewhere in
the past. Or can you give an example where the statements mentioned
above don't hold and nothing is wrong?

>>> I've no idea whether the checkarray script you're using is checking this
>>> counter - there seems little point in having a special script if it
>>> isn't though.
>>>   
>>>       
>> If I understand the meaning of this counter, it would be sufficient to
>> check the value of it _before_ the check operation and compare that
>> value to the counter value _after_ the check. If the counter has
>> increased: the check has encountered some inconsistencies which should
>> be reported.
>> Please correct me if I'm wrong
> Depends on what the previous operation was.  After a repair, the counter
> will indicate the number of errors fixed, not the number remaining.
> Theoretically, after a repair there will be no errors remaining, so any
> value (> 0) in the counter after a check would indicate an issue to be
> reported.
>   
Bottom line: I just want to know if an md check (using "echo check >
sync_action") encountered any inconsistencies. If so, in my setup that
would probably mean there is something wrong (bits flipping somewhere
between md, the bus, the NIC, the network, the NIC of a storage server,
etc.)

I just don't want to be surprised by any major filesystem corruptions
anymore!

Cheers,

  Bas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 15:16           ` Bas van Schaik
@ 2008-03-20 16:04             ` Robin Hill
  0 siblings, 0 replies; 44+ messages in thread
From: Robin Hill @ 2008-03-20 16:04 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1956 bytes --]

On Thu Mar 20, 2008 at 04:16:08PM +0100, Bas van Schaik wrote:

> Maybe I understand something wrong then. In an ideal situation, the
> following should hold:
>  - for RAID5: all data should count up to the parity bit
>  - for RAID1: all bits should be identical
> 
> If the redundancy check encounters a anomaly, something should be fixed.
> If something should be fixed, clearly something went wrong somewhere in
> the past. Or can you give an example where the statements mentioned
> above don't hold and nothing is wrong?
> 
My understanding is that, for RAID1 at least (and possibly for any other
mirrored setup), the data is not written to both disks simultaneously,
therefore there's a chance for the data to be modified (in memory)
between writes (or for the check to read the disks between writes).
This is usually only a temporary situation (i.e. the block is due to be
rewritten anyway) but does show up occasionally in checks, particularly
with swap partitions.

> Bottom line: I just want to know if an md check (using "echo check >
> sync_action") encountered any inconsistencies. If so, in my setup that
> would probably mean there is something wrong (bits flipping somewhere
> between md, the bus, the NIC, the network, the NIC of a storage server,
> etc.)
> 
> I just don't want to be surprised by any major filesystem corruptions
> anymore!
> 
For this, a simple check for a non-zero value in the
/sys/block/md{X}/md/mismatch_cnt entry will indicate an issue.  Note
that the repair stage only rewrites the parity - there's no way to know
whether the actual error was in the data or parity though, so there may
still be corruption after running a repair.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 14:19       ` Bas van Schaik
  2008-03-20 14:45         ` Robin Hill
@ 2008-03-20 16:35         ` Theodore Tso
  2008-03-20 17:10           ` Robin Hill
                             ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Theodore Tso @ 2008-03-20 16:35 UTC (permalink / raw)
  To: Bas van Schaik; +Cc: linux-raid

On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
> > There's no explicit message produced by the md module, no.  You need to
> > check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
> > mismatches there are.  Similarly, following a repair this will indicate
> > how many mismatches it thinks have been fixed (by updating the parity
> > block to match the data blocks).
> >   
> Marvellous! I naively assumed that the module would warn me, but that's
> not true. Wouldn't it be appropriate to print a message to dmesg if such
> a mismatch occurs during a check? Such a mismatch clearly means that
> there is something wrong with your hardware lying beneath md, doesn't it?

If a mismatch is detected in a RAID-6 configuration, it should be
possible to figure out what should be fixed (since with two hot spares
there should be enough redundancy not only to detect an error, but to
correct it.)  Out of curiosity, does md do this automatically, either
when reading from a stripe, or during a resync operation?

     	     	    	       	      - Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 16:35         ` Theodore Tso
@ 2008-03-20 17:10           ` Robin Hill
  2008-03-20 17:39           ` Andre Noll
  2008-03-20 23:08           ` Peter Rabbitson
  2 siblings, 0 replies; 44+ messages in thread
From: Robin Hill @ 2008-03-20 17:10 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1272 bytes --]

On Thu Mar 20, 2008 at 12:35:51PM -0400, Theodore Tso wrote:

> If a mismatch is detected in a RAID-6 configuration, it should be
> possible to figure out what should be fixed (since with two hot spares
> there should be enough redundancy not only to detect an error, but to
> correct it.)  Out of curiosity, does md do this automatically, either
> when reading from a stripe, or during a resync operation?
> 
I'm not sure about during a read (though my understanding is that the
parity is entirely ignored here, so no checking is done in any RAID
configuration).

As for during resync, not as of last time this came up, no (I've not
looked at the code but Neil was certainly opposed to trying to do this
then).  The problem is that you can only (safely) do a repair if you
know _for a fact_ that only a single block is corrupt.  Otherwise
there's a reasonable chance of further corrupting the data by
"repairing" good blocks to match the bad.  The current md code just
recalculates both parity blocks.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 16:35         ` Theodore Tso
  2008-03-20 17:10           ` Robin Hill
@ 2008-03-20 17:39           ` Andre Noll
  2008-03-20 18:02             ` Theodore Tso
  2008-03-20 23:08           ` Peter Rabbitson
  2 siblings, 1 reply; 44+ messages in thread
From: Andre Noll @ 2008-03-20 17:39 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Bas van Schaik, linux-raid

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

On 12:35, Theodore Tso wrote:

> If a mismatch is detected in a RAID-6 configuration, it should be
> possible to figure out what should be fixed

It can be figured out under the assumption that exactly one drive has
bad data and all other ones have good data. But that seems to be an
assumption that is hard to verify in reality.

> Out of curiosity, does md do this automatically, either
> when reading from a stripe, or during a resync operation?

Nope, md does no such thing.
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 17:39           ` Andre Noll
@ 2008-03-20 18:02             ` Theodore Tso
  2008-03-20 18:57               ` Andre Noll
                                 ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Theodore Tso @ 2008-03-20 18:02 UTC (permalink / raw)
  To: Andre Noll; +Cc: Bas van Schaik, linux-raid

On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
> On 12:35, Theodore Tso wrote:
> 
> > If a mismatch is detected in a RAID-6 configuration, it should be
> > possible to figure out what should be fixed
> 
> It can be figured out under the assumption that exactly one drive has
> bad data and all other ones have good data. But that seems to be an
> assumption that is hard to verify in reality.

True, but it's what ECC memory does.  :-)   And most people agree that
it's a useful thing to do with memory.  

If you do ECC syndrome checking on every read, and follow that up with
periodic scrubbing so that you catch (and correct) errors quickly, it
is a reasonable assumption to make.

Obviously a warning should be given when you do this kind of ECC
fixups, and if there is an increasing number of ECC fixups that are
being done, that should set off alarms that maybe there is a hardware
problem that needs to be addressed.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 18:02             ` Theodore Tso
@ 2008-03-20 18:57               ` Andre Noll
  2008-03-21 14:02               ` Ric Wheeler
  2008-03-21 20:19               ` NeilBrown
  2 siblings, 0 replies; 44+ messages in thread
From: Andre Noll @ 2008-03-20 18:57 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Bas van Schaik, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1625 bytes --]

On 14:02, Theodore Tso wrote:
> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
> > On 12:35, Theodore Tso wrote:
> > 
> > > If a mismatch is detected in a RAID-6 configuration, it should be
> > > possible to figure out what should be fixed
> > 
> > It can be figured out under the assumption that exactly one drive has
> > bad data and all other ones have good data. But that seems to be an
> > assumption that is hard to verify in reality.
> 
> True, but it's what ECC memory does.  :-)   And most people agree that
> it's a useful thing to do with memory.  
> 
> If you do ECC syndrome checking on every read, and follow that up with
> periodic scrubbing so that you catch (and correct) errors quickly, it
> is a reasonable assumption to make.
> 
> Obviously a warning should be given when you do this kind of ECC
> fixups, and if there is an increasing number of ECC fixups that are
> being done, that should set off alarms that maybe there is a hardware
> problem that needs to be addressed.

I agree, but not everybody likes the idea to do this kind of error
correction also for hard disks in raid6 [1]. In case of a hard power
failure it may well happen that any given subset of the disks in
the array is up to date and all others are not. So in practice the
situation for hard disks is different from memory modules.

OTOH, it's probably the best thing one can do, so I'd vote for
implementing this feature.

Andre

[1] http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09863.html
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 18:02             ` Theodore Tso
  2008-03-20 18:57               ` Andre Noll
@ 2008-03-21 14:02               ` Ric Wheeler
  2008-03-21 20:19               ` NeilBrown
  2 siblings, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2008-03-21 14:02 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andre Noll, Bas van Schaik, linux-raid, Martin K. Petersen

Theodore Tso wrote:
> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
>> On 12:35, Theodore Tso wrote:
>>
>>> If a mismatch is detected in a RAID-6 configuration, it should be
>>> possible to figure out what should be fixed
>> It can be figured out under the assumption that exactly one drive has
>> bad data and all other ones have good data. But that seems to be an
>> assumption that is hard to verify in reality.
> 
> True, but it's what ECC memory does.  :-)   And most people agree that
> it's a useful thing to do with memory.  
> 
> If you do ECC syndrome checking on every read, and follow that up with
> periodic scrubbing so that you catch (and correct) errors quickly, it
> is a reasonable assumption to make.
> 
> Obviously a warning should be given when you do this kind of ECC
> fixups, and if there is an increasing number of ECC fixups that are
> being done, that should set off alarms that maybe there is a hardware
> problem that needs to be addressed.
> 
> Regards,
> 
> 						- Ted

This might have been stated before in the thread, but most of the raid 
rebuilds are triggered by easily identified drive failures (i.e., a 
completely dead drive or a sequence of bad sectors that generate an IO 
error as we read from the platter). Fortunately, these are also the most 
common failures in RAID boxes ;-)

The way you deal with class of errors that don't trigger obvious 
failures is to do some kind of background scrubbing or add extra 
protection data to the disk.

Martin Petersen presented the new "DIF" work at the FS/IO workshop. This 
might be an interesting feature to build into MD raid devices:

http://oss.oracle.com/projects/data-integrity/documentation/

You would need to reformat your drives, so this is not a generic 
solution for all users, but it really does address the core of the issue.

ric

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 18:02             ` Theodore Tso
  2008-03-20 18:57               ` Andre Noll
  2008-03-21 14:02               ` Ric Wheeler
@ 2008-03-21 20:19               ` NeilBrown
  2008-03-21 20:45                 ` Ric Wheeler
  2008-03-22 17:13                 ` Bill Davidsen
  2 siblings, 2 replies; 44+ messages in thread
From: NeilBrown @ 2008-03-21 20:19 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Andre Noll, Bas van Schaik, linux-raid

On Fri, March 21, 2008 5:02 am, Theodore Tso wrote:
> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
>> On 12:35, Theodore Tso wrote:
>>
>> > If a mismatch is detected in a RAID-6 configuration, it should be
>> > possible to figure out what should be fixed
>>
>> It can be figured out under the assumption that exactly one drive has
>> bad data and all other ones have good data. But that seems to be an
>> assumption that is hard to verify in reality.
>
> True, but it's what ECC memory does.  :-)   And most people agree that
> it's a useful thing to do with memory.
>
> If you do ECC syndrome checking on every read, and follow that up with
> periodic scrubbing so that you catch (and correct) errors quickly, it
> is a reasonable assumption to make.

My problem with this is that I don't have a good model for what might
cause the error, so I cannot reason about what responses are justifiable.

The analogy with ECC memory is, I think, poor.  With ECC memory there are
electro/physical processes which can cause a bit to change independently
of any other bit with very low probability, so treating an ECC error as
a single bit error is reasonable.

The analogy with a disk drive would be a media error.  However disk drives
record CRC (or similar) checks so that media errors get reported as errors,
not as incorrect data.  So the analogy doesn't hold.

Where else could the error come from?  Presumably a bit-flip on some
transfer bus between main memory and the media.  There are several
of these busses (mem to controller, controller to device, internal to
device).  The corruption could happen on the write or on the read.
When you write to a RAID6 you often write several blocks to different
devices at the same time.  Are these really likely to be independent
events wrt whatever is causing the corruption?

I don't know.  But without a clear model, it isn't clear to me that
any particular action will be certain to improve the situation in
all cases.

And how often does silent corruption happen on modern hard drives?
How often do you write something and later successfully read something
else when it isn't due to a major hardware problem that is causing
much more that just occasional errors?

The ZFS people seem to say that their checksumming of all data shows
up a lot of these cases.  If that is true, how come people who
don't use ZFS aren't reporting lots of data corruption?

So yes: there are lots of things that *could* be done.  But without
a model for the "threat", an analysis of how the remedy would actually
affect every different possible scenario, and some idea of the
probability of the remedy being needed, it is very hard to
justify a change of this sort.
And there are plenty of other things to be coded that are genuinely
useful - like converting a RAID5 to a RAID6 while online...

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error     reporting?
  2008-03-21 20:19               ` NeilBrown
@ 2008-03-21 20:45                 ` Ric Wheeler
  2008-03-22 17:13                 ` Bill Davidsen
  1 sibling, 0 replies; 44+ messages in thread
From: Ric Wheeler @ 2008-03-21 20:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: Theodore Tso, Andre Noll, Bas van Schaik, linux-raid

NeilBrown wrote:
> On Fri, March 21, 2008 5:02 am, Theodore Tso wrote:
>> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote:
>>> On 12:35, Theodore Tso wrote:
>>>
>>>> If a mismatch is detected in a RAID-6 configuration, it should be
>>>> possible to figure out what should be fixed
>>> It can be figured out under the assumption that exactly one drive has
>>> bad data and all other ones have good data. But that seems to be an
>>> assumption that is hard to verify in reality.
>> True, but it's what ECC memory does.  :-)   And most people agree that
>> it's a useful thing to do with memory.
>>
>> If you do ECC syndrome checking on every read, and follow that up with
>> periodic scrubbing so that you catch (and correct) errors quickly, it
>> is a reasonable assumption to make.
> 
> My problem with this is that I don't have a good model for what might
> cause the error, so I cannot reason about what responses are justifiable.
> 
> The analogy with ECC memory is, I think, poor.  With ECC memory there are
> electro/physical processes which can cause a bit to change independently
> of any other bit with very low probability, so treating an ECC error as
> a single bit error is reasonable.
> 
> The analogy with a disk drive would be a media error.  However disk drives
> record CRC (or similar) checks so that media errors get reported as errors,
> not as incorrect data.  So the analogy doesn't hold.

The challenge is only when you don't get an error on the IO. If you have 
bad hardware somewhere off platter, you can get silent corruption.

In this case, if you look at Martin's presentation on DIF, we could do 
something that a check could leverage on a per sector basis for software 
raid.

> 
> Where else could the error come from?  Presumably a bit-flip on some
> transfer bus between main memory and the media.  There are several
> of these busses (mem to controller, controller to device, internal to
> device).  The corruption could happen on the write or on the read.
> When you write to a RAID6 you often write several blocks to different
> devices at the same time.  Are these really likely to be independent
> events wrt whatever is causing the corruption?
> 
> I don't know.  But without a clear model, it isn't clear to me that
> any particular action will be certain to improve the situation in
> all cases.

It can come from a lot of things (see the recent papers from FAST and 
NetApp for example).

> 
> And how often does silent corruption happen on modern hard drives?
> How often do you write something and later successfully read something
> else when it isn't due to a major hardware problem that is causing
> much more that just occasional errors?
> 
> The ZFS people seem to say that their checksumming of all data shows
> up a lot of these cases.  If that is true, how come people who
> don't use ZFS aren't reporting lots of data corruption?
> 
> So yes: there are lots of things that *could* be done.  But without
> a model for the "threat", an analysis of how the remedy would actually
> affect every different possible scenario, and some idea of the
> probability of the remedy being needed, it is very hard to
> justify a change of this sort.
> And there are plenty of other things to be coded that are genuinely
> useful - like converting a RAID5 to a RAID6 while online...
> 
> NeilBrown

I really think that we might be able to leverage the DIF standard if and 
when it rolls out.

ric

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error     reporting?
  2008-03-21 20:19               ` NeilBrown
  2008-03-21 20:45                 ` Ric Wheeler
@ 2008-03-22 17:13                 ` Bill Davidsen
  1 sibling, 0 replies; 44+ messages in thread
From: Bill Davidsen @ 2008-03-22 17:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Theodore Tso, Andre Noll, Bas van Schaik, linux-raid

NeilBrown wrote:
> My problem with this is that I don't have a good model for what might
> cause the error, so I cannot reason about what responses are justifiable.
>
> The analogy with ECC memory is, I think, poor.  With ECC memory there are
> electro/physical processes which can cause a bit to change independently
> of any other bit with very low probability, so treating an ECC error as
> a single bit error is reasonable.
>
> The analogy with a disk drive would be a media error.  However disk drives
> record CRC (or similar) checks so that media errors get reported as errors,
> not as incorrect data.  So the analogy doesn't hold.
>
> Where else could the error come from?  Presumably a bit-flip on some
> transfer bus between main memory and the media.  There are several
> of these busses (mem to controller, controller to device, internal to
> device).  The corruption could happen on the write or on the read.
> When you write to a RAID6 you often write several blocks to different
> devices at the same time.  Are these really likely to be independent
> events wrt whatever is causing the corruption?
>   

Based on what I have read and seen, some of these errors come in pairs 
and are caused by a drive just writing to the wrong sector. This can 
come from errors in the O/S (unlikely), disk hardware (unlikely), or 
disk firmware (least unlikely). So you get the data written to the wrong 
place (makes that stripe invalid) and parity change or mirror copies 
written to the right place(s). Thus, two bad stripes to be detected on 
"check," neither of which will return a hardware error on a read.
> I don't know.  But without a clear model, it isn't clear to me that
> any particular action will be certain to improve the situation in
> all cases.
>
>   
Agreed, the only cases I've identified where improvement is possible is 
in the case of raid1 with multiple copies, and raid6. Doing the recovery 
I outlines the other day will not make things better in all cases, but 
will never make things worse (statistically) and should recover both 
failures if the cause is "single misplaced write."

> And how often does silent corruption happen on modern hard drives?
> How often do you write something and later successfully read something
> else when it isn't due to a major hardware problem that is causing
> much more that just occasional errors?
>
>   
Very seldom, all my critical data is checked by software CRC, and these 
failures just don't happen. But I have owned drives in the past which 
had software revisions which had error rates as high as 2/10TB, which 
went away on the same drives after firmware updates. So while it is 
rare, it can and does happen occasionally.

> So yes: there are lots of things that *could* be done.  But without
> a model for the "threat", an analysis of how the remedy would actually
> affect every different possible scenario, and some idea of the
> probability of the remedy being needed, it is very hard to
> justify a change of this sort.
>   

I hope I have provided a plausible model for one error source. If I have 
identified the model correctly, errors will always happen in pairs, in 
normal operation rather than during some unclean system shutdown due to 
O/S crash or power failure.

> And there are plenty of other things to be coded that are genuinely
> useful - like converting a RAID5 to a RAID6 while online...
>   

I would suggest that upgrading an array to larger drives is more common, 
having a fully automated upgrade path would be useful to far more users. 
So if I have (example) four 320GB drives, I want to upgrade to 500GB 
drives, I attacha 500GB drive and say something like "on /dev/md2 
migrate /dev/sda1 to /dev/sde1" and have it done in such a way that it 
will fail safe and at the end sda1 will be out of the array and sde1 
will be in, without having to enter multiple commands per drive to get 
this done.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 16:35         ` Theodore Tso
  2008-03-20 17:10           ` Robin Hill
  2008-03-20 17:39           ` Andre Noll
@ 2008-03-20 23:08           ` Peter Rabbitson
  2008-03-21 14:24             ` Bill Davidsen
  2008-03-25  4:24             ` Neil Brown
  2 siblings, 2 replies; 44+ messages in thread
From: Peter Rabbitson @ 2008-03-20 23:08 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Bas van Schaik, linux-raid

Theodore Tso wrote:
> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
>>> There's no explicit message produced by the md module, no.  You need to
>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
>>> mismatches there are.  Similarly, following a repair this will indicate
>>> how many mismatches it thinks have been fixed (by updating the parity
>>> block to match the data blocks).
>>>   
>> Marvellous! I naively assumed that the module would warn me, but that's
>> not true. Wouldn't it be appropriate to print a message to dmesg if such
>> a mismatch occurs during a check? Such a mismatch clearly means that
>> there is something wrong with your hardware lying beneath md, doesn't it?
> 
> If a mismatch is detected in a RAID-6 configuration, it should be
> possible to figure out what should be fixed (since with two hot spares
> there should be enough redundancy not only to detect an error, but to
> correct it.)  Out of curiosity, does md do this automatically, either
> when reading from a stripe, or during a resync operation?
> 

In my modest experience with root/high performance spool on various raid 
levels I can pretty much conclude that the current check mechanism doesn't do 
enough to give power to the user. We can debate all we want about what the MD 
driver should do when it finds a mismatch, yet there is no way for the user to 
figure out what the mismatch is and take appropriate action. This does not 
apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the 
only wrong value is taken and written all over the other good blocks?

I think that the solution is rather simple, and I would contribute a patch if 
I had any C experience. The current check mechanism remains the same - 
mismatch_cnt is incremented/reset just the same as before. However on every 
mismatching chunk the system printks the following:

1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device
2) one line for every active disk containing:
	a) the offset of the chunk within the MD componnent
	b) a {md5|sha1}sum of the chunk

In a common case array this will take no more than 8 lines in dmesg. However 
it will allow:

1) For a human to determine at a glance which disk holds a mismatching chunk 
in raid 1/10
2) Determine the same for raid 6 using a userspace tool which will calculate 
the parity for every possible permutation of chunks
3) using some external tools to determine which file might have been affected 
on the layered file system

Now of course the problem remains how to repair the array using the 
information obtained above. I think the best way would be to extend the syntax 
of repair itself, so that:

echo repair > .../sync_action would use the old heuristics

echo repair <mdoffset> <component N> > .../sync_action will update the chunk 
on drive N which corresponds to the chunk/stripe at mdoffset within the MD 
device, using the information from the other drives, and not the other way 
around as might happen with just a repair.

Just my 2c

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 23:08           ` Peter Rabbitson
@ 2008-03-21 14:24             ` Bill Davidsen
  2008-03-21 14:52               ` Peter Rabbitson
  2008-03-25  4:24             ` Neil Brown
  1 sibling, 1 reply; 44+ messages in thread
From: Bill Davidsen @ 2008-03-21 14:24 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid

Peter Rabbitson wrote:
> Theodore Tso wrote:
>> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
>>>> There's no explicit message produced by the md module, no.  You 
>>>> need to
>>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
>>>> mismatches there are.  Similarly, following a repair this will 
>>>> indicate
>>>> how many mismatches it thinks have been fixed (by updating the parity
>>>> block to match the data blocks).
>>>>   
>>> Marvellous! I naively assumed that the module would warn me, but that's
>>> not true. Wouldn't it be appropriate to print a message to dmesg if 
>>> such
>>> a mismatch occurs during a check? Such a mismatch clearly means that
>>> there is something wrong with your hardware lying beneath md, 
>>> doesn't it?
>>
>> If a mismatch is detected in a RAID-6 configuration, it should be
>> possible to figure out what should be fixed (since with two hot spares
>> there should be enough redundancy not only to detect an error, but to
>> correct it.)  Out of curiosity, does md do this automatically, either
>> when reading from a stripe, or during a resync operation?
>>
>
> In my modest experience with root/high performance spool on various 
> raid levels I can pretty much conclude that the current check 
> mechanism doesn't do enough to give power to the user. We can debate 
> all we want about what the MD driver should do when it finds a 
> mismatch, yet there is no way for the user to figure out what the 
> mismatch is and take appropriate action. This does not apply only to 
> RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the only 
> wrong value is taken and written all over the other good blocks?
>
> I think that the solution is rather simple, and I would contribute a 
> patch if I had any C experience. The current check mechanism remains 
> the same - mismatch_cnt is incremented/reset just the same as before. 
> However on every mismatching chunk the system printks the following:
>
> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the 
> MD device
> 2) one line for every active disk containing:
>     a) the offset of the chunk within the MD componnent
>     b) a {md5|sha1}sum of the chunk
>
> In a common case array this will take no more than 8 lines in dmesg. 
> However it will allow:
>
> 1) For a human to determine at a glance which disk holds a mismatching 
> chunk in raid 1/10
> 2) Determine the same for raid 6 using a userspace tool which will 
> calculate the parity for every possible permutation of chunks
> 3) using some external tools to determine which file might have been 
> affected on the layered file system
>
>
> Now of course the problem remains how to repair the array using the 
> information obtained above. I think the best way would be to extend 
> the syntax of repair itself, so that:
>
> echo repair > .../sync_action would use the old heuristics
>
> echo repair <mdoffset> <component N> > .../sync_action will update the 
> chunk on drive N which corresponds to the chunk/stripe at mdoffset 
> within the MD device, using the information from the other drives, and 
> not the other way around as might happen with just a repair.

I totally agree, not doing the most likely to be correct thing seems to 
be the one argument for hardware raid. There are two case in which 
software can determine (a) if it is likely that there is a single bad 
block, and (b) what the correct value for that block is.

raid1 - more than one copy

  If there are multiple copies of the data, and N-1 agree, then it is 
more likely that the mismatched copy is the bad one, and should be 
rewritten with the data in the other copies. This is never less likely 
to be correct than selecting one copy at random and writing it over all 
others, so it can only be a help.

raid6 - assume and check

  Given an error in raid6, if the parity A appears correct and the 
parity B does not, assume that the non-matching parity is bad and 
regenerate.

  If neither parity appears correct, for each data block assume it is 
bad and recalculate a recovery value using A nd B parities. If the data 
pattern generated is the same for recovery using either parity, assume 
that the data is bad and rewrite.

  Again, this is more likely to be correct than assuming that both 
parities are wrong. Obviously if no "most likely" bad data or parity 
information can be identified then recalculating both parity blocks is 
the only way to "fix" the array, but it leaves undetectable bad data. I 
would like an option to do repairs using these two methods, which would 
give a high probability that whatever "fixes" were applied were actually 
recovering the correct data.

Yes, I know that errors like this are less common than pure hardware 
errors, does that justify something less than best practice during recovery?

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 14:24             ` Bill Davidsen
@ 2008-03-21 14:52               ` Peter Rabbitson
  2008-03-21 17:13                 ` Theodore Tso
  2008-03-21 23:01                 ` Bill Davidsen
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Rabbitson @ 2008-03-21 14:52 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Theodore Tso, Bas van Schaik, linux-raid

Bill Davidsen wrote:
> Peter Rabbitson wrote:
>> Theodore Tso wrote:
>>> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
>>>>> There's no explicit message produced by the md module, no.  You 
>>>>> need to
>>>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
>>>>> mismatches there are.  Similarly, following a repair this will 
>>>>> indicate
>>>>> how many mismatches it thinks have been fixed (by updating the parity
>>>>> block to match the data blocks).
>>>>>   
>>>> Marvellous! I naively assumed that the module would warn me, but that's
>>>> not true. Wouldn't it be appropriate to print a message to dmesg if 
>>>> such
>>>> a mismatch occurs during a check? Such a mismatch clearly means that
>>>> there is something wrong with your hardware lying beneath md, 
>>>> doesn't it?
>>>
>>> If a mismatch is detected in a RAID-6 configuration, it should be
>>> possible to figure out what should be fixed (since with two hot spares
>>> there should be enough redundancy not only to detect an error, but to
>>> correct it.)  Out of curiosity, does md do this automatically, either
>>> when reading from a stripe, or during a resync operation?
>>>
>>
>> In my modest experience with root/high performance spool on various 
>> raid levels I can pretty much conclude that the current check 
>> mechanism doesn't do enough to give power to the user. We can debate 
>> all we want about what the MD driver should do when it finds a 
>> mismatch, yet there is no way for the user to figure out what the 
>> mismatch is and take appropriate action. This does not apply only to 
>> RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the only 
>> wrong value is taken and written all over the other good blocks?
>>
>> I think that the solution is rather simple, and I would contribute a 
>> patch if I had any C experience. The current check mechanism remains 
>> the same - mismatch_cnt is incremented/reset just the same as before. 
>> However on every mismatching chunk the system printks the following:
>>
>> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the 
>> MD device
>> 2) one line for every active disk containing:
>>     a) the offset of the chunk within the MD componnent
>>     b) a {md5|sha1}sum of the chunk
>>
>> In a common case array this will take no more than 8 lines in dmesg. 
>> However it will allow:
>>
>> 1) For a human to determine at a glance which disk holds a mismatching 
>> chunk in raid 1/10
>> 2) Determine the same for raid 6 using a userspace tool which will 
>> calculate the parity for every possible permutation of chunks
>> 3) using some external tools to determine which file might have been 
>> affected on the layered file system
>>
>>
>> Now of course the problem remains how to repair the array using the 
>> information obtained above. I think the best way would be to extend 
>> the syntax of repair itself, so that:
>>
>> echo repair > .../sync_action would use the old heuristics
>>
>> echo repair <mdoffset> <component N> > .../sync_action will update the 
>> chunk on drive N which corresponds to the chunk/stripe at mdoffset 
>> within the MD device, using the information from the other drives, and 
>> not the other way around as might happen with just a repair.
> 
> I totally agree, not doing the most likely to be correct thing seems to 
> be the one argument for hardware raid. There are two case in which 
> software can determine (a) if it is likely that there is a single bad 
> block, and (b) what the correct value for that block is.
> 
> <snip>
> 

I was actually specifically advocating that md must _not_ do anything on its 
own. Just provide the hooks to get information (what is the current stripe 
state) and update information (the described repair extension). The logic that 
you are describing can live only in an external app, it has no place in-kernel.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 14:52               ` Peter Rabbitson
@ 2008-03-21 17:13                 ` Theodore Tso
  2008-03-21 17:35                   ` Peter Rabbitson
  2008-03-21 17:43                   ` Robin Hill
  2008-03-21 23:01                 ` Bill Davidsen
  1 sibling, 2 replies; 44+ messages in thread
From: Theodore Tso @ 2008-03-21 17:13 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Bill Davidsen, Bas van Schaik, linux-raid

On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote:
> I was actually specifically advocating that md must _not_ do anything on 
> its own. Just provide the hooks to get information (what is the current 
> stripe state) and update information (the described repair extension). The 
> logic that you are describing can live only in an external app, it has no 
> place in-kernel.

Why not?  If md doesn't do anything on its own, then when it detects a
disagreement between the data and the two parity blocks, it has two
choices (a) return possibly incorrect data to the application, or (b)
return an I/O error and cause the application to blow up.

Sure, it could then give the information so that the external repair
tool can fix it up after the fact, but that seems like a really lousy
thing to do as far as the original application is concerned.  (Or I
suppose you could try to block the userspace application until the
repair tool has a chance to do automatically what md could have done
automatically in the kernel anyway, but that has other problems.)

So what's the harm in having an option where md does exactly what ECC
memory does, which is when it can fix things up, to do so?  I bet most
system administrators would turn it on in a heartbeat.

       	     	      	      	      	     - Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 17:13                 ` Theodore Tso
@ 2008-03-21 17:35                   ` Peter Rabbitson
  2008-03-22 13:27                     ` Theodore Tso
  2008-03-21 17:43                   ` Robin Hill
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Rabbitson @ 2008-03-21 17:35 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Bill Davidsen, Bas van Schaik, linux-raid

Theodore Tso wrote:
> On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote:
>> I was actually specifically advocating that md must _not_ do anything on 
>> its own. Just provide the hooks to get information (what is the current 
>> stripe state) and update information (the described repair extension). The 
>> logic that you are describing can live only in an external app, it has no 
>> place in-kernel.
> 
> Why not?  If md doesn't do anything on its own, then when it detects a
> disagreement between the data and the two parity blocks, it has two
> choices (a) return possibly incorrect data to the application, or (b)
> return an I/O error and cause the application to blow up.
> 
> <snip>
> 
> So what's the harm in having an option where md does exactly what ECC
> memory does, which is when it can fix things up, to do so?  I bet most
> system administrators would turn it on in a heartbeat.
> 

With ECC memory you are checking for inconsistency on _every_single_read_ 
whereas the md scrubbing happens at best once a month if the admin turned the 
feature on. Moreover when md actually detects a mismatch the overwhelming 
chance is nobody needs this block at this moment, and might not need it for 
days to come. I think what is eluding this thread is the fact that md does not 
read _any_ redundant blocks unless it absolutely has to. And when it has to - 
you already have a missing chunk and can not apply ECC techniques either.

Of course it would be possible to instruct md to always read all data+parity 
chunks and make a comparison on every read. The performance would not be much 
to write home about though.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 17:35                   ` Peter Rabbitson
@ 2008-03-22 13:27                     ` Theodore Tso
  2008-03-22 14:00                       ` Bas van Schaik
                                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Theodore Tso @ 2008-03-22 13:27 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Bill Davidsen, Bas van Schaik, linux-raid

On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
>
> Of course it would be possible to instruct md to always read all 
> data+parity chunks and make a comparison on every read. The performance 
> would not be much to write home about though.

Yeah, and that's probably the real problem with this scheme.  You
basically reduce the read bandwidth of your array down to a single
(slowest) disk --- basically the same reason why RAID-2 is a
commercial failure.  

I suspect the best thing we *can* to do is for filesystems that
include checksums in the metadata and/or the data blocks, is if the
CRC doesn't match, to have the filesystem tell the RAID subsystem,
"um, could you send me copies of the data from all of the RAID-1
mirrors, and see if one of the copies from the mirrors causes a valid
checksum".  Something similar could be done with RAID-5/RAID-6 arrays,
if the fs layer could ask the RAID subsystem, "the external checksum
for this block is bad; can you recalculate it from all available
parity stripes assuming the data stripe is invalid".

Ext4 has metadata checksums; U Wisconsin's Iron filesystem (sponsored
with a grant from EMC) did it for both data and metadata, if memory
serves me correctly.  ZFS smashed through the RAID abstraction barrier
and sucked up RAID functionality into the filesystem so they could
this sort of thing; but with the right new set of interfaces, it
should be possible to add this functionality without reimplementing
RAID in each filesystem.

As far as the question of how often this happens, where a disk
silently corrupts a block without returning a media error, it
definitely happens.  Larry McVoy tells a story of periodically running
a per-file CRC across a backup/archival filesystems, and was able to
detect files that had not been modified changing out from under him.
One way this can happen is if the disk accidentally writes some block
to the wrong location on disk; the blockguard extension and various
enterprise databases (since they can control their db-specific on-disk
format) will encode the intended location of a block in their
per-block checksums, to detect this specific type of failure, which
should broad hint that this sort of thing can and does happen.

Does it happen as much as ZFS's marketing literature implies?
Probably not.  But as you start making bigger and bigger filesystems,
the chances that even relatively improbable errors happen start
increasing significantly.  Of course, the flip side of the argument is
that if you are using the huge arrays to store things like music and
video, maybe you don't care about a small amount of data corruption,
since it might not be noticeable to the human eye/ear.  That's a
pretty weak argument though, and it sends shivers up the spins of
people who are storing, for example, medical images of X-ray or CAT
scans.

						- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-22 13:27                     ` Theodore Tso
@ 2008-03-22 14:00                       ` Bas van Schaik
  2008-03-25  4:44                       ` Neil Brown
  2008-03-25  9:19                       ` Mattias Wadenstein
  2 siblings, 0 replies; 44+ messages in thread
From: Bas van Schaik @ 2008-03-22 14:00 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-raid

Theodore Tso wrote:
> On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
>   
>> Of course it would be possible to instruct md to always read all 
>> data+parity chunks and make a comparison on every read. The performance 
>> would not be much to write home about though.
>>     
>
> (...)
>
> Does it happen as much as ZFS's marketing literature implies?
> Probably not.  But as you start making bigger and bigger filesystems,
> the chances that even relatively improbable errors happen start
> increasing significantly.  Of course, the flip side of the argument is
> that if you are using the huge arrays to store things like music and
> video, maybe you don't care about a small amount of data corruption,
> since it might not be noticeable to the human eye/ear.  That's a
> pretty weak argument though, and it sends shivers up the spins of
> people who are storing, for example, medical images of X-ray or CAT
> scans.
>   
I totally agree with you, Ted, although I think your idea of a
filesystem communicating with RAID in an sophisticated way kind of
conflicts with the "layered approach" which is chosen in the world of
Linux. Should that be a reason not to implement this feature? I don't
think so.

Although most of you sketch scenarios in which it is very rare that
corruptions occur, I think you should also take into account that
storage is booming and growing like never before. This trend has caused
people (like me) to use other media to transfer and store data, using
the network for example. The assumption that data corruption is rare
because the bus and the disk are very reliable doesn't hold anymore:
other ways of communication are much more sensitive to corruption.

Of course, protection against these types of corruption should be
implemented in the appropriate layer (using checksums over packets, like
TCP does), but I think it is a little bit naive to assume that this will
succeed in all cases. On the other hand it would not make sense to read
every block after writing it (to check its consistency), but it might be
a nice feature to extend the monthly consistency check with advanced
error reporting features. Users who don't care (storing music and video,
using Ted's example) would disable this check, administrators like me
(storing large amounts of medical data) could run this check every week
or so.

Regards,

  -- Bas

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-22 13:27                     ` Theodore Tso
  2008-03-22 14:00                       ` Bas van Schaik
@ 2008-03-25  4:44                       ` Neil Brown
  2008-03-25 15:17                         ` Bill Davidsen
  2008-03-25  9:19                       ` Mattias Wadenstein
  2 siblings, 1 reply; 44+ messages in thread
From: Neil Brown @ 2008-03-25  4:44 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Peter Rabbitson, Bill Davidsen, Bas van Schaik, linux-raid

On Saturday March 22, tytso@MIT.EDU wrote:
> On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
> >
> > Of course it would be possible to instruct md to always read all 
> > data+parity chunks and make a comparison on every read. The performance 
> > would not be much to write home about though.
> 
> Yeah, and that's probably the real problem with this scheme.  You
> basically reduce the read bandwidth of your array down to a single
> (slowest) disk --- basically the same reason why RAID-2 is a
> commercial failure.  

Exactly.

> 
> I suspect the best thing we *can* to do is for filesystems that
> include checksums in the metadata and/or the data blocks, is if the
> CRC doesn't match, to have the filesystem tell the RAID subsystem,
> "um, could you send me copies of the data from all of the RAID-1
> mirrors, and see if one of the copies from the mirrors causes a valid
> checksum".  Something similar could be done with RAID-5/RAID-6 arrays,
> if the fs layer could ask the RAID subsystem, "the external checksum
> for this block is bad; can you recalculate it from all available
> parity stripes assuming the data stripe is invalid".

Something along these lines would be very appropriate I think.
Particularly for raid1.
For raid5/raid6 it is possible that a valid block in the same stripe
was read and written before the faulty block was read.  This would
correct the parity so when the bad block was found, there would be no
way to recover the correct data.
Still, having the possibility of recovery might be better than not
having it.

> 
> As far as the question of how often this happens, where a disk
> silently corrupts a block without returning a media error, it
> definitely happens.  Larry McVoy tells a story of periodically running
> a per-file CRC across a backup/archival filesystems, and was able to
> detect files that had not been modified changing out from under him.
> One way this can happen is if the disk accidentally writes some block
> to the wrong location on disk; the blockguard extension and various
> enterprise databases (since they can control their db-specific on-disk
> format) will encode the intended location of a block in their
> per-block checksums, to detect this specific type of failure, which
> should broad hint that this sort of thing can and does happen.

The "address data was corrupted" is certainly a credible possibility.
I remember reading that SCSI has a parity check for data, but not for
the command, which include the storage address.

With the raid6 algorithm, we can tell which device has an error
(assuming only one device does) for each byte in the block.
If this returns the same device for every block in a sector, it is
probably reasonable to assume that exactly that block is bad.
Still, if we only do that on the monthly 'check', it could be too
late.

I'm not sure that "surviving some data corruptions, if you are lucky"
is really better than surviving none.  We don't want to provide a
false sense of security.... but maybe RAID already does that.

A filesystem that always writes full stripes and never over-writes
valid data.  And that (optionally) stores checksums for everything is
looking more an more appealing.   The trouble is, I don't seem to have
enough "spare time" :-)

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-25  4:44                       ` Neil Brown
@ 2008-03-25 15:17                         ` Bill Davidsen
  0 siblings, 0 replies; 44+ messages in thread
From: Bill Davidsen @ 2008-03-25 15:17 UTC (permalink / raw)
  To: Neil Brown; +Cc: Theodore Tso, Peter Rabbitson, Bas van Schaik, linux-raid

Neil Brown wrote:
> On Saturday March 22, tytso@MIT.EDU wrote:
>   
>> On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
>>     
>>> Of course it would be possible to instruct md to always read all 
>>> data+parity chunks and make a comparison on every read. The performance 
>>> would not be much to write home about though.
>>>       
>> Yeah, and that's probably the real problem with this scheme.  You
>> basically reduce the read bandwidth of your array down to a single
>> (slowest) disk --- basically the same reason why RAID-2 is a
>> commercial failure.  
>>     
>
> Exactly.
>
>   
In some cases that would be acceptable. Obviously in the general case 
it's not required.
>> I suspect the best thing we *can* to do is for filesystems that
>> include checksums in the metadata and/or the data blocks, is if the
>> CRC doesn't match, to have the filesystem tell the RAID subsystem,
>> "um, could you send me copies of the data from all of the RAID-1
>> mirrors, and see if one of the copies from the mirrors causes a valid
>> checksum".  Something similar could be done with RAID-5/RAID-6 arrays,
>> if the fs layer could ask the RAID subsystem, "the external checksum
>> for this block is bad; can you recalculate it from all available
>> parity stripes assuming the data stripe is invalid".
>>     
>
> Something along these lines would be very appropriate I think.
> Particularly for raid1.
> For raid5/raid6 it is possible that a valid block in the same stripe
> was read and written before the faulty block was read.  This would
> correct the parity so when the bad block was found, there would be no
> way to recover the correct data.
> Still, having the possibility of recovery might be better than not
> having it.
>
>   
>> As far as the question of how often this happens, where a disk
>> silently corrupts a block without returning a media error, it
>> definitely happens.  Larry McVoy tells a story of periodically running
>> a per-file CRC across a backup/archival filesystems, and was able to
>> detect files that had not been modified changing out from under him.
>> One way this can happen is if the disk accidentally writes some block
>> to the wrong location on disk; the blockguard extension and various
>> enterprise databases (since they can control their db-specific on-disk
>> format) will encode the intended location of a block in their
>> per-block checksums, to detect this specific type of failure, which
>> should broad hint that this sort of thing can and does happen.
>>     
>
> The "address data was corrupted" is certainly a credible possibility.
> I remember reading that SCSI has a parity check for data, but not for
> the command, which include the storage address.
>
> With the raid6 algorithm, we can tell which device has an error
> (assuming only one device does) for each byte in the block.
> If this returns the same device for every block in a sector, it is
> probably reasonable to assume that exactly that block is bad.
> Still, if we only do that on the monthly 'check', it could be too
> late.
>
>   
I think the old saying "better late than never" applies, once the user 
knows that there is a problem via 'check,' and fixes it if possible, 
some form of recovery would then at least be possible.

> I'm not sure that "surviving some data corruptions, if you are lucky"
> is really better than surviving none.  We don't want to provide a
> false sense of security.... but maybe RAID already does that.
>
> A filesystem that always writes full stripes and never over-writes
> valid data.  And that (optionally) stores checksums for everything is
> looking more an more appealing.   The trouble is, I don't seem to have
> enough "spare time" :-)
>   

Frankly I think your limited time is better spent on raid, there are 
undoubtedly plenty of things on your "to do" list. I'd like to hope that 
raid5e is at least on that list, but I would be the first to say that 
performance improvements for raid5 would benefit more people.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-22 13:27                     ` Theodore Tso
  2008-03-22 14:00                       ` Bas van Schaik
  2008-03-25  4:44                       ` Neil Brown
@ 2008-03-25  9:19                       ` Mattias Wadenstein
  2 siblings, 0 replies; 44+ messages in thread
From: Mattias Wadenstein @ 2008-03-25  9:19 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Peter Rabbitson, Bill Davidsen, Bas van Schaik, linux-raid

On Sat, 22 Mar 2008, Theodore Tso wrote:

> On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote:
>>
>> Of course it would be possible to instruct md to always read all
>> data+parity chunks and make a comparison on every read. The performance
>> would not be much to write home about though.
>
> Yeah, and that's probably the real problem with this scheme.  You
> basically reduce the read bandwidth of your array down to a single
> (slowest) disk --- basically the same reason why RAID-2 is a
> commercial failure.

I don't really see this as a problem. Most of my filesystems are not 
anywhere near their performance limit and reading a strip from the parity 
disks as well as all the strips from the data disks in a raid6 setup 
probably would be less than a 50% performance hit, so I would very much 
appriciate a "paranoid parity check on every read" flag to set on _some_ 
of my raids.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 17:13                 ` Theodore Tso
  2008-03-21 17:35                   ` Peter Rabbitson
@ 2008-03-21 17:43                   ` Robin Hill
  1 sibling, 0 replies; 44+ messages in thread
From: Robin Hill @ 2008-03-21 17:43 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2763 bytes --]

On Fri Mar 21, 2008 at 01:13:48PM -0400, Theodore Tso wrote:

> On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote:
> > I was actually specifically advocating that md must _not_ do anything on 
> > its own. Just provide the hooks to get information (what is the current 
> > stripe state) and update information (the described repair extension). The 
> > logic that you are describing can live only in an external app, it has no 
> > place in-kernel.
> 
> Why not?  If md doesn't do anything on its own, then when it detects a
> disagreement between the data and the two parity blocks, it has two
> choices (a) return possibly incorrect data to the application, or (b)
> return an I/O error and cause the application to blow up.
> 
> Sure, it could then give the information so that the external repair
> tool can fix it up after the fact, but that seems like a really lousy
> thing to do as far as the original application is concerned.  (Or I
> suppose you could try to block the userspace application until the
> repair tool has a chance to do automatically what md could have done
> automatically in the kernel anyway, but that has other problems.)
> 
> So what's the harm in having an option where md does exactly what ECC
> memory does, which is when it can fix things up, to do so?  I bet most
> system administrators would turn it on in a heartbeat.
> 
Depends on how you look at things. ECC memory is designed to deal with
occasional mismatches caused by such obscure and rare events as cosmic
radiation. RAID subsytems, on the other hand, are designed to deal with
catastrophic failures of one (or more) drives. There's no trivially
explaianable reason why a drive would sporadically suffer from incorrect
data reading/writing (unlike with ECC memory) so there's no recovery
case.

Admittedly, it would be possible to do this, but that would mean adding
an extra read penalty on every RAID read (and, in some situations,
throwing away the advantages of parallelism) in order to cover the
exceptionally rare case where a drive has (for unknown reason) written
the wrong data.

Personally, this would be an option I'd avoid like the plague.  If I
know there's an issue then I replace the hardware, otherwise I expect
the system to work as fast as possible in the assumption that all is
correct.  Admittedly, a check/repair option to view/select how the
blocks are recovered might be useful, but I'd also see this sitting
well outside the md code.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 14:52               ` Peter Rabbitson
  2008-03-21 17:13                 ` Theodore Tso
@ 2008-03-21 23:01                 ` Bill Davidsen
  2008-03-21 23:45                   ` Carlos Carvalho
  2008-03-21 23:55                   ` Robin Hill
  1 sibling, 2 replies; 44+ messages in thread
From: Bill Davidsen @ 2008-03-21 23:01 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid

Peter Rabbitson wrote:
> I was actually specifically advocating that md must _not_ do anything 
> on its own. Just provide the hooks to get information (what is the 
> current stripe state) and update information (the described repair 
> extension). The logic that you are describing can live only in an 
> external app, it has no place in-kernel.

So you advocate the current code being in the kernel, which absent a 
hardware error makes blind assumptions about which data is valid and 
which is not and in all cases hides the problem, instead of the code I 
proposed, which in some cases will be able to avoid action which is 
provably wrong and never be less likely to do the wrong thing than the 
current code?

Currently the "repair" action (which *is* in the kernel now) takes no 
advantage of the additional information available in these cases I 
noted. By what logic do you conclude that the user meant "hide the 
error" when using the "repair" action? What I propose is never less 
likely to be correct than what the current code does, why would you not 
want to improve the chances of getting the repair correct?

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 23:01                 ` Bill Davidsen
@ 2008-03-21 23:45                   ` Carlos Carvalho
  2008-03-22 17:19                     ` Bill Davidsen
  2008-03-21 23:55                   ` Robin Hill
  1 sibling, 1 reply; 44+ messages in thread
From: Carlos Carvalho @ 2008-03-21 23:45 UTC (permalink / raw)
  To: linux-raid

Bill Davidsen (davidsen@tmr.com) wrote on 21 March 2008 19:01:
 >Peter Rabbitson wrote:
 >> I was actually specifically advocating that md must _not_ do anything 
                                                *************************
 >> on its own. Just provide the hooks to get information (what is the 
    **********
 >> current stripe state) and update information (the described repair 
 >> extension). The logic that you are describing can live only in an 
 >> external app, it has no place in-kernel.
 >
 >So you advocate the current code being in the kernel, which absent a 
 >hardware error makes blind assumptions about which data is valid and 
 >which is not and in all cases hides the problem, instead of the code I 
 >proposed, which in some cases will be able to avoid action which is 
 >provably wrong and never be less likely to do the wrong thing than the 
 >current code?

The current code doesn't do anything on its own, it must be invoked by
the user, which is an important difference.

I agree that blindingly setting parity is not good; that's an argument
for removing it from the kernel, not adding something :-)

Why is it there? This is for Neil to answer; I merely conjecture that
it was already there. For example, it's necessary after a raid5 array
is created, because it's done creating an n-1 degraded array and
adding the last disk afterwards. It's also done when an array is
dirty. This is a situation where it's done without asking the user but
it seems to me that in this case that's the right action: if the
parity doesn't agree with the data it's either because the parity was
not yet updated at the moment of the unclean shutdown or because it
was updated but not the data itself. In both cases the parity should
reflect the current data situation.

The /sys/..../syn_action is just an interface added much later to
trigger the code. The check action is useful but I think repair is too
risky. I doubt it should be available.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 23:45                   ` Carlos Carvalho
@ 2008-03-22 17:19                     ` Bill Davidsen
  0 siblings, 0 replies; 44+ messages in thread
From: Bill Davidsen @ 2008-03-22 17:19 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

Carlos Carvalho wrote:
> Bill Davidsen (davidsen@tmr.com) wrote on 21 March 2008 19:01:
>  >Peter Rabbitson wrote:
>  >> I was actually specifically advocating that md must _not_ do anything 
>                                                 *************************
>  >> on its own. Just provide the hooks to get information (what is the 
>     **********
>  >> current stripe state) and update information (the described repair 
>  >> extension). The logic that you are describing can live only in an 
>  >> external app, it has no place in-kernel.
>  >
>  >So you advocate the current code being in the kernel, which absent a 
>  >hardware error makes blind assumptions about which data is valid and 
>  >which is not and in all cases hides the problem, instead of the code I 
>  >proposed, which in some cases will be able to avoid action which is 
>  >provably wrong and never be less likely to do the wrong thing than the 
>  >current code?
>
> The current code doesn't do anything on its own, it must be invoked by
> the user, which is an important difference.
>
>   
Difference from what? Is issuing the 'repair' action on its own? How 
would adding code which lets that repair have a higher chance of success 
be bad? Sector consistency errors don't show up during normal operation, 
there's no hardware error, just bad data. It only shows up during 
'check' or 'repair,' so the recovery would never be triggered without 
express user request.

> I agree that blindingly setting parity is not good; that's an argument
> for removing it from the kernel, not adding something :-)
>
> Why is it there? This is for Neil to answer; I merely conjecture that
> it was already there. For example, it's necessary after a raid5 array
> is created, because it's done creating an n-1 degraded array and
> adding the last disk afterwards. It's also done when an array is
> dirty. This is a situation where it's done without asking the user but
> it seems to me that in this case that's the right action: if the
> parity doesn't agree with the data it's either because the parity was
> not yet updated at the moment of the unclean shutdown or because it
> was updated but not the data itself. In both cases the parity should
> reflect the current data situation.
>
> The /sys/..../syn_action is just an interface added much later to
> trigger the code. The check action is useful but I think repair is too
> risky. I doubt it should be available.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re:  Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 23:01                 ` Bill Davidsen
  2008-03-21 23:45                   ` Carlos Carvalho
@ 2008-03-21 23:55                   ` Robin Hill
  2008-03-22 10:03                     ` Peter Rabbitson
  1 sibling, 1 reply; 44+ messages in thread
From: Robin Hill @ 2008-03-21 23:55 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2298 bytes --]

On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:

> Peter Rabbitson wrote:
>> I was actually specifically advocating that md must _not_ do anything on 
>> its own. Just provide the hooks to get information (what is the current 
>> stripe state) and update information (the described repair extension). The 
>> logic that you are describing can live only in an external app, it has no 
>> place in-kernel.
>
> So you advocate the current code being in the kernel, which absent a 
> hardware error makes blind assumptions about which data is valid and which 
> is not and in all cases hides the problem, instead of the code I proposed, 
> which in some cases will be able to avoid action which is provably wrong 
> and never be less likely to do the wrong thing than the current code?
>
I would certainly advocate that the current (entirely automatic) code
belongs in the kernel whereas any code requiring user
intervention/decision making belongs in a user process, yes.  That's not
to say that the former should be preferred over the latter though, but
there's really no reason to remove the in-kernel automated process until
(or even after) a user-side repair process has been coded.

> Currently the "repair" action (which *is* in the kernel now) takes no 
> advantage of the additional information available in these cases I noted. 
> By what logic do you conclude that the user meant "hide the error" when 
> using the "repair" action? What I propose is never less likely to be 
> correct than what the current code does, why would you not want to improve 
> the chances of getting the repair correct?
>
That is, of course, a separate issue to whether it should be in-kernel.
I would entirely agree that user-level processes should be able to
access and manipulate the low-level RAID data/metadata (via the md
layer) in order to facilitate more advanced repair functions, but this
should be separate from, and in addition to, the "ignorant"
parity-updating repair process currently in place.

Just my 2p,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-21 23:55                   ` Robin Hill
@ 2008-03-22 10:03                     ` Peter Rabbitson
  2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
  2008-05-04  7:30                       ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Rabbitson @ 2008-03-22 10:03 UTC (permalink / raw)
  To: linux-raid

Robin Hill wrote:
> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
> 
>> Peter Rabbitson wrote:
>>> I was actually specifically advocating that md must _not_ do anything on 
>>> its own. Just provide the hooks to get information (what is the current 
>>> stripe state) and update information (the described repair extension). The 
>>> logic that you are describing can live only in an external app, it has no 
>>> place in-kernel.
>> So you advocate the current code being in the kernel, which absent a 
>> hardware error makes blind assumptions about which data is valid and which 
>> is not and in all cases hides the problem, instead of the code I proposed, 
>> which in some cases will be able to avoid action which is provably wrong 
>> and never be less likely to do the wrong thing than the current code?
>>
> I would certainly advocate that the current (entirely automatic) code
> belongs in the kernel whereas any code requiring user
> intervention/decision making belongs in a user process, yes.  That's not
> to say that the former should be preferred over the latter though, but
> there's really no reason to remove the in-kernel automated process until
> (or even after) a user-side repair process has been coded.

I am asserting that automatic repair is infeasible in most highly-redundant 
cases. Lets take the root raid1 of one of my busiest servers:

/dev/md0:
         Version : 00.90.03
   Creation Time : Tue Mar 20 21:58:54 2007
      Raid Level : raid1
      Array Size : 6000128 (5.72 GiB 6.14 GB)
   Used Dev Size : 6000128 (5.72 GiB 6.14 GB)
    Raid Devices : 4
   Total Devices : 4
Preferred Minor : 0
     Persistence : Superblock is persistent

     Update Time : Sat Mar 22 05:55:08 2008
           State : clean
  Active Devices : 4
Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0

            UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
          Events : 0.183270

As you can see it is pretty old, and does not have many events to speak of. 
Yet every month when the automatic check is issued I get between 512 and 2048 
in mismatch_cnt. I maintain md5sums of all files on this filesystem, and there 
were no deviations for the lifetime of the array (of course there are 
mismatches after upgrades, after log appends etc, but they are all expected). 
So all I can do with this array is issue a blind repair, without even having 
the chance to find what exactly is causing this. Yes, it is raid1 and I could 
do 1:1 comparison to find which is the offending block. How about raid10 -n 
f3? There is no way I can figure out _what_ is giving me a problem. I do not 
know if it is a hardware error (the md5 sums speak against it), some process 
with weird write patterns resulting in heavy DMA, or a bug in md itself.

By the way there is no swap file on this array. Just / and /var, with a 
moderately busy mail spool on top.

>> Currently the "repair" action (which *is* in the kernel now) takes no 
>> advantage of the additional information available in these cases I noted. 
>> By what logic do you conclude that the user meant "hide the error" when 
>> using the "repair" action? What I propose is never less likely to be 
>> correct than what the current code does, why would you not want to improve 
>> the chances of getting the repair correct?
>>
> That is, of course, a separate issue to whether it should be in-kernel.
> I would entirely agree that user-level processes should be able to
> access and manipulate the low-level RAID data/metadata (via the md
> layer) in order to facilitate more advanced repair functions, but this
> should be separate from, and in addition to, the "ignorant"
> parity-updating repair process currently in place.
> 

I am trying to convey the idea that a first step to a userland process would 
be full disclosure of what is going on. A non-zero mismatch_cnt on a 
multigigabyte array makes an admin very uneasy, without giving him a chance to 
assess the situation.

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* What do Events actually mean?
  2008-03-22 10:03                     ` Peter Rabbitson
@ 2008-03-22 10:42                       ` Justin Piszcz
  2008-03-22 17:35                         ` David Greaves
  2008-03-25  3:58                         ` Neil Brown
  2008-05-04  7:30                       ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
  1 sibling, 2 replies; 44+ messages in thread
From: Justin Piszcz @ 2008-03-22 10:42 UTC (permalink / raw)
  To: linux-raid; +Cc: Peter Rabbitson



On Sat, 22 Mar 2008, Peter Rabbitson wrote:

> Robin Hill wrote:
>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>> 
>>> Peter Rabbitson wrote:

>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
>         Events : 0.183270
>
> As you can see it is pretty old, and does not have many events to speak of.

What do the 'Events' actually represent and what do they mean for RAID0,
RAID1, RAID5 etc?

How are they calculated?

Justin.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
@ 2008-03-22 17:35                         ` David Greaves
  2008-03-22 17:48                           ` Justin Piszcz
  2008-03-25  3:58                         ` Neil Brown
  1 sibling, 1 reply; 44+ messages in thread
From: David Greaves @ 2008-03-22 17:35 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson

Justin Piszcz wrote:
> 
> 
> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
> 
>> Robin Hill wrote:
>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>>
>>>> Peter Rabbitson wrote:
> 
>>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host
>> Arzamas)
>>         Events : 0.183270
>>
>> As you can see it is pretty old, and does not have many events to
>> speak of.
> 
> What do the 'Events' actually represent and what do they mean for RAID0,
> RAID1, RAID5 etc?
> 
> How are they calculated?

http://linux-raid.osdl.org/index.php/Event

David


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-22 17:35                         ` David Greaves
@ 2008-03-22 17:48                           ` Justin Piszcz
  2008-03-22 18:02                             ` David Greaves
  0 siblings, 1 reply; 44+ messages in thread
From: Justin Piszcz @ 2008-03-22 17:48 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid, Peter Rabbitson



On Sat, 22 Mar 2008, David Greaves wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
>>
>>> Robin Hill wrote:
>>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>>>
>>>>> Peter Rabbitson wrote:
>>
>>>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host
>>> Arzamas)
>>>         Events : 0.183270
>>>
>>> As you can see it is pretty old, and does not have many events to
>>> speak of.
>>
>> What do the 'Events' actually represent and what do they mean for RAID0,
>> RAID1, RAID5 etc?
>>
>> How are they calculated?
>
> http://linux-raid.osdl.org/index.php/Event

Empty?
There is currently no text in this page, you can search for this page 
title in other pages or edit this page

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-22 17:48                           ` Justin Piszcz
@ 2008-03-22 18:02                             ` David Greaves
  0 siblings, 0 replies; 44+ messages in thread
From: David Greaves @ 2008-03-22 18:02 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson

Justin Piszcz wrote:
> 
> 
> On Sat, 22 Mar 2008, David Greaves wrote:
> 
>> Justin Piszcz wrote:
>>>
>>>
>>> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
>>>
>>>> Robin Hill wrote:
>>>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>>>>
>>>>>> Peter Rabbitson wrote:
>>>
>>>>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host
>>>> Arzamas)
>>>>         Events : 0.183270
>>>>
>>>> As you can see it is pretty old, and does not have many events to
>>>> speak of.
>>>
>>> What do the 'Events' actually represent and what do they mean for RAID0,
>>> RAID1, RAID5 etc?
>>>
>>> How are they calculated?
>>
>> http://linux-raid.osdl.org/index.php/Event
> 
> Empty?
> There is currently no text in this page, you can search for this page
> title in other pages or edit this page

And?


You think I was being too subtle <grin>

David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
  2008-03-22 17:35                         ` David Greaves
@ 2008-03-25  3:58                         ` Neil Brown
  2008-03-26  8:57                           ` David Greaves
  2008-03-26  8:57                           ` David Greaves
  1 sibling, 2 replies; 44+ messages in thread
From: Neil Brown @ 2008-03-25  3:58 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson

On Saturday March 22, jpiszcz@lucidpixels.com wrote:
> 
> 
> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
> 
> > Robin Hill wrote:
> >> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
> >> 
> >>> Peter Rabbitson wrote:
> 
> >           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
> >         Events : 0.183270
> >
> > As you can see it is pretty old, and does not have many events to speak of.
> 
> What do the 'Events' actually represent and what do they mean for RAID0,
> RAID1, RAID5 etc?

An 'event' is one of:
   switch from 'active' to 'clean'
   switch from 'clean' to 'active'
   device fails
   device is added
   spare replaces a failed device after a rebuild

I think that it all.

None of these are meaningful for RAID0, so the 'events' counter on
RAID0 should be stable.

Unfortunately, the number looks like a decimal but isn't.
It is a 64bit number.  We print out the top 32 bits, then the bottom
32 bits.  I don't remember why.  Maybe I'll 'fix' it.

> 
> How are they calculated?

   events = events + 1;


Feel free to merge this text into the wiki.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-25  3:58                         ` Neil Brown
@ 2008-03-26  8:57                           ` David Greaves
  2008-03-26  8:57                           ` David Greaves
  1 sibling, 0 replies; 44+ messages in thread
From: David Greaves @ 2008-03-26  8:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin Piszcz, linux-raid, Peter Rabbitson

Neil Brown wrote:
> On Saturday March 22, jpiszcz@lucidpixels.com wrote:
>>
>> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
>>
>>> Robin Hill wrote:
>>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>>>
>>>>> Peter Rabbitson wrote:
>>>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
>>>         Events : 0.183270
>>>
>>> As you can see it is pretty old, and does not have many events to speak of.
>> What do the 'Events' actually represent and what do they mean for RAID0,
>> RAID1, RAID5 etc?
> 
> An 'event' is one of:
>    switch from 'active' to 'clean'
>    switch from 'clean' to 'active'
>    device fails
>    device is added
>    spare replaces a failed device after a rebuild
> 
> I think that it all.
> 
> None of these are meaningful for RAID0, so the 'events' counter on
> RAID0 should be stable.
> 
> Unfortunately, the number looks like a decimal but isn't.
> It is a 64bit number.  We print out the top 32 bits, then the bottom
> 32 bits.  I don't remember why.  Maybe I'll 'fix' it.
> 
>> How are they calculated?
> 
>    events = events + 1;
> 
> 
> Feel free to merge this text into the wiki.

http://linux-raid.osdl.org/index.php?title=Event

I also added:

== What are they for? ==
When an array is assembled, all the disks should have the same number of events.
If they don't then something odd happened.

eg:
If one drive fails then the remaining drives have their event counter
incremented. When the array is re-assembled the failed drive has a different
event count and is not included in the assembly.


This lead me to ponder: How/when are events reset to equality?
I wrote:
  The event count on a drive is set to zero on creation and reset to the
majority on a resync or a forced assembly.

Is that right?

David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: What do Events actually mean?
  2008-03-25  3:58                         ` Neil Brown
  2008-03-26  8:57                           ` David Greaves
@ 2008-03-26  8:57                           ` David Greaves
  1 sibling, 0 replies; 44+ messages in thread
From: David Greaves @ 2008-03-26  8:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin Piszcz, linux-raid, Peter Rabbitson

Neil Brown wrote:
> On Saturday March 22, jpiszcz@lucidpixels.com wrote:
>>
>> On Sat, 22 Mar 2008, Peter Rabbitson wrote:
>>
>>> Robin Hill wrote:
>>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>>>
>>>>> Peter Rabbitson wrote:
>>>           UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas)
>>>         Events : 0.183270
>>>
>>> As you can see it is pretty old, and does not have many events to speak of.
>> What do the 'Events' actually represent and what do they mean for RAID0,
>> RAID1, RAID5 etc?
> 
> An 'event' is one of:
>    switch from 'active' to 'clean'
>    switch from 'clean' to 'active'
>    device fails
>    device is added
>    spare replaces a failed device after a rebuild
> 
> I think that it all.
> 
> None of these are meaningful for RAID0, so the 'events' counter on
> RAID0 should be stable.
> 
> Unfortunately, the number looks like a decimal but isn't.
> It is a 64bit number.  We print out the top 32 bits, then the bottom
> 32 bits.  I don't remember why.  Maybe I'll 'fix' it.
> 
>> How are they calculated?
> 
>    events = events + 1;
> 
> 
> Feel free to merge this text into the wiki.

Thanks Neil :)

http://linux-raid.osdl.org/index.php?title=Event

I also added:

== What are they for? ==
When an array is assembled, all the disks should have the same number of events.
If they don't then something odd happened.

eg:
If one drive fails then the remaining drives have their event counter
incremented. When the array is re-assembled the failed drive has a different
event count and is not included in the assembly.


This lead me to ponder: How/when are events reset to equality?
I wrote:
  The event count on a drive is set to zero on creation and reset to the
majority on a resync or a forced assembly.

Is that right?

David

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-22 10:03                     ` Peter Rabbitson
  2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
@ 2008-05-04  7:30                       ` Peter Rabbitson
  2008-05-06  6:36                         ` Luca Berra
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Rabbitson @ 2008-05-04  7:30 UTC (permalink / raw)
  To: linux-raid

Peter Rabbitson wrote:
> Robin Hill wrote:
>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote:
>>
>>> Peter Rabbitson wrote:
>>>> I was actually specifically advocating that md must _not_ do 
>>>> anything on its own. Just provide the hooks to get information (what 
>>>> is the current stripe state) and update information (the described 
>>>> repair extension). The logic that you are describing can live only 
>>>> in an external app, it has no place in-kernel.
>>> So you advocate the current code being in the kernel, which absent a 
>>> hardware error makes blind assumptions about which data is valid and 
>>> which is not and in all cases hides the problem, instead of the code 
>>> I proposed, which in some cases will be able to avoid action which is 
>>> provably wrong and never be less likely to do the wrong thing than 
>>> the current code?
>>>
>> I would certainly advocate that the current (entirely automatic) code
>> belongs in the kernel whereas any code requiring user
>> intervention/decision making belongs in a user process, yes.  That's not
>> to say that the former should be preferred over the latter though, but
>> there's really no reason to remove the in-kernel automated process until
>> (or even after) a user-side repair process has been coded.
> 
> I am asserting that automatic repair is infeasible in most 
> highly-redundant cases. Lets take the root raid1 of one of my busiest 
> servers:
> 
> /dev/md0:
>         Version : 00.90.03
>   Creation Time : Tue Mar 20 21:58:54 2007
>      Raid Level : raid1
>      Array Size : 6000128 (5.72 GiB 6.14 GB)
>   Used Dev Size : 6000128 (5.72 GiB 6.14 GB)
>    Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Sat Mar 22 05:55:08 2008
>           State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host 
> Arzamas)
>          Events : 0.183270
> 
> As you can see it is pretty old, and does not have many events to speak 
> of. Yet every month when the automatic check is issued I get between 512 
> and 2048 in mismatch_cnt. I maintain md5sums of all files on this 
> filesystem, and there were no deviations for the lifetime of the array 
> (of course there are mismatches after upgrades, after log appends etc, 
> but they are all expected). So all I can do with this array is issue a 
> blind repair, without even having the chance to find what exactly is 
> causing this. Yes, it is raid1 and I could do 1:1 comparison to find 
> which is the offending block. How about raid10 -n f3? There is no way I 
> can figure out _what_ is giving me a problem. I do not know if it is a 
> hardware error (the md5 sums speak against it), some process with weird 
> write patterns resulting in heavy DMA, or a bug in md itself.
> 
> By the way there is no swap file on this array. Just / and /var, with a 
> moderately busy mail spool on top.
> 

I want to resurect this discussion with a peculiar observation - the above 
mismatch was caused by GRUB.

I had some time this weekend and decided to take device snapshots of the 4 
array members as listed above while / is mounted ro. After stripping the md 
superblock I ended up with data from slots 1 2 and 3 being identical, and 0 
(my primary boot device) being different by about 10 bytes. Hexediting 
revealed that the bytes in question belong to /boot/grub/default.

I realized that my grub config contains a savedefault clause, which updates 
the file on the raw ext3 volume before any raid assembly has taken place. 
Executing grub-set-default from within a booted system (with a mounted 
assembled raid) resulted in the subsequent md check to return 0 mismatches. To 
add insult to the injury the way svedefault and grub-set-default update said 
file are different (comments vs empty lines). So even if one savedfault's the 
same entry as the one set initially bu grub-set-default - the result will 
still be a raid1 mismatch.

I assume that this condition is benign, but wanted to bring this to the 
attention of the masses anyway.

Cheers

Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-05-04  7:30                       ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
@ 2008-05-06  6:36                         ` Luca Berra
  0 siblings, 0 replies; 44+ messages in thread
From: Luca Berra @ 2008-05-06  6:36 UTC (permalink / raw)
  To: linux-raid

On Sun, May 04, 2008 at 09:30:02AM +0200, Peter Rabbitson wrote:
>I want to resurect this discussion with a peculiar observation - the above 
>mismatch was caused by GRUB.
...
>I realized that my grub config contains a savedefault clause, which updates 
>the file on the raw ext3 volume before any raid assembly has taken place. 
>Executing grub-set-default from within a booted system (with a mounted 
>assembled raid) resulted in the subsequent md check to return 0 mismatches. 
this has been a long standing issue with grub 1.x
hopefully some day grub 2 will be production ready and the issue will be
forgoten (real md raid support for grub 2 was added in google SOC 2006).

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-20 23:08           ` Peter Rabbitson
  2008-03-21 14:24             ` Bill Davidsen
@ 2008-03-25  4:24             ` Neil Brown
  2008-03-25  9:00               ` Peter Rabbitson
  1 sibling, 1 reply; 44+ messages in thread
From: Neil Brown @ 2008-03-25  4:24 UTC (permalink / raw)
  To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid

On Friday March 21, rabbit+list@rabbit.us wrote:
> Theodore Tso wrote:
> > On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
> >>> There's no explicit message produced by the md module, no.  You need to
> >>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
> >>> mismatches there are.  Similarly, following a repair this will indicate
> >>> how many mismatches it thinks have been fixed (by updating the parity
> >>> block to match the data blocks).
> >>>   
> >> Marvellous! I naively assumed that the module would warn me, but that's
> >> not true. Wouldn't it be appropriate to print a message to dmesg if such
> >> a mismatch occurs during a check? Such a mismatch clearly means that
> >> there is something wrong with your hardware lying beneath md, doesn't it?
> > 
> > If a mismatch is detected in a RAID-6 configuration, it should be
> > possible to figure out what should be fixed (since with two hot spares
> > there should be enough redundancy not only to detect an error, but to
> > correct it.)  Out of curiosity, does md do this automatically, either
> > when reading from a stripe, or during a resync operation?
> > 
> 
> In my modest experience with root/high performance spool on various raid 
> levels I can pretty much conclude that the current check mechanism doesn't do 
> enough to give power to the user. We can debate all we want about what the MD 
> driver should do when it finds a mismatch, yet there is no way for the user to 
> figure out what the mismatch is and take appropriate action. This does not 
> apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the 
> only wrong value is taken and written all over the other good blocks?
> 
> I think that the solution is rather simple, and I would contribute a patch if 
> I had any C experience. The current check mechanism remains the same - 
> mismatch_cnt is incremented/reset just the same as before. However on every 
> mismatching chunk the system printks the following:
> 
> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device
> 2) one line for every active disk containing:
> 	a) the offset of the chunk within the MD componnent
> 	b) a {md5|sha1}sum of the chunk

More logging probably would be appropriate.
I wouldn't emit too much detail from the kernel though.  Just enough
to identify the location.  Have the userspace tool do all the more
interesting stuff.

You would want to rate limit the message though, to that you don't get
piles of messages when initialising the array...

> 
> In a common case array this will take no more than 8 lines in dmesg. However 
> it will allow:
> 
> 1) For a human to determine at a glance which disk holds a mismatching chunk 
> in raid 1/10
> 2) Determine the same for raid 6 using a userspace tool which will calculate 
> the parity for every possible permutation of chunks
> 3) using some external tools to determine which file might have been affected 
> on the layered file system
> 
> 
> Now of course the problem remains how to repair the array using the 
> information obtained above. I think the best way would be to extend the syntax 
> of repair itself, so that:
> 
> echo repair > .../sync_action would use the old heuristics
> 
> echo repair <mdoffset> <component N> > .../sync_action will update the chunk 
> on drive N which corresponds to the chunk/stripe at mdoffset within the MD 
> device, using the information from the other drives, and not the other way 
> around as might happen with just a repair.

Suspend the array, update the raw devices, then re-enable the array.
All from user-space.
No magic parsing of 'sync_action' input.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Redundancy check using "echo check > sync_action": error reporting?
  2008-03-25  4:24             ` Neil Brown
@ 2008-03-25  9:00               ` Peter Rabbitson
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Rabbitson @ 2008-03-25  9:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: Theodore Tso, Bas van Schaik, linux-raid

Neil Brown wrote:
> On Friday March 21, rabbit+list@rabbit.us wrote:
>> Theodore Tso wrote:
>>> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote:
>>>>> There's no explicit message produced by the md module, no.  You need to
>>>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many
>>>>> mismatches there are.  Similarly, following a repair this will indicate
>>>>> how many mismatches it thinks have been fixed (by updating the parity
>>>>> block to match the data blocks).
>>>>>   
>>>> Marvellous! I naively assumed that the module would warn me, but that's
>>>> not true. Wouldn't it be appropriate to print a message to dmesg if such
>>>> a mismatch occurs during a check? Such a mismatch clearly means that
>>>> there is something wrong with your hardware lying beneath md, doesn't it?
>>> If a mismatch is detected in a RAID-6 configuration, it should be
>>> possible to figure out what should be fixed (since with two hot spares
>>> there should be enough redundancy not only to detect an error, but to
>>> correct it.)  Out of curiosity, does md do this automatically, either
>>> when reading from a stripe, or during a resync operation?
>>>
>> In my modest experience with root/high performance spool on various raid 
>> levels I can pretty much conclude that the current check mechanism doesn't do 
>> enough to give power to the user. We can debate all we want about what the MD 
>> driver should do when it finds a mismatch, yet there is no way for the user to 
>> figure out what the mismatch is and take appropriate action. This does not 
>> apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the 
>> only wrong value is taken and written all over the other good blocks?
>>
>> I think that the solution is rather simple, and I would contribute a patch if 
>> I had any C experience. The current check mechanism remains the same - 
>> mismatch_cnt is incremented/reset just the same as before. However on every 
>> mismatching chunk the system printks the following:
>>
>> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device
>> 2) one line for every active disk containing:
>> 	a) the offset of the chunk within the MD componnent
>> 	b) a {md5|sha1}sum of the chunk
> 
> More logging probably would be appropriate.
> I wouldn't emit too much detail from the kernel though.  Just enough
> to identify the location.  Have the userspace tool do all the more
> interesting stuff.

True. The only reason I suggested checkusm information was because the blocks 
are already in memory, and checksum routines are readily available.

> You would want to rate limit the message though, to that you don't get
> piles of messages when initialising the array...

More realistically one would want to be able to flip a switch in 
/sys/block/mdX/md/ to see any advanced logging at all. So basically you run 
your monthly checks, one of them comes back with non-zero mismatch_cnt, you 
echo 1 > /sys/block/mdX/md/sync_action_debug and look at your logs.

>> In a common case array this will take no more than 8 lines in dmesg. However 
>> it will allow:
>>
>> 1) For a human to determine at a glance which disk holds a mismatching chunk 
>> in raid 1/10
>> 2) Determine the same for raid 6 using a userspace tool which will calculate 
>> the parity for every possible permutation of chunks
>> 3) using some external tools to determine which file might have been affected 
>> on the layered file system
>>
>>
>> Now of course the problem remains how to repair the array using the 
>> information obtained above. I think the best way would be to extend the syntax 
>> of repair itself, so that:
>>
>> echo repair > .../sync_action would use the old heuristics
>>
>> echo repair <mdoffset> <component N> > .../sync_action will update the chunk 
>> on drive N which corresponds to the chunk/stripe at mdoffset within the MD 
>> device, using the information from the other drives, and not the other way 
>> around as might happen with just a repair.
> 
> Suspend the array, update the raw devices, then re-enable the array.
> All from user-space.
> No magic parsing of 'sync_action' input.
> 

The sole advantage of 'repair' is that you do nto take the array offline. It 
doesn't even have to be 'repair', it can be something like 'refresh' or 
'relocate'. The point is that such simple interface would be a clean way to 
fix any inconsistencies in any RAID level without taking it offline.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2008-05-06  6:36 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik
2008-03-16 15:14 ` Janek Kozicki
2008-03-20 13:32   ` Bas van Schaik
2008-03-20 13:47     ` Robin Hill
2008-03-20 14:19       ` Bas van Schaik
2008-03-20 14:45         ` Robin Hill
2008-03-20 15:16           ` Bas van Schaik
2008-03-20 16:04             ` Robin Hill
2008-03-20 16:35         ` Theodore Tso
2008-03-20 17:10           ` Robin Hill
2008-03-20 17:39           ` Andre Noll
2008-03-20 18:02             ` Theodore Tso
2008-03-20 18:57               ` Andre Noll
2008-03-21 14:02               ` Ric Wheeler
2008-03-21 20:19               ` NeilBrown
2008-03-21 20:45                 ` Ric Wheeler
2008-03-22 17:13                 ` Bill Davidsen
2008-03-20 23:08           ` Peter Rabbitson
2008-03-21 14:24             ` Bill Davidsen
2008-03-21 14:52               ` Peter Rabbitson
2008-03-21 17:13                 ` Theodore Tso
2008-03-21 17:35                   ` Peter Rabbitson
2008-03-22 13:27                     ` Theodore Tso
2008-03-22 14:00                       ` Bas van Schaik
2008-03-25  4:44                       ` Neil Brown
2008-03-25 15:17                         ` Bill Davidsen
2008-03-25  9:19                       ` Mattias Wadenstein
2008-03-21 17:43                   ` Robin Hill
2008-03-21 23:01                 ` Bill Davidsen
2008-03-21 23:45                   ` Carlos Carvalho
2008-03-22 17:19                     ` Bill Davidsen
2008-03-21 23:55                   ` Robin Hill
2008-03-22 10:03                     ` Peter Rabbitson
2008-03-22 10:42                       ` What do Events actually mean? Justin Piszcz
2008-03-22 17:35                         ` David Greaves
2008-03-22 17:48                           ` Justin Piszcz
2008-03-22 18:02                             ` David Greaves
2008-03-25  3:58                         ` Neil Brown
2008-03-26  8:57                           ` David Greaves
2008-03-26  8:57                           ` David Greaves
2008-05-04  7:30                       ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson
2008-05-06  6:36                         ` Luca Berra
2008-03-25  4:24             ` Neil Brown
2008-03-25  9:00               ` Peter Rabbitson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).