* Redundancy check using "echo check > sync_action": error reporting? @ 2008-03-16 14:21 Bas van Schaik 2008-03-16 15:14 ` Janek Kozicki 0 siblings, 1 reply; 44+ messages in thread From: Bas van Schaik @ 2008-03-16 14:21 UTC (permalink / raw) To: linux-raid Hi all, As we speak, I'm trying to debug a real weird type of filesystem corruption in a quite complex layered system with networking involved: ATA over Ethernet - RAID5 - LVM - CryptoLoop - EXT3 In plain English: four storage servers export a bunch of block devices using AoE, the "cluster frontend" uses those devices to build three RAID5 arrays. Those arrays are the basis of a large LVM volume group, in which an Logical Volume was created with an encrypted 2.5TB EXT3 filesystem (cryptoloop). Recently the system suffered massive filesystem corruption, which even made e2fsck crash. Theodore Tso was able to analyze and fix the filesystem partially and found out that some random garbage was written to the EXT3 inode tables, as well some other weird corruptions. Personally, I'm suspecting one of the storage servers or the network to have caused these severe corruptions, but I have never seen any errors on the RAID5 level. The (Debian) system runs a montly check of the RAID5 arrays using Martin F. Krafft's checkarray script. Basically this scripts performs a "echo check > /sys/block/$array/md/sync_action" for all arrays. With my (basic) knowledge of RAID5 I assume this check only recomputes the sums and compares them to the stored XOR'ed value. This makes me wonder: 1) Will the kernel actually warn me when an inconsistency is found? Reading some other posts on the lists, it seems the kernel will print a "read error corrected!", is that correct? Note that I'm using kernel 2.6.18 (Debian stable), was it already implemented that way in that kernel? 2) How can the RAID code actually correct such a read error on RAID5? How does it know which device actually contains the faulty data? The answers to those questions are very important to me: if the kernel actually warns me when an inconsistency is found, that rules out the possibility that there is something wrong with the network or one of the storage servers. Actually that would mean that the "cluster frontend" is causing the corruptions. Kind regards, -- Bas van Schaik ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik @ 2008-03-16 15:14 ` Janek Kozicki 2008-03-20 13:32 ` Bas van Schaik 0 siblings, 1 reply; 44+ messages in thread From: Janek Kozicki @ 2008-03-16 15:14 UTC (permalink / raw) Cc: linux-raid Bas van Schaik said: (by the date of Sun, 16 Mar 2008 15:21:11 +0100) > As we speak, I'm trying to debug a real weird type of filesystem > corruption in a quite complex layered system with networking involved: AFAIK, even for the simplest case where corruption happens between a head and a disk platter during write operation - the RAID has no way to detect that. Unless it discovers later that the checksum on another disk is wrong, and it is automatically updated to reflect the corrupted data. Does it produce a message during resync in such case? Someone here should be able to answer this. -- Janek Kozicki | ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-16 15:14 ` Janek Kozicki @ 2008-03-20 13:32 ` Bas van Schaik 2008-03-20 13:47 ` Robin Hill 0 siblings, 1 reply; 44+ messages in thread From: Bas van Schaik @ 2008-03-20 13:32 UTC (permalink / raw) To: Janek Kozicki; +Cc: linux-raid Janek Kozicki wrote: > Bas van Schaik said: (by the date of Sun, 16 Mar 2008 15:21:11 +0100) > > >> As we speak, I'm trying to debug a real weird type of filesystem >> corruption in a quite complex layered system with networking involved: >> > > AFAIK, even for the simplest case where corruption happens between a > head and a disk platter during write operation - the RAID has no way > to detect that. Unless it discovers later that the checksum on > another disk is wrong, and it is automatically updated to reflect the > corrupted data. Does it produce a message during resync in such case? > Someone here should be able to answer this. > Anyone able to answer the last and most important question: does it produce a message during resync in case of corruption? That would be great! -- Bas ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 13:32 ` Bas van Schaik @ 2008-03-20 13:47 ` Robin Hill 2008-03-20 14:19 ` Bas van Schaik 0 siblings, 1 reply; 44+ messages in thread From: Robin Hill @ 2008-03-20 13:47 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 952 bytes --] On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote: > Anyone able to answer the last and most important question: does it > produce a message during resync in case of corruption? That would be great! > There's no explicit message produced by the md module, no. You need to check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many mismatches there are. Similarly, following a repair this will indicate how many mismatches it thinks have been fixed (by updating the parity block to match the data blocks). I've no idea whether the checkarray script you're using is checking this counter - there seems little point in having a special script if it isn't though. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 13:47 ` Robin Hill @ 2008-03-20 14:19 ` Bas van Schaik 2008-03-20 14:45 ` Robin Hill 2008-03-20 16:35 ` Theodore Tso 0 siblings, 2 replies; 44+ messages in thread From: Bas van Schaik @ 2008-03-20 14:19 UTC (permalink / raw) To: linux-raid; +Cc: Theodore Tso Robin Hill wrote: > On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote: > >> Anyone able to answer the last and most important question: does it >> produce a message during resync in case of corruption? That would be great! >> > There's no explicit message produced by the md module, no. You need to > check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many > mismatches there are. Similarly, following a repair this will indicate > how many mismatches it thinks have been fixed (by updating the parity > block to match the data blocks). > Marvellous! I naively assumed that the module would warn me, but that's not true. Wouldn't it be appropriate to print a message to dmesg if such a mismatch occurs during a check? Such a mismatch clearly means that there is something wrong with your hardware lying beneath md, doesn't it? > I've no idea whether the checkarray script you're using is checking this > counter - there seems little point in having a special script if it > isn't though. > If I understand the meaning of this counter, it would be sufficient to check the value of it _before_ the check operation and compare that value to the counter value _after_ the check. If the counter has increased: the check has encountered some inconsistencies which should be reported. Please correct me if I'm wrong! Cheers, Bas ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 14:19 ` Bas van Schaik @ 2008-03-20 14:45 ` Robin Hill 2008-03-20 15:16 ` Bas van Schaik 2008-03-20 16:35 ` Theodore Tso 1 sibling, 1 reply; 44+ messages in thread From: Robin Hill @ 2008-03-20 14:45 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 2491 bytes --] On Thu Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: > Robin Hill wrote: > > On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote: > > > >> Anyone able to answer the last and most important question: does it > >> produce a message during resync in case of corruption? That would be great! > >> > > There's no explicit message produced by the md module, no. You need to > > check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many > > mismatches there are. Similarly, following a repair this will indicate > > how many mismatches it thinks have been fixed (by updating the parity > > block to match the data blocks). > > > Marvellous! I naively assumed that the module would warn me, but that's > not true. Wouldn't it be appropriate to print a message to dmesg if such > a mismatch occurs during a check? Such a mismatch clearly means that > there is something wrong with your hardware lying beneath md, doesn't it? > With a RAID5 then mostly, yes - there may be errors caused by transient situations (interference, cosmic rays, etc) which are entirely independent of the hardware. With other RAID versions it's not quite as clear cut. For example with RAID1 it's possible for the in-memory data to have been changed between writing to each disk (especially with swap disks) - this isn't necessarily an issue (and certainly not a hardware one). > > I've no idea whether the checkarray script you're using is checking this > > counter - there seems little point in having a special script if it > > isn't though. > > > If I understand the meaning of this counter, it would be sufficient to > check the value of it _before_ the check operation and compare that > value to the counter value _after_ the check. If the counter has > increased: the check has encountered some inconsistencies which should > be reported. > Please correct me if I'm wrong! > Depends on what the previous operation was. After a repair, the counter will indicate the number of errors fixed, not the number remaining. Theoretically, after a repair there will be no errors remaining, so any value (> 0) in the counter after a check would indicate an issue to be reported. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 14:45 ` Robin Hill @ 2008-03-20 15:16 ` Bas van Schaik 2008-03-20 16:04 ` Robin Hill 0 siblings, 1 reply; 44+ messages in thread From: Bas van Schaik @ 2008-03-20 15:16 UTC (permalink / raw) To: linux-raid Robin Hill wrote: > On Thu Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: > > >> Robin Hill wrote: >> >>> On Thu Mar 20, 2008 at 02:32:37PM +0100, Bas van Schaik wrote: >>> >>> >>>> Anyone able to answer the last and most important question: does it >>>> produce a message during resync in case of corruption? That would be great! >>>> >>>> >>> There's no explicit message produced by the md module, no. You need to >>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many >>> mismatches there are. Similarly, following a repair this will indicate >>> how many mismatches it thinks have been fixed (by updating the parity >>> block to match the data blocks). >>> >>> >> Marvellous! I naively assumed that the module would warn me, but that's >> not true. Wouldn't it be appropriate to print a message to dmesg if such >> a mismatch occurs during a check? Such a mismatch clearly means that >> there is something wrong with your hardware lying beneath md, doesn't it? >> >> > With a RAID5 then mostly, yes - there may be errors caused by transient > situations (interference, cosmic rays, etc) which are entirely > independent of the hardware. With other RAID versions it's not quite as > clear cut. For example with RAID1 it's possible for the in-memory data > to have been changed between writing to each disk (especially with swap > disks) - this isn't necessarily an issue (and certainly not a hardware > one). > Maybe I understand something wrong then. In an ideal situation, the following should hold: - for RAID5: all data should count up to the parity bit - for RAID1: all bits should be identical If the redundancy check encounters a anomaly, something should be fixed. If something should be fixed, clearly something went wrong somewhere in the past. Or can you give an example where the statements mentioned above don't hold and nothing is wrong? >>> I've no idea whether the checkarray script you're using is checking this >>> counter - there seems little point in having a special script if it >>> isn't though. >>> >>> >> If I understand the meaning of this counter, it would be sufficient to >> check the value of it _before_ the check operation and compare that >> value to the counter value _after_ the check. If the counter has >> increased: the check has encountered some inconsistencies which should >> be reported. >> Please correct me if I'm wrong > Depends on what the previous operation was. After a repair, the counter > will indicate the number of errors fixed, not the number remaining. > Theoretically, after a repair there will be no errors remaining, so any > value (> 0) in the counter after a check would indicate an issue to be > reported. > Bottom line: I just want to know if an md check (using "echo check > sync_action") encountered any inconsistencies. If so, in my setup that would probably mean there is something wrong (bits flipping somewhere between md, the bus, the NIC, the network, the NIC of a storage server, etc.) I just don't want to be surprised by any major filesystem corruptions anymore! Cheers, Bas ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 15:16 ` Bas van Schaik @ 2008-03-20 16:04 ` Robin Hill 0 siblings, 0 replies; 44+ messages in thread From: Robin Hill @ 2008-03-20 16:04 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 1956 bytes --] On Thu Mar 20, 2008 at 04:16:08PM +0100, Bas van Schaik wrote: > Maybe I understand something wrong then. In an ideal situation, the > following should hold: > - for RAID5: all data should count up to the parity bit > - for RAID1: all bits should be identical > > If the redundancy check encounters a anomaly, something should be fixed. > If something should be fixed, clearly something went wrong somewhere in > the past. Or can you give an example where the statements mentioned > above don't hold and nothing is wrong? > My understanding is that, for RAID1 at least (and possibly for any other mirrored setup), the data is not written to both disks simultaneously, therefore there's a chance for the data to be modified (in memory) between writes (or for the check to read the disks between writes). This is usually only a temporary situation (i.e. the block is due to be rewritten anyway) but does show up occasionally in checks, particularly with swap partitions. > Bottom line: I just want to know if an md check (using "echo check > > sync_action") encountered any inconsistencies. If so, in my setup that > would probably mean there is something wrong (bits flipping somewhere > between md, the bus, the NIC, the network, the NIC of a storage server, > etc.) > > I just don't want to be surprised by any major filesystem corruptions > anymore! > For this, a simple check for a non-zero value in the /sys/block/md{X}/md/mismatch_cnt entry will indicate an issue. Note that the repair stage only rewrites the parity - there's no way to know whether the actual error was in the data or parity though, so there may still be corruption after running a repair. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 14:19 ` Bas van Schaik 2008-03-20 14:45 ` Robin Hill @ 2008-03-20 16:35 ` Theodore Tso 2008-03-20 17:10 ` Robin Hill ` (2 more replies) 1 sibling, 3 replies; 44+ messages in thread From: Theodore Tso @ 2008-03-20 16:35 UTC (permalink / raw) To: Bas van Schaik; +Cc: linux-raid On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: > > There's no explicit message produced by the md module, no. You need to > > check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many > > mismatches there are. Similarly, following a repair this will indicate > > how many mismatches it thinks have been fixed (by updating the parity > > block to match the data blocks). > > > Marvellous! I naively assumed that the module would warn me, but that's > not true. Wouldn't it be appropriate to print a message to dmesg if such > a mismatch occurs during a check? Such a mismatch clearly means that > there is something wrong with your hardware lying beneath md, doesn't it? If a mismatch is detected in a RAID-6 configuration, it should be possible to figure out what should be fixed (since with two hot spares there should be enough redundancy not only to detect an error, but to correct it.) Out of curiosity, does md do this automatically, either when reading from a stripe, or during a resync operation? - Ted ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 16:35 ` Theodore Tso @ 2008-03-20 17:10 ` Robin Hill 2008-03-20 17:39 ` Andre Noll 2008-03-20 23:08 ` Peter Rabbitson 2 siblings, 0 replies; 44+ messages in thread From: Robin Hill @ 2008-03-20 17:10 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 1272 bytes --] On Thu Mar 20, 2008 at 12:35:51PM -0400, Theodore Tso wrote: > If a mismatch is detected in a RAID-6 configuration, it should be > possible to figure out what should be fixed (since with two hot spares > there should be enough redundancy not only to detect an error, but to > correct it.) Out of curiosity, does md do this automatically, either > when reading from a stripe, or during a resync operation? > I'm not sure about during a read (though my understanding is that the parity is entirely ignored here, so no checking is done in any RAID configuration). As for during resync, not as of last time this came up, no (I've not looked at the code but Neil was certainly opposed to trying to do this then). The problem is that you can only (safely) do a repair if you know _for a fact_ that only a single block is corrupt. Otherwise there's a reasonable chance of further corrupting the data by "repairing" good blocks to match the bad. The current md code just recalculates both parity blocks. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 16:35 ` Theodore Tso 2008-03-20 17:10 ` Robin Hill @ 2008-03-20 17:39 ` Andre Noll 2008-03-20 18:02 ` Theodore Tso 2008-03-20 23:08 ` Peter Rabbitson 2 siblings, 1 reply; 44+ messages in thread From: Andre Noll @ 2008-03-20 17:39 UTC (permalink / raw) To: Theodore Tso; +Cc: Bas van Schaik, linux-raid [-- Attachment #1: Type: text/plain, Size: 580 bytes --] On 12:35, Theodore Tso wrote: > If a mismatch is detected in a RAID-6 configuration, it should be > possible to figure out what should be fixed It can be figured out under the assumption that exactly one drive has bad data and all other ones have good data. But that seems to be an assumption that is hard to verify in reality. > Out of curiosity, does md do this automatically, either > when reading from a stripe, or during a resync operation? Nope, md does no such thing. Andre -- The only person who always got his work done by Friday was Robinson Crusoe [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 17:39 ` Andre Noll @ 2008-03-20 18:02 ` Theodore Tso 2008-03-20 18:57 ` Andre Noll ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Theodore Tso @ 2008-03-20 18:02 UTC (permalink / raw) To: Andre Noll; +Cc: Bas van Schaik, linux-raid On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote: > On 12:35, Theodore Tso wrote: > > > If a mismatch is detected in a RAID-6 configuration, it should be > > possible to figure out what should be fixed > > It can be figured out under the assumption that exactly one drive has > bad data and all other ones have good data. But that seems to be an > assumption that is hard to verify in reality. True, but it's what ECC memory does. :-) And most people agree that it's a useful thing to do with memory. If you do ECC syndrome checking on every read, and follow that up with periodic scrubbing so that you catch (and correct) errors quickly, it is a reasonable assumption to make. Obviously a warning should be given when you do this kind of ECC fixups, and if there is an increasing number of ECC fixups that are being done, that should set off alarms that maybe there is a hardware problem that needs to be addressed. Regards, - Ted ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 18:02 ` Theodore Tso @ 2008-03-20 18:57 ` Andre Noll 2008-03-21 14:02 ` Ric Wheeler 2008-03-21 20:19 ` NeilBrown 2 siblings, 0 replies; 44+ messages in thread From: Andre Noll @ 2008-03-20 18:57 UTC (permalink / raw) To: Theodore Tso; +Cc: Bas van Schaik, linux-raid [-- Attachment #1: Type: text/plain, Size: 1625 bytes --] On 14:02, Theodore Tso wrote: > On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote: > > On 12:35, Theodore Tso wrote: > > > > > If a mismatch is detected in a RAID-6 configuration, it should be > > > possible to figure out what should be fixed > > > > It can be figured out under the assumption that exactly one drive has > > bad data and all other ones have good data. But that seems to be an > > assumption that is hard to verify in reality. > > True, but it's what ECC memory does. :-) And most people agree that > it's a useful thing to do with memory. > > If you do ECC syndrome checking on every read, and follow that up with > periodic scrubbing so that you catch (and correct) errors quickly, it > is a reasonable assumption to make. > > Obviously a warning should be given when you do this kind of ECC > fixups, and if there is an increasing number of ECC fixups that are > being done, that should set off alarms that maybe there is a hardware > problem that needs to be addressed. I agree, but not everybody likes the idea to do this kind of error correction also for hard disks in raid6 [1]. In case of a hard power failure it may well happen that any given subset of the disks in the array is up to date and all others are not. So in practice the situation for hard disks is different from memory modules. OTOH, it's probably the best thing one can do, so I'd vote for implementing this feature. Andre [1] http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09863.html -- The only person who always got his work done by Friday was Robinson Crusoe [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 18:02 ` Theodore Tso 2008-03-20 18:57 ` Andre Noll @ 2008-03-21 14:02 ` Ric Wheeler 2008-03-21 20:19 ` NeilBrown 2 siblings, 0 replies; 44+ messages in thread From: Ric Wheeler @ 2008-03-21 14:02 UTC (permalink / raw) To: Theodore Tso; +Cc: Andre Noll, Bas van Schaik, linux-raid, Martin K. Petersen Theodore Tso wrote: > On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote: >> On 12:35, Theodore Tso wrote: >> >>> If a mismatch is detected in a RAID-6 configuration, it should be >>> possible to figure out what should be fixed >> It can be figured out under the assumption that exactly one drive has >> bad data and all other ones have good data. But that seems to be an >> assumption that is hard to verify in reality. > > True, but it's what ECC memory does. :-) And most people agree that > it's a useful thing to do with memory. > > If you do ECC syndrome checking on every read, and follow that up with > periodic scrubbing so that you catch (and correct) errors quickly, it > is a reasonable assumption to make. > > Obviously a warning should be given when you do this kind of ECC > fixups, and if there is an increasing number of ECC fixups that are > being done, that should set off alarms that maybe there is a hardware > problem that needs to be addressed. > > Regards, > > - Ted This might have been stated before in the thread, but most of the raid rebuilds are triggered by easily identified drive failures (i.e., a completely dead drive or a sequence of bad sectors that generate an IO error as we read from the platter). Fortunately, these are also the most common failures in RAID boxes ;-) The way you deal with class of errors that don't trigger obvious failures is to do some kind of background scrubbing or add extra protection data to the disk. Martin Petersen presented the new "DIF" work at the FS/IO workshop. This might be an interesting feature to build into MD raid devices: http://oss.oracle.com/projects/data-integrity/documentation/ You would need to reformat your drives, so this is not a generic solution for all users, but it really does address the core of the issue. ric ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 18:02 ` Theodore Tso 2008-03-20 18:57 ` Andre Noll 2008-03-21 14:02 ` Ric Wheeler @ 2008-03-21 20:19 ` NeilBrown 2008-03-21 20:45 ` Ric Wheeler 2008-03-22 17:13 ` Bill Davidsen 2 siblings, 2 replies; 44+ messages in thread From: NeilBrown @ 2008-03-21 20:19 UTC (permalink / raw) To: Theodore Tso; +Cc: Andre Noll, Bas van Schaik, linux-raid On Fri, March 21, 2008 5:02 am, Theodore Tso wrote: > On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote: >> On 12:35, Theodore Tso wrote: >> >> > If a mismatch is detected in a RAID-6 configuration, it should be >> > possible to figure out what should be fixed >> >> It can be figured out under the assumption that exactly one drive has >> bad data and all other ones have good data. But that seems to be an >> assumption that is hard to verify in reality. > > True, but it's what ECC memory does. :-) And most people agree that > it's a useful thing to do with memory. > > If you do ECC syndrome checking on every read, and follow that up with > periodic scrubbing so that you catch (and correct) errors quickly, it > is a reasonable assumption to make. My problem with this is that I don't have a good model for what might cause the error, so I cannot reason about what responses are justifiable. The analogy with ECC memory is, I think, poor. With ECC memory there are electro/physical processes which can cause a bit to change independently of any other bit with very low probability, so treating an ECC error as a single bit error is reasonable. The analogy with a disk drive would be a media error. However disk drives record CRC (or similar) checks so that media errors get reported as errors, not as incorrect data. So the analogy doesn't hold. Where else could the error come from? Presumably a bit-flip on some transfer bus between main memory and the media. There are several of these busses (mem to controller, controller to device, internal to device). The corruption could happen on the write or on the read. When you write to a RAID6 you often write several blocks to different devices at the same time. Are these really likely to be independent events wrt whatever is causing the corruption? I don't know. But without a clear model, it isn't clear to me that any particular action will be certain to improve the situation in all cases. And how often does silent corruption happen on modern hard drives? How often do you write something and later successfully read something else when it isn't due to a major hardware problem that is causing much more that just occasional errors? The ZFS people seem to say that their checksumming of all data shows up a lot of these cases. If that is true, how come people who don't use ZFS aren't reporting lots of data corruption? So yes: there are lots of things that *could* be done. But without a model for the "threat", an analysis of how the remedy would actually affect every different possible scenario, and some idea of the probability of the remedy being needed, it is very hard to justify a change of this sort. And there are plenty of other things to be coded that are genuinely useful - like converting a RAID5 to a RAID6 while online... NeilBrown ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 20:19 ` NeilBrown @ 2008-03-21 20:45 ` Ric Wheeler 2008-03-22 17:13 ` Bill Davidsen 1 sibling, 0 replies; 44+ messages in thread From: Ric Wheeler @ 2008-03-21 20:45 UTC (permalink / raw) To: NeilBrown; +Cc: Theodore Tso, Andre Noll, Bas van Schaik, linux-raid NeilBrown wrote: > On Fri, March 21, 2008 5:02 am, Theodore Tso wrote: >> On Thu, Mar 20, 2008 at 06:39:06PM +0100, Andre Noll wrote: >>> On 12:35, Theodore Tso wrote: >>> >>>> If a mismatch is detected in a RAID-6 configuration, it should be >>>> possible to figure out what should be fixed >>> It can be figured out under the assumption that exactly one drive has >>> bad data and all other ones have good data. But that seems to be an >>> assumption that is hard to verify in reality. >> True, but it's what ECC memory does. :-) And most people agree that >> it's a useful thing to do with memory. >> >> If you do ECC syndrome checking on every read, and follow that up with >> periodic scrubbing so that you catch (and correct) errors quickly, it >> is a reasonable assumption to make. > > My problem with this is that I don't have a good model for what might > cause the error, so I cannot reason about what responses are justifiable. > > The analogy with ECC memory is, I think, poor. With ECC memory there are > electro/physical processes which can cause a bit to change independently > of any other bit with very low probability, so treating an ECC error as > a single bit error is reasonable. > > The analogy with a disk drive would be a media error. However disk drives > record CRC (or similar) checks so that media errors get reported as errors, > not as incorrect data. So the analogy doesn't hold. The challenge is only when you don't get an error on the IO. If you have bad hardware somewhere off platter, you can get silent corruption. In this case, if you look at Martin's presentation on DIF, we could do something that a check could leverage on a per sector basis for software raid. > > Where else could the error come from? Presumably a bit-flip on some > transfer bus between main memory and the media. There are several > of these busses (mem to controller, controller to device, internal to > device). The corruption could happen on the write or on the read. > When you write to a RAID6 you often write several blocks to different > devices at the same time. Are these really likely to be independent > events wrt whatever is causing the corruption? > > I don't know. But without a clear model, it isn't clear to me that > any particular action will be certain to improve the situation in > all cases. It can come from a lot of things (see the recent papers from FAST and NetApp for example). > > And how often does silent corruption happen on modern hard drives? > How often do you write something and later successfully read something > else when it isn't due to a major hardware problem that is causing > much more that just occasional errors? > > The ZFS people seem to say that their checksumming of all data shows > up a lot of these cases. If that is true, how come people who > don't use ZFS aren't reporting lots of data corruption? > > So yes: there are lots of things that *could* be done. But without > a model for the "threat", an analysis of how the remedy would actually > affect every different possible scenario, and some idea of the > probability of the remedy being needed, it is very hard to > justify a change of this sort. > And there are plenty of other things to be coded that are genuinely > useful - like converting a RAID5 to a RAID6 while online... > > NeilBrown I really think that we might be able to leverage the DIF standard if and when it rolls out. ric ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 20:19 ` NeilBrown 2008-03-21 20:45 ` Ric Wheeler @ 2008-03-22 17:13 ` Bill Davidsen 1 sibling, 0 replies; 44+ messages in thread From: Bill Davidsen @ 2008-03-22 17:13 UTC (permalink / raw) To: NeilBrown; +Cc: Theodore Tso, Andre Noll, Bas van Schaik, linux-raid NeilBrown wrote: > My problem with this is that I don't have a good model for what might > cause the error, so I cannot reason about what responses are justifiable. > > The analogy with ECC memory is, I think, poor. With ECC memory there are > electro/physical processes which can cause a bit to change independently > of any other bit with very low probability, so treating an ECC error as > a single bit error is reasonable. > > The analogy with a disk drive would be a media error. However disk drives > record CRC (or similar) checks so that media errors get reported as errors, > not as incorrect data. So the analogy doesn't hold. > > Where else could the error come from? Presumably a bit-flip on some > transfer bus between main memory and the media. There are several > of these busses (mem to controller, controller to device, internal to > device). The corruption could happen on the write or on the read. > When you write to a RAID6 you often write several blocks to different > devices at the same time. Are these really likely to be independent > events wrt whatever is causing the corruption? > Based on what I have read and seen, some of these errors come in pairs and are caused by a drive just writing to the wrong sector. This can come from errors in the O/S (unlikely), disk hardware (unlikely), or disk firmware (least unlikely). So you get the data written to the wrong place (makes that stripe invalid) and parity change or mirror copies written to the right place(s). Thus, two bad stripes to be detected on "check," neither of which will return a hardware error on a read. > I don't know. But without a clear model, it isn't clear to me that > any particular action will be certain to improve the situation in > all cases. > > Agreed, the only cases I've identified where improvement is possible is in the case of raid1 with multiple copies, and raid6. Doing the recovery I outlines the other day will not make things better in all cases, but will never make things worse (statistically) and should recover both failures if the cause is "single misplaced write." > And how often does silent corruption happen on modern hard drives? > How often do you write something and later successfully read something > else when it isn't due to a major hardware problem that is causing > much more that just occasional errors? > > Very seldom, all my critical data is checked by software CRC, and these failures just don't happen. But I have owned drives in the past which had software revisions which had error rates as high as 2/10TB, which went away on the same drives after firmware updates. So while it is rare, it can and does happen occasionally. > So yes: there are lots of things that *could* be done. But without > a model for the "threat", an analysis of how the remedy would actually > affect every different possible scenario, and some idea of the > probability of the remedy being needed, it is very hard to > justify a change of this sort. > I hope I have provided a plausible model for one error source. If I have identified the model correctly, errors will always happen in pairs, in normal operation rather than during some unclean system shutdown due to O/S crash or power failure. > And there are plenty of other things to be coded that are genuinely > useful - like converting a RAID5 to a RAID6 while online... > I would suggest that upgrading an array to larger drives is more common, having a fully automated upgrade path would be useful to far more users. So if I have (example) four 320GB drives, I want to upgrade to 500GB drives, I attacha 500GB drive and say something like "on /dev/md2 migrate /dev/sda1 to /dev/sde1" and have it done in such a way that it will fail safe and at the end sda1 will be out of the array and sde1 will be in, without having to enter multiple commands per drive to get this done. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 16:35 ` Theodore Tso 2008-03-20 17:10 ` Robin Hill 2008-03-20 17:39 ` Andre Noll @ 2008-03-20 23:08 ` Peter Rabbitson 2008-03-21 14:24 ` Bill Davidsen 2008-03-25 4:24 ` Neil Brown 2 siblings, 2 replies; 44+ messages in thread From: Peter Rabbitson @ 2008-03-20 23:08 UTC (permalink / raw) To: Theodore Tso; +Cc: Bas van Schaik, linux-raid Theodore Tso wrote: > On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: >>> There's no explicit message produced by the md module, no. You need to >>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many >>> mismatches there are. Similarly, following a repair this will indicate >>> how many mismatches it thinks have been fixed (by updating the parity >>> block to match the data blocks). >>> >> Marvellous! I naively assumed that the module would warn me, but that's >> not true. Wouldn't it be appropriate to print a message to dmesg if such >> a mismatch occurs during a check? Such a mismatch clearly means that >> there is something wrong with your hardware lying beneath md, doesn't it? > > If a mismatch is detected in a RAID-6 configuration, it should be > possible to figure out what should be fixed (since with two hot spares > there should be enough redundancy not only to detect an error, but to > correct it.) Out of curiosity, does md do this automatically, either > when reading from a stripe, or during a resync operation? > In my modest experience with root/high performance spool on various raid levels I can pretty much conclude that the current check mechanism doesn't do enough to give power to the user. We can debate all we want about what the MD driver should do when it finds a mismatch, yet there is no way for the user to figure out what the mismatch is and take appropriate action. This does not apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the only wrong value is taken and written all over the other good blocks? I think that the solution is rather simple, and I would contribute a patch if I had any C experience. The current check mechanism remains the same - mismatch_cnt is incremented/reset just the same as before. However on every mismatching chunk the system printks the following: 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device 2) one line for every active disk containing: a) the offset of the chunk within the MD componnent b) a {md5|sha1}sum of the chunk In a common case array this will take no more than 8 lines in dmesg. However it will allow: 1) For a human to determine at a glance which disk holds a mismatching chunk in raid 1/10 2) Determine the same for raid 6 using a userspace tool which will calculate the parity for every possible permutation of chunks 3) using some external tools to determine which file might have been affected on the layered file system Now of course the problem remains how to repair the array using the information obtained above. I think the best way would be to extend the syntax of repair itself, so that: echo repair > .../sync_action would use the old heuristics echo repair <mdoffset> <component N> > .../sync_action will update the chunk on drive N which corresponds to the chunk/stripe at mdoffset within the MD device, using the information from the other drives, and not the other way around as might happen with just a repair. Just my 2c Peter ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 23:08 ` Peter Rabbitson @ 2008-03-21 14:24 ` Bill Davidsen 2008-03-21 14:52 ` Peter Rabbitson 2008-03-25 4:24 ` Neil Brown 1 sibling, 1 reply; 44+ messages in thread From: Bill Davidsen @ 2008-03-21 14:24 UTC (permalink / raw) To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid Peter Rabbitson wrote: > Theodore Tso wrote: >> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: >>>> There's no explicit message produced by the md module, no. You >>>> need to >>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many >>>> mismatches there are. Similarly, following a repair this will >>>> indicate >>>> how many mismatches it thinks have been fixed (by updating the parity >>>> block to match the data blocks). >>>> >>> Marvellous! I naively assumed that the module would warn me, but that's >>> not true. Wouldn't it be appropriate to print a message to dmesg if >>> such >>> a mismatch occurs during a check? Such a mismatch clearly means that >>> there is something wrong with your hardware lying beneath md, >>> doesn't it? >> >> If a mismatch is detected in a RAID-6 configuration, it should be >> possible to figure out what should be fixed (since with two hot spares >> there should be enough redundancy not only to detect an error, but to >> correct it.) Out of curiosity, does md do this automatically, either >> when reading from a stripe, or during a resync operation? >> > > In my modest experience with root/high performance spool on various > raid levels I can pretty much conclude that the current check > mechanism doesn't do enough to give power to the user. We can debate > all we want about what the MD driver should do when it finds a > mismatch, yet there is no way for the user to figure out what the > mismatch is and take appropriate action. This does not apply only to > RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the only > wrong value is taken and written all over the other good blocks? > > I think that the solution is rather simple, and I would contribute a > patch if I had any C experience. The current check mechanism remains > the same - mismatch_cnt is incremented/reset just the same as before. > However on every mismatching chunk the system printks the following: > > 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the > MD device > 2) one line for every active disk containing: > a) the offset of the chunk within the MD componnent > b) a {md5|sha1}sum of the chunk > > In a common case array this will take no more than 8 lines in dmesg. > However it will allow: > > 1) For a human to determine at a glance which disk holds a mismatching > chunk in raid 1/10 > 2) Determine the same for raid 6 using a userspace tool which will > calculate the parity for every possible permutation of chunks > 3) using some external tools to determine which file might have been > affected on the layered file system > > > Now of course the problem remains how to repair the array using the > information obtained above. I think the best way would be to extend > the syntax of repair itself, so that: > > echo repair > .../sync_action would use the old heuristics > > echo repair <mdoffset> <component N> > .../sync_action will update the > chunk on drive N which corresponds to the chunk/stripe at mdoffset > within the MD device, using the information from the other drives, and > not the other way around as might happen with just a repair. I totally agree, not doing the most likely to be correct thing seems to be the one argument for hardware raid. There are two case in which software can determine (a) if it is likely that there is a single bad block, and (b) what the correct value for that block is. raid1 - more than one copy If there are multiple copies of the data, and N-1 agree, then it is more likely that the mismatched copy is the bad one, and should be rewritten with the data in the other copies. This is never less likely to be correct than selecting one copy at random and writing it over all others, so it can only be a help. raid6 - assume and check Given an error in raid6, if the parity A appears correct and the parity B does not, assume that the non-matching parity is bad and regenerate. If neither parity appears correct, for each data block assume it is bad and recalculate a recovery value using A nd B parities. If the data pattern generated is the same for recovery using either parity, assume that the data is bad and rewrite. Again, this is more likely to be correct than assuming that both parities are wrong. Obviously if no "most likely" bad data or parity information can be identified then recalculating both parity blocks is the only way to "fix" the array, but it leaves undetectable bad data. I would like an option to do repairs using these two methods, which would give a high probability that whatever "fixes" were applied were actually recovering the correct data. Yes, I know that errors like this are less common than pure hardware errors, does that justify something less than best practice during recovery? -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 14:24 ` Bill Davidsen @ 2008-03-21 14:52 ` Peter Rabbitson 2008-03-21 17:13 ` Theodore Tso 2008-03-21 23:01 ` Bill Davidsen 0 siblings, 2 replies; 44+ messages in thread From: Peter Rabbitson @ 2008-03-21 14:52 UTC (permalink / raw) To: Bill Davidsen; +Cc: Theodore Tso, Bas van Schaik, linux-raid Bill Davidsen wrote: > Peter Rabbitson wrote: >> Theodore Tso wrote: >>> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: >>>>> There's no explicit message produced by the md module, no. You >>>>> need to >>>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many >>>>> mismatches there are. Similarly, following a repair this will >>>>> indicate >>>>> how many mismatches it thinks have been fixed (by updating the parity >>>>> block to match the data blocks). >>>>> >>>> Marvellous! I naively assumed that the module would warn me, but that's >>>> not true. Wouldn't it be appropriate to print a message to dmesg if >>>> such >>>> a mismatch occurs during a check? Such a mismatch clearly means that >>>> there is something wrong with your hardware lying beneath md, >>>> doesn't it? >>> >>> If a mismatch is detected in a RAID-6 configuration, it should be >>> possible to figure out what should be fixed (since with two hot spares >>> there should be enough redundancy not only to detect an error, but to >>> correct it.) Out of curiosity, does md do this automatically, either >>> when reading from a stripe, or during a resync operation? >>> >> >> In my modest experience with root/high performance spool on various >> raid levels I can pretty much conclude that the current check >> mechanism doesn't do enough to give power to the user. We can debate >> all we want about what the MD driver should do when it finds a >> mismatch, yet there is no way for the user to figure out what the >> mismatch is and take appropriate action. This does not apply only to >> RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the only >> wrong value is taken and written all over the other good blocks? >> >> I think that the solution is rather simple, and I would contribute a >> patch if I had any C experience. The current check mechanism remains >> the same - mismatch_cnt is incremented/reset just the same as before. >> However on every mismatching chunk the system printks the following: >> >> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the >> MD device >> 2) one line for every active disk containing: >> a) the offset of the chunk within the MD componnent >> b) a {md5|sha1}sum of the chunk >> >> In a common case array this will take no more than 8 lines in dmesg. >> However it will allow: >> >> 1) For a human to determine at a glance which disk holds a mismatching >> chunk in raid 1/10 >> 2) Determine the same for raid 6 using a userspace tool which will >> calculate the parity for every possible permutation of chunks >> 3) using some external tools to determine which file might have been >> affected on the layered file system >> >> >> Now of course the problem remains how to repair the array using the >> information obtained above. I think the best way would be to extend >> the syntax of repair itself, so that: >> >> echo repair > .../sync_action would use the old heuristics >> >> echo repair <mdoffset> <component N> > .../sync_action will update the >> chunk on drive N which corresponds to the chunk/stripe at mdoffset >> within the MD device, using the information from the other drives, and >> not the other way around as might happen with just a repair. > > I totally agree, not doing the most likely to be correct thing seems to > be the one argument for hardware raid. There are two case in which > software can determine (a) if it is likely that there is a single bad > block, and (b) what the correct value for that block is. > > <snip> > I was actually specifically advocating that md must _not_ do anything on its own. Just provide the hooks to get information (what is the current stripe state) and update information (the described repair extension). The logic that you are describing can live only in an external app, it has no place in-kernel. Peter ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 14:52 ` Peter Rabbitson @ 2008-03-21 17:13 ` Theodore Tso 2008-03-21 17:35 ` Peter Rabbitson 2008-03-21 17:43 ` Robin Hill 2008-03-21 23:01 ` Bill Davidsen 1 sibling, 2 replies; 44+ messages in thread From: Theodore Tso @ 2008-03-21 17:13 UTC (permalink / raw) To: Peter Rabbitson; +Cc: Bill Davidsen, Bas van Schaik, linux-raid On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote: > I was actually specifically advocating that md must _not_ do anything on > its own. Just provide the hooks to get information (what is the current > stripe state) and update information (the described repair extension). The > logic that you are describing can live only in an external app, it has no > place in-kernel. Why not? If md doesn't do anything on its own, then when it detects a disagreement between the data and the two parity blocks, it has two choices (a) return possibly incorrect data to the application, or (b) return an I/O error and cause the application to blow up. Sure, it could then give the information so that the external repair tool can fix it up after the fact, but that seems like a really lousy thing to do as far as the original application is concerned. (Or I suppose you could try to block the userspace application until the repair tool has a chance to do automatically what md could have done automatically in the kernel anyway, but that has other problems.) So what's the harm in having an option where md does exactly what ECC memory does, which is when it can fix things up, to do so? I bet most system administrators would turn it on in a heartbeat. - Ted ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 17:13 ` Theodore Tso @ 2008-03-21 17:35 ` Peter Rabbitson 2008-03-22 13:27 ` Theodore Tso 2008-03-21 17:43 ` Robin Hill 1 sibling, 1 reply; 44+ messages in thread From: Peter Rabbitson @ 2008-03-21 17:35 UTC (permalink / raw) To: Theodore Tso; +Cc: Bill Davidsen, Bas van Schaik, linux-raid Theodore Tso wrote: > On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote: >> I was actually specifically advocating that md must _not_ do anything on >> its own. Just provide the hooks to get information (what is the current >> stripe state) and update information (the described repair extension). The >> logic that you are describing can live only in an external app, it has no >> place in-kernel. > > Why not? If md doesn't do anything on its own, then when it detects a > disagreement between the data and the two parity blocks, it has two > choices (a) return possibly incorrect data to the application, or (b) > return an I/O error and cause the application to blow up. > > <snip> > > So what's the harm in having an option where md does exactly what ECC > memory does, which is when it can fix things up, to do so? I bet most > system administrators would turn it on in a heartbeat. > With ECC memory you are checking for inconsistency on _every_single_read_ whereas the md scrubbing happens at best once a month if the admin turned the feature on. Moreover when md actually detects a mismatch the overwhelming chance is nobody needs this block at this moment, and might not need it for days to come. I think what is eluding this thread is the fact that md does not read _any_ redundant blocks unless it absolutely has to. And when it has to - you already have a missing chunk and can not apply ECC techniques either. Of course it would be possible to instruct md to always read all data+parity chunks and make a comparison on every read. The performance would not be much to write home about though. Peter ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 17:35 ` Peter Rabbitson @ 2008-03-22 13:27 ` Theodore Tso 2008-03-22 14:00 ` Bas van Schaik ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Theodore Tso @ 2008-03-22 13:27 UTC (permalink / raw) To: Peter Rabbitson; +Cc: Bill Davidsen, Bas van Schaik, linux-raid On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote: > > Of course it would be possible to instruct md to always read all > data+parity chunks and make a comparison on every read. The performance > would not be much to write home about though. Yeah, and that's probably the real problem with this scheme. You basically reduce the read bandwidth of your array down to a single (slowest) disk --- basically the same reason why RAID-2 is a commercial failure. I suspect the best thing we *can* to do is for filesystems that include checksums in the metadata and/or the data blocks, is if the CRC doesn't match, to have the filesystem tell the RAID subsystem, "um, could you send me copies of the data from all of the RAID-1 mirrors, and see if one of the copies from the mirrors causes a valid checksum". Something similar could be done with RAID-5/RAID-6 arrays, if the fs layer could ask the RAID subsystem, "the external checksum for this block is bad; can you recalculate it from all available parity stripes assuming the data stripe is invalid". Ext4 has metadata checksums; U Wisconsin's Iron filesystem (sponsored with a grant from EMC) did it for both data and metadata, if memory serves me correctly. ZFS smashed through the RAID abstraction barrier and sucked up RAID functionality into the filesystem so they could this sort of thing; but with the right new set of interfaces, it should be possible to add this functionality without reimplementing RAID in each filesystem. As far as the question of how often this happens, where a disk silently corrupts a block without returning a media error, it definitely happens. Larry McVoy tells a story of periodically running a per-file CRC across a backup/archival filesystems, and was able to detect files that had not been modified changing out from under him. One way this can happen is if the disk accidentally writes some block to the wrong location on disk; the blockguard extension and various enterprise databases (since they can control their db-specific on-disk format) will encode the intended location of a block in their per-block checksums, to detect this specific type of failure, which should broad hint that this sort of thing can and does happen. Does it happen as much as ZFS's marketing literature implies? Probably not. But as you start making bigger and bigger filesystems, the chances that even relatively improbable errors happen start increasing significantly. Of course, the flip side of the argument is that if you are using the huge arrays to store things like music and video, maybe you don't care about a small amount of data corruption, since it might not be noticeable to the human eye/ear. That's a pretty weak argument though, and it sends shivers up the spins of people who are storing, for example, medical images of X-ray or CAT scans. - Ted ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-22 13:27 ` Theodore Tso @ 2008-03-22 14:00 ` Bas van Schaik 2008-03-25 4:44 ` Neil Brown 2008-03-25 9:19 ` Mattias Wadenstein 2 siblings, 0 replies; 44+ messages in thread From: Bas van Schaik @ 2008-03-22 14:00 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-raid Theodore Tso wrote: > On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote: > >> Of course it would be possible to instruct md to always read all >> data+parity chunks and make a comparison on every read. The performance >> would not be much to write home about though. >> > > (...) > > Does it happen as much as ZFS's marketing literature implies? > Probably not. But as you start making bigger and bigger filesystems, > the chances that even relatively improbable errors happen start > increasing significantly. Of course, the flip side of the argument is > that if you are using the huge arrays to store things like music and > video, maybe you don't care about a small amount of data corruption, > since it might not be noticeable to the human eye/ear. That's a > pretty weak argument though, and it sends shivers up the spins of > people who are storing, for example, medical images of X-ray or CAT > scans. > I totally agree with you, Ted, although I think your idea of a filesystem communicating with RAID in an sophisticated way kind of conflicts with the "layered approach" which is chosen in the world of Linux. Should that be a reason not to implement this feature? I don't think so. Although most of you sketch scenarios in which it is very rare that corruptions occur, I think you should also take into account that storage is booming and growing like never before. This trend has caused people (like me) to use other media to transfer and store data, using the network for example. The assumption that data corruption is rare because the bus and the disk are very reliable doesn't hold anymore: other ways of communication are much more sensitive to corruption. Of course, protection against these types of corruption should be implemented in the appropriate layer (using checksums over packets, like TCP does), but I think it is a little bit naive to assume that this will succeed in all cases. On the other hand it would not make sense to read every block after writing it (to check its consistency), but it might be a nice feature to extend the monthly consistency check with advanced error reporting features. Users who don't care (storing music and video, using Ted's example) would disable this check, administrators like me (storing large amounts of medical data) could run this check every week or so. Regards, -- Bas ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-22 13:27 ` Theodore Tso 2008-03-22 14:00 ` Bas van Schaik @ 2008-03-25 4:44 ` Neil Brown 2008-03-25 15:17 ` Bill Davidsen 2008-03-25 9:19 ` Mattias Wadenstein 2 siblings, 1 reply; 44+ messages in thread From: Neil Brown @ 2008-03-25 4:44 UTC (permalink / raw) To: Theodore Tso; +Cc: Peter Rabbitson, Bill Davidsen, Bas van Schaik, linux-raid On Saturday March 22, tytso@MIT.EDU wrote: > On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote: > > > > Of course it would be possible to instruct md to always read all > > data+parity chunks and make a comparison on every read. The performance > > would not be much to write home about though. > > Yeah, and that's probably the real problem with this scheme. You > basically reduce the read bandwidth of your array down to a single > (slowest) disk --- basically the same reason why RAID-2 is a > commercial failure. Exactly. > > I suspect the best thing we *can* to do is for filesystems that > include checksums in the metadata and/or the data blocks, is if the > CRC doesn't match, to have the filesystem tell the RAID subsystem, > "um, could you send me copies of the data from all of the RAID-1 > mirrors, and see if one of the copies from the mirrors causes a valid > checksum". Something similar could be done with RAID-5/RAID-6 arrays, > if the fs layer could ask the RAID subsystem, "the external checksum > for this block is bad; can you recalculate it from all available > parity stripes assuming the data stripe is invalid". Something along these lines would be very appropriate I think. Particularly for raid1. For raid5/raid6 it is possible that a valid block in the same stripe was read and written before the faulty block was read. This would correct the parity so when the bad block was found, there would be no way to recover the correct data. Still, having the possibility of recovery might be better than not having it. > > As far as the question of how often this happens, where a disk > silently corrupts a block without returning a media error, it > definitely happens. Larry McVoy tells a story of periodically running > a per-file CRC across a backup/archival filesystems, and was able to > detect files that had not been modified changing out from under him. > One way this can happen is if the disk accidentally writes some block > to the wrong location on disk; the blockguard extension and various > enterprise databases (since they can control their db-specific on-disk > format) will encode the intended location of a block in their > per-block checksums, to detect this specific type of failure, which > should broad hint that this sort of thing can and does happen. The "address data was corrupted" is certainly a credible possibility. I remember reading that SCSI has a parity check for data, but not for the command, which include the storage address. With the raid6 algorithm, we can tell which device has an error (assuming only one device does) for each byte in the block. If this returns the same device for every block in a sector, it is probably reasonable to assume that exactly that block is bad. Still, if we only do that on the monthly 'check', it could be too late. I'm not sure that "surviving some data corruptions, if you are lucky" is really better than surviving none. We don't want to provide a false sense of security.... but maybe RAID already does that. A filesystem that always writes full stripes and never over-writes valid data. And that (optionally) stores checksums for everything is looking more an more appealing. The trouble is, I don't seem to have enough "spare time" :-) NeilBrown ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-25 4:44 ` Neil Brown @ 2008-03-25 15:17 ` Bill Davidsen 0 siblings, 0 replies; 44+ messages in thread From: Bill Davidsen @ 2008-03-25 15:17 UTC (permalink / raw) To: Neil Brown; +Cc: Theodore Tso, Peter Rabbitson, Bas van Schaik, linux-raid Neil Brown wrote: > On Saturday March 22, tytso@MIT.EDU wrote: > >> On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote: >> >>> Of course it would be possible to instruct md to always read all >>> data+parity chunks and make a comparison on every read. The performance >>> would not be much to write home about though. >>> >> Yeah, and that's probably the real problem with this scheme. You >> basically reduce the read bandwidth of your array down to a single >> (slowest) disk --- basically the same reason why RAID-2 is a >> commercial failure. >> > > Exactly. > > In some cases that would be acceptable. Obviously in the general case it's not required. >> I suspect the best thing we *can* to do is for filesystems that >> include checksums in the metadata and/or the data blocks, is if the >> CRC doesn't match, to have the filesystem tell the RAID subsystem, >> "um, could you send me copies of the data from all of the RAID-1 >> mirrors, and see if one of the copies from the mirrors causes a valid >> checksum". Something similar could be done with RAID-5/RAID-6 arrays, >> if the fs layer could ask the RAID subsystem, "the external checksum >> for this block is bad; can you recalculate it from all available >> parity stripes assuming the data stripe is invalid". >> > > Something along these lines would be very appropriate I think. > Particularly for raid1. > For raid5/raid6 it is possible that a valid block in the same stripe > was read and written before the faulty block was read. This would > correct the parity so when the bad block was found, there would be no > way to recover the correct data. > Still, having the possibility of recovery might be better than not > having it. > > >> As far as the question of how often this happens, where a disk >> silently corrupts a block without returning a media error, it >> definitely happens. Larry McVoy tells a story of periodically running >> a per-file CRC across a backup/archival filesystems, and was able to >> detect files that had not been modified changing out from under him. >> One way this can happen is if the disk accidentally writes some block >> to the wrong location on disk; the blockguard extension and various >> enterprise databases (since they can control their db-specific on-disk >> format) will encode the intended location of a block in their >> per-block checksums, to detect this specific type of failure, which >> should broad hint that this sort of thing can and does happen. >> > > The "address data was corrupted" is certainly a credible possibility. > I remember reading that SCSI has a parity check for data, but not for > the command, which include the storage address. > > With the raid6 algorithm, we can tell which device has an error > (assuming only one device does) for each byte in the block. > If this returns the same device for every block in a sector, it is > probably reasonable to assume that exactly that block is bad. > Still, if we only do that on the monthly 'check', it could be too > late. > > I think the old saying "better late than never" applies, once the user knows that there is a problem via 'check,' and fixes it if possible, some form of recovery would then at least be possible. > I'm not sure that "surviving some data corruptions, if you are lucky" > is really better than surviving none. We don't want to provide a > false sense of security.... but maybe RAID already does that. > > A filesystem that always writes full stripes and never over-writes > valid data. And that (optionally) stores checksums for everything is > looking more an more appealing. The trouble is, I don't seem to have > enough "spare time" :-) > Frankly I think your limited time is better spent on raid, there are undoubtedly plenty of things on your "to do" list. I'd like to hope that raid5e is at least on that list, but I would be the first to say that performance improvements for raid5 would benefit more people. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-22 13:27 ` Theodore Tso 2008-03-22 14:00 ` Bas van Schaik 2008-03-25 4:44 ` Neil Brown @ 2008-03-25 9:19 ` Mattias Wadenstein 2 siblings, 0 replies; 44+ messages in thread From: Mattias Wadenstein @ 2008-03-25 9:19 UTC (permalink / raw) To: Theodore Tso; +Cc: Peter Rabbitson, Bill Davidsen, Bas van Schaik, linux-raid On Sat, 22 Mar 2008, Theodore Tso wrote: > On Fri, Mar 21, 2008 at 06:35:43PM +0100, Peter Rabbitson wrote: >> >> Of course it would be possible to instruct md to always read all >> data+parity chunks and make a comparison on every read. The performance >> would not be much to write home about though. > > Yeah, and that's probably the real problem with this scheme. You > basically reduce the read bandwidth of your array down to a single > (slowest) disk --- basically the same reason why RAID-2 is a > commercial failure. I don't really see this as a problem. Most of my filesystems are not anywhere near their performance limit and reading a strip from the parity disks as well as all the strips from the data disks in a raid6 setup probably would be less than a 50% performance hit, so I would very much appriciate a "paranoid parity check on every read" flag to set on _some_ of my raids. /Mattias Wadenstein ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 17:13 ` Theodore Tso 2008-03-21 17:35 ` Peter Rabbitson @ 2008-03-21 17:43 ` Robin Hill 1 sibling, 0 replies; 44+ messages in thread From: Robin Hill @ 2008-03-21 17:43 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 2763 bytes --] On Fri Mar 21, 2008 at 01:13:48PM -0400, Theodore Tso wrote: > On Fri, Mar 21, 2008 at 03:52:31PM +0100, Peter Rabbitson wrote: > > I was actually specifically advocating that md must _not_ do anything on > > its own. Just provide the hooks to get information (what is the current > > stripe state) and update information (the described repair extension). The > > logic that you are describing can live only in an external app, it has no > > place in-kernel. > > Why not? If md doesn't do anything on its own, then when it detects a > disagreement between the data and the two parity blocks, it has two > choices (a) return possibly incorrect data to the application, or (b) > return an I/O error and cause the application to blow up. > > Sure, it could then give the information so that the external repair > tool can fix it up after the fact, but that seems like a really lousy > thing to do as far as the original application is concerned. (Or I > suppose you could try to block the userspace application until the > repair tool has a chance to do automatically what md could have done > automatically in the kernel anyway, but that has other problems.) > > So what's the harm in having an option where md does exactly what ECC > memory does, which is when it can fix things up, to do so? I bet most > system administrators would turn it on in a heartbeat. > Depends on how you look at things. ECC memory is designed to deal with occasional mismatches caused by such obscure and rare events as cosmic radiation. RAID subsytems, on the other hand, are designed to deal with catastrophic failures of one (or more) drives. There's no trivially explaianable reason why a drive would sporadically suffer from incorrect data reading/writing (unlike with ECC memory) so there's no recovery case. Admittedly, it would be possible to do this, but that would mean adding an extra read penalty on every RAID read (and, in some situations, throwing away the advantages of parallelism) in order to cover the exceptionally rare case where a drive has (for unknown reason) written the wrong data. Personally, this would be an option I'd avoid like the plague. If I know there's an issue then I replace the hardware, otherwise I expect the system to work as fast as possible in the assumption that all is correct. Admittedly, a check/repair option to view/select how the blocks are recovered might be useful, but I'd also see this sitting well outside the md code. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 14:52 ` Peter Rabbitson 2008-03-21 17:13 ` Theodore Tso @ 2008-03-21 23:01 ` Bill Davidsen 2008-03-21 23:45 ` Carlos Carvalho 2008-03-21 23:55 ` Robin Hill 1 sibling, 2 replies; 44+ messages in thread From: Bill Davidsen @ 2008-03-21 23:01 UTC (permalink / raw) To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid Peter Rabbitson wrote: > I was actually specifically advocating that md must _not_ do anything > on its own. Just provide the hooks to get information (what is the > current stripe state) and update information (the described repair > extension). The logic that you are describing can live only in an > external app, it has no place in-kernel. So you advocate the current code being in the kernel, which absent a hardware error makes blind assumptions about which data is valid and which is not and in all cases hides the problem, instead of the code I proposed, which in some cases will be able to avoid action which is provably wrong and never be less likely to do the wrong thing than the current code? Currently the "repair" action (which *is* in the kernel now) takes no advantage of the additional information available in these cases I noted. By what logic do you conclude that the user meant "hide the error" when using the "repair" action? What I propose is never less likely to be correct than what the current code does, why would you not want to improve the chances of getting the repair correct? -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 23:01 ` Bill Davidsen @ 2008-03-21 23:45 ` Carlos Carvalho 2008-03-22 17:19 ` Bill Davidsen 2008-03-21 23:55 ` Robin Hill 1 sibling, 1 reply; 44+ messages in thread From: Carlos Carvalho @ 2008-03-21 23:45 UTC (permalink / raw) To: linux-raid Bill Davidsen (davidsen@tmr.com) wrote on 21 March 2008 19:01: >Peter Rabbitson wrote: >> I was actually specifically advocating that md must _not_ do anything ************************* >> on its own. Just provide the hooks to get information (what is the ********** >> current stripe state) and update information (the described repair >> extension). The logic that you are describing can live only in an >> external app, it has no place in-kernel. > >So you advocate the current code being in the kernel, which absent a >hardware error makes blind assumptions about which data is valid and >which is not and in all cases hides the problem, instead of the code I >proposed, which in some cases will be able to avoid action which is >provably wrong and never be less likely to do the wrong thing than the >current code? The current code doesn't do anything on its own, it must be invoked by the user, which is an important difference. I agree that blindingly setting parity is not good; that's an argument for removing it from the kernel, not adding something :-) Why is it there? This is for Neil to answer; I merely conjecture that it was already there. For example, it's necessary after a raid5 array is created, because it's done creating an n-1 degraded array and adding the last disk afterwards. It's also done when an array is dirty. This is a situation where it's done without asking the user but it seems to me that in this case that's the right action: if the parity doesn't agree with the data it's either because the parity was not yet updated at the moment of the unclean shutdown or because it was updated but not the data itself. In both cases the parity should reflect the current data situation. The /sys/..../syn_action is just an interface added much later to trigger the code. The check action is useful but I think repair is too risky. I doubt it should be available. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 23:45 ` Carlos Carvalho @ 2008-03-22 17:19 ` Bill Davidsen 0 siblings, 0 replies; 44+ messages in thread From: Bill Davidsen @ 2008-03-22 17:19 UTC (permalink / raw) To: Carlos Carvalho; +Cc: linux-raid Carlos Carvalho wrote: > Bill Davidsen (davidsen@tmr.com) wrote on 21 March 2008 19:01: > >Peter Rabbitson wrote: > >> I was actually specifically advocating that md must _not_ do anything > ************************* > >> on its own. Just provide the hooks to get information (what is the > ********** > >> current stripe state) and update information (the described repair > >> extension). The logic that you are describing can live only in an > >> external app, it has no place in-kernel. > > > >So you advocate the current code being in the kernel, which absent a > >hardware error makes blind assumptions about which data is valid and > >which is not and in all cases hides the problem, instead of the code I > >proposed, which in some cases will be able to avoid action which is > >provably wrong and never be less likely to do the wrong thing than the > >current code? > > The current code doesn't do anything on its own, it must be invoked by > the user, which is an important difference. > > Difference from what? Is issuing the 'repair' action on its own? How would adding code which lets that repair have a higher chance of success be bad? Sector consistency errors don't show up during normal operation, there's no hardware error, just bad data. It only shows up during 'check' or 'repair,' so the recovery would never be triggered without express user request. > I agree that blindingly setting parity is not good; that's an argument > for removing it from the kernel, not adding something :-) > > Why is it there? This is for Neil to answer; I merely conjecture that > it was already there. For example, it's necessary after a raid5 array > is created, because it's done creating an n-1 degraded array and > adding the last disk afterwards. It's also done when an array is > dirty. This is a situation where it's done without asking the user but > it seems to me that in this case that's the right action: if the > parity doesn't agree with the data it's either because the parity was > not yet updated at the moment of the unclean shutdown or because it > was updated but not the data itself. In both cases the parity should > reflect the current data situation. > > The /sys/..../syn_action is just an interface added much later to > trigger the code. The check action is useful but I think repair is too > risky. I doubt it should be available. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 23:01 ` Bill Davidsen 2008-03-21 23:45 ` Carlos Carvalho @ 2008-03-21 23:55 ` Robin Hill 2008-03-22 10:03 ` Peter Rabbitson 1 sibling, 1 reply; 44+ messages in thread From: Robin Hill @ 2008-03-21 23:55 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 2298 bytes --] On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: > Peter Rabbitson wrote: >> I was actually specifically advocating that md must _not_ do anything on >> its own. Just provide the hooks to get information (what is the current >> stripe state) and update information (the described repair extension). The >> logic that you are describing can live only in an external app, it has no >> place in-kernel. > > So you advocate the current code being in the kernel, which absent a > hardware error makes blind assumptions about which data is valid and which > is not and in all cases hides the problem, instead of the code I proposed, > which in some cases will be able to avoid action which is provably wrong > and never be less likely to do the wrong thing than the current code? > I would certainly advocate that the current (entirely automatic) code belongs in the kernel whereas any code requiring user intervention/decision making belongs in a user process, yes. That's not to say that the former should be preferred over the latter though, but there's really no reason to remove the in-kernel automated process until (or even after) a user-side repair process has been coded. > Currently the "repair" action (which *is* in the kernel now) takes no > advantage of the additional information available in these cases I noted. > By what logic do you conclude that the user meant "hide the error" when > using the "repair" action? What I propose is never less likely to be > correct than what the current code does, why would you not want to improve > the chances of getting the repair correct? > That is, of course, a separate issue to whether it should be in-kernel. I would entirely agree that user-level processes should be able to access and manipulate the low-level RAID data/metadata (via the md layer) in order to facilitate more advanced repair functions, but this should be separate from, and in addition to, the "ignorant" parity-updating repair process currently in place. Just my 2p, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-21 23:55 ` Robin Hill @ 2008-03-22 10:03 ` Peter Rabbitson 2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz 2008-05-04 7:30 ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson 0 siblings, 2 replies; 44+ messages in thread From: Peter Rabbitson @ 2008-03-22 10:03 UTC (permalink / raw) To: linux-raid Robin Hill wrote: > On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: > >> Peter Rabbitson wrote: >>> I was actually specifically advocating that md must _not_ do anything on >>> its own. Just provide the hooks to get information (what is the current >>> stripe state) and update information (the described repair extension). The >>> logic that you are describing can live only in an external app, it has no >>> place in-kernel. >> So you advocate the current code being in the kernel, which absent a >> hardware error makes blind assumptions about which data is valid and which >> is not and in all cases hides the problem, instead of the code I proposed, >> which in some cases will be able to avoid action which is provably wrong >> and never be less likely to do the wrong thing than the current code? >> > I would certainly advocate that the current (entirely automatic) code > belongs in the kernel whereas any code requiring user > intervention/decision making belongs in a user process, yes. That's not > to say that the former should be preferred over the latter though, but > there's really no reason to remove the in-kernel automated process until > (or even after) a user-side repair process has been coded. I am asserting that automatic repair is infeasible in most highly-redundant cases. Lets take the root raid1 of one of my busiest servers: /dev/md0: Version : 00.90.03 Creation Time : Tue Mar 20 21:58:54 2007 Raid Level : raid1 Array Size : 6000128 (5.72 GiB 6.14 GB) Used Dev Size : 6000128 (5.72 GiB 6.14 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Mar 22 05:55:08 2008 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas) Events : 0.183270 As you can see it is pretty old, and does not have many events to speak of. Yet every month when the automatic check is issued I get between 512 and 2048 in mismatch_cnt. I maintain md5sums of all files on this filesystem, and there were no deviations for the lifetime of the array (of course there are mismatches after upgrades, after log appends etc, but they are all expected). So all I can do with this array is issue a blind repair, without even having the chance to find what exactly is causing this. Yes, it is raid1 and I could do 1:1 comparison to find which is the offending block. How about raid10 -n f3? There is no way I can figure out _what_ is giving me a problem. I do not know if it is a hardware error (the md5 sums speak against it), some process with weird write patterns resulting in heavy DMA, or a bug in md itself. By the way there is no swap file on this array. Just / and /var, with a moderately busy mail spool on top. >> Currently the "repair" action (which *is* in the kernel now) takes no >> advantage of the additional information available in these cases I noted. >> By what logic do you conclude that the user meant "hide the error" when >> using the "repair" action? What I propose is never less likely to be >> correct than what the current code does, why would you not want to improve >> the chances of getting the repair correct? >> > That is, of course, a separate issue to whether it should be in-kernel. > I would entirely agree that user-level processes should be able to > access and manipulate the low-level RAID data/metadata (via the md > layer) in order to facilitate more advanced repair functions, but this > should be separate from, and in addition to, the "ignorant" > parity-updating repair process currently in place. > I am trying to convey the idea that a first step to a userland process would be full disclosure of what is going on. A non-zero mismatch_cnt on a multigigabyte array makes an admin very uneasy, without giving him a chance to assess the situation. Peter ^ permalink raw reply [flat|nested] 44+ messages in thread
* What do Events actually mean? 2008-03-22 10:03 ` Peter Rabbitson @ 2008-03-22 10:42 ` Justin Piszcz 2008-03-22 17:35 ` David Greaves 2008-03-25 3:58 ` Neil Brown 2008-05-04 7:30 ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson 1 sibling, 2 replies; 44+ messages in thread From: Justin Piszcz @ 2008-03-22 10:42 UTC (permalink / raw) To: linux-raid; +Cc: Peter Rabbitson On Sat, 22 Mar 2008, Peter Rabbitson wrote: > Robin Hill wrote: >> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >> >>> Peter Rabbitson wrote: > UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas) > Events : 0.183270 > > As you can see it is pretty old, and does not have many events to speak of. What do the 'Events' actually represent and what do they mean for RAID0, RAID1, RAID5 etc? How are they calculated? Justin. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz @ 2008-03-22 17:35 ` David Greaves 2008-03-22 17:48 ` Justin Piszcz 2008-03-25 3:58 ` Neil Brown 1 sibling, 1 reply; 44+ messages in thread From: David Greaves @ 2008-03-22 17:35 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson Justin Piszcz wrote: > > > On Sat, 22 Mar 2008, Peter Rabbitson wrote: > >> Robin Hill wrote: >>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >>> >>>> Peter Rabbitson wrote: > >> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host >> Arzamas) >> Events : 0.183270 >> >> As you can see it is pretty old, and does not have many events to >> speak of. > > What do the 'Events' actually represent and what do they mean for RAID0, > RAID1, RAID5 etc? > > How are they calculated? http://linux-raid.osdl.org/index.php/Event David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-22 17:35 ` David Greaves @ 2008-03-22 17:48 ` Justin Piszcz 2008-03-22 18:02 ` David Greaves 0 siblings, 1 reply; 44+ messages in thread From: Justin Piszcz @ 2008-03-22 17:48 UTC (permalink / raw) To: David Greaves; +Cc: linux-raid, Peter Rabbitson On Sat, 22 Mar 2008, David Greaves wrote: > Justin Piszcz wrote: >> >> >> On Sat, 22 Mar 2008, Peter Rabbitson wrote: >> >>> Robin Hill wrote: >>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >>>> >>>>> Peter Rabbitson wrote: >> >>> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host >>> Arzamas) >>> Events : 0.183270 >>> >>> As you can see it is pretty old, and does not have many events to >>> speak of. >> >> What do the 'Events' actually represent and what do they mean for RAID0, >> RAID1, RAID5 etc? >> >> How are they calculated? > > http://linux-raid.osdl.org/index.php/Event Empty? There is currently no text in this page, you can search for this page title in other pages or edit this page ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-22 17:48 ` Justin Piszcz @ 2008-03-22 18:02 ` David Greaves 0 siblings, 0 replies; 44+ messages in thread From: David Greaves @ 2008-03-22 18:02 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson Justin Piszcz wrote: > > > On Sat, 22 Mar 2008, David Greaves wrote: > >> Justin Piszcz wrote: >>> >>> >>> On Sat, 22 Mar 2008, Peter Rabbitson wrote: >>> >>>> Robin Hill wrote: >>>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >>>>> >>>>>> Peter Rabbitson wrote: >>> >>>> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host >>>> Arzamas) >>>> Events : 0.183270 >>>> >>>> As you can see it is pretty old, and does not have many events to >>>> speak of. >>> >>> What do the 'Events' actually represent and what do they mean for RAID0, >>> RAID1, RAID5 etc? >>> >>> How are they calculated? >> >> http://linux-raid.osdl.org/index.php/Event > > Empty? > There is currently no text in this page, you can search for this page > title in other pages or edit this page And? You think I was being too subtle <grin> David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz 2008-03-22 17:35 ` David Greaves @ 2008-03-25 3:58 ` Neil Brown 2008-03-26 8:57 ` David Greaves 2008-03-26 8:57 ` David Greaves 1 sibling, 2 replies; 44+ messages in thread From: Neil Brown @ 2008-03-25 3:58 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid, Peter Rabbitson On Saturday March 22, jpiszcz@lucidpixels.com wrote: > > > On Sat, 22 Mar 2008, Peter Rabbitson wrote: > > > Robin Hill wrote: > >> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: > >> > >>> Peter Rabbitson wrote: > > > UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas) > > Events : 0.183270 > > > > As you can see it is pretty old, and does not have many events to speak of. > > What do the 'Events' actually represent and what do they mean for RAID0, > RAID1, RAID5 etc? An 'event' is one of: switch from 'active' to 'clean' switch from 'clean' to 'active' device fails device is added spare replaces a failed device after a rebuild I think that it all. None of these are meaningful for RAID0, so the 'events' counter on RAID0 should be stable. Unfortunately, the number looks like a decimal but isn't. It is a 64bit number. We print out the top 32 bits, then the bottom 32 bits. I don't remember why. Maybe I'll 'fix' it. > > How are they calculated? events = events + 1; Feel free to merge this text into the wiki. NeilBrown ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-25 3:58 ` Neil Brown @ 2008-03-26 8:57 ` David Greaves 2008-03-26 8:57 ` David Greaves 1 sibling, 0 replies; 44+ messages in thread From: David Greaves @ 2008-03-26 8:57 UTC (permalink / raw) To: Neil Brown; +Cc: Justin Piszcz, linux-raid, Peter Rabbitson Neil Brown wrote: > On Saturday March 22, jpiszcz@lucidpixels.com wrote: >> >> On Sat, 22 Mar 2008, Peter Rabbitson wrote: >> >>> Robin Hill wrote: >>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >>>> >>>>> Peter Rabbitson wrote: >>> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas) >>> Events : 0.183270 >>> >>> As you can see it is pretty old, and does not have many events to speak of. >> What do the 'Events' actually represent and what do they mean for RAID0, >> RAID1, RAID5 etc? > > An 'event' is one of: > switch from 'active' to 'clean' > switch from 'clean' to 'active' > device fails > device is added > spare replaces a failed device after a rebuild > > I think that it all. > > None of these are meaningful for RAID0, so the 'events' counter on > RAID0 should be stable. > > Unfortunately, the number looks like a decimal but isn't. > It is a 64bit number. We print out the top 32 bits, then the bottom > 32 bits. I don't remember why. Maybe I'll 'fix' it. > >> How are they calculated? > > events = events + 1; > > > Feel free to merge this text into the wiki. http://linux-raid.osdl.org/index.php?title=Event I also added: == What are they for? == When an array is assembled, all the disks should have the same number of events. If they don't then something odd happened. eg: If one drive fails then the remaining drives have their event counter incremented. When the array is re-assembled the failed drive has a different event count and is not included in the assembly. This lead me to ponder: How/when are events reset to equality? I wrote: The event count on a drive is set to zero on creation and reset to the majority on a resync or a forced assembly. Is that right? David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: What do Events actually mean? 2008-03-25 3:58 ` Neil Brown 2008-03-26 8:57 ` David Greaves @ 2008-03-26 8:57 ` David Greaves 1 sibling, 0 replies; 44+ messages in thread From: David Greaves @ 2008-03-26 8:57 UTC (permalink / raw) To: Neil Brown; +Cc: Justin Piszcz, linux-raid, Peter Rabbitson Neil Brown wrote: > On Saturday March 22, jpiszcz@lucidpixels.com wrote: >> >> On Sat, 22 Mar 2008, Peter Rabbitson wrote: >> >>> Robin Hill wrote: >>>> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >>>> >>>>> Peter Rabbitson wrote: >>> UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host Arzamas) >>> Events : 0.183270 >>> >>> As you can see it is pretty old, and does not have many events to speak of. >> What do the 'Events' actually represent and what do they mean for RAID0, >> RAID1, RAID5 etc? > > An 'event' is one of: > switch from 'active' to 'clean' > switch from 'clean' to 'active' > device fails > device is added > spare replaces a failed device after a rebuild > > I think that it all. > > None of these are meaningful for RAID0, so the 'events' counter on > RAID0 should be stable. > > Unfortunately, the number looks like a decimal but isn't. > It is a 64bit number. We print out the top 32 bits, then the bottom > 32 bits. I don't remember why. Maybe I'll 'fix' it. > >> How are they calculated? > > events = events + 1; > > > Feel free to merge this text into the wiki. Thanks Neil :) http://linux-raid.osdl.org/index.php?title=Event I also added: == What are they for? == When an array is assembled, all the disks should have the same number of events. If they don't then something odd happened. eg: If one drive fails then the remaining drives have their event counter incremented. When the array is re-assembled the failed drive has a different event count and is not included in the assembly. This lead me to ponder: How/when are events reset to equality? I wrote: The event count on a drive is set to zero on creation and reset to the majority on a resync or a forced assembly. Is that right? David ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-22 10:03 ` Peter Rabbitson 2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz @ 2008-05-04 7:30 ` Peter Rabbitson 2008-05-06 6:36 ` Luca Berra 1 sibling, 1 reply; 44+ messages in thread From: Peter Rabbitson @ 2008-05-04 7:30 UTC (permalink / raw) To: linux-raid Peter Rabbitson wrote: > Robin Hill wrote: >> On Fri Mar 21, 2008 at 07:01:43PM -0400, Bill Davidsen wrote: >> >>> Peter Rabbitson wrote: >>>> I was actually specifically advocating that md must _not_ do >>>> anything on its own. Just provide the hooks to get information (what >>>> is the current stripe state) and update information (the described >>>> repair extension). The logic that you are describing can live only >>>> in an external app, it has no place in-kernel. >>> So you advocate the current code being in the kernel, which absent a >>> hardware error makes blind assumptions about which data is valid and >>> which is not and in all cases hides the problem, instead of the code >>> I proposed, which in some cases will be able to avoid action which is >>> provably wrong and never be less likely to do the wrong thing than >>> the current code? >>> >> I would certainly advocate that the current (entirely automatic) code >> belongs in the kernel whereas any code requiring user >> intervention/decision making belongs in a user process, yes. That's not >> to say that the former should be preferred over the latter though, but >> there's really no reason to remove the in-kernel automated process until >> (or even after) a user-side repair process has been coded. > > I am asserting that automatic repair is infeasible in most > highly-redundant cases. Lets take the root raid1 of one of my busiest > servers: > > /dev/md0: > Version : 00.90.03 > Creation Time : Tue Mar 20 21:58:54 2007 > Raid Level : raid1 > Array Size : 6000128 (5.72 GiB 6.14 GB) > Used Dev Size : 6000128 (5.72 GiB 6.14 GB) > Raid Devices : 4 > Total Devices : 4 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Sat Mar 22 05:55:08 2008 > State : clean > Active Devices : 4 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 0 > > UUID : b6a11a74:8b069a29:6e26228f:2ab99bd0 (local to host > Arzamas) > Events : 0.183270 > > As you can see it is pretty old, and does not have many events to speak > of. Yet every month when the automatic check is issued I get between 512 > and 2048 in mismatch_cnt. I maintain md5sums of all files on this > filesystem, and there were no deviations for the lifetime of the array > (of course there are mismatches after upgrades, after log appends etc, > but they are all expected). So all I can do with this array is issue a > blind repair, without even having the chance to find what exactly is > causing this. Yes, it is raid1 and I could do 1:1 comparison to find > which is the offending block. How about raid10 -n f3? There is no way I > can figure out _what_ is giving me a problem. I do not know if it is a > hardware error (the md5 sums speak against it), some process with weird > write patterns resulting in heavy DMA, or a bug in md itself. > > By the way there is no swap file on this array. Just / and /var, with a > moderately busy mail spool on top. > I want to resurect this discussion with a peculiar observation - the above mismatch was caused by GRUB. I had some time this weekend and decided to take device snapshots of the 4 array members as listed above while / is mounted ro. After stripping the md superblock I ended up with data from slots 1 2 and 3 being identical, and 0 (my primary boot device) being different by about 10 bytes. Hexediting revealed that the bytes in question belong to /boot/grub/default. I realized that my grub config contains a savedefault clause, which updates the file on the raw ext3 volume before any raid assembly has taken place. Executing grub-set-default from within a booted system (with a mounted assembled raid) resulted in the subsequent md check to return 0 mismatches. To add insult to the injury the way svedefault and grub-set-default update said file are different (comments vs empty lines). So even if one savedfault's the same entry as the one set initially bu grub-set-default - the result will still be a raid1 mismatch. I assume that this condition is benign, but wanted to bring this to the attention of the masses anyway. Cheers Peter ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-05-04 7:30 ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson @ 2008-05-06 6:36 ` Luca Berra 0 siblings, 0 replies; 44+ messages in thread From: Luca Berra @ 2008-05-06 6:36 UTC (permalink / raw) To: linux-raid On Sun, May 04, 2008 at 09:30:02AM +0200, Peter Rabbitson wrote: >I want to resurect this discussion with a peculiar observation - the above >mismatch was caused by GRUB. ... >I realized that my grub config contains a savedefault clause, which updates >the file on the raw ext3 volume before any raid assembly has taken place. >Executing grub-set-default from within a booted system (with a mounted >assembled raid) resulted in the subsequent md check to return 0 mismatches. this has been a long standing issue with grub 1.x hopefully some day grub 2 will be production ready and the issue will be forgoten (real md raid support for grub 2 was added in google SOC 2006). L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-20 23:08 ` Peter Rabbitson 2008-03-21 14:24 ` Bill Davidsen @ 2008-03-25 4:24 ` Neil Brown 2008-03-25 9:00 ` Peter Rabbitson 1 sibling, 1 reply; 44+ messages in thread From: Neil Brown @ 2008-03-25 4:24 UTC (permalink / raw) To: Peter Rabbitson; +Cc: Theodore Tso, Bas van Schaik, linux-raid On Friday March 21, rabbit+list@rabbit.us wrote: > Theodore Tso wrote: > > On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: > >>> There's no explicit message produced by the md module, no. You need to > >>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many > >>> mismatches there are. Similarly, following a repair this will indicate > >>> how many mismatches it thinks have been fixed (by updating the parity > >>> block to match the data blocks). > >>> > >> Marvellous! I naively assumed that the module would warn me, but that's > >> not true. Wouldn't it be appropriate to print a message to dmesg if such > >> a mismatch occurs during a check? Such a mismatch clearly means that > >> there is something wrong with your hardware lying beneath md, doesn't it? > > > > If a mismatch is detected in a RAID-6 configuration, it should be > > possible to figure out what should be fixed (since with two hot spares > > there should be enough redundancy not only to detect an error, but to > > correct it.) Out of curiosity, does md do this automatically, either > > when reading from a stripe, or during a resync operation? > > > > In my modest experience with root/high performance spool on various raid > levels I can pretty much conclude that the current check mechanism doesn't do > enough to give power to the user. We can debate all we want about what the MD > driver should do when it finds a mismatch, yet there is no way for the user to > figure out what the mismatch is and take appropriate action. This does not > apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the > only wrong value is taken and written all over the other good blocks? > > I think that the solution is rather simple, and I would contribute a patch if > I had any C experience. The current check mechanism remains the same - > mismatch_cnt is incremented/reset just the same as before. However on every > mismatching chunk the system printks the following: > > 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device > 2) one line for every active disk containing: > a) the offset of the chunk within the MD componnent > b) a {md5|sha1}sum of the chunk More logging probably would be appropriate. I wouldn't emit too much detail from the kernel though. Just enough to identify the location. Have the userspace tool do all the more interesting stuff. You would want to rate limit the message though, to that you don't get piles of messages when initialising the array... > > In a common case array this will take no more than 8 lines in dmesg. However > it will allow: > > 1) For a human to determine at a glance which disk holds a mismatching chunk > in raid 1/10 > 2) Determine the same for raid 6 using a userspace tool which will calculate > the parity for every possible permutation of chunks > 3) using some external tools to determine which file might have been affected > on the layered file system > > > Now of course the problem remains how to repair the array using the > information obtained above. I think the best way would be to extend the syntax > of repair itself, so that: > > echo repair > .../sync_action would use the old heuristics > > echo repair <mdoffset> <component N> > .../sync_action will update the chunk > on drive N which corresponds to the chunk/stripe at mdoffset within the MD > device, using the information from the other drives, and not the other way > around as might happen with just a repair. Suspend the array, update the raw devices, then re-enable the array. All from user-space. No magic parsing of 'sync_action' input. NeilBrown ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Redundancy check using "echo check > sync_action": error reporting? 2008-03-25 4:24 ` Neil Brown @ 2008-03-25 9:00 ` Peter Rabbitson 0 siblings, 0 replies; 44+ messages in thread From: Peter Rabbitson @ 2008-03-25 9:00 UTC (permalink / raw) To: Neil Brown; +Cc: Theodore Tso, Bas van Schaik, linux-raid Neil Brown wrote: > On Friday March 21, rabbit+list@rabbit.us wrote: >> Theodore Tso wrote: >>> On Thu, Mar 20, 2008 at 03:19:08PM +0100, Bas van Schaik wrote: >>>>> There's no explicit message produced by the md module, no. You need to >>>>> check the /sys/block/md{X}/md/mismatch_cnt entry to find out how many >>>>> mismatches there are. Similarly, following a repair this will indicate >>>>> how many mismatches it thinks have been fixed (by updating the parity >>>>> block to match the data blocks). >>>>> >>>> Marvellous! I naively assumed that the module would warn me, but that's >>>> not true. Wouldn't it be appropriate to print a message to dmesg if such >>>> a mismatch occurs during a check? Such a mismatch clearly means that >>>> there is something wrong with your hardware lying beneath md, doesn't it? >>> If a mismatch is detected in a RAID-6 configuration, it should be >>> possible to figure out what should be fixed (since with two hot spares >>> there should be enough redundancy not only to detect an error, but to >>> correct it.) Out of curiosity, does md do this automatically, either >>> when reading from a stripe, or during a resync operation? >>> >> In my modest experience with root/high performance spool on various raid >> levels I can pretty much conclude that the current check mechanism doesn't do >> enough to give power to the user. We can debate all we want about what the MD >> driver should do when it finds a mismatch, yet there is no way for the user to >> figure out what the mismatch is and take appropriate action. This does not >> apply only to RIAD5/6 - what about RAID1/10 with >2 chunk copies? What if the >> only wrong value is taken and written all over the other good blocks? >> >> I think that the solution is rather simple, and I would contribute a patch if >> I had any C experience. The current check mechanism remains the same - >> mismatch_cnt is incremented/reset just the same as before. However on every >> mismatching chunk the system printks the following: >> >> 1) the start offset of the chunk(md1/10) or stripe(md5/6) within the MD device >> 2) one line for every active disk containing: >> a) the offset of the chunk within the MD componnent >> b) a {md5|sha1}sum of the chunk > > More logging probably would be appropriate. > I wouldn't emit too much detail from the kernel though. Just enough > to identify the location. Have the userspace tool do all the more > interesting stuff. True. The only reason I suggested checkusm information was because the blocks are already in memory, and checksum routines are readily available. > You would want to rate limit the message though, to that you don't get > piles of messages when initialising the array... More realistically one would want to be able to flip a switch in /sys/block/mdX/md/ to see any advanced logging at all. So basically you run your monthly checks, one of them comes back with non-zero mismatch_cnt, you echo 1 > /sys/block/mdX/md/sync_action_debug and look at your logs. >> In a common case array this will take no more than 8 lines in dmesg. However >> it will allow: >> >> 1) For a human to determine at a glance which disk holds a mismatching chunk >> in raid 1/10 >> 2) Determine the same for raid 6 using a userspace tool which will calculate >> the parity for every possible permutation of chunks >> 3) using some external tools to determine which file might have been affected >> on the layered file system >> >> >> Now of course the problem remains how to repair the array using the >> information obtained above. I think the best way would be to extend the syntax >> of repair itself, so that: >> >> echo repair > .../sync_action would use the old heuristics >> >> echo repair <mdoffset> <component N> > .../sync_action will update the chunk >> on drive N which corresponds to the chunk/stripe at mdoffset within the MD >> device, using the information from the other drives, and not the other way >> around as might happen with just a repair. > > Suspend the array, update the raw devices, then re-enable the array. > All from user-space. > No magic parsing of 'sync_action' input. > The sole advantage of 'repair' is that you do nto take the array offline. It doesn't even have to be 'repair', it can be something like 'refresh' or 'relocate'. The point is that such simple interface would be a clean way to fix any inconsistencies in any RAID level without taking it offline. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2008-05-06 6:36 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-03-16 14:21 Redundancy check using "echo check > sync_action": error reporting? Bas van Schaik 2008-03-16 15:14 ` Janek Kozicki 2008-03-20 13:32 ` Bas van Schaik 2008-03-20 13:47 ` Robin Hill 2008-03-20 14:19 ` Bas van Schaik 2008-03-20 14:45 ` Robin Hill 2008-03-20 15:16 ` Bas van Schaik 2008-03-20 16:04 ` Robin Hill 2008-03-20 16:35 ` Theodore Tso 2008-03-20 17:10 ` Robin Hill 2008-03-20 17:39 ` Andre Noll 2008-03-20 18:02 ` Theodore Tso 2008-03-20 18:57 ` Andre Noll 2008-03-21 14:02 ` Ric Wheeler 2008-03-21 20:19 ` NeilBrown 2008-03-21 20:45 ` Ric Wheeler 2008-03-22 17:13 ` Bill Davidsen 2008-03-20 23:08 ` Peter Rabbitson 2008-03-21 14:24 ` Bill Davidsen 2008-03-21 14:52 ` Peter Rabbitson 2008-03-21 17:13 ` Theodore Tso 2008-03-21 17:35 ` Peter Rabbitson 2008-03-22 13:27 ` Theodore Tso 2008-03-22 14:00 ` Bas van Schaik 2008-03-25 4:44 ` Neil Brown 2008-03-25 15:17 ` Bill Davidsen 2008-03-25 9:19 ` Mattias Wadenstein 2008-03-21 17:43 ` Robin Hill 2008-03-21 23:01 ` Bill Davidsen 2008-03-21 23:45 ` Carlos Carvalho 2008-03-22 17:19 ` Bill Davidsen 2008-03-21 23:55 ` Robin Hill 2008-03-22 10:03 ` Peter Rabbitson 2008-03-22 10:42 ` What do Events actually mean? Justin Piszcz 2008-03-22 17:35 ` David Greaves 2008-03-22 17:48 ` Justin Piszcz 2008-03-22 18:02 ` David Greaves 2008-03-25 3:58 ` Neil Brown 2008-03-26 8:57 ` David Greaves 2008-03-26 8:57 ` David Greaves 2008-05-04 7:30 ` Redundancy check using "echo check > sync_action": error reporting? Peter Rabbitson 2008-05-06 6:36 ` Luca Berra 2008-03-25 4:24 ` Neil Brown 2008-03-25 9:00 ` Peter Rabbitson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).