* Re: Bad blocks are killing us! [not found] <200411150522.iAF5MNN18341@www.watkins-home.com> @ 2004-11-15 22:27 ` Neil Brown 2004-11-16 16:28 ` Maurilio Longo ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Neil Brown @ 2004-11-15 22:27 UTC (permalink / raw) To: Guy Watkins; +Cc: linux-raid On Monday November 15, guy@watkins-home.com wrote: > Neil, > This is a private email. You can post it if you want. snip > > Anyway, in the past there have been threads about correcting bad > blocks automatically within md. I think a RAID1 patch was created that will > attempt to correct a bad block automatically. Is it likely that you will > pursue this for RAID5 and maybe RAID6? I hope so. My current plans for md are: 1/ incorporate the "bitmap resync" patches that have been floating around for some months. This involves a reasonable amount of work as I want them to work with raid5/6/10 as well as raid1. raid10 is particularly interesting as resync is quite different from recovery there. 2/ Look at recovering from failed reads that can be fixed by a write. I am considering leveraging the "bitmap resync" stuff for this. With the bitmap stuff in place, you can let the kernel kick out a drive that has a read error, let user-space have a quick look at the drive and see if it might be a recoverable error, and then give the drive back to the kernel. It will then do a partial resync based on the bitmap information, thus writing the bad blocks, and all should be fine. This would mean re-writing several megabytes instead of a few sectors, but I don't think that is a big cost. There are a few issues that make it a bit less trivial than that, but it will probably be my starting point. The new "faulty" personality will allow this to be tested easily. 3/ Look at background data scans - i.e. read the whole array and check that parity/copies are correct. This will be triggered and monitored by user-space. If a read error happens during the scan, we trip the recovery code discussed above. While these are my current intentions, there are no guarantees and definitely no time frame. I get to spend about 50%-60% of my time on this at the moment, so there is hope. > About RAID6, you have fixed a bug or 2 in the last few weeks. Would > you consider RAID6 stable (safe) yet? I'm not really in a position to answer that. The code is structurally very similar to raid5, so there is a good chance that there are no races or awkward edge cases (unless there still are some in raid5). The "parity" arithmetic has been extensively tested out of the kernel and seems to be reliable. Basic testing seems to show that it largely works, but I haven't done more than very basic testing myself. So it is probably fairly close to stable. What it really needs is lots of testing. Build a filesystem on a raid6 and then in a loop: mount / do metadata-intensive stress test / umount / fsck -f while that is happening, fail, remove, and re-add various drives. Try to cover all combinations of failing active drives and spaces-being-rebuilt while 0, 1, or 2 drives are missing. Try using a "faulty" device and causing it to fail as well as just "mdadm --set-faulty". If you cannot get it to fail, you will have increased your confidence of it's safety. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us! 2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown @ 2004-11-16 16:28 ` Maurilio Longo 2004-11-16 18:18 ` Guy 2004-11-17 21:58 ` Bruce Lowekamp 2 siblings, 0 replies; 18+ messages in thread From: Maurilio Longo @ 2004-11-16 16:28 UTC (permalink / raw) To: Neil Brown; +Cc: Guy Watkins, linux-raid Neil Brown ha scritto: > On Monday November 15, guy@watkins-home.com wrote: > > Neil, > > This is a private email. You can post it if you want. > snip > > > > Anyway, in the past there have been threads about correcting bad > > blocks automatically within md. I think a RAID1 patch was created that will > > attempt to correct a bad block automatically. Is it likely that you will > > pursue this for RAID5 and maybe RAID6? I hope so. > > My current plans for md are: [...] > > 2/ Look at recovering from failed reads that can be fixed by a > write. I am considering leveraging the "bitmap resync" stuff for > this. With the bitmap stuff in place, you can let the kernel kick > out a drive that has a read error, let user-space have a quick > look at the drive and see if it might be a recoverable error, and > then give the drive back to the kernel. It will then do a partial > resync based on the bitmap information, thus writing the bad > blocks, and all should be fine. This would mean re-writing > several megabytes instead of a few sectors, but I don't think that > is a big cost. There are a few issues that make it a bit less > trivial than that, but it will probably be my starting point. > The new "faulty" personality will allow this to be tested easily. > I think 2/ should go unattended for at least a few retries, then, if they all fail, kick-out disk and/or call user-space program to see what's going on, I say this because an occasional read error should not kick-out a disk or require user intervention to fix it (as it is now). And it seems to me that new disks have a lot of badsectors regardless their brand. just my .02 euro cents :) regards. -- __________ | | | |__| md2520@mclink.it |_|_|_|____| Team OS/2 Italia ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown 2004-11-16 16:28 ` Maurilio Longo @ 2004-11-16 18:18 ` Guy 2004-11-16 23:04 ` Neil Brown 2004-11-17 21:58 ` Bruce Lowekamp 2 siblings, 1 reply; 18+ messages in thread From: Guy @ 2004-11-16 18:18 UTC (permalink / raw) To: 'Neil Brown'; +Cc: linux-raid This sounds great! But... 2/ Do you intend to create a user space program to attempt to correct the bad block and put the device back in the array automatically? I hope so. If not, please consider correcting the bad block without kicking the device out. Reason: Once the device is kicked out, a second bad block on another device is fatal to the array. And this has been happening a lot lately. 3/ Maybe don't do the bad block scan if the array is degraded. Reason: If a bad block is found, that would kick out a second disk, which is fatal. Since the stated purpose of this is to "check parity/copies are correct" then you probably can't do this anyway. I just want to be sure. Also, if during the scan, if a device is kicked, the scan should pause or abort. The scan can resume once the array has been corrected. I would be happy if the scan had to be restarted from the start. So a pause or abort is fine with me. thanks for your time, Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown Sent: Monday, November 15, 2004 5:27 PM To: Guy Watkins Cc: linux-raid@vger.kernel.org Subject: Re: Bad blocks are killing us! On Monday November 15, guy@watkins-home.com wrote: > Neil, > This is a private email. You can post it if you want. snip > > Anyway, in the past there have been threads about correcting bad > blocks automatically within md. I think a RAID1 patch was created that will > attempt to correct a bad block automatically. Is it likely that you will > pursue this for RAID5 and maybe RAID6? I hope so. My current plans for md are: 1/ incorporate the "bitmap resync" patches that have been floating around for some months. This involves a reasonable amount of work as I want them to work with raid5/6/10 as well as raid1. raid10 is particularly interesting as resync is quite different from recovery there. 2/ Look at recovering from failed reads that can be fixed by a write. I am considering leveraging the "bitmap resync" stuff for this. With the bitmap stuff in place, you can let the kernel kick out a drive that has a read error, let user-space have a quick look at the drive and see if it might be a recoverable error, and then give the drive back to the kernel. It will then do a partial resync based on the bitmap information, thus writing the bad blocks, and all should be fine. This would mean re-writing several megabytes instead of a few sectors, but I don't think that is a big cost. There are a few issues that make it a bit less trivial than that, but it will probably be my starting point. The new "faulty" personality will allow this to be tested easily. 3/ Look at background data scans - i.e. read the whole array and check that parity/copies are correct. This will be triggered and monitored by user-space. If a read error happens during the scan, we trip the recovery code discussed above. While these are my current intentions, there are no guarantees and definitely no time frame. I get to spend about 50%-60% of my time on this at the moment, so there is hope. > About RAID6, you have fixed a bug or 2 in the last few weeks. Would > you consider RAID6 stable (safe) yet? I'm not really in a position to answer that. The code is structurally very similar to raid5, so there is a good chance that there are no races or awkward edge cases (unless there still are some in raid5). The "parity" arithmetic has been extensively tested out of the kernel and seems to be reliable. Basic testing seems to show that it largely works, but I haven't done more than very basic testing myself. So it is probably fairly close to stable. What it really needs is lots of testing. Build a filesystem on a raid6 and then in a loop: mount / do metadata-intensive stress test / umount / fsck -f while that is happening, fail, remove, and re-add various drives. Try to cover all combinations of failing active drives and spaces-being-rebuilt while 0, 1, or 2 drives are missing. Try using a "faulty" device and causing it to fail as well as just "mdadm --set-faulty". If you cannot get it to fail, you will have increased your confidence of it's safety. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-16 18:18 ` Guy @ 2004-11-16 23:04 ` Neil Brown 2004-11-16 23:07 ` Guy 2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet 0 siblings, 2 replies; 18+ messages in thread From: Neil Brown @ 2004-11-16 23:04 UTC (permalink / raw) To: Guy; +Cc: linux-raid On Tuesday November 16, bugzilla@watkins-home.com wrote: > This sounds great! > > But... > > 2/ Do you intend to create a user space program to attempt to correct the > bad block and put the device back in the array automatically? I > hope so. Definitely. It would be added to the functionality of "mdadm --monitor". > > If not, please consider correcting the bad block without kicking the device > out. Reason: Once the device is kicked out, a second bad block on another > device is fatal to the array. And this has been happening a lot > lately. This one of several things that makes it "a bit less trivial" than simply using the bitmap stuff. I will keep your comment in mind when I start looking at this in more detail. Thanks. > > 3/ Maybe don't do the bad block scan if the array is degraded. Reason: If > a bad block is found, that would kick out a second disk, which is fatal. > Since the stated purpose of this is to "check parity/copies are correct" > then you probably can't do this anyway. I just want to be sure. Also, if > during the scan, if a device is kicked, the scan should pause or abort. The > scan can resume once the array has been corrected. I would be happy if the > scan had to be restarted from the start. So a pause or abort is fine with > me. I hadn't thought about that yet. I suspect there would be little point in doing a scan when there was no redundancy. However a scan on a degraded raid6 that could still safely loose one drive would probably make sense. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-16 23:04 ` Neil Brown @ 2004-11-16 23:07 ` Guy 2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves 2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet 1 sibling, 1 reply; 18+ messages in thread From: Guy @ 2004-11-16 23:07 UTC (permalink / raw) To: 'Neil Brown'; +Cc: linux-raid Neil said: "I hadn't thought about that yet. I suspect there would be little point in doing a scan when there was no redundancy. However a scan on a degraded raid6 that could still safely loose one drive would probably make sense." I agree. Also a RAID1 with 2 or more working devices. Don't forget, some people have 3 or more devices on the RAID1 arrays. From what I have read anyway. Thanks, Guy -----Original Message----- From: Neil Brown [mailto:neilb@cse.unsw.edu.au] Sent: Tuesday, November 16, 2004 6:04 PM To: Guy Cc: linux-raid@vger.kernel.org Subject: RE: Bad blocks are killing us! On Tuesday November 16, bugzilla@watkins-home.com wrote: > This sounds great! > > But... > > 2/ Do you intend to create a user space program to attempt to correct the > bad block and put the device back in the array automatically? I > hope so. Definitely. It would be added to the functionality of "mdadm --monitor". > > If not, please consider correcting the bad block without kicking the device > out. Reason: Once the device is kicked out, a second bad block on another > device is fatal to the array. And this has been happening a lot > lately. This one of several things that makes it "a bit less trivial" than simply using the bitmap stuff. I will keep your comment in mind when I start looking at this in more detail. Thanks. > > 3/ Maybe don't do the bad block scan if the array is degraded. Reason: If > a bad block is found, that would kick out a second disk, which is fatal. > Since the stated purpose of this is to "check parity/copies are correct" > then you probably can't do this anyway. I just want to be sure. Also, if > during the scan, if a device is kicked, the scan should pause or abort. The > scan can resume once the array has been corrected. I would be happy if the > scan had to be restarted from the start. So a pause or abort is fine with > me. I hadn't thought about that yet. I suspect there would be little point in doing a scan when there was no redundancy. However a scan on a degraded raid6 that could still safely loose one drive would probably make sense. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-16 23:07 ` Guy @ 2004-11-17 13:21 ` David Greaves 2004-11-18 9:59 ` Maurilio Longo 0 siblings, 1 reply; 18+ messages in thread From: David Greaves @ 2004-11-17 13:21 UTC (permalink / raw) To: Guy; +Cc: 'Neil Brown', linux-raid, dean-list-linux-raid Just for discussion... Proposal: md devices to have a badstripe table and space for re-allocation Benefits: Allows multiple block level failures on any combination of component md devices provided parity is not compromised. Zero impact on performance in non-degraded mode. No need for scanning (although it may be used as a trigger) Works for all md personalities. Overview: Provide an 'on or off-array' store for any stripes impacted by block level failure. Unlike a disk's badblock allocation this would be a temporary store since we'd insist on the underlying devices recovering fully from the problem before restoring full health. This allows us to cope transiently and, in the event of non-recoverable errors, until the disk is replaced. Downsides: Resync'ing with multiple failing drives is more complex (but more resilient) Some kind of store handler is needed. Description: I've structured this to look at the md driver, the userspace daemon, the store, failing drives and replacing and resync'ing drives. md: For normal md access the badstripe list has no entries and is ignored. A badstripe size check is required prior to each stripe access. If a write error occurs, rewrite the stripe to a store noting, and marking bad, the originating (faulty) stripe (and offending device/block) in the badstripe table. The device is marked 'failing'. If a read error occurs, attempt to reconstruct the stripe from the other devices then follow the write error path. For normal md access against stripes appearing in the badstripe list: * Lock the badstripe table against the daemon (and other md threads) * Check the stripe is still in the bad stripe list * If not then the userland daemon fixed it. Release lock. Carry on as normal. * If so then read/write from the reserved area. * Release badstripe lock. Daemon: A userland daemon could examine the reserved area, attempt a repair on a faulty stripe and, if it succeeds, could restore the stripe and mark the badstripe entry as clean thus freeing up the reserved area and restoring perfect health. The daemon would: * lock the badstripe table against md * write the stripe back to the previously faulty area which shouldn't need locking against md since it's "not in use" * correct the badstripe table * release the lock If the daemon fails then the badstripe entry is marked as unrecoverable. If the daemon has failed to correct the error (unrecoverable in the badstripe table) then the drive should be kept as failing (not faulty) and should be replaced. The intention is to allow a failing drive to continue to be used in the event of a subsequent bad drive event. The Store: This could be reserved stripes at the start (?) of the component devices read/written using the current personality. Alternatively it could be a filesystem level store (possibly remote, on a resilient device or just in /tmp). Failing drives: From a reading point of view it seems possible to treat a failing drive as a faulty drive - until the event of another read failure on another drive. In that case the read error case above could still access the failing drive to attempt a recovery. This may help in the event of recovery from a failing drive where you want to minimise load against it. It may not be worthwhile. Writing would still have to continue to maintain sync. Drive replacement + resync: If multiple devices go 'failing' then how are they removed (since they are all in use). A spare needs to be added and then the resync code needs to ensure that one of the failing disks is synced to the spare. Then the failing disk is made faulty and then removed. This could be done by having a progression: failing failing-pending-remove faulty As I said above a failing drive is not used for reads, only for writes. Presumably a drive that is sync'ing is used for writes but not reads. So if we add a good drive and mark it syncing and simultaneously mark the drive it replaces failing-pending-remove then the f-p-r drive won't be written to but is available for essential reads until the new drive is ready. Some thoughts: How much overhead is involved in checking each stripe read/write address against a *small* bad-stripe table. Probably none because most of the time, for a healthy md, the number of entries is 0. Does the temporary space even have to be in the md space? Would it be easier to make it a file (not in the filesystem on the md device!!) This avoids any messing with stripe offsets etc. I don't claim to understand md's locking - the stuff above is a simplistic start on the additional locking related to moving stuff in and out of the badstripes area. I don't know where contention is handled - md driver or fs. This is essentially only useful for single (or at least 'few') badblock errors - is that a problem worth solving (from the thread title I assume so). How intrusive is this? I can't really judge. It mainly feels like error handling - and maybe handing off to a reused/simplified loopback-like device could handle 'hits' against the reserved area. I'm only starting to read the code/device drivers books etc etc so if I'm talking rubbish then I'll apologise for your time and keep quiet :) David Guy wrote: >Neil said: >"I hadn't thought about that yet. I suspect there would be little >point in doing a scan when there was no redundancy. However a scan on >a degraded raid6 that could still safely loose one drive would >probably make sense." > >I agree. > >Also a RAID1 with 2 or more working devices. Don't forget, some people have >3 or more devices on the RAID1 arrays. From what I have read anyway. > >Thanks, >Guy > >-----Original Message----- >From: Neil Brown [mailto:neilb@cse.unsw.edu.au] >Sent: Tuesday, November 16, 2004 6:04 PM >To: Guy >Cc: linux-raid@vger.kernel.org >Subject: RE: Bad blocks are killing us! > >On Tuesday November 16, bugzilla@watkins-home.com wrote: > > >>This sounds great! >> >>But... >> >>2/ Do you intend to create a user space program to attempt to correct the >>bad block and put the device back in the array automatically? I >>hope so. >> >> > >Definitely. It would be added to the functionality of "mdadm --monitor". > > > >>If not, please consider correcting the bad block without kicking the >> >> >device > > >>out. Reason: Once the device is kicked out, a second bad block on >> >> >another > > >>device is fatal to the array. And this has been happening a lot >>lately. >> >> > >This one of several things that makes it "a bit less trivial" than >simply using the bitmap stuff. I will keep your comment in mind when >I start looking at this in more detail. Thanks. > > > >>3/ Maybe don't do the bad block scan if the array is degraded. Reason: >> >> >If > > >>a bad block is found, that would kick out a second disk, which is fatal. >>Since the stated purpose of this is to "check parity/copies are correct" >>then you probably can't do this anyway. I just want to be sure. Also, if >>during the scan, if a device is kicked, the scan should pause or abort. >> >> >The > > >>scan can resume once the array has been corrected. I would be happy if >> >> >the > > >>scan had to be restarted from the start. So a pause or abort is fine with >>me. >> >> > >I hadn't thought about that yet. I suspect there would be little >point in doing a scan when there was no redundancy. However a scan on >a degraded raid6 that could still safely loose one drive would >probably make sense. > >NeilBrown > >- >To unsubscribe from this list: send the line "unsubscribe linux-raid" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves @ 2004-11-18 9:59 ` Maurilio Longo 2004-11-18 10:29 ` Robin Bowes 2004-11-19 17:12 ` Jure Pe_ar 0 siblings, 2 replies; 18+ messages in thread From: Maurilio Longo @ 2004-11-18 9:59 UTC (permalink / raw) To: David Greaves; +Cc: Guy, 'Neil Brown', linux-raid, dean-list-linux-raid David and others, I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a bad-block management layer, maybe it could be merged inside md to have write errors management as well, maybe :) regards. David Greaves ha scritto: > Just for discussion... > > Proposal: > md devices to have a badstripe table and space for re-allocation > > Benefits: > Allows multiple block level failures on any combination of component md > devices provided parity is not compromised. > Zero impact on performance in non-degraded mode. > No need for scanning (although it may be used as a trigger) > Works for all md personalities. > > Overview: > Provide an 'on or off-array' store for any stripes impacted by block > level failure. > Unlike a disk's badblock allocation this would be a temporary store > since we'd insist on the underlying devices recovering fully from the > problem before restoring full health. > This allows us to cope transiently and, in the event of non-recoverable > errors, until the disk is replaced. > > Downsides: > Resync'ing with multiple failing drives is more complex (but more resilient) > Some kind of store handler is needed. > > Description: > I've structured this to look at the md driver, the userspace daemon, the > store, failing drives and replacing and resync'ing drives. > > md: > For normal md access the badstripe list has no entries and is ignored. A > badstripe size check is required prior to each stripe access. > > If a write error occurs, rewrite the stripe to a store noting, and > marking bad, the originating (faulty) stripe (and offending > device/block) in the badstripe table. The device is marked 'failing'. > If a read error occurs, attempt to reconstruct the stripe from the other > devices then follow the write error path. > > For normal md access against stripes appearing in the badstripe list: > * Lock the badstripe table against the daemon (and other md threads) > * Check the stripe is still in the bad stripe list > * If not then the userland daemon fixed it. Release lock. Carry on as > normal. > * If so then read/write from the reserved area. > * Release badstripe lock. > > Daemon: > A userland daemon could examine the reserved area, attempt a repair on a > faulty stripe and, if it succeeds, could restore the stripe and mark the > badstripe entry as clean thus freeing up the reserved area and restoring > perfect health. > The daemon would: > * lock the badstripe table against md > * write the stripe back to the previously faulty area which shouldn't > need locking against md since it's "not in use" > * correct the badstripe table > * release the lock > If the daemon fails then the badstripe entry is marked as unrecoverable. > > If the daemon has failed to correct the error (unrecoverable in the > badstripe table) then the drive should be kept as failing (not faulty) > and should be replaced. The intention is to allow a failing drive to > continue to be used in the event of a subsequent bad drive event. > > The Store: > This could be reserved stripes at the start (?) of the component devices > read/written using the current personality. Alternatively it could be a > filesystem level store (possibly remote, on a resilient device or just > in /tmp). > > Failing drives: > From a reading point of view it seems possible to treat a failing drive > as a faulty drive - until the event of another read failure on another > drive. In that case the read error case above could still access the > failing drive to attempt a recovery. This may help in the event of > recovery from a failing drive where you want to minimise load against > it. It may not be worthwhile. > Writing would still have to continue to maintain sync. > > Drive replacement + resync: > If multiple devices go 'failing' then how are they removed (since they > are all in use). A spare needs to be added and then the resync code > needs to ensure that one of the failing disks is synced to the spare. > Then the failing disk is made faulty and then removed. > > This could be done by having a progression: > failing > failing-pending-remove > faulty > > As I said above a failing drive is not used for reads, only for writes. > Presumably a drive that is sync'ing is used for writes but not reads. > So if we add a good drive and mark it syncing and simultaneously mark > the drive it replaces failing-pending-remove then the f-p-r drive won't > be written to but is available for essential reads until the new drive > is ready. > > Some thoughts: > How much overhead is involved in checking each stripe read/write address > against a *small* bad-stripe table. Probably none because most of the > time, for a healthy md, the number of entries is 0. > > Does the temporary space even have to be in the md space? Would it be > easier to make it a file (not in the filesystem on the md device!!) This > avoids any messing with stripe offsets etc. > > I don't claim to understand md's locking - the stuff above is a > simplistic start on the additional locking related to moving stuff in > and out of the badstripes area. I don't know where contention is handled > - md driver or fs. > > This is essentially only useful for single (or at least 'few') badblock > errors - is that a problem worth solving (from the thread title I assume > so). > > How intrusive is this? I can't really judge. It mainly feels like error > handling - and maybe handing off to a reused/simplified loopback-like > device could handle 'hits' against the reserved area. > > I'm only starting to read the code/device drivers books etc etc so if > I'm talking rubbish then I'll apologise for your time and keep quiet :) > > David > > Guy wrote: > > >Neil said: > >"I hadn't thought about that yet. I suspect there would be little > >point in doing a scan when there was no redundancy. However a scan on > >a degraded raid6 that could still safely loose one drive would > >probably make sense." > > > >I agree. > > > >Also a RAID1 with 2 or more working devices. Don't forget, some people have > >3 or more devices on the RAID1 arrays. From what I have read anyway. > > > >Thanks, > >Guy > > > >-----Original Message----- > >From: Neil Brown [mailto:neilb@cse.unsw.edu.au] > >Sent: Tuesday, November 16, 2004 6:04 PM > >To: Guy > >Cc: linux-raid@vger.kernel.org > >Subject: RE: Bad blocks are killing us! > > > >On Tuesday November 16, bugzilla@watkins-home.com wrote: > > > > > >>This sounds great! > >> > >>But... > >> > >>2/ Do you intend to create a user space program to attempt to correct the > >>bad block and put the device back in the array automatically? I > >>hope so. > >> > >> > > > >Definitely. It would be added to the functionality of "mdadm --monitor". > > > > > > > >>If not, please consider correcting the bad block without kicking the > >> > >> > >device > > > > > >>out. Reason: Once the device is kicked out, a second bad block on > >> > >> > >another > > > > > >>device is fatal to the array. And this has been happening a lot > >>lately. > >> > >> > > > >This one of several things that makes it "a bit less trivial" than > >simply using the bitmap stuff. I will keep your comment in mind when > >I start looking at this in more detail. Thanks. > > > > > > > >>3/ Maybe don't do the bad block scan if the array is degraded. Reason: > >> > >> > >If > > > > > >>a bad block is found, that would kick out a second disk, which is fatal. > >>Since the stated purpose of this is to "check parity/copies are correct" > >>then you probably can't do this anyway. I just want to be sure. Also, if > >>during the scan, if a device is kicked, the scan should pause or abort. > >> > >> > >The > > > > > >>scan can resume once the array has been corrected. I would be happy if > >> > >> > >the > > > > > >>scan had to be restarted from the start. So a pause or abort is fine with > >>me. > >> > >> > > > >I hadn't thought about that yet. I suspect there would be little > >point in doing a scan when there was no redundancy. However a scan on > >a degraded raid6 that could still safely loose one drive would > >probably make sense. > > > >NeilBrown > > > >- > >To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >the body of a message to majordomo@vger.kernel.org > >More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- __________ | | | |__| md2520@mclink.it |_|_|_|____| Team OS/2 Italia ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-18 9:59 ` Maurilio Longo @ 2004-11-18 10:29 ` Robin Bowes 2004-11-19 17:12 ` Jure Pe_ar 1 sibling, 0 replies; 18+ messages in thread From: Robin Bowes @ 2004-11-18 10:29 UTC (permalink / raw) To: Maurilio Longo Cc: David Greaves, Guy, 'Neil Brown', linux-raid, dean-list-linux-raid Maurilio Longo wrote: > David and others, > > I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a > bad-block management layer, maybe it could be merged inside md to have write > errors management as well, maybe :) Good point. When I set up my RAID (albeit a relatviely non-critical domestic system) I considered whether or not to use EVMS because of the bad block relocation feature. In the end, I went with md, lvm2, and mdadm, primarily as I recall because at the time EVMS did not support lvm2 and also required additional kernel patches to support bad block relocation. Plus, there were issues about having the root filesystem managed by EVMS (it requires the set up of a custom initrd). All things considered, I chose the path of least resistance and went with md/lvm2/mdadm. Perhaps it's time to reconsider that and look at EVMS again? R. -- http://robinbowes.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-18 9:59 ` Maurilio Longo 2004-11-18 10:29 ` Robin Bowes @ 2004-11-19 17:12 ` Jure Pe_ar 2004-11-20 13:15 ` Maurilio Longo 1 sibling, 1 reply; 18+ messages in thread From: Jure Pe_ar @ 2004-11-19 17:12 UTC (permalink / raw) To: linux-raid On Thu, 18 Nov 2004 10:59:58 +0100 Maurilio Longo <maurilio.longo@libero.it> wrote: > David and others, > > I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a > bad-block management layer, maybe it could be merged inside md to have > write errors management as well, maybe :) BBR as it is called is implemented with a device mapper. AFAIK one can take dm managed block devices and build md with them. -- Jure Pečar http://jure.pecar.org/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-19 17:12 ` Jure Pe_ar @ 2004-11-20 13:15 ` Maurilio Longo 2004-11-21 18:23 ` Jure Pe_ar 0 siblings, 1 reply; 18+ messages in thread From: Maurilio Longo @ 2004-11-20 13:15 UTC (permalink / raw) To: Jure Pe ar; +Cc: linux-raid Jure Pe, can we use dm-bbr without installing evms? I'd like not to install it since it seems to me a cumbersome tool while my systems with raid1 and grub are fairly easy to create and maintain. regards. Jure Pe ar ha scritto: > On Thu, 18 Nov 2004 10:59:58 +0100 > Maurilio Longo <maurilio.longo@libero.it> wrote: > > > David and others, > > > > I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a > > bad-block management layer, maybe it could be merged inside md to have > > write errors management as well, maybe :) > > BBR as it is called is implemented with a device mapper. AFAIK one can take > dm managed block devices and build md with them. > > -- > > Jure Peèar > http://jure.pecar.org/ > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- __________ | | | |__| md2520@mclink.it |_|_|_|____| Team OS/2 Italia - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!) 2004-11-20 13:15 ` Maurilio Longo @ 2004-11-21 18:23 ` Jure Pe_ar 0 siblings, 0 replies; 18+ messages in thread From: Jure Pe_ar @ 2004-11-21 18:23 UTC (permalink / raw) To: Maurilio Longo; +Cc: linux-raid On Sat, 20 Nov 2004 14:15:30 +0100 Maurilio Longo <maurilio.longo@libero.it> wrote: > Jure Pe, > > can we use dm-bbr without installing evms? I'd like not to install it > since it seems to me a cumbersome tool while my systems with raid1 and > grub are fairly easy to create and maintain. > > regards. I think dm-bbr is just another dm target and so can be created and used with dmsetup just as other dm stuff. -- Jure Pečar http://jure.pecar.org/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-16 23:04 ` Neil Brown 2004-11-16 23:07 ` Guy @ 2004-11-16 23:29 ` dean gaudet 1 sibling, 0 replies; 18+ messages in thread From: dean gaudet @ 2004-11-16 23:29 UTC (permalink / raw) To: Neil Brown; +Cc: Guy, linux-raid i've been trying to puzzle out a related solution... suppose there were "lock/unlock stripe" operations. md delays all i/o to a locked stripe indefinitely (or errors out after a timeout, both solutions work). a userland daemon could then walk through an array locking stripes performing whatever corrective actions it desires (raid5/6 reconstruction, raid1 reconstruction using "voting" for >2 ways). there are some deadlock scenarios which the daemon must avoid with mlockall() and by having static memory allocation requirements. that gives us a proactive detect/repair solution... maybe it also gives us a reactive solution: when a read error occurs, md could lock the stripe, wake the daemon, let it repair and unlock. there's good and bad aspects to this idea... i figured it's worth mentioning though. -dean ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us! 2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown 2004-11-16 16:28 ` Maurilio Longo 2004-11-16 18:18 ` Guy @ 2004-11-17 21:58 ` Bruce Lowekamp 2004-11-18 1:46 ` Guy Watkins 2 siblings, 1 reply; 18+ messages in thread From: Bruce Lowekamp @ 2004-11-17 21:58 UTC (permalink / raw) To: Neil Brown; +Cc: Guy Watkins, linux-raid 2: Thanks for devoting the time for getting this done. Personally, for the PATA arrays I use, this approach is a bit overkill---if the rewrite succeeds, it's ok (unless I start to see repeated errors, in which case I yank the drive), if the rewrite doesn't succeed, it's dead and I have to yank the drive. I don't have any useful diagnostic tools at linux user-level other than smart badblocks scans, which would just confirm the bad sectors. Personally, I wouldn't go to the effort to keep (parts of) the drive in the array if it can't be rewritten successfully---I've never seen a drive last long in that situation, and I think that drive is really dead. The only problems I've had in practice have been with mutliple accumulated read errors---and rewriting those would make them go away quickly. I would just want the data rewritten at user level, and log the event so I can monitor the array for failures and look at the smart output or take a drive offline for testing (with vendor diag tools) if it starts to have frequent errors. Naturally, as long as the more complex approach of kicking to user level allows the user-level to return immediately to let the kernel rewrite the stripe, I think it's fine. I agree that writing several megabytes is not an issue in any way. IMHO, feel free to hang the whole system for a few seconds if necessary---no one should be using md in an RT-critical application, and bad blocks are relatively rare. 3: The data scans is an interesting idea. Right now I run daily smart short scans and weekly smart long scans to try to catch any bad blocks before I get multiple errors. Assuming there aren't any uncaught CRC errors, I feel comfortable with that approach, but the md-level approach might be better. But I'm not sure I see the point of it---unless you have raid 6 with multiple parity blocks, if a disk actually has the wrong information recorded on it I don't think you can detect which drive is bad, just that one of them is. So I don't think you gain anything beyond what a standard smart long scan or just cat'ing the raw device would give you in terms of forcing the whole drive to be read. Bruce On Tue, 16 Nov 2004 09:27:17 +1100, Neil Brown <neilb@cse.unsw.edu.au> wrote: > 2/ Look at recovering from failed reads that can be fixed by a > write. I am considering leveraging the "bitmap resync" stuff for > this. With the bitmap stuff in place, you can let the kernel kick > out a drive that has a read error, let user-space have a quick > look at the drive and see if it might be a recoverable error, and > then give the drive back to the kernel. It will then do a partial > resync based on the bitmap information, thus writing the bad > blocks, and all should be fine. This would mean re-writing > several megabytes instead of a few sectors, but I don't think that > is a big cost. There are a few issues that make it a bit less > trivial than that, but it will probably be my starting point. > The new "faulty" personality will allow this to be tested easily. -- Bruce Lowekamp (lowekamp@cs.wm.edu) Computer Science Dept, College of William and Mary ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-17 21:58 ` Bruce Lowekamp @ 2004-11-18 1:46 ` Guy Watkins 2004-11-18 16:03 ` Bruce Lowekamp ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Guy Watkins @ 2004-11-18 1:46 UTC (permalink / raw) To: 'Bruce Lowekamp', 'Neil Brown'; +Cc: linux-raid 2 things about your comments: 1. You said: "no one should be using md in an RT-critical application" I am sorry to hear that! What do recommend? Windows 2000 maybe? 2. You said: "but the md-level approach might be better. But I'm not sure I see the point of it---unless you have raid 6 with multiple parity blocks, if a disk actually has the wrong information recorded on it I don't think you can detect which drive is bad, just that one of them is." If there is a parity block that does not match the data, true you do not know which device has the wrong data. However, if you do not "correct" the parity, when a device fails, it will be constructed differently than it was before it failed. This will just cause more corrupt data. The parity must be made consistent with whatever data is on the data blocks to prevent this corrosion of data. With RAID6 it should be possible to determine which block is wrong. It would be a pain in the @$$, but I think it would be doable. I will explain my theory if someone asks. Guy -----Original Message----- From: Bruce Lowekamp [mailto:brucelowekamp@gmail.com] Sent: Wednesday, November 17, 2004 4:58 PM To: Neil Brown Cc: Guy Watkins; linux-raid@vger.kernel.org Subject: Re: Bad blocks are killing us! 2: Thanks for devoting the time for getting this done. Personally, for the PATA arrays I use, this approach is a bit overkill---if the rewrite succeeds, it's ok (unless I start to see repeated errors, in which case I yank the drive), if the rewrite doesn't succeed, it's dead and I have to yank the drive. I don't have any useful diagnostic tools at linux user-level other than smart badblocks scans, which would just confirm the bad sectors. Personally, I wouldn't go to the effort to keep (parts of) the drive in the array if it can't be rewritten successfully---I've never seen a drive last long in that situation, and I think that drive is really dead. The only problems I've had in practice have been with mutliple accumulated read errors---and rewriting those would make them go away quickly. I would just want the data rewritten at user level, and log the event so I can monitor the array for failures and look at the smart output or take a drive offline for testing (with vendor diag tools) if it starts to have frequent errors. Naturally, as long as the more complex approach of kicking to user level allows the user-level to return immediately to let the kernel rewrite the stripe, I think it's fine. I agree that writing several megabytes is not an issue in any way. IMHO, feel free to hang the whole system for a few seconds if necessary---no one should be using md in an RT-critical application, and bad blocks are relatively rare. 3: The data scans is an interesting idea. Right now I run daily smart short scans and weekly smart long scans to try to catch any bad blocks before I get multiple errors. Assuming there aren't any uncaught CRC errors, I feel comfortable with that approach, but the md-level approach might be better. But I'm not sure I see the point of it---unless you have raid 6 with multiple parity blocks, if a disk actually has the wrong information recorded on it I don't think you can detect which drive is bad, just that one of them is. So I don't think you gain anything beyond what a standard smart long scan or just cat'ing the raw device would give you in terms of forcing the whole drive to be read. Bruce On Tue, 16 Nov 2004 09:27:17 +1100, Neil Brown <neilb@cse.unsw.edu.au> wrote: > 2/ Look at recovering from failed reads that can be fixed by a > write. I am considering leveraging the "bitmap resync" stuff for > this. With the bitmap stuff in place, you can let the kernel kick > out a drive that has a read error, let user-space have a quick > look at the drive and see if it might be a recoverable error, and > then give the drive back to the kernel. It will then do a partial > resync based on the bitmap information, thus writing the bad > blocks, and all should be fine. This would mean re-writing > several megabytes instead of a few sectors, but I don't think that > is a big cost. There are a few issues that make it a bit less > trivial than that, but it will probably be my starting point. > The new "faulty" personality will allow this to be tested easily. -- Bruce Lowekamp (lowekamp@cs.wm.edu) Computer Science Dept, College of William and Mary ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us! 2004-11-18 1:46 ` Guy Watkins @ 2004-11-18 16:03 ` Bruce Lowekamp 2004-11-19 18:47 ` Dieter Stueken 2004-11-22 8:22 ` Dieter Stueken 2 siblings, 0 replies; 18+ messages in thread From: Bruce Lowekamp @ 2004-11-18 16:03 UTC (permalink / raw) To: Guy Watkins; +Cc: Neil Brown, linux-raid On Wed, 17 Nov 2004 20:46:59 -0500, Guy Watkins <guy@watkins-home.com> wrote: > 2 things about your comments: > > 1. > You said: > "no one should be using md in an RT-critical application" > > I am sorry to hear that! What do recommend? Windows 2000 maybe? Very funny. I would only use hardware raid if I had a safety-related RT system. (as much as I love md, and I do keep all of my project data on a very large md raid5.) > > 2. > You said: > "but the md-level > approach might be better. But I'm not sure I see the point of > it---unless you have raid 6 with multiple parity blocks, if a disk > actually has the wrong information recorded on it I don't think you > can detect which drive is bad, just that one of them is." > > If there is a parity block that does not match the data, true you do not > know which device has the wrong data. However, if you do not "correct" the > parity, when a device fails, it will be constructed differently than it was > before it failed. This will just cause more corrupt data. The parity must > be made consistent with whatever data is on the data blocks to prevent this > corrosion of data. With RAID6 it should be possible to determine which > block is wrong. It would be a pain in the @$$, but I think it would be > doable. I will explain my theory if someone asks. The question with a raid5 parity error is, how do you correct it? You're right that if a disk fails, the data changes, and that is bad. But, IMHO, I don't want the raid subsystem to guess what the correct data is if it detects that there is that sort of an error. Flag the error and take the array offline. That system needs some sort of diagnosis to determine if data has actually been lost. If it happened with my /home partition, I would probably verify the data with backups. If it was a different partition, I might just run fsck on it. But I think the user needs to be involved if data loss was detected. I don't know enough about how the md raid6 implementation works, but a naive approach of removing each drive and seeing when you find one that disagrees with the parity of the n-1 other drives seems like it would work. Don't think I would want to code it. Still, at least then you can correct the data and notify the user level. But no data was lost, so continue as normal. Of course, personally, if md told me a drive had developed an undetected bit error, I would remove the drive immediately for more diagnostics and let it switch to a spare, and would probably rather that be the default behavior if there's a hot spare. But I'm a bit paranoid... Bruce -- Bruce Lowekamp (lowekamp@cs.wm.edu) Computer Science Dept, College of William and Mary ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us! 2004-11-18 1:46 ` Guy Watkins 2004-11-18 16:03 ` Bruce Lowekamp @ 2004-11-19 18:47 ` Dieter Stueken 2004-11-22 8:22 ` Dieter Stueken 2 siblings, 0 replies; 18+ messages in thread From: Dieter Stueken @ 2004-11-19 18:47 UTC (permalink / raw) To: linux-raid Guy Watkins wrote: > "but the md-level > approach might be better. But I'm not sure I see the point of > it---unless you have raid 6 with multiple parity blocks, if a disk > actually has the wrong information recorded on it I don't think you > can detect which drive is bad, just that one of them is." > > If there is a parity block that does not match the data, true you do not > know which device has the wrong data. However, if you do not "correct" the > parity, when a device fails, it will be constructed differently than it was > before it failed. This will just cause more corrupt data. The parity must > be made consistent with whatever data is on the data blocks to prevent this > corrosion of data. With RAID6 it should be possible to determine which > block is wrong. It would be a pain in the @$$, but I think it would be > doable. I will explain my theory if someone asks. This is exactly the same conflict, a single drive has with a unreadable sector. It notices the sector beeing bad, and it can not fulfill any read request, until the data is not rewritten or erased. The single drive can not (and should never try to!) silently replace the bad sector by some spare sectors, as it can not recover the content. Also the RAID system can not solve this problem automagically, and never should do so, as the former content can not be deduced any more. But notice, that we have two very different problems to examine: The above problem arises, if all disks of the RAID system claim to read correct data, whereas the parity information tells us, that one of them must be wrong. As long as we don't have RAID6, to recover single bit errors, the data is LOST and can not be recovered. This is very different to the situation, if one of the disks DOES report an internal crc-error. In this case your data CAN be recovered reliable from the parity information, and in most cases even successfully written back to the disk. But there is also a difference between the problem for RAID compared to the internal disk: Whereas the disk always reads all CRC data for the sector to verify its integrity, the RAID system does not normally check the validity of the parity information by default. (this is, why the idea of data scans actually came up). So, if a scan discovers a bad parity information, the only action that can (and must!) be taken is, to tag this peace of data to be invalid. And it is very important, not only to log that information somewhere. It is even more important to prevent further readings of this peace of lost data. Otherwise those definitely invalid data may be read without any notice, may get written back again and thus turn into valid data, even though it become garbage. People often argue for some spare sector management, which would solve all problems. I think this is an illusion. Spare sectors can only be useful if you fail WRITING data, not when reading data failed or data loss occurred. This is realized already within the single disks in a sufficient way (I think). If your disk gives write errors, you either have a very old one, without internal spare sector management, or your disk run out of spare sectors already. Read errors are quite more frequent than write errors and thus a much more important issue. Dieter Stüken. -- Dieter Stüken, con terra GmbH, Münster stueken@conterra.de http://www.conterra.de/ (0)251-7474-501 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us! 2004-11-18 1:46 ` Guy Watkins 2004-11-18 16:03 ` Bruce Lowekamp 2004-11-19 18:47 ` Dieter Stueken @ 2004-11-22 8:22 ` Dieter Stueken 2004-11-22 9:17 ` Guy 2 siblings, 1 reply; 18+ messages in thread From: Dieter Stueken @ 2004-11-22 8:22 UTC (permalink / raw) To: linux-raid Guy Watkins wrote: > ... but the md-level > approach might be better. But I'm not sure I see the point of > it---unless you have raid 6 with multiple parity blocks, if a disk > actually has the wrong information recorded on it I don't think you > can detect which drive is bad, just that one of them is." > > If there is a parity block that does not match the data, true you do not > know which device has the wrong data. However, if you do not "correct" the > parity, when a device fails, it will be constructed differently than it was > before it failed. This will just cause more corrupt data. The parity must > be made consistent with whatever data is on the data blocks to prevent this > corrosion of data. With RAID6 it should be possible to determine which > block is wrong. It would be a pain in the @$$, but I think it would be > doable. I will explain my theory if someone asks. This is exactly the same conflict, a single drive has with a unreadable sector. It notices the sector as bad, and it can not fulfill any read request, until the data is not rewritten or erased. The single drive can not (and should never try to!) silently replace the bad sector by some spare sectors, as it can not recover the content. Also the RAID system can not solve this problem automagically, and never should do so, as the former content can not be deduced any more. But notice, that we have two very different problems to examine: The above problem arises, if all disks of the RAID system claim to read correct data, whereas the parity information tells us, that one of them must be wrong. As long as we don't have RAID6, to recover single bit errors, the data is LOST and can not be recovered. This is very different to the situation, when one of the disks DOES reports an internal crc-error. In this case your data CAN be recovered reliable from the parity information, and in most cases successfully written back to the disk. But there is also a difference between the problem for RAID compared to the internal disk: Whereas the disk always reads all CRC data for the sector to verify its integrity, the RAID system does not normally check the validity of the parity information by default. (this is, why the idea of data scans actually came up). So, if a scan discovers a bad parity information, the only action that can (and must!) be taken is, to tag this piece of data to be invalid. And it is very important, not only to log that information somewhere. It is even more important to prevent further readings of this piece of lost data. Otherwise those definitely invalid data may be read without any notice again, may even get written back again and thus turns into valid data, even though it become garbage. People oftenargue for some spare sector management, which would solve all problems. I think this is an illusion. Spare sectors can only be useful if you fail WRITING data, not when reading data failed or data loss occurred. This is realized already within the single disks in a sufficient way (I think). If your disk gives write errors, you either have a very old one, without internal spare sector management, or your disk run out of spare sectors already. Read errors are quite more frequent than write errors and thus a much more important issue. Dieter Stüken. -- Dieter Stüken, con terra GmbH, Münster stueken@conterra.de http://www.conterra.de/ (0)251-7474-501 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us! 2004-11-22 8:22 ` Dieter Stueken @ 2004-11-22 9:17 ` Guy 0 siblings, 0 replies; 18+ messages in thread From: Guy @ 2004-11-22 9:17 UTC (permalink / raw) To: 'Dieter Stueken', linux-raid Summary for anyone that missed this thread. If a RAID5 array is scanned to verify that the parity data matches the data, how should the system handle a mismatch? Assume no disks had read errors. If the parity does not agree, then 1 or more disks are wrong. The parity disk could be wrong, in which case no data is lost or corrupt, yet. But a disk failure at this time would corrupt data unless it was the parity disk for this stripe. If any other disk is wrong, then data is corrupt. If you are lucky, the corrupt data will be un-used space. In this case no data is really corrupt, yet. I agree with your assessment, or you agree with mine! :) I disagree on how it should be handled. Now, what to do if a parity error occurs? As I see it, we have these possible options: A. Ignore it. This is what is done today. But log the blocks affected. Risk of data corruption is high. And know that the risk of additional data corruption is increased when a disk fails. Re-building to the spare has the effect of correcting the parity so the error is now masked. B. Just correct the parity, you stand a high risk of data corruption without knowing about it. But without correcting the parity you still have the same risk. Log the blocks affected. By correcting the parity, no additional corruption will occur when a disk fails. C. Mark all blocks (or chunks) affected by the parity error as unreadable. This would cause data loss, but no corruption. The data loss would be the size of the mismatch (in blocks or chunks) times the number of disks in the array - 1. This acts more like a disk drive when a sector can't be read. Log the blocks affected. In the case of a 14 disk RAID5 array, a single sector parity error would cause 13 sectors to be lost, or much more if going by chunks. Optionally, still allow option B at some later time at the user's request. This would allow the user to determine what data is affected, then attempt to recover some of the data. D. Report the error, and allow manual parity correction. This is like option A then at user request option B. E. All of the above. Have the option to choose which of the above you want. This will allow each user to choose how the system will handle parity errors. This option should be configured per array, not system wide. Guy -----Original Message----- From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Dieter Stueken Sent: Monday, November 22, 2004 3:22 AM To: linux-raid@vger.kernel.org Subject: Re: Bad blocks are killing us! Guy Watkins wrote: > ... but the md-level > approach might be better. But I'm not sure I see the point of > it---unless you have raid 6 with multiple parity blocks, if a disk > actually has the wrong information recorded on it I don't think you > can detect which drive is bad, just that one of them is." > > If there is a parity block that does not match the data, true you do not > know which device has the wrong data. However, if you do not "correct" the > parity, when a device fails, it will be constructed differently than it was > before it failed. This will just cause more corrupt data. The parity must > be made consistent with whatever data is on the data blocks to prevent this > corrosion of data. With RAID6 it should be possible to determine which > block is wrong. It would be a pain in the @$$, but I think it would be > doable. I will explain my theory if someone asks. This is exactly the same conflict, a single drive has with a unreadable sector. It notices the sector as bad, and it can not fulfill any read request, until the data is not rewritten or erased. The single drive can not (and should never try to!) silently replace the bad sector by some spare sectors, as it can not recover the content. Also the RAID system can not solve this problem automagically, and never should do so, as the former content can not be deduced any more. But notice, that we have two very different problems to examine: The above problem arises, if all disks of the RAID system claim to read correct data, whereas the parity information tells us, that one of them must be wrong. As long as we don't have RAID6, to recover single bit errors, the data is LOST and can not be recovered. This is very different to the situation, when one of the disks DOES reports an internal crc-error. In this case your data CAN be recovered reliable from the parity information, and in most cases successfully written back to the disk. But there is also a difference between the problem for RAID compared to the internal disk: Whereas the disk always reads all CRC data for the sector to verify its integrity, the RAID system does not normally check the validity of the parity information by default. (this is, why the idea of data scans actually came up). So, if a scan discovers a bad parity information, the only action that can (and must!) be taken is, to tag this piece of data to be invalid. And it is very important, not only to log that information somewhere. It is even more important to prevent further readings of this piece of lost data. Otherwise those definitely invalid data may be read without any notice again, may even get written back again and thus turns into valid data, even though it become garbage. People oftenargue for some spare sector management, which would solve all problems. I think this is an illusion. Spare sectors can only be useful if you fail WRITING data, not when reading data failed or data loss occurred. This is realized already within the single disks in a sufficient way (I think). If your disk gives write errors, you either have a very old one, without internal spare sector management, or your disk run out of spare sectors already. Read errors are quite more frequent than write errors and thus a much more important issue. Dieter Stüken. -- Dieter Stüken, con terra GmbH, Münster stueken@conterra.de http://www.conterra.de/ (0)251-7474-501 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-11-22 9:17 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200411150522.iAF5MNN18341@www.watkins-home.com>
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
2004-11-16 16:28 ` Maurilio Longo
2004-11-16 18:18 ` Guy
2004-11-16 23:04 ` Neil Brown
2004-11-16 23:07 ` Guy
2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves
2004-11-18 9:59 ` Maurilio Longo
2004-11-18 10:29 ` Robin Bowes
2004-11-19 17:12 ` Jure Pe_ar
2004-11-20 13:15 ` Maurilio Longo
2004-11-21 18:23 ` Jure Pe_ar
2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet
2004-11-17 21:58 ` Bruce Lowekamp
2004-11-18 1:46 ` Guy Watkins
2004-11-18 16:03 ` Bruce Lowekamp
2004-11-19 18:47 ` Dieter Stueken
2004-11-22 8:22 ` Dieter Stueken
2004-11-22 9:17 ` Guy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).