* Re: Bad blocks are killing us!
[not found] <200411150522.iAF5MNN18341@www.watkins-home.com>
@ 2004-11-15 22:27 ` Neil Brown
2004-11-16 16:28 ` Maurilio Longo
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Neil Brown @ 2004-11-15 22:27 UTC (permalink / raw)
To: Guy Watkins; +Cc: linux-raid
On Monday November 15, guy@watkins-home.com wrote:
> Neil,
> This is a private email. You can post it if you want.
snip
>
> Anyway, in the past there have been threads about correcting bad
> blocks automatically within md. I think a RAID1 patch was created that will
> attempt to correct a bad block automatically. Is it likely that you will
> pursue this for RAID5 and maybe RAID6? I hope so.
My current plans for md are:
1/ incorporate the "bitmap resync" patches that have been floating
around for some months. This involves a reasonable amount of
work as I want them to work with raid5/6/10 as well as raid1.
raid10 is particularly interesting as resync is quite different
from recovery there.
2/ Look at recovering from failed reads that can be fixed by a
write. I am considering leveraging the "bitmap resync" stuff for
this. With the bitmap stuff in place, you can let the kernel kick
out a drive that has a read error, let user-space have a quick
look at the drive and see if it might be a recoverable error, and
then give the drive back to the kernel. It will then do a partial
resync based on the bitmap information, thus writing the bad
blocks, and all should be fine. This would mean re-writing
several megabytes instead of a few sectors, but I don't think that
is a big cost. There are a few issues that make it a bit less
trivial than that, but it will probably be my starting point.
The new "faulty" personality will allow this to be tested easily.
3/ Look at background data scans - i.e. read the whole array and
check that parity/copies are correct. This will be triggered and
monitored by user-space. If a read error happens during the scan,
we trip the recovery code discussed above.
While these are my current intentions, there are no guarantees and
definitely no time frame.
I get to spend about 50%-60% of my time on this at the moment, so
there is hope.
> About RAID6, you have fixed a bug or 2 in the last few weeks. Would
> you consider RAID6 stable (safe) yet?
I'm not really in a position to answer that.
The code is structurally very similar to raid5, so there is a good
chance that there are no races or awkward edge cases (unless there
still are some in raid5).
The "parity" arithmetic has been extensively tested out of the kernel
and seems to be reliable.
Basic testing seems to show that it largely works, but I haven't done
more than very basic testing myself.
So it is probably fairly close to stable. What it really needs is
lots of testing.
Build a filesystem on a raid6 and then in a loop:
mount / do metadata-intensive stress test / umount / fsck -f
while that is happening, fail, remove, and re-add various drives.
Try to cover all combinations of failing active drives and
spaces-being-rebuilt while 0, 1, or 2 drives are missing.
Try using a "faulty" device and causing it to fail as well as just
"mdadm --set-faulty".
If you cannot get it to fail, you will have increased your confidence
of it's safety.
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us!
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
@ 2004-11-16 16:28 ` Maurilio Longo
2004-11-16 18:18 ` Guy
2004-11-17 21:58 ` Bruce Lowekamp
2 siblings, 0 replies; 18+ messages in thread
From: Maurilio Longo @ 2004-11-16 16:28 UTC (permalink / raw)
To: Neil Brown; +Cc: Guy Watkins, linux-raid
Neil Brown ha scritto:
> On Monday November 15, guy@watkins-home.com wrote:
> > Neil,
> > This is a private email. You can post it if you want.
> snip
> >
> > Anyway, in the past there have been threads about correcting bad
> > blocks automatically within md. I think a RAID1 patch was created that will
> > attempt to correct a bad block automatically. Is it likely that you will
> > pursue this for RAID5 and maybe RAID6? I hope so.
>
> My current plans for md are:
[...]
>
> 2/ Look at recovering from failed reads that can be fixed by a
> write. I am considering leveraging the "bitmap resync" stuff for
> this. With the bitmap stuff in place, you can let the kernel kick
> out a drive that has a read error, let user-space have a quick
> look at the drive and see if it might be a recoverable error, and
> then give the drive back to the kernel. It will then do a partial
> resync based on the bitmap information, thus writing the bad
> blocks, and all should be fine. This would mean re-writing
> several megabytes instead of a few sectors, but I don't think that
> is a big cost. There are a few issues that make it a bit less
> trivial than that, but it will probably be my starting point.
> The new "faulty" personality will allow this to be tested easily.
>
I think 2/ should go unattended for at least a few retries, then, if they all
fail, kick-out disk and/or call user-space program to see what's going on, I say
this because an occasional read error should not kick-out a disk or require user
intervention to fix it (as it is now).
And it seems to me that new disks have a lot of badsectors regardless their brand.
just my .02 euro cents :)
regards.
--
__________
| | | |__| md2520@mclink.it
|_|_|_|____| Team OS/2 Italia
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
2004-11-16 16:28 ` Maurilio Longo
@ 2004-11-16 18:18 ` Guy
2004-11-16 23:04 ` Neil Brown
2004-11-17 21:58 ` Bruce Lowekamp
2 siblings, 1 reply; 18+ messages in thread
From: Guy @ 2004-11-16 18:18 UTC (permalink / raw)
To: 'Neil Brown'; +Cc: linux-raid
This sounds great!
But...
2/ Do you intend to create a user space program to attempt to correct the
bad block and put the device back in the array automatically? I hope so.
If not, please consider correcting the bad block without kicking the device
out. Reason: Once the device is kicked out, a second bad block on another
device is fatal to the array. And this has been happening a lot lately.
3/ Maybe don't do the bad block scan if the array is degraded. Reason: If
a bad block is found, that would kick out a second disk, which is fatal.
Since the stated purpose of this is to "check parity/copies are correct"
then you probably can't do this anyway. I just want to be sure. Also, if
during the scan, if a device is kicked, the scan should pause or abort. The
scan can resume once the array has been corrected. I would be happy if the
scan had to be restarted from the start. So a pause or abort is fine with
me.
thanks for your time,
Guy
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown
Sent: Monday, November 15, 2004 5:27 PM
To: Guy Watkins
Cc: linux-raid@vger.kernel.org
Subject: Re: Bad blocks are killing us!
On Monday November 15, guy@watkins-home.com wrote:
> Neil,
> This is a private email. You can post it if you want.
snip
>
> Anyway, in the past there have been threads about correcting bad
> blocks automatically within md. I think a RAID1 patch was created that
will
> attempt to correct a bad block automatically. Is it likely that you will
> pursue this for RAID5 and maybe RAID6? I hope so.
My current plans for md are:
1/ incorporate the "bitmap resync" patches that have been floating
around for some months. This involves a reasonable amount of
work as I want them to work with raid5/6/10 as well as raid1.
raid10 is particularly interesting as resync is quite different
from recovery there.
2/ Look at recovering from failed reads that can be fixed by a
write. I am considering leveraging the "bitmap resync" stuff for
this. With the bitmap stuff in place, you can let the kernel kick
out a drive that has a read error, let user-space have a quick
look at the drive and see if it might be a recoverable error, and
then give the drive back to the kernel. It will then do a partial
resync based on the bitmap information, thus writing the bad
blocks, and all should be fine. This would mean re-writing
several megabytes instead of a few sectors, but I don't think that
is a big cost. There are a few issues that make it a bit less
trivial than that, but it will probably be my starting point.
The new "faulty" personality will allow this to be tested easily.
3/ Look at background data scans - i.e. read the whole array and
check that parity/copies are correct. This will be triggered and
monitored by user-space. If a read error happens during the scan,
we trip the recovery code discussed above.
While these are my current intentions, there are no guarantees and
definitely no time frame.
I get to spend about 50%-60% of my time on this at the moment, so
there is hope.
> About RAID6, you have fixed a bug or 2 in the last few weeks. Would
> you consider RAID6 stable (safe) yet?
I'm not really in a position to answer that.
The code is structurally very similar to raid5, so there is a good
chance that there are no races or awkward edge cases (unless there
still are some in raid5).
The "parity" arithmetic has been extensively tested out of the kernel
and seems to be reliable.
Basic testing seems to show that it largely works, but I haven't done
more than very basic testing myself.
So it is probably fairly close to stable. What it really needs is
lots of testing.
Build a filesystem on a raid6 and then in a loop:
mount / do metadata-intensive stress test / umount / fsck -f
while that is happening, fail, remove, and re-add various drives.
Try to cover all combinations of failing active drives and
spaces-being-rebuilt while 0, 1, or 2 drives are missing.
Try using a "faulty" device and causing it to fail as well as just
"mdadm --set-faulty".
If you cannot get it to fail, you will have increased your confidence
of it's safety.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-16 18:18 ` Guy
@ 2004-11-16 23:04 ` Neil Brown
2004-11-16 23:07 ` Guy
2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet
0 siblings, 2 replies; 18+ messages in thread
From: Neil Brown @ 2004-11-16 23:04 UTC (permalink / raw)
To: Guy; +Cc: linux-raid
On Tuesday November 16, bugzilla@watkins-home.com wrote:
> This sounds great!
>
> But...
>
> 2/ Do you intend to create a user space program to attempt to correct the
> bad block and put the device back in the array automatically? I
> hope so.
Definitely. It would be added to the functionality of "mdadm --monitor".
>
> If not, please consider correcting the bad block without kicking the device
> out. Reason: Once the device is kicked out, a second bad block on another
> device is fatal to the array. And this has been happening a lot
> lately.
This one of several things that makes it "a bit less trivial" than
simply using the bitmap stuff. I will keep your comment in mind when
I start looking at this in more detail. Thanks.
>
> 3/ Maybe don't do the bad block scan if the array is degraded. Reason: If
> a bad block is found, that would kick out a second disk, which is fatal.
> Since the stated purpose of this is to "check parity/copies are correct"
> then you probably can't do this anyway. I just want to be sure. Also, if
> during the scan, if a device is kicked, the scan should pause or abort. The
> scan can resume once the array has been corrected. I would be happy if the
> scan had to be restarted from the start. So a pause or abort is fine with
> me.
I hadn't thought about that yet. I suspect there would be little
point in doing a scan when there was no redundancy. However a scan on
a degraded raid6 that could still safely loose one drive would
probably make sense.
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-16 23:04 ` Neil Brown
@ 2004-11-16 23:07 ` Guy
2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves
2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet
1 sibling, 1 reply; 18+ messages in thread
From: Guy @ 2004-11-16 23:07 UTC (permalink / raw)
To: 'Neil Brown'; +Cc: linux-raid
Neil said:
"I hadn't thought about that yet. I suspect there would be little
point in doing a scan when there was no redundancy. However a scan on
a degraded raid6 that could still safely loose one drive would
probably make sense."
I agree.
Also a RAID1 with 2 or more working devices. Don't forget, some people have
3 or more devices on the RAID1 arrays. From what I have read anyway.
Thanks,
Guy
-----Original Message-----
From: Neil Brown [mailto:neilb@cse.unsw.edu.au]
Sent: Tuesday, November 16, 2004 6:04 PM
To: Guy
Cc: linux-raid@vger.kernel.org
Subject: RE: Bad blocks are killing us!
On Tuesday November 16, bugzilla@watkins-home.com wrote:
> This sounds great!
>
> But...
>
> 2/ Do you intend to create a user space program to attempt to correct the
> bad block and put the device back in the array automatically? I
> hope so.
Definitely. It would be added to the functionality of "mdadm --monitor".
>
> If not, please consider correcting the bad block without kicking the
device
> out. Reason: Once the device is kicked out, a second bad block on
another
> device is fatal to the array. And this has been happening a lot
> lately.
This one of several things that makes it "a bit less trivial" than
simply using the bitmap stuff. I will keep your comment in mind when
I start looking at this in more detail. Thanks.
>
> 3/ Maybe don't do the bad block scan if the array is degraded. Reason:
If
> a bad block is found, that would kick out a second disk, which is fatal.
> Since the stated purpose of this is to "check parity/copies are correct"
> then you probably can't do this anyway. I just want to be sure. Also, if
> during the scan, if a device is kicked, the scan should pause or abort.
The
> scan can resume once the array has been corrected. I would be happy if
the
> scan had to be restarted from the start. So a pause or abort is fine with
> me.
I hadn't thought about that yet. I suspect there would be little
point in doing a scan when there was no redundancy. However a scan on
a degraded raid6 that could still safely loose one drive would
probably make sense.
NeilBrown
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-16 23:04 ` Neil Brown
2004-11-16 23:07 ` Guy
@ 2004-11-16 23:29 ` dean gaudet
1 sibling, 0 replies; 18+ messages in thread
From: dean gaudet @ 2004-11-16 23:29 UTC (permalink / raw)
To: Neil Brown; +Cc: Guy, linux-raid
i've been trying to puzzle out a related solution... suppose there were
"lock/unlock stripe" operations. md delays all i/o to a locked stripe
indefinitely (or errors out after a timeout, both solutions work).
a userland daemon could then walk through an array locking stripes
performing whatever corrective actions it desires (raid5/6 reconstruction,
raid1 reconstruction using "voting" for >2 ways).
there are some deadlock scenarios which the daemon must avoid with
mlockall() and by having static memory allocation requirements.
that gives us a proactive detect/repair solution... maybe it also gives us
a reactive solution: when a read error occurs, md could lock the stripe,
wake the daemon, let it repair and unlock.
there's good and bad aspects to this idea... i figured it's worth
mentioning though.
-dean
^ permalink raw reply [flat|nested] 18+ messages in thread
* Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-16 23:07 ` Guy
@ 2004-11-17 13:21 ` David Greaves
2004-11-18 9:59 ` Maurilio Longo
0 siblings, 1 reply; 18+ messages in thread
From: David Greaves @ 2004-11-17 13:21 UTC (permalink / raw)
To: Guy; +Cc: 'Neil Brown', linux-raid, dean-list-linux-raid
Just for discussion...
Proposal:
md devices to have a badstripe table and space for re-allocation
Benefits:
Allows multiple block level failures on any combination of component md
devices provided parity is not compromised.
Zero impact on performance in non-degraded mode.
No need for scanning (although it may be used as a trigger)
Works for all md personalities.
Overview:
Provide an 'on or off-array' store for any stripes impacted by block
level failure.
Unlike a disk's badblock allocation this would be a temporary store
since we'd insist on the underlying devices recovering fully from the
problem before restoring full health.
This allows us to cope transiently and, in the event of non-recoverable
errors, until the disk is replaced.
Downsides:
Resync'ing with multiple failing drives is more complex (but more resilient)
Some kind of store handler is needed.
Description:
I've structured this to look at the md driver, the userspace daemon, the
store, failing drives and replacing and resync'ing drives.
md:
For normal md access the badstripe list has no entries and is ignored. A
badstripe size check is required prior to each stripe access.
If a write error occurs, rewrite the stripe to a store noting, and
marking bad, the originating (faulty) stripe (and offending
device/block) in the badstripe table. The device is marked 'failing'.
If a read error occurs, attempt to reconstruct the stripe from the other
devices then follow the write error path.
For normal md access against stripes appearing in the badstripe list:
* Lock the badstripe table against the daemon (and other md threads)
* Check the stripe is still in the bad stripe list
* If not then the userland daemon fixed it. Release lock. Carry on as
normal.
* If so then read/write from the reserved area.
* Release badstripe lock.
Daemon:
A userland daemon could examine the reserved area, attempt a repair on a
faulty stripe and, if it succeeds, could restore the stripe and mark the
badstripe entry as clean thus freeing up the reserved area and restoring
perfect health.
The daemon would:
* lock the badstripe table against md
* write the stripe back to the previously faulty area which shouldn't
need locking against md since it's "not in use"
* correct the badstripe table
* release the lock
If the daemon fails then the badstripe entry is marked as unrecoverable.
If the daemon has failed to correct the error (unrecoverable in the
badstripe table) then the drive should be kept as failing (not faulty)
and should be replaced. The intention is to allow a failing drive to
continue to be used in the event of a subsequent bad drive event.
The Store:
This could be reserved stripes at the start (?) of the component devices
read/written using the current personality. Alternatively it could be a
filesystem level store (possibly remote, on a resilient device or just
in /tmp).
Failing drives:
From a reading point of view it seems possible to treat a failing drive
as a faulty drive - until the event of another read failure on another
drive. In that case the read error case above could still access the
failing drive to attempt a recovery. This may help in the event of
recovery from a failing drive where you want to minimise load against
it. It may not be worthwhile.
Writing would still have to continue to maintain sync.
Drive replacement + resync:
If multiple devices go 'failing' then how are they removed (since they
are all in use). A spare needs to be added and then the resync code
needs to ensure that one of the failing disks is synced to the spare.
Then the failing disk is made faulty and then removed.
This could be done by having a progression:
failing
failing-pending-remove
faulty
As I said above a failing drive is not used for reads, only for writes.
Presumably a drive that is sync'ing is used for writes but not reads.
So if we add a good drive and mark it syncing and simultaneously mark
the drive it replaces failing-pending-remove then the f-p-r drive won't
be written to but is available for essential reads until the new drive
is ready.
Some thoughts:
How much overhead is involved in checking each stripe read/write address
against a *small* bad-stripe table. Probably none because most of the
time, for a healthy md, the number of entries is 0.
Does the temporary space even have to be in the md space? Would it be
easier to make it a file (not in the filesystem on the md device!!) This
avoids any messing with stripe offsets etc.
I don't claim to understand md's locking - the stuff above is a
simplistic start on the additional locking related to moving stuff in
and out of the badstripes area. I don't know where contention is handled
- md driver or fs.
This is essentially only useful for single (or at least 'few') badblock
errors - is that a problem worth solving (from the thread title I assume
so).
How intrusive is this? I can't really judge. It mainly feels like error
handling - and maybe handing off to a reused/simplified loopback-like
device could handle 'hits' against the reserved area.
I'm only starting to read the code/device drivers books etc etc so if
I'm talking rubbish then I'll apologise for your time and keep quiet :)
David
Guy wrote:
>Neil said:
>"I hadn't thought about that yet. I suspect there would be little
>point in doing a scan when there was no redundancy. However a scan on
>a degraded raid6 that could still safely loose one drive would
>probably make sense."
>
>I agree.
>
>Also a RAID1 with 2 or more working devices. Don't forget, some people have
>3 or more devices on the RAID1 arrays. From what I have read anyway.
>
>Thanks,
>Guy
>
>-----Original Message-----
>From: Neil Brown [mailto:neilb@cse.unsw.edu.au]
>Sent: Tuesday, November 16, 2004 6:04 PM
>To: Guy
>Cc: linux-raid@vger.kernel.org
>Subject: RE: Bad blocks are killing us!
>
>On Tuesday November 16, bugzilla@watkins-home.com wrote:
>
>
>>This sounds great!
>>
>>But...
>>
>>2/ Do you intend to create a user space program to attempt to correct the
>>bad block and put the device back in the array automatically? I
>>hope so.
>>
>>
>
>Definitely. It would be added to the functionality of "mdadm --monitor".
>
>
>
>>If not, please consider correcting the bad block without kicking the
>>
>>
>device
>
>
>>out. Reason: Once the device is kicked out, a second bad block on
>>
>>
>another
>
>
>>device is fatal to the array. And this has been happening a lot
>>lately.
>>
>>
>
>This one of several things that makes it "a bit less trivial" than
>simply using the bitmap stuff. I will keep your comment in mind when
>I start looking at this in more detail. Thanks.
>
>
>
>>3/ Maybe don't do the bad block scan if the array is degraded. Reason:
>>
>>
>If
>
>
>>a bad block is found, that would kick out a second disk, which is fatal.
>>Since the stated purpose of this is to "check parity/copies are correct"
>>then you probably can't do this anyway. I just want to be sure. Also, if
>>during the scan, if a device is kicked, the scan should pause or abort.
>>
>>
>The
>
>
>>scan can resume once the array has been corrected. I would be happy if
>>
>>
>the
>
>
>>scan had to be restarted from the start. So a pause or abort is fine with
>>me.
>>
>>
>
>I hadn't thought about that yet. I suspect there would be little
>point in doing a scan when there was no redundancy. However a scan on
>a degraded raid6 that could still safely loose one drive would
>probably make sense.
>
>NeilBrown
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us!
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
2004-11-16 16:28 ` Maurilio Longo
2004-11-16 18:18 ` Guy
@ 2004-11-17 21:58 ` Bruce Lowekamp
2004-11-18 1:46 ` Guy Watkins
2 siblings, 1 reply; 18+ messages in thread
From: Bruce Lowekamp @ 2004-11-17 21:58 UTC (permalink / raw)
To: Neil Brown; +Cc: Guy Watkins, linux-raid
2: Thanks for devoting the time for getting this done. Personally,
for the PATA arrays I use, this approach is a bit overkill---if the
rewrite succeeds, it's ok (unless I start to see repeated errors, in
which case I yank the drive), if the rewrite doesn't succeed, it's
dead and I have to yank the drive. I don't have any useful
diagnostic tools at linux user-level other than smart badblocks scans,
which would just confirm the bad sectors. Personally, I wouldn't go
to the effort to keep (parts of) the drive in the array if it can't be
rewritten successfully---I've never seen a drive last long in that
situation, and I think that drive is really dead. The only problems
I've had in practice have been with mutliple accumulated read
errors---and rewriting those would make them go away quickly. I would
just want the data rewritten at user level, and log the event so I can
monitor the array for failures and look at the smart output or take a
drive offline for testing (with vendor diag tools) if it starts to
have frequent errors. Naturally, as long as the more complex approach
of kicking to user level allows the user-level to return immediately
to let the kernel rewrite the stripe, I think it's fine.
I agree that writing several megabytes is not an issue in any way.
IMHO, feel free to hang the whole system for a few seconds if
necessary---no one should be using md in an RT-critical application,
and bad blocks are relatively rare.
3: The data scans is an interesting idea. Right now I run daily smart
short scans and weekly smart long scans to try to catch any bad blocks
before I get multiple errors. Assuming there aren't any uncaught CRC
errors, I feel comfortable with that approach, but the md-level
approach might be better. But I'm not sure I see the point of
it---unless you have raid 6 with multiple parity blocks, if a disk
actually has the wrong information recorded on it I don't think you
can detect which drive is bad, just that one of them is. So I don't
think you gain anything beyond what a standard smart long scan or just
cat'ing the raw device would give you in terms of forcing the whole
drive to be read.
Bruce
On Tue, 16 Nov 2004 09:27:17 +1100, Neil Brown <neilb@cse.unsw.edu.au> wrote:
> 2/ Look at recovering from failed reads that can be fixed by a
> write. I am considering leveraging the "bitmap resync" stuff for
> this. With the bitmap stuff in place, you can let the kernel kick
> out a drive that has a read error, let user-space have a quick
> look at the drive and see if it might be a recoverable error, and
> then give the drive back to the kernel. It will then do a partial
> resync based on the bitmap information, thus writing the bad
> blocks, and all should be fine. This would mean re-writing
> several megabytes instead of a few sectors, but I don't think that
> is a big cost. There are a few issues that make it a bit less
> trivial than that, but it will probably be my starting point.
> The new "faulty" personality will allow this to be tested easily.
--
Bruce Lowekamp (lowekamp@cs.wm.edu)
Computer Science Dept, College of William and Mary
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-17 21:58 ` Bruce Lowekamp
@ 2004-11-18 1:46 ` Guy Watkins
2004-11-18 16:03 ` Bruce Lowekamp
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Guy Watkins @ 2004-11-18 1:46 UTC (permalink / raw)
To: 'Bruce Lowekamp', 'Neil Brown'; +Cc: linux-raid
2 things about your comments:
1.
You said:
"no one should be using md in an RT-critical application"
I am sorry to hear that! What do recommend? Windows 2000 maybe?
2.
You said:
"but the md-level
approach might be better. But I'm not sure I see the point of
it---unless you have raid 6 with multiple parity blocks, if a disk
actually has the wrong information recorded on it I don't think you
can detect which drive is bad, just that one of them is."
If there is a parity block that does not match the data, true you do not
know which device has the wrong data. However, if you do not "correct" the
parity, when a device fails, it will be constructed differently than it was
before it failed. This will just cause more corrupt data. The parity must
be made consistent with whatever data is on the data blocks to prevent this
corrosion of data. With RAID6 it should be possible to determine which
block is wrong. It would be a pain in the @$$, but I think it would be
doable. I will explain my theory if someone asks.
Guy
-----Original Message-----
From: Bruce Lowekamp [mailto:brucelowekamp@gmail.com]
Sent: Wednesday, November 17, 2004 4:58 PM
To: Neil Brown
Cc: Guy Watkins; linux-raid@vger.kernel.org
Subject: Re: Bad blocks are killing us!
2: Thanks for devoting the time for getting this done. Personally,
for the PATA arrays I use, this approach is a bit overkill---if the
rewrite succeeds, it's ok (unless I start to see repeated errors, in
which case I yank the drive), if the rewrite doesn't succeed, it's
dead and I have to yank the drive. I don't have any useful
diagnostic tools at linux user-level other than smart badblocks scans,
which would just confirm the bad sectors. Personally, I wouldn't go
to the effort to keep (parts of) the drive in the array if it can't be
rewritten successfully---I've never seen a drive last long in that
situation, and I think that drive is really dead. The only problems
I've had in practice have been with mutliple accumulated read
errors---and rewriting those would make them go away quickly. I would
just want the data rewritten at user level, and log the event so I can
monitor the array for failures and look at the smart output or take a
drive offline for testing (with vendor diag tools) if it starts to
have frequent errors. Naturally, as long as the more complex approach
of kicking to user level allows the user-level to return immediately
to let the kernel rewrite the stripe, I think it's fine.
I agree that writing several megabytes is not an issue in any way.
IMHO, feel free to hang the whole system for a few seconds if
necessary---no one should be using md in an RT-critical application,
and bad blocks are relatively rare.
3: The data scans is an interesting idea. Right now I run daily smart
short scans and weekly smart long scans to try to catch any bad blocks
before I get multiple errors. Assuming there aren't any uncaught CRC
errors, I feel comfortable with that approach, but the md-level
approach might be better. But I'm not sure I see the point of
it---unless you have raid 6 with multiple parity blocks, if a disk
actually has the wrong information recorded on it I don't think you
can detect which drive is bad, just that one of them is. So I don't
think you gain anything beyond what a standard smart long scan or just
cat'ing the raw device would give you in terms of forcing the whole
drive to be read.
Bruce
On Tue, 16 Nov 2004 09:27:17 +1100, Neil Brown <neilb@cse.unsw.edu.au>
wrote:
> 2/ Look at recovering from failed reads that can be fixed by a
> write. I am considering leveraging the "bitmap resync" stuff for
> this. With the bitmap stuff in place, you can let the kernel kick
> out a drive that has a read error, let user-space have a quick
> look at the drive and see if it might be a recoverable error, and
> then give the drive back to the kernel. It will then do a partial
> resync based on the bitmap information, thus writing the bad
> blocks, and all should be fine. This would mean re-writing
> several megabytes instead of a few sectors, but I don't think that
> is a big cost. There are a few issues that make it a bit less
> trivial than that, but it will probably be my starting point.
> The new "faulty" personality will allow this to be tested easily.
--
Bruce Lowekamp (lowekamp@cs.wm.edu)
Computer Science Dept, College of William and Mary
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves
@ 2004-11-18 9:59 ` Maurilio Longo
2004-11-18 10:29 ` Robin Bowes
2004-11-19 17:12 ` Jure Pe_ar
0 siblings, 2 replies; 18+ messages in thread
From: Maurilio Longo @ 2004-11-18 9:59 UTC (permalink / raw)
To: David Greaves; +Cc: Guy, 'Neil Brown', linux-raid, dean-list-linux-raid
David and others,
I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a
bad-block management layer, maybe it could be merged inside md to have write
errors management as well, maybe :)
regards.
David Greaves ha scritto:
> Just for discussion...
>
> Proposal:
> md devices to have a badstripe table and space for re-allocation
>
> Benefits:
> Allows multiple block level failures on any combination of component md
> devices provided parity is not compromised.
> Zero impact on performance in non-degraded mode.
> No need for scanning (although it may be used as a trigger)
> Works for all md personalities.
>
> Overview:
> Provide an 'on or off-array' store for any stripes impacted by block
> level failure.
> Unlike a disk's badblock allocation this would be a temporary store
> since we'd insist on the underlying devices recovering fully from the
> problem before restoring full health.
> This allows us to cope transiently and, in the event of non-recoverable
> errors, until the disk is replaced.
>
> Downsides:
> Resync'ing with multiple failing drives is more complex (but more resilient)
> Some kind of store handler is needed.
>
> Description:
> I've structured this to look at the md driver, the userspace daemon, the
> store, failing drives and replacing and resync'ing drives.
>
> md:
> For normal md access the badstripe list has no entries and is ignored. A
> badstripe size check is required prior to each stripe access.
>
> If a write error occurs, rewrite the stripe to a store noting, and
> marking bad, the originating (faulty) stripe (and offending
> device/block) in the badstripe table. The device is marked 'failing'.
> If a read error occurs, attempt to reconstruct the stripe from the other
> devices then follow the write error path.
>
> For normal md access against stripes appearing in the badstripe list:
> * Lock the badstripe table against the daemon (and other md threads)
> * Check the stripe is still in the bad stripe list
> * If not then the userland daemon fixed it. Release lock. Carry on as
> normal.
> * If so then read/write from the reserved area.
> * Release badstripe lock.
>
> Daemon:
> A userland daemon could examine the reserved area, attempt a repair on a
> faulty stripe and, if it succeeds, could restore the stripe and mark the
> badstripe entry as clean thus freeing up the reserved area and restoring
> perfect health.
> The daemon would:
> * lock the badstripe table against md
> * write the stripe back to the previously faulty area which shouldn't
> need locking against md since it's "not in use"
> * correct the badstripe table
> * release the lock
> If the daemon fails then the badstripe entry is marked as unrecoverable.
>
> If the daemon has failed to correct the error (unrecoverable in the
> badstripe table) then the drive should be kept as failing (not faulty)
> and should be replaced. The intention is to allow a failing drive to
> continue to be used in the event of a subsequent bad drive event.
>
> The Store:
> This could be reserved stripes at the start (?) of the component devices
> read/written using the current personality. Alternatively it could be a
> filesystem level store (possibly remote, on a resilient device or just
> in /tmp).
>
> Failing drives:
> From a reading point of view it seems possible to treat a failing drive
> as a faulty drive - until the event of another read failure on another
> drive. In that case the read error case above could still access the
> failing drive to attempt a recovery. This may help in the event of
> recovery from a failing drive where you want to minimise load against
> it. It may not be worthwhile.
> Writing would still have to continue to maintain sync.
>
> Drive replacement + resync:
> If multiple devices go 'failing' then how are they removed (since they
> are all in use). A spare needs to be added and then the resync code
> needs to ensure that one of the failing disks is synced to the spare.
> Then the failing disk is made faulty and then removed.
>
> This could be done by having a progression:
> failing
> failing-pending-remove
> faulty
>
> As I said above a failing drive is not used for reads, only for writes.
> Presumably a drive that is sync'ing is used for writes but not reads.
> So if we add a good drive and mark it syncing and simultaneously mark
> the drive it replaces failing-pending-remove then the f-p-r drive won't
> be written to but is available for essential reads until the new drive
> is ready.
>
> Some thoughts:
> How much overhead is involved in checking each stripe read/write address
> against a *small* bad-stripe table. Probably none because most of the
> time, for a healthy md, the number of entries is 0.
>
> Does the temporary space even have to be in the md space? Would it be
> easier to make it a file (not in the filesystem on the md device!!) This
> avoids any messing with stripe offsets etc.
>
> I don't claim to understand md's locking - the stuff above is a
> simplistic start on the additional locking related to moving stuff in
> and out of the badstripes area. I don't know where contention is handled
> - md driver or fs.
>
> This is essentially only useful for single (or at least 'few') badblock
> errors - is that a problem worth solving (from the thread title I assume
> so).
>
> How intrusive is this? I can't really judge. It mainly feels like error
> handling - and maybe handing off to a reused/simplified loopback-like
> device could handle 'hits' against the reserved area.
>
> I'm only starting to read the code/device drivers books etc etc so if
> I'm talking rubbish then I'll apologise for your time and keep quiet :)
>
> David
>
> Guy wrote:
>
> >Neil said:
> >"I hadn't thought about that yet. I suspect there would be little
> >point in doing a scan when there was no redundancy. However a scan on
> >a degraded raid6 that could still safely loose one drive would
> >probably make sense."
> >
> >I agree.
> >
> >Also a RAID1 with 2 or more working devices. Don't forget, some people have
> >3 or more devices on the RAID1 arrays. From what I have read anyway.
> >
> >Thanks,
> >Guy
> >
> >-----Original Message-----
> >From: Neil Brown [mailto:neilb@cse.unsw.edu.au]
> >Sent: Tuesday, November 16, 2004 6:04 PM
> >To: Guy
> >Cc: linux-raid@vger.kernel.org
> >Subject: RE: Bad blocks are killing us!
> >
> >On Tuesday November 16, bugzilla@watkins-home.com wrote:
> >
> >
> >>This sounds great!
> >>
> >>But...
> >>
> >>2/ Do you intend to create a user space program to attempt to correct the
> >>bad block and put the device back in the array automatically? I
> >>hope so.
> >>
> >>
> >
> >Definitely. It would be added to the functionality of "mdadm --monitor".
> >
> >
> >
> >>If not, please consider correcting the bad block without kicking the
> >>
> >>
> >device
> >
> >
> >>out. Reason: Once the device is kicked out, a second bad block on
> >>
> >>
> >another
> >
> >
> >>device is fatal to the array. And this has been happening a lot
> >>lately.
> >>
> >>
> >
> >This one of several things that makes it "a bit less trivial" than
> >simply using the bitmap stuff. I will keep your comment in mind when
> >I start looking at this in more detail. Thanks.
> >
> >
> >
> >>3/ Maybe don't do the bad block scan if the array is degraded. Reason:
> >>
> >>
> >If
> >
> >
> >>a bad block is found, that would kick out a second disk, which is fatal.
> >>Since the stated purpose of this is to "check parity/copies are correct"
> >>then you probably can't do this anyway. I just want to be sure. Also, if
> >>during the scan, if a device is kicked, the scan should pause or abort.
> >>
> >>
> >The
> >
> >
> >>scan can resume once the array has been corrected. I would be happy if
> >>
> >>
> >the
> >
> >
> >>scan had to be restarted from the start. So a pause or abort is fine with
> >>me.
> >>
> >>
> >
> >I hadn't thought about that yet. I suspect there would be little
> >point in doing a scan when there was no redundancy. However a scan on
> >a degraded raid6 that could still safely loose one drive would
> >probably make sense.
> >
> >NeilBrown
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
__________
| | | |__| md2520@mclink.it
|_|_|_|____| Team OS/2 Italia
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-18 9:59 ` Maurilio Longo
@ 2004-11-18 10:29 ` Robin Bowes
2004-11-19 17:12 ` Jure Pe_ar
1 sibling, 0 replies; 18+ messages in thread
From: Robin Bowes @ 2004-11-18 10:29 UTC (permalink / raw)
To: Maurilio Longo
Cc: David Greaves, Guy, 'Neil Brown', linux-raid,
dean-list-linux-raid
Maurilio Longo wrote:
> David and others,
>
> I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a
> bad-block management layer, maybe it could be merged inside md to have write
> errors management as well, maybe :)
Good point.
When I set up my RAID (albeit a relatviely non-critical domestic system)
I considered whether or not to use EVMS because of the bad block
relocation feature.
In the end, I went with md, lvm2, and mdadm, primarily as I recall
because at the time EVMS did not support lvm2 and also required
additional kernel patches to support bad block relocation. Plus, there
were issues about having the root filesystem managed by EVMS (it
requires the set up of a custom initrd).
All things considered, I chose the path of least resistance and went
with md/lvm2/mdadm. Perhaps it's time to reconsider that and look at
EVMS again?
R.
--
http://robinbowes.com
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us!
2004-11-18 1:46 ` Guy Watkins
@ 2004-11-18 16:03 ` Bruce Lowekamp
2004-11-19 18:47 ` Dieter Stueken
2004-11-22 8:22 ` Dieter Stueken
2 siblings, 0 replies; 18+ messages in thread
From: Bruce Lowekamp @ 2004-11-18 16:03 UTC (permalink / raw)
To: Guy Watkins; +Cc: Neil Brown, linux-raid
On Wed, 17 Nov 2004 20:46:59 -0500, Guy Watkins <guy@watkins-home.com> wrote:
> 2 things about your comments:
>
> 1.
> You said:
> "no one should be using md in an RT-critical application"
>
> I am sorry to hear that! What do recommend? Windows 2000 maybe?
Very funny. I would only use hardware raid if I had a safety-related
RT system. (as much as I love md, and I do keep all of my project
data on a very large md raid5.)
>
> 2.
> You said:
> "but the md-level
> approach might be better. But I'm not sure I see the point of
> it---unless you have raid 6 with multiple parity blocks, if a disk
> actually has the wrong information recorded on it I don't think you
> can detect which drive is bad, just that one of them is."
>
> If there is a parity block that does not match the data, true you do not
> know which device has the wrong data. However, if you do not "correct" the
> parity, when a device fails, it will be constructed differently than it was
> before it failed. This will just cause more corrupt data. The parity must
> be made consistent with whatever data is on the data blocks to prevent this
> corrosion of data. With RAID6 it should be possible to determine which
> block is wrong. It would be a pain in the @$$, but I think it would be
> doable. I will explain my theory if someone asks.
The question with a raid5 parity error is, how do you correct it?
You're right that if a disk fails, the data changes, and that is bad.
But, IMHO, I don't want the raid subsystem to guess what the correct
data is if it detects that there is that sort of an error. Flag the
error and take the array offline. That system needs some sort of
diagnosis to determine if data has actually been lost. If it happened
with my /home partition, I would probably verify the data with
backups. If it was a different partition, I might just run fsck on
it. But I think the user needs to be involved if data loss was
detected.
I don't know enough about how the md raid6 implementation works, but a
naive approach of removing each drive and seeing when you find one
that disagrees with the parity of the n-1 other drives seems like it
would work. Don't think I would want to code it. Still, at least
then you can correct the data and notify the user level. But no data
was lost, so continue as normal. Of course, personally, if md told me
a drive had developed an undetected bit error, I would remove the
drive immediately for more diagnostics and let it switch to a spare,
and would probably rather that be the default behavior if there's a
hot spare. But I'm a bit paranoid...
Bruce
--
Bruce Lowekamp (lowekamp@cs.wm.edu)
Computer Science Dept, College of William and Mary
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-18 9:59 ` Maurilio Longo
2004-11-18 10:29 ` Robin Bowes
@ 2004-11-19 17:12 ` Jure Pe_ar
2004-11-20 13:15 ` Maurilio Longo
1 sibling, 1 reply; 18+ messages in thread
From: Jure Pe_ar @ 2004-11-19 17:12 UTC (permalink / raw)
To: linux-raid
On Thu, 18 Nov 2004 10:59:58 +0100
Maurilio Longo <maurilio.longo@libero.it> wrote:
> David and others,
>
> I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a
> bad-block management layer, maybe it could be merged inside md to have
> write errors management as well, maybe :)
BBR as it is called is implemented with a device mapper. AFAIK one can take
dm managed block devices and build md with them.
--
Jure Pečar
http://jure.pecar.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us!
2004-11-18 1:46 ` Guy Watkins
2004-11-18 16:03 ` Bruce Lowekamp
@ 2004-11-19 18:47 ` Dieter Stueken
2004-11-22 8:22 ` Dieter Stueken
2 siblings, 0 replies; 18+ messages in thread
From: Dieter Stueken @ 2004-11-19 18:47 UTC (permalink / raw)
To: linux-raid
Guy Watkins wrote:
> "but the md-level
> approach might be better. But I'm not sure I see the point of
> it---unless you have raid 6 with multiple parity blocks, if a disk
> actually has the wrong information recorded on it I don't think you
> can detect which drive is bad, just that one of them is."
>
> If there is a parity block that does not match the data, true you do not
> know which device has the wrong data. However, if you do not "correct" the
> parity, when a device fails, it will be constructed differently than it was
> before it failed. This will just cause more corrupt data. The parity must
> be made consistent with whatever data is on the data blocks to prevent this
> corrosion of data. With RAID6 it should be possible to determine which
> block is wrong. It would be a pain in the @$$, but I think it would be
> doable. I will explain my theory if someone asks.
This is exactly the same conflict, a single drive has with a unreadable sector.
It notices the sector beeing bad, and it can not fulfill any read request, until
the data is not rewritten or erased. The single drive can not (and should never
try to!) silently replace the bad sector by some spare sectors, as it can not
recover the content.
Also the RAID system can not solve this problem automagically, and never should
do so, as the former content can not be deduced any more. But notice, that we
have two very different problems to examine: The above problem arises, if all
disks of the RAID system claim to read correct data, whereas the parity information
tells us, that one of them must be wrong. As long as we don't have RAID6,
to recover single bit errors, the data is LOST and can not be recovered.
This is very different to the situation, if one of the disks DOES report
an internal crc-error. In this case your data CAN be recovered reliable from the
parity information, and in most cases even successfully written back to the disk.
But there is also a difference between the problem for RAID compared to the internal
disk: Whereas the disk always reads all CRC data for the sector to verify its integrity,
the RAID system does not normally check the validity of the parity information
by default. (this is, why the idea of data scans actually came up). So, if a scan
discovers a bad parity information, the only action that can (and must!) be taken
is, to tag this peace of data to be invalid. And it is very important, not only
to log that information somewhere. It is even more important to prevent further readings
of this peace of lost data. Otherwise those definitely invalid data may be read
without any notice, may get written back again and thus turn into valid data,
even though it become garbage.
People often argue for some spare sector management, which would solve all problems.
I think this is an illusion. Spare sectors can only be useful if you fail WRITING data,
not when reading data failed or data loss occurred. This is realized already within
the single disks in a sufficient way (I think). If your disk gives write errors, you
either have a very old one, without internal spare sector management, or your disk
run out of spare sectors already. Read errors are quite more frequent than write
errors and thus a much more important issue.
Dieter Stüken.
--
Dieter Stüken, con terra GmbH, Münster
stueken@conterra.de
http://www.conterra.de/
(0)251-7474-501
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-19 17:12 ` Jure Pe_ar
@ 2004-11-20 13:15 ` Maurilio Longo
2004-11-21 18:23 ` Jure Pe_ar
0 siblings, 1 reply; 18+ messages in thread
From: Maurilio Longo @ 2004-11-20 13:15 UTC (permalink / raw)
To: Jure Pe ar; +Cc: linux-raid
Jure Pe,
can we use dm-bbr without installing evms? I'd like not to install it since it
seems to me a cumbersome tool while my systems with raid1 and grub are fairly
easy to create and maintain.
regards.
Jure Pe ar ha scritto:
> On Thu, 18 Nov 2004 10:59:58 +0100
> Maurilio Longo <maurilio.longo@libero.it> wrote:
>
> > David and others,
> >
> > I'd like to add that evms ( http://evms.sourceforge.net/ ) already has a
> > bad-block management layer, maybe it could be merged inside md to have
> > write errors management as well, maybe :)
>
> BBR as it is called is implemented with a device mapper. AFAIK one can take
> dm managed block devices and build md with them.
>
> --
>
> Jure Peèar
> http://jure.pecar.org/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
__________
| | | |__| md2520@mclink.it
|_|_|_|____| Team OS/2 Italia
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Badstripe proposal (was Re: Bad blocks are killing us!)
2004-11-20 13:15 ` Maurilio Longo
@ 2004-11-21 18:23 ` Jure Pe_ar
0 siblings, 0 replies; 18+ messages in thread
From: Jure Pe_ar @ 2004-11-21 18:23 UTC (permalink / raw)
To: Maurilio Longo; +Cc: linux-raid
On Sat, 20 Nov 2004 14:15:30 +0100
Maurilio Longo <maurilio.longo@libero.it> wrote:
> Jure Pe,
>
> can we use dm-bbr without installing evms? I'd like not to install it
> since it seems to me a cumbersome tool while my systems with raid1 and
> grub are fairly easy to create and maintain.
>
> regards.
I think dm-bbr is just another dm target and so can be created and used with
dmsetup just as other dm stuff.
--
Jure Pečar
http://jure.pecar.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Bad blocks are killing us!
2004-11-18 1:46 ` Guy Watkins
2004-11-18 16:03 ` Bruce Lowekamp
2004-11-19 18:47 ` Dieter Stueken
@ 2004-11-22 8:22 ` Dieter Stueken
2004-11-22 9:17 ` Guy
2 siblings, 1 reply; 18+ messages in thread
From: Dieter Stueken @ 2004-11-22 8:22 UTC (permalink / raw)
To: linux-raid
Guy Watkins wrote:
> ... but the md-level
> approach might be better. But I'm not sure I see the point of
> it---unless you have raid 6 with multiple parity blocks, if a disk
> actually has the wrong information recorded on it I don't think you
> can detect which drive is bad, just that one of them is."
>
> If there is a parity block that does not match the data, true you do not
> know which device has the wrong data. However, if you do not "correct" the
> parity, when a device fails, it will be constructed differently than it was
> before it failed. This will just cause more corrupt data. The parity must
> be made consistent with whatever data is on the data blocks to prevent this
> corrosion of data. With RAID6 it should be possible to determine which
> block is wrong. It would be a pain in the @$$, but I think it would be
> doable. I will explain my theory if someone asks.
This is exactly the same conflict, a single drive has with a unreadable sector.
It notices the sector as bad, and it can not fulfill any read request, until
the data is not rewritten or erased. The single drive can not (and should never
try to!) silently replace the bad sector by some spare sectors, as it can not
recover the content.
Also the RAID system can not solve this problem automagically, and never should
do so, as the former content can not be deduced any more. But notice, that we
have two very different problems to examine: The above problem arises, if all
disks of the RAID system claim to read correct data, whereas the parity information
tells us, that one of them must be wrong. As long as we don't have RAID6,
to recover single bit errors, the data is LOST and can not be recovered.
This is very different to the situation, when one of the disks DOES reports
an internal crc-error. In this case your data CAN be recovered reliable from the
parity information, and in most cases successfully written back to the disk.
But there is also a difference between the problem for RAID compared to the internal
disk: Whereas the disk always reads all CRC data for the sector to verify its integrity,
the RAID system does not normally check the validity of the parity information
by default. (this is, why the idea of data scans actually came up). So, if a scan
discovers a bad parity information, the only action that can (and must!) be taken
is, to tag this piece of data to be invalid. And it is very important, not only
to log that information somewhere. It is even more important to prevent further readings
of this piece of lost data. Otherwise those definitely invalid data may be read
without any notice again, may even get written back again and thus turns into valid
data, even though it become garbage.
People oftenargue for some spare sector management, which would solve all problems.
I think this is an illusion. Spare sectors can only be useful if you fail WRITING data,
not when reading data failed or data loss occurred. This is realized already within
the single disks in a sufficient way (I think). If your disk gives write errors, you
either have a very old one, without internal spare sector management, or your disk
run out of spare sectors already. Read errors are quite more frequent than write
errors and thus a much more important issue.
Dieter Stüken.
--
Dieter Stüken, con terra GmbH, Münster
stueken@conterra.de
http://www.conterra.de/
(0)251-7474-501
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: Bad blocks are killing us!
2004-11-22 8:22 ` Dieter Stueken
@ 2004-11-22 9:17 ` Guy
0 siblings, 0 replies; 18+ messages in thread
From: Guy @ 2004-11-22 9:17 UTC (permalink / raw)
To: 'Dieter Stueken', linux-raid
Summary for anyone that missed this thread.
If a RAID5 array is scanned to verify that the parity data matches the data,
how should the system handle a mismatch? Assume no disks had read errors.
If the parity does not agree, then 1 or more disks are wrong. The parity
disk could be wrong, in which case no data is lost or corrupt, yet. But a
disk failure at this time would corrupt data unless it was the parity disk
for this stripe. If any other disk is wrong, then data is corrupt. If you
are lucky, the corrupt data will be un-used space. In this case no data is
really corrupt, yet.
I agree with your assessment, or you agree with mine! :)
I disagree on how it should be handled.
Now, what to do if a parity error occurs? As I see it, we have these
possible options:
A. Ignore it. This is what is done today. But log the blocks affected.
Risk of data corruption is high. And know that the risk of additional data
corruption is increased when a disk fails. Re-building to the spare has the
effect of correcting the parity so the error is now masked.
B. Just correct the parity, you stand a high risk of data corruption without
knowing about it. But without correcting the parity you still have the same
risk. Log the blocks affected. By correcting the parity, no additional
corruption will occur when a disk fails.
C. Mark all blocks (or chunks) affected by the parity error as unreadable.
This would cause data loss, but no corruption. The data loss would be the
size of the mismatch (in blocks or chunks) times the number of disks in the
array - 1. This acts more like a disk drive when a sector can't be read.
Log the blocks affected. In the case of a 14 disk RAID5 array, a single
sector parity error would cause 13 sectors to be lost, or much more if going
by chunks. Optionally, still allow option B at some later time at the
user's request. This would allow the user to determine what data is
affected, then attempt to recover some of the data.
D. Report the error, and allow manual parity correction. This is like
option A then at user request option B.
E. All of the above. Have the option to choose which of the above you want.
This will allow each user to choose how the system will handle parity
errors. This option should be configured per array, not system wide.
Guy
-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Dieter Stueken
Sent: Monday, November 22, 2004 3:22 AM
To: linux-raid@vger.kernel.org
Subject: Re: Bad blocks are killing us!
Guy Watkins wrote:
> ... but the md-level
> approach might be better. But I'm not sure I see the point of
> it---unless you have raid 6 with multiple parity blocks, if a disk
> actually has the wrong information recorded on it I don't think you
> can detect which drive is bad, just that one of them is."
>
> If there is a parity block that does not match the data, true you do not
> know which device has the wrong data. However, if you do not "correct"
the
> parity, when a device fails, it will be constructed differently than it
was
> before it failed. This will just cause more corrupt data. The parity
must
> be made consistent with whatever data is on the data blocks to prevent
this
> corrosion of data. With RAID6 it should be possible to determine which
> block is wrong. It would be a pain in the @$$, but I think it would be
> doable. I will explain my theory if someone asks.
This is exactly the same conflict, a single drive has with a unreadable
sector.
It notices the sector as bad, and it can not fulfill any read request, until
the data is not rewritten or erased. The single drive can not (and should
never
try to!) silently replace the bad sector by some spare sectors, as it can
not
recover the content.
Also the RAID system can not solve this problem automagically, and never
should
do so, as the former content can not be deduced any more. But notice, that
we
have two very different problems to examine: The above problem arises, if
all
disks of the RAID system claim to read correct data, whereas the parity
information
tells us, that one of them must be wrong. As long as we don't have RAID6,
to recover single bit errors, the data is LOST and can not be recovered.
This is very different to the situation, when one of the disks DOES reports
an internal crc-error. In this case your data CAN be recovered reliable from
the
parity information, and in most cases successfully written back to the disk.
But there is also a difference between the problem for RAID compared to the
internal
disk: Whereas the disk always reads all CRC data for the sector to verify
its integrity,
the RAID system does not normally check the validity of the parity
information
by default. (this is, why the idea of data scans actually came up). So, if a
scan
discovers a bad parity information, the only action that can (and must!) be
taken
is, to tag this piece of data to be invalid. And it is very important, not
only
to log that information somewhere. It is even more important to prevent
further readings
of this piece of lost data. Otherwise those definitely invalid data may be
read
without any notice again, may even get written back again and thus turns
into valid
data, even though it become garbage.
People oftenargue for some spare sector management, which would solve all
problems.
I think this is an illusion. Spare sectors can only be useful if you fail
WRITING data,
not when reading data failed or data loss occurred. This is realized already
within
the single disks in a sufficient way (I think). If your disk gives write
errors, you
either have a very old one, without internal spare sector management, or
your disk
run out of spare sectors already. Read errors are quite more frequent than
write
errors and thus a much more important issue.
Dieter Stüken.
--
Dieter Stüken, con terra GmbH, Münster
stueken@conterra.de
http://www.conterra.de/
(0)251-7474-501
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-11-22 9:17 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200411150522.iAF5MNN18341@www.watkins-home.com>
2004-11-15 22:27 ` Bad blocks are killing us! Neil Brown
2004-11-16 16:28 ` Maurilio Longo
2004-11-16 18:18 ` Guy
2004-11-16 23:04 ` Neil Brown
2004-11-16 23:07 ` Guy
2004-11-17 13:21 ` Badstripe proposal (was Re: Bad blocks are killing us!) David Greaves
2004-11-18 9:59 ` Maurilio Longo
2004-11-18 10:29 ` Robin Bowes
2004-11-19 17:12 ` Jure Pe_ar
2004-11-20 13:15 ` Maurilio Longo
2004-11-21 18:23 ` Jure Pe_ar
2004-11-16 23:29 ` Bad blocks are killing us! dean gaudet
2004-11-17 21:58 ` Bruce Lowekamp
2004-11-18 1:46 ` Guy Watkins
2004-11-18 16:03 ` Bruce Lowekamp
2004-11-19 18:47 ` Dieter Stueken
2004-11-22 8:22 ` Dieter Stueken
2004-11-22 9:17 ` Guy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).