linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Feature Request/Suggestion - "Drive Linking"
@ 2006-08-29 16:21 Neil Bortnak
  2006-08-29 17:43 ` dean gaudet
  2006-09-03 14:59 ` Tuomas Leikola
  0 siblings, 2 replies; 6+ messages in thread
From: Neil Bortnak @ 2006-08-29 16:21 UTC (permalink / raw)
  To: linux-raid

Hi Everybody,

I had this major recovery last week after a hardware failure monkeyed
things up pretty badly. About half way though I had a couple of ideas
and I thought I'd suggest/ask them.

1) "Drive Linking": So let's say I have a 6 disk RAID5 array and I have
reason to believe one of the drives will fail (funny noises, SMART
warnings or it's *really* slow compared to the other drives, etc). It
would be nice to put in a new drive, link it to the failing disk so that
it copies all of the data to the new one and mirrors new writes as they
happen.

This way I could get the replacement in and do the resync without
actually having to degrade the array first. When it's done, pulling out
the failing disk automatically breaks the link and everything goes back
to normal. Or, if you break the link in software, it removes the old
disk from the array and wipes out the superblock automatically.

Maybe there is a way to do this already and I just missed it, but I
don't think so. I'm not really keen on degrading the array just in case
the system finds an unrecoverable error on one of the other disks during
the resync and the whole thing comes crashing down in a dual disk
failure. In fact, I'm not keen on degrading the array period.


2) This sort of brings up a subject I'm getting increasingly paranoid
about. It seems to me that if disk 1 develops a unrecoverable error at
block 500 and disk 4 develops one at 55,000 I'm going to get a double
disk failure as soon as one of the bad blocks is read (or some other
system problem ->makes it look like<- some random block is
unrecoverable). Such an error should not bring the whole thing to a
crashing halt. I know I can recover from that sort of error manually,
but yuk.

It seems to me that as arrays get larger and larger, failure mechanisms
better than "wipe out 750G of mirror and put the array in jeopardy
because a single block is unrecoverable" need to be developed. Can bad
block redirection help us add a layer of defense, at least in the short
term? Granted, if the disk block is unrecoverable because all the spares
are used up, the chances are the drive will die off soon anyway, but I'd
rather get one last kick at doing a clean rebuild (maybe a la the disk
linking idea above) before ejecting the drive. The current methods
employed by RAID 1-6 seem a bit crude. Fine for 20 years ago, but
showing it's age with today's increasingly massive data sets.

I'm quite thankful for all the MD work and this isn't a criticism. I'm
merely interested in the problem and wonder at other people's thoughts
on the matter. Maybe we can move from something that paints in large
strokes like RAID 1-6 and look towards an all-new RAID-OMG. I'm
basically thinking it's prudent to apply security's idea of "defense in
depth" to drive safety.


3) So this last rebuild I had to do was for a system with a double disk
failure and no backup (no, not my system as I would have had a backup as
we all know raid doesn't protect against a lot of threats). I managed to
get it done but I ended up writing a lot of offline, userspace
verification and resync tools in perl and C and editing the superblocks
with hexedit.

An extra tool to edit superblock fields would be very keen.

If no one is horrified by the fact I did the other recovery tools in
perl, I would be happy to clean them up and submit them. I wrote one to
verify a given disk's data vs. the other disks and report errors
(optionally fixing them). It also has a range feature so you don't have
to do the whole disk. The other is similar, but I built it for high
speed bulk resyncing from userspace (no need to have RAID in the
kernel).


4) And finally (for today at least), can mdadm do the equivalent of
NetApp's or 3Ware's disk scrubbing? I know I can check an array manually
with a /sys entry, but it would be cool to have mdadm optionally run
these checks and continually rerun them when they were finished for all
the arrays on the system. Just part of it's monitoring duties really.
For someone like me, I only care about data integrity and uptime, not
speed. I heard something like that was going in, but I don't know it's
status.

Thanks!

Neil


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Feature Request/Suggestion - "Drive Linking"
  2006-08-29 16:21 Feature Request/Suggestion - "Drive Linking" Neil Bortnak
@ 2006-08-29 17:43 ` dean gaudet
  2006-09-03 14:59 ` Tuomas Leikola
  1 sibling, 0 replies; 6+ messages in thread
From: dean gaudet @ 2006-08-29 17:43 UTC (permalink / raw)
  To: Neil Bortnak; +Cc: linux-raid

On Wed, 30 Aug 2006, Neil Bortnak wrote:

> Hi Everybody,
> 
> I had this major recovery last week after a hardware failure monkeyed
> things up pretty badly. About half way though I had a couple of ideas
> and I thought I'd suggest/ask them.
> 
> 1) "Drive Linking": So let's say I have a 6 disk RAID5 array and I have
> reason to believe one of the drives will fail (funny noises, SMART
> warnings or it's *really* slow compared to the other drives, etc). It
> would be nice to put in a new drive, link it to the failing disk so that
> it copies all of the data to the new one and mirrors new writes as they
> happen.

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

works for any raid level actually.


> 2) This sort of brings up a subject I'm getting increasingly paranoid
> about. It seems to me that if disk 1 develops a unrecoverable error at
> block 500 and disk 4 develops one at 55,000 I'm going to get a double
> disk failure as soon as one of the bad blocks is read (or some other
> system problem ->makes it look like<- some random block is
> unrecoverable). Such an error should not bring the whole thing to a
> crashing halt. I know I can recover from that sort of error manually,
> but yuk.

Neil made some improvements in this area as of 2.6.15... when md gets a 
read error it won't knock the entire drive out immediately -- it first 
attempts to reconstruct the sectors from the other drives and write them 
back.  this covers a lot of the failure cases because the drive will 
either successfully complete the write in-place, or use its reallocation 
pool.  the kernel logs when it makes such a correction (but the log wasn't 
very informative until 2.6.18ish i think).

if you watch SMART data (either through smartd logging changes for you, or 
if you diff the output regularly) you can see this activity happen as 
well.

you can also use the check/repair sync_actions to force this to happen 
when you know a disk has a Current_Pending_Sector (i.e. pending read 
error).

-dean

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Feature Request/Suggestion - "Drive Linking"
  2006-08-29 16:21 Feature Request/Suggestion - "Drive Linking" Neil Bortnak
  2006-08-29 17:43 ` dean gaudet
@ 2006-09-03 14:59 ` Tuomas Leikola
  2006-09-03 18:35   ` Michael Tokarev
  1 sibling, 1 reply; 6+ messages in thread
From: Tuomas Leikola @ 2006-09-03 14:59 UTC (permalink / raw)
  To: Neil Bortnak; +Cc: linux-raid

> This way I could get the replacement in and do the resync without
> actually having to degrade the array first.

<snip>

> 2) This sort of brings up a subject I'm getting increasingly paranoid
> about. It seems to me that if disk 1 develops a unrecoverable error at
> block 500 and disk 4 develops one at 55,000 I'm going to get a double
> disk failure as soon as one of the bad blocks is read

Here's an alternate description. On first 'unrecoverable' error, the
disk is marked as FAILING, which means that a spare is immediately
taken into use to replace the failing one. The disk is not kicked, and
readable blocks can still be used to rebuild other blocks (from other
FAILING disks).

The rebuild can be more like a ddrescue type operation, which is
probably a lot faster in the case of raid6, and the disk can be
automatically kicked after the sync is done. If there is no read
access to the FAILING disk, the rebuild will be faster just because
seeks are avoided in a busy system.

Personally I feel this is a good idea, count my vote in.

- Tuomas

-- 
VGER BF report: U 0.505245

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Feature Request/Suggestion - "Drive Linking"
  2006-09-03 14:59 ` Tuomas Leikola
@ 2006-09-03 18:35   ` Michael Tokarev
  2006-09-04 16:55     ` Bill Davidsen
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Tokarev @ 2006-09-03 18:35 UTC (permalink / raw)
  To: Tuomas Leikola; +Cc: Neil Bortnak, linux-raid

Tuomas Leikola wrote:
[]
> Here's an alternate description. On first 'unrecoverable' error, the
> disk is marked as FAILING, which means that a spare is immediately
> taken into use to replace the failing one. The disk is not kicked, and
> readable blocks can still be used to rebuild other blocks (from other
> FAILING disks).
> 
> The rebuild can be more like a ddrescue type operation, which is
> probably a lot faster in the case of raid6, and the disk can be
> automatically kicked after the sync is done. If there is no read
> access to the FAILING disk, the rebuild will be faster just because
> seeks are avoided in a busy system.

It's not that simple.  The issue is with writes.  If there's a "failing"
disk, md code will need to keep track of "up-to-date", or "good" sectors
of it vs "obsolete" ones.  Ie, when write fails, the data in that block
is either unreadable (but can become readable on the next try, say, after
themperature change or whatnot), or readable but contains old data, or
is readable but contains some random garbage.  So at least that block(s)
of the disk should not be copied to the spare during resync, and should
not be read at all, to avoid returning wrong data to userspace.  In short,
if the array isn't stopped (or changed to read-only), we should watch for
writes, and remember which ones are failed.  Which is some non-trivial
change.  Yes, bitmaps somewhat helps here.

/mjt

-- 
VGER BF report: H 0.418675

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Feature Request/Suggestion - "Drive Linking"
  2006-09-03 18:35   ` Michael Tokarev
@ 2006-09-04 16:55     ` Bill Davidsen
  2006-09-05  6:33       ` dean gaudet
  0 siblings, 1 reply; 6+ messages in thread
From: Bill Davidsen @ 2006-09-04 16:55 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Tuomas Leikola, Neil Bortnak, linux-raid

Michael Tokarev wrote:

>Tuomas Leikola wrote:
>[]
>  
>
>>Here's an alternate description. On first 'unrecoverable' error, the
>>disk is marked as FAILING, which means that a spare is immediately
>>taken into use to replace the failing one. The disk is not kicked, and
>>readable blocks can still be used to rebuild other blocks (from other
>>FAILING disks).
>>
>>The rebuild can be more like a ddrescue type operation, which is
>>probably a lot faster in the case of raid6, and the disk can be
>>automatically kicked after the sync is done. If there is no read
>>access to the FAILING disk, the rebuild will be faster just because
>>seeks are avoided in a busy system.
>>    
>>
>
>It's not that simple.  The issue is with writes.  If there's a "failing"
>disk, md code will need to keep track of "up-to-date", or "good" sectors
>of it vs "obsolete" ones.  Ie, when write fails, the data in that block
>is either unreadable (but can become readable on the next try, say, after
>themperature change or whatnot), or readable but contains old data, or
>is readable but contains some random garbage.  So at least that block(s)
>of the disk should not be copied to the spare during resync, and should
>not be read at all, to avoid returning wrong data to userspace.  In short,
>if the array isn't stopped (or changed to read-only), we should watch for
>writes, and remember which ones are failed.  Which is some non-trivial
>change.  Yes, bitmaps somewhat helps here.
>  
>
It would seem that much of the code needed is already there. When doing 
the recovery the spare can be treated as a RAID1 copy of the failing 
drive, with all sectors out of date. Then the sectors from the failing 
drive can be copied, using reconstruction if needed, until there is a 
valid copy on the new drive.

There are several decision points during this process:
- do writes get tried to the failing drive, or just the spare?
- do you mark the failing drive as "failed" after the good copy is created?

But I think most of the logic exists, the hardest part would be deciding 
what to do. The existing code looks as if it could be hooked to do this 
far more easily than writing new. In fact, several suggested recovery 
schemes involve stopping the RAID5, replacing the failing drive with a 
created RAID1, etc. So the method is valid, it would just be nice to 
have it happen without human intervention.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Feature Request/Suggestion - "Drive Linking"
  2006-09-04 16:55     ` Bill Davidsen
@ 2006-09-05  6:33       ` dean gaudet
  0 siblings, 0 replies; 6+ messages in thread
From: dean gaudet @ 2006-09-05  6:33 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Michael Tokarev, Tuomas Leikola, Neil Bortnak, linux-raid

On Mon, 4 Sep 2006, Bill Davidsen wrote:

> But I think most of the logic exists, the hardest part would be deciding what
> to do. The existing code looks as if it could be hooked to do this far more
> easily than writing new. In fact, several suggested recovery schemes involve
> stopping the RAID5, replacing the failing drive with a created RAID1, etc. So
> the method is valid, it would just be nice to have it happen without human
> intervention.

you don't actually have to stop the raid5 if you're using bitmaps... you 
can just remove the disk, create a (superblockless) raid1 and put the 
raid1 back in place.

the whole process could be handled a lot like mdadm handles spare groups 
already... there isn't a lot more kernel support required.

the largest problem is if a power failure occurs before the process 
finishes.  i'm 95% certain that even during a reconstruction, raid1 writes 
go to all copies even if the write is beyond the current sync position[1] 
-- so the raid5 superblock would definitely have been written to the 
partial disk... so that means on a reboot there'll be two disks which look 
like they're both the same (valid) component of the raid5, and one of them 
definitely isn't.

maybe there's some trick to handle this situation -- aside from ensuring 
the array won't come up automatically on reboot until after the process 
has finished.

one way to handle it would be to have an option for raid1 resync which 
suppresses writes which are beyond the resync position... then you could 
zero the new disk superblock to start with, and then start up the resync 
-- then it won't have a valid superblock until the entire disk is copied.

-dean

[1] there's normally a really good reason for raid1 to mirror all writes 
even if they're beyond the resync point... consider the case where you 
have a system crash and have 2 essentially idential mirrors which then 
need a resync... and the source disk dies during the resync.

if all writes have been mirrored then the other disk is already useable 
(in fact it's essentially arbitrary which of the mirrors was used for the 
resync source after the crash -- they're all equally (un)likely to have 
the most current data)... without bitmaps this sort of thing is a common 
scenario and certainly saved my data more than once.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-09-05  6:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-29 16:21 Feature Request/Suggestion - "Drive Linking" Neil Bortnak
2006-08-29 17:43 ` dean gaudet
2006-09-03 14:59 ` Tuomas Leikola
2006-09-03 18:35   ` Michael Tokarev
2006-09-04 16:55     ` Bill Davidsen
2006-09-05  6:33       ` dean gaudet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).