proactive-raid-disk-replacement

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* proactive-raid-disk-replacement
@ 2006-09-08  8:48 Michael Tokarev
  2006-09-08  9:24 ` proactive-raid-disk-replacement dean gaudet
  2006-09-10  0:02 ` proactive-raid-disk-replacement Bodo Thiesen
  0 siblings, 2 replies; 9+ messages in thread
From: Michael Tokarev @ 2006-09-08  8:48 UTC (permalink / raw)
  To: Linux RAID

Recently Dean Gaudet, in thread titled 'Feature
Request/Suggestion - "Drive Linking"', mentioned his
document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

I've read it, and have some umm.. concerns.  Here's why:

....
> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> mdadm /dev/md4 -r /dev/sdh1
> mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> mdadm /dev/md4 --re-add /dev/md5
> mdadm /dev/md5 -a /dev/sdh1
>
> ... wait a few hours for md5 resync...

And here's the problem.  While new disk, sdh1, are resynced from
old, probably failing disk sde1, chances are high that there will
be an unreadable block on sde1.  And this means the whole thing
will not work -- md5 initially contained one working drive (sde1)
and one spare (sdh1) which is being converted (resynced) to working
disk.  But after read error on sde1, md5 will contain one failed
drive and one spare -- for raid1 it's fatal combination.

While at the same time, it's perfectly easy to reconstruct this
failing block from other component devices of md4.

That to say: this way of replacing disk in a software raid array
isn't much better than just removing old drive and adding new one.
And if the drive you're replacing is failing (according to SMART
for example), this method is more likely to fail.

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-08  8:48 proactive-raid-disk-replacement Michael Tokarev
@ 2006-09-08  9:24 ` dean gaudet
  2006-09-08 10:47   ` proactive-raid-disk-replacement Michael Tokarev
  2006-09-10  0:02 ` proactive-raid-disk-replacement Bodo Thiesen
  1 sibling, 1 reply; 9+ messages in thread
From: dean gaudet @ 2006-09-08  9:24 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Linux RAID

On Fri, 8 Sep 2006, Michael Tokarev wrote:

> Recently Dean Gaudet, in thread titled 'Feature
> Request/Suggestion - "Drive Linking"', mentioned his
> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> 
> I've read it, and have some umm.. concerns.  Here's why:
> 
> ....
> > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> > mdadm /dev/md4 -r /dev/sdh1
> > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> > mdadm /dev/md4 --re-add /dev/md5
> > mdadm /dev/md5 -a /dev/sdh1
> >
> > ... wait a few hours for md5 resync...
> 
> And here's the problem.  While new disk, sdh1, are resynced from
> old, probably failing disk sde1, chances are high that there will
> be an unreadable block on sde1.  And this means the whole thing
> will not work -- md5 initially contained one working drive (sde1)
> and one spare (sdh1) which is being converted (resynced) to working
> disk.  But after read error on sde1, md5 will contain one failed
> drive and one spare -- for raid1 it's fatal combination.
> 
> While at the same time, it's perfectly easy to reconstruct this
> failing block from other component devices of md4.

this statement is an argument for native support for this type of activity 
in md itself.

> That to say: this way of replacing disk in a software raid array
> isn't much better than just removing old drive and adding new one.

hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
no redundancy while you wait for the new disk to sync in the raid5.

in my proposal the probability that you'll retain redundancy through the 
entire process is non-zero.  we can debate how non-zero it is, but 
non-zero is greater than zero.

i'll admit it depends a heck of a lot on how long you wait to replace your 
disks, but i prefer to replace mine well before they get to the point 
where just reading the entire disk is guaranteed to result in problems.

> And if the drive you're replacing is failing (according to SMART
> for example), this method is more likely to fail.

my practice is to run regular SMART long self tests, which tend to find 
Current_Pending_Sectors (which are generally read errors waiting to 
happen) and then launch a "repair" sync action... that generally drops the 
Current_Pending_Sector back to zero.  either through a realloc or just 
simply rewriting the block.  if it's a realloc then i consider if there's 
enough of them to warrant replacing the disk...

so for me the chances of a read error while doing the raid1 thing aren't 
as high as they could be...

but yeah you've convinced me this solution isn't good enough.

-dean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-08  9:24 ` proactive-raid-disk-replacement dean gaudet
@ 2006-09-08 10:47   ` Michael Tokarev
  2006-09-08 18:44     ` proactive-raid-disk-replacement dean gaudet
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Tokarev @ 2006-09-08 10:47 UTC (permalink / raw)
  To: dean gaudet; +Cc: Linux RAID

dean gaudet wrote:
> On Fri, 8 Sep 2006, Michael Tokarev wrote:
> 
>> Recently Dean Gaudet, in thread titled 'Feature
>> Request/Suggestion - "Drive Linking"', mentioned his
>> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
>>
>> I've read it, and have some umm.. concerns.  Here's why:
>>
>> ....
>>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4

By the way, don't specify bitmap-chunk for internal bitmap.
It's needed for file-based (external) bitmap.  With internal
bitmap, we have fixed size in superblock for it, so bitmap-chunk
is determined by dividing that size by size of the array.

>>> mdadm /dev/md4 -r /dev/sdh1
>>> mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
>>> mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
>>> mdadm /dev/md4 --re-add /dev/md5
>>> mdadm /dev/md5 -a /dev/sdh1
>>>
>>> ... wait a few hours for md5 resync...
>> And here's the problem.  While new disk, sdh1, are resynced from
>> old, probably failing disk sde1, chances are high that there will
>> be an unreadable block on sde1.  And this means the whole thing
>> will not work -- md5 initially contained one working drive (sde1)
>> and one spare (sdh1) which is being converted (resynced) to working
>> disk.  But after read error on sde1, md5 will contain one failed
>> drive and one spare -- for raid1 it's fatal combination.
>>
>> While at the same time, it's perfectly easy to reconstruct this
>> failing block from other component devices of md4.
> 
> this statement is an argument for native support for this type of activity 
> in md itself.

Yes, definitely.

>> That to say: this way of replacing disk in a software raid array
>> isn't much better than just removing old drive and adding new one.
> 
> hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
> no redundancy while you wait for the new disk to sync in the raid5.

It's not a proposal per se, it's just another possible way (used by majority
of users I think, because it's way simpler ;)

> in my proposal the probability that you'll retain redundancy through the 
> entire process is non-zero.  we can debate how non-zero it is, but 
> non-zero is greater than zero.

Yes there will be no redundancy in "my" variant, guaranteed.  And yes,
there is probability to complete the whole "your" process without a glitch.

> i'll admit it depends a heck of a lot on how long you wait to replace your 
> disks, but i prefer to replace mine well before they get to the point 
> where just reading the entire disk is guaranteed to result in problems.
> 
>> And if the drive you're replacing is failing (according to SMART
>> for example), this method is more likely to fail.
> 
> my practice is to run regular SMART long self tests, which tend to find 
> Current_Pending_Sectors (which are generally read errors waiting to 
> happen) and then launch a "repair" sync action... that generally drops the 
> Current_Pending_Sector back to zero.  either through a realloc or just 
> simply rewriting the block.  if it's a realloc then i consider if there's 
> enough of them to warrant replacing the disk...
> 
> so for me the chances of a read error while doing the raid1 thing aren't 
> as high as they could be...

So the whole thing goes this way:
  0) do a SMART selftest ;)
  1) do repair for the whole array
  2) copy data from failing to new drive
    (using temporary superblock-less array)
  2a) if step 2 failed still, probably due to new bad sectors,
      go the "old way", removing the failing drive and adding
      new one.

That's 2x or 3x (or 4x counting the selftest, but that should be
done regardless) more work than just going the "old way" from the
beginning, but still some chances to have it completed flawlessly
in 2 steps, without losing redundancy.

Too complicated and too long for most people I'd say ;)

I can come to yet another way, which is only somewhat possible with
current md code. In 3 variants.

1)  Offline the array, stop it.
    Make a copy of the drive using dd with error=skip (or how it is),
     noticing the bad blocks
    Mark those bad blocks in bitmap as dirty
    Assemble the array with new drive, letting it to resync the blocks
    to new drive which we were unable to copy previously.

This variant does not lose redundancy at all, but requires the array to
be off-line during the whole copy procedure.  What's missing (which has
been discussed on linux-raid@ recently too) is the ability to mark those
"bad" blocks in bitmap.

2)  The same, but not offlining the array.  Hot-remove a drive, make copy
   of it to new drive, flip necessary bitmap bits, and re-add the new drive,
   and let raid code to resync changed (during copy, while the array was
   still active, something might has changed) and missing blocks.

This variant still loses redundancy, but not much of it, provided the bitmap
code works correctly.

3)  The same as your way, with the difference that we tell md to *skip* and
  ignore possible errors during resync (which is also not possible currently).

> but yeah you've convinced me this solution isn't good enough.

But all this, all 5 (so far ;) ways, aren't nice ;)

/mjt

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-08 10:47   ` proactive-raid-disk-replacement Michael Tokarev
@ 2006-09-08 18:44     ` dean gaudet
  0 siblings, 0 replies; 9+ messages in thread
From: dean gaudet @ 2006-09-08 18:44 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Linux RAID

On Fri, 8 Sep 2006, Michael Tokarev wrote:

> dean gaudet wrote:
> > On Fri, 8 Sep 2006, Michael Tokarev wrote:
> > 
> >> Recently Dean Gaudet, in thread titled 'Feature
> >> Request/Suggestion - "Drive Linking"', mentioned his
> >> document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt
> >>
> >> I've read it, and have some umm.. concerns.  Here's why:
> >>
> >> ....
> >>> mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> 
> By the way, don't specify bitmap-chunk for internal bitmap.
> It's needed for file-based (external) bitmap.  With internal
> bitmap, we have fixed size in superblock for it, so bitmap-chunk
> is determined by dividing that size by size of the array.

yeah sorry that was with an older version of mdadm which didn't calculate 
the chunksize correct for an internal bitmap on a large enough array... i 
should have mentioned that in the post.  it's fixed in newer mdadm.


> > my practice is to run regular SMART long self tests, which tend to find 
> > Current_Pending_Sectors (which are generally read errors waiting to 
> > happen) and then launch a "repair" sync action... that generally drops the 
> > Current_Pending_Sector back to zero.  either through a realloc or just 
> > simply rewriting the block.  if it's a realloc then i consider if there's 
> > enough of them to warrant replacing the disk...
> > 
> > so for me the chances of a read error while doing the raid1 thing aren't 
> > as high as they could be...
> 
> So the whole thing goes this way:
>   0) do a SMART selftest ;)
>   1) do repair for the whole array
>   2) copy data from failing to new drive
>     (using temporary superblock-less array)
>   2a) if step 2 failed still, probably due to new bad sectors,
>       go the "old way", removing the failing drive and adding
>       new one.
> 
> That's 2x or 3x (or 4x counting the selftest, but that should be
> done regardless) more work than just going the "old way" from the
> beginning, but still some chances to have it completed flawlessly
> in 2 steps, without losing redundancy.

well it's more "work" but i don't actually manually launch the SMART 
tests, smartd does that.  i just notice when i get mail indicating 
Current_Pending_Sectors has gone up.

but i'm starting to lean towards SMART short tests (in case they test 
something i can't test with a full surface read) and regular crontabbed 
rate-limited repair or check actions.


> 2)  The same, but not offlining the array.  Hot-remove a drive, make copy
>    of it to new drive, flip necessary bitmap bits, and re-add the new drive,
>    and let raid code to resync changed (during copy, while the array was
>    still active, something might has changed) and missing blocks.
> 
> This variant still loses redundancy, but not much of it, provided the bitmap
> code works correctly.


i like this method.  it yields the minimal disk copy time because there's
no competition with the live traffic... and you can recover if another
disk has errors while you're doing the copy.


> 3)  The same as your way, with the difference that we tell md to *skip* and
>   ignore possible errors during resync (which is also not possible currently).

maybe we could hand it a bitmap to record the errors in... so we could
merge it with the raid5 bitmap later.

still not really the best solution though, is it?

we really want a solution similar to raid10...

-dean

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-08  8:48 proactive-raid-disk-replacement Michael Tokarev
  2006-09-08  9:24 ` proactive-raid-disk-replacement dean gaudet
@ 2006-09-10  0:02 ` Bodo Thiesen
  2006-09-10 10:53   ` proactive-raid-disk-replacement Tuomas Leikola
  1 sibling, 1 reply; 9+ messages in thread
From: Bodo Thiesen @ 2006-09-10  0:02 UTC (permalink / raw)
  To: Linux RAID

Michael Tokarev <mjt@tls.msk.ru> wrote:

> > mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
> > mdadm /dev/md4 -r /dev/sdh1
> > mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
> > mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
> > mdadm /dev/md4 --re-add /dev/md5
> > mdadm /dev/md5 -a /dev/sdh1
> >
> > ... wait a few hours for md5 resync...
> 
> And here's the problem.  While new disk, sdh1, are resynced from
> old, probably failing disk sde1, chances are high that there will
> be an unreadable block on sde1.

So, we need a way, to feedback the redundancy from the raid5 to the raid1.

Here is a short 5 minute brainstorm I did, to check, wether it's possible to 
manage this, and I think, it is:

Requirements:
	Any Raid with parity of any kind needs to provide so called "vitual 
	block devices", which carry the same data, as the underlaying block 
	devices, which the array is composed of. If the underlaying block 
	device can't read a block, that block will be calculated from the 
	other raid disks and hence is still readable using the virtual block 
	device.

	e.g. having the disks sda1 .. sde1 in a raid5 means, the raid 
	provides not one new block device (e.g. /dev/md4 as in the example 
	above), but six (the one just mentioned and maybe we call them 
	/dev/vsda1 .. /dev/vsde1 or /dev/mapper/vsda1 .. /dev/mapper/vsde1
	or even /dev/mapper/virtual/sda1 .. /dev/mapper/virtual/sde1). For 
	ease, I'll call them just vsdx1 here.
	
	Reading any block from vsda1 will yield the same data as reading 
	from sda1 at any time (except the case, that reading from sda1 
	fails, then vsda1 will still carry that data).

Now, construct the following nested raid structure:

	sda1 + vsda1 + missing = /dev/md10 RAID1 w/o super block
	sdb1 + vsdb1 + missing = /dev/md11 RAID1 w/o super block
	sdc1 + vsdc1 + missing = /dev/md12 RAID1 w/o super block
	sdd1 + vsdd1 + missing = /dev/md13 RAID1 w/o super block
	sde1 + vsde1 + missing = /dev/md14 RAID1 w/o super block

	md10 + md11 + md12 + md13 + md14 = /dev/md4 RAID5 optionally with sb

Problem:

	As long as md4 is not active, vsdx1 is not available. So the arrays 
	md1x need to be created with 1 disk out of 3. After md4 was 
	assembled, vsdx1 needs to be added. Now we get another problem: 
	There must be no sync between sdx1 and vsdx1 (they are more or less 
	the same device). So there should be an option to mdadm like 
	--assume-sync for hot-add.

What we get:

	As soon as we decide to replace a disk (like sde1 as above) we just 
	hot-add sdh1 to the sde1-raid1 array. That array will start 
	resyncing. If now a block can't be read from sde1, it's just taken 
	from vsde1 (and there that block will be reconstructed from the 
	raid5).
	
	After syncing to sdh1 was completed, sde1 may be removed from the 
	array.

We would loose redundancy at no time - the only lost redundancy is those of 
the already failed sde1 which we can't workaround anyways (except for using 
raid6 etc.).

This is only a brainstorm, and I don't know what internal effects could 
cause problems, like the resyncing process of the raid1 array reading a bad 
block from sde1 then triggering a reconstruction using vsde1 if in parallel 
the raid5 itself detects (e.g. as cause from a user space read) sde1 to have 
failed and tries to write back that block to the raid array for sde1 while 
in the raid1 the same rewrite is pending already ... problems over problems, 
but the evil is in detail as ever ...

Regards, Bodo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-10  0:02 ` proactive-raid-disk-replacement Bodo Thiesen
@ 2006-09-10 10:53   ` Tuomas Leikola
  2006-09-16 14:28     ` proactive-raid-disk-replacement Bill Davidsen
  0 siblings, 1 reply; 9+ messages in thread
From: Tuomas Leikola @ 2006-09-10 10:53 UTC (permalink / raw)
  To: Bodo Thiesen; +Cc: Linux RAID

On 9/10/06, Bodo Thiesen <bothie@gmx.de> wrote:
> So, we need a way, to feedback the redundancy from the raid5 to the raid1.
<snip long explanation>

Sounds awfully complicated to me. Perhaps this is how it internally
works, but my 2 cents go to the option to gracefully remove a device
(migrating to a spare without losing redundancy) in the kernel (or
mdadm).

I'm thinking

mdadm /dev/raid-device -a /dev/new-disk
mdadm /dev/raid-device --graceful-remove /dev/failing-disk

also hopefully a path to do this instead of kicking (multiple) disks
when bad blocks occur.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-10 10:53   ` proactive-raid-disk-replacement Tuomas Leikola
@ 2006-09-16 14:28     ` Bill Davidsen
  2006-09-16 17:47       ` proactive-raid-disk-replacement Guy
  0 siblings, 1 reply; 9+ messages in thread
From: Bill Davidsen @ 2006-09-16 14:28 UTC (permalink / raw)
  To: Tuomas Leikola; +Cc: Bodo Thiesen, Linux RAID

Tuomas Leikola wrote:

> On 9/10/06, Bodo Thiesen <bothie@gmx.de> wrote:
>
>> So, we need a way, to feedback the redundancy from the raid5 to the 
>> raid1.
>
> <snip long explanation>
>
> Sounds awfully complicated to me. Perhaps this is how it internally
> works, but my 2 cents go to the option to gracefully remove a device
> (migrating to a spare without losing redundancy) in the kernel (or
> mdadm).
>
> I'm thinking
>
> mdadm /dev/raid-device -a /dev/new-disk
> mdadm /dev/raid-device --graceful-remove /dev/failing-disk
>
> also hopefully a path to do this instead of kicking (multiple) disks
> when bad blocks occur. 

Actually, an internal implementation is really needed if this is to be 
generally useful to a non-guru. And it has other possible uses, as well. 
if there were just a --migrate command:
  mdadm --migrate /dev/md0 /dev/sda /dev/sdf
as an example for discussion, the whole process of not only moving the 
data, but getting recovered information from the RAID array could be 
done by software which does the right thing, creating superblocks, copy 
UUID, etc. And as a last step it could invalidate the superblock on the 
failing drive (so reboots would work right) and leave the array running 
on the new drive.

But wait, there's more! Assume that I want to upgrade from a set of 
250GB drives to 400GB drives. Using this feature I could replace a drive 
at a time, then --grow the array. The process for doing that is complex 
currently, and many manual steps invite errors.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: proactive-raid-disk-replacement
  2006-09-16 14:28     ` proactive-raid-disk-replacement Bill Davidsen
@ 2006-09-16 17:47       ` Guy
  2006-09-21 16:28         ` proactive-raid-disk-replacement Bill Davidsen
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2006-09-16 17:47 UTC (permalink / raw)
  To: 'Bill Davidsen', 'Tuomas Leikola'
  Cc: 'Bodo Thiesen', 'Linux RAID'

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Bill Davidsen
} Sent: Saturday, September 16, 2006 10:29 AM
} To: Tuomas Leikola
} Cc: Bodo Thiesen; Linux RAID
} Subject: Re: proactive-raid-disk-replacement
} 
} Tuomas Leikola wrote:
} 
} > On 9/10/06, Bodo Thiesen <bothie@gmx.de> wrote:
} >
} >> So, we need a way, to feedback the redundancy from the raid5 to the
} >> raid1.
} >
} > <snip long explanation>
} >
} > Sounds awfully complicated to me. Perhaps this is how it internally
} > works, but my 2 cents go to the option to gracefully remove a device
} > (migrating to a spare without losing redundancy) in the kernel (or
} > mdadm).
} >
} > I'm thinking
} >
} > mdadm /dev/raid-device -a /dev/new-disk
} > mdadm /dev/raid-device --graceful-remove /dev/failing-disk
} >
} > also hopefully a path to do this instead of kicking (multiple) disks
} > when bad blocks occur.
} 
} 
} Actually, an internal implementation is really needed if this is to be
} generally useful to a non-guru. And it has other possible uses, as well.
} if there were just a --migrate command:
}   mdadm --migrate /dev/md0 /dev/sda /dev/sdf
} as an example for discussion, the whole process of not only moving the
} data, but getting recovered information from the RAID array could be
} done by software which does the right thing, creating superblocks, copy
} UUID, etc. And as a last step it could invalidate the superblock on the
} failing drive (so reboots would work right) and leave the array running
} on the new drive.
} 
} But wait, there's more! Assume that I want to upgrade from a set of
} 250GB drives to 400GB drives. Using this feature I could replace a drive
} at a time, then --grow the array. The process for doing that is complex
} currently, and many manual steps invite errors.

I like the migrate option or whatever you want to call it.  However, if the
disk is failing I would want to avoid reading from the failing disk and
reconstruct the data from the other disks.  Only reading from the failing
disk if you find a bad block on another disk.  This would put less strain on
the failing disk possibly allowing it to last long enough to finish the
migrate.  Your data would remain redundant throughout the process.

However if I am upgrading to larger disks it would be best to read from the
disk being replaced.  You could even migrate many disks at the same time.
Your data would remain redundant throughout the process.

Guy

} 
} --
} bill davidsen <davidsen@tmr.com>
}   CTO TMR Associates, Inc
}   Doing interesting things with small computers since 1979
} 
} -
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: proactive-raid-disk-replacement
  2006-09-16 17:47       ` proactive-raid-disk-replacement Guy
@ 2006-09-21 16:28         ` Bill Davidsen
  0 siblings, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2006-09-21 16:28 UTC (permalink / raw)
  To: Guy; +Cc: 'Tuomas Leikola', 'Bodo Thiesen',
	'Linux RAID'

Guy wrote:

>} -----Original Message-----
>} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>} owner@vger.kernel.org] On Behalf Of Bill Davidsen
>} Sent: Saturday, September 16, 2006 10:29 AM
>} To: Tuomas Leikola
>} Cc: Bodo Thiesen; Linux RAID
>} Subject: Re: proactive-raid-disk-replacement
>} 
>} Tuomas Leikola wrote:
>} 
>} > On 9/10/06, Bodo Thiesen <bothie@gmx.de> wrote:
>} >
>} >> So, we need a way, to feedback the redundancy from the raid5 to the
>} >> raid1.
>} >
>} > <snip long explanation>
>} >
>} > Sounds awfully complicated to me. Perhaps this is how it internally
>} > works, but my 2 cents go to the option to gracefully remove a device
>} > (migrating to a spare without losing redundancy) in the kernel (or
>} > mdadm).
>} >
>} > I'm thinking
>} >
>} > mdadm /dev/raid-device -a /dev/new-disk
>} > mdadm /dev/raid-device --graceful-remove /dev/failing-disk
>} >
>} > also hopefully a path to do this instead of kicking (multiple) disks
>} > when bad blocks occur.
>} 
>} 
>} Actually, an internal implementation is really needed if this is to be
>} generally useful to a non-guru. And it has other possible uses, as well.
>} if there were just a --migrate command:
>}   mdadm --migrate /dev/md0 /dev/sda /dev/sdf
>} as an example for discussion, the whole process of not only moving the
>} data, but getting recovered information from the RAID array could be
>} done by software which does the right thing, creating superblocks, copy
>} UUID, etc. And as a last step it could invalidate the superblock on the
>} failing drive (so reboots would work right) and leave the array running
>} on the new drive.
>} 
>} But wait, there's more! Assume that I want to upgrade from a set of
>} 250GB drives to 400GB drives. Using this feature I could replace a drive
>} at a time, then --grow the array. The process for doing that is complex
>} currently, and many manual steps invite errors.
>
>I like the migrate option or whatever you want to call it.  However, if the
>disk is failing I would want to avoid reading from the failing disk and
>reconstruct the data from the other disks.  Only reading from the failing
>disk if you find a bad block on another disk.  This would put less strain on
>the failing disk possibly allowing it to last long enough to finish the
>migrate.  Your data would remain redundant throughout the process.
>  
>
This is one of those "maybe" things, the data move would take longer, 
increasing the chance of a total fail... etc. Someone else might speak 
to this, I generally find that the non-total failures usually result in 
writing bad sectors but reading not causing problems. Note _usually_. In 
either case, during the migrate you wouldn't want to write to the 
failing drive, so it would gradually fall out of currency in any case.

>However if I am upgrading to larger disks it would be best to read from the
>disk being replaced.  You could even migrate many disks at the same time.
>Your data would remain redundant throughout the process.
>
Many at a time supposes room for more new drives, and makes the code 
seem a lot more complex. I would hope this doesn't happen often enough 
to make efficiency important.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-09-21 16:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-08  8:48 proactive-raid-disk-replacement Michael Tokarev
2006-09-08  9:24 ` proactive-raid-disk-replacement dean gaudet
2006-09-08 10:47   ` proactive-raid-disk-replacement Michael Tokarev
2006-09-08 18:44     ` proactive-raid-disk-replacement dean gaudet
2006-09-10  0:02 ` proactive-raid-disk-replacement Bodo Thiesen
2006-09-10 10:53   ` proactive-raid-disk-replacement Tuomas Leikola
2006-09-16 14:28     ` proactive-raid-disk-replacement Bill Davidsen
2006-09-16 17:47       ` proactive-raid-disk-replacement Guy
2006-09-21 16:28         ` proactive-raid-disk-replacement Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).