Distributed spares

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Distributed spares
@ 2008-10-13 21:50 Bill Davidsen
  2008-10-13 22:11 ` Justin Piszcz
  2008-10-14 10:04 ` Neil Brown
  0 siblings, 2 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-10-13 21:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux RAID

Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) 
distributed over multiple drives. This has come up again, so I thought 
I'd just mention why, and what advantages it offers.

By spreading the spare over multiple drives the head motion of normal 
access is spread over one (or several) more drives. This reduces seeks, 
improves performance, etc. The benefit reduces as the number of drives 
in the array gets larger, obviously with four drives using only three 
for normal operation is slower than four, etc. And by using all the 
drives all the time, the chance of a spare being undetected after going 
bad is reduced.

This becomes important as array drive counts shrink. Lower cost for 
drives ($100/TB!), and attempts to drop power use by using fewer drives, 
result in an overall drop in drive count, important in serious applications.

All that said, I would really like to bring this up one more time, even 
if the answer is "no interest."

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 21:50 Distributed spares Bill Davidsen
@ 2008-10-13 22:11 ` Justin Piszcz
  2008-10-13 22:30   ` Billy Crook
  2008-10-14 23:20   ` Bill Davidsen
  2008-10-14 10:04 ` Neil Brown
  1 sibling, 2 replies; 16+ messages in thread
From: Justin Piszcz @ 2008-10-13 22:11 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Linux RAID



On Mon, 13 Oct 2008, Bill Davidsen wrote:

> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) distributed 
> over multiple drives. This has come up again, so I thought I'd just mention 
> why, and what advantages it offers.
>
> By spreading the spare over multiple drives the head motion of normal access 
> is spread over one (or several) more drives. This reduces seeks, improves 
> performance, etc. The benefit reduces as the number of drives in the array 
> gets larger, obviously with four drives using only three for normal operation 
> is slower than four, etc. And by using all the drives all the time, the 
> chance of a spare being undetected after going bad is reduced.
>
> This becomes important as array drive counts shrink. Lower cost for drives 
> ($100/TB!), and attempts to drop power use by using fewer drives, result in 
> an overall drop in drive count, important in serious applications.
>
> All that said, I would really like to bring this up one more time, even if 
> the answer is "no interest."
>
> -- 
> Bill Davidsen <davidsen@tmr.com>
> "Woe unto the statesman who makes war without a reason that will still
> be valid when the war is over..." Otto von Bismark 
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Bill,

Not a bad idea; however, can the same not be acheived (somewhat) by 
performing daily/smart, weekly/long tests on the drive to validate its 
health?  I find this to work fairly well on a large scale.

Justin.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 22:11 ` Justin Piszcz
@ 2008-10-13 22:30   ` Billy Crook
  2008-10-13 23:29     ` Keld Jørn Simonsen
  2008-10-14 12:02     ` non-degraded component replacement was " David Greaves
  2008-10-14 23:20   ` Bill Davidsen
  1 sibling, 2 replies; 16+ messages in thread
From: Billy Crook @ 2008-10-13 22:30 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Bill Davidsen, Neil Brown, Linux RAID

Just my two cents....  Those daily smart tests or regularly running
badblocks are fine, but they're not 'real' load.  A test can't prove
everything is right, it can at best only prove it didn't find anything
wrong.  Distributed spare would exert 'real' load on the spare because
the spare disks ARE the live disks.

On a side note, it would be handy to have a daemon that could run in
the background on large raid1's, or raid6', and once a month, pull
each disk out of the array sequentially, completely overwrite it,
check it with badblocks several times, do the smart tests, etc...,
then rejoin it, reinstall grub, wait an hour and move on.  The point
being, of course, to kill weak drives off early and in a controlled
manor.  It would be even nicer if there were a way to hot-transfer one
raid component to another without setting anything faulty.  I suppose
you could make all the components of the real array be single disk
raid1 arrays for that purpose.  Then you could have one extra disk set
aside for this sort of scrubbing, and never even be down one of your
parities.  I guess I should add that onto my todo list....

-Billy

On Mon, Oct 13, 2008 at 17:11, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
>
>
> On Mon, 13 Oct 2008, Bill Davidsen wrote:
>
>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s)
>> distributed over multiple drives. This has come up again, so I thought I'd
>> just mention why, and what advantages it offers.
>>
>> By spreading the spare over multiple drives the head motion of normal
>> access is spread over one (or several) more drives. This reduces seeks,
>> improves performance, etc. The benefit reduces as the number of drives in
>> the array gets larger, obviously with four drives using only three for
>> normal operation is slower than four, etc. And by using all the drives all
>> the time, the chance of a spare being undetected after going bad is reduced.
>>
>> This becomes important as array drive counts shrink. Lower cost for drives
>> ($100/TB!), and attempts to drop power use by using fewer drives, result in
>> an overall drop in drive count, important in serious applications.
>>
>> All that said, I would really like to bring this up one more time, even if
>> the answer is "no interest."
>>
>> --
>> Bill Davidsen <davidsen@tmr.com>
>> "Woe unto the statesman who makes war without a reason that will still
>> be valid when the war is over..." Otto von Bismark
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Bill,
>
> Not a bad idea; however, can the same not be acheived (somewhat) by
> performing daily/smart, weekly/long tests on the drive to validate its
> health?  I find this to work fairly well on a large scale.
>
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 22:30   ` Billy Crook
@ 2008-10-13 23:29     ` Keld Jørn Simonsen
  2008-10-14 10:12       ` Martin K. Petersen
  2008-10-14 12:02     ` non-degraded component replacement was " David Greaves
  1 sibling, 1 reply; 16+ messages in thread
From: Keld Jørn Simonsen @ 2008-10-13 23:29 UTC (permalink / raw)
  To: Billy Crook; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID

On Mon, Oct 13, 2008 at 05:30:49PM -0500, Billy Crook wrote:
> Just my two cents....  Those daily smart tests or regularly running
> badblocks are fine, but they're not 'real' load.  A test can't prove
> everything is right, it can at best only prove it didn't find anything
> wrong.  Distributed spare would exert 'real' load on the spare because
> the spare disks ARE the live disks.
> 
> 
> On a side note, it would be handy to have a daemon that could run in
> the background on large raid1's, or raid6', and once a month, pull
> each disk out of the array sequentially, completely overwrite it,
> check it with badblocks several times, do the smart tests, etc...,
> then rejoin it, reinstall grub, wait an hour and move on.  The point
> being, of course, to kill weak drives off early and in a controlled
> manor.  It would be even nicer if there were a way to hot-transfer one
> raid component to another without setting anything faulty.  I suppose
> you could make all the components of the real array be single disk
> raid1 arrays for that purpose.  Then you could have one extra disk set
> aside for this sort of scrubbing, and never even be down one of your
> parities.  I guess I should add that onto my todo list....

I have also been thinking a little on this. My idea is that if bit
errors develop on disks, then there is first maybe one bit error, and
the crc check on the disk sectors then finds and corrects these.

If you rewrite such bit errors, then that bit error will be corrected,
and you prevent the one-bit error from developing to a two-bit error
that is not correctable by the CRC. 

Is there some merit to this idea?

Furthermore, if bad luck has striken, then in the case of mirrored RAIDs
you could - when crc fails, then see that this is the block in error and
recreate it from the redundant info, Would be good for raid1, raid10,
raid5, raid6. If the block then could not be written without errors,
then it could be added to a bad blocks list and remapped.

I think there is nothing novel in a scheme like this, but I would like
to know if it is implemented somewhere. Articles say that bit errors on
disks are becoming more and more frequent, so schemes like this may help
the scary scenarion somewhat.

best regards
keld

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 23:29     ` Keld Jørn Simonsen
@ 2008-10-14 10:12       ` Martin K. Petersen
  2008-10-14 13:06         ` Keld Jørn Simonsen
  2008-10-14 13:20         ` David Lethe
  0 siblings, 2 replies; 16+ messages in thread
From: Martin K. Petersen @ 2008-10-14 10:12 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID

>>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes:

Keld> I have also been thinking a little on this. My idea is that if
Keld> bit errors develop on disks, then there is first maybe one bit
Keld> error, and the crc check on the disk sectors then finds and
Keld> corrects these.

Keld> If you rewrite such bit errors, then that bit error will be
Keld> corrected, and you prevent the one-bit error from developing to
Keld> a two-bit error that is not correctable by the CRC.

I think you are assuming that disks are much simpler than they
actually are.

A modern disk drive protects a 512-byte sector with a pretty strong
ECC that's capable of correcting errors up to ~50 bytes.  Yes, that's
bytes.

Also, many drive firmwares will internally keep track of problematic
media areas and rewrite or reallocate affected blocks.  That includes
stuff like rewriting sectors that are susceptible to bleed due to
being adjacent to write hot spots.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-14 10:12       ` Martin K. Petersen
@ 2008-10-14 13:06         ` Keld Jørn Simonsen
  2008-10-14 13:20         ` David Lethe
  1 sibling, 0 replies; 16+ messages in thread
From: Keld Jørn Simonsen @ 2008-10-14 13:06 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID

On Tue, Oct 14, 2008 at 06:12:29AM -0400, Martin K. Petersen wrote:
> >>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes:
> 
> Keld> I have also been thinking a little on this. My idea is that if
> Keld> bit errors develop on disks, then there is first maybe one bit
> Keld> error, and the crc check on the disk sectors then finds and
> Keld> corrects these.
> 
> Keld> If you rewrite such bit errors, then that bit error will be
> Keld> corrected, and you prevent the one-bit error from developing to
> Keld> a two-bit error that is not correctable by the CRC.
> 
> I think you are assuming that disks are much simpler than they
> actually are.
> 
> A modern disk drive protects a 512-byte sector with a pretty strong
> ECC that's capable of correcting errors up to ~50 bytes.  Yes, that's
> bytes.
> 
> Also, many drive firmwares will internally keep track of problematic
> media areas and rewrite or reallocate affected blocks.  That includes
> stuff like rewriting sectors that are susceptible to bleed due to
> being adjacent to write hot spots.

Good to know. Could yo tell me if this is actually true for normal
state-of-the art SATA disks, or only true for more expensive disks?
Do you have a good reference for it.

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Distributed spares
  2008-10-14 10:12       ` Martin K. Petersen
  2008-10-14 13:06         ` Keld Jørn Simonsen
@ 2008-10-14 13:20         ` David Lethe
  1 sibling, 0 replies; 16+ messages in thread
From: David Lethe @ 2008-10-14 13:20 UTC (permalink / raw)
  To: Martin K. Petersen, Keld Jørn Simonsen
  Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Tuesday, October 14, 2008 5:12 AM
> To: Keld Jørn Simonsen
> Cc: Billy Crook; Justin Piszcz; Bill Davidsen; Neil Brown; Linux RAID
> Subject: Re: Distributed spares
> 
> >>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes:
> 
> Keld> I have also been thinking a little on this. My idea is that if
> Keld> bit errors develop on disks, then there is first maybe one bit
> Keld> error, and the crc check on the disk sectors then finds and
> Keld> corrects these.
> 
> Keld> If you rewrite such bit errors, then that bit error will be
> Keld> corrected, and you prevent the one-bit error from developing to
> Keld> a two-bit error that is not correctable by the CRC.
> 
> I think you are assuming that disks are much simpler than they
> actually are.
> 
> A modern disk drive protects a 512-byte sector with a pretty strong
> ECC that's capable of correcting errors up to ~50 bytes.  Yes, that's
> bytes.
> 
> Also, many drive firmwares will internally keep track of problematic
> media areas and rewrite or reallocate affected blocks.  That includes
> stuff like rewriting sectors that are susceptible to bleed due to
> being adjacent to write hot spots.
> 
> --
> Martin K. Petersen	Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin is absolutely correct.  Enterprise class drives have come a long way. They will scan and
fix blocks (but certainly not 100% of them) in background.  The $99 disk drives you get at
the local computer retailer now even have limited BGMS / repair capability.

If you run the built-in diags on disk drives, you can be presented with a list of known bad blocks,
or when you boot a disk drive, sometimes you can get a bad block display in POST.

How about a baby step?  When you run offline or online tests, or even when you run media scans,
you get a list of known defects.  How about a program that rewrites a RAID1/3/5/6 stripe, and
you just pass it the physical device name and known block number?    

As for checking out a disk ..

The prior poster's idea about putting the RAID in degraded mode for purposes of checking out a disk is, 
Frankly, nuts. NEVER degrade anything.   Just use the hotspare and do a hot clone of the disk in question
to the hotspare, then make that disk the new hot spare and repeat..

Equate this to a "Rotating the Tires" mode.

David @ santools com
http://www.santools.com/smart/unix/manual

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* non-degraded component replacement was Re: Distributed spares
  2008-10-13 22:30   ` Billy Crook
  2008-10-13 23:29     ` Keld Jørn Simonsen
@ 2008-10-14 12:02     ` David Greaves
  2008-10-14 13:18       ` Billy Crook
  1 sibling, 1 reply; 16+ messages in thread
From: David Greaves @ 2008-10-14 12:02 UTC (permalink / raw)
  To: Billy Crook; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID, dean

Billy Crook wrote:
> It would be even nicer if there were a way to hot-transfer one
> raid component to another without setting anything faulty.  I suppose
> you could make all the components of the real array be single disk
> raid1 arrays for that purpose.  Then you could have one extra disk set
> aside for this sort of scrubbing, and never even be down one of your
> parities.  I guess I should add that onto my todo list....

IMHO This one should be high on the todo list. Especially if it's a
pre-requisite for other improvements to resilience.

Right now, if a drive fails or shows signs of going bad then you get into a very
 risky situation.

I'm sure most here know that the risk is because removing the failing drive and
installing a good one to re-sync puts you in a very vulnerable position; if
another drive fails (even one bad block) then you lose data.

The solution involves raid1 - but it needs a twist of raid5/6.

http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

I think this is what was discussed:

Assume md0 has drives A B C D
D is failing
E is new

* add E as spare
* set E to mirror 'failing' drive D (with bitmap?)
* subsequent writes go to both D+E
* recover 99+% of data from D to E by simple mirroring
* any md0 or D->E read failures on D are recovered from reading ABC not E unless
E is in sync. D is not failed out. (and it's these tricks that stops users from
doing all this manually)
* any md0 sector read failure on ABC can still (hopefully) be read from D even
if not yet mirrored to E (also not possible
* once E is mirrored, D is removed and  the job is done

Personally I think this feature is more important than the reshaping requests;
of course that's just one opinion :)

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: non-degraded component replacement was Re: Distributed spares
  2008-10-14 12:02     ` non-degraded component replacement was " David Greaves
@ 2008-10-14 13:18       ` Billy Crook
  0 siblings, 0 replies; 16+ messages in thread
From: Billy Crook @ 2008-10-14 13:18 UTC (permalink / raw)
  To: David Greaves; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID, dean

On Tue, Oct 14, 2008 at 07:02, David Greaves <david@dgreaves.com> wrote:
> Billy Crook wrote:
>> It would be even nicer if there were a way to hot-transfer one
>> raid component to another without setting anything faulty.  I suppose
>> you could make all the components of the real array be single disk
>> raid1 arrays for that purpose.  Then you could have one extra disk set
>> aside for this sort of scrubbing, and never even be down one of your
>> parities.  I guess I should add that onto my todo list....
>
> IMHO This one should be high on the todo list. Especially if it's a
> pre-requisite for other improvements to resilience.

Here's the process as I thought it out.  I'm sure it can be improved upon:

Component C will be the current drive that one wishes to take out of service.
Component N will be the new drive that one wishes to put in service in
place of component C.

Redirect incoming writes from component C to component C AND N.
Check to make sure component N is same size or larger than C.
Create counter curBlock to store position of the drive copy,
(initialize counter at 0).
While curBlock < component C's block count:
 Copy block curBlock from component C to curBlock on component N.
   If copy fails, then try to construct that block from the other
disks using parity and apply that to component N.
 Increment curBlock.
Once the copy is complete, optionally verify by comparing both components.
Set curBlock to 0 again.
While curBlock < component C's block count:
 Compare curBlock on component C to curBlock on component N.
   If compare fails, then terminate with error and stop mirroring writes to N.
 Increment curBlock.
Redirect reads from component N only.
Stop writing to component C, and only write to N.
Present some notification that this process is done.

At all points during the process, redundancy should be as good as or
better than before the process.  The process can be aborted at any
time without disruption to the array.  This could be represented IMHO
with some different status designator character in /proc/mdstat like M
(for migrating), and the name for this capability, I'd call "Hot raid
component migration".  Just so long as people realise its an option
for replacing raid components more safely.  I bet the majority of the
code needed is already in the raid1 personality.  You could accomplish
the same thing by building your 'real' array ontop of single-disk
raid1 arrays, but oh that would be messy to look at!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 22:11 ` Justin Piszcz
  2008-10-13 22:30   ` Billy Crook
@ 2008-10-14 23:20   ` Bill Davidsen
  1 sibling, 0 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-10-14 23:20 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Neil Brown, Linux RAID

Justin Piszcz wrote:
>
>
> On Mon, 13 Oct 2008, Bill Davidsen wrote:
>
>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) 
>> distributed over multiple drives. This has come up again, so I 
>> thought I'd just mention why, and what advantages it offers.
>>
>> By spreading the spare over multiple drives the head motion of normal 
>> access is spread over one (or several) more drives. This reduces 
>> seeks, improves performance, etc. The benefit reduces as the number 
>> of drives in the array gets larger, obviously with four drives using 
>> only three for normal operation is slower than four, etc. And by 
>> using all the drives all the time, the chance of a spare being 
>> undetected after going bad is reduced.
>>
>> This becomes important as array drive counts shrink. Lower cost for 
>> drives ($100/TB!), and attempts to drop power use by using fewer 
>> drives, result in an overall drop in drive count, important in 
>> serious applications.
>>
>> All that said, I would really like to bring this up one more time, 
>> even if the answer is "no interest."
>>
>> -- 
>> Bill Davidsen <davidsen@tmr.com>
>> "Woe unto the statesman who makes war without a reason that will still
>> be valid when the war is over..." Otto von Bismark
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Bill,
>
> Not a bad idea; however, can the same not be acheived (somewhat) by 
> performing daily/smart, weekly/long tests on the drive to validate its 
> health?  I find this to work fairly well on a large scale.

Not really, the performance benefit comes from spreading head motion to 
(at least) one more drive. You can get a check on basic functionality 
with SMART, but it doesn't beat the drive the way real load does. Add to 
that the unfortunate problem that more realistic testing also takes up 
i/o bandwidth for non-productive transfers. Better to be doing actual 
live data transfers to those drives if you can.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-13 21:50 Distributed spares Bill Davidsen
  2008-10-13 22:11 ` Justin Piszcz
@ 2008-10-14 10:04 ` Neil Brown
  2008-10-16 23:50   ` Bill Davidsen
  2008-10-17 13:09   ` Gabor Gombas
  1 sibling, 2 replies; 16+ messages in thread
From: Neil Brown @ 2008-10-14 10:04 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Linux RAID

On Monday October 13, davidsen@tmr.com wrote:
> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) 
> distributed over multiple drives. This has come up again, so I thought 
> I'd just mention why, and what advantages it offers.
> 
> By spreading the spare over multiple drives the head motion of normal 
> access is spread over one (or several) more drives. This reduces seeks, 
> improves performance, etc. The benefit reduces as the number of drives 
> in the array gets larger, obviously with four drives using only three 
> for normal operation is slower than four, etc. And by using all the 
> drives all the time, the chance of a spare being undetected after going 
> bad is reduced.
> 
> This becomes important as array drive counts shrink. Lower cost for 
> drives ($100/TB!), and attempts to drop power use by using fewer drives, 
> result in an overall drop in drive count, important in serious applications.
> 
> All that said, I would really like to bring this up one more time, even 
> if the answer is "no interest."

How are your coding skills?

The tricky bit is encoding the new state.
We can not longer tell the difference between "optimal" and "degraded"
based on the number of in-sync devices.  We also need some state flag
to say that the "distributed spare" has been constructed. 
Maybe that could be encoded in the "layout".

We would also need to allow a "recovery" pass to happen without having
actually added any spares, or having any non-insync devices.  That
probably means passing the decision "is a recovery pending" down into
the personality rather than making it in common code.  Maybe have some
field in the mddev structure which the personality sets if a recovery
is worth trying.  Or maybe just try it anyway after any significant
change and if the personality finds nothing can be done it aborts.

I'm happy to advise on, review, and eventually accept patches.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-14 10:04 ` Neil Brown
@ 2008-10-16 23:50   ` Bill Davidsen
  2008-10-17  4:09     ` David Lethe
  2008-10-17 13:09   ` Gabor Gombas
  1 sibling, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-10-16 23:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux RAID

Neil Brown wrote:
> On Monday October 13, davidsen@tmr.com wrote:
>   
>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) 
>> distributed over multiple drives. This has come up again, so I thought 
>> I'd just mention why, and what advantages it offers.
>>
>> By spreading the spare over multiple drives the head motion of normal 
>> access is spread over one (or several) more drives. This reduces seeks, 
>> improves performance, etc. The benefit reduces as the number of drives 
>> in the array gets larger, obviously with four drives using only three 
>> for normal operation is slower than four, etc. And by using all the 
>> drives all the time, the chance of a spare being undetected after going 
>> bad is reduced.
>>
>> This becomes important as array drive counts shrink. Lower cost for 
>> drives ($100/TB!), and attempts to drop power use by using fewer drives, 
>> result in an overall drop in drive count, important in serious applications.
>>
>> All that said, I would really like to bring this up one more time, even 
>> if the answer is "no interest."
>>     
>
> How are your coding skills?
>
> The tricky bit is encoding the new state.
> We can not longer tell the difference between "optimal" and "degraded"
> based on the number of in-sync devices.  We also need some state flag
> to say that the "distributed spare" has been constructed. 
> Maybe that could be encoded in the "layout".
>
> We would also need to allow a "recovery" pass to happen without having
> actually added any spares, or having any non-insync devices.  That
> probably means passing the decision "is a recovery pending" down into
> the personality rather than making it in common code.  Maybe have some
> field in the mddev structure which the personality sets if a recovery
> is worth trying.  Or maybe just try it anyway after any significant
> change and if the personality finds nothing can be done it aborts.
>
>   
My coding skills are fine, here, but I have to do a lot of planning 
before even considering this.
Here's why:
  Say you have a five drive RAID-5e, and you are running happily. A 
drive fails! Now you can rebuild on the spare drive, but the spare drive 
must be created on the parts from the remaining functional drives, so it 
can't be done pre-failure, the allocation has to be defined after you 
see what you have left. Does that sound ugly and complex? Does to me, 
too. So I'm thinking about this, and doing some reading, but it's not 
quite as simple as I thought.
> I'm happy to advise on, review, and eventually accept patches.
>   

Actually what I think I would do is want to build a test bed in software 
before trying this in the kernel, then run the kernel part in a virtual 
machine. I have another idea, which has about 75% of the benefit with 
10% of the complexity. Since it sounds too good to be true it probably 
is, I'll get back to you after I think about the simpler solution, I 
distrust free lunch algorithms.
> NeilBrown
-- 

Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Distributed spares
  2008-10-16 23:50   ` Bill Davidsen
@ 2008-10-17  4:09     ` David Lethe
  2008-10-17 13:46       ` Bill Davidsen
  0 siblings, 1 reply; 16+ messages in thread
From: David Lethe @ 2008-10-17  4:09 UTC (permalink / raw)
  To: Bill Davidsen, Neil Brown; +Cc: Linux RAID



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Bill Davidsen
> Sent: Thursday, October 16, 2008 6:50 PM
> To: Neil Brown
> Cc: Linux RAID
> Subject: Re: Distributed spares
> 
> Neil Brown wrote:
> > On Monday October 13, davidsen@tmr.com wrote:
> >
> >> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s)
> >> distributed over multiple drives. This has come up again, so I
> thought
> >> I'd just mention why, and what advantages it offers.
> >>
> >> By spreading the spare over multiple drives the head motion of
> normal
> >> access is spread over one (or several) more drives. This reduces
> seeks,
> >> improves performance, etc. The benefit reduces as the number of
> drives
> >> in the array gets larger, obviously with four drives using only
> three
> >> for normal operation is slower than four, etc. And by using all the
> >> drives all the time, the chance of a spare being undetected after
> going
> >> bad is reduced.
> >>
> >> This becomes important as array drive counts shrink. Lower cost for
> >> drives ($100/TB!), and attempts to drop power use by using fewer
> drives,
> >> result in an overall drop in drive count, important in serious
> applications.
> >>
> >> All that said, I would really like to bring this up one more time,
> even
> >> if the answer is "no interest."
> >>
> >
> > How are your coding skills?
> >
> > The tricky bit is encoding the new state.
> > We can not longer tell the difference between "optimal" and
> "degraded"
> > based on the number of in-sync devices.  We also need some state
flag
> > to say that the "distributed spare" has been constructed.
> > Maybe that could be encoded in the "layout".
> >
> > We would also need to allow a "recovery" pass to happen without
> having
> > actually added any spares, or having any non-insync devices.  That
> > probably means passing the decision "is a recovery pending" down
into
> > the personality rather than making it in common code.  Maybe have
> some
> > field in the mddev structure which the personality sets if a
recovery
> > is worth trying.  Or maybe just try it anyway after any significant
> > change and if the personality finds nothing can be done it aborts.
> >
> >
> My coding skills are fine, here, but I have to do a lot of planning
> before even considering this.
> Here's why:
>   Say you have a five drive RAID-5e, and you are running happily. A
> drive fails! Now you can rebuild on the spare drive, but the spare
> drive
> must be created on the parts from the remaining functional drives, so
> it
> can't be done pre-failure, the allocation has to be defined after you
> see what you have left. Does that sound ugly and complex? Does to me,
> too. So I'm thinking about this, and doing some reading, but it's not
> quite as simple as I thought.
> > I'm happy to advise on, review, and eventually accept patches.
> >
> 
> Actually what I think I would do is want to build a test bed in
> software
> before trying this in the kernel, then run the kernel part in a
virtual
> machine. I have another idea, which has about 75% of the benefit with
> 10% of the complexity. Since it sounds too good to be true it probably
> is, I'll get back to you after I think about the simpler solution, I
> distrust free lunch algorithms.
> > NeilBrown
> --
> 
> Bill Davidsen <davidsen@tmr.com>
>   "Woe unto the statesman who makes war without a reason that will
> still
>   be valid when the war is over..." Otto von Bismark
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

With all due respect, RAID5E isn't practical.  Too many corner cases
dealing
With performance implications, and where you even just put the parity
block, to insure
That when a disk fails you didn't put yourself into situation where the
Hot spare chunk
is located on the disk drive that just died.  

What about all of those dozens of utilities that need to know about
RAID5E to work properly?
The cynic in me says if they're still having to recall patches (like
today) that deals with mdadm on
Established RAID levels, then RAID5E is going to be much worse.
Algorithms dealing
with drive failures, unrecoverable read/write errors on normal
operations as well as rebuilds, expansions, 
and journalization/optimization are not well understood.  It is new
territory.

If you want multiple distributed spares, just do RAID6, it is better
than RAID5 in that respect, and nobody
Has to re-invent the wheel.  Your "hot spare" is still distributed in
all of the disks, and you can survive multiple
Drive failures.  If your motivation is performance, then buy faster
disks, additional controller(s), 
optimize your storage pools; and tune your md settings to be more
compatible with your filesystem parameters.  
Or even look at your application and see if anything can be done to
reduce the I/O count.   

The fastest I/Os are the ones you eliminate.

David





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-17  4:09     ` David Lethe
@ 2008-10-17 13:46       ` Bill Davidsen
  2008-10-20  1:11         ` Neil Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-10-17 13:46 UTC (permalink / raw)
  To: David Lethe; +Cc: Neil Brown, Linux RAID

David Lethe wrote:
>   
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Bill Davidsen
>> Sent: Thursday, October 16, 2008 6:50 PM
>> To: Neil Brown
>> Cc: Linux RAID
>> Subject: Re: Distributed spares
>>
>> Neil Brown wrote:
>>     
>>> On Monday October 13, davidsen@tmr.com wrote:
>>>
>>>       
>>>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s)
>>>> distributed over multiple drives. This has come up again, so I
>>>>         
>> thought
>>     
>>>> I'd just mention why, and what advantages it offers.
>>>>
>>>> By spreading the spare over multiple drives the head motion of
>>>>         
>> normal
>>     
>>>> access is spread over one (or several) more drives. This reduces
>>>>         
>> seeks,
>>     
>>>> improves performance, etc. The benefit reduces as the number of
>>>>         
>> drives
>>     
>>>> in the array gets larger, obviously with four drives using only
>>>>         
>> three
>>     
>>>> for normal operation is slower than four, etc. And by using all the
>>>> drives all the time, the chance of a spare being undetected after
>>>>         
>> going
>>     
>>>> bad is reduced.
>>>>
>>>> This becomes important as array drive counts shrink. Lower cost for
>>>> drives ($100/TB!), and attempts to drop power use by using fewer
>>>>         
>> drives,
>>     
>>>> result in an overall drop in drive count, important in serious
>>>>         
>> applications.
>>     
>>>> All that said, I would really like to bring this up one more time,
>>>>         
>> even
>>     
>>>> if the answer is "no interest."
>>>>
>>>>         
>>> How are your coding skills?
>>>
>>> The tricky bit is encoding the new state.
>>> We can not longer tell the difference between "optimal" and
>>>       
>> "degraded"
>>     
>>> based on the number of in-sync devices.  We also need some state
>>>       
> flag
>   
>>> to say that the "distributed spare" has been constructed.
>>> Maybe that could be encoded in the "layout".
>>>
>>> We would also need to allow a "recovery" pass to happen without
>>>       
>> having
>>     
>>> actually added any spares, or having any non-insync devices.  That
>>> probably means passing the decision "is a recovery pending" down
>>>       
> into
>   
>>> the personality rather than making it in common code.  Maybe have
>>>       
>> some
>>     
>>> field in the mddev structure which the personality sets if a
>>>       
> recovery
>   
>>> is worth trying.  Or maybe just try it anyway after any significant
>>> change and if the personality finds nothing can be done it aborts.
>>>
>>>
>>>       
>> My coding skills are fine, here, but I have to do a lot of planning
>> before even considering this.
>> Here's why:
>>   Say you have a five drive RAID-5e, and you are running happily. A
>> drive fails! Now you can rebuild on the spare drive, but the spare
>> drive
>> must be created on the parts from the remaining functional drives, so
>> it
>> can't be done pre-failure, the allocation has to be defined after you
>> see what you have left. Does that sound ugly and complex? Does to me,
>> too. So I'm thinking about this, and doing some reading, but it's not
>> quite as simple as I thought.
>>     
>>> I'm happy to advise on, review, and eventually accept patches.
>>>
>>>       
>> Actually what I think I would do is want to build a test bed in
>> software
>> before trying this in the kernel, then run the kernel part in a
>>     
> virtual
>   
>> machine. I have another idea, which has about 75% of the benefit with
>> 10% of the complexity. Since it sounds too good to be true it probably
>> is, I'll get back to you after I think about the simpler solution, I
>> distrust free lunch algorithms.
>>     
>>> NeilBrown
>>>       
>> --
>>
>> Bill Davidsen <davidsen@tmr.com>
>>   "Woe unto the statesman who makes war without a reason that will
>> still
>>   be valid when the war is over..." Otto von Bismark
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     
>
> With all due respect, RAID5E isn't practical.  Too many corner cases
> dealing
> With performance implications, and where you even just put the parity
> block, to insure
> That when a disk fails you didn't put yourself into situation where the
> Hot spare chunk
> is located on the disk drive that just died.  
>
>   
Having run 38 multi-TB machines for an ISP using RAID5e in the SCSI 
controller, I feel pretty sure that the practicality is established, and 
only the ability to reinvent that particular wheel is in question. The 
complexity is that the hot spare drive needs to be defined after the 1st 
drive failure, using the spare sectors on the functional drives.

> What about all of those dozens of utilities that need to know about
> RAID5E to work properly?
>   

What did you have in mind? Virtually all utilities working at the 
filesystem level have no need to know anything, and treat any array as a 
drive in a black box fashion.

> The cynic in me says if they're still having to recall patches (like
> today) that deals with mdadm on
> Established RAID levels, then RAID5E is going to be much worse.
>   

Let's definitely not add anything to the kernel, then, as a 
feature-static kernel is much more stable. Features like the software 
RAID-10 (not 1+0) are not established in some standard I've seen, but 
they work just fine. Any this is not unexplored territory, distributed 
spare is called RAID5e on the IBM servers I used, and I believe Storage 
Computer (in NH) has a similar feature they call "RAID-7" and trademark.

> Algorithms dealing
> with drive failures, unrecoverable read/write errors on normal
> operations as well as rebuilds, expansions, 
> and journalization/optimization are not well understood.  It is new
> territory.
>   

That's why I'm being quite cautious about saying I can do this, the 
coding is easy, it's finding out what to code that's hard. It appears 
that configuration decisions need to be made after the failure event, 
before the rebuild. Yes, it's complex. But from experience I can say 
that performance during rebuild is far better with a distributed spare 
than beating the snot out of one newly added spare with other RAID 
levels. So there's a performance benefit for both the normal case and 
the rebuild case, and a side benefit of faster rebuild time.

The full recovery after replacing the failed drive is also an 
interesting time. :-(
> If you want multiple distributed spares, just do RAID6, it is better
> than RAID5 in that respect, and nobody
> Has to re-invent the wheel.  Your "hot spare" is still distributed in
> all of the disks, and you can survive multiple
> Drive failures.  If your motivation is performance, then buy faster
> disks, additional controller(s), 
> optimize your storage pools; and tune your md settings to be more
> compatible with your filesystem parameters.  
> Or even look at your application and see if anything can be done to
> reduce the I/O count.   
>
>   
The motivation is to get the best performance from the hardware you 
have. Adding hardware cost so you can use storage hardware inefficiently 
is *really* not practical. Neither power, cooling, drives, nor floor 
space are cheap enough to use poorly.

> The fastest I/Os are the ones you eliminate.
>   

And the fastest seeks are the ones you don't do because you spread head 
motion over more drives. But once you have distributed spare in the 
kernel, you have a free performance gain. Or as free as using either 
more CPU or more memory for mapping will allow. Most people will trade a 
little of either for better performance.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-17 13:46       ` Bill Davidsen
@ 2008-10-20  1:11         ` Neil Brown
  0 siblings, 0 replies; 16+ messages in thread
From: Neil Brown @ 2008-10-20  1:11 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: David Lethe, Linux RAID

On Friday October 17, davidsen@tmr.com wrote:
> David Lethe wrote:
> >
> > With all due respect, RAID5E isn't practical.  Too many corner cases
> > dealing
> > With performance implications, and where you even just put the parity
> > block, to insure
> > That when a disk fails you didn't put yourself into situation where the
> > Hot spare chunk
> > is located on the disk drive that just died.  
> >
> >   
> Having run 38 multi-TB machines for an ISP using RAID5e in the SCSI 
> controller, I feel pretty sure that the practicality is established, and 
> only the ability to reinvent that particular wheel is in question. The 
> complexity is that the hot spare drive needs to be defined after the 1st 
> drive failure, using the spare sectors on the functional drives.

I don't think that will be particularly complex.  It will just be a
bit of code in raid5_compute_sector.  The detail of 'which device has
failed' would be stored in ->algorithm somehow.

There is an interesting question of how general do we want the code to
be.
e.g do we want to be able to configure an array with 2 distributed
spares?  I suspect that people would rarely want 2, and never want 3,
so it would be worth making 2 work if the code didn't get too complex,
which I don't think it would (but I'm not certain).

> > Algorithms dealing
> > with drive failures, unrecoverable read/write errors on normal
> > operations as well as rebuilds, expansions, 
> > and journalization/optimization are not well understood.  It is new
> > territory.
> >   
> 
> That's why I'm being quite cautious about saying I can do this, the 
> coding is easy, it's finding out what to code that's hard. It appears 
> that configuration decisions need to be made after the failure event, 
> before the rebuild. Yes, it's complex. But from experience I can say 
> that performance during rebuild is far better with a distributed spare 
> than beating the snot out of one newly added spare with other RAID 
> levels. So there's a performance benefit for both the normal case and 
> the rebuild case, and a side benefit of faster rebuild time.

I cannot see why rebuilding a raid5e would be faster than rebuilding a
raid5 to a fresh device.
In each case, you need to read from n-1 devices, and write to 1
device.  So all devices are constantly doing IO at the same rate.
In the raid5 case you could get better streaming as each device is
either "always reading" or "always writing", where as in a raid5e
rebuild, devices will sometimes read and sometimes write.  So if
anything, I would expect raid5e to rebuild more slowly, but you would
probably only notice this with small chunk sizes.

I agree that (with suitably large chunk sizes) you should be able to
get better throughput on raid5e.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Distributed spares
  2008-10-14 10:04 ` Neil Brown
  2008-10-16 23:50   ` Bill Davidsen
@ 2008-10-17 13:09   ` Gabor Gombas
  1 sibling, 0 replies; 16+ messages in thread
From: Gabor Gombas @ 2008-10-17 13:09 UTC (permalink / raw)
  To: Neil Brown; +Cc: Bill Davidsen, Linux RAID

On Tue, Oct 14, 2008 at 09:04:25PM +1100, Neil Brown wrote:

> The tricky bit is encoding the new state.
> We can not longer tell the difference between "optimal" and "degraded"
> based on the number of in-sync devices.  We also need some state flag
> to say that the "distributed spare" has been constructed. 
> Maybe that could be encoded in the "layout".

Or you need to add a "virtual" spare that does not have an actual block
device behind it. Or rather it could be a virtual disk constructed from
the spare chunks on the data disks; maybe device mapper could be used
here? If you could create such a virtual disk then maybe the normal
RAID5 code could just do the rest. Of course the mapping of stripes to
disk locations has to be changed to account for the "black holes" now
belonging to the virtual spare device.

Hmm, if you create DM devices for all the disks that just leave out the
proper holes from the mapping, and you also create a DM device from the
holes, and you create an MD RAID5+spare on top of these DM devices, then
you could have RAID5e right now completely from userspace. The
superblock should be changed so nobody ever tries to build a RAID5 from
the raw disks and there may be some other details...

Gabor

-- 
     ---------------------------------------------------------
     MTA SZTAKI Computer and Automation Research Institute
                Hungarian Academy of Sciences
     ---------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-10-20  1:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-13 21:50 Distributed spares Bill Davidsen
2008-10-13 22:11 ` Justin Piszcz
2008-10-13 22:30   ` Billy Crook
2008-10-13 23:29     ` Keld Jørn Simonsen
2008-10-14 10:12       ` Martin K. Petersen
2008-10-14 13:06         ` Keld Jørn Simonsen
2008-10-14 13:20         ` David Lethe
2008-10-14 12:02     ` non-degraded component replacement was " David Greaves
2008-10-14 13:18       ` Billy Crook
2008-10-14 23:20   ` Bill Davidsen
2008-10-14 10:04 ` Neil Brown
2008-10-16 23:50   ` Bill Davidsen
2008-10-17  4:09     ` David Lethe
2008-10-17 13:46       ` Bill Davidsen
2008-10-20  1:11         ` Neil Brown
2008-10-17 13:09   ` Gabor Gombas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).