Re: Distributed spares - Bill Davidsen

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bill Davidsen <davidsen@tmr.com>
To: David Lethe <david@santools.com>
Cc: Neil Brown <neilb@suse.de>, Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: Distributed spares
Date: Fri, 17 Oct 2008 09:46:49 -0400	[thread overview]
Message-ID: <48F89749.5030500@tmr.com> (raw)
In-Reply-To: <A20315AE59B5C34585629E258D76A97C024F81A1@34093-C3-EVS3.exchange.rackspace.com>

David Lethe wrote:
>   
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Bill Davidsen
>> Sent: Thursday, October 16, 2008 6:50 PM
>> To: Neil Brown
>> Cc: Linux RAID
>> Subject: Re: Distributed spares
>>
>> Neil Brown wrote:
>>     
>>> On Monday October 13, davidsen@tmr.com wrote:
>>>
>>>       
>>>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s)
>>>> distributed over multiple drives. This has come up again, so I
>>>>         
>> thought
>>     
>>>> I'd just mention why, and what advantages it offers.
>>>>
>>>> By spreading the spare over multiple drives the head motion of
>>>>         
>> normal
>>     
>>>> access is spread over one (or several) more drives. This reduces
>>>>         
>> seeks,
>>     
>>>> improves performance, etc. The benefit reduces as the number of
>>>>         
>> drives
>>     
>>>> in the array gets larger, obviously with four drives using only
>>>>         
>> three
>>     
>>>> for normal operation is slower than four, etc. And by using all the
>>>> drives all the time, the chance of a spare being undetected after
>>>>         
>> going
>>     
>>>> bad is reduced.
>>>>
>>>> This becomes important as array drive counts shrink. Lower cost for
>>>> drives ($100/TB!), and attempts to drop power use by using fewer
>>>>         
>> drives,
>>     
>>>> result in an overall drop in drive count, important in serious
>>>>         
>> applications.
>>     
>>>> All that said, I would really like to bring this up one more time,
>>>>         
>> even
>>     
>>>> if the answer is "no interest."
>>>>
>>>>         
>>> How are your coding skills?
>>>
>>> The tricky bit is encoding the new state.
>>> We can not longer tell the difference between "optimal" and
>>>       
>> "degraded"
>>     
>>> based on the number of in-sync devices.  We also need some state
>>>       
> flag
>   
>>> to say that the "distributed spare" has been constructed.
>>> Maybe that could be encoded in the "layout".
>>>
>>> We would also need to allow a "recovery" pass to happen without
>>>       
>> having
>>     
>>> actually added any spares, or having any non-insync devices.  That
>>> probably means passing the decision "is a recovery pending" down
>>>       
> into
>   
>>> the personality rather than making it in common code.  Maybe have
>>>       
>> some
>>     
>>> field in the mddev structure which the personality sets if a
>>>       
> recovery
>   
>>> is worth trying.  Or maybe just try it anyway after any significant
>>> change and if the personality finds nothing can be done it aborts.
>>>
>>>
>>>       
>> My coding skills are fine, here, but I have to do a lot of planning
>> before even considering this.
>> Here's why:
>>   Say you have a five drive RAID-5e, and you are running happily. A
>> drive fails! Now you can rebuild on the spare drive, but the spare
>> drive
>> must be created on the parts from the remaining functional drives, so
>> it
>> can't be done pre-failure, the allocation has to be defined after you
>> see what you have left. Does that sound ugly and complex? Does to me,
>> too. So I'm thinking about this, and doing some reading, but it's not
>> quite as simple as I thought.
>>     
>>> I'm happy to advise on, review, and eventually accept patches.
>>>
>>>       
>> Actually what I think I would do is want to build a test bed in
>> software
>> before trying this in the kernel, then run the kernel part in a
>>     
> virtual
>   
>> machine. I have another idea, which has about 75% of the benefit with
>> 10% of the complexity. Since it sounds too good to be true it probably
>> is, I'll get back to you after I think about the simpler solution, I
>> distrust free lunch algorithms.
>>     
>>> NeilBrown
>>>       
>> --
>>
>> Bill Davidsen <davidsen@tmr.com>
>>   "Woe unto the statesman who makes war without a reason that will
>> still
>>   be valid when the war is over..." Otto von Bismark
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>     
>
> With all due respect, RAID5E isn't practical.  Too many corner cases
> dealing
> With performance implications, and where you even just put the parity
> block, to insure
> That when a disk fails you didn't put yourself into situation where the
> Hot spare chunk
> is located on the disk drive that just died.  
>
>   
Having run 38 multi-TB machines for an ISP using RAID5e in the SCSI 
controller, I feel pretty sure that the practicality is established, and 
only the ability to reinvent that particular wheel is in question. The 
complexity is that the hot spare drive needs to be defined after the 1st 
drive failure, using the spare sectors on the functional drives.

> What about all of those dozens of utilities that need to know about
> RAID5E to work properly?
>   

What did you have in mind? Virtually all utilities working at the 
filesystem level have no need to know anything, and treat any array as a 
drive in a black box fashion.

> The cynic in me says if they're still having to recall patches (like
> today) that deals with mdadm on
> Established RAID levels, then RAID5E is going to be much worse.
>   

Let's definitely not add anything to the kernel, then, as a 
feature-static kernel is much more stable. Features like the software 
RAID-10 (not 1+0) are not established in some standard I've seen, but 
they work just fine. Any this is not unexplored territory, distributed 
spare is called RAID5e on the IBM servers I used, and I believe Storage 
Computer (in NH) has a similar feature they call "RAID-7" and trademark.

> Algorithms dealing
> with drive failures, unrecoverable read/write errors on normal
> operations as well as rebuilds, expansions, 
> and journalization/optimization are not well understood.  It is new
> territory.
>   

That's why I'm being quite cautious about saying I can do this, the 
coding is easy, it's finding out what to code that's hard. It appears 
that configuration decisions need to be made after the failure event, 
before the rebuild. Yes, it's complex. But from experience I can say 
that performance during rebuild is far better with a distributed spare 
than beating the snot out of one newly added spare with other RAID 
levels. So there's a performance benefit for both the normal case and 
the rebuild case, and a side benefit of faster rebuild time.

The full recovery after replacing the failed drive is also an 
interesting time. :-(
> If you want multiple distributed spares, just do RAID6, it is better
> than RAID5 in that respect, and nobody
> Has to re-invent the wheel.  Your "hot spare" is still distributed in
> all of the disks, and you can survive multiple
> Drive failures.  If your motivation is performance, then buy faster
> disks, additional controller(s), 
> optimize your storage pools; and tune your md settings to be more
> compatible with your filesystem parameters.  
> Or even look at your application and see if anything can be done to
> reduce the I/O count.   
>
>   
The motivation is to get the best performance from the hardware you 
have. Adding hardware cost so you can use storage hardware inefficiently 
is *really* not practical. Neither power, cooling, drives, nor floor 
space are cheap enough to use poorly.

> The fastest I/Os are the ones you eliminate.
>   

And the fastest seeks are the ones you don't do because you spread head 
motion over more drives. But once you have distributed spare in the 
kernel, you have a free performance gain. Or as free as using either 
more CPU or more memory for mapping will allow. Most people will trade a 
little of either for better performance.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark

next prev parent reply	other threads:[~2008-10-17 13:46 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-13 21:50 Distributed spares Bill Davidsen
2008-10-13 22:11 ` Justin Piszcz
2008-10-13 22:30   ` Billy Crook
2008-10-13 23:29     ` Keld Jørn Simonsen
2008-10-14 10:12       ` Martin K. Petersen
2008-10-14 13:06         ` Keld Jørn Simonsen
2008-10-14 13:20         ` David Lethe
2008-10-14 12:02     ` non-degraded component replacement was " David Greaves
2008-10-14 13:18       ` Billy Crook
2008-10-14 23:20   ` Bill Davidsen
2008-10-14 10:04 ` Neil Brown
2008-10-16 23:50   ` Bill Davidsen
2008-10-17  4:09     ` David Lethe
2008-10-17 13:46       ` Bill Davidsen [this message]
2008-10-20  1:11         ` Neil Brown
2008-10-17 13:09   ` Gabor Gombas
  -- strict thread matches above, loose matches on Subject: below --
2008-10-14 13:30 David Lethe
2008-10-14 14:37 ` Keld Jørn Simonsen
2008-10-14 15:18   ` David Lethe
2008-10-14 16:29     ` KELEMEN Peter
2008-10-14 17:16       ` David Lethe
2008-10-14 17:20       ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48F89749.5030500@tmr.com \
    --to=davidsen@tmr.com \
    --cc=david@santools.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).