Re: high throughput storage server?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Brown <david@westcontrol.com>
To: linux-raid@vger.kernel.org
Subject: Re: high throughput storage server?
Date: Wed, 23 Feb 2011 14:56:34 +0100	[thread overview]
Message-ID: <ik33nt$fd8$1@dough.gmane.org> (raw)
In-Reply-To: <4D64A082.9000601@hardwarefreak.com>

On 23/02/2011 06:52, Stan Hoeppner wrote:
> David Brown put forth on 2/22/2011 8:18 AM:
>
>> Yes, this is definitely true - RAID10 is less affected by running
>> degraded, and recovering is faster and involves less disk wear.  The
>> disadvantage compared to RAID6 is, of course, if the other half of a
>> disk pair dies during recovery then your raid is gone - with RAID6 you
>> have better worst-case redundancy.
>
> The odds of the mirror partner dying during rebuild are very very long,
> and the odds of suffering a URE are very low.  However, in the case of
> RAID5/6, moreso with RAID5, with modern very large drives (1/2/3TB),
> there is being quite a bit written these days about unrecoverable read
> error rates.  Using a sufficient number of these very large disks will
> at some point guarantee a URE during an array rebuild, which may very
> likely cost you your entire array.  This is because every block of every
> remaining disk (assuming full disk RAID not small partitions on each
> disk) must be read during a RAID5/6 rebuild.  I don't have the equation
> handy but Google should be able to fetch it for you.  IIRC this is one
> of the reasons RAID6 is becoming more popular today.  Not just because
> it can survive an additional disk failure, but that it's more resilient
> to a URE during a rebuild.
>

It is certainly the case that the chance of a second failure when doing 
a RAID5/6 rebuild goes up with the number of disks (since all the disks 
are stressed during the rebuild, and any failures are relevant), while 
with RAID 10 rebuilds the chances of a second failure are restricted to 
the single disk being used.

However, as disks get bigger, the chance of errors on any given disk is 
increasing.  And the fact remains that if you have a failure on a RAID10 
system, you then have a single point of failure during the rebuild 
period - while with RAID6 you still have redundancy (obviously RAID5 is 
far worse here).

> With a RAID10 rebuild, as you're only reading entire contents of a
> single disk, the odds of encountering a URE are much lower than with a
> RAID5 with the same number of drives, simply due to the total number of
> bits read.
>
>> Once md raid has support for bad block lists, hot replace, and non-sync
>> lists, then the differences will be far less clear.  If a disk in a RAID
>> 5/6 set has a few failures (rather than dying completely), then it will
>> run as normal except when bad blocks are accessed.  This means for all
>> but the few bad blocks, the degraded performance will be full speed. And
>
> You're muddying the definition of a "degraded RAID".
>

That could be the case - I'll try to be clearer.  It is certainly 
possible that I'm getting terminology wrong.

>> if you use "hot replace" to replace the partially failed drive, the
>> rebuild will have almost exactly the same characteristics as RAID10
>> rebuilds - apart from the bad blocks, which must be recovered by parity
>> calculations, you have a straight disk-to-disk copy.
>
> Are you saying you'd take a "partially failing" drive in a RAID5/6 and
> simply do a full disk copy onto the spare, except "bad blocks",
> rebuilding those in the normal fashion, simply to approximate the
> recover speed of RAID10?
>
> I think your logic is a tad flawed here.  If a drive is already failing,
> why on earth would you trust it, period?  I think you'd be asking for
> trouble doing this.  This is precisely one of the reasons many hardware
> RAID controllers have historically kicked drives offline after the first
> signs of trouble--if a drive is acting flaky we don't want to trust it,
> but replace it as soon as possible.
>

I don't know if you've followed the recent "md road-map: 2011" thread (I 
can't see any replies from you in the thread), but that is my reference 
point here.

Sometimes disks die suddenly and catastrophically.  When that happens, 
the disk is gone and needs to be kicked offline.

Other times, you have a single-event corruption - for some reason, a 
particular block got corrupted.  And sometimes the disk is wearing out - 
disks have a set of replacement blocks for re-locating known bad blocks, 
and in the end these will run out.  Either you get an URE, or a write 
failure.

(I don't have any idea what the ratio of these sorts of failure modes is.)

If you have a drive with a few failures, then the rest of the data is 
still correct.  You can expect that if the drive returns data 
successfully for a read, then the data is valid - that's what the 
drive's ECC is for.  But you would not want to trust it with new data, 
and you would want to replace it as soon as possible.

The point of md raid's planned "bad block list" is to track which areas 
of the drive should not be used.  And the "hot replace" feature is aimed 
at making a direct copy of a disk - excluding the bad blocks - to make 
replacement of failed drives faster and safer.  Since the failing drive 
is not removed from the array until the hot replace takes over, you 
still have full redundancy for most of the array - just not for stripes 
that contain a bad block.

I can well imagine that hardware RAID controllers don't have this sort 
of flexibility.

> The assumption is that the data on the array is far more valuable than
> the cost of a single drive or the entire hardware for that matter.  In
> most environments this is the case.  Everyone seems fond of the WD20EARS
> drives (which I disdain).  I hear they're loved because Newegg has them
> for less than $100.  What's your 2TB of data on that drive worth?  In
> the case of a MythTV box, to the owner, that $100 is worth more than the
> content.  In a business setting, I'd dare say the data on that drive is
> worth far more than the $100 cost of the drive and the admin $$ time
> required to replace/rebuild it.
>
> In the MythTV case what you propose might be a worthwhile risk.  In a
> business environment, definitely not.
>

I believe it is the value of the data - and the value of keeping as much 
redundancy as you can, and minimising the risky rebuild period, that is 
Neil Brown's motivation behind the bad block list and hot replace.  It 
could well be that I'm not explaining it very well - but this is /not/ 
about saving money by continuing to use a dodgy disk even though you 
know it is failing.  It is about a dodgy disk with most of a data set 
being a lot better than no disk when it comes to rebuild speed and data 
redundancy.

Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where 
you have a RAID5 or RAID6 build from RAID1 pairs?  You get all the 
rebuild benefits of RAID1 or RAID10, such as simple and fast direct 
copies for rebuilds, and little performance degradation.  But you also 
get multiple failure redundancy from the RAID5 or RAID6.  It could be 
that it is excessive - that the extra redundancy is not worth the 
performance cost (you still have poor small write performance).

next prev parent reply	other threads:[~2011-02-23 13:56 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown [this message]
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ik33nt$fd8$1@dough.gmane.org' \
    --to=david@westcontrol.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.