linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: joystick <joystick@shiftmail.org>
Cc: James Plank <plank@cs.utk.edu>, Ric Wheeler <rwheeler@redhat.com>,
	Andrea Mazzoleni <amadvance@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-raid@vger.kernel.org, linux-btrfs@vger.kernel.org,
	David Brown <david.brown@hesbynett.no>,
	David Smith <creamyfish@gmail.com>
Subject: Re: Triple parity and beyond
Date: Thu, 21 Nov 2013 18:30:49 -0600	[thread overview]
Message-ID: <528EA5B9.3000801@hardwarefreak.com> (raw)
In-Reply-To: <528DBF85.6010303@shiftmail.org>

On 11/21/2013 2:08 AM, joystick wrote:
> On 21/11/2013 02:28, Stan Hoeppner wrote:
...
>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>> average--and that is probably being kind to the drive makers.  With 6 or
>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>> minimum 72 hours or more, probably over 100, and probably more yet for
>> 3P.  And with larger drive count arrays the rebuild times approach a
>> week.  Whose users can go a week with degraded performance?  This is
>> simply unreasonable, at best.  I say it's completely unacceptable.
>>
>> With these gargantuan drives coming soon, the probability of multiple
>> UREs during rebuild are pretty high.
> 
> No because if you are correct about the very high CPU overhead during

I made no such claim.

> rebuild (which I don't see so dramatic as Andrea claims 500MB/sec for
> triple-parity, probably parallelizable on multiple cores), the speed of
> rebuild decreases proportionally 

The rebuild time of a parity array normally has little to do with CPU
overhead.  The bulk of the elapsed time is due to:

1.  The serial nature of the rebuild algorithm
2.  The random IO pattern of the reads
3.  The rotational latency of the drives

#3 is typically the largest portion of the elapsed time.

> and hence the stress and heating on the
> drives proportionally reduces, approximating that of normal operation.
> And how often have you seen a drive failure in a week during normal
> operation?

This depends greatly on one's normal operation.  In general, for most
users of parity arrays, any full array operation such as a rebuild or
reshape is far more taxing on the drives, in both power draw and heat
dissipation, than 'normal' operation.

> But in reality, consider that a non-naive implementation of
> multiple-parity would probably use just the single parity during
> reconstruction if just one disk fails, using the multiple parities only
> to read the stripes which are unreadable at single parity. So the speed
> and time of reconstruction and performance penalty would be that of
> raid5 except in exceptional situations of multiple failures.

That may very well be, but it doesn't change #2,3 above.

>> What I envision is an array type, something similar to RAID 51, i.e.
>> striped parity over mirror pairs. ....
> 
> I don't like your approach of raid 51: it has the write overhead of
> raid5, with the waste of space of raid1.
> So it cannot be used as neither a performance array nor a capacity array.

I don't like it either.  It's a compromise.  But as RAID1/10 will soon
be unusable due to URE probability during rebuild, I think it's a
relatively good compromise for some users, some workloads.

> In the scope of this discussion (we are talking about very large
> arrays), 

Capacity yes, drive count, no.  Drive capacities are increasing at a
much faster rate than our need for storage space.  As we move forward
the trend will be building larger capacity arrays with fewer disks.

> the waste of space of your solution, higher than 50%, will make
> your solution costing double the price.

This is the classic mirror vs parity argument.  Using 1 more disk to add
parity to striped mirrors doesn't change it.  "Waste" is in the eye of
the beholder.  Anyone currently using RAID10 will have no problem
dedicating one more disk for uptime, protection.

> A competitor for the multiple-parity scheme might be raid65 or 66, but
> this is a so much dirtier approach than multiple parity if you think at
> the kind of rmw and overhead that will occur during normal operation.

Neither of those has any advantage over multi-parity.  I suggested this
approach because it retains all of the advantages of RAID10 but one.  We
sacrifice fast random write performance for protection against UREs, the
same reason behind 3P.  That's what the single parity is for, and that
alone.

I suggest that anyone in the future needing fast random write IOPS is
going to move those workloads to SSD, which is steadily increasing in
capacity.  And I suggest anyone building arrays with 10-20TB drives
isn't in need of fast random write IOPS.  Whether this approach is
valuable to anyone depends on whether the remaining attributes of
RAID10, with the added URE protection, are worth the drive count.
Obviously proponents of traditional parity arrays will not think so.
Users of RAID10 may.  Even if md never supports such a scheme, I bet
we'll see something similar to this in enterprise gear, where rebuilds
need to be 'fast' and performance degradation due to a downed drive is
not acceptable.

-- 
Stan

  reply	other threads:[~2013-11-22  0:30 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-18 22:08 Triple parity and beyond Andrea Mazzoleni
2013-11-18 22:12 ` H. Peter Anvin
2013-11-18 22:35   ` Andrea Mazzoleni
2013-11-18 23:25     ` H. Peter Anvin
2013-11-19 10:16       ` David Brown
2013-11-19 17:36         ` Andrea Mazzoleni
2013-11-19 22:51           ` Drew
2013-11-20  0:54             ` Chris Murphy
2013-11-20  1:23               ` John Williams
2013-11-20 10:35                 ` David Brown
2013-11-20 10:31           ` David Brown
2013-11-20 18:09             ` John Williams
2013-11-20 18:44               ` Andrea Mazzoleni
2013-11-21  6:15                 ` Stan Hoeppner
2013-11-21  8:32               ` David Brown
2013-11-20 18:34             ` Andrea Mazzoleni
2013-11-20 18:43               ` H. Peter Anvin
2013-11-20 18:56                 ` Andrea Mazzoleni
2013-11-20 18:59                   ` H. Peter Anvin
2013-11-20 21:21                     ` Andrea Mazzoleni
2013-11-20 19:00                   ` H. Peter Anvin
2013-11-20 21:04                     ` Andrea Mazzoleni
2013-11-20 21:06                       ` H. Peter Anvin
2013-11-21  8:36               ` David Brown
2013-11-19 17:28       ` Andrea Mazzoleni
2013-11-19 20:29         ` Ric Wheeler
2013-11-20 16:16           ` James Plank
2013-11-20 19:05             ` Andrea Mazzoleni
2013-11-20 19:10               ` H. Peter Anvin
2013-11-20 20:30                 ` James Plank
2013-11-20 21:23                   ` Andrea Mazzoleni
2013-11-27  2:50                     ` ronnie sahlberg
2013-11-20 21:28                   ` H. Peter Anvin
2013-11-21  1:28             ` Stan Hoeppner
2013-11-21  2:46               ` John Williams
2013-11-21  6:52                 ` Stan Hoeppner
2013-11-21  7:05                   ` John Williams
2013-11-21 22:57                     ` Stan Hoeppner
2013-11-21 23:38                       ` John Williams
2013-11-22  9:35                         ` Stan Hoeppner
2013-11-22 11:24                           ` joystick
2013-11-22 15:01                           ` John Williams
2013-11-22 22:28                             ` Stan Hoeppner
2013-11-22 23:07                       ` NeilBrown
2013-11-23  3:46                         ` Stan Hoeppner
2013-11-23  5:04                           ` NeilBrown
2013-11-23  5:34                             ` John Williams
2013-11-23  7:12                               ` NeilBrown
2013-11-24  4:03                                 ` Stan Hoeppner
2013-11-24  5:14                                   ` John Williams
2013-11-24 21:13                                     ` Stan Hoeppner
2013-11-24 23:28                                       ` Rudy Zijlstra
2013-11-24 23:53                                       ` Alex Elsayed
2013-11-25  2:04                                         ` Stan Hoeppner
2013-11-25  4:48                                           ` Alex Elsayed
2013-11-25  9:15                                       ` David Brown
2013-11-24  5:19                                   ` Russell Coker
2013-11-24 21:44                                     ` Stan Hoeppner
2013-11-24 22:31                                       ` Mark Knecht
2013-11-25  2:14                                       ` Russell Coker
2013-11-25  9:20                                         ` David Brown
2013-11-21  8:08               ` joystick
2013-11-22  0:30                 ` Stan Hoeppner [this message]
2013-11-22  0:33                   ` H. Peter Anvin
2013-11-22  0:45                   ` David Brown
2013-11-21  9:07               ` David Brown
2013-11-21  9:54                 ` Adam Goryachev
2013-11-21 10:32                   ` David Brown
2013-11-22  8:12                   ` Russell Coker
2013-11-25 18:23                     ` Pasi Kärkkäinen
2013-11-22  8:13                 ` Stan Hoeppner
2013-11-22 13:15                   ` David Brown
2013-11-22 16:07                   ` Stan Hoeppner
2013-11-22 22:59                     ` NeilBrown
2013-11-23 17:39                       ` David Brown
2013-11-22 16:50                   ` Mark Knecht
2013-11-22 19:51                     ` Duncan
2013-11-22  8:38                 ` Stan Hoeppner
2013-11-22 13:24                   ` David Brown
2013-11-28  7:16                     ` Stan Hoeppner
2013-11-28  7:36                       ` Russell Coker
2013-11-28  9:56                       ` David Brown
2013-11-30  7:32                       ` Alex Elsayed
2013-12-01 15:37                         ` Stan Hoeppner
2013-11-22 14:19                   ` David Taylor
2013-11-21 19:56               ` Piergiorgio Sartor
2013-11-19 18:12 ` Piergiorgio Sartor
2013-11-20 10:44   ` David Brown
2013-11-20 21:59     ` Piergiorgio Sartor
2013-11-21 10:13       ` David Brown
2013-11-21 17:37         ` Goffredo Baroncelli
2013-11-21 20:05         ` Piergiorgio Sartor
2013-11-21 20:31           ` David Brown
2013-11-21 20:52             ` Piergiorgio Sartor
2013-11-22  0:32               ` David Brown
2013-11-22 20:32                 ` Piergiorgio Sartor
2013-11-26 18:10             ` joystick
2013-11-20 21:38   ` Andrea Mazzoleni
2013-11-20 22:29 ` Piergiorgio Sartor
2013-11-23  7:55   ` Andrea Mazzoleni
2013-11-23 22:10     ` Piergiorgio Sartor
2013-11-24  9:39       ` Andrea Mazzoleni
  -- strict thread matches above, loose matches on Subject: below --
2013-12-01 17:53 Richard Scobie
2013-12-02  4:30 ` Stan Hoeppner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=528EA5B9.3000801@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=amadvance@gmail.com \
    --cc=creamyfish@gmail.com \
    --cc=david.brown@hesbynett.no \
    --cc=hpa@zytor.com \
    --cc=joystick@shiftmail.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=plank@cs.utk.edu \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).