linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stan Hoeppner <stan@hardwarefreak.com>
To: NeilBrown <neilb@suse.de>, John Williams <jwilliams4200@gmail.com>
Cc: James Plank <plank@cs.utk.edu>, Ric Wheeler <rwheeler@redhat.com>,
	Andrea Mazzoleni <amadvance@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Linux RAID Mailing List <linux-raid@vger.kernel.org>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	David Brown <david.brown@hesbynett.no>,
	David Smith <creamyfish@gmail.com>
Subject: Re: Triple parity and beyond
Date: Sat, 23 Nov 2013 22:03:52 -0600	[thread overview]
Message-ID: <52917AA8.1040300@hardwarefreak.com> (raw)
In-Reply-To: <20131123181208.5103bee4@notabene.brown>

On 11/23/2013 1:12 AM, NeilBrown wrote:
> On Fri, 22 Nov 2013 21:34:41 -0800 John Williams <jwilliams4200@gmail.com>

>> Even a single 8x PCIe 3.0 card has potentially over 7GB/s of bandwidth.
>>
>> Bottom line is that IO bandwidth is not a problem for a system with
>> prudently chosen hardware.

Quite right.

>> More likely is that you would be CPU limited (rather than bus limited)
>> in a high-parity rebuild where more than one drive failed. But even
>> that is not likely to be too bad, since Andrea's single-threaded
>> recovery code can recover two drives at nearly 1GB/s on one of my
>> machines. I think the code could probably be threaded to achieve a
>> multiple of that running on multiple cores.
> 
> Indeed.  It seems likely that with modern hardware, the  linear write speed
> would be the limiting factor for spinning-rust drives.

Parity array rebuilds are read-modify-write operations.  The main
difference from normal operation RMWs is that the write is always to the
same disk.  As long as the stripe reads and chunk reconstruction outrun
the write throughput then the rebuild speed should be as fast as a
mirror rebuild.  But this doesn't appear to be what people are
experiencing.  Parity rebuilds would seem to take much longer.

I have always surmised that the culprit is rotational latency, because
we're not able to get a real sector-by-sector streaming read from each
drive.  If even only one disk in the array has to wait for the platter
to come round again, the entire stripe read is slowed down by an
additional few milliseconds.  For example, in an 8 drive array let's say
each stripe read is slowed 5ms by only one of the 7 drives due to
rotational latency, maybe acoustical management, or some other firmware
hiccup in the drive.  This slows down the entire stripe read because we
can't do parity reconstruction until all chunks are in.  An 8x 2TB array
with 512KB chunk has 4 million stripes of 4MB each.  Reading 4M stripes,
that extra 5ms per stripe read costs us

(4,000,000 * 0.005)/3600 = 5.56 hours

Now consider that arrays typically have a few years on them before the
first drive failure.  During our rebuild it's likely that some drives
will take a few rotations to return a sector that's marginal.  So  this
might slow down a stripe read by dozens of milliseconds, maybe a full
second.  If this happens to multiple drives many times throughout the
rebuild it will add even more elapsed time, possibly additional hours.

Reading stripes asynchronously or in parallel, which I assume we already
do to some degree, can mitigate these latencies to some extent.  But I
think in the overall picture, things of this nature are what is driving
parity rebuilds to dozens of hours for many people.  And as I stated
previously, when drives reach 10-20TB, this becomes far worse because
we're reading 2-10x as many stripes.  And the more drives per array the
greater the odds of incurring latency during a stripe read.

With a mirror reconstruction we can stream the reads.  Though we can't
avoid all of the drive issues above, the total number of hiccups causing
latency will be at most 1/7th those of the parity 8 drive array case.

-- 
Stan

  reply	other threads:[~2013-11-24  4:03 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-18 22:08 Triple parity and beyond Andrea Mazzoleni
2013-11-18 22:12 ` H. Peter Anvin
2013-11-18 22:35   ` Andrea Mazzoleni
2013-11-18 23:25     ` H. Peter Anvin
2013-11-19 10:16       ` David Brown
2013-11-19 17:36         ` Andrea Mazzoleni
2013-11-19 22:51           ` Drew
2013-11-20  0:54             ` Chris Murphy
2013-11-20  1:23               ` John Williams
2013-11-20 10:35                 ` David Brown
2013-11-20 10:31           ` David Brown
2013-11-20 18:09             ` John Williams
2013-11-20 18:44               ` Andrea Mazzoleni
2013-11-21  6:15                 ` Stan Hoeppner
2013-11-21  8:32               ` David Brown
2013-11-20 18:34             ` Andrea Mazzoleni
2013-11-20 18:43               ` H. Peter Anvin
2013-11-20 18:56                 ` Andrea Mazzoleni
2013-11-20 18:59                   ` H. Peter Anvin
2013-11-20 21:21                     ` Andrea Mazzoleni
2013-11-20 19:00                   ` H. Peter Anvin
2013-11-20 21:04                     ` Andrea Mazzoleni
2013-11-20 21:06                       ` H. Peter Anvin
2013-11-21  8:36               ` David Brown
2013-11-19 17:28       ` Andrea Mazzoleni
2013-11-19 20:29         ` Ric Wheeler
2013-11-20 16:16           ` James Plank
2013-11-20 19:05             ` Andrea Mazzoleni
2013-11-20 19:10               ` H. Peter Anvin
2013-11-20 20:30                 ` James Plank
2013-11-20 21:23                   ` Andrea Mazzoleni
2013-11-27  2:50                     ` ronnie sahlberg
2013-11-20 21:28                   ` H. Peter Anvin
2013-11-21  1:28             ` Stan Hoeppner
2013-11-21  2:46               ` John Williams
2013-11-21  6:52                 ` Stan Hoeppner
2013-11-21  7:05                   ` John Williams
2013-11-21 22:57                     ` Stan Hoeppner
2013-11-21 23:38                       ` John Williams
2013-11-22  9:35                         ` Stan Hoeppner
2013-11-22 15:01                           ` John Williams
2013-11-22 22:28                             ` Stan Hoeppner
2013-11-22 23:07                       ` NeilBrown
2013-11-23  3:46                         ` Stan Hoeppner
2013-11-23  5:04                           ` NeilBrown
2013-11-23  5:34                             ` John Williams
2013-11-23  7:12                               ` NeilBrown
2013-11-24  4:03                                 ` Stan Hoeppner [this message]
2013-11-24  5:14                                   ` John Williams
2013-11-24 21:13                                     ` Stan Hoeppner
2013-11-24 23:28                                       ` Rudy Zijlstra
     [not found]                                       ` <l6u3h9$l72$2@ger.gmane.org>
2013-11-25  2:04                                         ` Stan Hoeppner
2013-11-25  9:15                                       ` David Brown
2013-11-24  5:19                                   ` Russell Coker
2013-11-24 21:44                                     ` Stan Hoeppner
2013-11-24 22:31                                       ` Mark Knecht
2013-11-25  2:14                                       ` Russell Coker
2013-11-25  9:20                                         ` David Brown
2013-11-21  8:08               ` joystick
2013-11-22  0:30                 ` Stan Hoeppner
2013-11-22  0:33                   ` H. Peter Anvin
2013-11-22  0:45                   ` David Brown
2013-11-21  9:07               ` David Brown
2013-11-21  9:54                 ` Adam Goryachev
2013-11-21 10:32                   ` David Brown
2013-11-22  8:12                   ` Russell Coker
2013-11-25 18:23                     ` Pasi Kärkkäinen
2013-11-22  8:13                 ` Stan Hoeppner
2013-11-22 13:15                   ` David Brown
2013-11-22 16:07                   ` Stan Hoeppner
2013-11-22 22:59                     ` NeilBrown
2013-11-23 17:39                       ` David Brown
2013-11-22 16:50                   ` Mark Knecht
2013-11-22 19:51                     ` Duncan
2013-11-22  8:38                 ` Stan Hoeppner
2013-11-22 13:24                   ` David Brown
2013-11-28  7:16                     ` Stan Hoeppner
2013-11-28  7:36                       ` Russell Coker
2013-11-28  9:56                       ` David Brown
2013-11-21 19:56               ` Piergiorgio Sartor
2013-11-19 18:12 ` Piergiorgio Sartor
2013-11-20 10:44   ` David Brown
2013-11-20 21:59     ` Piergiorgio Sartor
2013-11-21 10:13       ` David Brown
2013-11-21 17:37         ` Goffredo Baroncelli
2013-11-21 20:05         ` Piergiorgio Sartor
2013-11-21 20:31           ` David Brown
2013-11-21 20:52             ` Piergiorgio Sartor
2013-11-22  0:32               ` David Brown
2013-11-22 20:32                 ` Piergiorgio Sartor
2013-11-26 18:10             ` joystick
2013-11-20 21:38   ` Andrea Mazzoleni
2013-11-20 22:29 ` Piergiorgio Sartor
2013-11-23  7:55   ` Andrea Mazzoleni
2013-11-23 22:10     ` Piergiorgio Sartor
2013-11-24  9:39       ` Andrea Mazzoleni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52917AA8.1040300@hardwarefreak.com \
    --to=stan@hardwarefreak.com \
    --cc=amadvance@gmail.com \
    --cc=creamyfish@gmail.com \
    --cc=david.brown@hesbynett.no \
    --cc=hpa@zytor.com \
    --cc=jwilliams4200@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=plank@cs.utk.edu \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).