Re: raid resync speed - Bernd Schubert

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bernd Schubert <bernd.schubert@fastmail.fm>
To: stan@hardwarefreak.com,
	Jeff Allison <jeff.allison@allygray.2y.net>,
	linux-raid@vger.kernel.org
Subject: Re: raid resync speed
Date: Thu, 27 Mar 2014 17:08:38 +0100	[thread overview]
Message-ID: <53344D06.4090401@fastmail.fm> (raw)
In-Reply-To: <532B372A.7090802@hardwarefreak.com>

Sorry for the late reply, I'm busy with work...

On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>> Yes.  The article gives 16384 and 32768 as examples for
>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>> addition, high values eat huge amounts of memory.  The formula is:
>
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>
> I won't discuss "should" as that makes this a subjective discussion.
> I'll discuss this objectively, discuss what md does, not what it
> "should" do or could do.
>
> I'll answer your question with a question:  Why does the total stripe
> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
> and 16 drives, to maintain the same per drive throughput?
>
> The answer to both this question and your question is the same answer.
> As the total write bandwidth of the array increases, so must the total
> stripe cache buffer space.  stripe_cache_size of 1024 is usually optimal
> for SATA drives with measured 100MB/s throughput, and 4096 is usually
> optimal for SSDs with 400MB/s measured write throughput.  The bandwidth
> numbers include parity block writes.

Did you also consider that you simply need more stripe-heads (struct 
stripe_head) to get complete stripes with more drives?

>
> array(s)		bandwidth MB/s	stripe_cache_size	cache MB
>
> 12x 100MB/s Rust	1200		1024			 48
> 16x 100MB/s Rust	1600		1024			 64
> 32x 100MB/s Rust	3200		1024			128
>
> 3x  400MB/s SSD		1200		4096			 48
> 4x  400MB/s SSD		1600		4096			 64
> 8x  400MB/s SSD		3200		4096			128
>
> As is clearly demonstrated, there is a direct relationship between cache
> size and total write bandwidth.  The number of drives and drive type is
> irrelevant.  It's the aggregate write bandwidth that matters.

What is the meaning of "cache MB"? It does not seem to come from this 
calculation:

> 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
> 		 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;

...

> 		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
> 		       mdname(mddev), memory);


>
> Whether this "should" be this way is something for developers to debate.
>   I'm simply demonstrating how it "is" currently.

Well, somehow I only see two different stripe-cache size values in your 
numbers. Then the given bandwidth seems to be theoretical value, based 
on num-drives * performance-per-drive. Redundancy drives are also 
missing in that calculation.  And then the value of "cache MB" is also 
unclear. So I'm sorry, but don't see any "simply demonstrating".


>
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
>
> This may be true at the limits, but going from 512 to 1024 to 2048 to
> 4096 with a 3 disk rust array isn't going to peak the CPU.  And
> somewhere with this setup, usually between 1024 and 2048, throughput
> will begin to tail off, even with plenty of CPU and memory B/W remaining.

Sorry, not in my experience. So it would be interesting to see real 
measused values. But then I definitely never tested raid6 with 3 drives, 
as this only provides a single data drive.

>
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>
> Because higher aggregate throughputs require higher stripe_cache_size
> values, and some drive types (SSDs) have significantly higher throughput
> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
> for PCIe SSDs.

As I said, it would be interesting to see real numbers and profiling data.


Cheers,
Bernd

next prev parent reply	other threads:[~2014-03-27 16:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-20  1:12 raid resync speed Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 15:35   ` Bernd Schubert
2014-03-20 15:36     ` Bernd Schubert
2014-03-20 16:19       ` Eivind Sarto
2014-03-20 16:22         ` Bernd Schubert
2014-03-20 18:44     ` Stan Hoeppner
2014-03-27 16:08       ` Bernd Schubert [this message]
2014-03-28  8:03         ` Stan Hoeppner
2014-03-20 17:46 ` Bernd Schubert
2014-03-21  0:44   ` Jeff Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53344D06.4090401@fastmail.fm \
    --to=bernd.schubert@fastmail.fm \
    --cc=jeff.allison@allygray.2y.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).