Re: raid resync speed - Bernd Schubert

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bernd Schubert <bernd.schubert@fastmail.fm>
To: stan@hardwarefreak.com,
	Jeff Allison <jeff.allison@allygray.2y.net>,
	linux-raid@vger.kernel.org
Subject: Re: raid resync speed
Date: Thu, 27 Mar 2014 17:08:38 +0100	[thread overview]
Message-ID: <53344D06.4090401@fastmail.fm> (raw)
In-Reply-To: <532B372A.7090802@hardwarefreak.com>

Sorry for the late reply, I'm busy with work...

On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>> Yes.  The article gives 16384 and 32768 as examples for
>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>> addition, high values eat huge amounts of memory.  The formula is:
>
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>
> I won't discuss "should" as that makes this a subjective discussion.
> I'll discuss this objectively, discuss what md does, not what it
> "should" do or could do.
>
> I'll answer your question with a question:  Why does the total stripe
> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
> and 16 drives, to maintain the same per drive throughput?
>
> The answer to both this question and your question is the same answer.
> As the total write bandwidth of the array increases, so must the total
> stripe cache buffer space.  stripe_cache_size of 1024 is usually optimal
> for SATA drives with measured 100MB/s throughput, and 4096 is usually
> optimal for SSDs with 400MB/s measured write throughput.  The bandwidth
> numbers include parity block writes.

Did you also consider that you simply need more stripe-heads (struct 
stripe_head) to get complete stripes with more drives?

>
> array(s)		bandwidth MB/s	stripe_cache_size	cache MB
>
> 12x 100MB/s Rust	1200		1024			 48
> 16x 100MB/s Rust	1600		1024			 64
> 32x 100MB/s Rust	3200		1024			128
>
> 3x  400MB/s SSD		1200		4096			 48
> 4x  400MB/s SSD		1600		4096			 64
> 8x  400MB/s SSD		3200		4096			128
>
> As is clearly demonstrated, there is a direct relationship between cache
> size and total write bandwidth.  The number of drives and drive type is
> irrelevant.  It's the aggregate write bandwidth that matters.

What is the meaning of "cache MB"? It does not seem to come from this 
calculation:

> 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
> 		 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;

...

> 		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
> 		       mdname(mddev), memory);


>
> Whether this "should" be this way is something for developers to debate.
>   I'm simply demonstrating how it "is" currently.

Well, somehow I only see two different stripe-cache size values in your 
numbers. Then the given bandwidth seems to be theoretical value, based 
on num-drives * performance-per-drive. Redundancy drives are also 
missing in that calculation.  And then the value of "cache MB" is also 
unclear. So I'm sorry, but don't see any "simply demonstrating".


>
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
>
> This may be true at the limits, but going from 512 to 1024 to 2048 to
> 4096 with a 3 disk rust array isn't going to peak the CPU.  And
> somewhere with this setup, usually between 1024 and 2048, throughput
> will begin to tail off, even with plenty of CPU and memory B/W remaining.

Sorry, not in my experience. So it would be interesting to see real 
measused values. But then I definitely never tested raid6 with 3 drives, 
as this only provides a single data drive.

>
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>
> Because higher aggregate throughputs require higher stripe_cache_size
> values, and some drive types (SSDs) have significantly higher throughput
> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
> for PCIe SSDs.

As I said, it would be interesting to see real numbers and profiling data.


Cheers,
Bernd

next prev parent reply	other threads:[~2014-03-27 16:08 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-20  1:12 raid resync speed Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 15:35   ` Bernd Schubert
2014-03-20 15:36     ` Bernd Schubert
2014-03-20 16:19       ` Eivind Sarto
2014-03-20 16:22         ` Bernd Schubert
2014-03-20 18:44     ` Stan Hoeppner
2014-03-27 16:08       ` Bernd Schubert [this message]
2014-03-28  8:03         ` Stan Hoeppner
2014-03-20 17:46 ` Bernd Schubert
2014-03-21  0:44   ` Jeff Allison
  -- strict thread matches above, loose matches on Subject: below --
2005-09-10  2:11 RAID " Eyal Lebedinsky
2005-09-10  2:53 ` Nuno Silva
2005-09-10  3:18   ` Eyal Lebedinsky
2005-09-10  4:54     ` Nuno Silva
2005-09-10  5:16       ` Joel Jaeggli
2005-09-11  2:16       ` Eyal Lebedinsky
2005-09-12 15:57         ` Roger Heflin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53344D06.4090401@fastmail.fm \
    --to=bernd.schubert@fastmail.fm \
    --cc=jeff.allison@allygray.2y.net \
    --cc=linux-raid@vger.kernel.org \
    --cc=stan@hardwarefreak.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.