linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid resync speed
@ 2014-03-20  1:12 Jeff Allison
  2014-03-20 14:35 ` Stan Hoeppner
  2014-03-20 17:46 ` Bernd Schubert
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff Allison @ 2014-03-20  1:12 UTC (permalink / raw)
  To: linux-raid

The gist of my question is what kind of resync speed should I expect?

I have a HP N54L Microserver running centos 6.5.

In this box I have a 3x2TB disk raid 5 array, which I am in the
process of extending to a 4x2TB raid 5 array.

I've added the new disk --> mdadm --add /dev/md0 /dev/sdb

And grown the array --> mdadm --grow /dev/md0 --raid-devices=4

Now the problem the resync speed is v slow, it refuses to rise above
5MB, in general it sits at 4M.

from looking at glances it would appear that writing to the new disk
is the bottle neck, /dev/sdb is the new disk.

Disk I/O In/s Out/s
md0 0 0
sda1 0 0
sda2 0 1K
sdb1 3.92M 0
sdc1 24.2M 54.7M
sdd1 11.2M 54.7M
sde1 16.3M 54.7M

I partitiioned the disk with --> parted -a optimal /dev/sdb

[root@nas ~]# parted -a optimal /dev/sdb
GNU Parted 2.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: ATA ST2000DM001-1E61 (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid

There is no ntfs filesystem on the disk, I've still not worked out how
to remove that flag.

I've followed the article here -->
http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html
to attempt to speed it up but no joy.

Any Ideas what I've done wrong?

parted output

[root@nas ~]# parted -l
Model: ATA ST31000528AS (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 525MB 524MB primary ext4 boot
2 525MB 1000GB 1000GB primary lvm

Model: ATA ST2000DM001-1E61 (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid

Model: ATA ST2000DM001-9YN1 (scsi)
Disk /dev/sdc: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary raid

Model: ATA WDC WD25EZRS-00J (scsi)
Disk /dev/sdd: 2500GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid

Model: ATA ST2000DL001-9VT1 (scsi)
Disk /dev/sde: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary raid

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20  1:12 raid resync speed Jeff Allison
@ 2014-03-20 14:35 ` Stan Hoeppner
  2014-03-20 15:35   ` Bernd Schubert
  2014-03-20 17:46 ` Bernd Schubert
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-20 14:35 UTC (permalink / raw)
  To: Jeff Allison, linux-raid

On 3/19/2014 8:12 PM, Jeff Allison wrote:
...
> In this box I have a 3x2TB disk raid 5 array, which I am in the
> process of extending to a 4x2TB raid 5 array.
> 
> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
> 
> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
> 
> Now the problem the resync speed is v slow, it refuses to rise above
> 5MB, in general it sits at 4M.
...
> Model: ATA ST2000DM001-1E61 (scsi)
> Disk /dev/sdb: 2000GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: msdos

Seagate Advanced Format 512e disk drive

> Number Start End Size Type File system Flags
> 1 1049kB 2000GB 2000GB primary ntfs raid

Offset is 262 physical sectors, strange value, but won't incur RMW
internally.  So no performance hit here.

...
> I've followed the article here -->
> http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html
> to attempt to speed it up but no joy.
> 
> Any Ideas what I've done wrong?

Yes.  The article gives 16384 and 32768 as examples for
stripe_cache_size.  Such high values tend to reduce throughput instead
of increasing it.  Never use a value above 2048 with rust, and 1024 is
usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
addition, high values eat huge amounts of memory.  The formula is:

stripe_cache_size * 4096 bytes * drive_count = RAM usage

(32768*4096) * 4 = 512MB of RAM consumed by the stripe cache
(16384*4096) * 4 = 256MB of RAM consumed by the stripe cache

 (2048*4096) * 4 =  32MB of RAM consumed by the stripe cache
 (1024*4096) * 4 =  16MB of RAM consumed by the stripe cache


Cheers,

-- 
Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 14:35 ` Stan Hoeppner
@ 2014-03-20 15:35   ` Bernd Schubert
  2014-03-20 15:36     ` Bernd Schubert
  2014-03-20 18:44     ` Stan Hoeppner
  0 siblings, 2 replies; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 15:35 UTC (permalink / raw)
  To: stan, Jeff Allison, linux-raid

>
> Yes.  The article gives 16384 and 32768 as examples for
> stripe_cache_size.  Such high values tend to reduce throughput instead
> of increasing it.  Never use a value above 2048 with rust, and 1024 is
> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
> addition, high values eat huge amounts of memory.  The formula is:
>

Why should the stripe-cache size differ between SSDs and rotating disks? 
Did you ever try to figure out yourself why it got slower with higher 
values? I profiled that in the past and it was a CPU/memory limitation - 
the md thread went to 100%, searching for stripe-heads.

So I really wonder how you got the impression that the stripe cache size 
should have different values for differnt kinds of drives.


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 15:35   ` Bernd Schubert
@ 2014-03-20 15:36     ` Bernd Schubert
  2014-03-20 16:19       ` Eivind Sarto
  2014-03-20 18:44     ` Stan Hoeppner
  1 sibling, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 15:36 UTC (permalink / raw)
  To: stan, Jeff Allison, linux-raid

On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>
>> Yes.  The article gives 16384 and 32768 as examples for
>> stripe_cache_size.  Such high values tend to reduce throughput instead
>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>> addition, high values eat huge amounts of memory.  The formula is:
>>
>
> Why should the stripe-cache size differ between SSDs and rotating disks?
> Did you ever try to figure out yourself why it got slower with higher
> values? I profiled that in the past and it was a CPU/memory limitation -
> the md thread went to 100%, searching for stripe-heads.

Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.

>
> So I really wonder how you got the impression that the stripe cache size
> should have different values for differnt kinds of drives.
>
>
> Cheers,
> Bernd
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 15:36     ` Bernd Schubert
@ 2014-03-20 16:19       ` Eivind Sarto
  2014-03-20 16:22         ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Eivind Sarto @ 2014-03-20 16:19 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: stan, Jeff Allison, linux-raid


On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>> 
>>> Yes.  The article gives 16384 and 32768 as examples for
>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>> addition, high values eat huge amounts of memory.  The formula is:
>>> 
>> 
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
> 
> Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.
> 
>> 
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>> 
>> 
>> Cheers,
>> Bernd
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

The hash chains for the stripe cache become long if you increase the stripe cache.  There are only 256
hash buckets.  With 32K stripe cache entries, the average length of a hash chain will be 128 and that will
increase contention for the lock protection the chain.

-eivind


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 16:19       ` Eivind Sarto
@ 2014-03-20 16:22         ` Bernd Schubert
  0 siblings, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 16:22 UTC (permalink / raw)
  To: Eivind Sarto; +Cc: stan, Jeff Allison, linux-raid

On 03/20/2014 05:19 PM, Eivind Sarto wrote:
>
> On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
>> On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>>>
>>>> Yes.  The article gives 16384 and 32768 as examples for
>>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>>> addition, high values eat huge amounts of memory.  The formula is:
>>>>
>>>
>>> Why should the stripe-cache size differ between SSDs and rotating disks?
>>> Did you ever try to figure out yourself why it got slower with higher
>>> values? I profiled that in the past and it was a CPU/memory limitation -
>>> the md thread went to 100%, searching for stripe-heads.
>>
>> Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.
>>
>>>
>>> So I really wonder how you got the impression that the stripe cache size
>>> should have different values for differnt kinds of drives.
>>>

> The hash chains for the stripe cache become long if you increase the stripe cache.  There are only 256
> hash buckets.  With 32K stripe cache entries, the average length of a hash chain will be 128 and that will
> increase contention for the lock protection the chain.
>

Yes, this is a implementation detail. But that make a difference between 
SSDs and rotating disks... (which was my point here).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20  1:12 raid resync speed Jeff Allison
  2014-03-20 14:35 ` Stan Hoeppner
@ 2014-03-20 17:46 ` Bernd Schubert
  2014-03-21  0:44   ` Jeff Allison
  1 sibling, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 17:46 UTC (permalink / raw)
  To: Jeff Allison, linux-raid

On 03/20/2014 02:12 AM, Jeff Allison wrote:
> The gist of my question is what kind of resync speed should I expect?
>
> I have a HP N54L Microserver running centos 6.5.
>
> In this box I have a 3x2TB disk raid 5 array, which I am in the
> process of extending to a 4x2TB raid 5 array.
>
> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
>
> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
>
> Now the problem the resync speed is v slow, it refuses to rise above
> 5MB, in general it sits at 4M.

Per second?

>
> from looking at glances it would appear that writing to the new disk
> is the bottle neck, /dev/sdb is the new disk.
>
> Disk I/O In/s Out/s
> md0 0 0
> sda1 0 0
> sda2 0 1K
> sdb1 3.92M 0
> sdc1 24.2M 54.7M
> sdd1 11.2M 54.7M
> sde1 16.3M 54.7M

Could you please send output of 'iostat -xm 1'? Also, do you anything in 
'top' that takes 100% CPU?

Thanks,
Bernd



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 15:35   ` Bernd Schubert
  2014-03-20 15:36     ` Bernd Schubert
@ 2014-03-20 18:44     ` Stan Hoeppner
  2014-03-27 16:08       ` Bernd Schubert
  1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-20 18:44 UTC (permalink / raw)
  To: Bernd Schubert, Jeff Allison, linux-raid

On 3/20/2014 10:35 AM, Bernd Schubert wrote:
> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>> Yes.  The article gives 16384 and 32768 as examples for
>> stripe_cache_size.  Such high values tend to reduce throughput instead
>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>> addition, high values eat huge amounts of memory.  The formula is:

> Why should the stripe-cache size differ between SSDs and rotating disks?

I won't discuss "should" as that makes this a subjective discussion.
I'll discuss this objectively, discuss what md does, not what it
"should" do or could do.

I'll answer your question with a question:  Why does the total stripe
cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
and 16 drives, to maintain the same per drive throughput?

The answer to both this question and your question is the same answer.
As the total write bandwidth of the array increases, so must the total
stripe cache buffer space.  stripe_cache_size of 1024 is usually optimal
for SATA drives with measured 100MB/s throughput, and 4096 is usually
optimal for SSDs with 400MB/s measured write throughput.  The bandwidth
numbers include parity block writes.

array(s)		bandwidth MB/s	stripe_cache_size	cache MB

12x 100MB/s Rust	1200		1024			 48
16x 100MB/s Rust	1600		1024			 64
32x 100MB/s Rust	3200		1024			128

3x  400MB/s SSD		1200		4096			 48
4x  400MB/s SSD		1600		4096			 64
8x  400MB/s SSD		3200		4096			128

As is clearly demonstrated, there is a direct relationship between cache
size and total write bandwidth.  The number of drives and drive type is
irrelevant.  It's the aggregate write bandwidth that matters.

Whether this "should" be this way is something for developers to debate.
 I'm simply demonstrating how it "is" currently.

> Did you ever try to figure out yourself why it got slower with higher
> values? I profiled that in the past and it was a CPU/memory limitation -
> the md thread went to 100%, searching for stripe-heads.

This may be true at the limits, but going from 512 to 1024 to 2048 to
4096 with a 3 disk rust array isn't going to peak the CPU.  And
somewhere with this setup, usually between 1024 and 2048, throughput
will begin to tail off, even with plenty of CPU and memory B/W remaining.

> So I really wonder how you got the impression that the stripe cache size
> should have different values for differnt kinds of drives.

Because higher aggregate throughputs require higher stripe_cache_size
values, and some drive types (SSDs) have significantly higher throughput
than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
for PCIe SSDs.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 17:46 ` Bernd Schubert
@ 2014-03-21  0:44   ` Jeff Allison
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff Allison @ 2014-03-21  0:44 UTC (permalink / raw)
  To: Bernd Schubert, linux-raid

I don't think it's the raid code I've dropped the disk out of the
array and I still cannot get anymore that 4MB/sec out of it...

[jeff@nas ~]$dd if=/dev/zero of=/mnt/sdj/bonnie/test.tmp bs=4k
count=2000000 && sync && dd if=/dev/zero of=/mnt/sdd/bonnie/test.tmp
bs=4k count=2000000 && sync

2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 231.778 s, 35.3 MB/s <-- WD Green
RMA I got back yesterday.

2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 1818.18 s, 4.5 MB/s <-- Dud one.

Perhaps it's time to RMA the RMA.

On 21 March 2014 04:46, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> On 03/20/2014 02:12 AM, Jeff Allison wrote:
>>
>> The gist of my question is what kind of resync speed should I expect?
>>
>> I have a HP N54L Microserver running centos 6.5.
>>
>> In this box I have a 3x2TB disk raid 5 array, which I am in the
>> process of extending to a 4x2TB raid 5 array.
>>
>> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
>>
>> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
>>
>> Now the problem the resync speed is v slow, it refuses to rise above
>> 5MB, in general it sits at 4M.
>
>
> Per second?
>
>
>>
>> from looking at glances it would appear that writing to the new disk
>> is the bottle neck, /dev/sdb is the new disk.
>>
>> Disk I/O In/s Out/s
>> md0 0 0
>> sda1 0 0
>> sda2 0 1K
>> sdb1 3.92M 0
>> sdc1 24.2M 54.7M
>> sdd1 11.2M 54.7M
>> sde1 16.3M 54.7M
>
>
> Could you please send output of 'iostat -xm 1'? Also, do you anything in
> 'top' that takes 100% CPU?
>
> Thanks,
> Bernd
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-20 18:44     ` Stan Hoeppner
@ 2014-03-27 16:08       ` Bernd Schubert
  2014-03-28  8:03         ` Stan Hoeppner
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-27 16:08 UTC (permalink / raw)
  To: stan, Jeff Allison, linux-raid

Sorry for the late reply, I'm busy with work...

On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>> Yes.  The article gives 16384 and 32768 as examples for
>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>> addition, high values eat huge amounts of memory.  The formula is:
>
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>
> I won't discuss "should" as that makes this a subjective discussion.
> I'll discuss this objectively, discuss what md does, not what it
> "should" do or could do.
>
> I'll answer your question with a question:  Why does the total stripe
> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
> and 16 drives, to maintain the same per drive throughput?
>
> The answer to both this question and your question is the same answer.
> As the total write bandwidth of the array increases, so must the total
> stripe cache buffer space.  stripe_cache_size of 1024 is usually optimal
> for SATA drives with measured 100MB/s throughput, and 4096 is usually
> optimal for SSDs with 400MB/s measured write throughput.  The bandwidth
> numbers include parity block writes.

Did you also consider that you simply need more stripe-heads (struct 
stripe_head) to get complete stripes with more drives?

>
> array(s)		bandwidth MB/s	stripe_cache_size	cache MB
>
> 12x 100MB/s Rust	1200		1024			 48
> 16x 100MB/s Rust	1600		1024			 64
> 32x 100MB/s Rust	3200		1024			128
>
> 3x  400MB/s SSD		1200		4096			 48
> 4x  400MB/s SSD		1600		4096			 64
> 8x  400MB/s SSD		3200		4096			128
>
> As is clearly demonstrated, there is a direct relationship between cache
> size and total write bandwidth.  The number of drives and drive type is
> irrelevant.  It's the aggregate write bandwidth that matters.

What is the meaning of "cache MB"? It does not seem to come from this 
calculation:

> 	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
> 		 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;

...

> 		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
> 		       mdname(mddev), memory);


>
> Whether this "should" be this way is something for developers to debate.
>   I'm simply demonstrating how it "is" currently.

Well, somehow I only see two different stripe-cache size values in your 
numbers. Then the given bandwidth seems to be theoretical value, based 
on num-drives * performance-per-drive. Redundancy drives are also 
missing in that calculation.  And then the value of "cache MB" is also 
unclear. So I'm sorry, but don't see any "simply demonstrating".


>
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
>
> This may be true at the limits, but going from 512 to 1024 to 2048 to
> 4096 with a 3 disk rust array isn't going to peak the CPU.  And
> somewhere with this setup, usually between 1024 and 2048, throughput
> will begin to tail off, even with plenty of CPU and memory B/W remaining.

Sorry, not in my experience. So it would be interesting to see real 
measused values. But then I definitely never tested raid6 with 3 drives, 
as this only provides a single data drive.

>
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>
> Because higher aggregate throughputs require higher stripe_cache_size
> values, and some drive types (SSDs) have significantly higher throughput
> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
> for PCIe SSDs.

As I said, it would be interesting to see real numbers and profiling data.


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: raid resync speed
  2014-03-27 16:08       ` Bernd Schubert
@ 2014-03-28  8:03         ` Stan Hoeppner
  0 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-28  8:03 UTC (permalink / raw)
  To: Bernd Schubert, Jeff Allison, linux-raid

On 3/27/2014 11:08 AM, Bernd Schubert wrote:
> Sorry for the late reply, I'm busy with work...
> 
> On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
>> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>>> Yes.  The article gives 16384 and 32768 as examples for
>>>> stripe_cache_size.  Such high values tend to reduce throughput instead
>>>> of increasing it.  Never use a value above 2048 with rust, and 1024 is
>>>> usually optimal for 7.2K drives.  Only go 4096 or higher with SSDs.  In
>>>> addition, high values eat huge amounts of memory.  The formula is:
>>
>>> Why should the stripe-cache size differ between SSDs and rotating disks?
>>
>> I won't discuss "should" as that makes this a subjective discussion.
>> I'll discuss this objectively, discuss what md does, not what it
>> "should" do or could do.
>>
>> I'll answer your question with a question:  Why does the total stripe
>> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
>> and 16 drives, to maintain the same per drive throughput?
>>
>> The answer to both this question and your question is the same answer.
>> As the total write bandwidth of the array increases, so must the total
>> stripe cache buffer space.  stripe_cache_size of 1024 is usually optimal
>> for SATA drives with measured 100MB/s throughput, and 4096 is usually
>> optimal for SSDs with 400MB/s measured write throughput.  The bandwidth
>> numbers include parity block writes.
> 
> Did you also consider that you simply need more stripe-heads (struct stripe_head) to get complete stripes with more drives?

It has nothing to do with what we're discussing.  You get complete stripes with the default value which is IIRC 256, though md.txt still says 128 as of 3.13.6, and that it only applies to RAID5.  Maybe md.txt should be updated.

  stripe_cache_size  (currently raid5 only)
      number of entries in the stripe cache.  This is writable, but
      there are upper and lower limits (32768, 16).  Default is 128.

>> array(s)        	bandwidth MB/s    stripe_cache_size    cache MB
>>
>> 12x 100MB/s Rust     1200        	  1024                  48
>> 16x 100MB/s Rust     1600        	  1024                  64
>> 32x 100MB/s Rust     3200        	  1024                 128
>>
>> 3x  400MB/s SSD      1200        	  4096                  48
>> 4x  400MB/s SSD      1600        	  4096                  64
>> 8x  400MB/s SSD      3200        	  4096                 128
>>
>> As is clearly demonstrated, there is a direct relationship between cache
>> size and total write bandwidth.  The number of drives and drive type is
>> irrelevant.  It's the aggregate write bandwidth that matters.
> 
> What is the meaning of "cache MB"? It does not seem to come from this calculation:
> 
>>     memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
>>          max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
...
> 
>>         printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
>>                mdname(mddev), memory);

No, it is not derieved from the source code, but from the formula I stated previously in this thread:

stripe_cache_size * 4096 bytes * drive_count = RAM usage

>> Whether this "should" be this way is something for developers to debate.
>>   I'm simply demonstrating how it "is" currently.
> 
> Well, somehow I only see two different stripe-cache size values in your numbers. 

Only two are required to demonstrate the md RAID5/6 behavior in question.

> Then the given bandwidth seems to be theoretical value, based on num-drives * performance-per-drive. 

The values in the table are not theoretical, but are derived from test data, and are very close to what one will see with such a real world configuration.

> Redundancy drives are also missing in that calculation.  

No, this is included.  Read the sentence directly preceding the table.

> And then the value of "cache MB" is also unclear. 

It is unambiguous.

> So I'm sorry, but don't see any "simply demonstrating".

...

>>> Did you ever try to figure out yourself why it got slower with higher
>>> values? I profiled that in the past and it was a CPU/memory limitation -
>>> the md thread went to 100%, searching for stripe-heads.
>>
>> This may be true at the limits, but going from 512 to 1024 to 2048 to
>> 4096 with a 3 disk rust array isn't going to peak the CPU.  And
>> somewhere with this setup, usually between 1024 and 2048, throughput
>> will begin to tail off, even with plenty of CPU and memory B/W remaining.
> 
> Sorry, not in my experience. 

This is the behavior everyone sees, because this is how md behaves.  If your experience is different then you should demonstrate it.  

> So it would be interesting to see real measused values.  But then I definitely never tested raid6 with 3 drives, as this only provides a single data drive.

The point above is that an md write thread won't saturate the processor, regardless of the size of the stripe cache, with a small count rust array.  I simply chose a very low number to make the point clear.  I didn't state a RAID level here.  Whether it's RAID5 or 6 is irrelevant to the point.

>>> So I really wonder how you got the impression that the stripe cache size
>>> should have different values for differnt kinds of drives.
>>
>> Because higher aggregate throughputs require higher stripe_cache_size
>> values, and some drive types (SSDs) have significantly higher throughput
>> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
>> for PCIe SSDs.
> 
> As I said, it would be interesting to see real numbers and profiling data.

Here are numbers for an md RAID5 SSD array, 64KB chunk.

5 x Intel 520s MLC 480G SATA3
Intel Xeon E3-1230V2 quad core, 1MB L2, 8MB L3, 3.3GHz/3.7GHz turbo
2x DDR3 = 21 GB/s memory bandwidth
Debian 6 kernel 3.2

Parallel FIO throughput
16 threads, 256KB block size, O_DIRECT, libaio, queue depth 16, 8 GB/thread, 128 GB total written:

stripe_cache_size = 256
    READ: io=131072MB, aggrb=2496MB/s, minb=2556MB/s, maxb=2556MB/s, mint=52508msec, maxt=52508msec
   WRITE: io=131072MB, aggrb=928148KB/s, minb=950424KB/s, maxb=950424KB/s, mint=144608msec, maxt=144608msec

stripe_cache_size = 512
    READ: io=131072MB, aggrb=2497MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52484msec, maxt=52484msec
   WRITE: io=131072MB, aggrb=978170KB/s, minb=978MB/s, maxb=978MB/s, mint=137213msec, maxt=137213msec

stripe_cache_size = 2048
    READ: io=131072MB, aggrb=2502MB/s, minb=2562MB/s, maxb=2562MB/s, mint=52382msec, maxt=52382msec
   WRITE: io=131072MB, aggrb=996MB/s, minb=1020MB/s, maxb=1020MB/s, mint=131631msec, maxt=131631msec

stripe_cache_size = 4096
    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
   WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec

stripe_cache_size = 8192
    READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
   WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec

stripe_cache_size = 16384
    READ: io=131072MB, aggrb=2482MB/s, minb=2542MB/s, maxb=2542MB/s, mint=52807msec, maxt=52807msec
   WRITE: io=131072MB, aggrb=1377MB/s, minb=1410MB/s, maxb=1410MB/s, mint=95191msec, maxt=95191msec

stripe_cache_size = 32768
    READ: io=131072MB, aggrb=2498MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52481msec, maxt=52481msec
   WRITE: io=131072MB, aggrb=1139MB/s, minb=1166MB/s, maxb=1166MB/s, mint=115102msec, maxt=115102msec


The effect I described is clearly demonstrated here: increasing stripe_cache_size beyond the optimal value causes write throughput to decrease.  With this SSD array a value of 4096 achieves peak sequential application write throughput of 1.6 GB/s.  Throughput with parity is 2 GB/s, or 400 MB/s per drive.  Note what I said previously, above, when I described the table figures:  "...4096 is usually optimal for SSDs with 400MB/s measured write throughput."  Thus, those figures are not "theoretical" as you claimed, but are based on actual testing.  The same is true for rust, though I haven't performed such testing on rust.  Others on this list have submitted rust numbers, but not with testing quite as thorough as the above.  I invite you to perform FIO testing on your rust array and submit you
 r results.  They should confirm what I stated in the table above.


On 3/20/2014 10:35 AM, Bernd Schubert wrote:
> Why should the stripe-cache size differ between SSDs and rotating
> disks? Did you ever try to figure out yourself why it got slower with
> higher values? I profiled that in the past and it was a CPU/memory
> limitation - the md thread went to 100%, searching for stripe-heads.


The results above do not seem corroborate your claim.  The decrease in throughput from 1.63 GB/s to 1.16 GB/s, when increasing stripe_cache_size from 4096 to 32768, is a slope not a cliff.  If CPU/DRAM starvation were the problem, I would think this would be a cliff and not a slope.

As I stated previously, I am simply characterizing the behavior of stripe_cache_size values and their real world impact on throughput and memory consumption.  I have not speculated to this point as to the cause of the observed behavior.  I have not profiled execution.  I don't know the code.  I am not a kernel hacker.  I am not a programmer.  What I have observed in reports on this list and in testing is that there is a direct correlation between optimal stripe_cache_size and device write throughput.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-03-28  8:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-20  1:12 raid resync speed Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 15:35   ` Bernd Schubert
2014-03-20 15:36     ` Bernd Schubert
2014-03-20 16:19       ` Eivind Sarto
2014-03-20 16:22         ` Bernd Schubert
2014-03-20 18:44     ` Stan Hoeppner
2014-03-27 16:08       ` Bernd Schubert
2014-03-28  8:03         ` Stan Hoeppner
2014-03-20 17:46 ` Bernd Schubert
2014-03-21  0:44   ` Jeff Allison

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).