* raid resync speed
@ 2014-03-20 1:12 Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 17:46 ` Bernd Schubert
0 siblings, 2 replies; 11+ messages in thread
From: Jeff Allison @ 2014-03-20 1:12 UTC (permalink / raw)
To: linux-raid
The gist of my question is what kind of resync speed should I expect?
I have a HP N54L Microserver running centos 6.5.
In this box I have a 3x2TB disk raid 5 array, which I am in the
process of extending to a 4x2TB raid 5 array.
I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
Now the problem the resync speed is v slow, it refuses to rise above
5MB, in general it sits at 4M.
from looking at glances it would appear that writing to the new disk
is the bottle neck, /dev/sdb is the new disk.
Disk I/O In/s Out/s
md0 0 0
sda1 0 0
sda2 0 1K
sdb1 3.92M 0
sdc1 24.2M 54.7M
sdd1 11.2M 54.7M
sde1 16.3M 54.7M
I partitiioned the disk with --> parted -a optimal /dev/sdb
[root@nas ~]# parted -a optimal /dev/sdb
GNU Parted 2.1
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: ATA ST2000DM001-1E61 (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid
There is no ntfs filesystem on the disk, I've still not worked out how
to remove that flag.
I've followed the article here -->
http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html
to attempt to speed it up but no joy.
Any Ideas what I've done wrong?
parted output
[root@nas ~]# parted -l
Model: ATA ST31000528AS (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 525MB 524MB primary ext4 boot
2 525MB 1000GB 1000GB primary lvm
Model: ATA ST2000DM001-1E61 (scsi)
Disk /dev/sdb: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid
Model: ATA ST2000DM001-9YN1 (scsi)
Disk /dev/sdc: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary raid
Model: ATA WDC WD25EZRS-00J (scsi)
Disk /dev/sdd: 2500GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary ntfs raid
Model: ATA ST2000DL001-9VT1 (scsi)
Disk /dev/sde: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 2000GB 2000GB primary raid
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 1:12 raid resync speed Jeff Allison
@ 2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 15:35 ` Bernd Schubert
2014-03-20 17:46 ` Bernd Schubert
1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-20 14:35 UTC (permalink / raw)
To: Jeff Allison, linux-raid
On 3/19/2014 8:12 PM, Jeff Allison wrote:
...
> In this box I have a 3x2TB disk raid 5 array, which I am in the
> process of extending to a 4x2TB raid 5 array.
>
> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
>
> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
>
> Now the problem the resync speed is v slow, it refuses to rise above
> 5MB, in general it sits at 4M.
...
> Model: ATA ST2000DM001-1E61 (scsi)
> Disk /dev/sdb: 2000GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: msdos
Seagate Advanced Format 512e disk drive
> Number Start End Size Type File system Flags
> 1 1049kB 2000GB 2000GB primary ntfs raid
Offset is 262 physical sectors, strange value, but won't incur RMW
internally. So no performance hit here.
...
> I've followed the article here -->
> http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html
> to attempt to speed it up but no joy.
>
> Any Ideas what I've done wrong?
Yes. The article gives 16384 and 32768 as examples for
stripe_cache_size. Such high values tend to reduce throughput instead
of increasing it. Never use a value above 2048 with rust, and 1024 is
usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
addition, high values eat huge amounts of memory. The formula is:
stripe_cache_size * 4096 bytes * drive_count = RAM usage
(32768*4096) * 4 = 512MB of RAM consumed by the stripe cache
(16384*4096) * 4 = 256MB of RAM consumed by the stripe cache
(2048*4096) * 4 = 32MB of RAM consumed by the stripe cache
(1024*4096) * 4 = 16MB of RAM consumed by the stripe cache
Cheers,
--
Stan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 14:35 ` Stan Hoeppner
@ 2014-03-20 15:35 ` Bernd Schubert
2014-03-20 15:36 ` Bernd Schubert
2014-03-20 18:44 ` Stan Hoeppner
0 siblings, 2 replies; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 15:35 UTC (permalink / raw)
To: stan, Jeff Allison, linux-raid
>
> Yes. The article gives 16384 and 32768 as examples for
> stripe_cache_size. Such high values tend to reduce throughput instead
> of increasing it. Never use a value above 2048 with rust, and 1024 is
> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
> addition, high values eat huge amounts of memory. The formula is:
>
Why should the stripe-cache size differ between SSDs and rotating disks?
Did you ever try to figure out yourself why it got slower with higher
values? I profiled that in the past and it was a CPU/memory limitation -
the md thread went to 100%, searching for stripe-heads.
So I really wonder how you got the impression that the stripe cache size
should have different values for differnt kinds of drives.
Cheers,
Bernd
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 15:35 ` Bernd Schubert
@ 2014-03-20 15:36 ` Bernd Schubert
2014-03-20 16:19 ` Eivind Sarto
2014-03-20 18:44 ` Stan Hoeppner
1 sibling, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 15:36 UTC (permalink / raw)
To: stan, Jeff Allison, linux-raid
On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>
>> Yes. The article gives 16384 and 32768 as examples for
>> stripe_cache_size. Such high values tend to reduce throughput instead
>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>> addition, high values eat huge amounts of memory. The formula is:
>>
>
> Why should the stripe-cache size differ between SSDs and rotating disks?
> Did you ever try to figure out yourself why it got slower with higher
> values? I profiled that in the past and it was a CPU/memory limitation -
> the md thread went to 100%, searching for stripe-heads.
Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.
>
> So I really wonder how you got the impression that the stripe cache size
> should have different values for differnt kinds of drives.
>
>
> Cheers,
> Bernd
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 15:36 ` Bernd Schubert
@ 2014-03-20 16:19 ` Eivind Sarto
2014-03-20 16:22 ` Bernd Schubert
0 siblings, 1 reply; 11+ messages in thread
From: Eivind Sarto @ 2014-03-20 16:19 UTC (permalink / raw)
To: Bernd Schubert; +Cc: stan, Jeff Allison, linux-raid
On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>>
>>> Yes. The article gives 16384 and 32768 as examples for
>>> stripe_cache_size. Such high values tend to reduce throughput instead
>>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>>> addition, high values eat huge amounts of memory. The formula is:
>>>
>>
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
>
> Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.
>
>>
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>>
>>
>> Cheers,
>> Bernd
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
The hash chains for the stripe cache become long if you increase the stripe cache. There are only 256
hash buckets. With 32K stripe cache entries, the average length of a hash chain will be 128 and that will
increase contention for the lock protection the chain.
-eivind
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 16:19 ` Eivind Sarto
@ 2014-03-20 16:22 ` Bernd Schubert
0 siblings, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 16:22 UTC (permalink / raw)
To: Eivind Sarto; +Cc: stan, Jeff Allison, linux-raid
On 03/20/2014 05:19 PM, Eivind Sarto wrote:
>
> On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
>> On 03/20/2014 04:35 PM, Bernd Schubert wrote:
>>>>
>>>> Yes. The article gives 16384 and 32768 as examples for
>>>> stripe_cache_size. Such high values tend to reduce throughput instead
>>>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>>>> addition, high values eat huge amounts of memory. The formula is:
>>>>
>>>
>>> Why should the stripe-cache size differ between SSDs and rotating disks?
>>> Did you ever try to figure out yourself why it got slower with higher
>>> values? I profiled that in the past and it was a CPU/memory limitation -
>>> the md thread went to 100%, searching for stripe-heads.
>>
>> Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage.
>>
>>>
>>> So I really wonder how you got the impression that the stripe cache size
>>> should have different values for differnt kinds of drives.
>>>
> The hash chains for the stripe cache become long if you increase the stripe cache. There are only 256
> hash buckets. With 32K stripe cache entries, the average length of a hash chain will be 128 and that will
> increase contention for the lock protection the chain.
>
Yes, this is a implementation detail. But that make a difference between
SSDs and rotating disks... (which was my point here).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 1:12 raid resync speed Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
@ 2014-03-20 17:46 ` Bernd Schubert
2014-03-21 0:44 ` Jeff Allison
1 sibling, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-20 17:46 UTC (permalink / raw)
To: Jeff Allison, linux-raid
On 03/20/2014 02:12 AM, Jeff Allison wrote:
> The gist of my question is what kind of resync speed should I expect?
>
> I have a HP N54L Microserver running centos 6.5.
>
> In this box I have a 3x2TB disk raid 5 array, which I am in the
> process of extending to a 4x2TB raid 5 array.
>
> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
>
> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
>
> Now the problem the resync speed is v slow, it refuses to rise above
> 5MB, in general it sits at 4M.
Per second?
>
> from looking at glances it would appear that writing to the new disk
> is the bottle neck, /dev/sdb is the new disk.
>
> Disk I/O In/s Out/s
> md0 0 0
> sda1 0 0
> sda2 0 1K
> sdb1 3.92M 0
> sdc1 24.2M 54.7M
> sdd1 11.2M 54.7M
> sde1 16.3M 54.7M
Could you please send output of 'iostat -xm 1'? Also, do you anything in
'top' that takes 100% CPU?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 15:35 ` Bernd Schubert
2014-03-20 15:36 ` Bernd Schubert
@ 2014-03-20 18:44 ` Stan Hoeppner
2014-03-27 16:08 ` Bernd Schubert
1 sibling, 1 reply; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-20 18:44 UTC (permalink / raw)
To: Bernd Schubert, Jeff Allison, linux-raid
On 3/20/2014 10:35 AM, Bernd Schubert wrote:
> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>> Yes. The article gives 16384 and 32768 as examples for
>> stripe_cache_size. Such high values tend to reduce throughput instead
>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>> addition, high values eat huge amounts of memory. The formula is:
> Why should the stripe-cache size differ between SSDs and rotating disks?
I won't discuss "should" as that makes this a subjective discussion.
I'll discuss this objectively, discuss what md does, not what it
"should" do or could do.
I'll answer your question with a question: Why does the total stripe
cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
and 16 drives, to maintain the same per drive throughput?
The answer to both this question and your question is the same answer.
As the total write bandwidth of the array increases, so must the total
stripe cache buffer space. stripe_cache_size of 1024 is usually optimal
for SATA drives with measured 100MB/s throughput, and 4096 is usually
optimal for SSDs with 400MB/s measured write throughput. The bandwidth
numbers include parity block writes.
array(s) bandwidth MB/s stripe_cache_size cache MB
12x 100MB/s Rust 1200 1024 48
16x 100MB/s Rust 1600 1024 64
32x 100MB/s Rust 3200 1024 128
3x 400MB/s SSD 1200 4096 48
4x 400MB/s SSD 1600 4096 64
8x 400MB/s SSD 3200 4096 128
As is clearly demonstrated, there is a direct relationship between cache
size and total write bandwidth. The number of drives and drive type is
irrelevant. It's the aggregate write bandwidth that matters.
Whether this "should" be this way is something for developers to debate.
I'm simply demonstrating how it "is" currently.
> Did you ever try to figure out yourself why it got slower with higher
> values? I profiled that in the past and it was a CPU/memory limitation -
> the md thread went to 100%, searching for stripe-heads.
This may be true at the limits, but going from 512 to 1024 to 2048 to
4096 with a 3 disk rust array isn't going to peak the CPU. And
somewhere with this setup, usually between 1024 and 2048, throughput
will begin to tail off, even with plenty of CPU and memory B/W remaining.
> So I really wonder how you got the impression that the stripe cache size
> should have different values for differnt kinds of drives.
Because higher aggregate throughputs require higher stripe_cache_size
values, and some drive types (SSDs) have significantly higher throughput
than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
for PCIe SSDs.
Cheers,
Stan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 17:46 ` Bernd Schubert
@ 2014-03-21 0:44 ` Jeff Allison
0 siblings, 0 replies; 11+ messages in thread
From: Jeff Allison @ 2014-03-21 0:44 UTC (permalink / raw)
To: Bernd Schubert, linux-raid
I don't think it's the raid code I've dropped the disk out of the
array and I still cannot get anymore that 4MB/sec out of it...
[jeff@nas ~]$dd if=/dev/zero of=/mnt/sdj/bonnie/test.tmp bs=4k
count=2000000 && sync && dd if=/dev/zero of=/mnt/sdd/bonnie/test.tmp
bs=4k count=2000000 && sync
2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 231.778 s, 35.3 MB/s <-- WD Green
RMA I got back yesterday.
2000000+0 records in
2000000+0 records out
8192000000 bytes (8.2 GB) copied, 1818.18 s, 4.5 MB/s <-- Dud one.
Perhaps it's time to RMA the RMA.
On 21 March 2014 04:46, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> On 03/20/2014 02:12 AM, Jeff Allison wrote:
>>
>> The gist of my question is what kind of resync speed should I expect?
>>
>> I have a HP N54L Microserver running centos 6.5.
>>
>> In this box I have a 3x2TB disk raid 5 array, which I am in the
>> process of extending to a 4x2TB raid 5 array.
>>
>> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb
>>
>> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4
>>
>> Now the problem the resync speed is v slow, it refuses to rise above
>> 5MB, in general it sits at 4M.
>
>
> Per second?
>
>
>>
>> from looking at glances it would appear that writing to the new disk
>> is the bottle neck, /dev/sdb is the new disk.
>>
>> Disk I/O In/s Out/s
>> md0 0 0
>> sda1 0 0
>> sda2 0 1K
>> sdb1 3.92M 0
>> sdc1 24.2M 54.7M
>> sdd1 11.2M 54.7M
>> sde1 16.3M 54.7M
>
>
> Could you please send output of 'iostat -xm 1'? Also, do you anything in
> 'top' that takes 100% CPU?
>
> Thanks,
> Bernd
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-20 18:44 ` Stan Hoeppner
@ 2014-03-27 16:08 ` Bernd Schubert
2014-03-28 8:03 ` Stan Hoeppner
0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2014-03-27 16:08 UTC (permalink / raw)
To: stan, Jeff Allison, linux-raid
Sorry for the late reply, I'm busy with work...
On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>> Yes. The article gives 16384 and 32768 as examples for
>>> stripe_cache_size. Such high values tend to reduce throughput instead
>>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>>> addition, high values eat huge amounts of memory. The formula is:
>
>> Why should the stripe-cache size differ between SSDs and rotating disks?
>
> I won't discuss "should" as that makes this a subjective discussion.
> I'll discuss this objectively, discuss what md does, not what it
> "should" do or could do.
>
> I'll answer your question with a question: Why does the total stripe
> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
> and 16 drives, to maintain the same per drive throughput?
>
> The answer to both this question and your question is the same answer.
> As the total write bandwidth of the array increases, so must the total
> stripe cache buffer space. stripe_cache_size of 1024 is usually optimal
> for SATA drives with measured 100MB/s throughput, and 4096 is usually
> optimal for SSDs with 400MB/s measured write throughput. The bandwidth
> numbers include parity block writes.
Did you also consider that you simply need more stripe-heads (struct
stripe_head) to get complete stripes with more drives?
>
> array(s) bandwidth MB/s stripe_cache_size cache MB
>
> 12x 100MB/s Rust 1200 1024 48
> 16x 100MB/s Rust 1600 1024 64
> 32x 100MB/s Rust 3200 1024 128
>
> 3x 400MB/s SSD 1200 4096 48
> 4x 400MB/s SSD 1600 4096 64
> 8x 400MB/s SSD 3200 4096 128
>
> As is clearly demonstrated, there is a direct relationship between cache
> size and total write bandwidth. The number of drives and drive type is
> irrelevant. It's the aggregate write bandwidth that matters.
What is the meaning of "cache MB"? It does not seem to come from this
calculation:
> memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
> max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
...
> printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
> mdname(mddev), memory);
>
> Whether this "should" be this way is something for developers to debate.
> I'm simply demonstrating how it "is" currently.
Well, somehow I only see two different stripe-cache size values in your
numbers. Then the given bandwidth seems to be theoretical value, based
on num-drives * performance-per-drive. Redundancy drives are also
missing in that calculation. And then the value of "cache MB" is also
unclear. So I'm sorry, but don't see any "simply demonstrating".
>
>> Did you ever try to figure out yourself why it got slower with higher
>> values? I profiled that in the past and it was a CPU/memory limitation -
>> the md thread went to 100%, searching for stripe-heads.
>
> This may be true at the limits, but going from 512 to 1024 to 2048 to
> 4096 with a 3 disk rust array isn't going to peak the CPU. And
> somewhere with this setup, usually between 1024 and 2048, throughput
> will begin to tail off, even with plenty of CPU and memory B/W remaining.
Sorry, not in my experience. So it would be interesting to see real
measused values. But then I definitely never tested raid6 with 3 drives,
as this only provides a single data drive.
>
>> So I really wonder how you got the impression that the stripe cache size
>> should have different values for differnt kinds of drives.
>
> Because higher aggregate throughputs require higher stripe_cache_size
> values, and some drive types (SSDs) have significantly higher throughput
> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
> for PCIe SSDs.
As I said, it would be interesting to see real numbers and profiling data.
Cheers,
Bernd
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed
2014-03-27 16:08 ` Bernd Schubert
@ 2014-03-28 8:03 ` Stan Hoeppner
0 siblings, 0 replies; 11+ messages in thread
From: Stan Hoeppner @ 2014-03-28 8:03 UTC (permalink / raw)
To: Bernd Schubert, Jeff Allison, linux-raid
On 3/27/2014 11:08 AM, Bernd Schubert wrote:
> Sorry for the late reply, I'm busy with work...
>
> On 03/20/2014 07:44 PM, Stan Hoeppner wrote:
>> On 3/20/2014 10:35 AM, Bernd Schubert wrote:
>>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote:
>>>> Yes. The article gives 16384 and 32768 as examples for
>>>> stripe_cache_size. Such high values tend to reduce throughput instead
>>>> of increasing it. Never use a value above 2048 with rust, and 1024 is
>>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In
>>>> addition, high values eat huge amounts of memory. The formula is:
>>
>>> Why should the stripe-cache size differ between SSDs and rotating disks?
>>
>> I won't discuss "should" as that makes this a subjective discussion.
>> I'll discuss this objectively, discuss what md does, not what it
>> "should" do or could do.
>>
>> I'll answer your question with a question: Why does the total stripe
>> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives
>> and 16 drives, to maintain the same per drive throughput?
>>
>> The answer to both this question and your question is the same answer.
>> As the total write bandwidth of the array increases, so must the total
>> stripe cache buffer space. stripe_cache_size of 1024 is usually optimal
>> for SATA drives with measured 100MB/s throughput, and 4096 is usually
>> optimal for SSDs with 400MB/s measured write throughput. The bandwidth
>> numbers include parity block writes.
>
> Did you also consider that you simply need more stripe-heads (struct stripe_head) to get complete stripes with more drives?
It has nothing to do with what we're discussing. You get complete stripes with the default value which is IIRC 256, though md.txt still says 128 as of 3.13.6, and that it only applies to RAID5. Maybe md.txt should be updated.
stripe_cache_size (currently raid5 only)
number of entries in the stripe cache. This is writable, but
there are upper and lower limits (32768, 16). Default is 128.
>> array(s) bandwidth MB/s stripe_cache_size cache MB
>>
>> 12x 100MB/s Rust 1200 1024 48
>> 16x 100MB/s Rust 1600 1024 64
>> 32x 100MB/s Rust 3200 1024 128
>>
>> 3x 400MB/s SSD 1200 4096 48
>> 4x 400MB/s SSD 1600 4096 64
>> 8x 400MB/s SSD 3200 4096 128
>>
>> As is clearly demonstrated, there is a direct relationship between cache
>> size and total write bandwidth. The number of drives and drive type is
>> irrelevant. It's the aggregate write bandwidth that matters.
>
> What is the meaning of "cache MB"? It does not seem to come from this calculation:
>
>> memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
>> max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
...
>
>> printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
>> mdname(mddev), memory);
No, it is not derieved from the source code, but from the formula I stated previously in this thread:
stripe_cache_size * 4096 bytes * drive_count = RAM usage
>> Whether this "should" be this way is something for developers to debate.
>> I'm simply demonstrating how it "is" currently.
>
> Well, somehow I only see two different stripe-cache size values in your numbers.
Only two are required to demonstrate the md RAID5/6 behavior in question.
> Then the given bandwidth seems to be theoretical value, based on num-drives * performance-per-drive.
The values in the table are not theoretical, but are derived from test data, and are very close to what one will see with such a real world configuration.
> Redundancy drives are also missing in that calculation.
No, this is included. Read the sentence directly preceding the table.
> And then the value of "cache MB" is also unclear.
It is unambiguous.
> So I'm sorry, but don't see any "simply demonstrating".
...
>>> Did you ever try to figure out yourself why it got slower with higher
>>> values? I profiled that in the past and it was a CPU/memory limitation -
>>> the md thread went to 100%, searching for stripe-heads.
>>
>> This may be true at the limits, but going from 512 to 1024 to 2048 to
>> 4096 with a 3 disk rust array isn't going to peak the CPU. And
>> somewhere with this setup, usually between 1024 and 2048, throughput
>> will begin to tail off, even with plenty of CPU and memory B/W remaining.
>
> Sorry, not in my experience.
This is the behavior everyone sees, because this is how md behaves. If your experience is different then you should demonstrate it.
> So it would be interesting to see real measused values. But then I definitely never tested raid6 with 3 drives, as this only provides a single data drive.
The point above is that an md write thread won't saturate the processor, regardless of the size of the stripe cache, with a small count rust array. I simply chose a very low number to make the point clear. I didn't state a RAID level here. Whether it's RAID5 or 6 is irrelevant to the point.
>>> So I really wonder how you got the impression that the stripe cache size
>>> should have different values for differnt kinds of drives.
>>
>> Because higher aggregate throughputs require higher stripe_cache_size
>> values, and some drive types (SSDs) have significantly higher throughput
>> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater
>> for PCIe SSDs.
>
> As I said, it would be interesting to see real numbers and profiling data.
Here are numbers for an md RAID5 SSD array, 64KB chunk.
5 x Intel 520s MLC 480G SATA3
Intel Xeon E3-1230V2 quad core, 1MB L2, 8MB L3, 3.3GHz/3.7GHz turbo
2x DDR3 = 21 GB/s memory bandwidth
Debian 6 kernel 3.2
Parallel FIO throughput
16 threads, 256KB block size, O_DIRECT, libaio, queue depth 16, 8 GB/thread, 128 GB total written:
stripe_cache_size = 256
READ: io=131072MB, aggrb=2496MB/s, minb=2556MB/s, maxb=2556MB/s, mint=52508msec, maxt=52508msec
WRITE: io=131072MB, aggrb=928148KB/s, minb=950424KB/s, maxb=950424KB/s, mint=144608msec, maxt=144608msec
stripe_cache_size = 512
READ: io=131072MB, aggrb=2497MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52484msec, maxt=52484msec
WRITE: io=131072MB, aggrb=978170KB/s, minb=978MB/s, maxb=978MB/s, mint=137213msec, maxt=137213msec
stripe_cache_size = 2048
READ: io=131072MB, aggrb=2502MB/s, minb=2562MB/s, maxb=2562MB/s, mint=52382msec, maxt=52382msec
WRITE: io=131072MB, aggrb=996MB/s, minb=1020MB/s, maxb=1020MB/s, mint=131631msec, maxt=131631msec
stripe_cache_size = 4096
READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec
WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec
stripe_cache_size = 8192
READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec
WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec
stripe_cache_size = 16384
READ: io=131072MB, aggrb=2482MB/s, minb=2542MB/s, maxb=2542MB/s, mint=52807msec, maxt=52807msec
WRITE: io=131072MB, aggrb=1377MB/s, minb=1410MB/s, maxb=1410MB/s, mint=95191msec, maxt=95191msec
stripe_cache_size = 32768
READ: io=131072MB, aggrb=2498MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52481msec, maxt=52481msec
WRITE: io=131072MB, aggrb=1139MB/s, minb=1166MB/s, maxb=1166MB/s, mint=115102msec, maxt=115102msec
The effect I described is clearly demonstrated here: increasing stripe_cache_size beyond the optimal value causes write throughput to decrease. With this SSD array a value of 4096 achieves peak sequential application write throughput of 1.6 GB/s. Throughput with parity is 2 GB/s, or 400 MB/s per drive. Note what I said previously, above, when I described the table figures: "...4096 is usually optimal for SSDs with 400MB/s measured write throughput." Thus, those figures are not "theoretical" as you claimed, but are based on actual testing. The same is true for rust, though I haven't performed such testing on rust. Others on this list have submitted rust numbers, but not with testing quite as thorough as the above. I invite you to perform FIO testing on your rust array and submit you
r results. They should confirm what I stated in the table above.
On 3/20/2014 10:35 AM, Bernd Schubert wrote:
> Why should the stripe-cache size differ between SSDs and rotating
> disks? Did you ever try to figure out yourself why it got slower with
> higher values? I profiled that in the past and it was a CPU/memory
> limitation - the md thread went to 100%, searching for stripe-heads.
The results above do not seem corroborate your claim. The decrease in throughput from 1.63 GB/s to 1.16 GB/s, when increasing stripe_cache_size from 4096 to 32768, is a slope not a cliff. If CPU/DRAM starvation were the problem, I would think this would be a cliff and not a slope.
As I stated previously, I am simply characterizing the behavior of stripe_cache_size values and their real world impact on throughput and memory consumption. I have not speculated to this point as to the cause of the observed behavior. I have not profiled execution. I don't know the code. I am not a kernel hacker. I am not a programmer. What I have observed in reports on this list and in testing is that there is a direct correlation between optimal stripe_cache_size and device write throughput.
Cheers,
Stan
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-03-28 8:03 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-20 1:12 raid resync speed Jeff Allison
2014-03-20 14:35 ` Stan Hoeppner
2014-03-20 15:35 ` Bernd Schubert
2014-03-20 15:36 ` Bernd Schubert
2014-03-20 16:19 ` Eivind Sarto
2014-03-20 16:22 ` Bernd Schubert
2014-03-20 18:44 ` Stan Hoeppner
2014-03-27 16:08 ` Bernd Schubert
2014-03-28 8:03 ` Stan Hoeppner
2014-03-20 17:46 ` Bernd Schubert
2014-03-21 0:44 ` Jeff Allison
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).