major performance drop on raid5 due to context switches caused by small max_hw

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* major performance drop on raid5 due to context switches caused by small max_hw_sectors
@ 2007-04-20 21:06 Pallai Roland
       [not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-20 21:06 UTC (permalink / raw)
  To: Linux RAID Mailing List

Hi!

 I made a software RAID5 array from 8 disks top on a HPT2320 card driven by 
hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I began to 
test it with a simple sequental read by 100 threads with adjusted readahead 
size (2048Kb; total ram is 1Gb, I use posix_fadvise DONTNEED after reads). 
Bad news: I noticed very weak peformance on this array compared to an another 
array built from 7 disk on the motherboard's AHCI controllers. I digged 
deeper, and I found the root of the problem: if I lower max_sectors_kb on my 
AHCI disks, the same happen there too!

dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done

dap:/sys/block# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 14      0   8216      0 791056    0    0 103304     0 2518 57340  0 41  0 
59
 3 12      0   7420      0 791856    0    0 117264     0 2600 55709  0 44  0 
56
thrashed readahead pages: 123363

dap:/sys/block# for i in sd*; do echo 512 >$i/queue/max_sectors_kb; done

dap:/sys/block# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 100      0 182876      0 762560    0    0 350944     0 1299  1484  1 14  0 
86
 0 100      0 129492      0 815460    0    0 265864     0 1432  2045  0 10  0 
90
 0 100      0 112812      0 832504    0    0 290084     0 1366  1807  1 11  0 
89
thrashed readahead pages: 4605

 Is not possible to reduce the number of context switches here? Why context 
switches causes readahead thrashing? Why just the RAID5 suffers from the 
small max_sectors_kb, why don't happen if I run lot of 'badblocks'?

thanks,
--
 dap

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>]

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
       [not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
@ 2007-04-21 19:32   ` Pallai Roland
  2007-04-22  0:18     ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-21 19:32 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List


On Saturday 21 April 2007 07:47:49 you wrote:
> On 4/21/07, Pallai Roland <dap@mail.index.hu> wrote:
> >  I made a software RAID5 array from 8 disks top on a HPT2320 card driven
> > by hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I
> > began to test it with a simple sequental read by 100 threads with
> > adjusted readahead size (2048Kb; total ram is 1Gb, I use posix_fadvise
> > DONTNEED after reads). Bad news: I noticed very weak peformance on this
> > array compared to an another array built from 7 disk on the motherboard's
> > AHCI controllers. I digged deeper, and I found the root of the problem:
> > if I lower max_sectors_kb on my AHCI disks, the same happen there too!
> >
> > dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> 
> 3. what is the raid configuration ? did you increase the stripe_cache_size
> ?
 Thanks! It's works fine if chunk size < max_hw_sectors! But if it's not true, 
the very high number of context switches kills the performance.


RAID5, chunk size 128k:

# mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
 (waiting for sync, then mount, mkfs, etc)
# blockdev --setra 4096 /dev/md/0
# ./readtest &
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
91 10      0 432908      0 436572    0    0 99788    40 2925 50358  2 36  0 63
 0 11      0 444184      0 435992    0    0 89996    32 4252 49303  1 31  0 68
45 11      0 446924      0 441024    0    0 88584     0 5748 58197  0 30  2 67
- context switch storm, only 10 of 100 processes are working, lot of thrashed 
readahead pages. I'm sure you can reproduce with 64Kb max_sectors_kb and 
2.6.20.x on *any* 8 disk-wide RAID5 array if chunk size > max_sectors_kb:
for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done


RAID5, chunk size 64k (equal to max_hw_sectors):

# mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
 (waiting for sync, then mount, mkfs, etc)
# blockdev --setra 4096 /dev/md/0
# ./readtest &
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
1 99      0 309260      0 653000    0    0 309620     0 4521  2897  0 17  0 82
1 99      0 156436      0 721452    0    0 258072     0 4640  3168  0 14  0 86
0 100     0 244088      0 599888    0    0 258856     0 4703  3986  1 17  0 82
- YES! It's MUCH better now! :)


All in all, I use 64Kb chunk now and I'm happy, but I think it's definitely a 
software bug. The sata_mv driver also doesn't give bigger max_sectors_kb on 
Marvell chips, so it's a performance killer for every Marvell user if they're 
using 128k or bigger chunks on RAID5. A warning should be printed by the 
kernel at least if it's not a bug, just a limitation.


bye,
--
 d


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-21 19:32   ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland
@ 2007-04-22  0:18     ` Justin Piszcz
  2007-04-22  0:42       ` Pallai Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22  0:18 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List



On Sat, 21 Apr 2007, Pallai Roland wrote:

>
> On Saturday 21 April 2007 07:47:49 you wrote:
>> On 4/21/07, Pallai Roland <dap@mail.index.hu> wrote:
>>>  I made a software RAID5 array from 8 disks top on a HPT2320 card driven
>>> by hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I
>>> began to test it with a simple sequental read by 100 threads with
>>> adjusted readahead size (2048Kb; total ram is 1Gb, I use posix_fadvise
>>> DONTNEED after reads). Bad news: I noticed very weak peformance on this
>>> array compared to an another array built from 7 disk on the motherboard's
>>> AHCI controllers. I digged deeper, and I found the root of the problem:
>>> if I lower max_sectors_kb on my AHCI disks, the same happen there too!
>>>
>>> dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
>>
>> 3. what is the raid configuration ? did you increase the stripe_cache_size
>> ?
> Thanks! It's works fine if chunk size < max_hw_sectors! But if it's not true,
> the very high number of context switches kills the performance.
>
>
> RAID5, chunk size 128k:
>
> # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
> (waiting for sync, then mount, mkfs, etc)
> # blockdev --setra 4096 /dev/md/0
> # ./readtest &
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
> 91 10      0 432908      0 436572    0    0 99788    40 2925 50358  2 36  0 63
> 0 11      0 444184      0 435992    0    0 89996    32 4252 49303  1 31  0 68
> 45 11      0 446924      0 441024    0    0 88584     0 5748 58197  0 30  2 67
> - context switch storm, only 10 of 100 processes are working, lot of thrashed
> readahead pages. I'm sure you can reproduce with 64Kb max_sectors_kb and
> 2.6.20.x on *any* 8 disk-wide RAID5 array if chunk size > max_sectors_kb:
> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done
>
>
> RAID5, chunk size 64k (equal to max_hw_sectors):
>
> # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
> (waiting for sync, then mount, mkfs, etc)
> # blockdev --setra 4096 /dev/md/0
> # ./readtest &
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
> 1 99      0 309260      0 653000    0    0 309620     0 4521  2897  0 17  0 82
> 1 99      0 156436      0 721452    0    0 258072     0 4640  3168  0 14  0 86
> 0 100     0 244088      0 599888    0    0 258856     0 4703  3986  1 17  0 82
> - YES! It's MUCH better now! :)
>
>
> All in all, I use 64Kb chunk now and I'm happy, but I think it's definitely a
> software bug. The sata_mv driver also doesn't give bigger max_sectors_kb on
> Marvell chips, so it's a performance killer for every Marvell user if they're
> using 128k or bigger chunks on RAID5. A warning should be printed by the
> kernel at least if it's not a bug, just a limitation.
>
>
> bye,
> --
> d
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

How did you run your read test?

$ sudo dd if=/dev/md3 of=/dev/null
Password:
18868881+0 records in
18868880+0 records out
9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  2  0      0 3007612 251068  86372    0    0 243732     0 3109  541 15 38 47  0
  1  0      0 3007724 282444  86344    0    0 260636     0 3152  619 14 38 48  0
  1  0      0 3007472 282600  86400    0    0 262188     0 3153  339 15 38 48  0
  1  0      0 3007432 282792  86360    0    0 262160    67 3197 1066 14 38 47  0

However--

$ sudo dd if=/dev/md3 of=/dev/null bs=8M
763+0 records in
762+0 records out
6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  0  1      0 2999592 282408  86388    0    0 434208     0 4556 1514  0 43 43 15
  1  0      0 2999892 262928  86552    0    0 439816    68 4568 2412  0 43 43 14
  1  1      0 2999952 281832  86532    0    0 444992     0 4604 1486  0 43 43 14
  1  1      0 2999708 282148  86456    0    0 458752     0 4642 1694  0 45 42 13

Justin.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22  0:18     ` Justin Piszcz
@ 2007-04-22  0:42       ` Pallai Roland
  2007-04-22  8:47         ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-22  0:42 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux RAID Mailing List


On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> On Sat, 21 Apr 2007, Pallai Roland wrote:
> >
> > RAID5, chunk size 128k:
> >
> > # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
> > (waiting for sync, then mount, mkfs, etc)
> > # blockdev --setra 4096 /dev/md/0
> > # ./readtest &
> > procs -----------memory---------- ---swap-- -----io---- --system--
> > ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in  
> >  cs us sy id wa 91 10      0 432908      0 436572    0    0 99788    40
> > 2925 50358  2 36  0 63 0 11      0 444184      0 435992    0    0 89996  
> >  32 4252 49303  1 31  0 68 45 11      0 446924      0 441024    0    0
> > 88584     0 5748 58197  0 30  2 67 - context switch storm, only 10 of 100
> > processes are working, lot of thrashed readahead pages. I'm sure you can
> > reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide
> > RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd
> > of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd
> > if=$i of=/dev/zero bs=64k 2>/dev/null & done
> >
> >
> > RAID5, chunk size 64k (equal to max_hw_sectors):
> >
> > # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
> > (waiting for sync, then mount, mkfs, etc)
> > # blockdev --setra 4096 /dev/md/0
> > # ./readtest &
> > procs -----------memory---------- ---swap-- -----io---- --system--
> > ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in  
> >  cs us sy id wa 1 99      0 309260      0 653000    0    0 309620     0
> > 4521  2897  0 17  0 82 1 99      0 156436      0 721452    0    0 258072 
> >    0 4640  3168  0 14  0 86 0 100     0 244088      0 599888    0    0
> > 258856     0 4703  3986  1 17  0 82 - YES! It's MUCH better now! :)
> >
> >
> > All in all, I use 64Kb chunk now and I'm happy, but I think it's
> > definitely a software bug. The sata_mv driver also doesn't give bigger
> > max_sectors_kb on Marvell chips, so it's a performance killer for every
> > Marvell user if they're using 128k or bigger chunks on RAID5. A warning
> > should be printed by the kernel at least if it's not a bug, just a
> > limitation.
> >
> >
>
> How did you run your read test?
>
> $ sudo dd if=/dev/md3 of=/dev/null
> Password:
> 18868881+0 records in
> 18868880+0 records out
> 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in  
> cs us sy id wa 2  0      0 3007612 251068  86372    0    0 243732     0
> 3109  541 15 38 47  0 1  0      0 3007724 282444  86344    0    0 260636   
>  0 3152  619 14 38 48  0 1  0      0 3007472 282600  86400    0    0 262188
>     0 3153  339 15 38 48  0 1  0      0 3007432 282792  86360    0    0
> 262160    67 3197 1066 14 38 47  0
>
> However--
>
> $ sudo dd if=/dev/md3 of=/dev/null bs=8M
> 763+0 records in
> 762+0 records out
> 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu---- 0  1      0 2999592 282408  86388    0    0 434208     0 4556
> 1514  0 43 43 15 1  0      0 2999892 262928  86552    0    0 439816    68
> 4568 2412  0 43 43 14 1  1      0 2999952 281832  86532    0    0 444992   
>  0 4604 1486  0 43 43 14 1  1      0 2999708 282148  86456    0    0 458752
>     0 4642 1694  0 45 42 13

I did run 100 parallel reader process (dd) top of XFS file system, try this:
 for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
 for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done

and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
 /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done

I also set 2048/4096 readahead sectors with blockdev --setra

You need 50-100 reader processes for this issue, I think so. My kernel version 
is 2.6.20.3


--
 d


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22  0:42       ` Pallai Roland
@ 2007-04-22  8:47         ` Justin Piszcz
  2007-04-22  9:52           ` Pallai Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22  8:47 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux RAID Mailing List



On Sun, 22 Apr 2007, Pallai Roland wrote:

>
> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
>> On Sat, 21 Apr 2007, Pallai Roland wrote:
>>>
>>> RAID5, chunk size 128k:
>>>
>>> # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
>>> (waiting for sync, then mount, mkfs, etc)
>>> # blockdev --setra 4096 /dev/md/0
>>> # ./readtest &
>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>>  cs us sy id wa 91 10      0 432908      0 436572    0    0 99788    40
>>> 2925 50358  2 36  0 63 0 11      0 444184      0 435992    0    0 89996
>>>  32 4252 49303  1 31  0 68 45 11      0 446924      0 441024    0    0
>>> 88584     0 5748 58197  0 30  2 67 - context switch storm, only 10 of 100
>>> processes are working, lot of thrashed readahead pages. I'm sure you can
>>> reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide
>>> RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd
>>> of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd
>>> if=$i of=/dev/zero bs=64k 2>/dev/null & done
>>>
>>>
>>> RAID5, chunk size 64k (equal to max_hw_sectors):
>>>
>>> # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop]
>>> (waiting for sync, then mount, mkfs, etc)
>>> # blockdev --setra 4096 /dev/md/0
>>> # ./readtest &
>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>>  cs us sy id wa 1 99      0 309260      0 653000    0    0 309620     0
>>> 4521  2897  0 17  0 82 1 99      0 156436      0 721452    0    0 258072
>>>    0 4640  3168  0 14  0 86 0 100     0 244088      0 599888    0    0
>>> 258856     0 4703  3986  1 17  0 82 - YES! It's MUCH better now! :)
>>>
>>>
>>> All in all, I use 64Kb chunk now and I'm happy, but I think it's
>>> definitely a software bug. The sata_mv driver also doesn't give bigger
>>> max_sectors_kb on Marvell chips, so it's a performance killer for every
>>> Marvell user if they're using 128k or bigger chunks on RAID5. A warning
>>> should be printed by the kernel at least if it's not a bug, just a
>>> limitation.
>>>
>>>
>>
>> How did you run your read test?
>>
>> $ sudo dd if=/dev/md3 of=/dev/null
>> Password:
>> 18868881+0 records in
>> 18868880+0 records out
>> 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s
>>
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in
>> cs us sy id wa 2  0      0 3007612 251068  86372    0    0 243732     0
>> 3109  541 15 38 47  0 1  0      0 3007724 282444  86344    0    0 260636
>>  0 3152  619 14 38 48  0 1  0      0 3007472 282600  86400    0    0 262188
>>     0 3153  339 15 38 48  0 1  0      0 3007432 282792  86360    0    0
>> 262160    67 3197 1066 14 38 47  0
>>
>> However--
>>
>> $ sudo dd if=/dev/md3 of=/dev/null bs=8M
>> 763+0 records in
>> 762+0 records out
>> 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s
>>
>> procs -----------memory---------- ---swap-- -----io---- -system--
>> ----cpu---- 0  1      0 2999592 282408  86388    0    0 434208     0 4556
>> 1514  0 43 43 15 1  0      0 2999892 262928  86552    0    0 439816    68
>> 4568 2412  0 43 43 14 1  1      0 2999952 281832  86532    0    0 444992
>>  0 4604 1486  0 43 43 14 1  1      0 2999708 282148  86456    0    0 458752
>>     0 4642 1694  0 45 42 13
>
> I did run 100 parallel reader process (dd) top of XFS file system, try this:
> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done
>
> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
>
> I also set 2048/4096 readahead sectors with blockdev --setra
>
> You need 50-100 reader processes for this issue, I think so. My kernel version
> is 2.6.20.3
>
>
> --
> d
>

In one xterm:
for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done

In another:
for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done

Results:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  1 79    104  54908 169772 3096180    0    0   340 120944 3062 9220  0 37  0 63
  9 72    104  70588 155356 3095512    0    0   812 238550 3506 6692  1 80  0 20
  7 72    104  75352 154252 3092896    0    0  4172 73020 3262 1952  0 31  0 69
13 89    104  70036 157160 3092988    0    0  5224 410088 3754 6068  1 66  0 34
  1 10    104  66232 157160 3096784    0    0   128 10668 2834 1227  0 11  0 89
  0 83    104  69608 156008 3094468    0    0     0 227888 3017 1843  0 43  0 57
35 91    104  72556 153832 3094608    0    0     0 148968 3141 4651  0 45  0 55
  0 84    104  71332 152300 3097304    0    0     4 192556 3345 5716  1 45  0 55
  4 79    104  76288 150324 3096168    0    0   200 63940 3201 1518  0 34  0 66
  2 84    104  18796 210492 3093256    0    0 86360 276956 3500 4899  1 67  0 33
  1 89    104  21208 207296 3093372    0    0   148 219696 3473 3774  0 41  0 59
86 22    104  15236 210080 3097008    0    0  4968 91508 3313 9608  0 34  4 62
  0 81    104  17048 208536 3096756    0    0  2304 148384 3066  929  0 31  0 69
28 60    104  25696 204100 3093156    0    0   136 159520 3394 6210  0 41  0 59
  7 57    104  21788 207620 3095812    0    0 15808 64888 2880 3992  1 44  0 56
  0 85    104  23952 206576 3092812    0    0 24304 383716 4067 3535  0 66  0 34
  3 81    104  22468 204888 3096768    0    0   620 164092 3160 5136  0 37  0 63


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22  8:47         ` Justin Piszcz
@ 2007-04-22  9:52           ` Pallai Roland
  2007-04-22 10:23             ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-22  9:52 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux RAID Mailing List


On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
> > On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> >>
> >> How did you run your read test?
> >>
> >
> > I did run 100 parallel reader process (dd) top of XFS file system, try
> > this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null;
> > done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
> > done
> >
> > and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> > /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> >
> > I also set 2048/4096 readahead sectors with blockdev --setra
> >
> > You need 50-100 reader processes for this issue, I think so. My kernel
> > version is 2.6.20.3
> >
>
> In one xterm:
> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
>
> In another:
> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done

Write and read files top of XFS, not on the block device. $i isn't a typo, you 
should write into 100 files and read back by 100 threads in parallel when 
done. I've 1Gb of RAM, maybe you should use mem= kernel parameter on boot.

1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 2>/dev/null; 
done
2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done


--
 d


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22  9:52           ` Pallai Roland
@ 2007-04-22 10:23             ` Justin Piszcz
  2007-04-22 11:38               ` Pallai Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22 10:23 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux RAID Mailing List



On Sun, 22 Apr 2007, Pallai Roland wrote:

>
> On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
>> On Sun, 22 Apr 2007, Pallai Roland wrote:
>>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
>>>>
>>>> How did you run your read test?
>>>>
>>>
>>> I did run 100 parallel reader process (dd) top of XFS file system, try
>>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null;
>>> done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
>>> done
>>>
>>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
>>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
>>>
>>> I also set 2048/4096 readahead sectors with blockdev --setra
>>>
>>> You need 50-100 reader processes for this issue, I think so. My kernel
>>> version is 2.6.20.3
>>>
>>
>> In one xterm:
>> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
>>
>> In another:
>> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
>
> Write and read files top of XFS, not on the block device. $i isn't a typo, you
> should write into 100 files and read back by 100 threads in parallel when
> done. I've 1Gb of RAM, maybe you should use mem= kernel parameter on boot.
>
> 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 2>/dev/null;
> done
> 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done
>
>
> --
> d
>

I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965 
chipset.  My max_sectors_kb is 128kb, my chunk size is 128kb, why do you 
set the max_sectors_kb less than the chunk size?  For read-ahead, there 
are some good benchmarks by SGI(?) I believe and some others that states 
16MB is the best value, over that, you lose on reads/writes or the other, 
16MB appears to be optimal for best overall value.  Do these values look 
good to you, or?

Read 100 files on XFS simultaneously:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  0  0    104 146680    460 3179400    0    0   138    17   77  161  1  0 99  0
  0  0    104 146680    460 3179400    0    0     0     0 1095  190  0  0 100  0
16 84    104 149024    460 3162880    0    0 57492    25 1481 1284  2 10 88  1
  0 88    104 148728    396 2982628    0    0 296904     8 2816 4728  1 48  0 51
  0 88    104 147796     40 2894752    0    0 204080     0 2815 3507  0  8  0 92
  0 87    104 149272     40 2845688    0    0 116200     0 2781 2836  0  6  0 94
  0 87    104 147872     40 2770888    0    0 188260     0 2948 3043  0  9  0 91
  0 87    104 148244     40 2583700    0    0 218440     0 2856 3218  0 12  0 88
  1 87    104 147528     40 2385668    0    0 181352     0 3225 4211  1  9  0 90
  0 87    104 147784     40 2343868    0    0 225716     0 3205 3725  0  9  0 91
  0 88    104 146112     40 2233272    0    0 202968     8 3956 5669  0  9  0 91
  0 88    104 146452     40 2325332    0    0 241196     0 4374 4927  0  7  0 93
  0 88    104 147792     40 2463600    0    0 239788     0 3216 1139  1  5  0 94
  1 88    104 148600     40 2468352    0    0 187488     0 3107 2883  0  6  0 94
  0 88    104 147240     40 2553940    0    0 168236     0 3559 2795  0  5  0 95
  0 87    104 147772     40 2682932    0    0 278092     0 5531 7507  0  9  0 91
  0 87    104 148200     40 2813376    0    0 201324     0 4346 5726  0  6  0 94
  1 87    104 148196     40 2774932    0    0 193620     0 5433 11595  0 10  0 90
  0 87    104 148488     40 2785256    0    0 227568     0 4081 6121  0 11  0 89
  0 87    104 148240     40 2798024    0    0 201800     0 5064 10857  1 11  0 89
  0 87    104 148684     40 2825632    0    0 206564    20 5244 9451  0 10  0 90
  2 86    104 149060     40 2899464    0    0 182584     0 6610 15984  0 10  0 90
  0 86    104 148004     40 2925044    0    0 167044     0 5725 13011  0 13  0 87
  0 84    104 148344     40 3025508    0    0 149640     0 6860 18340  0  9  0 91
  0 81    104 146152     40 3074460    0    0 167780     0 7572 19434  1 16  0 84
  0 79    104 149172     40 3127748    0    0 191064     0 5373 11928  0 10  0 90
  0 77    104 146972     40 3101440    0    0 135968     0 9232 28208  1 18  0 82
  0 70    104 148896     40 3081860    0    0 144372     0 7790 22857  0 16  0 84
  0 69    104 148492     40 3173532    0    0 154024     0 8002 22211  0 10  0 90
  1 68    104 145804     40 3192788    0    0 111412     0 9612 30604  0 14  0 86
  0 66    104 147324     40 3173004    0    0 137800     0 9858 31132  0 13  0 87
  0 65    104 145456     40 3121436    0    0 141348     0 6521 17443  1 13  0 87
  1 64    104 148408     40 3154056    0    0 149308     0 9575 29755  0 13  0 87
  0 61    104 145396     40 3187312    0    0 161976     0 9709 29021  0 13  0 87
  0 61    104 147860     40 3189972    0    0 154844     0 7345 20033  0 10  0 90
  2 61    104 148264     40 3162276    0    0 135804     0 10865 33643  0 15  1 8
5
  0 59    104 147264     40 3203068    0    0 70464     0 9340 29345  0 11  0 89
  1 58    104 147344     40 3203468    0    0 108072    48 12206 39011  0 15  0 8
5
  0 58    104 146696     40 3204660    0    0 102492    32 12800 43578  0 15  0 8
5
  1 58    104 145712     40 3201916    0    0 198728     8 10537 34621  0 15  0 8
5
  2 56    104 148732     40 3195160    0    0 119700     0 10647 35077  1 15  0 8
4
  0 54    104 148628     40 3203856    0    0 92588     0 8757 26257  0 12  0 88
  0 54    104 147128     40 3203956    0    0 207224     0 9706 29997  0 14  0 86
  1 54    104 146788     40 3165980    0    0 150120    36 10611 36721  0 17  0 8
3
  1 52    104 146984     40 3121696    0    0 160080     8 8714 28264  0 14  0 86
  0 51    104 147556     40 3123932    0    0 223940     0 6088 15013  0 13  0 87
  0 50    104 147268     40 3172724    0    0 154592     0 8949 26645  0 13  0 87
  0 46    104 148692     40 3135296    0    0 153708     0 7146 19844  1 15  0 85
  0 45    104 148116     40 3168180    0    0 165740    20 7884 22111  0 14  0 86
  0 43    104 147388     40 3209160    0    0 179588     8 9244 27501  0 14  0 86
  0 41    104 147760     40 3208860    0    0 131216    24 8205 24867  0 10  0 90
  0 41    104 148868     40 3122048    0    0 188260     0 9812 30872  0 14  0 86
  1 40    104 146832     40 3210452    0    0 168388     0 8798 27378  0 12  0 88
  1 39    104 146388     40 3211436    0    0 96044     0 13197 45162  0 15  2 83
  0 37    104 144716     40 3213412    0    0 151044    12 9877 30263  1 14  0 85
  0 37    104 146056     40 3212088    0    0 95988    27 13179 44933  0 15  5 80
  3 36    104 146776     40 3211552    0    0 66580     0 13651 46636  0 16  9 76
  1 37    104 146504     40 3212192    0    0 77196     0 16353 57091  0 18  0 82
  1 37    104 146864     40 3211648    0    0 135312     0 13350 45859  0 17  1 8
3
  2 35    104 148924     40 3209796    0    0 81228     0 13459 43748  0 15  0 85
  1 36    104 147792     40 3211332    0    0 107248     0 15935 57007  0 21  0 7
9
  2 35    104 146916     40 3212444    0    0 84328     0 16471 57611  0 18  0 82
  0 35    104 146500     40 3212944    0    0 98524     0 16049 55775  0 19  0 81
  0 34    104 148448     40 3211220    0    0 114116     0 14166 51140  1 17  2 8
1
  0 32    104 145220     40 3215024    0    0 96068     0 16008 58012  0 21  0 79
  1 29    104 145376     40 3215416    0    0 102604     0 12294 41704  0 16  1 8
4
  2 27    104 146764     40 3214384    0    0 132004     0 13076 44026  1 15  3 8
2
  0 27    104 148888     40 3212460    0    0 118908     0 16100 57546  0 19  0 8
1
  0 26    104 144932     40 3216260    0    0 101220    24 15687 55236  0 20  2 7
8
  1 24    104 147848     40 3214344    0    0 125152     8 13737 48996  0 19  0 8
1
  2 21    104 147024     40 3215676    0    0 71300     0 18828 65151  0 22  0 78
  1 22    104 145124     40 3217852    0    0 85228     0 19031 65807  1 20  0 80
  2 21    104 147472     40 3215532    0    0 96188     0 16644 60505  0 20  0 80
  2 18    104 146064     40 3217544    0    0 83316     0 16788 60762  1 21  0 79
  6 13    104 145824     40 3218832    0    0 56188     0 15100 51145  0 16  1 83
  0  6    104 147988     40 3218048    0    0 75836    20 19925 69456  0 22  0 78
  0  0    104 148592     40 3218292    0    0 20328     0 6190 17793  0  6 70 24
  0  0    104 148796     40 3218392    0    0    36     0 1199  372  1  0 98  1
  0  0    104 148796     40 3218392    0    0     0     0 1090  192  0  0 100  0
  0  0    104 149044     40 3218420    0    0    16     0 1110  349  0  0 100  0
  0  0    104 149044     40 3218420    0    0     0    17 1105  212  0  0 100  0
  0  0    104 149308     40 3218436    0    0     0     0 1143  317  0  0 100  0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 10:23             ` Justin Piszcz
@ 2007-04-22 11:38               ` Pallai Roland
  2007-04-22 11:42                 ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-22 11:38 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]


On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
> > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
> >> On Sun, 22 Apr 2007, Pallai Roland wrote:
> >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> >>>> How did you run your read test?
> >>>
> >>> I did run 100 parallel reader process (dd) top of XFS file system, try
> >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k
> >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k
> >>> 2>/dev/null & done
> >>>
> >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> >>>
> >>> I also set 2048/4096 readahead sectors with blockdev --setra
> >>>
> >>> You need 50-100 reader processes for this issue, I think so. My kernel
> >>> version is 2.6.20.3
> >>
> >> In one xterm:
> >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> >>
> >> In another:
> >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
> >
> > Write and read files top of XFS, not on the block device. $i isn't a
> > typo, you should write into 100 files and read back by 100 threads in
> > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel
> > parameter on boot.
> >
> > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100
> > 2>/dev/null; done
> > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
> > done
> >
>
> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965
> chipset.  My max_sectors_kb is 128kb, my chunk size is 128kb, why do you
> set the max_sectors_kb less than the chunk size?
 It's the maximum on Marvell SATA chips under Linux. Maybe hardware 
limitation. I just would used 128Kb chunk but I hit this issue.

> For read-ahead, there 
> are some good benchmarks by SGI(?) I believe and some others that states
> 16MB is the best value, over that, you lose on reads/writes or the other,
> 16MB appears to be optimal for best overall value.  Do these values look
> good to you, or?
 Where can I found this bechmark? I did some test on this topic, too. I think 
the optimal readahead size always depend on the number of sequentally reading 
processes and the available RAM. If you've 100 processes and 1Gb of RAM, max 
optimal readahead is about 5-6Mb, if you set it bigger that turns into 
readahead thrashing and undesirable context switches. Anyway, I tried 16Mb 
now, but the readahead size doesn't matter in this bug, same context switch 
storm appears with any readahead window size.

> Read 100 files on XFS simultaneously:
 max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe 
just you've too big readahead window for so many processes, it's not the bug 
what I'm talking about in my original post. High interrupt and CS count has 
been building slowly, it may a sign of readahead thrashing. In my case the CS 
storm began in the first second and no high interrupt count:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0  0      0   7220      0 940972    0    0     0     0  256    20  0  0 100  0
0 13      0 383636      0 535520    0    0 144904    32 2804 63834  1 42  0 57
24 20     0 353312      0 558200    0    0 121524     0 2669 67604  1 40  0 59
15 21      0 314808      0 557068    0    0 91300    33 2572 53442  0 29  0 71

 I attached a small kernel patch, you can measure readahead thrashing ratio 
with this (see tail of /proc/vmstat). I think it's a handy tool to find the 
optimal RA-size. And if you're interested in the bug what I'm talking about, 
set max_sectors_kb to 64Kb.


--
 d


[-- Attachment #2: 01_2618+_rathr-d1.diff --]
[-- Type: text/x-diff, Size: 868 bytes --]

--- linux-2.6.18.2/include/linux/vmstat.h.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/include/linux/vmstat.h	2006-11-06 02:09:25.000000000 +0100
@@ -30,6 +30,7 @@
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		RATHRASHED,
 		NR_VM_EVENT_ITEMS
 };
 
--- linux-2.6.18.2/mm/vmstat.c.orig	2006-11-06 01:55:58.000000000 +0100
+++ linux-2.6.18.2/mm/vmstat.c	2006-11-06 02:05:14.000000000 +0100
@@ -502,6 +502,8 @@
 	"allocstall",
 
 	"pgrotated",
+
+	"rathrashed",
 #endif
 };
 
--- linux-2.6.18.2/mm/readahead.c.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/mm/readahead.c	2006-11-06 02:13:12.000000000 +0100
@@ -568,6 +568,7 @@
 	ra->flags |= RA_FLAG_MISS;
 	ra->flags &= ~RA_FLAG_INCACHE;
 	ra->cache_hit = 0;
+	count_vm_event(RATHRASHED);
 }
 
 /*

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 11:38               ` Pallai Roland
@ 2007-04-22 11:42                 ` Justin Piszcz
  2007-04-22 14:38                   ` Pallai Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22 11:42 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux RAID Mailing List



On Sun, 22 Apr 2007, Pallai Roland wrote:

>
> On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote:
>> On Sun, 22 Apr 2007, Pallai Roland wrote:
>>> On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
>>>> On Sun, 22 Apr 2007, Pallai Roland wrote:
>>>>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
>>>>>> How did you run your read test?
>>>>>
>>>>> I did run 100 parallel reader process (dd) top of XFS file system, try
>>>>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k
>>>>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k
>>>>> 2>/dev/null & done
>>>>>
>>>>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
>>>>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
>>>>>
>>>>> I also set 2048/4096 readahead sectors with blockdev --setra
>>>>>
>>>>> You need 50-100 reader processes for this issue, I think so. My kernel
>>>>> version is 2.6.20.3
>>>>
>>>> In one xterm:
>>>> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
>>>>
>>>> In another:
>>>> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
>>>
>>> Write and read files top of XFS, not on the block device. $i isn't a
>>> typo, you should write into 100 files and read back by 100 threads in
>>> parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel
>>> parameter on boot.
>>>
>>> 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100
>>> 2>/dev/null; done
>>> 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
>>> done
>>>
>>
>> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965
>> chipset.  My max_sectors_kb is 128kb, my chunk size is 128kb, why do you
>> set the max_sectors_kb less than the chunk size?
> It's the maximum on Marvell SATA chips under Linux. Maybe hardware
> limitation. I just would used 128Kb chunk but I hit this issue.
>
>> For read-ahead, there
>> are some good benchmarks by SGI(?) I believe and some others that states
>> 16MB is the best value, over that, you lose on reads/writes or the other,
>> 16MB appears to be optimal for best overall value.  Do these values look
>> good to you, or?
> Where can I found this bechmark?

http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf
Check page 13 of 20.

  I did some test on this topic, too. I think
> the optimal readahead size always depend on the number of sequentally reading
> processes and the available RAM. If you've 100 processes and 1Gb of RAM, max
> optimal readahead is about 5-6Mb, if you set it bigger that turns into
> readahead thrashing and undesirable context switches. Anyway, I tried 16Mb
> now, but the readahead size doesn't matter in this bug, same context switch
> storm appears with any readahead window size.
>
>> Read 100 files on XFS simultaneously:
> max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe
> just you've too big readahead window for so many processes, it's not the bug
> what I'm talking about in my original post. High interrupt and CS count has
> been building slowly, it may a sign of readahead thrashing. In my case the CS
> storm began in the first second and no high interrupt count:
>
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
> 0  0      0   7220      0 940972    0    0     0     0  256    20  0  0 100  0
> 0 13      0 383636      0 535520    0    0 144904    32 2804 63834  1 42  0 57
> 24 20     0 353312      0 558200    0    0 121524     0 2669 67604  1 40  0 59
> 15 21      0 314808      0 557068    0    0 91300    33 2572 53442  0 29  0 71
>
> I attached a small kernel patch, you can measure readahead thrashing ratio
> with this (see tail of /proc/vmstat). I think it's a handy tool to find the
> optimal RA-size. And if you're interested in the bug what I'm talking about,
> set max_sectors_kb to 64Kb.
>
>
> --
> d
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 11:42                 ` Justin Piszcz
@ 2007-04-22 14:38                   ` Pallai Roland
  2007-04-22 14:48                     ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-22 14:38 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 1334 bytes --]

On Sunday 22 April 2007 13:42:43 Justin Piszcz wrote:
> http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf
> Check page 13 of 20.
 Thanks, interesting presentation. I'm working in the same area now, big media 
files and many clients. I spent some days to build a low-cost, high 
performance server. With my experience, I think, some results of this 
presentation can't be applied to recent kernels.

 It's off-topic in this thread, sorry, but I like to swagger what can be done 
with Linux! :)

ASUS P5B-E Plus, P4 641, 1024Mb RAM, 6 disks on 965P's south bridge, 1 disk on 
Jmicron (both driven by AHCI driver), 1 disk on Silicon Image 3132, 8 disks 
on HPT2320 (hpt's driver). 16x Seagate 500Gb 16Mb cache.
 kernel 2.6.20.3
 anticipatory scheduler
 chunk size 64Kb
 XFS file system
 file size is 400Mb, I read 200 of them in each test

 The yellow points are marking thrashing thresholds, I computed it based on 
process number and RAM size. It's not an exact threshold.

- now see the attached picture :)

 Awesome performance, near disk-platter speed with big RA! It's even better 
with ~+15% if I use the -mm tree with the new adaptive readahead! Bigger 
files, bigger chunk also helps, but in my case, it's constant 
(unfortunately).

 The rule of readahead size is simple: the much is better, till no thrashing.

--
 d

[-- Attachment #2: r1.png --]
[-- Type: image/png, Size: 8277 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 14:38                   ` Pallai Roland
@ 2007-04-22 14:48                     ` Justin Piszcz
  2007-04-22 15:09                       ` Pallai Roland
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22 14:48 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux RAID Mailing List



On Sun, 22 Apr 2007, Pallai Roland wrote:

>
> On Sunday 22 April 2007 13:42:43 Justin Piszcz wrote:
>> http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf
>> Check page 13 of 20.
> Thanks, interesting presentation. I'm working in the same area now, big media
> files and many clients. I spent some days to build a low-cost, high
> performance server. With my experience, I think, some results of this
> presentation can't be applied to recent kernels.
>
> It's off-topic in this thread, sorry, but I like to swagger what can be done
> with Linux! :)
>
> ASUS P5B-E Plus, P4 641, 1024Mb RAM, 6 disks on 965P's south bridge, 1 disk on
> Jmicron (both driven by AHCI driver), 1 disk on Silicon Image 3132, 8 disks
> on HPT2320 (hpt's driver). 16x Seagate 500Gb 16Mb cache.
> kernel 2.6.20.3
> anticipatory scheduler
> chunk size 64Kb
> XFS file system
> file size is 400Mb, I read 200 of them in each test
>
> The yellow points are marking thrashing thresholds, I computed it based on
> process number and RAM size. It's not an exact threshold.
>
> - now see the attached picture :)
>
> Awesome performance, near disk-platter speed with big RA! It's even better
> with ~+15% if I use the -mm tree with the new adaptive readahead! Bigger
> files, bigger chunk also helps, but in my case, it's constant
> (unfortunately).
>
> The rule of readahead size is simple: the much is better, till no thrashing.
>
>
> --
> d
>
>

Have you also optimized your stripe cache for writes?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 14:48                     ` Justin Piszcz
@ 2007-04-22 15:09                       ` Pallai Roland
  2007-04-22 15:53                         ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-22 15:09 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Linux RAID Mailing List


On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote:
> Have you also optimized your stripe cache for writes?
 Not yet. Is it worth it?


--
 d


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 15:09                       ` Pallai Roland
@ 2007-04-22 15:53                         ` Justin Piszcz
  2007-04-22 19:01                           ` Mr. James W. Laferriere
  0 siblings, 1 reply; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22 15:53 UTC (permalink / raw)
  To: Pallai Roland; +Cc: Linux RAID Mailing List



On Sun, 22 Apr 2007, Pallai Roland wrote:

>
> On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote:
>> Have you also optimized your stripe cache for writes?
> Not yet. Is it worth it?
>
>
> --
> d
>

Yes, it is-- well, if write speed is important to you that is?

Each of these write tests are averaged over three runs:

128k_stripe: 69.2MB/s
256k_stripe: 105.3MB/s
512k_stripe: 142.0MB/s
1024k_stripe: 144.6MB/s
2048k_stripe: 208.3MB/s
4096k_stripe: 223.6MB/s
8192k_stripe: 226.0MB/s
16384k_stripe: 215.0MB/s

What is your /sys/block/md[0-9]/md/stripe_cache_size?

Justin.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 15:53                         ` Justin Piszcz
@ 2007-04-22 19:01                           ` Mr. James W. Laferriere
  2007-04-22 20:35                             ` Justin Piszcz
  0 siblings, 1 reply; 15+ messages in thread
From: Mr. James W. Laferriere @ 2007-04-22 19:01 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Pallai Roland, Linux RAID Mailing List

 	Hello Justin ,

On Sun, 22 Apr 2007, Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
>> On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote:
>>> Have you also optimized your stripe cache for writes?
>> Not yet. Is it worth it?
>> --
>> d
> Yes, it is-- well, if write speed is important to you that is?
> Each of these write tests are averaged over three runs:
>
> 128k_stripe: 69.2MB/s
> 256k_stripe: 105.3MB/s
> 512k_stripe: 142.0MB/s
> 1024k_stripe: 144.6MB/s
> 2048k_stripe: 208.3MB/s
> 4096k_stripe: 223.6MB/s
> 8192k_stripe: 226.0MB/s
> 16384k_stripe: 215.0MB/s
>
> What is your /sys/block/md[0-9]/md/stripe_cache_size?
 	On which filesystem type ?  ie: ext3,xfs,rieser?,...
 		Tia ,  JimL
-- 
+-----------------------------------------------------------------+
| James   W.   Laferriere | System   Techniques | Give me VMS     |
| Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
| babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
+-----------------------------------------------------------------+

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]
  2007-04-22 19:01                           ` Mr. James W. Laferriere
@ 2007-04-22 20:35                             ` Justin Piszcz
  0 siblings, 0 replies; 15+ messages in thread
From: Justin Piszcz @ 2007-04-22 20:35 UTC (permalink / raw)
  To: Mr. James W. Laferriere; +Cc: Pallai Roland, Linux RAID Mailing List



On Sun, 22 Apr 2007, Mr. James W. Laferriere wrote:

> 	Hello Justin ,
>
> On Sun, 22 Apr 2007, Justin Piszcz wrote:
>> On Sun, 22 Apr 2007, Pallai Roland wrote:
>>> On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote:
>>>> Have you also optimized your stripe cache for writes?
>>> Not yet. Is it worth it?
>>> --
>>> d
>> Yes, it is-- well, if write speed is important to you that is?
>> Each of these write tests are averaged over three runs:
>> 
>> 128k_stripe: 69.2MB/s
>> 256k_stripe: 105.3MB/s
>> 512k_stripe: 142.0MB/s
>> 1024k_stripe: 144.6MB/s
>> 2048k_stripe: 208.3MB/s
>> 4096k_stripe: 223.6MB/s
>> 8192k_stripe: 226.0MB/s
>> 16384k_stripe: 215.0MB/s
>> 
>> What is your /sys/block/md[0-9]/md/stripe_cache_size?
> 	On which filesystem type ?  ie: ext3,xfs,rieser?,...
> 		Tia ,  JimL
> -- 
> +-----------------------------------------------------------------+
> | James   W.   Laferriere | System   Techniques | Give me VMS     |
> | Network        Engineer | 663  Beaumont  Blvd |  Give me Linux  |
> | babydr@baby-dragons.com | Pacifica, CA. 94044 |   only  on  AXP |
> +-----------------------------------------------------------------+
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

XFS


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2007-04-22 20:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-20 21:06 major performance drop on raid5 due to context switches caused by small max_hw_sectors Pallai Roland
     [not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
2007-04-21 19:32   ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland
2007-04-22  0:18     ` Justin Piszcz
2007-04-22  0:42       ` Pallai Roland
2007-04-22  8:47         ` Justin Piszcz
2007-04-22  9:52           ` Pallai Roland
2007-04-22 10:23             ` Justin Piszcz
2007-04-22 11:38               ` Pallai Roland
2007-04-22 11:42                 ` Justin Piszcz
2007-04-22 14:38                   ` Pallai Roland
2007-04-22 14:48                     ` Justin Piszcz
2007-04-22 15:09                       ` Pallai Roland
2007-04-22 15:53                         ` Justin Piszcz
2007-04-22 19:01                           ` Mr. James W. Laferriere
2007-04-22 20:35                             ` Justin Piszcz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).