* major performance drop on raid5 due to context switches caused by small max_hw_sectors
@ 2007-04-20 21:06 Pallai Roland
[not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
0 siblings, 1 reply; 15+ messages in thread
From: Pallai Roland @ 2007-04-20 21:06 UTC (permalink / raw)
To: Linux RAID Mailing List
Hi!
I made a software RAID5 array from 8 disks top on a HPT2320 card driven by
hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I began to
test it with a simple sequental read by 100 threads with adjusted readahead
size (2048Kb; total ram is 1Gb, I use posix_fadvise DONTNEED after reads).
Bad news: I noticed very weak peformance on this array compared to an another
array built from 7 disk on the motherboard's AHCI controllers. I digged
deeper, and I found the root of the problem: if I lower max_sectors_kb on my
AHCI disks, the same happen there too!
dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
dap:/sys/block# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 14 0 8216 0 791056 0 0 103304 0 2518 57340 0 41 0
59
3 12 0 7420 0 791856 0 0 117264 0 2600 55709 0 44 0
56
thrashed readahead pages: 123363
dap:/sys/block# for i in sd*; do echo 512 >$i/queue/max_sectors_kb; done
dap:/sys/block# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 100 0 182876 0 762560 0 0 350944 0 1299 1484 1 14 0
86
0 100 0 129492 0 815460 0 0 265864 0 1432 2045 0 10 0
90
0 100 0 112812 0 832504 0 0 290084 0 1366 1807 1 11 0
89
thrashed readahead pages: 4605
Is not possible to reduce the number of context switches here? Why context
switches causes readahead thrashing? Why just the RAID5 suffers from the
small max_sectors_kb, why don't happen if I run lot of 'badblocks'?
thanks,
--
dap
^ permalink raw reply [flat|nested] 15+ messages in thread[parent not found: <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>]
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] [not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com> @ 2007-04-21 19:32 ` Pallai Roland 2007-04-22 0:18 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-21 19:32 UTC (permalink / raw) To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List On Saturday 21 April 2007 07:47:49 you wrote: > On 4/21/07, Pallai Roland <dap@mail.index.hu> wrote: > > I made a software RAID5 array from 8 disks top on a HPT2320 card driven > > by hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I > > began to test it with a simple sequental read by 100 threads with > > adjusted readahead size (2048Kb; total ram is 1Gb, I use posix_fadvise > > DONTNEED after reads). Bad news: I noticed very weak peformance on this > > array compared to an another array built from 7 disk on the motherboard's > > AHCI controllers. I digged deeper, and I found the root of the problem: > > if I lower max_sectors_kb on my AHCI disks, the same happen there too! > > > > dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > > 3. what is the raid configuration ? did you increase the stripe_cache_size > ? Thanks! It's works fine if chunk size < max_hw_sectors! But if it's not true, the very high number of context switches kills the performance. RAID5, chunk size 128k: # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] (waiting for sync, then mount, mkfs, etc) # blockdev --setra 4096 /dev/md/0 # ./readtest & procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 91 10 0 432908 0 436572 0 0 99788 40 2925 50358 2 36 0 63 0 11 0 444184 0 435992 0 0 89996 32 4252 49303 1 31 0 68 45 11 0 446924 0 441024 0 0 88584 0 5748 58197 0 30 2 67 - context switch storm, only 10 of 100 processes are working, lot of thrashed readahead pages. I'm sure you can reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done RAID5, chunk size 64k (equal to max_hw_sectors): # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] (waiting for sync, then mount, mkfs, etc) # blockdev --setra 4096 /dev/md/0 # ./readtest & procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 99 0 309260 0 653000 0 0 309620 0 4521 2897 0 17 0 82 1 99 0 156436 0 721452 0 0 258072 0 4640 3168 0 14 0 86 0 100 0 244088 0 599888 0 0 258856 0 4703 3986 1 17 0 82 - YES! It's MUCH better now! :) All in all, I use 64Kb chunk now and I'm happy, but I think it's definitely a software bug. The sata_mv driver also doesn't give bigger max_sectors_kb on Marvell chips, so it's a performance killer for every Marvell user if they're using 128k or bigger chunks on RAID5. A warning should be printed by the kernel at least if it's not a bug, just a limitation. bye, -- d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-21 19:32 ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland @ 2007-04-22 0:18 ` Justin Piszcz 2007-04-22 0:42 ` Pallai Roland 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 0:18 UTC (permalink / raw) To: Pallai Roland; +Cc: Raz Ben-Jehuda(caro), Linux RAID Mailing List On Sat, 21 Apr 2007, Pallai Roland wrote: > > On Saturday 21 April 2007 07:47:49 you wrote: >> On 4/21/07, Pallai Roland <dap@mail.index.hu> wrote: >>> I made a software RAID5 array from 8 disks top on a HPT2320 card driven >>> by hpt's driver. max_hw_sectors is 64Kb in this proprietary driver. I >>> began to test it with a simple sequental read by 100 threads with >>> adjusted readahead size (2048Kb; total ram is 1Gb, I use posix_fadvise >>> DONTNEED after reads). Bad news: I noticed very weak peformance on this >>> array compared to an another array built from 7 disk on the motherboard's >>> AHCI controllers. I digged deeper, and I found the root of the problem: >>> if I lower max_sectors_kb on my AHCI disks, the same happen there too! >>> >>> dap:/sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done >> >> 3. what is the raid configuration ? did you increase the stripe_cache_size >> ? > Thanks! It's works fine if chunk size < max_hw_sectors! But if it's not true, > the very high number of context switches kills the performance. > > > RAID5, chunk size 128k: > > # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] > (waiting for sync, then mount, mkfs, etc) > # blockdev --setra 4096 /dev/md/0 > # ./readtest & > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 91 10 0 432908 0 436572 0 0 99788 40 2925 50358 2 36 0 63 > 0 11 0 444184 0 435992 0 0 89996 32 4252 49303 1 31 0 68 > 45 11 0 446924 0 441024 0 0 88584 0 5748 58197 0 30 2 67 > - context switch storm, only 10 of 100 processes are working, lot of thrashed > readahead pages. I'm sure you can reproduce with 64Kb max_sectors_kb and > 2.6.20.x on *any* 8 disk-wide RAID5 array if chunk size > max_sectors_kb: > for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done > > > RAID5, chunk size 64k (equal to max_hw_sectors): > > # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] > (waiting for sync, then mount, mkfs, etc) > # blockdev --setra 4096 /dev/md/0 > # ./readtest & > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 99 0 309260 0 653000 0 0 309620 0 4521 2897 0 17 0 82 > 1 99 0 156436 0 721452 0 0 258072 0 4640 3168 0 14 0 86 > 0 100 0 244088 0 599888 0 0 258856 0 4703 3986 1 17 0 82 > - YES! It's MUCH better now! :) > > > All in all, I use 64Kb chunk now and I'm happy, but I think it's definitely a > software bug. The sata_mv driver also doesn't give bigger max_sectors_kb on > Marvell chips, so it's a performance killer for every Marvell user if they're > using 128k or bigger chunks on RAID5. A warning should be printed by the > kernel at least if it's not a bug, just a limitation. > > > bye, > -- > d > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > How did you run your read test? $ sudo dd if=/dev/md3 of=/dev/null Password: 18868881+0 records in 18868880+0 records out 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 0 3007612 251068 86372 0 0 243732 0 3109 541 15 38 47 0 1 0 0 3007724 282444 86344 0 0 260636 0 3152 619 14 38 48 0 1 0 0 3007472 282600 86400 0 0 262188 0 3153 339 15 38 48 0 1 0 0 3007432 282792 86360 0 0 262160 67 3197 1066 14 38 47 0 However-- $ sudo dd if=/dev/md3 of=/dev/null bs=8M 763+0 records in 762+0 records out 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 1 0 2999592 282408 86388 0 0 434208 0 4556 1514 0 43 43 15 1 0 0 2999892 262928 86552 0 0 439816 68 4568 2412 0 43 43 14 1 1 0 2999952 281832 86532 0 0 444992 0 4604 1486 0 43 43 14 1 1 0 2999708 282148 86456 0 0 458752 0 4642 1694 0 45 42 13 Justin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 0:18 ` Justin Piszcz @ 2007-04-22 0:42 ` Pallai Roland 2007-04-22 8:47 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-22 0:42 UTC (permalink / raw) To: Justin Piszcz; +Cc: Linux RAID Mailing List On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: > On Sat, 21 Apr 2007, Pallai Roland wrote: > > > > RAID5, chunk size 128k: > > > > # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] > > (waiting for sync, then mount, mkfs, etc) > > # blockdev --setra 4096 /dev/md/0 > > # ./readtest & > > procs -----------memory---------- ---swap-- -----io---- --system-- > > ----cpu---- r b swpd free buff cache si so bi bo in > > cs us sy id wa 91 10 0 432908 0 436572 0 0 99788 40 > > 2925 50358 2 36 0 63 0 11 0 444184 0 435992 0 0 89996 > > 32 4252 49303 1 31 0 68 45 11 0 446924 0 441024 0 0 > > 88584 0 5748 58197 0 30 2 67 - context switch storm, only 10 of 100 > > processes are working, lot of thrashed readahead pages. I'm sure you can > > reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide > > RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd > > of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd > > if=$i of=/dev/zero bs=64k 2>/dev/null & done > > > > > > RAID5, chunk size 64k (equal to max_hw_sectors): > > > > # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] > > (waiting for sync, then mount, mkfs, etc) > > # blockdev --setra 4096 /dev/md/0 > > # ./readtest & > > procs -----------memory---------- ---swap-- -----io---- --system-- > > ----cpu---- r b swpd free buff cache si so bi bo in > > cs us sy id wa 1 99 0 309260 0 653000 0 0 309620 0 > > 4521 2897 0 17 0 82 1 99 0 156436 0 721452 0 0 258072 > > 0 4640 3168 0 14 0 86 0 100 0 244088 0 599888 0 0 > > 258856 0 4703 3986 1 17 0 82 - YES! It's MUCH better now! :) > > > > > > All in all, I use 64Kb chunk now and I'm happy, but I think it's > > definitely a software bug. The sata_mv driver also doesn't give bigger > > max_sectors_kb on Marvell chips, so it's a performance killer for every > > Marvell user if they're using 128k or bigger chunks on RAID5. A warning > > should be printed by the kernel at least if it's not a bug, just a > > limitation. > > > > > > How did you run your read test? > > $ sudo dd if=/dev/md3 of=/dev/null > Password: > 18868881+0 records in > 18868880+0 records out > 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s > > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- r b swpd free buff cache si so bi bo in > cs us sy id wa 2 0 0 3007612 251068 86372 0 0 243732 0 > 3109 541 15 38 47 0 1 0 0 3007724 282444 86344 0 0 260636 > 0 3152 619 14 38 48 0 1 0 0 3007472 282600 86400 0 0 262188 > 0 3153 339 15 38 48 0 1 0 0 3007432 282792 86360 0 0 > 262160 67 3197 1066 14 38 47 0 > > However-- > > $ sudo dd if=/dev/md3 of=/dev/null bs=8M > 763+0 records in > 762+0 records out > 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s > > procs -----------memory---------- ---swap-- -----io---- -system-- > ----cpu---- 0 1 0 2999592 282408 86388 0 0 434208 0 4556 > 1514 0 43 43 15 1 0 0 2999892 262928 86552 0 0 439816 68 > 4568 2412 0 43 43 14 1 1 0 2999952 281832 86532 0 0 444992 > 0 4604 1486 0 43 43 14 1 1 0 2999708 282148 86456 0 0 458752 > 0 4642 1694 0 45 42 13 I did run 100 parallel reader process (dd) top of XFS file system, try this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done I also set 2048/4096 readahead sectors with blockdev --setra You need 50-100 reader processes for this issue, I think so. My kernel version is 2.6.20.3 -- d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 0:42 ` Pallai Roland @ 2007-04-22 8:47 ` Justin Piszcz 2007-04-22 9:52 ` Pallai Roland 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 8:47 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux RAID Mailing List On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: >> On Sat, 21 Apr 2007, Pallai Roland wrote: >>> >>> RAID5, chunk size 128k: >>> >>> # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] >>> (waiting for sync, then mount, mkfs, etc) >>> # blockdev --setra 4096 /dev/md/0 >>> # ./readtest & >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> ----cpu---- r b swpd free buff cache si so bi bo in >>> cs us sy id wa 91 10 0 432908 0 436572 0 0 99788 40 >>> 2925 50358 2 36 0 63 0 11 0 444184 0 435992 0 0 89996 >>> 32 4252 49303 1 31 0 68 45 11 0 446924 0 441024 0 0 >>> 88584 0 5748 58197 0 30 2 67 - context switch storm, only 10 of 100 >>> processes are working, lot of thrashed readahead pages. I'm sure you can >>> reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide >>> RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd >>> of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd >>> if=$i of=/dev/zero bs=64k 2>/dev/null & done >>> >>> >>> RAID5, chunk size 64k (equal to max_hw_sectors): >>> >>> # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] >>> (waiting for sync, then mount, mkfs, etc) >>> # blockdev --setra 4096 /dev/md/0 >>> # ./readtest & >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> ----cpu---- r b swpd free buff cache si so bi bo in >>> cs us sy id wa 1 99 0 309260 0 653000 0 0 309620 0 >>> 4521 2897 0 17 0 82 1 99 0 156436 0 721452 0 0 258072 >>> 0 4640 3168 0 14 0 86 0 100 0 244088 0 599888 0 0 >>> 258856 0 4703 3986 1 17 0 82 - YES! It's MUCH better now! :) >>> >>> >>> All in all, I use 64Kb chunk now and I'm happy, but I think it's >>> definitely a software bug. The sata_mv driver also doesn't give bigger >>> max_sectors_kb on Marvell chips, so it's a performance killer for every >>> Marvell user if they're using 128k or bigger chunks on RAID5. A warning >>> should be printed by the kernel at least if it's not a bug, just a >>> limitation. >>> >>> >> >> How did you run your read test? >> >> $ sudo dd if=/dev/md3 of=/dev/null >> Password: >> 18868881+0 records in >> 18868880+0 records out >> 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s >> >> procs -----------memory---------- ---swap-- -----io---- -system-- >> ----cpu---- r b swpd free buff cache si so bi bo in >> cs us sy id wa 2 0 0 3007612 251068 86372 0 0 243732 0 >> 3109 541 15 38 47 0 1 0 0 3007724 282444 86344 0 0 260636 >> 0 3152 619 14 38 48 0 1 0 0 3007472 282600 86400 0 0 262188 >> 0 3153 339 15 38 48 0 1 0 0 3007432 282792 86360 0 0 >> 262160 67 3197 1066 14 38 47 0 >> >> However-- >> >> $ sudo dd if=/dev/md3 of=/dev/null bs=8M >> 763+0 records in >> 762+0 records out >> 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s >> >> procs -----------memory---------- ---swap-- -----io---- -system-- >> ----cpu---- 0 1 0 2999592 282408 86388 0 0 434208 0 4556 >> 1514 0 43 43 15 1 0 0 2999892 262928 86552 0 0 439816 68 >> 4568 2412 0 43 43 14 1 1 0 2999952 281832 86532 0 0 444992 >> 0 4604 1486 0 43 43 14 1 1 0 2999708 282148 86456 0 0 458752 >> 0 4642 1694 0 45 42 13 > > I did run 100 parallel reader process (dd) top of XFS file system, try this: > for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done > > and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) > /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > > I also set 2048/4096 readahead sectors with blockdev --setra > > You need 50-100 reader processes for this issue, I think so. My kernel version > is 2.6.20.3 > > > -- > d > In one xterm: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done In another: for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done Results: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 79 104 54908 169772 3096180 0 0 340 120944 3062 9220 0 37 0 63 9 72 104 70588 155356 3095512 0 0 812 238550 3506 6692 1 80 0 20 7 72 104 75352 154252 3092896 0 0 4172 73020 3262 1952 0 31 0 69 13 89 104 70036 157160 3092988 0 0 5224 410088 3754 6068 1 66 0 34 1 10 104 66232 157160 3096784 0 0 128 10668 2834 1227 0 11 0 89 0 83 104 69608 156008 3094468 0 0 0 227888 3017 1843 0 43 0 57 35 91 104 72556 153832 3094608 0 0 0 148968 3141 4651 0 45 0 55 0 84 104 71332 152300 3097304 0 0 4 192556 3345 5716 1 45 0 55 4 79 104 76288 150324 3096168 0 0 200 63940 3201 1518 0 34 0 66 2 84 104 18796 210492 3093256 0 0 86360 276956 3500 4899 1 67 0 33 1 89 104 21208 207296 3093372 0 0 148 219696 3473 3774 0 41 0 59 86 22 104 15236 210080 3097008 0 0 4968 91508 3313 9608 0 34 4 62 0 81 104 17048 208536 3096756 0 0 2304 148384 3066 929 0 31 0 69 28 60 104 25696 204100 3093156 0 0 136 159520 3394 6210 0 41 0 59 7 57 104 21788 207620 3095812 0 0 15808 64888 2880 3992 1 44 0 56 0 85 104 23952 206576 3092812 0 0 24304 383716 4067 3535 0 66 0 34 3 81 104 22468 204888 3096768 0 0 620 164092 3160 5136 0 37 0 63 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 8:47 ` Justin Piszcz @ 2007-04-22 9:52 ` Pallai Roland 2007-04-22 10:23 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-22 9:52 UTC (permalink / raw) To: Justin Piszcz; +Cc: Linux RAID Mailing List On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote: > On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: > >> > >> How did you run your read test? > >> > > > > I did run 100 parallel reader process (dd) top of XFS file system, try > > this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; > > done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & > > done > > > > and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) > > /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > > > > I also set 2048/4096 readahead sectors with blockdev --setra > > > > You need 50-100 reader processes for this issue, I think so. My kernel > > version is 2.6.20.3 > > > > In one xterm: > for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > > In another: > for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done Write and read files top of XFS, not on the block device. $i isn't a typo, you should write into 100 files and read back by 100 threads in parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel parameter on boot. 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 2>/dev/null; done 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done -- d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 9:52 ` Pallai Roland @ 2007-04-22 10:23 ` Justin Piszcz 2007-04-22 11:38 ` Pallai Roland 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 10:23 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux RAID Mailing List On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote: >> On Sun, 22 Apr 2007, Pallai Roland wrote: >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: >>>> >>>> How did you run your read test? >>>> >>> >>> I did run 100 parallel reader process (dd) top of XFS file system, try >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; >>> done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & >>> done >>> >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done >>> >>> I also set 2048/4096 readahead sectors with blockdev --setra >>> >>> You need 50-100 reader processes for this issue, I think so. My kernel >>> version is 2.6.20.3 >>> >> >> In one xterm: >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done >> >> In another: >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done > > Write and read files top of XFS, not on the block device. $i isn't a typo, you > should write into 100 files and read back by 100 threads in parallel when > done. I've 1Gb of RAM, maybe you should use mem= kernel parameter on boot. > > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 2>/dev/null; > done > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done > > > -- > d > I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965 chipset. My max_sectors_kb is 128kb, my chunk size is 128kb, why do you set the max_sectors_kb less than the chunk size? For read-ahead, there are some good benchmarks by SGI(?) I believe and some others that states 16MB is the best value, over that, you lose on reads/writes or the other, 16MB appears to be optimal for best overall value. Do these values look good to you, or? Read 100 files on XFS simultaneously: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 104 146680 460 3179400 0 0 138 17 77 161 1 0 99 0 0 0 104 146680 460 3179400 0 0 0 0 1095 190 0 0 100 0 16 84 104 149024 460 3162880 0 0 57492 25 1481 1284 2 10 88 1 0 88 104 148728 396 2982628 0 0 296904 8 2816 4728 1 48 0 51 0 88 104 147796 40 2894752 0 0 204080 0 2815 3507 0 8 0 92 0 87 104 149272 40 2845688 0 0 116200 0 2781 2836 0 6 0 94 0 87 104 147872 40 2770888 0 0 188260 0 2948 3043 0 9 0 91 0 87 104 148244 40 2583700 0 0 218440 0 2856 3218 0 12 0 88 1 87 104 147528 40 2385668 0 0 181352 0 3225 4211 1 9 0 90 0 87 104 147784 40 2343868 0 0 225716 0 3205 3725 0 9 0 91 0 88 104 146112 40 2233272 0 0 202968 8 3956 5669 0 9 0 91 0 88 104 146452 40 2325332 0 0 241196 0 4374 4927 0 7 0 93 0 88 104 147792 40 2463600 0 0 239788 0 3216 1139 1 5 0 94 1 88 104 148600 40 2468352 0 0 187488 0 3107 2883 0 6 0 94 0 88 104 147240 40 2553940 0 0 168236 0 3559 2795 0 5 0 95 0 87 104 147772 40 2682932 0 0 278092 0 5531 7507 0 9 0 91 0 87 104 148200 40 2813376 0 0 201324 0 4346 5726 0 6 0 94 1 87 104 148196 40 2774932 0 0 193620 0 5433 11595 0 10 0 90 0 87 104 148488 40 2785256 0 0 227568 0 4081 6121 0 11 0 89 0 87 104 148240 40 2798024 0 0 201800 0 5064 10857 1 11 0 89 0 87 104 148684 40 2825632 0 0 206564 20 5244 9451 0 10 0 90 2 86 104 149060 40 2899464 0 0 182584 0 6610 15984 0 10 0 90 0 86 104 148004 40 2925044 0 0 167044 0 5725 13011 0 13 0 87 0 84 104 148344 40 3025508 0 0 149640 0 6860 18340 0 9 0 91 0 81 104 146152 40 3074460 0 0 167780 0 7572 19434 1 16 0 84 0 79 104 149172 40 3127748 0 0 191064 0 5373 11928 0 10 0 90 0 77 104 146972 40 3101440 0 0 135968 0 9232 28208 1 18 0 82 0 70 104 148896 40 3081860 0 0 144372 0 7790 22857 0 16 0 84 0 69 104 148492 40 3173532 0 0 154024 0 8002 22211 0 10 0 90 1 68 104 145804 40 3192788 0 0 111412 0 9612 30604 0 14 0 86 0 66 104 147324 40 3173004 0 0 137800 0 9858 31132 0 13 0 87 0 65 104 145456 40 3121436 0 0 141348 0 6521 17443 1 13 0 87 1 64 104 148408 40 3154056 0 0 149308 0 9575 29755 0 13 0 87 0 61 104 145396 40 3187312 0 0 161976 0 9709 29021 0 13 0 87 0 61 104 147860 40 3189972 0 0 154844 0 7345 20033 0 10 0 90 2 61 104 148264 40 3162276 0 0 135804 0 10865 33643 0 15 1 8 5 0 59 104 147264 40 3203068 0 0 70464 0 9340 29345 0 11 0 89 1 58 104 147344 40 3203468 0 0 108072 48 12206 39011 0 15 0 8 5 0 58 104 146696 40 3204660 0 0 102492 32 12800 43578 0 15 0 8 5 1 58 104 145712 40 3201916 0 0 198728 8 10537 34621 0 15 0 8 5 2 56 104 148732 40 3195160 0 0 119700 0 10647 35077 1 15 0 8 4 0 54 104 148628 40 3203856 0 0 92588 0 8757 26257 0 12 0 88 0 54 104 147128 40 3203956 0 0 207224 0 9706 29997 0 14 0 86 1 54 104 146788 40 3165980 0 0 150120 36 10611 36721 0 17 0 8 3 1 52 104 146984 40 3121696 0 0 160080 8 8714 28264 0 14 0 86 0 51 104 147556 40 3123932 0 0 223940 0 6088 15013 0 13 0 87 0 50 104 147268 40 3172724 0 0 154592 0 8949 26645 0 13 0 87 0 46 104 148692 40 3135296 0 0 153708 0 7146 19844 1 15 0 85 0 45 104 148116 40 3168180 0 0 165740 20 7884 22111 0 14 0 86 0 43 104 147388 40 3209160 0 0 179588 8 9244 27501 0 14 0 86 0 41 104 147760 40 3208860 0 0 131216 24 8205 24867 0 10 0 90 0 41 104 148868 40 3122048 0 0 188260 0 9812 30872 0 14 0 86 1 40 104 146832 40 3210452 0 0 168388 0 8798 27378 0 12 0 88 1 39 104 146388 40 3211436 0 0 96044 0 13197 45162 0 15 2 83 0 37 104 144716 40 3213412 0 0 151044 12 9877 30263 1 14 0 85 0 37 104 146056 40 3212088 0 0 95988 27 13179 44933 0 15 5 80 3 36 104 146776 40 3211552 0 0 66580 0 13651 46636 0 16 9 76 1 37 104 146504 40 3212192 0 0 77196 0 16353 57091 0 18 0 82 1 37 104 146864 40 3211648 0 0 135312 0 13350 45859 0 17 1 8 3 2 35 104 148924 40 3209796 0 0 81228 0 13459 43748 0 15 0 85 1 36 104 147792 40 3211332 0 0 107248 0 15935 57007 0 21 0 7 9 2 35 104 146916 40 3212444 0 0 84328 0 16471 57611 0 18 0 82 0 35 104 146500 40 3212944 0 0 98524 0 16049 55775 0 19 0 81 0 34 104 148448 40 3211220 0 0 114116 0 14166 51140 1 17 2 8 1 0 32 104 145220 40 3215024 0 0 96068 0 16008 58012 0 21 0 79 1 29 104 145376 40 3215416 0 0 102604 0 12294 41704 0 16 1 8 4 2 27 104 146764 40 3214384 0 0 132004 0 13076 44026 1 15 3 8 2 0 27 104 148888 40 3212460 0 0 118908 0 16100 57546 0 19 0 8 1 0 26 104 144932 40 3216260 0 0 101220 24 15687 55236 0 20 2 7 8 1 24 104 147848 40 3214344 0 0 125152 8 13737 48996 0 19 0 8 1 2 21 104 147024 40 3215676 0 0 71300 0 18828 65151 0 22 0 78 1 22 104 145124 40 3217852 0 0 85228 0 19031 65807 1 20 0 80 2 21 104 147472 40 3215532 0 0 96188 0 16644 60505 0 20 0 80 2 18 104 146064 40 3217544 0 0 83316 0 16788 60762 1 21 0 79 6 13 104 145824 40 3218832 0 0 56188 0 15100 51145 0 16 1 83 0 6 104 147988 40 3218048 0 0 75836 20 19925 69456 0 22 0 78 0 0 104 148592 40 3218292 0 0 20328 0 6190 17793 0 6 70 24 0 0 104 148796 40 3218392 0 0 36 0 1199 372 1 0 98 1 0 0 104 148796 40 3218392 0 0 0 0 1090 192 0 0 100 0 0 0 104 149044 40 3218420 0 0 16 0 1110 349 0 0 100 0 0 0 104 149044 40 3218420 0 0 0 17 1105 212 0 0 100 0 0 0 104 149308 40 3218436 0 0 0 0 1143 317 0 0 100 0 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 10:23 ` Justin Piszcz @ 2007-04-22 11:38 ` Pallai Roland 2007-04-22 11:42 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-22 11:38 UTC (permalink / raw) To: Justin Piszcz; +Cc: Linux RAID Mailing List [-- Attachment #1: Type: text/plain, Size: 3792 bytes --] On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote: > On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote: > >> On Sun, 22 Apr 2007, Pallai Roland wrote: > >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: > >>>> How did you run your read test? > >>> > >>> I did run 100 parallel reader process (dd) top of XFS file system, try > >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k > >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k > >>> 2>/dev/null & done > >>> > >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) > >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > >>> > >>> I also set 2048/4096 readahead sectors with blockdev --setra > >>> > >>> You need 50-100 reader processes for this issue, I think so. My kernel > >>> version is 2.6.20.3 > >> > >> In one xterm: > >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > >> > >> In another: > >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done > > > > Write and read files top of XFS, not on the block device. $i isn't a > > typo, you should write into 100 files and read back by 100 threads in > > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel > > parameter on boot. > > > > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 > > 2>/dev/null; done > > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & > > done > > > > I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965 > chipset. My max_sectors_kb is 128kb, my chunk size is 128kb, why do you > set the max_sectors_kb less than the chunk size? It's the maximum on Marvell SATA chips under Linux. Maybe hardware limitation. I just would used 128Kb chunk but I hit this issue. > For read-ahead, there > are some good benchmarks by SGI(?) I believe and some others that states > 16MB is the best value, over that, you lose on reads/writes or the other, > 16MB appears to be optimal for best overall value. Do these values look > good to you, or? Where can I found this bechmark? I did some test on this topic, too. I think the optimal readahead size always depend on the number of sequentally reading processes and the available RAM. If you've 100 processes and 1Gb of RAM, max optimal readahead is about 5-6Mb, if you set it bigger that turns into readahead thrashing and undesirable context switches. Anyway, I tried 16Mb now, but the readahead size doesn't matter in this bug, same context switch storm appears with any readahead window size. > Read 100 files on XFS simultaneously: max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe just you've too big readahead window for so many processes, it's not the bug what I'm talking about in my original post. High interrupt and CS count has been building slowly, it may a sign of readahead thrashing. In my case the CS storm began in the first second and no high interrupt count: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 7220 0 940972 0 0 0 0 256 20 0 0 100 0 0 13 0 383636 0 535520 0 0 144904 32 2804 63834 1 42 0 57 24 20 0 353312 0 558200 0 0 121524 0 2669 67604 1 40 0 59 15 21 0 314808 0 557068 0 0 91300 33 2572 53442 0 29 0 71 I attached a small kernel patch, you can measure readahead thrashing ratio with this (see tail of /proc/vmstat). I think it's a handy tool to find the optimal RA-size. And if you're interested in the bug what I'm talking about, set max_sectors_kb to 64Kb. -- d [-- Attachment #2: 01_2618+_rathr-d1.diff --] [-- Type: text/x-diff, Size: 868 bytes --] --- linux-2.6.18.2/include/linux/vmstat.h.orig 2006-09-20 05:42:06.000000000 +0200 +++ linux-2.6.18.2/include/linux/vmstat.h 2006-11-06 02:09:25.000000000 +0100 @@ -30,6 +30,7 @@ FOR_ALL_ZONES(PGSCAN_DIRECT), PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL, PAGEOUTRUN, ALLOCSTALL, PGROTATED, + RATHRASHED, NR_VM_EVENT_ITEMS }; --- linux-2.6.18.2/mm/vmstat.c.orig 2006-11-06 01:55:58.000000000 +0100 +++ linux-2.6.18.2/mm/vmstat.c 2006-11-06 02:05:14.000000000 +0100 @@ -502,6 +502,8 @@ "allocstall", "pgrotated", + + "rathrashed", #endif }; --- linux-2.6.18.2/mm/readahead.c.orig 2006-09-20 05:42:06.000000000 +0200 +++ linux-2.6.18.2/mm/readahead.c 2006-11-06 02:13:12.000000000 +0100 @@ -568,6 +568,7 @@ ra->flags |= RA_FLAG_MISS; ra->flags &= ~RA_FLAG_INCACHE; ra->cache_hit = 0; + count_vm_event(RATHRASHED); } /* ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 11:38 ` Pallai Roland @ 2007-04-22 11:42 ` Justin Piszcz 2007-04-22 14:38 ` Pallai Roland 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 11:42 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux RAID Mailing List On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote: >> On Sun, 22 Apr 2007, Pallai Roland wrote: >>> On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote: >>>> On Sun, 22 Apr 2007, Pallai Roland wrote: >>>>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: >>>>>> How did you run your read test? >>>>> >>>>> I did run 100 parallel reader process (dd) top of XFS file system, try >>>>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k >>>>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k >>>>> 2>/dev/null & done >>>>> >>>>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) >>>>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done >>>>> >>>>> I also set 2048/4096 readahead sectors with blockdev --setra >>>>> >>>>> You need 50-100 reader processes for this issue, I think so. My kernel >>>>> version is 2.6.20.3 >>>> >>>> In one xterm: >>>> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done >>>> >>>> In another: >>>> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done >>> >>> Write and read files top of XFS, not on the block device. $i isn't a >>> typo, you should write into 100 files and read back by 100 threads in >>> parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel >>> parameter on boot. >>> >>> 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 >>> 2>/dev/null; done >>> 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & >>> done >>> >> >> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965 >> chipset. My max_sectors_kb is 128kb, my chunk size is 128kb, why do you >> set the max_sectors_kb less than the chunk size? > It's the maximum on Marvell SATA chips under Linux. Maybe hardware > limitation. I just would used 128Kb chunk but I hit this issue. > >> For read-ahead, there >> are some good benchmarks by SGI(?) I believe and some others that states >> 16MB is the best value, over that, you lose on reads/writes or the other, >> 16MB appears to be optimal for best overall value. Do these values look >> good to you, or? > Where can I found this bechmark? http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf Check page 13 of 20. I did some test on this topic, too. I think > the optimal readahead size always depend on the number of sequentally reading > processes and the available RAM. If you've 100 processes and 1Gb of RAM, max > optimal readahead is about 5-6Mb, if you set it bigger that turns into > readahead thrashing and undesirable context switches. Anyway, I tried 16Mb > now, but the readahead size doesn't matter in this bug, same context switch > storm appears with any readahead window size. > >> Read 100 files on XFS simultaneously: > max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe > just you've too big readahead window for so many processes, it's not the bug > what I'm talking about in my original post. High interrupt and CS count has > been building slowly, it may a sign of readahead thrashing. In my case the CS > storm began in the first second and no high interrupt count: > > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 7220 0 940972 0 0 0 0 256 20 0 0 100 0 > 0 13 0 383636 0 535520 0 0 144904 32 2804 63834 1 42 0 57 > 24 20 0 353312 0 558200 0 0 121524 0 2669 67604 1 40 0 59 > 15 21 0 314808 0 557068 0 0 91300 33 2572 53442 0 29 0 71 > > I attached a small kernel patch, you can measure readahead thrashing ratio > with this (see tail of /proc/vmstat). I think it's a handy tool to find the > optimal RA-size. And if you're interested in the bug what I'm talking about, > set max_sectors_kb to 64Kb. > > > -- > d > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 11:42 ` Justin Piszcz @ 2007-04-22 14:38 ` Pallai Roland 2007-04-22 14:48 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-22 14:38 UTC (permalink / raw) To: Justin Piszcz; +Cc: Linux RAID Mailing List [-- Attachment #1: Type: text/plain, Size: 1334 bytes --] On Sunday 22 April 2007 13:42:43 Justin Piszcz wrote: > http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf > Check page 13 of 20. Thanks, interesting presentation. I'm working in the same area now, big media files and many clients. I spent some days to build a low-cost, high performance server. With my experience, I think, some results of this presentation can't be applied to recent kernels. It's off-topic in this thread, sorry, but I like to swagger what can be done with Linux! :) ASUS P5B-E Plus, P4 641, 1024Mb RAM, 6 disks on 965P's south bridge, 1 disk on Jmicron (both driven by AHCI driver), 1 disk on Silicon Image 3132, 8 disks on HPT2320 (hpt's driver). 16x Seagate 500Gb 16Mb cache. kernel 2.6.20.3 anticipatory scheduler chunk size 64Kb XFS file system file size is 400Mb, I read 200 of them in each test The yellow points are marking thrashing thresholds, I computed it based on process number and RAM size. It's not an exact threshold. - now see the attached picture :) Awesome performance, near disk-platter speed with big RA! It's even better with ~+15% if I use the -mm tree with the new adaptive readahead! Bigger files, bigger chunk also helps, but in my case, it's constant (unfortunately). The rule of readahead size is simple: the much is better, till no thrashing. -- d [-- Attachment #2: r1.png --] [-- Type: image/png, Size: 8277 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 14:38 ` Pallai Roland @ 2007-04-22 14:48 ` Justin Piszcz 2007-04-22 15:09 ` Pallai Roland 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 14:48 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux RAID Mailing List On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 13:42:43 Justin Piszcz wrote: >> http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf >> Check page 13 of 20. > Thanks, interesting presentation. I'm working in the same area now, big media > files and many clients. I spent some days to build a low-cost, high > performance server. With my experience, I think, some results of this > presentation can't be applied to recent kernels. > > It's off-topic in this thread, sorry, but I like to swagger what can be done > with Linux! :) > > ASUS P5B-E Plus, P4 641, 1024Mb RAM, 6 disks on 965P's south bridge, 1 disk on > Jmicron (both driven by AHCI driver), 1 disk on Silicon Image 3132, 8 disks > on HPT2320 (hpt's driver). 16x Seagate 500Gb 16Mb cache. > kernel 2.6.20.3 > anticipatory scheduler > chunk size 64Kb > XFS file system > file size is 400Mb, I read 200 of them in each test > > The yellow points are marking thrashing thresholds, I computed it based on > process number and RAM size. It's not an exact threshold. > > - now see the attached picture :) > > Awesome performance, near disk-platter speed with big RA! It's even better > with ~+15% if I use the -mm tree with the new adaptive readahead! Bigger > files, bigger chunk also helps, but in my case, it's constant > (unfortunately). > > The rule of readahead size is simple: the much is better, till no thrashing. > > > -- > d > > Have you also optimized your stripe cache for writes? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 14:48 ` Justin Piszcz @ 2007-04-22 15:09 ` Pallai Roland 2007-04-22 15:53 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Pallai Roland @ 2007-04-22 15:09 UTC (permalink / raw) To: Justin Piszcz; +Cc: Linux RAID Mailing List On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote: > Have you also optimized your stripe cache for writes? Not yet. Is it worth it? -- d ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 15:09 ` Pallai Roland @ 2007-04-22 15:53 ` Justin Piszcz 2007-04-22 19:01 ` Mr. James W. Laferriere 0 siblings, 1 reply; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 15:53 UTC (permalink / raw) To: Pallai Roland; +Cc: Linux RAID Mailing List On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote: >> Have you also optimized your stripe cache for writes? > Not yet. Is it worth it? > > > -- > d > Yes, it is-- well, if write speed is important to you that is? Each of these write tests are averaged over three runs: 128k_stripe: 69.2MB/s 256k_stripe: 105.3MB/s 512k_stripe: 142.0MB/s 1024k_stripe: 144.6MB/s 2048k_stripe: 208.3MB/s 4096k_stripe: 223.6MB/s 8192k_stripe: 226.0MB/s 16384k_stripe: 215.0MB/s What is your /sys/block/md[0-9]/md/stripe_cache_size? Justin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 15:53 ` Justin Piszcz @ 2007-04-22 19:01 ` Mr. James W. Laferriere 2007-04-22 20:35 ` Justin Piszcz 0 siblings, 1 reply; 15+ messages in thread From: Mr. James W. Laferriere @ 2007-04-22 19:01 UTC (permalink / raw) To: Justin Piszcz; +Cc: Pallai Roland, Linux RAID Mailing List Hello Justin , On Sun, 22 Apr 2007, Justin Piszcz wrote: > On Sun, 22 Apr 2007, Pallai Roland wrote: >> On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote: >>> Have you also optimized your stripe cache for writes? >> Not yet. Is it worth it? >> -- >> d > Yes, it is-- well, if write speed is important to you that is? > Each of these write tests are averaged over three runs: > > 128k_stripe: 69.2MB/s > 256k_stripe: 105.3MB/s > 512k_stripe: 142.0MB/s > 1024k_stripe: 144.6MB/s > 2048k_stripe: 208.3MB/s > 4096k_stripe: 223.6MB/s > 8192k_stripe: 226.0MB/s > 16384k_stripe: 215.0MB/s > > What is your /sys/block/md[0-9]/md/stripe_cache_size? On which filesystem type ? ie: ext3,xfs,rieser?,... Tia , JimL -- +-----------------------------------------------------------------+ | James W. Laferriere | System Techniques | Give me VMS | | Network Engineer | 663 Beaumont Blvd | Give me Linux | | babydr@baby-dragons.com | Pacifica, CA. 94044 | only on AXP | +-----------------------------------------------------------------+ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] 2007-04-22 19:01 ` Mr. James W. Laferriere @ 2007-04-22 20:35 ` Justin Piszcz 0 siblings, 0 replies; 15+ messages in thread From: Justin Piszcz @ 2007-04-22 20:35 UTC (permalink / raw) To: Mr. James W. Laferriere; +Cc: Pallai Roland, Linux RAID Mailing List On Sun, 22 Apr 2007, Mr. James W. Laferriere wrote: > Hello Justin , > > On Sun, 22 Apr 2007, Justin Piszcz wrote: >> On Sun, 22 Apr 2007, Pallai Roland wrote: >>> On Sunday 22 April 2007 16:48:11 Justin Piszcz wrote: >>>> Have you also optimized your stripe cache for writes? >>> Not yet. Is it worth it? >>> -- >>> d >> Yes, it is-- well, if write speed is important to you that is? >> Each of these write tests are averaged over three runs: >> >> 128k_stripe: 69.2MB/s >> 256k_stripe: 105.3MB/s >> 512k_stripe: 142.0MB/s >> 1024k_stripe: 144.6MB/s >> 2048k_stripe: 208.3MB/s >> 4096k_stripe: 223.6MB/s >> 8192k_stripe: 226.0MB/s >> 16384k_stripe: 215.0MB/s >> >> What is your /sys/block/md[0-9]/md/stripe_cache_size? > On which filesystem type ? ie: ext3,xfs,rieser?,... > Tia , JimL > -- > +-----------------------------------------------------------------+ > | James W. Laferriere | System Techniques | Give me VMS | > | Network Engineer | 663 Beaumont Blvd | Give me Linux | > | babydr@baby-dragons.com | Pacifica, CA. 94044 | only on AXP | > +-----------------------------------------------------------------+ > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > XFS ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2007-04-22 20:35 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-20 21:06 major performance drop on raid5 due to context switches caused by small max_hw_sectors Pallai Roland
[not found] ` <5d96567b0704202247s60e4f2f1x19511f790f597ea0@mail.gmail.com>
2007-04-21 19:32 ` major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Pallai Roland
2007-04-22 0:18 ` Justin Piszcz
2007-04-22 0:42 ` Pallai Roland
2007-04-22 8:47 ` Justin Piszcz
2007-04-22 9:52 ` Pallai Roland
2007-04-22 10:23 ` Justin Piszcz
2007-04-22 11:38 ` Pallai Roland
2007-04-22 11:42 ` Justin Piszcz
2007-04-22 14:38 ` Pallai Roland
2007-04-22 14:48 ` Justin Piszcz
2007-04-22 15:09 ` Pallai Roland
2007-04-22 15:53 ` Justin Piszcz
2007-04-22 19:01 ` Mr. James W. Laferriere
2007-04-22 20:35 ` Justin Piszcz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).