From mboxrd@z Thu Jan 1 00:00:00 1970 From: Justin Piszcz Subject: Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved] Date: Sun, 22 Apr 2007 04:47:59 -0400 (EDT) Message-ID: References: <200704202306.14880.dap@mail.index.hu> <200704212132.13775.dap@mail.index.hu> <200704220242.42285.dap@mail.index.hu> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Return-path: In-Reply-To: <200704220242.42285.dap@mail.index.hu> Sender: linux-raid-owner@vger.kernel.org To: Pallai Roland Cc: Linux RAID Mailing List List-Id: linux-raid.ids On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: >> On Sat, 21 Apr 2007, Pallai Roland wrote: >>> >>> RAID5, chunk size 128k: >>> >>> # mdadm -C -n8 -l5 -c128 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] >>> (waiting for sync, then mount, mkfs, etc) >>> # blockdev --setra 4096 /dev/md/0 >>> # ./readtest & >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> ----cpu---- r b swpd free buff cache si so bi bo in >>> cs us sy id wa 91 10 0 432908 0 436572 0 0 99788 40 >>> 2925 50358 2 36 0 63 0 11 0 444184 0 435992 0 0 89996 >>> 32 4252 49303 1 31 0 68 45 11 0 446924 0 441024 0 0 >>> 88584 0 5748 58197 0 30 2 67 - context switch storm, only 10 of 100 >>> processes are working, lot of thrashed readahead pages. I'm sure you can >>> reproduce with 64Kb max_sectors_kb and 2.6.20.x on *any* 8 disk-wide >>> RAID5 array if chunk size > max_sectors_kb: for i in `seq 1 100`; do dd >>> of=$i if=/dev/zero bs=64k 2>/dev/null; done for i in `seq 1 100`; do dd >>> if=$i of=/dev/zero bs=64k 2>/dev/null & done >>> >>> >>> RAID5, chunk size 64k (equal to max_hw_sectors): >>> >>> # mdadm -C -n8 -l5 -c64 -z 12000000 /dev/md/0 /dev/sd[ijklmnop] >>> (waiting for sync, then mount, mkfs, etc) >>> # blockdev --setra 4096 /dev/md/0 >>> # ./readtest & >>> procs -----------memory---------- ---swap-- -----io---- --system-- >>> ----cpu---- r b swpd free buff cache si so bi bo in >>> cs us sy id wa 1 99 0 309260 0 653000 0 0 309620 0 >>> 4521 2897 0 17 0 82 1 99 0 156436 0 721452 0 0 258072 >>> 0 4640 3168 0 14 0 86 0 100 0 244088 0 599888 0 0 >>> 258856 0 4703 3986 1 17 0 82 - YES! It's MUCH better now! :) >>> >>> >>> All in all, I use 64Kb chunk now and I'm happy, but I think it's >>> definitely a software bug. The sata_mv driver also doesn't give bigger >>> max_sectors_kb on Marvell chips, so it's a performance killer for every >>> Marvell user if they're using 128k or bigger chunks on RAID5. A warning >>> should be printed by the kernel at least if it's not a bug, just a >>> limitation. >>> >>> >> >> How did you run your read test? >> >> $ sudo dd if=/dev/md3 of=/dev/null >> Password: >> 18868881+0 records in >> 18868880+0 records out >> 9660866560 bytes (9.7 GB) copied, 36.661 seconds, 264 MB/s >> >> procs -----------memory---------- ---swap-- -----io---- -system-- >> ----cpu---- r b swpd free buff cache si so bi bo in >> cs us sy id wa 2 0 0 3007612 251068 86372 0 0 243732 0 >> 3109 541 15 38 47 0 1 0 0 3007724 282444 86344 0 0 260636 >> 0 3152 619 14 38 48 0 1 0 0 3007472 282600 86400 0 0 262188 >> 0 3153 339 15 38 48 0 1 0 0 3007432 282792 86360 0 0 >> 262160 67 3197 1066 14 38 47 0 >> >> However-- >> >> $ sudo dd if=/dev/md3 of=/dev/null bs=8M >> 763+0 records in >> 762+0 records out >> 6392119296 bytes (6.4 GB) copied, 14.0555 seconds, 455 MB/s >> >> procs -----------memory---------- ---swap-- -----io---- -system-- >> ----cpu---- 0 1 0 2999592 282408 86388 0 0 434208 0 4556 >> 1514 0 43 43 15 1 0 0 2999892 262928 86552 0 0 439816 68 >> 4568 2412 0 43 43 14 1 1 0 2999952 281832 86532 0 0 444992 >> 0 4604 1486 0 43 43 14 1 1 0 2999708 282148 86456 0 0 458752 >> 0 4642 1694 0 45 42 13 > > I did run 100 parallel reader process (dd) top of XFS file system, try this: > for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & done > > and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) > /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > > I also set 2048/4096 readahead sectors with blockdev --setra > > You need 50-100 reader processes for this issue, I think so. My kernel version > is 2.6.20.3 > > > -- > d > In one xterm: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done In another: for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done Results: procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 79 104 54908 169772 3096180 0 0 340 120944 3062 9220 0 37 0 63 9 72 104 70588 155356 3095512 0 0 812 238550 3506 6692 1 80 0 20 7 72 104 75352 154252 3092896 0 0 4172 73020 3262 1952 0 31 0 69 13 89 104 70036 157160 3092988 0 0 5224 410088 3754 6068 1 66 0 34 1 10 104 66232 157160 3096784 0 0 128 10668 2834 1227 0 11 0 89 0 83 104 69608 156008 3094468 0 0 0 227888 3017 1843 0 43 0 57 35 91 104 72556 153832 3094608 0 0 0 148968 3141 4651 0 45 0 55 0 84 104 71332 152300 3097304 0 0 4 192556 3345 5716 1 45 0 55 4 79 104 76288 150324 3096168 0 0 200 63940 3201 1518 0 34 0 66 2 84 104 18796 210492 3093256 0 0 86360 276956 3500 4899 1 67 0 33 1 89 104 21208 207296 3093372 0 0 148 219696 3473 3774 0 41 0 59 86 22 104 15236 210080 3097008 0 0 4968 91508 3313 9608 0 34 4 62 0 81 104 17048 208536 3096756 0 0 2304 148384 3066 929 0 31 0 69 28 60 104 25696 204100 3093156 0 0 136 159520 3394 6210 0 41 0 59 7 57 104 21788 207620 3095812 0 0 15808 64888 2880 3992 1 44 0 56 0 85 104 23952 206576 3092812 0 0 24304 383716 4067 3535 0 66 0 34 3 81 104 22468 204888 3096768 0 0 620 164092 3160 5136 0 37 0 63