From: Neil Brown <neilb@suse.de>
To: Joe Williams <jwilliams315@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: increasing stripe_cache_size decreases RAID-6 read throughput
Date: Tue, 27 Apr 2010 16:41:26 +1000 [thread overview]
Message-ID: <20100427164126.2765f9e0@notabene.brown> (raw)
In-Reply-To: <h2y11f0870e1004241636z1f3e302g913be494ec0aefa5@mail.gmail.com>
On Sat, 24 Apr 2010 16:36:20 -0700
Joe Williams <jwilliams315@gmail.com> wrote:
> I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
> TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
> parameters, including a 512 KB chunk size. It took about 6 hours to
> initialize, then I created an XFS filesystem:
>
> # mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
> meta-data=/dev/md0 isize=256 agcount=32, agsize=45776384 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=1464843648, imaxpct=5
> = sunit=128 swidth=384 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=64 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.
>
> Then it was time to optimize the performance. First I ran a benchmark
> with the default settings (from a recent Arch linux install) for the
> following parameters:
>
> # cat /sys/block/md0/md/stripe_cache_size
> 256
>
> # cat /sys/block/md0/queue/read_ahead_kb
> 3072
2 full stripes - that is right.
> # cat /sys/block/sdb/queue/read_ahead_kb
> 128
This number is completely irrelevant. Only the read_ahead_kb of the device
that the filesystem sees is used.
>
> # cat /sys/block/md0/queue/scheduler
> none
> # cat /sys/block/sdb/queue/scheduler
> noop deadline [cfq]
>
> # cat /sys/block/md0/queue/nr_requests
> 128
> # cat /sys/block/sdb/queue/nr_requests
> 128
>
> # cat /sys/block/md0/device/queue_depth
> cat: /sys/block/md0/device/queue_depth: No such file or directory
> # cat /sys/block/sdb/device/queue_depth
> 31
>
> # cat /sys/block/md0/queue/max_sectors_kb
> 127
> # cat /sys/block/sdb/queue/max_sectors_kb
> 512
>
> Note that sdb is one of the 5 drives for the RAID volume, and the
> other 4 have the same settings.
>
> First question, is it normal for the md0 scheduler to be "none"? I
> cannot change it by writing, eg., "deadline" into the file.
>
Because software-RAID is not disk drive, does not use and elevator and so
does not use a scheduler.
The whole 'queue' directory really shouldn't appear for md devices but for
some very boring reasons it does.
> Next question, is it normal for md0 to have no queue_depth setting?
Yes. The stripe_cache_size is conceptually a similar think, but only
at a very abstract level.
>
> Are there any other parameters that are important to performance that
> I should be looking at?
No.
>
> I started the kernel with mem=1024M so that the buffer cache wasn't
> too large (this machine has 72G of RAM), and ran an iozone benchmark:
>
> Iozone: Performance Test of File I/O
> Version $Revision: 3.338 $
> Compiled for 64 bit mode.
> Build: linux-AMD64
>
> Auto Mode
> Using Minimum Record Size 64 KB
> Using Maximum Record Size 16384 KB
> File size set to 4194304 KB
> Include fsync in write timing
> Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
> Output is in Kbytes/sec
> Time Resolution = 0.000001 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
> random
> random
> KB reclen write rewrite read reread read write
> 4194304 64 133608 114920 191367 191559 7772
> 14718
> 4194304 128 142748 113722 165832 161023 14055
> 20728
> 4194304 256 127493 108110 165142 175396 24156
> 23300
> 4194304 512 136022 112711 171146 165466 36147
> 25698
> 4194304 1024 140618 110196 153134 148925 57498
> 39864
> 4194304 2048 137110 108872 177201 193416 98759
> 50106
> 4194304 4096 138723 113352 130858 129940 78636
> 64615
> 4194304 8192 140100 114089 175240 168807 109858
> 84656
> 4194304 16384 130633 116475 131867 142958 115147
> 102795
>
>
> I was expecting a little faster sequential reads, but 191 MB/s is not
> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> record sizes.
I don't know why it would decrease either. For sequential reads, read-ahead
should be scheduling all the read requests and that actual reads should just
be waiting for the read-ahead to complete. So there shouldn't be any
variability - clearly there is. I wonder if it is an XFS thing....
care to try a different filesystem for comparison? ext3?
>
> But the writes were disappointing. So the first thing I tried tuning
> was stripe_cache_size
>
> # echo 16384 > /sys/block/md0/md/stripe_cache_size
>
> I re-ran the iozone benchmark:
>
> random
> random
> KB reclen write rewrite read reread read write
> 4194304 64 219206 264113 104751 108118 7240
> 12372
> 4194304 128 232713 255337 153990 142872 13209
> 21979
> 4194304 256 229446 242155 132753 131009 20858
> 32286
> 4194304 512 236389 245713 144280 149283 32024
> 44119
> 4194304 1024 234205 243135 141243 141604 53539
> 70459
> 4194304 2048 219163 224379 134043 131765 84428
> 90394
> 4194304 4096 226037 225588 143682 146620 60171
> 125360
> 4194304 8192 214487 231506 135311 140918 78868
> 156935
> 4194304 16384 210671 215078 138466 129098 96340
> 178073
>
> And now the sequential writes are quite satisfactory, but the reads
> are low. Next I tried 2560 for stipe_cache_size, since that is the 512KB x 5
> stripe width.
That is very weird, as reads don't use the stripe cache at all - when
the array is not degraded and no overlapping writes are happening.
And the stripe_cache is measured in pages-per-device. So 2560 means
2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
When you set stripe_cache_size to 16384, it would have consumed
16384*5*4K == 320Meg
or 1/3 of your available RAM. This might have affected throughput,
I'm not sure.
> So the sequential reads at 200+ MB/s look okay (although I do not
> understand the huge throughput variability with record size), but the
> writes are not as high as with 16MB stripe cache. This may be the
> setting that I decide to stick with, but I would like to understand
> what is going on.
> Why did increasing the stripe cache from 256 KB to 16 MB decrease the
> sequential read speeds?
The only reason I can guess at is that you actually changed it from
from 5M to 320M, and maybe that affect available buffer memory?
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-04-27 6:41 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
2010-04-24 23:45 ` Joe Williams
2010-04-27 6:41 ` Neil Brown [this message]
2010-04-27 17:18 ` Joe Williams
2010-04-27 21:24 ` Neil Brown
2010-04-28 20:40 ` Joe Williams
2010-04-29 4:34 ` Neil Brown
2010-05-04 0:06 ` Joe Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100427164126.2765f9e0@notabene.brown \
--to=neilb@suse.de \
--cc=jwilliams315@gmail.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).