* increasing stripe_cache_size decreases RAID-6 read throughput
@ 2010-04-24 23:36 Joe Williams
2010-04-24 23:45 ` Joe Williams
2010-04-27 6:41 ` Neil Brown
0 siblings, 2 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-24 23:36 UTC (permalink / raw)
To: linux-raid
I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
parameters, including a 512 KB chunk size. It took about 6 hours to
initialize, then I created an XFS filesystem:
# mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
meta-data=/dev/md0 isize=256 agcount=32, agsize=45776384 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1464843648, imaxpct=5
= sunit=128 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.
Then it was time to optimize the performance. First I ran a benchmark
with the default settings (from a recent Arch linux install) for the
following parameters:
# cat /sys/block/md0/md/stripe_cache_size
256
# cat /sys/block/md0/queue/read_ahead_kb
3072
# cat /sys/block/sdb/queue/read_ahead_kb
128
# cat /sys/block/md0/queue/scheduler
none
# cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]
# cat /sys/block/md0/queue/nr_requests
128
# cat /sys/block/sdb/queue/nr_requests
128
# cat /sys/block/md0/device/queue_depth
cat: /sys/block/md0/device/queue_depth: No such file or directory
# cat /sys/block/sdb/device/queue_depth
31
# cat /sys/block/md0/queue/max_sectors_kb
127
# cat /sys/block/sdb/queue/max_sectors_kb
512
Note that sdb is one of the 5 drives for the RAID volume, and the
other 4 have the same settings.
First question, is it normal for the md0 scheduler to be "none"? I
cannot change it by writing, eg., "deadline" into the file.
Next question, is it normal for md0 to have no queue_depth setting?
Are there any other parameters that are important to performance that
I should be looking at?
I started the kernel with mem=1024M so that the buffer cache wasn't
too large (this machine has 72G of RAM), and ran an iozone benchmark:
Iozone: Performance Test of File I/O
Version $Revision: 3.338 $
Compiled for 64 bit mode.
Build: linux-AMD64
Auto Mode
Using Minimum Record Size 64 KB
Using Maximum Record Size 16384 KB
File size set to 4194304 KB
Include fsync in write timing
Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random
random
KB reclen write rewrite read reread read write
4194304 64 133608 114920 191367 191559 7772
14718
4194304 128 142748 113722 165832 161023 14055
20728
4194304 256 127493 108110 165142 175396 24156
23300
4194304 512 136022 112711 171146 165466 36147
25698
4194304 1024 140618 110196 153134 148925 57498
39864
4194304 2048 137110 108872 177201 193416 98759
50106
4194304 4096 138723 113352 130858 129940 78636
64615
4194304 8192 140100 114089 175240 168807 109858
84656
4194304 16384 130633 116475 131867 142958 115147
102795
I was expecting a little faster sequential reads, but 191 MB/s is not
too bad. I'm not sure why it decreases to 130-131 MB/s at larger
record sizes.
But the writes were disappointing. So the first thing I tried tuning
was stripe_cache_size
# echo 16384 > /sys/block/md0/md/stripe_cache_size
I re-ran the iozone benchmark:
random
random
KB reclen write rewrite read reread read write
4194304 64 219206 264113 104751 108118 7240
12372
4194304 128 232713 255337 153990 142872 13209
21979
4194304 256 229446 242155 132753 131009 20858
32286
4194304 512 236389 245713 144280 149283 32024
44119
4194304 1024 234205 243135 141243 141604 53539
70459
4194304 2048 219163 224379 134043 131765 84428
90394
4194304 4096 226037 225588 143682 146620 60171
125360
4194304 8192 214487 231506 135311 140918 78868
156935
4194304 16384 210671 215078 138466 129098 96340
178073
And now the sequential writes are quite satisfactory, but the reads
are low. Next I tried 2560 for stipe_cache_size, since that is
random
random
KB reclen write rewrite read reread read write
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
@ 2010-04-24 23:45 ` Joe Williams
2010-04-27 6:41 ` Neil Brown
1 sibling, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-24 23:45 UTC (permalink / raw)
To: linux-raid
[end of last message was truncated, here is the rest]
Next I tried 2560 for stripe_cache_size, since that is the 512KB x 5
stripe width.
random
random
KB reclen write rewrite read reread read write
4194304 64 201919 141025 139386 134252 7421
13327
4194304 128 194337 123513 237911 237901 13002
22758
4194304 256 181426 142159 256929 252772 21986
30099
4194304 512 183168 175516 234975 234090 32614
40375
4194304 1024 169051 163818 220393 233060 54738
58653
4194304 2048 173281 141452 237993 234881 95969
77678
4194304 4096 162690 142784 208838 211268 90016
96876
4194304 8192 151361 125652 197484 197278 124009
112708
4194304 16384 138971 106200 183750 183659 135876
121704
So the sequential reads at 200+ MB/s look okay (although I do not
understand the huge throughput variability with record size), but the
writes are not as high as with 16MB stripe cache. This may be the
setting that I decide to stick with, but I would like to understand
what is going on.
Why did increasing the stripe cache from 256 KB to 16 MB decrease the
sequential read speeds?
Also, let me know what other parameters I should tune during my optimizations.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
2010-04-24 23:45 ` Joe Williams
@ 2010-04-27 6:41 ` Neil Brown
2010-04-27 17:18 ` Joe Williams
1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-27 6:41 UTC (permalink / raw)
To: Joe Williams; +Cc: linux-raid
On Sat, 24 Apr 2010 16:36:20 -0700
Joe Williams <jwilliams315@gmail.com> wrote:
> I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
> TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
> parameters, including a 512 KB chunk size. It took about 6 hours to
> initialize, then I created an XFS filesystem:
>
> # mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
> meta-data=/dev/md0 isize=256 agcount=32, agsize=45776384 blks
> = sectsz=512 attr=2
> data = bsize=4096 blocks=1464843648, imaxpct=5
> = sunit=128 swidth=384 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=64 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.
>
> Then it was time to optimize the performance. First I ran a benchmark
> with the default settings (from a recent Arch linux install) for the
> following parameters:
>
> # cat /sys/block/md0/md/stripe_cache_size
> 256
>
> # cat /sys/block/md0/queue/read_ahead_kb
> 3072
2 full stripes - that is right.
> # cat /sys/block/sdb/queue/read_ahead_kb
> 128
This number is completely irrelevant. Only the read_ahead_kb of the device
that the filesystem sees is used.
>
> # cat /sys/block/md0/queue/scheduler
> none
> # cat /sys/block/sdb/queue/scheduler
> noop deadline [cfq]
>
> # cat /sys/block/md0/queue/nr_requests
> 128
> # cat /sys/block/sdb/queue/nr_requests
> 128
>
> # cat /sys/block/md0/device/queue_depth
> cat: /sys/block/md0/device/queue_depth: No such file or directory
> # cat /sys/block/sdb/device/queue_depth
> 31
>
> # cat /sys/block/md0/queue/max_sectors_kb
> 127
> # cat /sys/block/sdb/queue/max_sectors_kb
> 512
>
> Note that sdb is one of the 5 drives for the RAID volume, and the
> other 4 have the same settings.
>
> First question, is it normal for the md0 scheduler to be "none"? I
> cannot change it by writing, eg., "deadline" into the file.
>
Because software-RAID is not disk drive, does not use and elevator and so
does not use a scheduler.
The whole 'queue' directory really shouldn't appear for md devices but for
some very boring reasons it does.
> Next question, is it normal for md0 to have no queue_depth setting?
Yes. The stripe_cache_size is conceptually a similar think, but only
at a very abstract level.
>
> Are there any other parameters that are important to performance that
> I should be looking at?
No.
>
> I started the kernel with mem=1024M so that the buffer cache wasn't
> too large (this machine has 72G of RAM), and ran an iozone benchmark:
>
> Iozone: Performance Test of File I/O
> Version $Revision: 3.338 $
> Compiled for 64 bit mode.
> Build: linux-AMD64
>
> Auto Mode
> Using Minimum Record Size 64 KB
> Using Maximum Record Size 16384 KB
> File size set to 4194304 KB
> Include fsync in write timing
> Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
> Output is in Kbytes/sec
> Time Resolution = 0.000001 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
> random
> random
> KB reclen write rewrite read reread read write
> 4194304 64 133608 114920 191367 191559 7772
> 14718
> 4194304 128 142748 113722 165832 161023 14055
> 20728
> 4194304 256 127493 108110 165142 175396 24156
> 23300
> 4194304 512 136022 112711 171146 165466 36147
> 25698
> 4194304 1024 140618 110196 153134 148925 57498
> 39864
> 4194304 2048 137110 108872 177201 193416 98759
> 50106
> 4194304 4096 138723 113352 130858 129940 78636
> 64615
> 4194304 8192 140100 114089 175240 168807 109858
> 84656
> 4194304 16384 130633 116475 131867 142958 115147
> 102795
>
>
> I was expecting a little faster sequential reads, but 191 MB/s is not
> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> record sizes.
I don't know why it would decrease either. For sequential reads, read-ahead
should be scheduling all the read requests and that actual reads should just
be waiting for the read-ahead to complete. So there shouldn't be any
variability - clearly there is. I wonder if it is an XFS thing....
care to try a different filesystem for comparison? ext3?
>
> But the writes were disappointing. So the first thing I tried tuning
> was stripe_cache_size
>
> # echo 16384 > /sys/block/md0/md/stripe_cache_size
>
> I re-ran the iozone benchmark:
>
> random
> random
> KB reclen write rewrite read reread read write
> 4194304 64 219206 264113 104751 108118 7240
> 12372
> 4194304 128 232713 255337 153990 142872 13209
> 21979
> 4194304 256 229446 242155 132753 131009 20858
> 32286
> 4194304 512 236389 245713 144280 149283 32024
> 44119
> 4194304 1024 234205 243135 141243 141604 53539
> 70459
> 4194304 2048 219163 224379 134043 131765 84428
> 90394
> 4194304 4096 226037 225588 143682 146620 60171
> 125360
> 4194304 8192 214487 231506 135311 140918 78868
> 156935
> 4194304 16384 210671 215078 138466 129098 96340
> 178073
>
> And now the sequential writes are quite satisfactory, but the reads
> are low. Next I tried 2560 for stipe_cache_size, since that is the 512KB x 5
> stripe width.
That is very weird, as reads don't use the stripe cache at all - when
the array is not degraded and no overlapping writes are happening.
And the stripe_cache is measured in pages-per-device. So 2560 means
2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
When you set stripe_cache_size to 16384, it would have consumed
16384*5*4K == 320Meg
or 1/3 of your available RAM. This might have affected throughput,
I'm not sure.
> So the sequential reads at 200+ MB/s look okay (although I do not
> understand the huge throughput variability with record size), but the
> writes are not as high as with 16MB stripe cache. This may be the
> setting that I decide to stick with, but I would like to understand
> what is going on.
> Why did increasing the stripe cache from 256 KB to 16 MB decrease the
> sequential read speeds?
The only reason I can guess at is that you actually changed it from
from 5M to 320M, and maybe that affect available buffer memory?
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-27 6:41 ` Neil Brown
@ 2010-04-27 17:18 ` Joe Williams
2010-04-27 21:24 ` Neil Brown
2010-04-29 4:34 ` Neil Brown
0 siblings, 2 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-27 17:18 UTC (permalink / raw)
To: linux-raid
On Mon, Apr 26, 2010 at 11:41 PM, Neil Brown <neilb@suse.de> wrote:
> On Sat, 24 Apr 2010 16:36:20 -0700 Joe Williams <jwilliams315@gmail.com> wrote:
>
> The whole 'queue' directory really shouldn't appear for md devices but for
> some very boring reasons it does.
But read_ahead for md0 is in the queue directory:
/sys/block/md0/queue/read_ahead_kb
I know you said read_ahead is irrelevant for the individual disk
devices like sdb, but I thought it was implied the read_ahead for md0
is significant.
>
>
>> Next question, is it normal for md0 to have no queue_depth setting?
>
> Yes. The stripe_cache_size is conceptually a similar think, but only
> at a very abstract level.
>
>>
>> Are there any other parameters that are important to performance that
>> I should be looking at?
>
>> I was expecting a little faster sequential reads, but 191 MB/s is not
>> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
>> record sizes.
>
> I don't know why it would decrease either. For sequential reads, read-ahead
> should be scheduling all the read requests and that actual reads should just
> be waiting for the read-ahead to complete. So there shouldn't be any
> variability - clearly there is. I wonder if it is an XFS thing....
> care to try a different filesystem for comparison? ext3?
I can try ext3. When I run mkfs.ext3, are there any parameters that I
should set to other than the default values?
>
> That is very weird, as reads don't use the stripe cache at all - when
> the array is not degraded and no overlapping writes are happening.
>
> And the stripe_cache is measured in pages-per-device. So 2560 means
> 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
>
> When you set stripe_cache_size to 16384, it would have consumed
> 16384*5*4K == 320Meg
> or 1/3 of your available RAM. This might have affected throughput,
> I'm not sure.
Ah, thanks for explaining that! I set the stripe cache much larger
than I intended to. But I am a little confused about your
calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
to get the RAM usage. Why multiply time 3 in the first case, and 5 in
the second? Does the stripe cache only cache data devices, or does it
cache all the devices in the array?
What stripe_cache_size value or values would you suggest I try to
optimize write throughput?
The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
per device, which would be two stripes, I think (you commented to that
effect earlier). But somehow the default setting was not optimal for
sequential write throughput. When I increased stripe_cache_size, the
sequential write throughput improved. Does that make sense? Why would
it be necessary to cache more than 2 stripes to get optimal sequential
write performance?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-27 17:18 ` Joe Williams
@ 2010-04-27 21:24 ` Neil Brown
2010-04-28 20:40 ` Joe Williams
2010-04-29 4:34 ` Neil Brown
1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-27 21:24 UTC (permalink / raw)
To: Joe Williams; +Cc: linux-raid
On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@gmail.com> wrote:
> On Mon, Apr 26, 2010 at 11:41 PM, Neil Brown <neilb@suse.de> wrote:
> > On Sat, 24 Apr 2010 16:36:20 -0700 Joe Williams <jwilliams315@gmail.com> wrote:
> >
> > The whole 'queue' directory really shouldn't appear for md devices but for
> > some very boring reasons it does.
>
> But read_ahead for md0 is in the queue directory:
>
> /sys/block/md0/queue/read_ahead_kb
True - but it shouldn't be. read_ahead has nothing to do with
the queue.
/sys/block/md0/bdi/read_ahead_kb
is (in my mind) the more natural way to access that information.
Actually it should be
/sys/class/block/md0/bdi/read_ahead_kb
but that is a little off-topic.
>
> I know you said read_ahead is irrelevant for the individual disk
> devices like sdb, but I thought it was implied the read_ahead for md0
> is significant.
>
NeilBrown
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-27 21:24 ` Neil Brown
@ 2010-04-28 20:40 ` Joe Williams
0 siblings, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-28 20:40 UTC (permalink / raw)
To: linux-raid
I did some tests, starting with the default values of 256 for
stripe_cache_size and 3072 for read_ahead_kb, and doubling them both
until performance stopped improving. Here are the best results that I
saw:
# echo 2048 > /sys/block/md0/md/stripe_cache_size
# echo 24576 > /sys/block/md0/queue/read_ahead_kb
# iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
random random
KB reclen write rewrite read reread read write
4194304 64 241087 259892 243478 248102 7745 16161
4194304 128 259503 261886 244612 247157 13417 26812
4194304 256 260438 268077 240211 238916 21884 37527
4194304 512 243511 250004 252507 252276 34694 48868
4194304 1024 244744 253905 258920 250495 52351 76356
4194304 2048 240910 250500 253800 265361 79848 100131
4194304 4096 244283 253516 271940 272117 101737 137386
4194304 8192 239110 246370 262118 269687 103437 164715
4194304 16384 240698 249182 239378 253896 119437 198276
250 MB/s reads and writes is quite nice for a 5 drive RAID-6.
But I still do not understand why it is necessary to increase the
stripe_cache_size to 16 full stripes in order to optimize sequential
write speed.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-27 17:18 ` Joe Williams
2010-04-27 21:24 ` Neil Brown
@ 2010-04-29 4:34 ` Neil Brown
2010-05-04 0:06 ` Joe Williams
1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-29 4:34 UTC (permalink / raw)
To: Joe Williams; +Cc: linux-raid
On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@gmail.com> wrote:
> >
> >
> >> Next question, is it normal for md0 to have no queue_depth setting?
> >
> > Yes. The stripe_cache_size is conceptually a similar think, but only
> > at a very abstract level.
> >
> >>
> >> Are there any other parameters that are important to performance that
> >> I should be looking at?
> >
>
> >> I was expecting a little faster sequential reads, but 191 MB/s is not
> >> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> >> record sizes.
> >
> > I don't know why it would decrease either. For sequential reads, read-ahead
> > should be scheduling all the read requests and that actual reads should just
> > be waiting for the read-ahead to complete. So there shouldn't be any
> > variability - clearly there is. I wonder if it is an XFS thing....
> > care to try a different filesystem for comparison? ext3?
>
> I can try ext3. When I run mkfs.ext3, are there any parameters that I
> should set to other than the default values?
>
No, the defaults are normally fine. There might be room for small
improvements through tuning, but for now we are really looking for big
effects.
> >
>
> > That is very weird, as reads don't use the stripe cache at all - when
> > the array is not degraded and no overlapping writes are happening.
> >
> > And the stripe_cache is measured in pages-per-device. So 2560 means
> > 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
> >
> > When you set stripe_cache_size to 16384, it would have consumed
> > 16384*5*4K == 320Meg
> > or 1/3 of your available RAM. This might have affected throughput,
> > I'm not sure.
>
> Ah, thanks for explaining that! I set the stripe cache much larger
> than I intended to. But I am a little confused about your
> calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
> total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
> to get the RAM usage. Why multiply time 3 in the first case, and 5 in
> the second? Does the stripe cache only cache data devices, or does it
> cache all the devices in the array?
I multiply by 3 when I'm calculating storage space in the array.
I multiply by 4 when I'm calculating the amount of RAM consumed.
The holds content for each device, whether data or parity.
We do all the parity calculations in the cache, so it has to store everything.
>
> What stripe_cache_size value or values would you suggest I try to
> optimize write throughput?
No idea. It is dependent on load and hardware characteristics.
Try lots of different numbers and draw a graph.
>
> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
> per device, which would be two stripes, I think (you commented to that
> effect earlier). But somehow the default setting was not optimal for
> sequential write throughput. When I increased stripe_cache_size, the
> sequential write throughput improved. Does that make sense? Why would
> it be necessary to cache more than 2 stripes to get optimal sequential
> write performance?
The individual devices have some optimal write size - possible one
track or one cylinder (if we pretend those words mean something useful these
days).
To be able to fill that you really need that much cache for each device.
Maybe your drives work best when they are sent 8M (16 stripes, as you say in
a subsequent email) before expecting the first write to complete..
You say you get about 250MB/sec, so that is about 80MB/sec per drive
(3 drives worth of data).
Rotational speed is what? 10K? That is 166revs-per-second.
So about 500K per revolution.
I imagine you would need at least 3 revolutions worth of data in the cache,
one that is currently being written, one that is ready to be written next
(so the drive knows it can just keep writing) and one that you are in the
process of filling up.
You find that you need about 16 revolutions (it seems to be about one
revolution per stripe). That is more than I would expect .... maybe there is
some extra latency somewhere.
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: increasing stripe_cache_size decreases RAID-6 read throughput
2010-04-29 4:34 ` Neil Brown
@ 2010-05-04 0:06 ` Joe Williams
0 siblings, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-05-04 0:06 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Wed, Apr 28, 2010 at 9:34 PM, Neil Brown <neilb@suse.de> wrote:
> On Tue, 27 Apr 2010 10:18:36 -0700
> Joe Williams <jwilliams315@gmail.com> wrote:
>> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
>> per device, which would be two stripes, I think (you commented to that
>> effect earlier). But somehow the default setting was not optimal for
>> sequential write throughput. When I increased stripe_cache_size, the
>> sequential write throughput improved. Does that make sense? Why would
>> it be necessary to cache more than 2 stripes to get optimal sequential
>> write performance?
>
> The individual devices have some optimal write size - possible one
> track or one cylinder (if we pretend those words mean something useful these
> days).
> To be able to fill that you really need that much cache for each device.
> Maybe your drives work best when they are sent 8M (16 stripes, as you say in
> a subsequent email) before expecting the first write to complete..
>
> You say you get about 250MB/sec, so that is about 80MB/sec per drive
> (3 drives worth of data).
> Rotational speed is what? 10K? That is 166revs-per-second.
Actually, 5400rpm.
> So about 500K per revolution.
About twice that, about 1 MB per revolution.
> I imagine you would need at least 3 revolutions worth of data in the cache,
> one that is currently being written, one that is ready to be written next
> (so the drive knows it can just keep writing) and one that you are in the
> process of filling up.
> You find that you need about 16 revolutions (it seems to be about one
> revolution per stripe). That is more than I would expect .... maybe there is
> some extra latency somewhere.
So about 8 revolutions in the cache. 2 to 3 times what might be
expected to be needed for optimal performance. Hmmm.
16 stripes comes to 16*512KB per drive, or about 8MB per drive. At
about 100MB/s, that is about 80 msec worth of writing. I don't see
where 80 msec of latency might come from.
Could it be a quirk of NCQ? I think each HDD has an NCQ of 31. But 31
512 byte sectors is only 16KB. That does not seem relevant.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-05-04 0:06 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
2010-04-24 23:45 ` Joe Williams
2010-04-27 6:41 ` Neil Brown
2010-04-27 17:18 ` Joe Williams
2010-04-27 21:24 ` Neil Brown
2010-04-28 20:40 ` Joe Williams
2010-04-29 4:34 ` Neil Brown
2010-05-04 0:06 ` Joe Williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).