increasing stripe_cache_size decreases RAID-6 read throughput

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* increasing stripe_cache_size decreases RAID-6 read throughput
@ 2010-04-24 23:36 Joe Williams
  2010-04-24 23:45 ` Joe Williams
  2010-04-27  6:41 ` Neil Brown
  0 siblings, 2 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-24 23:36 UTC (permalink / raw)
  To: linux-raid

I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
parameters, including a 512 KB chunk size. It took about 6 hours to
initialize, then I created an XFS filesystem:

# mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
meta-data=/dev/md0               isize=256    agcount=32, agsize=45776384 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1464843648, imaxpct=5
         =                       sunit=128    swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.

Then it was time to optimize the performance. First I ran a benchmark
with the default settings (from a recent Arch linux install) for the
following parameters:

# cat /sys/block/md0/md/stripe_cache_size
256

# cat /sys/block/md0/queue/read_ahead_kb
3072
# cat /sys/block/sdb/queue/read_ahead_kb
128

# cat /sys/block/md0/queue/scheduler
none
# cat /sys/block/sdb/queue/scheduler
noop deadline [cfq]

# cat /sys/block/md0/queue/nr_requests
128
# cat /sys/block/sdb/queue/nr_requests
128

# cat /sys/block/md0/device/queue_depth
cat: /sys/block/md0/device/queue_depth: No such file or directory
# cat /sys/block/sdb/device/queue_depth
31

# cat /sys/block/md0/queue/max_sectors_kb
127
# cat /sys/block/sdb/queue/max_sectors_kb
512

Note that sdb is one of the 5 drives for the RAID volume, and the
other 4 have the same settings.

First question, is it normal for the md0 scheduler to be "none"? I
cannot change it by writing, eg., "deadline" into the file.

Next question, is it normal for md0 to have no queue_depth setting?

Are there any other parameters that are important to performance that
I should be looking at?

I started the kernel with mem=1024M so that the buffer cache wasn't
too large (this machine has 72G of RAM), and ran an iozone benchmark:

    Iozone: Performance Test of File I/O
            Version $Revision: 3.338 $
        Compiled for 64 bit mode.
        Build: linux-AMD64

    Auto Mode
    Using Minimum Record Size 64 KB
    Using Maximum Record Size 16384 KB
    File size set to 4194304 KB
    Include fsync in write timing
    Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.
                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64  133608  114920   191367   191559    7772
14718
         4194304     128  142748  113722   165832   161023   14055
20728
         4194304     256  127493  108110   165142   175396   24156
23300
         4194304     512  136022  112711   171146   165466   36147
25698
         4194304    1024  140618  110196   153134   148925   57498
39864
         4194304    2048  137110  108872   177201   193416   98759
50106
         4194304    4096  138723  113352   130858   129940   78636
64615
         4194304    8192  140100  114089   175240   168807  109858
84656
         4194304   16384  130633  116475   131867   142958  115147
102795


I was expecting a little faster sequential reads, but 191 MB/s is not
too bad. I'm not sure why it decreases to 130-131 MB/s at larger
record sizes.

But the writes were disappointing. So the first thing I tried tuning
was stripe_cache_size

# echo 16384 > /sys/block/md0/md/stripe_cache_size

I re-ran the iozone benchmark:

                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64  219206  264113   104751   108118    7240
12372
         4194304     128  232713  255337   153990   142872   13209
21979
         4194304     256  229446  242155   132753   131009   20858
32286
         4194304     512  236389  245713   144280   149283   32024
44119
         4194304    1024  234205  243135   141243   141604   53539
70459
         4194304    2048  219163  224379   134043   131765   84428
90394
         4194304    4096  226037  225588   143682   146620   60171
125360
         4194304    8192  214487  231506   135311   140918   78868
156935
         4194304   16384  210671  215078   138466   129098   96340
178073

And now the sequential writes are quite satisfactory, but the reads
are low. Next I tried 2560 for stipe_cache_size, since that is


                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
@ 2010-04-24 23:45 ` Joe Williams
  2010-04-27  6:41 ` Neil Brown
  1 sibling, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-24 23:45 UTC (permalink / raw)
  To: linux-raid

[end of last message was truncated, here is the rest]

Next I tried 2560 for stripe_cache_size, since that is the 512KB x 5
stripe width.

                                                            random
random
              KB  reclen   write rewrite    read    reread    read   write
         4194304      64  201919  141025   139386   134252    7421
13327
         4194304     128  194337  123513   237911   237901   13002
22758
         4194304     256  181426  142159   256929   252772   21986
30099
         4194304     512  183168  175516   234975   234090   32614
40375
         4194304    1024  169051  163818   220393   233060   54738
58653
         4194304    2048  173281  141452   237993   234881   95969
77678
         4194304    4096  162690  142784   208838   211268   90016
96876
         4194304    8192  151361  125652   197484   197278  124009
112708
         4194304   16384  138971  106200   183750   183659  135876
121704

So the sequential reads at 200+ MB/s look okay (although I do not
understand the huge throughput variability with record size), but the
writes are not as high as with 16MB stripe cache. This may be the
setting that I decide to stick with, but I would like to understand
what is going on.

Why did increasing the stripe cache from 256 KB to 16 MB decrease the
sequential read speeds?

Also, let me know what other parameters I should tune during my optimizations.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
  2010-04-24 23:45 ` Joe Williams
@ 2010-04-27  6:41 ` Neil Brown
  2010-04-27 17:18   ` Joe Williams
  1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-27  6:41 UTC (permalink / raw)
  To: Joe Williams; +Cc: linux-raid

On Sat, 24 Apr 2010 16:36:20 -0700
Joe Williams <jwilliams315@gmail.com> wrote:

> I am new to mdadm, and I just set up an mdadm v3.1.2 RAID-6 of five 2
> TB Samsung Spinpoint F3EGs. I created the RAID-6 with the default
> parameters, including a 512 KB chunk size. It took about 6 hours to
> initialize, then I created an XFS filesystem:
> 
> # mkfs.xfs -f -d su=512k,sw=3 -l su=256k -l lazy-count=1 -L raidvol /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=45776384 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=1464843648, imaxpct=5
>          =                       sunit=128    swidth=384 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=64 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> Note that 256k is the maximum allowed by mkfs.xfs for the log stripe unit.
> 
> Then it was time to optimize the performance. First I ran a benchmark
> with the default settings (from a recent Arch linux install) for the
> following parameters:
> 
> # cat /sys/block/md0/md/stripe_cache_size
> 256
> 
> # cat /sys/block/md0/queue/read_ahead_kb
> 3072

2 full stripes - that is right.

> # cat /sys/block/sdb/queue/read_ahead_kb
> 128

This number is completely irrelevant.  Only the read_ahead_kb of the device
that the filesystem sees is used.

> 
> # cat /sys/block/md0/queue/scheduler
> none
> # cat /sys/block/sdb/queue/scheduler
> noop deadline [cfq]
> 
> # cat /sys/block/md0/queue/nr_requests
> 128
> # cat /sys/block/sdb/queue/nr_requests
> 128
> 
> # cat /sys/block/md0/device/queue_depth
> cat: /sys/block/md0/device/queue_depth: No such file or directory
> # cat /sys/block/sdb/device/queue_depth
> 31
> 
> # cat /sys/block/md0/queue/max_sectors_kb
> 127
> # cat /sys/block/sdb/queue/max_sectors_kb
> 512
> 
> Note that sdb is one of the 5 drives for the RAID volume, and the
> other 4 have the same settings.
> 
> First question, is it normal for the md0 scheduler to be "none"? I
> cannot change it by writing, eg., "deadline" into the file.
> 

Because software-RAID is not disk drive, does not use and elevator and so
does not use a scheduler.
The whole 'queue' directory really shouldn't appear for md devices but for
some very boring reasons it does.


> Next question, is it normal for md0 to have no queue_depth setting?

Yes.  The stripe_cache_size is conceptually a similar think, but only
at a very abstract level.

> 
> Are there any other parameters that are important to performance that
> I should be looking at?

No.

> 
> I started the kernel with mem=1024M so that the buffer cache wasn't
> too large (this machine has 72G of RAM), and ran an iozone benchmark:
> 
>     Iozone: Performance Test of File I/O
>             Version $Revision: 3.338 $
>         Compiled for 64 bit mode.
>         Build: linux-AMD64
> 
>     Auto Mode
>     Using Minimum Record Size 64 KB
>     Using Maximum Record Size 16384 KB
>     File size set to 4194304 KB
>     Include fsync in write timing
>     Command line used: iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2
>     Output is in Kbytes/sec
>     Time Resolution = 0.000001 seconds.
>     Processor cache size set to 1024 Kbytes.
>     Processor cache line size set to 32 bytes.
>     File stride size set to 17 * record size.
>                                                             random
> random
>               KB  reclen   write rewrite    read    reread    read   write
>          4194304      64  133608  114920   191367   191559    7772
> 14718
>          4194304     128  142748  113722   165832   161023   14055
> 20728
>          4194304     256  127493  108110   165142   175396   24156
> 23300
>          4194304     512  136022  112711   171146   165466   36147
> 25698
>          4194304    1024  140618  110196   153134   148925   57498
> 39864
>          4194304    2048  137110  108872   177201   193416   98759
> 50106
>          4194304    4096  138723  113352   130858   129940   78636
> 64615
>          4194304    8192  140100  114089   175240   168807  109858
> 84656
>          4194304   16384  130633  116475   131867   142958  115147
> 102795
> 
> 
> I was expecting a little faster sequential reads, but 191 MB/s is not
> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> record sizes.

I don't know why it would decrease either.  For sequential reads, read-ahead
should be scheduling all the read requests and that actual reads should just
be waiting for the read-ahead to complete.  So there shouldn't be any
variability - clearly there is.  I wonder if it is an XFS thing....
care to try a different filesystem for comparison?  ext3?


> 
> But the writes were disappointing. So the first thing I tried tuning
> was stripe_cache_size
> 
> # echo 16384 > /sys/block/md0/md/stripe_cache_size
> 
> I re-ran the iozone benchmark:
> 
>                                                             random
> random
>               KB  reclen   write rewrite    read    reread    read   write
>          4194304      64  219206  264113   104751   108118    7240
> 12372
>          4194304     128  232713  255337   153990   142872   13209
> 21979
>          4194304     256  229446  242155   132753   131009   20858
> 32286
>          4194304     512  236389  245713   144280   149283   32024
> 44119
>          4194304    1024  234205  243135   141243   141604   53539
> 70459
>          4194304    2048  219163  224379   134043   131765   84428
> 90394
>          4194304    4096  226037  225588   143682   146620   60171
> 125360
>          4194304    8192  214487  231506   135311   140918   78868
> 156935
>          4194304   16384  210671  215078   138466   129098   96340
> 178073
> 
> And now the sequential writes are quite satisfactory, but the reads
> are low. Next I tried 2560 for stipe_cache_size, since that is the 512KB x 5
> stripe width.


That is very weird, as reads don't use the stripe cache at all - when
the array is not degraded and no overlapping writes are happening.

And the stripe_cache is measured in pages-per-device.  So 2560 means
2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.

When you set stripe_cache_size to 16384, it would have consumed
 16384*5*4K == 320Meg
or 1/3 of your available RAM.  This might have affected throughput,
I'm not sure.


> So the sequential reads at 200+ MB/s look okay (although I do not
> understand the huge throughput variability with record size), but the
> writes are not as high as with 16MB stripe cache. This may be the
> setting that I decide to stick with, but I would like to understand
> what is going on.

> Why did increasing the stripe cache from 256 KB to 16 MB decrease the
> sequential read speeds?

The only reason I can guess at is that you actually changed it from
from 5M to 320M, and maybe that affect available buffer memory?

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-27  6:41 ` Neil Brown
@ 2010-04-27 17:18   ` Joe Williams
  2010-04-27 21:24     ` Neil Brown
  2010-04-29  4:34     ` Neil Brown
  0 siblings, 2 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-27 17:18 UTC (permalink / raw)
  To: linux-raid

On Mon, Apr 26, 2010 at 11:41 PM, Neil Brown <neilb@suse.de> wrote:
> On Sat, 24 Apr 2010 16:36:20 -0700 Joe Williams <jwilliams315@gmail.com> wrote:
>
> The whole 'queue' directory really shouldn't appear for md devices but for
> some very boring reasons it does.

But read_ahead for md0 is in the queue directory:

/sys/block/md0/queue/read_ahead_kb

I know you said read_ahead is irrelevant for the individual disk
devices like sdb, but I thought it was implied the read_ahead for md0
is significant.

>
>
>> Next question, is it normal for md0 to have no queue_depth setting?
>
> Yes.  The stripe_cache_size is conceptually a similar think, but only
> at a very abstract level.
>
>>
>> Are there any other parameters that are important to performance that
>> I should be looking at?
>

>> I was expecting a little faster sequential reads, but 191 MB/s is not
>> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
>> record sizes.
>
> I don't know why it would decrease either.  For sequential reads, read-ahead
> should be scheduling all the read requests and that actual reads should just
> be waiting for the read-ahead to complete.  So there shouldn't be any
> variability - clearly there is.  I wonder if it is an XFS thing....
> care to try a different filesystem for comparison?  ext3?

I can try ext3. When I run mkfs.ext3, are there any parameters that I
should set to other than the default values?

>

> That is very weird, as reads don't use the stripe cache at all - when
> the array is not degraded and no overlapping writes are happening.
>
> And the stripe_cache is measured in pages-per-device.  So 2560 means
> 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
>
> When you set stripe_cache_size to 16384, it would have consumed
>  16384*5*4K == 320Meg
> or 1/3 of your available RAM.  This might have affected throughput,
> I'm not sure.

Ah, thanks for explaining that! I set the stripe cache much larger
than I intended to. But I am a little confused about your
calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
to get the RAM usage. Why multiply time 3 in the first case, and 5 in
the second? Does the stripe cache only cache data devices, or does it
cache all the devices in the array?

What stripe_cache_size value or values would you suggest I try to
optimize write throughput?

The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
per device, which would be two stripes, I think (you commented to that
effect earlier). But somehow the default setting was not optimal for
sequential write throughput. When I increased stripe_cache_size, the
sequential write throughput improved. Does that make sense? Why would
it be necessary to cache more than 2 stripes to get optimal sequential
write performance?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-27 17:18   ` Joe Williams
@ 2010-04-27 21:24     ` Neil Brown
  2010-04-28 20:40       ` Joe Williams
  2010-04-29  4:34     ` Neil Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-27 21:24 UTC (permalink / raw)
  To: Joe Williams; +Cc: linux-raid

On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@gmail.com> wrote:

> On Mon, Apr 26, 2010 at 11:41 PM, Neil Brown <neilb@suse.de> wrote:
> > On Sat, 24 Apr 2010 16:36:20 -0700 Joe Williams <jwilliams315@gmail.com> wrote:
> >
> > The whole 'queue' directory really shouldn't appear for md devices but for
> > some very boring reasons it does.
> 
> But read_ahead for md0 is in the queue directory:
> 
> /sys/block/md0/queue/read_ahead_kb

True - but it shouldn't be.  read_ahead has nothing to do with 
the queue.  
   /sys/block/md0/bdi/read_ahead_kb
is (in my mind) the more natural way to access that information.
Actually it should be

   /sys/class/block/md0/bdi/read_ahead_kb

but that is a little off-topic.

> 
> I know you said read_ahead is irrelevant for the individual disk
> devices like sdb, but I thought it was implied the read_ahead for md0
> is significant.
> 

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-27 21:24     ` Neil Brown
@ 2010-04-28 20:40       ` Joe Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-04-28 20:40 UTC (permalink / raw)
  To: linux-raid

I did some tests, starting with the default values of 256 for
stripe_cache_size and 3072 for read_ahead_kb, and doubling them both
until performance stopped improving. Here are the best results that I
saw:

# echo 2048 > /sys/block/md0/md/stripe_cache_size
# echo 24576 > /sys/block/md0/queue/read_ahead_kb
# iozone -a -y64K -q16M -s4G -e -f iotest -i0 -i1 -i2

                                                    random  random
      KB  reclen   write rewrite    read    reread    read   write
 4194304      64  241087  259892   243478   248102    7745   16161
 4194304     128  259503  261886   244612   247157   13417   26812
 4194304     256  260438  268077   240211   238916   21884   37527
 4194304     512  243511  250004   252507   252276   34694   48868
 4194304    1024  244744  253905   258920   250495   52351   76356
 4194304    2048  240910  250500   253800   265361   79848  100131
 4194304    4096  244283  253516   271940   272117  101737  137386
 4194304    8192  239110  246370   262118   269687  103437  164715
 4194304   16384  240698  249182   239378   253896  119437  198276


250 MB/s reads and writes is quite nice for a 5 drive RAID-6.

But I still do not understand why it is necessary to increase the
stripe_cache_size to 16 full stripes in order to optimize sequential
write speed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-27 17:18   ` Joe Williams
  2010-04-27 21:24     ` Neil Brown
@ 2010-04-29  4:34     ` Neil Brown
  2010-05-04  0:06       ` Joe Williams
  1 sibling, 1 reply; 8+ messages in thread
From: Neil Brown @ 2010-04-29  4:34 UTC (permalink / raw)
  To: Joe Williams; +Cc: linux-raid

On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@gmail.com> wrote:

> >
> >
> >> Next question, is it normal for md0 to have no queue_depth setting?
> >
> > Yes.  The stripe_cache_size is conceptually a similar think, but only
> > at a very abstract level.
> >
> >>
> >> Are there any other parameters that are important to performance that
> >> I should be looking at?
> >
> 
> >> I was expecting a little faster sequential reads, but 191 MB/s is not
> >> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> >> record sizes.
> >
> > I don't know why it would decrease either.  For sequential reads, read-ahead
> > should be scheduling all the read requests and that actual reads should just
> > be waiting for the read-ahead to complete.  So there shouldn't be any
> > variability - clearly there is.  I wonder if it is an XFS thing....
> > care to try a different filesystem for comparison?  ext3?
> 
> I can try ext3. When I run mkfs.ext3, are there any parameters that I
> should set to other than the default values?
> 

No, the defaults are normally fine.  There might be room for small
improvements through tuning, but for now we are really looking for big
effects.

> >
> 
> > That is very weird, as reads don't use the stripe cache at all - when
> > the array is not degraded and no overlapping writes are happening.
> >
> > And the stripe_cache is measured in pages-per-device.  So 2560 means
> > 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
> >
> > When you set stripe_cache_size to 16384, it would have consumed
> >  16384*5*4K == 320Meg
> > or 1/3 of your available RAM.  This might have affected throughput,
> > I'm not sure.
> 
> Ah, thanks for explaining that! I set the stripe cache much larger
> than I intended to. But I am a little confused about your
> calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
> total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
> to get the RAM usage. Why multiply time 3 in the first case, and 5 in
> the second? Does the stripe cache only cache data devices, or does it
> cache all the devices in the array?

I multiply by 3 when I'm calculating storage space in the array.
I multiply by 4 when I'm calculating the amount of RAM consumed.

The holds content for each device, whether data or parity.
We do all the parity calculations in the cache, so it has to store everything.

> 
> What stripe_cache_size value or values would you suggest I try to
> optimize write throughput?

No idea.  It is dependent on load and hardware characteristics.
Try lots of different numbers and draw a graph.

> 
> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
> per device, which would be two stripes, I think (you commented to that
> effect earlier). But somehow the default setting was not optimal for
> sequential write throughput. When I increased stripe_cache_size, the
> sequential write throughput improved. Does that make sense? Why would
> it be necessary to cache more than 2 stripes to get optimal sequential
> write performance?

The individual devices have some optimal write size - possible one
track or one cylinder (if we pretend those words mean something useful these
days).
To be able to fill that you really need that much cache for each device.
Maybe your drives work best when they are sent 8M (16 stripes, as you say in
a subsequent email) before expecting the first write to complete..

You say you get about 250MB/sec, so that is about 80MB/sec per drive
(3 drives worth of data).
Rotational speed is what?  10K?  That is 166revs-per-second.
So about 500K per revolution.
I imagine you would need at least 3 revolutions worth of data in the cache,
one that is currently being written, one that is ready to be written next
(so the drive knows it can just keep writing) and one that you are in the
process of filling up.
You find that you need about 16 revolutions (it seems to be about one
revolution per stripe).  That is more than I would expect .... maybe there is
some extra latency somewhere.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: increasing stripe_cache_size decreases RAID-6 read throughput
  2010-04-29  4:34     ` Neil Brown
@ 2010-05-04  0:06       ` Joe Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Joe Williams @ 2010-05-04  0:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Wed, Apr 28, 2010 at 9:34 PM, Neil Brown <neilb@suse.de> wrote:
> On Tue, 27 Apr 2010 10:18:36 -0700
> Joe Williams <jwilliams315@gmail.com> wrote:

>> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
>> per device, which would be two stripes, I think (you commented to that
>> effect earlier). But somehow the default setting was not optimal for
>> sequential write throughput. When I increased stripe_cache_size, the
>> sequential write throughput improved. Does that make sense? Why would
>> it be necessary to cache more than 2 stripes to get optimal sequential
>> write performance?
>
> The individual devices have some optimal write size - possible one
> track or one cylinder (if we pretend those words mean something useful these
> days).
> To be able to fill that you really need that much cache for each device.
> Maybe your drives work best when they are sent 8M (16 stripes, as you say in
> a subsequent email) before expecting the first write to complete..
>
> You say you get about 250MB/sec, so that is about 80MB/sec per drive
> (3 drives worth of data).
> Rotational speed is what?  10K?  That is 166revs-per-second.

Actually, 5400rpm.

> So about 500K per revolution.

About twice that, about 1 MB per revolution.

> I imagine you would need at least 3 revolutions worth of data in the cache,
> one that is currently being written, one that is ready to be written next
> (so the drive knows it can just keep writing) and one that you are in the
> process of filling up.
> You find that you need about 16 revolutions (it seems to be about one
> revolution per stripe).  That is more than I would expect .... maybe there is
> some extra latency somewhere.

So about 8 revolutions in the cache. 2 to 3 times what might be
expected to be needed for optimal performance. Hmmm.

16 stripes comes to 16*512KB per drive, or about 8MB per drive. At
about 100MB/s, that is about 80 msec worth of writing. I don't see
where 80 msec of latency might come from.

Could it be a quirk of NCQ? I think each HDD has an NCQ of 31. But 31
512 byte sectors is only 16KB. That does not seem relevant.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-05-04  0:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
2010-04-24 23:45 ` Joe Williams
2010-04-27  6:41 ` Neil Brown
2010-04-27 17:18   ` Joe Williams
2010-04-27 21:24     ` Neil Brown
2010-04-28 20:40       ` Joe Williams
2010-04-29  4:34     ` Neil Brown
2010-05-04  0:06       ` Joe Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).