linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de> (by way of Neil Brown <neilb@suse.de>)
To: Joe Williams <jwilliams315@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: increasing stripe_cache_size decreases RAID-6 read throughput
Date: Thu, 29 Apr 2010 14:34:03 +1000	[thread overview]
Message-ID: <20100429143403.44bef7a1@notabene.brown> (raw)
In-Reply-To: <n2x11f0870e1004271018t742e933tba7e0d37428271f2@mail.gmail.com>

On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@gmail.com> wrote:

> >
> >
> >> Next question, is it normal for md0 to have no queue_depth setting?
> >
> > Yes.  The stripe_cache_size is conceptually a similar think, but only
> > at a very abstract level.
> >
> >>
> >> Are there any other parameters that are important to performance that
> >> I should be looking at?
> >
> 
> >> I was expecting a little faster sequential reads, but 191 MB/s is not
> >> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> >> record sizes.
> >
> > I don't know why it would decrease either.  For sequential reads, read-ahead
> > should be scheduling all the read requests and that actual reads should just
> > be waiting for the read-ahead to complete.  So there shouldn't be any
> > variability - clearly there is.  I wonder if it is an XFS thing....
> > care to try a different filesystem for comparison?  ext3?
> 
> I can try ext3. When I run mkfs.ext3, are there any parameters that I
> should set to other than the default values?
> 

No, the defaults are normally fine.  There might be room for small
improvements through tuning, but for now we are really looking for big
effects.

> >
> 
> > That is very weird, as reads don't use the stripe cache at all - when
> > the array is not degraded and no overlapping writes are happening.
> >
> > And the stripe_cache is measured in pages-per-device.  So 2560 means
> > 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
> >
> > When you set stripe_cache_size to 16384, it would have consumed
> >  16384*5*4K == 320Meg
> > or 1/3 of your available RAM.  This might have affected throughput,
> > I'm not sure.
> 
> Ah, thanks for explaining that! I set the stripe cache much larger
> than I intended to. But I am a little confused about your
> calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
> total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
> to get the RAM usage. Why multiply time 3 in the first case, and 5 in
> the second? Does the stripe cache only cache data devices, or does it
> cache all the devices in the array?

I multiply by 3 when I'm calculating storage space in the array.
I multiply by 4 when I'm calculating the amount of RAM consumed.

The holds content for each device, whether data or parity.
We do all the parity calculations in the cache, so it has to store everything.

> 
> What stripe_cache_size value or values would you suggest I try to
> optimize write throughput?

No idea.  It is dependent on load and hardware characteristics.
Try lots of different numbers and draw a graph.


> 
> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
> per device, which would be two stripes, I think (you commented to that
> effect earlier). But somehow the default setting was not optimal for
> sequential write throughput. When I increased stripe_cache_size, the
> sequential write throughput improved. Does that make sense? Why would
> it be necessary to cache more than 2 stripes to get optimal sequential
> write performance?

The individual devices have some optimal write size - possible one
track or one cylinder (if we pretend those words mean something useful these
days).
To be able to fill that you really need that much cache for each device.
Maybe your drives work best when they are sent 8M (16 stripes, as you say in
a subsequent email) before expecting the first write to complete..

You say you get about 250MB/sec, so that is about 80MB/sec per drive
(3 drives worth of data).
Rotational speed is what?  10K?  That is 166revs-per-second.
So about 500K per revolution.
I imagine you would need at least 3 revolutions worth of data in the cache,
one that is currently being written, one that is ready to be written next
(so the drive knows it can just keep writing) and one that you are in the
process of filling up.
You find that you need about 16 revolutions (it seems to be about one
revolution per stripe).  That is more than I would expect .... maybe there is
some extra latency somewhere.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2010-04-29  4:34 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-24 23:36 increasing stripe_cache_size decreases RAID-6 read throughput Joe Williams
2010-04-24 23:45 ` Joe Williams
2010-04-27  6:41 ` Neil Brown
2010-04-27 17:18   ` Joe Williams
2010-04-27 21:24     ` Neil Brown
2010-04-28 20:40       ` Joe Williams
2010-04-29  4:34     ` Neil Brown [this message]
2010-05-04  0:06       ` Joe Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100429143403.44bef7a1@notabene.brown \
    --to=neilb@suse.de \
    --cc=jwilliams315@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).