Re: the '--setra 65536' mistery, analysis and WTF?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: pg_lxra@lxra.for.sabi.co.UK (Peter Grandi)
To: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: the '--setra 65536' mistery, analysis and WTF?
Date: Thu, 20 Mar 2008 08:12:26 +0000	[thread overview]
Message-ID: <18402.7274.174605.611654@tree.ty.sabi.co.uk> (raw)
In-Reply-To: <18400.13488.178782.156775@tree.ty.sabi.co.uk>

[ ... on large read-ahead being needed for reasonable Linux RAID
read performance ... ]

>> * Most revealingly, when I used values of read ahead which
>> were powers of 10, the numbers of block/s reported by 'vmstat
>> 1' was also a multiple of that power of 10.

> Most disturbingly, this seems to indicate that not only the
> Linux block IO subsystems issues IO operations in multiples of
> the read-ahead size, but does so at a fixed number of times per
> second that is a multiple of 10.

> Which leads me to suspect that the queueing of IO requests on
> the driver's queue, or even the issuing of requests from the
> driver to the device, may end up being driven by the clock tick
> interrupt frequency, not the device interrupt frequency.

Which lead me to think about elevators, which was also mentioned
in some recent (and otherwise less interesting :->) comments as
some elevator do it periodically.

So I have done a quick test with the 'anticipatory' elevator
instead of the RHEL4 default CFQ and large readheads are not
necessary and I get 260MB/s writing and 520MB/s reading with an
8 sector readahead on the same 4*(1+1) RAID0 f2 used previously.

In theory the elevator should have no influence on a strict
sequential reading test that with strictly increasing read
addresses, as there is nothing to reorder.

However RHEL4, which was mentioned by other people reporting the
use of very large read-aheads, comes with an old version of the
elevator subsystem (which can only change elevator for all block
devices and only on reboot too).

Perhaps the CFQ version in RHEL4 inserts pauses in the stream of
read requests which have to be amortized over large read request
streams, and perhaps the variability in performance depends on
resonances between the length of the read-ahead at the RAID
block device level and the interval between pauses at the
underlying disk level.

I have used 'anticipatory' in my test above because it is known
to favour sequential access patterns. Unfortunately it does so a
bit too much and also leads to poor latency with multiple streams,
probably the reason why the default is CFQ. Again, the version
of CFQ in RHEL4 is old, so it has few tweakables, but perhaps it
can be tweaked to be less stop-and-go.

Anyhow the elevator seems to be why there are pauses in the
stream of read operations (but not [much] with write ones...).
It still seems the case to me that the block IO subsystem
structures IO in lots of read-ahead sectors, which is not good,
but at least not bad if the read-ahead is rather small (a few
KiB) as it should be.

Finally, I am getting a bit skeptical about elevators in general;
several tests show no-elevator as not being significantly worse
and sometimes better than any elevator. I suspect that elevators
as currently designed have too common pathological cases, as
their designers may have been not as careful as to ensuring
their influence was small and robust...

next prev parent reply	other threads:[~2008-03-20  8:12 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-05 17:21 the '--setra 65536' mistery, analysis and WTF? pg_mh, Peter Grandi
2008-03-13 13:02 ` Nat Makarevitch
2008-03-18 21:31 ` Peter Grandi
2008-03-20  8:12   ` Peter Grandi [this message]
2008-03-21 15:12     ` Nat Makarevitch
2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
2008-03-25 19:39   ` Peter Grandi
2008-03-27  4:07   ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18402.7274.174605.611654@tree.ty.sabi.co.uk \
    --to=pg_lxra@lxra.for.sabi.co.uk \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).