linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* the '--setra 65536' mistery, analysis and WTF?
@ 2008-03-05 17:21 pg_mh, Peter Grandi
  2008-03-13 13:02 ` Nat Makarevitch
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: pg_mh, Peter Grandi @ 2008-03-05 17:21 UTC (permalink / raw)
  To: Linux RAID

I was recently testing a nice Dell 2900 with 2 MD1000 array
enclosure with 4-5 drives each (a mixture of SAS and SATA...)
attached to an LSI MegaRAID, but in a sw RAID10 4x(1+1) (with
'-p f2' to get extra read bandwidth at the expense of write
performance). Running RHEL4 (patched 2.6.9 kernel).

It was greatly perplexed to find that I could get 250MB/s clean
when writing, but usually only 50-70MB/s when reading, but then
sometimes 350-400MB/s. So I tested each disk individually and I
could get the expected 85MB/s (SATA) to 103MB/s (SAS).

Note: this doing very crude bulk sequential transfer tests with
  'dd bs=4k' (and the use of 'sysctl vm/drop_caches=3') or with
  Bonnie 1.4 (either with very large files or with '-o_direct').

Highly perplexing, and then I remembered some people reporting
that they had to do 'blockdev --setra 65536' to get good
streaming performance in similar circumstances, and indeed this
applied to my case too (when set on the '/dev/md0' device).

So I tried several combinations of 'dd' block size, 'blockdev'
read ahead, and disk and RAID device targets, and I noticed
these rather worrying combination of details:

* The 'dd' block size had really little influence on the
  outcome.

* A read-ahead up to 64 (32Kib) on the disk block device reading
  the individual disk resulted in an increasing transfer rate,
  and then the transfer rate reached the nominal top one for the
  disk with 64.

* A read head unless 65536 on '/dev/md0' resulted in increasing
  but erratic performance (the read-ahead on the individual
  disks seemed not to matter when reading from '/dev/md0').

* In both disk and '/dev/md0' cases I watched instantaneous
  transfer rates woith 'vmstat 1' and 'watch iostat 1 2'.

  I noticed that interrupts/s seemed exactly inversely
  proportional to read ahead, with lots of interrupts/s for
  small read-head, and few with large read-ahead.

  When reading from the '/dev/md0' the load was usually spread
  equally between the 8 array disks with 65536, but rather
  unevenly with the smaller values.

* Most revealingly, when I used values of read ahead which were
  powers of 10, the numbers of block/s reported by 'vmstat 1'
  was also a multiple of that power of 10.

All these (which happen also under 2.6.23 on my laptop's disk
and other workstations) seem to point to the following
conclusions:

* Quite astonishingly, the Linux block device subsystem does not
  do mailboxing/queueing of IO requests, but turns the
  read-ahead for the device into a blocking factor, and always
  issues read requests to the device driver for a strip of N
  blocks, where N is the read-ahead, then waits for completion
  of each strip before issuing the next request.

* This half-duplex logic with dire implications for performance
  is used even if the host adapter is capable of mailboxing and
  tagged queueing (verified also on a 3ware host adapter).

All this seems awful enough, because it results in streaming
pauses unless the read-ahead (thus the number of blocks read at
once from devices) is large, but it is more worrying that while
a read-ahead of 64 already results in infrequent enough pauses
for single disk drives, it does not for RAID block devices.

  For writes queueing and streaming seem to be happening
  naturally as written pages accumulate in the page cache.

The read-ahead on the RAID10 has to be a lot larger (apparently
32MiB) to deliver the expected level of streaming read speed.
This is very bad except for bulk streaming. It is hard to
imagine why that is needed, unless the calculation is wrong.
Also, with smaller values the read rates are erratic: sometimes
and for a while high, then slower.

I had a look at the code and in the block subsystem the code
dealing with 'ra_pages' is opaque but there is nothing that
screams that it is doing blocked reads instead of streaming
reads.

In 'drivers/md/raid10.c' there is one of the usual awful
practices of overriding the user's chosen value (to at least two
stripes) without actually telling the user ('--getra' does not
return the actual value used), but nothing overtly suspicious.

Before I do some debugging and tracing of where things go wrong,
it would be nice if someone more familiar with the vagaries of
the block subsystem and of the MD RAID code had a look to guess
at where the problem for the above arise...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the '--setra 65536' mistery, analysis and WTF?
  2008-03-05 17:21 the '--setra 65536' mistery, analysis and WTF? pg_mh, Peter Grandi
@ 2008-03-13 13:02 ` Nat Makarevitch
  2008-03-18 21:31 ` Peter Grandi
  2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
  2 siblings, 0 replies; 8+ messages in thread
From: Nat Makarevitch @ 2008-03-13 13:02 UTC (permalink / raw)
  To: linux-raid

Disclaimer: I may not be "familiar with the vagaries of the block subsystem and
of the MD RAID code" than you :-)

> * The 'dd' block size had really little influence on the outcome.

Is it the classic 'dd', doing purely 'classic' (read/write) blocking I/O?

> * A read-ahead up to 64 (32Kib) on the disk block device reading
>   the individual disk resulted in an increasing transfer rate,
>   and then the transfer rate reached the nominal top one for the
>   disk with 64.

The optimal value is very context-dependent. The disk-integrated cache size, for
example, is AFAIK not neglectable.

> * A read head unless 65536 on '/dev/md0' resulted in increasing
>   but erratic performance

The 'erratic' part of your stance seems weird to me. You use different disks
models, it may be part of an explanation

Are you sure that your RAID was fully built and not in 'degraded' mode (check
with mdadm -D /dev/RAIDDeviceName)?

> (the read-ahead on the individual
>   disks seemed not to matter when reading from '/dev/md0').

Same here

> * In both disk and '/dev/md0' cases I watched instantaneous
>   transfer rates woith 'vmstat 1' and 'watch iostat 1 2'.

Various disk internal-housekeeping processes may distort a too short benchmark.
Let it run for at least 60 seconds then calculate the average (dd and sdd can
help). Moreover invoke them via 'time' to check the CPU load. Any hint on
checking the bus load will be welcome!

>   I noticed that interrupts/s seemed exactly inversely
>   proportional to read ahead, with lots of interrupts/s for
>   small read-head, and few with large read-ahead.

This seems normal to me: interrupts only occur upon controller
work, they don't occur when the requested block is in the buffercache. With
enough read-ahead each disk-read nurtures the buffercache with many blocks,
therefore reducing the 'interrupt pressure'

>   When reading from the '/dev/md0' the load was usually spread
>   equally between the 8 array disks with 65536, but rather
>   unevenly with the smaller values.

Maybe because a smaller value forbids parallelization (reading a single stripe
is sufficient)

> * Most revealingly, when I used values of read ahead which were
>   powers of 10, the numbers of block/s reported by 'vmstat 1'
>   was also a multiple of that power of 10.

Some weird assessments led me to think that vmstat may be somewhat inadequate

> * Quite astonishingly, the Linux block device subsystem does not
>   do mailboxing/queueing of IO requests, but turns the
>   read-ahead for the device into a blocking factor, and always
>   issues read requests to the device driver for a strip of N
>   blocks, where N is the read-ahead, then waits for completion
>   of each strip before issuing the next request.

On purely sequential I/O it seems OK to me. Is it also true on random? Is it
true with deadline and CFQ? Is it true when you saturate the system thanks to
async io (if each requests blocks there is no way for the kernel to further
optimize), to multiple (simultaneously requesting I/O) threads or to processes?

As you probably already know: upon requests parallelization involved parties
(CPU and libc + kernel) are able to generate and accept a huge amount of
requests and then to group them (at elevator level) before really sending them
to the controller. Try using the deadline ioscheduler and reducing its ability
to group, by playing with /sys/block/DeviceName/queue/iosched/read_expire,
please let us know the results
Try 'randomio' or 'tiobench' (see the URL below)

> * This half-duplex logic with dire implications for performance
>   is used even if the host adapter is capable of mailboxing and
>   tagged queueing (verified also on a 3ware host adapter).

3ware is no more on my list (see http://www.makarevitch.org/rant/raid/ )

>   For writes queueing and streaming seem to be happening
>   naturally as written pages accumulate in the page cache.

Writes are fundamentally different when they can be cached for write-back

> The read-ahead on the RAID10 has to be a lot larger (apparently
> 32MiB) to deliver the expected level of streaming read speed.
> This is very bad except for bulk streaming. It is hard to
> imagine why that is needed, unless the calculation is wrong.

Maybe because the adequate data ("next to be requested") is just under the
disk's head after read-ahead had kicked in and "extended" the read. Without any
read-ahead these data will not be in the buffercache, resulting in a cache
'miss' when the next request (from the same sequential read set of requests)
will arrive. System and disk logic induces very small latencies but the disk
platters revolves continuously, therefore the needed data is sometimes behind
the head when your code has received the previous data and request the next
blocks. The disk will be only able to read it after a near-complete platter
rotation. This is a huge delay by CPU and DMA standards. In other terms the
read-ahead reduces the ratio (platter rotations/useful data read).






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the '--setra 65536' mistery, analysis and WTF?
  2008-03-05 17:21 the '--setra 65536' mistery, analysis and WTF? pg_mh, Peter Grandi
  2008-03-13 13:02 ` Nat Makarevitch
@ 2008-03-18 21:31 ` Peter Grandi
  2008-03-20  8:12   ` Peter Grandi
  2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
  2 siblings, 1 reply; 8+ messages in thread
From: Peter Grandi @ 2008-03-18 21:31 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> * Most revealingly, when I used values of read ahead which
>   were powers of 10, the numbers of block/s reported by
>   'vmstat 1' was also a multiple of that power of 10.

Most disturbingly, this seems to indicate that not only the
Linux block IO subsystems issues IO operations in multiples of
the read-ahead size, but does so at a fixed number of times per
second that is a multiple of 10.

Which leads me to suspect that the queueing of IO requests on
the driver's queue, or even the issuing of requests from the
driver to the device, may end up being driven by the clock tick
interrupt frequency, not the device interrupt frequency.

Eventually in my copious free time :-) I shall put in a few
traces to see what is the timewise flow of read requests...

[ ... ]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the '--setra 65536' mistery, analysis and WTF?
  2008-03-18 21:31 ` Peter Grandi
@ 2008-03-20  8:12   ` Peter Grandi
  2008-03-21 15:12     ` Nat Makarevitch
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Grandi @ 2008-03-20  8:12 UTC (permalink / raw)
  To: Linux RAID

[ ... on large read-ahead being needed for reasonable Linux RAID
read performance ... ]

>> * Most revealingly, when I used values of read ahead which
>> were powers of 10, the numbers of block/s reported by 'vmstat
>> 1' was also a multiple of that power of 10.

> Most disturbingly, this seems to indicate that not only the
> Linux block IO subsystems issues IO operations in multiples of
> the read-ahead size, but does so at a fixed number of times per
> second that is a multiple of 10.

> Which leads me to suspect that the queueing of IO requests on
> the driver's queue, or even the issuing of requests from the
> driver to the device, may end up being driven by the clock tick
> interrupt frequency, not the device interrupt frequency.

Which lead me to think about elevators, which was also mentioned
in some recent (and otherwise less interesting :->) comments as
some elevator do it periodically.

So I have done a quick test with the 'anticipatory' elevator
instead of the RHEL4 default CFQ and large readheads are not
necessary and I get 260MB/s writing and 520MB/s reading with an
8 sector readahead on the same 4*(1+1) RAID0 f2 used previously.

In theory the elevator should have no influence on a strict
sequential reading test that with strictly increasing read
addresses, as there is nothing to reorder.

However RHEL4, which was mentioned by other people reporting the
use of very large read-aheads, comes with an old version of the
elevator subsystem (which can only change elevator for all block
devices and only on reboot too).

Perhaps the CFQ version in RHEL4 inserts pauses in the stream of
read requests which have to be amortized over large read request
streams, and perhaps the variability in performance depends on
resonances between the length of the read-ahead at the RAID
block device level and the interval between pauses at the
underlying disk level.

I have used 'anticipatory' in my test above because it is known
to favour sequential access patterns. Unfortunately it does so a
bit too much and also leads to poor latency with multiple streams,
probably the reason why the default is CFQ. Again, the version
of CFQ in RHEL4 is old, so it has few tweakables, but perhaps it
can be tweaked to be less stop-and-go.

Anyhow the elevator seems to be why there are pauses in the
stream of read operations (but not [much] with write ones...).
It still seems the case to me that the block IO subsystem
structures IO in lots of read-ahead sectors, which is not good,
but at least not bad if the read-ahead is rather small (a few
KiB) as it should be.

Finally, I am getting a bit skeptical about elevators in general;
several tests show no-elevator as not being significantly worse
and sometimes better than any elevator. I suspect that elevators
as currently designed have too common pathological cases, as
their designers may have been not as careful as to ensuring
their influence was small and robust...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: the '--setra 65536' mistery, analysis and WTF?
  2008-03-20  8:12   ` Peter Grandi
@ 2008-03-21 15:12     ` Nat Makarevitch
  0 siblings, 0 replies; 8+ messages in thread
From: Nat Makarevitch @ 2008-03-21 15:12 UTC (permalink / raw)
  To: linux-raid

Peter Grandi <pg_lxra <at> lxra.for.sabi.co.UK> writes:

> this seems to indicate that not only the Linux block IO subsystems issues
> IO operations in multiples of the read-ahead size, but does so at a fixed
> number of times per second that is a multiple of 10.

Is it the case with a classic controller (not a 3ware?)

http://forums.storagereview.net/index.php?showtopic=25923 may be of
interest.

> Which lead me to think about elevators, which was also mentioned
> in some recent (and otherwise less interesting :->)

Sorry, I don't understand this.

> It still seems the case to me that the block IO subsystem
> structures IO in lots of read-ahead sectors

Isn't it specific to the 3ware driver and/or hardware, even in
SINGLE (JBOD) mode?

> Finally, I am getting a bit skeptical about elevators in general;

> several tests show no-elevator as not being significantly worse
> and sometimes better than any elevator

Isn't it specific to intelligent controllers?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Why are MD block IO requests subject to 'plugging'?
  2008-03-05 17:21 the '--setra 65536' mistery, analysis and WTF? pg_mh, Peter Grandi
  2008-03-13 13:02 ` Nat Makarevitch
  2008-03-18 21:31 ` Peter Grandi
@ 2008-03-25 12:05 ` Peter Grandi
  2008-03-25 19:39   ` Peter Grandi
  2008-03-27  4:07   ` Neil Brown
  2 siblings, 2 replies; 8+ messages in thread
From: Peter Grandi @ 2008-03-25 12:05 UTC (permalink / raw)
  To: Linux RAID

[ ... READ10 setup on fast system, poor read rates ... ]

> It was greatly perplexed to find that I could get 250MB/s
> clean when writing, [ ... ]

It seems that this is because the intelligent host adapter would
signal writes completed immediately, so removing the scheduling
from the Linux block IO system, not because the Lonux MD and block
IO subsystems handle writes better than reads, as I tried the
same on another system with slow dumb host adpter and 4 drives
and write performance was  not good.

> Highly perplexing, and then I remembered some people reporting
> that they had to do 'blockdev --setra 65536' to get good
> streaming performance in similar circumstances, and indeed this
> applied to my case too (when set on the '/dev/md0' device).

[ ... ]

>   I noticed that interrupts/s seemed exactly inversely
>   proportional to read ahead, with lots of interrupts/s for
>   small read-head, and few with large read-ahead.

> * Most revealingly, when I used values of read ahead which were
>   powers of 10, the numbers of block/s reported by 'vmstat 1'
>   was also a multiple of that power of 10.

More precisely it seems that the thruput is an exact multiple of
readhead and interrupts per second. For example on a single hard
disk, reading it 32KiB at a time with a read ahead of 1000 512B
sectors:

  soft# blockdev --getra /dev/hda; vmstat 1
  1000
  procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
   0  1      0   6304 468292   5652    0    0  1300    66  121   39  0  1 96  3  0
   0  1      0   6328 468320   5668    0    0 41000     0  422  720  0 19  0 81  0
   1  1      0   6264 468348   5688    0    0 41500     0  429  731  0 14  0 86  0
   1  1      0   6076 468556   5656    0    0 41500     0  427  731  0 15  0 85  0
   1  1      0   6012 468584   5660    0    0 41500     0  428  730  0 19  0 81  0
   0  1      0   6460 468112   5660    0    0 41016     0  433  730  0 16  0 84  0
   0  0      0   6420 468112   5696    0    0    20     0  114   23  0  0 100  0  0
   0  0      0   6420 468112   5696    0    0     0     0  104    9  0  0 100  0  0
   0  0      0   6420 468112   5696    0    0     0     0  103   11  0  0 100  0  0
   0  0      0   6420 468112   5696    0    0     0     0  104    9  0  0 100  0  0

The 'bi' column is in 1KiB blocks, not 512B sectors. It is an
exact multiple of 500 If one looks at the number of ''idle''
interrupts (around 100-110/s) it seems as if there are 410-415
IO interrupts per second and on each exactly 1000 512B sectors
are read. Amazing coincidence!

> * Quite astonishingly, the Linux block device subsystem does
>   not do mailboxing/queueing of IO requests, but turns the
>   read-ahead for the device into a blocking factor, and always
>   issues read requests to the device driver for a strip of N
>   blocks, where N is the read-ahead, then waits for completion
>   of each strip before issuing the next request.

Indeed, and I have noticed that the number of interrupts/s per MiB
of read rate varies considerably as one changes the read-ahead
size, so I started suspecting some really dumb logic. But it does
not seem to affect physical block devices that much, as despite it
they still seem to be able to issue back-to-back requests to the
host adapter, even if excessively quantized; but it seems to
affect MD a lot.

Another interesting detail is that while real disk devices have
queueing parameter and MD device don't:

  # ls /sys/block/{md0,hda}/queue
  ls: cannot access /sys/block/md0/queue: No such file or directory
  /sys/block/hda/queue:
  iosched  max_hw_sectors_kb  max_sectors_kb  nr_requests  read_ahead_kb  scheduler

As one can set 'blockdev --setra' on an MD device (which should be
the same as setting 'queue/read_head_kb' to half the value), and
that does have an effect, but then the readhead on all the block
devices in the MD array are ignored.

Anbother interesting detail is that in my usual setup I get 50MB/s
with a read-ahead of 64, 250MB/s with 128, and 50MB/s with 256,
which strongly suggests some "resonance" at work between
quantization factors.

So I started digging around for the obvious: some scheme for
quantizing/batching requests in the block IO subsystem, and indeed
if one looks at the details of the ''plugging'' logic:

http://www.gelato.unsw.edu.au/lxr/source/block/ll_rw_blk.c#L1605

«* generic_unplug_device - fire a request queue
 * @q:    The &struct request_queue in question
 *
 * Description:
 *   Linux uses plugging to build bigger requests queues before letting
 *   the device have at them. If a queue is plugged, the I/O scheduler
 *   is still adding and merging requests on the queue. Once the queue
 *   gets unplugged, the request_fn defined for the queue is invoked and
 *   transfers started.»

and reads further confessions here:

http://www.linuxsymposium.org/proceedings/reprints/Reprint-Axboe-OLS2004.pdf

 «For the longest time, the Linux block layer has used a
  technique dubbed plugging to increase IO throughput. In its
  simplicity, plugging works sort of like the plug in your tub
  drain—when IO is queued on an initially empty queue, the queue
  is plugged.
  Only when someone asks for the completion of some of the queued
  IO is the plug yanked out, and io is allowed to drain from the
  queue. So instead of submitting the first immediately to the
  driver, the block layer allows a small buildup of requests.
  There’s nothing wrong with the principle of plugging, and it
  has been shown to work well for a number of workloads.»

BTW, this statement is a naive advocacy for a gross impropriety,
as it is absolutely only the business of the device-specific part
of IO (e.g. the host adapter driver) how to translate logical IO
requests into physical IO requests, whether coalescing or even
splitting them is good, and there is amazingly little
justification for rearranging a stream of logical IO requests
at the logical IO level.

Conversely, quite properly elevators (another request stream
restructuring) apply to physical devices, not to partitions or MD
devices, and one can have different elevators on different devices
(even if having them different for MD slave devices in most cases
is very dubious).

 «2.6 also contains some additional logic to unplug a given queue
  once it reaches the point where waiting longer doesn’t make much
  sense. So where 2.4 will always wait for an explicit unplug, 2.6
  can trigger an unplug when one of two conditions are met:

  1. The number of queued requests reach a certain limit,
     q->unplug_thresh. This is device tweak able and defaults to 4.

It not only defaults to 4, it is 4, as it is never changed from
the default:

  $ pwd
  /usr/local/src/linux-2.6.23
  $ find * -name '*.[ch]' | xargs egrep unplug_thresh
  block/elevator.c:               if (nrq >= q->unplug_thresh)
  block/ll_rw_blk.c:      q->unplug_thresh = 4;           /* hmm */
  include/linux/blkdev.h: int                     unplug_thresh;  /* After this many requests */

But more ominously, there is some (allegedly rarely triggered)
timeout on unplugging a plugged queue:

 «2. When the queue has been idle for q-> unplug_delay. Also
     device tweak able, and defaults to 3 milliseconds.

  The idea is that once a certain number of requests have
  accumulated in the queue, it doesn’t make much sense to
  continue waiting for more—there is already an adequate number
  available to keep the disk happy. The time limit is really a
  last resort, and should rarely trigger in real life.

  Observations on various work loads have verified this. More
  than a handful or two timer unplugs per minute usually
  indicates a kernel bug.»

So I had a look at how the MD subsystem handles unplugging,
because of a terrible suspicion that it does two-level
unplugging, and wonder what:

http://www.gelato.unsw.edu.au/lxr/source/drivers/md/raid10.c#L599

  static void raid10_unplug(struct request_queue *q)
  {
	  mddev_t *mddev = q->queuedata;

	  unplug_slaves(q->queuedata);
	  md_wakeup_thread(mddev->thread);
  }

Can some MD developer justify the lines above?

Can some MD developer also explain why should MD engage in double
level request queueing/unplugging at both the MD and slave level?

Can some MD developer then give some very good reason why the MD
layer should be subject to plugging *at all*?

This before I spend a bit of time doing a bit of 'blktrace' work
to see how unplugging "helps" MD and perhaps setting 'unplug_thresh'
globally to 1 "just for fun" :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why are MD block IO requests subject to 'plugging'?
  2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
@ 2008-03-25 19:39   ` Peter Grandi
  2008-03-27  4:07   ` Neil Brown
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Grandi @ 2008-03-25 19:39 UTC (permalink / raw)
  To: Linux RAID

[ ... low low read rates unless enormous read-aheads are used
... ]

>> * Most revealingly, when I used values of read ahead which
>>   were powers of 10, the numbers of block/s reported by
>>   'vmstat 1' was also a multiple of that power of 10.

> More precisely it seems that the thruput is an exact multiple
> of readhead and interrupts per second. For example on a single
> hard disk, reading it 32KiB at a time with a read ahead of
> 1000 512B sectors: [ ... ]

Well, I have now setup an old PC I have with a test RAID, and it
is an otherwise totally quiescent system, so I can observe
things a bit more precisely.

It shows that problem exists not just on MD devices, but on 'hd'
and 'sd' devices too.

It is pretty ridiculous in the sense that the PC does exactly
101 interrupts per second, and if I run for example something
like one of:

  dd bs=NNk iflag=direct if=/dev/hdX of=/dev/null

  blockdev --setra NN /dev/hdX && sysctl vm/drop_caches=1 \
    && dd bs=32k if=/dev/hdX of=/dev/null

The number of block/s reported by 'vmstat 1' is exactly a
multiple of 100 or 101, e.g. 6464/s or 12800/s or 130256/s where
the apparent request issue rate can sort of halve wrt 100Hz but
not exceed it. This happens with the 'noop' elevator too, so
it must be some absurd thing 

> This before I spend a bit of time doing a bit of 'blktrace'
> work to see how unplugging "helps" MD

Seems ever more likely that I need to have a look at 'blktrace',
but it is not an MD specific issue.

> and perhaps setting 'unplug_thresh' globally to 1 "just for
> fun" :-).

Uhm I have exported both 'unplug_thresh' and 'unplug_delay' and
defaulted them both to 1 in the appended patch, and I am trying
also out of curiosity to figure out how to make the 'queue'
object/entry appear under '/sys/block/md0/md/'...

--- block/ll_rw_blk.c-dist	2007-11-17 17:22:41.484066984 +0000
+++ block/ll_rw_blk.c	2008-03-25 15:50:11.110010883 +0000
@@ -217,8 +217,8 @@
 	blk_queue_congestion_threshold(q);
 	q->nr_batching = BLK_BATCH_REQ;
 
-	q->unplug_thresh = 4;		/* hmm */
-	q->unplug_delay = (3 * HZ) / 1000;	/* 3 milliseconds */
+	q->unplug_thresh = 1;		/* hmm */
+	q->unplug_delay = (1 * HZ) / 1000;	/* 3 milliseconds */
 	if (q->unplug_delay == 0)
 		q->unplug_delay = 1;
 
@@ -3997,6 +3997,54 @@
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_unplug_thresh_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->unplug_thresh, (page));
+}
+
+static ssize_t
+queue_unplug_thresh_store(struct request_queue *q, const char *page, size_t count)
+{
+	unsigned long unplug_thresh;
+	ssize_t ret = queue_var_store(&unplug_thresh, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	q->unplug_thresh = unplug_thresh;
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
+
+static ssize_t queue_unplug_delay_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->unplug_delay, (page));
+}
+
+static ssize_t
+queue_unplug_delay_store(struct request_queue *q, const char *page, size_t count)
+{
+	unsigned long unplug_delay;
+	ssize_t ret = queue_var_store(&unplug_delay, page, count);
+
+	spin_lock_irq(q->queue_lock);
+	q->unplug_delay = unplug_delay;
+	spin_unlock_irq(q->queue_lock);
+
+	return ret;
+}
+
+
+static struct queue_sysfs_entry queue_unplug_thresh_entry = {
+	.attr = {.name = "unplug_thresh", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_unplug_thresh_show,
+	.store = queue_unplug_thresh_store,
+};
+
+static struct queue_sysfs_entry queue_unplug_delay_entry = {
+	.attr = {.name = "unplug_delay", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_unplug_delay_show,
+	.store = queue_unplug_delay_store,
+};
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4028,6 +4076,8 @@
 };
 
 static struct attribute *default_attrs[] = {
+	&queue_unplug_thresh_entry.attr,
+	&queue_unplug_delay_entry.attr,
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Why are MD block IO requests subject to 'plugging'?
  2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
  2008-03-25 19:39   ` Peter Grandi
@ 2008-03-27  4:07   ` Neil Brown
  1 sibling, 0 replies; 8+ messages in thread
From: Neil Brown @ 2008-03-27  4:07 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On Tuesday March 25, pg_lxra@lxra.for.sabi.co.UK wrote:
> Another interesting detail is that while real disk devices have
> queueing parameter and MD device don't:
> 
>   # ls /sys/block/{md0,hda}/queue
>   ls: cannot access /sys/block/md0/queue: No such file or directory
>   /sys/block/hda/queue:
>   iosched  max_hw_sectors_kb  max_sectors_kb  nr_requests  read_ahead_kb  scheduler

No.  An md device doesn't have a queue, at least not in the same way
that IDE/SCSI drivers do.  Most of those values are completely
meaningless for an md device.

> 
> As one can set 'blockdev --setra' on an MD device (which should be
> the same as setting 'queue/read_head_kb' to half the value), and
> that does have an effect, but then the readhead on all the block
> devices in the MD array are ignored.

I'm not sure what you mean by that last line...

readahead is performed by the filesystem.  Each device has a
read-ahead number which serves as advice to the filesystem to suggest
how much readahead might be appropriate.
Components of an md device do not have a filesystem on them, so
readahead setting for those devices is not relevant.

When md sets the readahead for the md array, it does take some notice
of the read-ahead setting for the component devices.

> 
> Conversely, quite properly elevators (another request stream
> restructuring) apply to physical devices, not to partitions or MD
> devices, and one can have different elevators on different devices
> (even if having them different for MD slave devices in most cases
> is very dubious).
> 

This is exactly correct and exactly what Linux does.
There is no elevator above MD.  There is a distinct elevator above
each IDE/SCSI/etc device that is (or appear to Linux to be) a plain
device.

> 
> So I had a look at how the MD subsystem handles unplugging,
> because of a terrible suspicion that it does two-level
> unplugging, and wonder what:
> 
> http://www.gelato.unsw.edu.au/lxr/source/drivers/md/raid10.c#L599
> 
>   static void raid10_unplug(struct request_queue *q)
>   {
> 	  mddev_t *mddev = q->queuedata;
> 
> 	  unplug_slaves(q->queuedata);
> 	  md_wakeup_thread(mddev->thread);
>   }
> 
> Can some MD developer justify the lines above?

You need to look more deeply.

No md personality ever plugs read requests.
Sometimes write requests are plugged, in order to delay the requests
slightly.
There are two main reasons for this:

1/ When using a write-intent-bitmap we plug write request so as to
  gather lots of them together, to reduce the number of updates to the
  bitmap.

2/ For raid5/raid6 we plug writes in the hope of gathering a full
  stripe of writes so we can avoid pre-reading.  As soon as a stripe can
  be processed without any pre-reading, it is bypasses the plug.

> 
> Can some MD developer also explain why should MD engage in double
> level request queueing/unplugging at both the MD and slave level?

different reasons for small delays.

> 
> Can some MD developer then give some very good reason why the MD
> layer should be subject to plugging *at all*?

The MD layer doesn't do plugging.  Some MD personalities do, for
reasons I have explained above.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-03-27  4:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-05 17:21 the '--setra 65536' mistery, analysis and WTF? pg_mh, Peter Grandi
2008-03-13 13:02 ` Nat Makarevitch
2008-03-18 21:31 ` Peter Grandi
2008-03-20  8:12   ` Peter Grandi
2008-03-21 15:12     ` Nat Makarevitch
2008-03-25 12:05 ` Why are MD block IO requests subject to 'plugging'? Peter Grandi
2008-03-25 19:39   ` Peter Grandi
2008-03-27  4:07   ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).