Re: [RFC PATCH] block: Fix bio merge induced high I/O latency

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: Andrea Arcangeli <andrea@suse.de>,
	akpm@linux-foundation.org, Ingo Molnar <mingo@elte.hu>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, ltt-dev@lists.casi.polymtl.ca
Subject: Re: [RFC PATCH] block: Fix bio merge induced high I/O latency
Date: Sun, 18 Jan 2009 16:27:12 -0500	[thread overview]
Message-ID: <20090118212712.GA5817@Krystal> (raw)
In-Reply-To: <20090118211234.GA4913@Krystal>

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Jens Axboe (jens.axboe@oracle.com) wrote:
> > On Sat, Jan 17 2009, Mathieu Desnoyers wrote:
> > > A long standing I/O regression (since 2.6.18, still there today) has hit
> > > Slashdot recently :
> > > http://bugzilla.kernel.org/show_bug.cgi?id=12309
> > > http://it.slashdot.org/article.pl?sid=09/01/15/049201
> > > 
> > > I've taken a trace reproducing the wrong behavior on my machine and I
> > > think it's getting us somewhere.
> > > 
> > > LTTng 0.83, kernel 2.6.28
> > > Machine : Intel Xeon E5405 dual quad-core, 16GB ram
> > > (just created a new block-trace.c LTTng probe which is not released yet.
> > > It basically replaces blktrace)
> > > 
> > > 
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > lttctl -C -w /tmp/trace -o channel.mm.bufnum=8 -o channel.block.bufnum=64 trace
> > > 
> > > dd if=/dev/zero of=/tmp/newfile bs=1M count=1M
> > > cp -ax music /tmp   (copying 1.1GB of mp3)
> > > 
> > > ls  (takes 15 seconds to get the directory listing !)
> > > 
> > > lttctl -D trace
> > > 
> > > I looked at the trace (especially at the ls surroundings), and bash is
> > > waiting for a few seconds for I/O in the exec system call (to exec ls).
> > > 
> > > While this happens, we have dd doing lots and lots of bio_queue. There
> > > is a bio_backmerge after each bio_queue event. This is reasonable,
> > > because dd is writing to a contiguous file.
> > > 
> > > However, I wonder if this is not the actual problem. We have dd which
> > > has the head request in the elevator request queue. It is progressing
> > > steadily by plugging/unplugging the device periodically and gets its
> > > work done. However, because requests are being dequeued at the same
> > > rate others are being merged, I suspect it stays at the top of the queue
> > > and does not let the other unrelated requests run.
> > > 
> > > There is a test in the blk-merge.c which makes sure that merged requests
> > > do not get bigger than a certain size. However, if the request is
> > > steadily dequeued, I think this test is not doing anything.
> > > 
> > > 
> > > This patch implements a basic test to make sure we never merge more
> > > than 128 requests into the same request if it is the "last_merge"
> > > request. I have not been able to trigger the problem again with the
> > > fix applied. It might not be in a perfect state : there may be better
> > > solutions to the problem, but I think it helps pointing out where the
> > > culprit lays.
> > 
> > To be painfully honest, I have no idea what you are attempting to solve
> > with this patch. First of all, Linux has always merged any request
> > possible. The one-hit cache is just that, a one hit cache frontend for
> > merging. We'll be hitting the merge hash and doing the same merge if it
> > fails. Since we even cap the size of the request, the merging is also
> > bounded.
> > 
> 
> Hi Jens,
> 
> I was mostly trying to poke around and try to figure out what was going
> on in the I/O elevator. Sorry if my first attempts did not make much
> sense. Following your advice, I've looked more deeply into the test
> cases.
> 
> > Furthermore, the request being merged is not considered for IO yet. It
> > has not been dispatched by the io scheduler. IOW, I'm surprised your
> > patch makes any difference at all. Especially with your 128 limit, since
> > 4kbx128kb is 512kb which is the default max merge size anyway. These
> > sort of test cases tend to be very sensitive and exhibit different
> > behaviour for many runs, so call me a bit skeptical and consider that an
> > enouragement to do more directed testing. You could use fio for
> > instance. Have two jobs in your job file. One is a dd type process that
> > just writes a huge file, the other job starts eg 10 seconds later and
> > does a 4kb read of a file.
> > 
> 
> I looked at the "ls" behavior (while doing a dd) within my LTTng trace
> to create a fio job file.  The said behavior is appended below as "Part
> 1 - ls I/O behavior". Note that the original "ls" test case was done
> with the anticipatory I/O scheduler, which was active by default on my
> debian system with custom vanilla 2.6.28 kernel. Also note that I am
> running this on a raid-1, but have experienced the same problem on a
> standard partition I created on the same machine.
> 
> I created the fio job file appended as "Part 2 - dd+ls fio job file". It
> consists of one dd-like job and many small jobs reading as many data as
> ls did. I used the small test script to batch run this ("Part 3 - batch
> test").
> 
> The results for the ls-like jobs are interesting :
> 
> I/O scheduler        runt-min (msec)   runt-max (msec)
> noop                       41             10563
> anticipatory               63              8185
> deadline                   52             33387
> cfq                        43              1420
> 
> 
> > As a quick test, could you try and increase the slice_idle to eg 20ms?
> > Sometimes I've seen timing being slightly off, which makes us miss the
> > sync window for the ls (in your case) process. Then you get a mix of
> > async and sync IO all the time, which very much slows down the sync
> > process.
> > 
> 
> Just to confirm, the quick test you are taking about would be :
> 
> ---
>  block/cfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6-lttng/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6-lttng.orig/block/cfq-iosched.c	2009-01-18 15:17:32.000000000 -0500
> +++ linux-2.6-lttng/block/cfq-iosched.c	2009-01-18 15:46:38.000000000 -0500
> @@ -26,7 +26,7 @@ static const int cfq_back_penalty = 2;
>  static const int cfq_slice_sync = HZ / 10;
>  static int cfq_slice_async = HZ / 25;
>  static const int cfq_slice_async_rq = 2;
> -static int cfq_slice_idle = HZ / 125;
> +static int cfq_slice_idle = 20;
>  
>  /*
>   * offset from end of service tree
> 
> 
> It does not make much difference with the standard cfq test :
> 
> I/O scheduler        runt-min (msec)   runt-max (msec)
> cfq (standard)             43              1420
> cfq (20ms slice_idle)      31              1573
> 
> 
> So, I guess 1.5s delay to run ls on a directory when the cache is cold
> with a cfq I/O scheduler is somewhat acceptable, but I doubt the 8, 10
> and 33s response times for the anticipatory, noop and deadline I/O
> schedulers are. I wonder why on earth is the anticipatory I/O scheduler
> activated by default with my kernel given it results in so poor
> interactive behavior when doing large I/O ?
> 

I found out why : I had an old pre-2.6.18 .config hanging around in
/boot on _many_ of my systems and upgraded to a newer vanilla kernel
using these defaults. make oldconfig left
CONFIG_DEFAULT_IOSCHED="anticipatory".

Changing to CONFIG_DEFAULT_IOSCHED="cfq" makes everything run better
under heavy I/O. I bet I'm not the only one in this situation.

Mathieu


> Thanks for the advices,
> 
> Mathieu
> 
> 
> 
> * Part 1 - ls I/O behavior
> 
> lttv -m textDump -t /traces/block-backmerge \
>      -e "state.pid=4145&event.subname=bio_queue"
> 
> block.bio_queue: 662.707321959 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, SYSCALL { sector = 327680048, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 10, not_uptodate = 0 }
> block.bio_queue: 662.707331445 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, SYSCALL { sector = 349175018, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.968214766 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, SYSCALL { sector = 327696968, size = 16384, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.968222110 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, SYSCALL { sector = 349191938, size = 16384, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971662800 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697032, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971670417 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192002, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971684184 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697040, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971689854 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192010, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971695762 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697048, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971701135 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192018, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971706301 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697056, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971711698 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192026, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971723359 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697064, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.971729035 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192034, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.999391873 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 327697072, size = 53248, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 662.999397864 (/traces/block-backmerge/block_2), 4145, 4145, bash, , 4063, 0x0, TRAP { sector = 349192042, size = 53248, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 670.809328737 (/traces/block-backmerge/block_7), 4145, 4145, /bin/ls, , 4063, 0x0, TRAP { sector = 327697000, size = 16384, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 670.809337500 (/traces/block-backmerge/block_7), 4145, 4145, /bin/ls, , 4063, 0x0, TRAP { sector = 349191970, size = 16384, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 671.161036834 (/traces/block-backmerge/block_5), 4145, 4145, /bin/ls, , 4063, 0x0, SYSCALL { sector = 360714880, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 10, not_uptodate = 0 }
> block.bio_queue: 671.161047247 (/traces/block-backmerge/block_5), 4145, 4145, /bin/ls, , 4063, 0x0, SYSCALL { sector = 382209850, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> block.bio_queue: 671.653601399 (/traces/block-backmerge/block_7), 4145, 4145, /bin/ls, , 4063, 0x0, SYSCALL { sector = 360712184, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 10, not_uptodate = 0 }
> block.bio_queue: 671.653611077 (/traces/block-backmerge/block_7), 4145, 4145, /bin/ls, , 4063, 0x0, SYSCALL { sector = 382207154, size = 4096, rw(FAILFAST_DRIVER,FAILFAST_TRANSPORT, = 0, not_uptodate = 0 }
> 
> 
> * Part 2 - dd+ls fio job file (test.job5)
> 
> [job1]
> rw=write
> size=10240m
> direct=0
> blocksize=1024k
> 
> [global]
> rw=randread
> size=96k
> filesize=30m
> direct=0
> bsrange=4k-52k
> 
> [file1]
> startdelay=0
> 
> [file2]
> startdelay=4
> 
> [file3]
> startdelay=8
> 
> [file4]
> startdelay=12
> 
> [file5]
> startdelay=16
> 
> [file6]
> startdelay=20
> 
> [file7]
> startdelay=24
> 
> [file8]
> startdelay=28
> 
> [file9]
> startdelay=32
> 
> [file10]
> startdelay=36
> 
> [file11]
> startdelay=40
> 
> [file12]
> startdelay=44
> 
> [file13]
> startdelay=48
> 
> [file14]
> startdelay=52
> 
> [file15]
> startdelay=56
> 
> [file16]
> startdelay=60
> 
> [file17]
> startdelay=64
> 
> [file18]
> startdelay=68
> 
> [file19]
> startdelay=72
> 
> [file20]
> startdelay=76
> 
> [file21]
> startdelay=80
> 
> [file22]
> startdelay=84
> 
> [file23]
> startdelay=88
> 
> [file24]
> startdelay=92
> 
> [file25]
> startdelay=96
> 
> [file26]
> startdelay=100
> 
> [file27]
> startdelay=104
> 
> [file28]
> startdelay=108
> 
> [file29]
> startdelay=112
> 
> [file30]
> startdelay=116
> 
> [file31]
> startdelay=120
> 
> [file32]
> startdelay=124
> 
> [file33]
> startdelay=128
> 
> [file34]
> startdelay=132
> 
> [file35]
> startdelay=134
> 
> [file36]
> startdelay=138
> 
> [file37]
> startdelay=142
> 
> [file38]
> startdelay=146
> 
> [file39]
> startdelay=150
> 
> [file40]
> startdelay=200
> 
> [file41]
> startdelay=260
> 
> 
> * Part 3 - batch test (do-tests.sh)
> 
> #!/bin/sh
> 
> TESTS="anticipatory noop deadline cfq"
> 
> for TEST in ${TESTS}; do 
> 	echo "Running ${TEST}"
> 	
> 	rm -f file*.0 job*.0
> 
> 	echo ${TEST} > /sys/block/sda/queue/scheduler
> 	echo ${TEST} > /sys/block/sdb/queue/scheduler
> 	sync
> 	echo 3 > /proc/sys/vm/drop_caches
> 	sleep 5
> 	
> 	./fio test.job5 --output test.result.${TEST}
> done
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-01-18 21:27 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-17  0:44 [Regression] High latency when doing large I/O Mathieu Desnoyers
2009-01-17 16:26 ` [RFC PATCH] block: Fix bio merge induced high I/O latency Mathieu Desnoyers
2009-01-17 16:50   ` Leon Woestenberg
2009-01-17 17:15     ` Mathieu Desnoyers
2009-01-17 19:04   ` Jens Axboe
2009-01-18 21:12     ` Mathieu Desnoyers
2009-01-18 21:27       ` Mathieu Desnoyers [this message]
2009-01-19 18:26       ` Jens Axboe
2009-01-20  2:10         ` Mathieu Desnoyers
2009-01-20  7:37           ` Jens Axboe
2009-01-20 12:28             ` Jens Axboe
2009-01-20 14:22               ` [ltt-dev] " Mathieu Desnoyers
2009-01-20 14:24                 ` Jens Axboe
2009-01-20 15:42                   ` Mathieu Desnoyers
2009-01-20 23:06                     ` Mathieu Desnoyers
2009-01-20 23:27               ` Mathieu Desnoyers
2009-01-21  0:25                 ` Mathieu Desnoyers
2009-01-21  4:38                   ` Ben Gamari
2009-01-21  4:54                     ` [ltt-dev] " Mathieu Desnoyers
2009-01-21  6:17                       ` Ben Gamari
2009-01-22 22:59                   ` Mathieu Desnoyers
2009-01-23  3:21                 ` [ltt-dev] " KOSAKI Motohiro
2009-01-23  4:03                   ` Mathieu Desnoyers
2009-02-10  3:36                   ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
2009-02-10  3:36                     ` Mathieu Desnoyers
2009-02-10  3:55                     ` Nick Piggin
2009-02-10  3:55                       ` Nick Piggin
2009-02-10  5:23                     ` Linus Torvalds
2009-02-10  5:23                       ` Linus Torvalds
2009-02-10  5:56                       ` Nick Piggin
2009-02-10  5:56                         ` Nick Piggin
2009-02-10  6:12                       ` Mathieu Desnoyers
2009-02-10  6:12                         ` Mathieu Desnoyers
2009-02-02  2:08               ` [RFC PATCH] block: Fix bio merge induced high I/O latency Mathieu Desnoyers
2009-02-02 11:26                 ` Jens Axboe
2009-02-03  0:46                   ` Mathieu Desnoyers
2009-01-20 13:45             ` [ltt-dev] " Mathieu Desnoyers
2009-01-20 20:22             ` Ben Gamari
2009-01-20 22:23               ` Ben Gamari
2009-01-20 23:05                 ` Mathieu Desnoyers
2009-01-22  2:35               ` Ben Gamari
2009-01-19 15:45     ` Nikanth K
2009-01-19 18:23       ` Jens Axboe
2009-01-17 20:03   ` Ben Gamari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090118212712.GA5817@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@suse.de \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltt-dev@lists.casi.polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.