linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <compudj@krystal.dyndns.org>
To: Jens Axboe <jens.axboe@oracle.com>
Cc: akpm@linux-foundation.org, ltt-dev@lists.casi.polymtl.ca,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org
Subject: Re: [ltt-dev] [RFC PATCH] block: Fix bio merge induced high I/O latency
Date: Tue, 20 Jan 2009 09:22:23 -0500	[thread overview]
Message-ID: <20090120142223.GD22421@Krystal> (raw)
In-Reply-To: <20090120122855.GF30821@kernel.dk>

* Jens Axboe (jens.axboe@oracle.com) wrote:
> On Tue, Jan 20 2009, Jens Axboe wrote:
> > On Mon, Jan 19 2009, Mathieu Desnoyers wrote:
> > > * Jens Axboe (jens.axboe@oracle.com) wrote:
> > > > On Sun, Jan 18 2009, Mathieu Desnoyers wrote:
> > > > > I looked at the "ls" behavior (while doing a dd) within my LTTng trace
> > > > > to create a fio job file.  The said behavior is appended below as "Part
> > > > > 1 - ls I/O behavior". Note that the original "ls" test case was done
> > > > > with the anticipatory I/O scheduler, which was active by default on my
> > > > > debian system with custom vanilla 2.6.28 kernel. Also note that I am
> > > > > running this on a raid-1, but have experienced the same problem on a
> > > > > standard partition I created on the same machine.
> > > > > 
> > > > > I created the fio job file appended as "Part 2 - dd+ls fio job file". It
> > > > > consists of one dd-like job and many small jobs reading as many data as
> > > > > ls did. I used the small test script to batch run this ("Part 3 - batch
> > > > > test").
> > > > > 
> > > > > The results for the ls-like jobs are interesting :
> > > > > 
> > > > > I/O scheduler        runt-min (msec)   runt-max (msec)
> > > > > noop                       41             10563
> > > > > anticipatory               63              8185
> > > > > deadline                   52             33387
> > > > > cfq                        43              1420
> > > > 
> > > 
> > > Extra note : I have a HZ=250 on my system. Changing to 100 or 1000 did
> > > not make much difference (also tried with NO_HZ enabled).
> > > 
> > > > Do you have queuing enabled on your drives? You can check that in
> > > > /sys/block/sdX/device/queue_depth. Try setting those to 1 and retest all
> > > > schedulers, would be good for comparison.
> > > > 
> > > 
> > > Here are the tests with a queue_depth of 1 :
> > > 
> > > I/O scheduler        runt-min (msec)   runt-max (msec)
> > > noop                       43             38235
> > > anticipatory               44              8728
> > > deadline                   51             19751
> > > cfq                        48               427
> > > 
> > > 
> > > Overall, I wouldn't say it makes much difference.
> > 
> > 0,5 seconds vs 1,5 seconds isn't much of a difference?
> > 
> > > > raid personalities or dm complicates matters, since it introduces a
> > > > disconnect between 'ls' and the io scheduler at the bottom...
> > > > 
> > > 
> > > Yes, ideally I should re-run those directly on the disk partitions.
> > 
> > At least for comparison.
> > 
> > > I am also tempted to create a fio job file which acts like a ssh server
> > > receiving a connexion after it has been pruned from the cache while the
> > > system if doing heavy I/O. "ssh", in this case, seems to be doing much
> > > more I/O than a simple "ls", and I think we might want to see if cfq
> > > behaves correctly in such case. Most of this I/O is coming from page
> > > faults (identified as traps in the trace) probably because the ssh
> > > executable has been thrown out of the cache by
> > > 
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > The behavior of an incoming ssh connexion after clearing the cache is
> > > appended below (Part 1 - LTTng trace for incoming ssh connexion). The
> > > job file created (Part 2) reads, for each job, a 2MB file with random
> > > reads each between 4k-44k. The results are very interesting for cfq :
> > > 
> > > I/O scheduler        runt-min (msec)   runt-max (msec)
> > > noop                       586           110242
> > > anticipatory               531            26942
> > > deadline                   561           108772
> > > cfq                        523            28216
> > > 
> > > So, basically, ssh being out of the cache can take 28s to answer an
> > > incoming ssh connexion even with the cfq scheduler. This is not exactly
> > > what I would call an acceptable latency.
> > 
> > At some point, you have to stop and consider what is acceptable
> > performance for a given IO pattern. Your ssh test case is purely random
> > IO, and neither CFQ nor AS would do any idling for that. We can make
> > this test case faster for sure, the hard part is making sure that we
> > don't regress on async throughput at the same time.
> > 
> > Also remember that with your raid1, it's not entirely reasonable to
> > blaim all performance issues on the IO scheduler as per my previous
> > mail. It would be a lot more fair to view the disk numbers individually.
> > 
> > Can you retry this job with 'quantum' set to 1 and 'slice_async_rq' set
> > to 1 as well?
> > 
> > However, I think we should be doing somewhat better at this test case.
> 
> Mathieu, does this improve anything for you?
> 

I got this message when running with your patch applied :
cfq: forced dispatching is broken (nr_sorted=4294967275), please report this
(message appeared 10 times in a job run)

Here is the result :

ssh test done on /dev/sda directly

queue_depth=31 (default)
/sys/block/sda/queue/iosched/slice_async_rq = 2 (default)
/sys/block/sda/queue/iosched/quantum = 4 (default)

I/O scheduler        runt-min (msec)   runt-max (msec)
cfq (default)             523              6637
cfq (patched)             564              7195

Pretty much the same.

Here is the test done on raid1 :
queue_depth=31 (default)
/sys/block/sd{a,b}/queue/iosched/slice_async_rq = 2 (default)
/sys/block/sd{a,b}/queue/iosched/quantum = 4 (default)

I/O scheduler        runt-min (msec)   runt-max (msec)
cfq (default, raid1)       523            28216
cfq (patched, raid1)       540            16454

With nearly same order of magnitude worse-case.

Mathieu


> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index e8525fa..a556512 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1765,6 +1765,32 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  }
>  
>  /*
> + * Pull dispatched requests from 'cfqq' back into the scheduler
> + */
> +static void cfq_pull_dispatched_requests(struct cfq_data *cfqd,
> +					 struct cfq_queue *cfqq)
> +{
> +	struct request_queue *q = cfqd->queue;
> +	struct request *rq, *tmp;
> +
> +	list_for_each_entry_safe(rq, tmp, &q->queue_head, queuelist) {
> +		if ((rq->cmd_flags & REQ_STARTED) || RQ_CFQQ(rq) != cfqq)
> +			continue;
> +
> +		/*
> +		 * Pull off the dispatch list and put it back into the cfqq
> +		 */
> +		list_del(&rq->queuelist);
> +		cfqq->dispatched--;
> +		if (cfq_cfqq_sync(cfqq))
> +			cfqd->sync_flight--;
> +
> +		list_add_tail(&rq->queuelist, &cfqq->fifo);
> +		cfq_add_rq_rb(rq);
> +	}
> +}
> +
> +/*
>   * Check if new_cfqq should preempt the currently active queue. Return 0 for
>   * no or if we aren't sure, a 1 will cause a preempt.
>   */
> @@ -1820,8 +1846,14 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>   */
>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
> +	struct cfq_queue *old_cfqq = cfqd->active_queue;
> +
>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
> -	cfq_slice_expired(cfqd, 1);
> +
> +	if (old_cfqq) {
> +		__cfq_slice_expired(cfqd, old_cfqq, 1);
> +		cfq_pull_dispatched_requests(cfqd, old_cfqq);
> +	}
>  
>  	/*
>  	 * Put the new queue at the front of the of the current list,
> 
> -- 
> Jens Axboe
> 
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

  reply	other threads:[~2009-01-20 14:22 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-17  0:44 [Regression] High latency when doing large I/O Mathieu Desnoyers
2009-01-17 16:26 ` [RFC PATCH] block: Fix bio merge induced high I/O latency Mathieu Desnoyers
2009-01-17 16:50   ` Leon Woestenberg
2009-01-17 17:15     ` Mathieu Desnoyers
2009-01-17 19:04   ` Jens Axboe
2009-01-18 21:12     ` Mathieu Desnoyers
2009-01-18 21:27       ` Mathieu Desnoyers
2009-01-19 18:26       ` Jens Axboe
2009-01-20  2:10         ` Mathieu Desnoyers
2009-01-20  7:37           ` Jens Axboe
2009-01-20 12:28             ` Jens Axboe
2009-01-20 14:22               ` Mathieu Desnoyers [this message]
2009-01-20 14:24                 ` [ltt-dev] " Jens Axboe
2009-01-20 15:42                   ` Mathieu Desnoyers
2009-01-20 23:06                     ` Mathieu Desnoyers
2009-01-20 23:27               ` Mathieu Desnoyers
2009-01-21  0:25                 ` Mathieu Desnoyers
2009-01-21  4:38                   ` Ben Gamari
2009-01-21  4:54                     ` [ltt-dev] " Mathieu Desnoyers
2009-01-21  6:17                       ` Ben Gamari
2009-01-22 22:59                   ` Mathieu Desnoyers
2009-01-23  3:21                 ` [ltt-dev] " KOSAKI Motohiro
2009-01-23  4:03                   ` Mathieu Desnoyers
2009-02-10  3:36                   ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
2009-02-10  3:55                     ` Nick Piggin
2009-02-10  5:23                     ` Linus Torvalds
2009-02-10  5:56                       ` Nick Piggin
2009-02-10  6:12                       ` Mathieu Desnoyers
2009-02-02  2:08               ` [RFC PATCH] block: Fix bio merge induced high I/O latency Mathieu Desnoyers
2009-02-02 11:26                 ` Jens Axboe
2009-02-03  0:46                   ` Mathieu Desnoyers
2009-01-20 13:45             ` [ltt-dev] " Mathieu Desnoyers
2009-01-20 20:22             ` Ben Gamari
2009-01-20 22:23               ` Ben Gamari
2009-01-20 23:05                 ` Mathieu Desnoyers
2009-01-22  2:35               ` Ben Gamari
2009-01-19 15:45     ` Nikanth K
2009-01-19 18:23       ` Jens Axboe
2009-01-17 20:03   ` Ben Gamari

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090120142223.GD22421@Krystal \
    --to=compudj@krystal.dyndns.org \
    --cc=akpm@linux-foundation.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltt-dev@lists.casi.polymtl.ca \
    --cc=mingo@elte.hu \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).