Re: [RFC] Workload type Vs Groups (Was: Re: [PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vivek Goyal <vgoyal@redhat.com>
To: Corrado Zoccolo <czoccolo@gmail.com>
Cc: linux-kernel@vger.kernel.org, jens.axboe@oracle.com,
	nauman@google.com, dpshah@google.com, lizf@cn.fujitsu.com,
	ryov@valinux.co.jp, fernando@oss.ntt.co.jp,
	s-uchida@ap.jp.nec.com, taka@valinux.co.jp,
	guijianfeng@cn.fujitsu.com, jmoyer@redhat.com,
	balbir@linux.vnet.ibm.com, righi.andrea@gmail.com,
	m-ikeda@ds.jp.nec.com, akpm@linux-foundation.org,
	riel@redhat.com, kamezawa.hiroyu@jp.fujitsu.com
Subject: Re: [RFC] Workload type Vs Groups (Was: Re: [PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps)
Date: Tue, 10 Nov 2009 09:12:46 -0500	[thread overview]
Message-ID: <20091110141246.GB1083@redhat.com> (raw)
In-Reply-To: <20091110133113.GA1083@redhat.com>

On Tue, Nov 10, 2009 at 08:31:13AM -0500, Vivek Goyal wrote:
> On Tue, Nov 10, 2009 at 12:29:30PM +0100, Corrado Zoccolo wrote:
> > On Tue, Nov 10, 2009 at 12:12 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > I thought it was reverse. For sync-noidle workoads (typically seeky), we
> > > do lot less IO and size of IO is not the right measure otherwise most of
> > > the disk time we will be giving to this sync-noidle queue/group and
> > > sync-idle queues will be heavily punished in other groups.
> > 
> > This happens only if you try to measure both sequential and seeky with
> > the same metric.
> 
> Ok, we seem to be discussing many things. I will try to pull it back on
> core points.
> 
> To me there are only two key questions.
> 
> - Whether workload type should be on topmost layer or groups should be on
>   topmost layer.
> 
> - How to define fairness in case of NCQ SSD where idling hurts and we
>   don't choose to idle.
> 
> 
> For the first issue, if we keep workoad type on top, then we weaken the
> isolation between groups. We provide isolation between only same kind of
> workload type and not across the workloads types.
> 
> So if a group is running only sequential readers and other group is runnig
> random seeky reaeder, then share of second group is not determined by the
> group weight but the number of queues in first group.
> 
> Hence as we increase number of queues in first group, share of second
> group keep on coming down. This kind of implies that sequential reads
> in first group are more important as comapred to random seeky reader in
> second group. But in this case the relative importance of workload is
> specifed by the user with the help of cgroups and weights and IO scheduler
> should honor that.
> 
> So to me, groups on topmost layer makes more sense than having workload
> type on topmost layer.
> 
> > >
> > > time based fairness generally should work better on seeky media. As the
> > > seek cost starts to come down, size of IO also starts making sense.
> > >
> > > In fact on SSD, we do queue switching so fast and don't idle on the queue,
> > > doing time accounting and providing fairness in terms of time is hard, for
> > > the groups which are not continuously backlogged.
> > The mechanism in place still gives fairness in terms of I/Os for SSDs.
> > One queue is not even nearly backlogged, then there is no point in
> > enforcing fairness for it so that the backlogged one gets lower
> > bandwidth, but the not backlogged one doesn't get higher (since it is
> > limited by its think time).
> > 
> > For me fairness for SSDs should happen only when the total BW required
> > by all the queues is more than the one the disk can deliver, or the
> > total number of active queues is more than NCQ depth. Otherwise, each
> > queue will get exactly the bandwidth it wants, without affecting the
> > others, so no idling should happen. In the mentioned cases, instead,
> > no idling needs to be added, since the contention for resource will
> > already introduce delays.
> > 
> 
> Ok, above is pertinent for the second issue of not idling on NCQ SSDs as
> it hurts and brings down the overall throughput. I tend to agree here,
> that idling on queues limited by think time does not make much sense on
> NCQ SSD. In this case probably fairness will be defined by how many a
> times a group got scheduled in for dispatch. If group has higher weight
> then it should be able to dispatch more times (in proportionate ratio),
> as compared to group lower weight group.
> 
> We should be able to achieve this without idling hence overall thoughtput
> of the system should also be good. The only catch here is that it will be
> hard to achieve this behavior if group is not continuously backlogged.
> 
> You seem to be suggesting that current CFQ formula for calculating slice
> offset provides take care of that. Looking at the formula I can't
> understand how does it enable dispatch from a queue in proportion to
> weight or priority. I will do some experiments on my NCQ SSD and do
> more discussion on this aspect later.
> 

Ok, I ran some simple tests on my NCQ SSD. I had pulled the Jen's branch
few days back and it has your patches in it.

I am running three direct sequential readers or prio 0, 4 and 7
respectively using fio for 10 seconds and then monitoring who got how
much job done.

Following is my fio job file

****************************************************************
[global]
ioengine=sync
runtime=10
size=1G
rw=read
directory=/mnt/sdc/fio/
direct=1
bs=4K
exec_prerun="echo 3 > /proc/sys/vm/drop_caches"

[seqread0]
prio=0

[seqread4]
prio=4

[seqread7]
prio=7
************************************************************************

Following are the results of 4 runs. Every run lists three jobs of prio0,
prio4 and prio7 respectively.

First run
=========
read : io=75,996KB, bw=7,599KB/s, iops=1,899, runt= 10001msec
read : io=95,920KB, bw=9,591KB/s, iops=2,397, runt= 10001msec
read : io=21,068KB, bw=2,107KB/s, iops=526, runt= 10001msec

Second run
==========
read : io=103MB, bw=10,540KB/s, iops=2,635, runt= 10001msec
read : io=102MB, bw=10,479KB/s, iops=2,619, runt= 10001msec
read : io=720KB, bw=73,728B/s, iops=18, runt= 10000msec

Third Run
=========
read : io=103MB, bw=10,532KB/s, iops=2,632, runt= 10001msec
read : io=85,728KB, bw=8,572KB/s, iops=2,142, runt= 10001msec
read : io=19,696KB, bw=1,969KB/s, iops=492, runt= 10001msec

Fourth Run
==========
read : io=50,060KB, bw=5,005KB/s, iops=1,251, runt= 10001msec
read : io=102MB, bw=10,409KB/s, iops=2,602, runt= 10001msec
read : io=54,844KB, bw=5,484KB/s, iops=1,370, runt= 10001msec

I can't see fairness being provided to processes of diff prio levels. In
first run prio4 got more BW than prio0 process.

In second run prio 7 process got completely starved. Based on slice
calculation, the difference between prio 0 and prio 7 should be 180/40=4.5

Third run is still better.

In fourth run again prio 4 got double the BW of prio 0.

So I can't see how are you achieving fariness on NCQ SSD?

One more important thing to notice is that throughput of SSD has come down
significantly. If I just run one job then I get 73MB/s. With these tree
jobs running, we are achieving close to 19 MB/s. 

I think this is happening because of seeks happening almost after every
dispatch and that brings down the overall throughput. If we had idled
here, I think probably overall throughput would have been better.

Thanks
Vivek

next prev parent reply	other threads:[~2009-11-10 14:13 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-03 23:43 [RFC] Block IO Controller V1 Vivek Goyal
2009-11-03 23:43 ` [PATCH 01/20] blkio: Documentation Vivek Goyal
2009-11-04 13:37   ` Jeff Moyer
2009-11-04 17:21   ` Balbir Singh
2009-11-04 17:52     ` Vivek Goyal
2009-11-04 23:36       ` Balbir Singh
2009-11-03 23:43 ` [PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps Vivek Goyal
2009-11-04 14:30   ` Jeff Moyer
2009-11-04 16:37     ` Vivek Goyal
2009-11-04 17:59       ` Corrado Zoccolo
2009-11-04 18:54         ` Vivek Goyal
2009-11-05  2:44       ` Divyesh Shah
2009-11-05 14:39         ` Vivek Goyal
2009-11-04 21:18   ` Corrado Zoccolo
2009-11-04 22:25     ` Vivek Goyal
2009-11-05  8:36       ` Corrado Zoccolo
2009-11-04 23:22     ` Vivek Goyal
2009-11-05  8:27       ` Corrado Zoccolo
2009-11-05  0:05     ` Vivek Goyal
2009-11-06 22:22     ` [RFC] Workload type Vs Groups (Was: Re: [PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps) Vivek Goyal
2009-11-09 17:33       ` Nauman Rafique
2009-11-09 21:47       ` Corrado Zoccolo
2009-11-09 23:12         ` Vivek Goyal
2009-11-10 11:29           ` Corrado Zoccolo
2009-11-10 13:31             ` Vivek Goyal
2009-11-10 14:12               ` Vivek Goyal [this message]
2009-11-10 18:05                 ` Corrado Zoccolo
2009-11-10 19:15                   ` Vivek Goyal
2009-11-12  8:53                     ` Corrado Zoccolo
2009-11-11  0:48   ` [PATCH 02/20] blkio: Change CFQ to use CFS like queue time stamps Gui Jianfeng
2009-11-12 23:07     ` Vivek Goyal
2009-11-13  0:59       ` Gui Jianfeng
2009-11-13  1:24         ` Vivek Goyal
2009-11-13  2:05           ` Gui Jianfeng
2009-11-03 23:43 ` [PATCH 03/20] blkio: Introduce the notion of weights Vivek Goyal
2009-11-04 15:06   ` Jeff Moyer
2009-11-04 15:41     ` Vivek Goyal
2009-11-04 17:07       ` Divyesh Shah
2009-11-04 19:00         ` Vivek Goyal
2009-11-04 19:15       ` Jeff Moyer
2009-11-03 23:43 ` [PATCH 04/20] blkio: Introduce the notion of cfq entity Vivek Goyal
2009-11-03 23:43 ` [PATCH 05/20] blkio: Introduce the notion of cfq groups Vivek Goyal
2009-11-03 23:43 ` [PATCH 06/20] blkio: Introduce cgroup interface Vivek Goyal
2009-11-04 15:23   ` Jeff Moyer
2009-11-04 16:47     ` Vivek Goyal
2009-11-03 23:43 ` [PATCH 07/20] blkio: Provide capablity to enqueue/dequeue group entities Vivek Goyal
2009-11-04 15:34   ` Jeff Moyer
2009-11-04 16:54     ` Vivek Goyal
2009-11-03 23:43 ` [PATCH 08/20] blkio: Add support for dynamic creation of cfq_groups Vivek Goyal
2009-11-04 16:01   ` Jeff Moyer
2009-11-03 23:43 ` [PATCH 09/20] blkio: Porpogate blkio cgroup weight or ioprio class updation to cfq groups Vivek Goyal
2009-11-05  5:35   ` Gui Jianfeng
2009-11-05 14:42     ` Vivek Goyal
2009-11-03 23:43 ` [PATCH 10/20] blkio: Implement cfq group deletion and reference counting support Vivek Goyal
2009-11-04 18:44   ` Jeff Moyer
2009-11-04 19:00     ` Vivek Goyal
2009-11-03 23:43 ` [PATCH 11/20] blkio: Some CFQ debugging Aid Vivek Goyal
2009-11-04 18:52   ` Jeff Moyer
2009-11-04 19:12     ` Vivek Goyal
2009-11-04 19:25       ` Jeff Moyer
2009-11-05  3:10   ` Divyesh Shah
2009-11-05 14:42     ` Vivek Goyal
2009-11-06  0:56       ` Divyesh Shah
2009-11-03 23:43 ` [PATCH 12/20] blkio: Export disk time and sectors dispatched from cgroup interface Vivek Goyal
2009-11-03 23:43 ` [PATCH 13/20] blkio: Add a group dequeue interface in cgroup for debugging Vivek Goyal
2009-11-03 23:43 ` [PATCH 14/20] blkio: Do not allow request merging across cfq groups Vivek Goyal
2009-11-03 23:43 ` [PATCH 15/20] blkio: Take care of preemptions across groups Vivek Goyal
2009-11-04 19:00   ` Jeff Moyer
2009-11-04 19:27     ` Vivek Goyal
2009-11-04 19:30       ` Jeff Moyer
2009-11-06  7:55   ` Gui Jianfeng
2009-11-06 22:10     ` Vivek Goyal
2009-11-09  7:41       ` Gui Jianfeng
2009-11-03 23:43 ` [PATCH 16/20] blkio: do not select co-operating queues from different cfq groups Vivek Goyal
2009-11-03 23:43 ` [PATCH 17/20] blkio: Wait for queue to get backlogged before it expires Vivek Goyal
2009-11-03 23:43 ` [PATCH 18/20] blkio: arm idle timer even if think time is great then time slice left Vivek Goyal
2009-11-04 19:04   ` Jeff Moyer
2009-11-04 19:17     ` Vivek Goyal
2009-11-03 23:43 ` [PATCH 19/20] blkio: Arm slice timer even if there are requests in driver Vivek Goyal
2009-11-03 23:43 ` [PATCH 20/20] blkio: Drop the reference to queue once the task changes cgroup Vivek Goyal
2009-11-04 19:09   ` Jeff Moyer
2009-11-04 19:18     ` Vivek Goyal
2009-11-04  7:43 ` [RFC] Block IO Controller V1 Jens Axboe
2009-11-04 13:39   ` Vivek Goyal
2009-11-04 19:12 ` Jeff Moyer
2009-11-04 19:19   ` Vivek Goyal
2009-11-04 19:27     ` Jeff Moyer
2009-11-04 19:38       ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091110141246.GB1083@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=czoccolo@gmail.com \
    --cc=dpshah@google.com \
    --cc=fernando@oss.ntt.co.jp \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=jens.axboe@oracle.com \
    --cc=jmoyer@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=m-ikeda@ds.jp.nec.com \
    --cc=nauman@google.com \
    --cc=riel@redhat.com \
    --cc=righi.andrea@gmail.com \
    --cc=ryov@valinux.co.jp \
    --cc=s-uchida@ap.jp.nec.com \
    --cc=taka@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.