Re: Block IO Controller V4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jens.axboe@oracle.com,
	nauman@google.com, dpshah@google.com, lizf@cn.fujitsu.com,
	ryov@valinux.co.jp, fernando@oss.ntt.co.jp,
	s-uchida@ap.jp.nec.com, taka@valinux.co.jp, jmoyer@redhat.com,
	righi.andrea@gmail.com, m-ikeda@ds.jp.nec.com,
	czoccolo@gmail.com, Alan.Brunelle@hp.com
Subject: Re: Block IO Controller V4
Date: Mon, 07 Dec 2009 09:35:05 +0800	[thread overview]
Message-ID: <4B1C5BC9.3010001@cn.fujitsu.com> (raw)
In-Reply-To: <20091203143641.GA3887@redhat.com>

Vivek Goyal wrote:
> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> Hi Jens,
>>>>>
>>>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>>>> of block tree.
>>>>>
>>>>> A consolidated patch can be found here:
>>>>>
>>>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>>>
>>>> Hi Vivek,
>>>>
>>>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>>>> For example, you can create group A and group B, then assign weight 100 to group A and
>>>> weight 400 to group B, and you run "direct sequence read" workload in group A and B 
>>>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B. 
>>>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>>>> into this issue.
>>>> BTW, V3 works well for this case.
>>> Hi Gui,
>>>
>>> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
>>> be working fine.
>>>
>>> http://lkml.org/lkml/2009/12/1/367
>>>
>>> I suspect that in some case we choose not to idle on the group and it gets
>>> deleted from service tree hence we loose share. Can you have a look at
>>> blkio.dequeue files. If there are excessive deletions, that will signify
>>> that we are loosing share because we chose not to idle.
>>>
>>> If yes, please also run blktrace to see in what cases we chose not to
>>> idle.
>>>
>>> In V3, I had a stronger check to idle on the group if it is empty using
>>> wait_busy() function. In V4 I have removed that and trying to wait busy
>>> on a queue by extending its slice if it has consumed its allocated slice.
>> Hi Vivek,
>>
>> I ckecked the blktrace output, it seems that io group was deleted all the time,
>> because we don't have group idle any more. I pulled the wait_busy code back to
>> V4, and retest it, problem seems disappeared.
>>
>> So i suggest that we need to retain the wait_busy code.
> 
> Hi Gui,
> 
> We need to figure out why the existing code is not working on your system.
> In V4, I introduced the functionality to extend the slice by slice_idle
> so that we will arm slice idle timer and wait for new request to come in
> and then expire the queue. Following is the code to extend the slice.
> 
>                 /*
>                  * If this queue consumed its slice and this is last queue
>                  * in the group, wait for next request before we expire
>                  * the queue
>                  */
>                 if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
>                         cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
>                         cfq_mark_cfqq_wait_busy(cfqq);
>                 }
> 
> One loop hole I see is that, I extend the slice only if current slice has
> been used. If if we on the boundary and slice has not been used yet, then
> I will not extend the slice. We also might not arm the timer thinking that
> remaining slice is less than think time of process and that can lead to
> expiry of queue. To rule out this possibility, can you remove following
> code in arm_slice_timer() and try it again.
> 
>         /*
>          * If our average think time is larger than the remaining time
>          * slice, then don't idle. This avoids overrunning the allotted
>          * time slice.
>          */
>         if (sample_valid(cic->ttime_samples) &&
>             (cfqq->slice_end - jiffies < cic->ttime_mean))
>                 return;
> 
> The other possiblity is that at the request completion time slice has not
> expired hence we don't extend the slice and arm the timer. But then
> select_queue() hits and by that time slice has expired and we expire the
> queue. I thought this will not happen very frequently.
> 
> Can you figure out what is happening on your system. Why we are not doing
> wait busy on the queue/group (new queue wait_busy and wait_busy_done
> flags) and instead expiring the queue and hence group.

Hi Vivek,

Sorry for the late reply.
In V4, we don't have wait_busy() in select_queue(), so if there isn't any 
request on this queue and no cooperator queue available, this queue will
expire immediately. We don't have a chance to get that queue backlogged
again. So group will get removed frequently.


> You can send your blktrace logs to me also. I can also try figuring out
> what is happening.

I think here is the most significant part of blktrace output for this issue.

  8,16   0     4024     0.642072068  3924  Q   R 320708977 + 8 [rwio]
  8,16   0     4025     0.642078523  3924  G   R 320708977 + 8 [rwio]
  8,16   0     4026     0.642082632  3924  I   R 320708977 + 8 [rwio]
  8,16   0        0     0.642084075     0  m   N cfq3924S /test1 insert_request
  8,16   0        0     0.642087062     0  m   N cfq3924S /test1 dispatch_insert
  8,16   0        0     0.642088250     0  m   N cfq3924S /test1 dispatched a request
  8,16   0        0     0.642089242     0  m   N cfq3924S /test1 activate rq, drv=1
  8,16   0     4027     0.642089573  3924  D   R 320708977 + 8 [rwio]
  8,16   0        0     0.642185679     0  m   N cfq3924S /test1 slice expired t=0   <= I think this happens in select_queue()
  8,16   0        0     0.642187132     0  m   N cfq3924S /test1 sl_used=60 sect=2056
  8,16   0        0     0.642189007     0  m   N /test1 served: vt=276536888 min_vt=275308088
  8,16   0        0     0.642190265     0  m   N cfq3924S /test1 del_from_rr
  8,16   0        0     0.642190941     0  m   N /test1 del_from_rr group
  8,16   0        0     0.642192600     0  m   N cfq3925S /test2 set_active
  8,16   0        0     0.642194414     0  m   N cfq3925S /test2 fifo=(null)
  8,16   0        0     0.642195296     0  m   N cfq3925S /test2 dispatch_insert
  8,16   0        0     0.642196709     0  m   N cfq3925S /test2 dispatched a request
  8,16   0        0     0.642197737     0  m   N cfq3925S /test2 activate rq, drv=2
  8,16   0     4028     0.642198102  3924  D   R 324900545 + 8 [rwio]
  8,16   0     4029     0.642204612  3924  U   N [rwio] 2



> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

next prev parent reply	other threads:[~2009-12-07  1:39 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-30  2:59 Block IO Controller V4 Vivek Goyal
2009-11-30  2:59 ` [PATCH 01/21] blkio: Set must_dispatch only if we decided to not dispatch the request Vivek Goyal
2009-12-02 14:06   ` Jeff Moyer
2009-11-30  2:59 ` [PATCH 02/21] blkio: Introduce the notion of cfq groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 03/21] blkio: Implement macro to traverse each idle tree in group Vivek Goyal
2009-11-30 20:13   ` Divyesh Shah
2009-11-30 22:24     ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 04/21] blkio: Keep queue on service tree until we expire it Vivek Goyal
2009-11-30  2:59 ` [PATCH 05/21] blkio: Introduce the root service tree for cfq groups Vivek Goyal
2009-11-30 23:55   ` Divyesh Shah
2009-12-02 15:42     ` Vivek Goyal
2009-12-02 15:49   ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 06/21] blkio: Introduce blkio controller cgroup interface Vivek Goyal
2009-12-01  0:04   ` Divyesh Shah
2009-12-02 15:27     ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 07/21] blkio: Introduce per cfq group weights and vdisktime calculations Vivek Goyal
2009-12-02 15:50   ` Vivek Goyal
2009-11-30  2:59 ` [PATCH 08/21] blkio: Implement per cfq group latency target and busy queue avg Vivek Goyal
2009-11-30  2:59 ` [PATCH 09/21] blkio: Group time used accounting and workload context save restore Vivek Goyal
2009-11-30  2:59 ` [PATCH 10/21] blkio: Dynamic cfq group creation based on cgroup tasks belongs to Vivek Goyal
2009-11-30  2:59 ` [PATCH 11/21] blkio: Take care of cgroup deletion and cfq group reference counting Vivek Goyal
2009-11-30  2:59 ` [PATCH 12/21] blkio: Some debugging aids for CFQ Vivek Goyal
2009-11-30  2:59 ` [PATCH 13/21] blkio: Export disk time and sectors used by a group to user space Vivek Goyal
2009-11-30  2:59 ` [PATCH 14/21] blkio: Provide some isolation between groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 15/21] blkio: Drop the reference to queue once the task changes cgroup Vivek Goyal
2009-11-30  2:59 ` [PATCH 16/21] blkio: Propagate cgroup weight updation to cfq groups Vivek Goyal
2009-11-30  2:59 ` [PATCH 17/21] blkio: Wait for cfq queue to get backlogged if group is empty Vivek Goyal
2009-11-30  2:59 ` [PATCH 18/21] blkio: Determine async workload length based on total number of queues Vivek Goyal
2009-11-30  2:59 ` [PATCH 19/21] blkio: Implement group_isolation tunable Vivek Goyal
2009-11-30  2:59 ` [PATCH 20/21] blkio: Wait on sync-noidle queue even if rq_noidle = 1 Vivek Goyal
2009-11-30  2:59 ` [PATCH 21/21] blkio: Documentation Vivek Goyal
2009-11-30 15:34 ` Block IO Controller V4 Corrado Zoccolo
2009-11-30 16:00   ` Vivek Goyal
2009-11-30 21:34     ` Corrado Zoccolo
2009-11-30 21:58       ` Vivek Goyal
2009-11-30 22:00       ` Alan D. Brunelle
2009-11-30 22:56         ` Vivek Goyal
2009-11-30 23:50           ` Alan D. Brunelle
2009-12-02 19:12             ` Vivek Goyal
2009-12-08 15:17           ` Alan D. Brunelle
2009-12-08 16:32             ` Vivek Goyal
2009-12-08 18:05               ` Alan D. Brunelle
2009-12-10  3:44                 ` Vivek Goyal
2009-12-01 22:27 ` Vivek Goyal
2009-12-02  1:51 ` Gui Jianfeng
2009-12-02 14:25   ` Vivek Goyal
2009-12-03  8:41     ` Gui Jianfeng
2009-12-03 14:36       ` Vivek Goyal
2009-12-03 18:10         ` Vivek Goyal
2009-12-03 23:51           ` Vivek Goyal
2009-12-07  8:45             ` Gui Jianfeng
2009-12-07 15:25               ` Vivek Goyal
2009-12-07  1:35         ` Gui Jianfeng [this message]
2009-12-07  8:41           ` Gui Jianfeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B1C5BC9.3010001@cn.fujitsu.com \
    --to=guijianfeng@cn.fujitsu.com \
    --cc=Alan.Brunelle@hp.com \
    --cc=czoccolo@gmail.com \
    --cc=dpshah@google.com \
    --cc=fernando@oss.ntt.co.jp \
    --cc=jens.axboe@oracle.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=m-ikeda@ds.jp.nec.com \
    --cc=nauman@google.com \
    --cc=righi.andrea@gmail.com \
    --cc=ryov@valinux.co.jp \
    --cc=s-uchida@ap.jp.nec.com \
    --cc=taka@valinux.co.jp \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox