From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754256Ab1AXSvw (ORCPT ); Mon, 24 Jan 2011 13:51:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:14957 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754154Ab1AXSvt (ORCPT ); Mon, 24 Jan 2011 13:51:49 -0500 Date: Mon, 24 Jan 2011 13:51:33 -0500 From: Vivek Goyal To: Gui Jianfeng Cc: Jens Axboe , linux kernel mailing list , Corrado Zoccolo , Chad Talbott , Nauman Rafique , Divyesh Shah , jmoyer@redhat.com, Shaohua Li Subject: Re: [PATCH 3/6 v3] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Message-ID: <20110124185133.GD9420@redhat.com> References: <4D180ECE.4000305@cn.fujitsu.com> <4D185374.6090300@cn.fujitsu.com> <20110119225855.GH13887@redhat.com> <4D37B2DA.1060908@cn.fujitsu.com> <20110120110952.GA16184@redhat.com> <4D3D03DE.20604@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4D3D03DE.20604@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 24, 2011 at 12:45:18PM +0800, Gui Jianfeng wrote: > Vivek Goyal wrote: > > On Thu, Jan 20, 2011 at 11:58:18AM +0800, Gui Jianfeng wrote: > >> Vivek Goyal wrote: > >>> On Mon, Dec 27, 2010 at 04:51:00PM +0800, Gui Jianfeng wrote: > >>>> Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority > >>>> maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes > >>>> use the same scheduling algorithm as CFQ group does. This helps for CFQ queue and > >>>> group scheduling on the same service tree. > >>>> > >>>> Signed-off-by: Gui Jianfeng > >>> [..] > >>>> @@ -1246,47 +1278,71 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, > >>>> > >>>> service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq), > >>>> cfqq_type(cfqq)); > >>>> + /* > >>>> + * For the time being, put the newly added CFQ queue at the end of the > >>>> + * service tree. > >>>> + */ > >>>> + if (RB_EMPTY_NODE(&cfqe->rb_node)) { > >>>> + /* > >>>> + * If this CFQ queue moves to another group, the original > >>>> + * vdisktime makes no sense any more, reset the vdisktime > >>>> + * here. > >>>> + */ > >>>> + parent = rb_last(&service_tree->rb); > >>>> + if (parent) { > >>>> + u64 boost; > >>>> + s64 __vdisktime; > >>>> + > >>>> + __cfqe = rb_entry_entity(parent); > >>>> + cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY; > >>>> + > >>>> + /* Give some vdisktime boost according to its weight */ > >>>> + boost = cfq_get_boost(cfqd, cfqe); > >>>> + __vdisktime = cfqe->vdisktime - boost; > >>>> + if (__vdisktime > service_tree->min_vdisktime) > >>>> + cfqe->vdisktime = __vdisktime; > >>>> + else > >>>> + cfqe->vdisktime = service_tree->min_vdisktime; > >>>> + } else > >>>> + cfqe->vdisktime = service_tree->min_vdisktime; > >>> Hi Gui, > >>> > >>> Is above logic actually working? I suspect that most of the time all the > >>> new queues will end up getting same vdisktime and that is st->min_vdisktime > >>> and there will be no service differentiation across various prio. > >>> > >>> Reason being, on SSD, idle is disabled. So very few/no queue will consume > >>> its slice and follow reque path. So every queue will be new. Now you are > >>> doing following. > >>> > >>> cfqd->vdisktime = vdisktime_of_parent + IDLE_DELAY - boost > >>> > >>> assume vdisktime_of_parent=st->min_vdisktime, then (IDLE_DELAY - boost) > >>> is always going to be a -ve number and hence cfqd->vdisktime will > >>> default to st->min_vdisktime. (IDLE_DELAY=200 while boost should be a huge > >>> value due to SERVICE_SHIFT thing). > >> Vivek, > >> > >> Actually, I tested on rotational disk with idling disabled, I saw service > >> differentiation between two tasks with different ioprio. > >> I don't have a SSD on hand, But I'll get one and do more tests. > > > > Ok, I am really not sure how did it work for you because boost will > > turn out to be huge number in comparision to IDLE_DELAY or vdisktime > > of parent. If you have some blktrace data or if you can explain how > > it is working will help in understanding it better. > > Vivek, > > I think there must be something configuration wrong in my previous tests. > I tested again with patch applied. I don't see service differentiation > this time. > But, I do the same test on vanilla kernel, I don't see service differentiation > either. I dig into it. It seems when I set slice_idle and group_idle to zero, > it means i close idle window by mannual. In this case, IO tasks becomes SYNC NOIDLE > workloads, they will preempts each other mutually. And the cfqq will always be put > at the front of service tree. So we don't see service differentiation. > > Actually, We don't want preemption happening in such case, right? So I think we should > refine the preemption logic. > Thoughts? Gui, That preemption only happens when we are idling on existing sync-noidle queue and a new queue gets backlogged. It makes sense from overall throughput perspective. So I think getting rid of this is not a good idea. But agreed that it does not play well with ioprio thing. I would suggest that instead of two threads, try more 4-5 threads which are doing IO and keep queue depth as 1. That way you should end up in a case where there are more than 1 thread backlogged on sync-noidle service tree. Thanks Vivek > > Below is my fio job file: > > [global] > directory=/hdb3 > ioengine=sync > runtime=30 > time_based=1 > direct=1 > bs=128K > size=64M > numjobs=1 > exec_prerun='echo 3 > /proc/sys/vm/drop_caches' > > [task-prio-0] > description=task-prio-0 > rw=read > prio=0 > filename=file1 > > [task-prio-7] > description=task-prio-7 > rw=read > prio=7 > filename=file2 > > Thanks, > Gui