Re: [PATCH 01/10] Documentation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>,
	nauman@google.com, dpshah@google.com, lizf@cn.fujitsu.com,
	mikew@google.com, fchecconi@gmail.com, paolo.valente@unimore.it,
	jens.axboe@oracle.com, ryov@valinux.co.jp,
	fernando@intellilink.co.jp, s-uchida@ap.jp.nec.com,
	taka@valinux.co.jp, arozansk@redhat.com, jmoyer@redhat.com,
	oz-kernel@redhat.com, balbir@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org,
	containers@lists.linux-foundation.org, akpm@linux-foundation.org,
	menage@google.com, peterz@infradead.org
Subject: Re: [PATCH 01/10] Documentation
Date: Thu, 19 Mar 2009 11:38:06 +0800	[thread overview]
Message-ID: <49C1BE1E.9060707@cn.fujitsu.com> (raw)
In-Reply-To: <20090318215529.GA3338@redhat.com>

Vivek Goyal wrote:
> On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>>> Hi Vivek,
>>>>
>>>> I would be interested in knowing if these are the results expected?
>>>>
>>> Hi Dhaval, 
>>>
>>> Good question. Keeping current expectation in mind, yes these are expected
>>> results. To begin with, current expectations are that try to emulate
>>> cfq behavior and the kind of service differentiation we get between
>>> threads of different priority, same kind of service differentiation we
>>> should get from different cgroups.
>>>  
>>> Having said that, in theory a more accurate estimate should be amount 
>>> of actual disk time a queue/cgroup got. I have put a tracing message
>>> to keep track of total service received by a queue. If you run "blktrace"
>>> then you can see that. Ideally, total service received by two threads
>>> over a period of time should be in same proportion as their cgroup
>>> weights.
>>>
>>> It will not be easy to achive it given the constraints we have got in
>>> terms of how to accurately we can account for disk time actually used by a
>>> queue in certain situations. So to begin with I am targetting that
>>> try to meet same kind of service differentation between cgroups as
>>> cfq provides between threads and then slowly refine it to see how
>>> close one can come to get accurate numbers in terms of "total_serivce"
>>> received by each queue.
>>   Hi Vivek,
>>
>>   I simply tested with blktrace opened. I create two groups and set ioprio
>>   4 and 7 respectively(the corresponding weight should 4:1, right?),
> 
> Hi Gui,
> 
> Thanks for testing. You are right about weight proportions.
> 
>> and 
>>   start two dd concurrently. UUIC, Ideally, the proportion of service two 
>>   dd got should be 4:1 in a period of time when they are running. I extract 
>>   *served* value from blktrace output and sum them up. I found the proportion 
>>   of the sum of *served* value is about 1.7:1
>>   Am i missing something?
> 
> Actually getting the service proportion in same ratio as weight proportion
> is quite hard for sync queues. The biggest issue is that many a times sync
> queues are not continuously backlogged and they do some IO and then dispatch
> a next round of requests.
> 
> Most of the time idling seems to be the solution for giving an impression
> that sync queue is continuously backlogged but it also has potential to
> reduce throughput on faster hardware.
> 
> Anyway, can you please send me your complete blkparse output. There are
> many a places where code has been designed to favor throughput than
> fairness. Looking at your blkparse output, will give me better idea what's
> the issue in your setup.
> 
> Also please try the attached patch. I have experimented with waiting for
> new request to come before sync queue is expired. It helps me in getting
> the fairness numbers at least with noop on non-queueing rotational media.
> 
> I also have introduced a new tunable "fairness". Above code will kick in
> only if this variable is set to 1. Many a places where we favor throughput
> over fairness, I plan to use this variable as condition to let user
> decide whether to choose fairness over throughput. I am not sure at how many
> places it really makes sense, but it atleast gives us something to play and
> compare the throughput in two cases.
> 
> This patch applies on my current tree after removing tomost patceh
> "anticipatory scheduling changes". My code has changed a bit since the
> posting, so you might have to message this patch a bit.

  Hi Vivek,

  This time I run two dd with pure sync read like this:
  dd if=/mnt/500M.1 of=/dev/null, and the proportion of service got by each
  is very close to the proportion of their weight.
  Previously, I run concurrent dd like this:
  dd if=/mnt/500M.1 of=/mnt/500M.2

  I'd like to try this patch out.

> 
> Thanks
> Vivek
> 
> 
> DESC
> io-controller: idle for sometime on sync queue before expiring it
> EDESC 
> 
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right. 
> 
> 
> ---
>  block/blk-sysfs.c   |    7 ++++
>  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  block/elevator-fq.h |   12 +++++++
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> Index: linux1/block/elevator-fq.h
> ===================================================================
> --- linux1.orig/block/elevator-fq.h	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.h	2009-03-18 17:34:53.000000000 -0400
> @@ -318,6 +318,13 @@ struct elv_fq_data {
>  	unsigned long long rate_sampling_start; /*sampling window start jifies*/
>  	/* number of sectors finished io during current sampling window */
>  	unsigned long rate_sectors_current;
> +
> +	/*
> +	 * If set to 1, will disable many optimizations done for boost
> +	 * throughput and focus more on providing fairness for sync
> +	 * queues.
> +	 */
> +	int fairness;
>  };
>  
>  extern int elv_slice_idle;
> @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>  	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
>  	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
>  	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
> +	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
>  	ELV_QUEUE_FLAG_NR,
>  };
>  
> @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>  ELV_IO_QUEUE_FLAG_FNS(wait_request)
>  ELV_IO_QUEUE_FLAG_FNS(idle_window)
>  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>  
>  static inline struct io_service_tree *
>  io_entity_service_tree(struct io_entity *entity)
> @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>  						size_t count);
> +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +						size_t count);
>  
>  /* Functions used by elevator.c */
>  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> Index: linux1/block/elevator-fq.c
> ===================================================================
> --- linux1.orig/block/elevator-fq.c	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.c	2009-03-18 17:34:53.000000000 -0400
> @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>  			ioq->total_service);
>  }
>  
> +/* Functions to show and store fairness value through sysfs */
> +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	data = efqd->fairness;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +	return sprintf(name, "%d\n", data);
> +}
> +
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +			  size_t count)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	char *p = (char *)name;
> +
> +	data = simple_strtoul(p, &p, 10);
> +
> +	if (data < 0)
> +		data = 0;
> +	else if (data > INT_MAX)
> +		data = INT_MAX;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	efqd->fairness = data;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	return count;
> +}
> +
>  /* Functions to show and store elv_idle_slice value through sysfs */
>  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
>  {
> @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>  	assert_spin_locked(q->queue_lock);
>  	elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>  
> -	if (elv_ioq_wait_request(ioq))
> +	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>  		del_timer(&efqd->idle_slice_timer);
>  
>  	elv_clear_ioq_wait_request(ioq);
> +	elv_clear_ioq_wait_busy(ioq);
>  
>  	/*
>  	 * if ioq->slice_end = 0, that means a queue was expired before first
> @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>  		 * immediately and flag that we must not expire this queue
>  		 * just now
>  		 */
> -		if (elv_ioq_wait_request(ioq)) {
> +		if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>  			del_timer(&efqd->idle_slice_timer);
> +			elv_clear_ioq_wait_busy(ioq);
>  			blk_start_queueing(q);
>  		}
>  	} else if (elv_should_preempt(q, ioq, rq)) {
> @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long 
>  
>  	if (ioq) {
>  
> +		if (elv_ioq_wait_busy(ioq))
> +			goto expire;
> +
>  		/*
>  		 * expired
>  		 */
> @@ -2546,7 +2589,7 @@ out_cont:
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  
> -void elv_ioq_arm_slice_timer(struct request_queue *q)
> +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>  {
>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
> @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>  		return;
>  
>  	/*
> -	 * still requests with the driver, don't idle
> +	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (efqd->rq_in_driver)
> +	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>  		return;
>  
>  	/*
> -	 * idle is disabled, either manually or by past process history
> +	 * This queue has consumed its time slice. We are waiting only for
> +	 * it to become busy before we select next queue for dispatch.
>  	 */
> -	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> +	if (efqd->fairness && wait_for_busy) {
> +		elv_mark_ioq_wait_busy(ioq);
> +		sl = efqd->elv_slice_idle;
> +		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> +		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> +		return;
> +	}
> +
> +	/*
> +	 * still requests with the driver, don't idle
> +	 */
> +	if (efqd->rq_in_driver)
>  		return;
>  
>  	/*
> @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>  		}
>  	}
>  
> +	/* We are waiting for this queue to become busy before it expires.*/
> +	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>  			elv_ioq_set_prio_slice(q, ioq);
>  			elv_clear_ioq_slice_new(ioq);
>  		}
> -		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +		if (elv_ioq_class_idle(ioq))
>  			elv_ioq_slice_expired(q, 1);
> -		else if (sync && !ioq->nr_queued)
> -			elv_ioq_arm_slice_timer(q);
> +		else if (sync && !ioq->nr_queued) {
> +			if (elv_ioq_slice_used(ioq))
> +				elv_ioq_arm_slice_timer(q, 1);
> +			else
> +				elv_ioq_arm_slice_timer(q, 0);
> +		}
>  	}
>  
>  	if (!efqd->rq_in_driver)
> Index: linux1/block/blk-sysfs.c
> ===================================================================
> --- linux1.orig/block/blk-sysfs.c	2009-03-18 17:34:28.000000000 -0400
> +++ linux1/block/blk-sysfs.c	2009-03-18 17:34:53.000000000 -0400
> @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>  	.show = elv_slice_idle_show,
>  	.store = elv_slice_idle_store,
>  };
> +
> +static struct queue_sysfs_entry queue_fairness_entry = {
> +	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> +	.show = elv_fairness_show,
> +	.store = elv_fairness_store,
> +};
>  #endif
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>  	&queue_iostats_entry.attr,
>  #ifdef CONFIG_ELV_FAIR_QUEUING
>  	&queue_slice_idle_entry.attr,
> +	&queue_fairness_entry.attr,
>  #endif
>  	NULL,
>  };
> 
> 
> 

-- 
Regards
Gui Jianfeng

next prev parent reply	other threads:[~2009-03-19  3:40 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-12  1:56 [RFC] IO Controller Vivek Goyal
2009-03-12  1:56 ` [PATCH 01/10] Documentation Vivek Goyal
2009-03-12  7:11   ` Andrew Morton
2009-03-12 10:07     ` Ryo Tsuruta
2009-03-12 18:01     ` Vivek Goyal
2009-03-16  8:40       ` Ryo Tsuruta
2009-03-16 13:39         ` Vivek Goyal
2009-04-05 15:15       ` Andrea Righi
2009-04-06  6:50         ` Nauman Rafique
2009-04-07  6:40         ` Vivek Goyal
2009-04-08 20:37           ` Andrea Righi
2009-04-16 18:37             ` Vivek Goyal
2009-04-17  5:35               ` Dhaval Giani
2009-04-17 13:49                 ` IO Controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17  9:37               ` [PATCH 01/10] Documentation Andrea Righi
2009-04-17 14:13                 ` IO controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17 18:09                   ` Nauman Rafique
2009-04-18  8:13                     ` Andrea Righi
2009-04-19 12:59                     ` Vivek Goyal
2009-04-19 13:08                     ` Vivek Goyal
2009-04-17 22:38                   ` Andrea Righi
2009-04-19 13:21                     ` Vivek Goyal
2009-04-18 13:19                   ` Balbir Singh
2009-04-19 13:45                     ` Vivek Goyal
2009-04-19 15:53                       ` Andrea Righi
2009-04-21  1:16                         ` KAMEZAWA Hiroyuki
2009-04-19  4:35                   ` Nauman Rafique
2009-03-12  7:45   ` [PATCH 01/10] Documentation Yang Hongyang
2009-03-12 13:51     ` Vivek Goyal
2009-03-12 10:00   ` Dhaval Giani
2009-03-12 14:04     ` Vivek Goyal
2009-03-12 14:48       ` Fabio Checconi
2009-03-12 15:03         ` Vivek Goyal
2009-03-18  7:23       ` Gui Jianfeng
2009-03-18 21:55         ` Vivek Goyal
2009-03-19  3:38           ` Gui Jianfeng [this message]
2009-03-24  5:32           ` Nauman Rafique
2009-03-24 12:58             ` Vivek Goyal
2009-03-24 18:14               ` Nauman Rafique
2009-03-24 18:29                 ` Vivek Goyal
2009-03-24 18:41                   ` Fabio Checconi
2009-03-24 18:35                     ` Vivek Goyal
2009-03-24 18:49                       ` Nauman Rafique
2009-03-24 19:04                       ` Fabio Checconi
2009-03-12 10:24   ` Peter Zijlstra
2009-03-12 14:09     ` Vivek Goyal
2009-04-06 14:35   ` Balbir Singh
2009-04-06 22:00     ` Nauman Rafique
2009-04-07  5:59     ` Gui Jianfeng
2009-04-13 13:40     ` Vivek Goyal
2009-05-01 22:04       ` IKEDA, Munehiro
2009-05-01 22:45         ` IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-05-01 23:39           ` Nauman Rafique
2009-05-04 17:18             ` IKEDA, Munehiro
2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-03-19  6:27   ` Gui Jianfeng
2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
2009-03-27 13:52     ` Vivek Goyal
2009-04-02  4:06   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
2009-04-02 13:52     ` Vivek Goyal
2009-03-12  1:56 ` [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-03-12  1:56 ` [PATCH 04/10] Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-03-12  1:56 ` [PATCH 05/10] cfq changes to use " Vivek Goyal
2009-04-16  5:25   ` [PATCH] IO-Controller: Fix kernel panic after moving a task Gui Jianfeng
2009-04-16 19:15     ` Vivek Goyal
2009-03-12  1:56 ` [PATCH 06/10] Separate out queue and data Vivek Goyal
2009-03-12  1:56 ` [PATCH 07/10] Prepare elevator layer for single queue schedulers Vivek Goyal
2009-03-12  1:56 ` [PATCH 08/10] noop changes for hierarchical fair queuing Vivek Goyal
2009-03-12  1:56 ` [PATCH 09/10] deadline " Vivek Goyal
2009-03-12  1:56 ` [PATCH 10/10] anticipatory " Vivek Goyal
2009-03-27  6:58   ` [PATCH] IO Controller: No need to stop idling in as Gui Jianfeng
2009-03-27 14:05     ` Vivek Goyal
2009-03-30  1:09       ` Gui Jianfeng
2009-03-12  3:27 ` [RFC] IO Controller Takuya Yoshikawa
2009-03-12  6:40   ` anqin
2009-03-12  6:55     ` Li Zefan
2009-03-12  7:11       ` anqin
2009-03-12 14:57         ` Vivek Goyal
2009-03-12 13:46     ` Vivek Goyal
2009-03-12 13:43   ` Vivek Goyal
2009-04-02  6:39 ` Gui Jianfeng
2009-04-02 14:00   ` Vivek Goyal
2009-04-07  1:40     ` Gui Jianfeng
2009-04-07  6:40       ` Gui Jianfeng
2009-04-10  9:33 ` Gui Jianfeng
2009-04-10 17:49   ` Nauman Rafique
2009-04-13 13:09   ` Vivek Goyal
2009-04-22  3:04     ` Gui Jianfeng
2009-04-22  3:10       ` Nauman Rafique
2009-04-22 13:23       ` Vivek Goyal
2009-04-30 19:38         ` Nauman Rafique
2009-05-05  3:18           ` Gui Jianfeng
2009-05-01  1:25 ` Divyesh Shah
2009-05-01  2:45   ` Vivek Goyal
2009-05-01  3:00     ` Divyesh Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49C1BE1E.9060707@cn.fujitsu.com \
    --to=guijianfeng@cn.fujitsu.com \
    --cc=akpm@linux-foundation.org \
    --cc=arozansk@redhat.com \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=containers@lists.linux-foundation.org \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=dpshah@google.com \
    --cc=fchecconi@gmail.com \
    --cc=fernando@intellilink.co.jp \
    --cc=jens.axboe@oracle.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=menage@google.com \
    --cc=mikew@google.com \
    --cc=nauman@google.com \
    --cc=oz-kernel@redhat.com \
    --cc=paolo.valente@unimore.it \
    --cc=peterz@infradead.org \
    --cc=ryov@valinux.co.jp \
    --cc=s-uchida@ap.jp.nec.com \
    --cc=taka@valinux.co.jp \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).