From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1760194AbcAUVBD (ORCPT <rfc822;w@1wt.eu>);
	Thu, 21 Jan 2016 16:01:03 -0500
Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:10151 "EHLO
	mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1750939AbcAUVA7 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 21 Jan 2016 16:00:59 -0500
Date: Thu, 21 Jan 2016 13:00:44 -0800
From: Shaohua Li <shli@fb.com>
To: Vivek Goyal <vgoyal@redhat.com>
CC: <linux-kernel@vger.kernel.org>, <axboe@kernel.dk>, <tj@kernel.org>,
        <jmoyer@redhat.com>, <Kernel-team@fb.com>
Subject: Re: [RFC 2/3] blk-throttling: weight based throttling
Message-ID: <20160121210043.GA3696942@devbig084.prn1.facebook.com>
References: <cover.1453308862.git.shli@fb.com>
 <9ea3761d048bf6b5f36c3d57e3eecbce2780b6c3.1453308862.git.shli@fb.com>
 <20160121203332.GD8379@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20160121203332.GD8379@redhat.com>
User-Agent: Mutt/1.5.20 (2009-12-10)
X-Originating-IP: [192.168.52.123]
X-Proofpoint-Spam-Reason: safe
X-FB-Internal: Safe
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-01-21_07:,,
 signatures=0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Jan 21, 2016 at 03:33:32PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote:
> > We know total bandwidth of a disk and can calculate cgroup's bandwidth
> > percentage against disk bandwidth according to its weight. We can easily
> > calculate cgroup bandwidth.
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 134 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 2149a1d..b3f847d 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -12,6 +12,9 @@
> >  #include <linux/blk-cgroup.h>
> >  #include "blk.h"
> >  
> > +#define MAX_WEIGHT (1000)
> > +#define WEIGHT_RATIO_SHIFT (12)
> > +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
> >  /* Max dispatch from a group in 1 round */
> >  static int throtl_grp_quantum = 8;
> >  
> > @@ -74,6 +77,10 @@ struct throtl_service_queue {
> >  	unsigned int		nr_pending;	/* # queued in the tree */
> >  	unsigned long		first_pending_disptime;	/* disptime of the first tg */
> >  	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
> > +
> > +	unsigned int		weight;
> > +	unsigned int		children_weight;
> > +	unsigned int		ratio;
> 
> Will it be better to call it "share" instead of "ratio". It is basically
> a measure of % disk share of the group and share seems more intuitive.

Ok
> 
> [..]
> > +static void tg_update_bps(struct throtl_grp *tg)
> > +{
> > +	struct throtl_service_queue *sq, *parent_sq;
> > +
> > +	sq = &tg->service_queue;
> > +	parent_sq = sq->parent_sq;
> > +
> > +	if (!tg->td->weight_based || !parent_sq)
> > +		return;
> > +	sq->ratio = max_t(unsigned int,
> > +		parent_sq->ratio * sq->weight / parent_sq->children_weight,
> > +		1);
> > +
> 
> It might be good to decouple updation of "share/ratio" and updation of
> bps. Change of share can happen any time either weight is changed or
> an active group is queue/dequeued and we don't have to do it every time
> a bio is submitted.

Ok
> > +	tg->bps[READ] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, READ) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +	tg->bps[WRITE] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +}
> > +
> > +static void tg_update_ratio(struct throtl_grp *tg)
> > +{
> > +	struct throtl_data *td = tg->td;
> > +	struct cgroup_subsys_state *pos_css;
> > +	struct blkcg_gq *blkg;
> > +
> > +	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
> 
> Is it possible to traverse only the affected subtree instead of whole
> tree of groups. Because if weight is updated on a group, then we just
> need to traverse the subtree under that group's parent.

makes sense
> [..]
> > @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
> >  
> >  	sq = &tg->service_queue;
> >  
> > +	tg_update_bps(tg);
> 
> Updating bps for every bio submitted sounds like a lot. We probably could
> do it when first bio gets queued in the group and then refresh it at
> some regular interval. Say when next set of dispatch happens from group
> we could update bandwidth of group after dispatch.

That calculation isn't very heavy. I'll revisit this if it's a problem.

Thanks,
Shaohua