From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760194AbcAUVBD (ORCPT ); Thu, 21 Jan 2016 16:01:03 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:10151 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750939AbcAUVA7 (ORCPT ); Thu, 21 Jan 2016 16:00:59 -0500 Date: Thu, 21 Jan 2016 13:00:44 -0800 From: Shaohua Li To: Vivek Goyal CC: , , , , Subject: Re: [RFC 2/3] blk-throttling: weight based throttling Message-ID: <20160121210043.GA3696942@devbig084.prn1.facebook.com> References: <9ea3761d048bf6b5f36c3d57e3eecbce2780b6c3.1453308862.git.shli@fb.com> <20160121203332.GD8379@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20160121203332.GD8379@redhat.com> User-Agent: Mutt/1.5.20 (2009-12-10) X-Originating-IP: [192.168.52.123] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-01-21_07:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 21, 2016 at 03:33:32PM -0500, Vivek Goyal wrote: > On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote: > > We know total bandwidth of a disk and can calculate cgroup's bandwidth > > percentage against disk bandwidth according to its weight. We can easily > > calculate cgroup bandwidth. > > > > Signed-off-by: Shaohua Li > > --- > > block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 134 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c > > index 2149a1d..b3f847d 100644 > > --- a/block/blk-throttle.c > > +++ b/block/blk-throttle.c > > @@ -12,6 +12,9 @@ > > #include > > #include "blk.h" > > > > +#define MAX_WEIGHT (1000) > > +#define WEIGHT_RATIO_SHIFT (12) > > +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT) > > /* Max dispatch from a group in 1 round */ > > static int throtl_grp_quantum = 8; > > > > @@ -74,6 +77,10 @@ struct throtl_service_queue { > > unsigned int nr_pending; /* # queued in the tree */ > > unsigned long first_pending_disptime; /* disptime of the first tg */ > > struct timer_list pending_timer; /* fires on first_pending_disptime */ > > + > > + unsigned int weight; > > + unsigned int children_weight; > > + unsigned int ratio; > > Will it be better to call it "share" instead of "ratio". It is basically > a measure of % disk share of the group and share seems more intuitive. Ok > > [..] > > +static void tg_update_bps(struct throtl_grp *tg) > > +{ > > + struct throtl_service_queue *sq, *parent_sq; > > + > > + sq = &tg->service_queue; > > + parent_sq = sq->parent_sq; > > + > > + if (!tg->td->weight_based || !parent_sq) > > + return; > > + sq->ratio = max_t(unsigned int, > > + parent_sq->ratio * sq->weight / parent_sq->children_weight, > > + 1); > > + > > It might be good to decouple updation of "share/ratio" and updation of > bps. Change of share can happen any time either weight is changed or > an active group is queue/dequeued and we don't have to do it every time > a bio is submitted. Ok > > + tg->bps[READ] = max_t(uint64_t, > > + (queue_bandwidth(tg->td, READ) * sq->ratio) >> > > + WEIGHT_RATIO_SHIFT, > > + 1024); > > + tg->bps[WRITE] = max_t(uint64_t, > > + (queue_bandwidth(tg->td, WRITE) * sq->ratio) >> > > + WEIGHT_RATIO_SHIFT, > > + 1024); > > +} > > + > > +static void tg_update_ratio(struct throtl_grp *tg) > > +{ > > + struct throtl_data *td = tg->td; > > + struct cgroup_subsys_state *pos_css; > > + struct blkcg_gq *blkg; > > + > > + blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) { > > Is it possible to traverse only the affected subtree instead of whole > tree of groups. Because if weight is updated on a group, then we just > need to traverse the subtree under that group's parent. makes sense > [..] > > @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, > > > > sq = &tg->service_queue; > > > > + tg_update_bps(tg); > > Updating bps for every bio submitted sounds like a lot. We probably could > do it when first bio gets queued in the group and then refresh it at > some regular interval. Say when next set of dispatch happens from group > we could update bandwidth of group after dispatch. That calculation isn't very heavy. I'll revisit this if it's a problem. Thanks, Shaohua