linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vivek Goyal <vgoyal@redhat.com>
To: Shaohua Li <shli@fb.com>
Cc: linux-kernel@vger.kernel.org, axboe@kernel.dk, tj@kernel.org,
	jmoyer@redhat.com, Kernel-team@fb.com,
	linux-block@vger.kernel.org
Subject: Re: [RFC 0/3] block: proportional based blk-throttling
Date: Wed, 20 Jan 2016 16:11:00 -0500	[thread overview]
Message-ID: <20160120211100.GE10553@redhat.com> (raw)
In-Reply-To: <20160120194327.GA519886@devbig084.prn1.facebook.com>

On Wed, Jan 20, 2016 at 11:43:27AM -0800, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 02:40:13PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> > > On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > > > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > > > Hi,
> > > > > 
> > > > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > > > weight based. It would be great there is a unified iocontroller for the two.
> > > > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > > > bandwidth/weight based control and working with blk-mq.
> > > > > 
> > > > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > > > this isn't a big problem for today's workload. This patchset then try to make a
> > > > > unified iocontroller. I'm leveraging blk-throttling.
> > > > > 
> > > > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > > > pretty well when IO pattern changes.
> > > > > 
> > > > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > > > 
> > > > > To test this, create two fio jobs and assign them different weight. You will
> > > > > see the jobs have different bandwidth roughly according to their weight.
> > > > 
> > > > Patches look pretty small. Nice to see an implementation which will work
> > > > with faster devices and get away from dependency on cfq.
> > > > 
> > > > How does one switch between weight based vs bandwidth based throttling?
> > > > What's the default. 
> > > > 
> > > > So this has been implemented at throttling layer. By default is weight 
> > > > based throttling enabled or one needs to enable it explicitly.
> > > 
> > > So in current implementation, only one of weight/bandwidth can be
> > > enabled. After one is enabled, switching to the other is forbidden. It
> > > should not be hard to enable switching. But mixing the two in one
> > > hierarchy sounds not trivial.
> > 
> > So is this selection per device? Would be good if you also provide steps
> > to test it. I am going through code now and will figure out ultimately,
> > just that if you give steps, it makes it little easier.
> 
> Just uses:
> echo "8:16 200" > $TEST_CG/blkio.throttle.weight
> 
> 200 is the weight
> 

It would be nice if you also update the documentation. What are the max
and min for weight values. What does it mean if a group has weight 200.
While others have not been configured. What % of disk share this cgroup
will get.

I am still wrapping my head around the patches but it looks like this is
another way of coming up automatically with bandwidth limit for a cgroup
based on weight. So user does not have to configure absolute values 
for read/write bandwidth. They can configure the weight and that will 
automatically control the bandwidth of cgroup dynamically.

What I am not clear is that once I apply weight on one cgroup, what happes
to rest of peer cgroups which are still not configured. If I don't apply
rules to them, then adding weight to one cgroup does not mean much. 

Ideally, I might help that we assign default weights to cgroup and have
a per device switch to enable weight based controller. That way user
space can enable it per device as needed and all the cgroup get their
fair share without any extra configuration. If the overhead of this
mechanism is ultra low, then a global switch to enable it by default
for all devices should be useful too. That way user space has to toggle
just that switch and by default all IO cgroups on all block devices get
their fair share.

Thanks
Vivek

  parent reply	other threads:[~2016-01-20 21:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
2016-01-21 20:33   ` Vivek Goyal
2016-01-21 21:00     ` Shaohua Li
2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
2016-01-21 20:44   ` Vivek Goyal
2016-01-21 21:05     ` Shaohua Li
2016-01-21 21:09       ` Vivek Goyal
2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
2016-01-20 19:34   ` Shaohua Li
2016-01-20 19:40     ` Vivek Goyal
2016-01-20 19:43       ` Shaohua Li
2016-01-20 19:54         ` Vivek Goyal
2016-01-20 21:11         ` Vivek Goyal [this message]
2016-01-20 21:34           ` Shaohua Li
2016-01-21 21:10 ` Tejun Heo
2016-01-21 22:24   ` Shaohua Li
2016-01-21 22:41     ` Tejun Heo
2016-01-22  0:00       ` Shaohua Li
2016-01-22 14:48         ` Tejun Heo
2016-01-22 15:52           ` Vivek Goyal
2016-01-22 18:00             ` Shaohua Li
2016-01-22 19:09               ` Vivek Goyal
2016-01-22 19:45                 ` Shaohua Li
2016-01-22 20:04                   ` Vivek Goyal
2016-01-22 17:57           ` Shaohua Li
2016-01-22 18:08             ` Tejun Heo
2016-01-22 19:11               ` Shaohua Li
2016-01-22 14:43       ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160120211100.GE10553@redhat.com \
    --to=vgoyal@redhat.com \
    --cc=Kernel-team@fb.com \
    --cc=axboe@kernel.dk \
    --cc=jmoyer@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=shli@fb.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).