linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] block: proportional based blk-throttling
@ 2016-01-20 17:49 Shaohua Li
  2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 17:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, tj, vgoyal, jmoyer, Kernel-team

Hi,

Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
weight based. It would be great there is a unified iocontroller for the two.
And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
for blk-mq. It's time to have a scalable iocontroller supporting both
bandwidth/weight based control and working with blk-mq.

blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
It has a global lock which is scaring for scalability, but it's not terrible in
practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
this isn't a big problem for today's workload. This patchset then try to make a
unified iocontroller. I'm leveraging blk-throttling.

The idea is pretty simple. If we know disk total bandwidth, we can calculate
cgroup bandwidth according to its weight. blk-throttling can use the calculated
bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
pattern. Long history is meaningless. The simple algorithm in patch 1 works
pretty well when IO pattern changes.

This is a feedback system. If we underestimate disk total bandwidth, we assign
less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
inactive. If inactive cgroup is accounted in, other cgroup will be assigned
less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
To avoid the issue, we periodically check cgroups and exclude inactive ones.

To test this, create two fio jobs and assign them different weight. You will
see the jobs have different bandwidth roughly according to their weight.

Comments and benchmarks are welcome!

Thanks,
Shaohua

Shaohua Li (3):
  block: estimate disk bandwidth
  blk-throttling: weight based throttling
  blk-throttling: detect inactive cgroup

 block/blk-core.c       |  49 ++++++++++++
 block/blk-sysfs.c      |  13 ++++
 block/blk-throttle.c   | 198 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h |   4 +
 4 files changed, 263 insertions(+), 1 deletion(-)

-- 
2.4.6

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-01-22 20:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
2016-01-21 20:33   ` Vivek Goyal
2016-01-21 21:00     ` Shaohua Li
2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
2016-01-21 20:44   ` Vivek Goyal
2016-01-21 21:05     ` Shaohua Li
2016-01-21 21:09       ` Vivek Goyal
2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
2016-01-20 19:34   ` Shaohua Li
2016-01-20 19:40     ` Vivek Goyal
2016-01-20 19:43       ` Shaohua Li
2016-01-20 19:54         ` Vivek Goyal
2016-01-20 21:11         ` Vivek Goyal
2016-01-20 21:34           ` Shaohua Li
2016-01-21 21:10 ` Tejun Heo
2016-01-21 22:24   ` Shaohua Li
2016-01-21 22:41     ` Tejun Heo
2016-01-22  0:00       ` Shaohua Li
2016-01-22 14:48         ` Tejun Heo
2016-01-22 15:52           ` Vivek Goyal
2016-01-22 18:00             ` Shaohua Li
2016-01-22 19:09               ` Vivek Goyal
2016-01-22 19:45                 ` Shaohua Li
2016-01-22 20:04                   ` Vivek Goyal
2016-01-22 17:57           ` Shaohua Li
2016-01-22 18:08             ` Tejun Heo
2016-01-22 19:11               ` Shaohua Li
2016-01-22 14:43       ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).