From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935079AbcATRuF (ORCPT ); Wed, 20 Jan 2016 12:50:05 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:60631 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934453AbcATRtW (ORCPT ); Wed, 20 Jan 2016 12:49:22 -0500 From: Shaohua Li To: CC: , , , , Subject: [RFC 0/3] block: proportional based blk-throttling Date: Wed, 20 Jan 2016 09:49:16 -0800 Message-ID: X-Mailer: git-send-email 2.4.6 X-FB-Internal: Safe MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-01-20_05:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is weight based. It would be great there is a unified iocontroller for the two. And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option for blk-mq. It's time to have a scalable iocontroller supporting both bandwidth/weight based control and working with blk-mq. blk-throttling is a good candidate, it works for both blk-mq and legacy queue. It has a global lock which is scaring for scalability, but it's not terrible in practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect this isn't a big problem for today's workload. This patchset then try to make a unified iocontroller. I'm leveraging blk-throttling. The idea is pretty simple. If we know disk total bandwidth, we can calculate cgroup bandwidth according to its weight. blk-throttling can use the calculated bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO pattern. Long history is meaningless. The simple algorithm in patch 1 works pretty well when IO pattern changes. This is a feedback system. If we underestimate disk total bandwidth, we assign less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk total bandwidth is estimated. To break the loop, cgroup bandwidth calculation always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be inactive. If inactive cgroup is accounted in, other cgroup will be assigned less bandwidth and so dispatch less IO, and disk total bandwidth drops further. To avoid the issue, we periodically check cgroups and exclude inactive ones. To test this, create two fio jobs and assign them different weight. You will see the jobs have different bandwidth roughly according to their weight. Comments and benchmarks are welcome! Thanks, Shaohua Shaohua Li (3): block: estimate disk bandwidth blk-throttling: weight based throttling blk-throttling: detect inactive cgroup block/blk-core.c | 49 ++++++++++++ block/blk-sysfs.c | 13 ++++ block/blk-throttle.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/blkdev.h | 4 + 4 files changed, 263 insertions(+), 1 deletion(-) -- 2.4.6