[RFC 0/3] block: proportional based blk-throttling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] block: proportional based blk-throttling
@ 2016-01-20 17:49 Shaohua Li
  2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 17:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, tj, vgoyal, jmoyer, Kernel-team

Hi,

Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
weight based. It would be great there is a unified iocontroller for the two.
And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
for blk-mq. It's time to have a scalable iocontroller supporting both
bandwidth/weight based control and working with blk-mq.

blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
It has a global lock which is scaring for scalability, but it's not terrible in
practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
this isn't a big problem for today's workload. This patchset then try to make a
unified iocontroller. I'm leveraging blk-throttling.

The idea is pretty simple. If we know disk total bandwidth, we can calculate
cgroup bandwidth according to its weight. blk-throttling can use the calculated
bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
pattern. Long history is meaningless. The simple algorithm in patch 1 works
pretty well when IO pattern changes.

This is a feedback system. If we underestimate disk total bandwidth, we assign
less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
inactive. If inactive cgroup is accounted in, other cgroup will be assigned
less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
To avoid the issue, we periodically check cgroups and exclude inactive ones.

To test this, create two fio jobs and assign them different weight. You will
see the jobs have different bandwidth roughly according to their weight.

Comments and benchmarks are welcome!

Thanks,
Shaohua

Shaohua Li (3):
  block: estimate disk bandwidth
  blk-throttling: weight based throttling
  blk-throttling: detect inactive cgroup

 block/blk-core.c       |  49 ++++++++++++
 block/blk-sysfs.c      |  13 ++++
 block/blk-throttle.c   | 198 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h |   4 +
 4 files changed, 263 insertions(+), 1 deletion(-)

-- 
2.4.6

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC 1/3] block: estimate disk bandwidth
  2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
@ 2016-01-20 17:49 ` Shaohua Li
  2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 17:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, tj, vgoyal, jmoyer, Kernel-team

weight based blk-throttling can use the estimated bandwidth to calculate
cgroup bandwidth.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c       | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c      | 13 +++++++++++++
 include/linux/blkdev.h |  4 ++++
 3 files changed, 66 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 33e2f62..8c85bb0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -753,6 +753,12 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (blkcg_init_queue(q))
 		goto fail_ref;
 
+	/*
+	 * assign a big initial bandwidth (10GB/s), so blk-throte doesn't start
+	 * slowly
+	 */
+	q->avg_bw[READ] = 10 * 1024 * 1024 * 2;
+	q->avg_bw[WRITE] = 10 * 1024 * 1024 * 2;
 	return q;
 
 fail_ref:
@@ -1909,6 +1915,46 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
 	return 0;
 }
 
+static void blk_update_bandwidth(struct request_queue *q,
+	struct hd_struct *p)
+{
+	unsigned long now = jiffies;
+	unsigned long last = q->bw_timestamp;
+	sector_t bw;
+	sector_t read_sect, write_sect, tmp_sect;
+
+	if (time_before(now, last + HZ / 5))
+		return;
+
+	if (cmpxchg(&q->bw_timestamp, last, now) != last)
+		return;
+
+	tmp_sect = part_stat_read(p, sectors[READ]);
+	read_sect = tmp_sect - q->last_sects[READ];
+	q->last_sects[READ] = tmp_sect;
+	tmp_sect = part_stat_read(p, sectors[WRITE]);
+	write_sect = tmp_sect - q->last_sects[WRITE];
+	q->last_sects[WRITE] = tmp_sect;
+
+	if (now - last > HZ)
+		return;
+	if (now == last)
+		return;
+
+	bw = read_sect * HZ;
+	sector_div(bw, now - last);
+	if (q->avg_bw[READ] < bw)
+		q->avg_bw[READ] += (bw - q->avg_bw[READ]) >> 3;
+	if (q->avg_bw[READ] > bw)
+		q->avg_bw[READ] -= (q->avg_bw[READ] - bw) >> 3;
+	bw = write_sect * HZ;
+	sector_div(bw, now - last);
+	if (q->avg_bw[WRITE] < bw)
+		q->avg_bw[WRITE] += (bw - q->avg_bw[WRITE]) >> 3;
+	if (q->avg_bw[WRITE] > bw)
+		q->avg_bw[WRITE] -= (q->avg_bw[WRITE] - bw) >> 3;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -1981,6 +2027,9 @@ generic_make_request_checks(struct bio *bio)
 	 */
 	create_io_context(GFP_ATOMIC, q->node);
 
+	blk_update_bandwidth(q,
+		part->partno ? &part_to_disk(part)->part0 : part);
+
 	if (!blkcg_bio_issue_check(q, bio))
 		return false;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e140cc4..419f6bd 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -348,6 +348,13 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_avg_perf_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu %llu\n",
+		       (unsigned long long)q->avg_bw[READ] * 512,
+		       (unsigned long long)q->avg_bw[WRITE] * 512);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -479,6 +486,11 @@ static struct queue_sysfs_entry queue_poll_entry = {
 	.store = queue_poll_store,
 };
 
+static struct queue_sysfs_entry queue_avg_perf_entry = {
+	.attr = {.name = "average_perf", .mode = S_IRUGO },
+	.show = queue_avg_perf_show,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -504,6 +516,7 @@ static struct attribute *default_attrs[] = {
 	&queue_iostats_entry.attr,
 	&queue_random_entry.attr,
 	&queue_poll_entry.attr,
+	&queue_avg_perf_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c70e358..7e6b8ed 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -464,6 +464,10 @@ struct request_queue {
 	struct bio_set		*bio_split;
 
 	bool			mq_sysfs_init_done;
+
+	unsigned long bw_timestamp;
+	sector_t avg_bw[2];
+	sector_t last_sects[2];
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 2/3] blk-throttling: weight based throttling
  2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
  2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
@ 2016-01-20 17:49 ` Shaohua Li
  2016-01-21 20:33   ` Vivek Goyal
  2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 17:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, tj, vgoyal, jmoyer, Kernel-team

We know total bandwidth of a disk and can calculate cgroup's bandwidth
percentage against disk bandwidth according to its weight. We can easily
calculate cgroup bandwidth.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 134 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2149a1d..b3f847d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -12,6 +12,9 @@
 #include <linux/blk-cgroup.h>
 #include "blk.h"
 
+#define MAX_WEIGHT (1000)
+#define WEIGHT_RATIO_SHIFT (12)
+#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
 
@@ -74,6 +77,10 @@ struct throtl_service_queue {
 	unsigned int		nr_pending;	/* # queued in the tree */
 	unsigned long		first_pending_disptime;	/* disptime of the first tg */
 	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
+
+	unsigned int		weight;
+	unsigned int		children_weight;
+	unsigned int		ratio;
 };
 
 enum tg_state_flags {
@@ -152,6 +159,9 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+
+	bool bw_based;
+	bool weight_based;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -203,6 +213,15 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static inline uint64_t queue_bandwidth(struct throtl_data *td, int rw)
+{
+	uint64_t bw = td->queue->avg_bw[rw] * 512;
+
+	/* give extra bw, so cgroup can dispatch enough IO */
+	bw += bw >> 3;
+	return bw;
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -371,6 +390,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	sq->parent_sq = &td->service_queue;
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
+	sq->parent_sq->children_weight += sq->weight;
 	tg->td = td;
 }
 
@@ -386,7 +406,8 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+				    (tg->bps[rw] != -1 || tg->iops[rw] != -1 ||
+				     tg->service_queue.weight);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -401,6 +422,10 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
+	struct throtl_service_queue *sq = &tg->service_queue;
+
+	if (sq->parent_sq)
+		sq->parent_sq->children_weight -= sq->weight;
 
 	del_timer_sync(&tg->service_queue.pending_timer);
 	kfree(tg);
@@ -898,6 +923,48 @@ static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
 
 }
 
+static void tg_update_bps(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq, *parent_sq;
+
+	sq = &tg->service_queue;
+	parent_sq = sq->parent_sq;
+
+	if (!tg->td->weight_based || !parent_sq)
+		return;
+	sq->ratio = max_t(unsigned int,
+		parent_sq->ratio * sq->weight / parent_sq->children_weight,
+		1);
+
+	tg->bps[READ] = max_t(uint64_t,
+		(queue_bandwidth(tg->td, READ) * sq->ratio) >>
+			WEIGHT_RATIO_SHIFT,
+		1024);
+	tg->bps[WRITE] = max_t(uint64_t,
+		(queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
+			WEIGHT_RATIO_SHIFT,
+		1024);
+}
+
+static void tg_update_ratio(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_service_queue *sq;
+
+		tg = blkg_to_tg(blkg);
+		sq = &tg->service_queue;
+
+		if (!sq->parent_sq)
+			continue;
+
+		tg_update_bps(tg);
+	}
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1202,12 +1269,65 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = -1;
 
 	tg = blkg_to_tg(ctx.blkg);
+	if (tg->td->weight_based) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
+	tg->td->bw_based = true;
+
+	tg_conf_updated(tg);
+	ret = 0;
+out_finish:
+	blkg_conf_finish(&ctx);
+	return ret ?: nbytes;
+}
+
+static ssize_t tg_set_weight(struct kernfs_open_file *of,
+			   char *buf, size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct blkg_conf_ctx ctx;
+	struct throtl_grp *tg;
+	int ret;
+	u64 v;
+	int old_weight;
+
+	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
+	if (ret)
+		return ret;
 
+	ret = -EINVAL;
+	if (sscanf(ctx.body, "%llu", &v) != 1)
+		goto out_finish;
+	if (v > MAX_WEIGHT)
+		v = MAX_WEIGHT;
+	if (v == 0)
+		v = 1;
+
+	tg = blkg_to_tg(ctx.blkg);
+	if (tg->td->bw_based) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
+	tg->td->weight_based = true;
+
+	old_weight = tg->service_queue.weight;
+
+	tg->service_queue.weight = v;
+	if (tg->service_queue.parent_sq) {
+		struct throtl_service_queue *psq = tg->service_queue.parent_sq;
+		if (v > old_weight)
+			psq->children_weight += v - old_weight;
+		else if (v < old_weight)
+			psq->children_weight -= old_weight - v;
+	}
+
+	tg_update_ratio(tg);
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1229,6 +1349,12 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 
 static struct cftype throtl_legacy_files[] = {
 	{
+		.name = "throttle.weight",
+		.private = offsetof(struct throtl_grp, service_queue.weight),
+		.seq_show = tg_print_conf_uint,
+		.write = tg_set_weight,
+	},
+	{
 		.name = "throttle.read_bps_device",
 		.private = offsetof(struct throtl_grp, bps[READ]),
 		.seq_show = tg_print_conf_u64,
@@ -1313,6 +1439,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
+	if (tg->td->weight_based) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
 
 	v[0] = tg->bps[READ];
 	v[1] = tg->bps[WRITE];
@@ -1358,6 +1488,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	tg->bps[WRITE] = v[1];
 	tg->iops[READ] = v[2];
 	tg->iops[WRITE] = v[3];
+	tg->td->bw_based = true;
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+	tg_update_bps(tg);
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
@@ -1563,6 +1695,7 @@ int blk_throtl_init(struct request_queue *q)
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
+	td->service_queue.ratio = WEIGHT_RATIO;
 
 	q->td = td;
 	td->queue = q;
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC 3/3] blk-throttling: detect inactive cgroup
  2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
  2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
  2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
@ 2016-01-20 17:49 ` Shaohua Li
  2016-01-21 20:44   ` Vivek Goyal
  2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
  2016-01-21 21:10 ` Tejun Heo
  4 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 17:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, tj, vgoyal, jmoyer, Kernel-team

If a cgroup is inactive for some time, it should be excluded from
bandwidth calculation.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 67 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b3f847d..5c11270 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -15,6 +15,9 @@
 #define MAX_WEIGHT (1000)
 #define WEIGHT_RATIO_SHIFT (12)
 #define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
+/* must less than the interval we update bandwidth */
+#define CGCHECK_TIME (msecs_to_jiffies(20))
+
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
 
@@ -81,6 +84,9 @@ struct throtl_service_queue {
 	unsigned int		weight;
 	unsigned int		children_weight;
 	unsigned int		ratio;
+
+	unsigned long active_timestamp;
+	bool active;
 };
 
 enum tg_state_flags {
@@ -162,6 +168,7 @@ struct throtl_data
 
 	bool bw_based;
 	bool weight_based;
+	unsigned long last_check_timestamp;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -390,7 +397,6 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	sq->parent_sq = &td->service_queue;
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
-	sq->parent_sq->children_weight += sq->weight;
 	tg->td = td;
 }
 
@@ -424,7 +430,7 @@ static void throtl_pd_free(struct blkg_policy_data *pd)
 	struct throtl_grp *tg = pd_to_tg(pd);
 	struct throtl_service_queue *sq = &tg->service_queue;
 
-	if (sq->parent_sq)
+	if (sq->active && sq->parent_sq)
 		sq->parent_sq->children_weight -= sq->weight;
 
 	del_timer_sync(&tg->service_queue.pending_timer);
@@ -930,7 +936,7 @@ static void tg_update_bps(struct throtl_grp *tg)
 	sq = &tg->service_queue;
 	parent_sq = sq->parent_sq;
 
-	if (!tg->td->weight_based || !parent_sq)
+	if (!tg->td->weight_based || !parent_sq || !sq->active)
 		return;
 	sq->ratio = max_t(unsigned int,
 		parent_sq->ratio * sq->weight / parent_sq->children_weight,
@@ -965,6 +971,26 @@ static void tg_update_ratio(struct throtl_grp *tg)
 	}
 }
 
+static void tg_update_active_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	bool update_ratio = false;
+	unsigned long now = jiffies;
+
+	while (sq->parent_sq) {
+		sq->active_timestamp = now;
+		if (!sq->active) {
+			sq->parent_sq->children_weight += sq->weight;
+			sq->active = true;
+			update_ratio = true;
+		}
+		sq = sq->parent_sq;
+	};
+
+	if (update_ratio)
+		tg_update_ratio(tg);
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -984,6 +1010,8 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 
 	throtl_charge_bio(tg, bio);
 
+	tg_update_active_time(tg);
+
 	/*
 	 * If our parent is another tg, we just need to transfer @bio to
 	 * the parent using throtl_add_bio_tg().  If our parent is
@@ -1319,7 +1347,7 @@ static ssize_t tg_set_weight(struct kernfs_open_file *of,
 	old_weight = tg->service_queue.weight;
 
 	tg->service_queue.weight = v;
-	if (tg->service_queue.parent_sq) {
+	if (tg->service_queue.active && tg->service_queue.parent_sq) {
 		struct throtl_service_queue *psq = tg->service_queue.parent_sq;
 		if (v > old_weight)
 			psq->children_weight += v - old_weight;
@@ -1524,6 +1552,39 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static void detect_inactive_cg(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	struct throtl_service_queue *sq = &tg->service_queue;
+	unsigned long now = jiffies;
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool update_ratio = false;
+
+	tg_update_active_time(tg);
+
+	if (time_before(now, td->last_check_timestamp))
+		return;
+	td->last_check_timestamp = now + CGCHECK_TIME;
+
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		tg = blkg_to_tg(blkg);
+		sq = &tg->service_queue;
+		if (sq->parent_sq &&
+		    time_before(sq->active_timestamp + CGCHECK_TIME, now) &&
+		    !(sq->nr_queued[READ] || sq->nr_queued[WRITE])) {
+			if (sq->active && sq->parent_sq) {
+				sq->active = false;
+				sq->parent_sq->children_weight -= sq->weight;
+				update_ratio = true;
+			}
+		}
+	}
+
+	if (update_ratio)
+		tg_update_ratio(tg);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1546,6 +1607,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+	detect_inactive_cg(tg);
 	tg_update_bps(tg);
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
@@ -1696,6 +1758,7 @@ int blk_throtl_init(struct request_queue *q)
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
 	td->service_queue.ratio = WEIGHT_RATIO;
+	td->service_queue.active = true;
 
 	q->td = td;
 	td->queue = q;
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
                   ` (2 preceding siblings ...)
  2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
@ 2016-01-20 19:05 ` Vivek Goyal
  2016-01-20 19:34   ` Shaohua Li
  2016-01-21 21:10 ` Tejun Heo
  4 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-20 19:05 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> Hi,
> 
> Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> weight based. It would be great there is a unified iocontroller for the two.
> And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> for blk-mq. It's time to have a scalable iocontroller supporting both
> bandwidth/weight based control and working with blk-mq.
> 
> blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> It has a global lock which is scaring for scalability, but it's not terrible in
> practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> this isn't a big problem for today's workload. This patchset then try to make a
> unified iocontroller. I'm leveraging blk-throttling.
> 
> The idea is pretty simple. If we know disk total bandwidth, we can calculate
> cgroup bandwidth according to its weight. blk-throttling can use the calculated
> bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> pattern. Long history is meaningless. The simple algorithm in patch 1 works
> pretty well when IO pattern changes.
> 
> This is a feedback system. If we underestimate disk total bandwidth, we assign
> less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> To avoid the issue, we periodically check cgroups and exclude inactive ones.
> 
> To test this, create two fio jobs and assign them different weight. You will
> see the jobs have different bandwidth roughly according to their weight.

Patches look pretty small. Nice to see an implementation which will work
with faster devices and get away from dependency on cfq.

How does one switch between weight based vs bandwidth based throttling?
What's the default. 

So this has been implemented at throttling layer. By default is weight 
based throttling enabled or one needs to enable it explicitly.

What's the performance impact of new weight based throttling.

Thanks
Vivek

> 
> Comments and benchmarks are welcome!
> 
> Thanks,
> Shaohua
> 
> Shaohua Li (3):
>   block: estimate disk bandwidth
>   blk-throttling: weight based throttling
>   blk-throttling: detect inactive cgroup
> 
>  block/blk-core.c       |  49 ++++++++++++
>  block/blk-sysfs.c      |  13 ++++
>  block/blk-throttle.c   | 198 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/blkdev.h |   4 +
>  4 files changed, 263 insertions(+), 1 deletion(-)
> 
> -- 
> 2.4.6

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
@ 2016-01-20 19:34   ` Shaohua Li
  2016-01-20 19:40     ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 19:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > Hi,
> > 
> > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > weight based. It would be great there is a unified iocontroller for the two.
> > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > for blk-mq. It's time to have a scalable iocontroller supporting both
> > bandwidth/weight based control and working with blk-mq.
> > 
> > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > It has a global lock which is scaring for scalability, but it's not terrible in
> > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > this isn't a big problem for today's workload. This patchset then try to make a
> > unified iocontroller. I'm leveraging blk-throttling.
> > 
> > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > pretty well when IO pattern changes.
> > 
> > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > 
> > To test this, create two fio jobs and assign them different weight. You will
> > see the jobs have different bandwidth roughly according to their weight.
> 
> Patches look pretty small. Nice to see an implementation which will work
> with faster devices and get away from dependency on cfq.
> 
> How does one switch between weight based vs bandwidth based throttling?
> What's the default. 
> 
> So this has been implemented at throttling layer. By default is weight 
> based throttling enabled or one needs to enable it explicitly.

So in current implementation, only one of weight/bandwidth can be
enabled. After one is enabled, switching to the other is forbidden. It
should not be hard to enable switching. But mixing the two in one
hierarchy sounds not trivial.
 
> What's the performance impact of new weight based throttling.

I haven't benchmarked yet, but this doesn't add too many code, I'd
expect the performance isn't changed. I'll do a test soon.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 19:34   ` Shaohua Li
@ 2016-01-20 19:40     ` Vivek Goyal
  2016-01-20 19:43       ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-20 19:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > Hi,
> > > 
> > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > weight based. It would be great there is a unified iocontroller for the two.
> > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > bandwidth/weight based control and working with blk-mq.
> > > 
> > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > this isn't a big problem for today's workload. This patchset then try to make a
> > > unified iocontroller. I'm leveraging blk-throttling.
> > > 
> > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > pretty well when IO pattern changes.
> > > 
> > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > 
> > > To test this, create two fio jobs and assign them different weight. You will
> > > see the jobs have different bandwidth roughly according to their weight.
> > 
> > Patches look pretty small. Nice to see an implementation which will work
> > with faster devices and get away from dependency on cfq.
> > 
> > How does one switch between weight based vs bandwidth based throttling?
> > What's the default. 
> > 
> > So this has been implemented at throttling layer. By default is weight 
> > based throttling enabled or one needs to enable it explicitly.
> 
> So in current implementation, only one of weight/bandwidth can be
> enabled. After one is enabled, switching to the other is forbidden. It
> should not be hard to enable switching. But mixing the two in one
> hierarchy sounds not trivial.

So is this selection per device? Would be good if you also provide steps
to test it. I am going through code now and will figure out ultimately,
just that if you give steps, it makes it little easier.

Is this one way selection system wide or per device?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 19:40     ` Vivek Goyal
@ 2016-01-20 19:43       ` Shaohua Li
  2016-01-20 19:54         ` Vivek Goyal
  2016-01-20 21:11         ` Vivek Goyal
  0 siblings, 2 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 19:43 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Wed, Jan 20, 2016 at 02:40:13PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> > On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > > Hi,
> > > > 
> > > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > > weight based. It would be great there is a unified iocontroller for the two.
> > > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > > bandwidth/weight based control and working with blk-mq.
> > > > 
> > > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > > this isn't a big problem for today's workload. This patchset then try to make a
> > > > unified iocontroller. I'm leveraging blk-throttling.
> > > > 
> > > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > > pretty well when IO pattern changes.
> > > > 
> > > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > > 
> > > > To test this, create two fio jobs and assign them different weight. You will
> > > > see the jobs have different bandwidth roughly according to their weight.
> > > 
> > > Patches look pretty small. Nice to see an implementation which will work
> > > with faster devices and get away from dependency on cfq.
> > > 
> > > How does one switch between weight based vs bandwidth based throttling?
> > > What's the default. 
> > > 
> > > So this has been implemented at throttling layer. By default is weight 
> > > based throttling enabled or one needs to enable it explicitly.
> > 
> > So in current implementation, only one of weight/bandwidth can be
> > enabled. After one is enabled, switching to the other is forbidden. It
> > should not be hard to enable switching. But mixing the two in one
> > hierarchy sounds not trivial.
> 
> So is this selection per device? Would be good if you also provide steps
> to test it. I am going through code now and will figure out ultimately,
> just that if you give steps, it makes it little easier.

Just uses:
echo "8:16 200" > $TEST_CG/blkio.throttle.weight

200 is the weight

> Is this one way selection system wide or per device?

It's per device currently.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 19:43       ` Shaohua Li
@ 2016-01-20 19:54         ` Vivek Goyal
  2016-01-20 21:11         ` Vivek Goyal
  1 sibling, 0 replies; 30+ messages in thread
From: Vivek Goyal @ 2016-01-20 19:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Wed, Jan 20, 2016 at 11:43:27AM -0800, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 02:40:13PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> > > On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > > > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > > > Hi,
> > > > > 
> > > > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > > > weight based. It would be great there is a unified iocontroller for the two.
> > > > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > > > bandwidth/weight based control and working with blk-mq.
> > > > > 
> > > > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > > > this isn't a big problem for today's workload. This patchset then try to make a
> > > > > unified iocontroller. I'm leveraging blk-throttling.
> > > > > 
> > > > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > > > pretty well when IO pattern changes.
> > > > > 
> > > > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > > > 
> > > > > To test this, create two fio jobs and assign them different weight. You will
> > > > > see the jobs have different bandwidth roughly according to their weight.
> > > > 
> > > > Patches look pretty small. Nice to see an implementation which will work
> > > > with faster devices and get away from dependency on cfq.
> > > > 
> > > > How does one switch between weight based vs bandwidth based throttling?
> > > > What's the default. 
> > > > 
> > > > So this has been implemented at throttling layer. By default is weight 
> > > > based throttling enabled or one needs to enable it explicitly.
> > > 
> > > So in current implementation, only one of weight/bandwidth can be
> > > enabled. After one is enabled, switching to the other is forbidden. It
> > > should not be hard to enable switching. But mixing the two in one
> > > hierarchy sounds not trivial.
> > 
> > So is this selection per device? Would be good if you also provide steps
> > to test it. I am going through code now and will figure out ultimately,
> > just that if you give steps, it makes it little easier.
> 
> Just uses:
> echo "8:16 200" > $TEST_CG/blkio.throttle.weight
> 
> 200 is the weight

Ok. So by default this mechanism is off. And the moment I assign the
weight to any of the cgroups on a device, weight based mechanism
kicks in? And what happens to other cgroups where I have not assigned
any weight which are doing IO?

I am doing cc to linux-block mailing list also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 19:43       ` Shaohua Li
  2016-01-20 19:54         ` Vivek Goyal
@ 2016-01-20 21:11         ` Vivek Goyal
  2016-01-20 21:34           ` Shaohua Li
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-20 21:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Wed, Jan 20, 2016 at 11:43:27AM -0800, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 02:40:13PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> > > On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > > > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > > > Hi,
> > > > > 
> > > > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > > > weight based. It would be great there is a unified iocontroller for the two.
> > > > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > > > bandwidth/weight based control and working with blk-mq.
> > > > > 
> > > > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > > > this isn't a big problem for today's workload. This patchset then try to make a
> > > > > unified iocontroller. I'm leveraging blk-throttling.
> > > > > 
> > > > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > > > pretty well when IO pattern changes.
> > > > > 
> > > > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > > > 
> > > > > To test this, create two fio jobs and assign them different weight. You will
> > > > > see the jobs have different bandwidth roughly according to their weight.
> > > > 
> > > > Patches look pretty small. Nice to see an implementation which will work
> > > > with faster devices and get away from dependency on cfq.
> > > > 
> > > > How does one switch between weight based vs bandwidth based throttling?
> > > > What's the default. 
> > > > 
> > > > So this has been implemented at throttling layer. By default is weight 
> > > > based throttling enabled or one needs to enable it explicitly.
> > > 
> > > So in current implementation, only one of weight/bandwidth can be
> > > enabled. After one is enabled, switching to the other is forbidden. It
> > > should not be hard to enable switching. But mixing the two in one
> > > hierarchy sounds not trivial.
> > 
> > So is this selection per device? Would be good if you also provide steps
> > to test it. I am going through code now and will figure out ultimately,
> > just that if you give steps, it makes it little easier.
> 
> Just uses:
> echo "8:16 200" > $TEST_CG/blkio.throttle.weight
> 
> 200 is the weight
> 

It would be nice if you also update the documentation. What are the max
and min for weight values. What does it mean if a group has weight 200.
While others have not been configured. What % of disk share this cgroup
will get.

I am still wrapping my head around the patches but it looks like this is
another way of coming up automatically with bandwidth limit for a cgroup
based on weight. So user does not have to configure absolute values 
for read/write bandwidth. They can configure the weight and that will 
automatically control the bandwidth of cgroup dynamically.

What I am not clear is that once I apply weight on one cgroup, what happes
to rest of peer cgroups which are still not configured. If I don't apply
rules to them, then adding weight to one cgroup does not mean much. 

Ideally, I might help that we assign default weights to cgroup and have
a per device switch to enable weight based controller. That way user
space can enable it per device as needed and all the cgroup get their
fair share without any extra configuration. If the overhead of this
mechanism is ultra low, then a global switch to enable it by default
for all devices should be useful too. That way user space has to toggle
just that switch and by default all IO cgroups on all block devices get
their fair share.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 21:11         ` Vivek Goyal
@ 2016-01-20 21:34           ` Shaohua Li
  0 siblings, 0 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-20 21:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Wed, Jan 20, 2016 at 04:11:00PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 11:43:27AM -0800, Shaohua Li wrote:
> > On Wed, Jan 20, 2016 at 02:40:13PM -0500, Vivek Goyal wrote:
> > > On Wed, Jan 20, 2016 at 11:34:48AM -0800, Shaohua Li wrote:
> > > > On Wed, Jan 20, 2016 at 02:05:35PM -0500, Vivek Goyal wrote:
> > > > > On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> > > > > > weight based. It would be great there is a unified iocontroller for the two.
> > > > > > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > > > > > for blk-mq. It's time to have a scalable iocontroller supporting both
> > > > > > bandwidth/weight based control and working with blk-mq.
> > > > > > 
> > > > > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > > > > > It has a global lock which is scaring for scalability, but it's not terrible in
> > > > > > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > > > > > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > > > > > this isn't a big problem for today's workload. This patchset then try to make a
> > > > > > unified iocontroller. I'm leveraging blk-throttling.
> > > > > > 
> > > > > > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > > > > > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > > > > > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > > > > > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > > > > > pretty well when IO pattern changes.
> > > > > > 
> > > > > > This is a feedback system. If we underestimate disk total bandwidth, we assign
> > > > > > less bandwidth to cgroup. cgroup will dispatch less IO and finally lower disk
> > > > > > total bandwidth is estimated. To break the loop, cgroup bandwidth calculation
> > > > > > always uses (1 + 1/8) * disk_bandwidth. Another issue is cgroup could be
> > > > > > inactive. If inactive cgroup is accounted in, other cgroup will be assigned
> > > > > > less bandwidth and so dispatch less IO, and disk total bandwidth drops further.
> > > > > > To avoid the issue, we periodically check cgroups and exclude inactive ones.
> > > > > > 
> > > > > > To test this, create two fio jobs and assign them different weight. You will
> > > > > > see the jobs have different bandwidth roughly according to their weight.
> > > > > 
> > > > > Patches look pretty small. Nice to see an implementation which will work
> > > > > with faster devices and get away from dependency on cfq.
> > > > > 
> > > > > How does one switch between weight based vs bandwidth based throttling?
> > > > > What's the default. 
> > > > > 
> > > > > So this has been implemented at throttling layer. By default is weight 
> > > > > based throttling enabled or one needs to enable it explicitly.
> > > > 
> > > > So in current implementation, only one of weight/bandwidth can be
> > > > enabled. After one is enabled, switching to the other is forbidden. It
> > > > should not be hard to enable switching. But mixing the two in one
> > > > hierarchy sounds not trivial.
> > > 
> > > So is this selection per device? Would be good if you also provide steps
> > > to test it. I am going through code now and will figure out ultimately,
> > > just that if you give steps, it makes it little easier.
> > 
> > Just uses:
> > echo "8:16 200" > $TEST_CG/blkio.throttle.weight
> > 
> > 200 is the weight
> > 
> 
> It would be nice if you also update the documentation. What are the max
> and min for weight values. What does it mean if a group has weight 200.
> While others have not been configured. What % of disk share this cgroup
> will get.
> 
> I am still wrapping my head around the patches but it looks like this is
> another way of coming up automatically with bandwidth limit for a cgroup
> based on weight. So user does not have to configure absolute values 
> for read/write bandwidth. They can configure the weight and that will 
> automatically control the bandwidth of cgroup dynamically.
> 
> What I am not clear is that once I apply weight on one cgroup, what happes
> to rest of peer cgroups which are still not configured. If I don't apply
> rules to them, then adding weight to one cgroup does not mean much. 
> 
> Ideally, I might help that we assign default weights to cgroup and have
> a per device switch to enable weight based controller. That way user
> space can enable it per device as needed and all the cgroup get their
> fair share without any extra configuration. If the overhead of this
> mechanism is ultra low, then a global switch to enable it by default
> for all devices should be useful too. That way user space has to toggle
> just that switch and by default all IO cgroups on all block devices get
> their fair share.

I haven't thought about the interface too much yet. This version mainly
demonstrates the idea. Your suggestions look reasonable. A single
control to enable weight/bandwidth with proper default setting is
convenient. Will add it in next post.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/3] blk-throttling: weight based throttling
  2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
@ 2016-01-21 20:33   ` Vivek Goyal
  2016-01-21 21:00     ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-21 20:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote:
> We know total bandwidth of a disk and can calculate cgroup's bandwidth
> percentage against disk bandwidth according to its weight. We can easily
> calculate cgroup bandwidth.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 134 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 2149a1d..b3f847d 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -12,6 +12,9 @@
>  #include <linux/blk-cgroup.h>
>  #include "blk.h"
>  
> +#define MAX_WEIGHT (1000)
> +#define WEIGHT_RATIO_SHIFT (12)
> +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
>  /* Max dispatch from a group in 1 round */
>  static int throtl_grp_quantum = 8;
>  
> @@ -74,6 +77,10 @@ struct throtl_service_queue {
>  	unsigned int		nr_pending;	/* # queued in the tree */
>  	unsigned long		first_pending_disptime;	/* disptime of the first tg */
>  	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
> +
> +	unsigned int		weight;
> +	unsigned int		children_weight;
> +	unsigned int		ratio;

Will it be better to call it "share" instead of "ratio". It is basically
a measure of % disk share of the group and share seems more intuitive.


[..]
> +static void tg_update_bps(struct throtl_grp *tg)
> +{
> +	struct throtl_service_queue *sq, *parent_sq;
> +
> +	sq = &tg->service_queue;
> +	parent_sq = sq->parent_sq;
> +
> +	if (!tg->td->weight_based || !parent_sq)
> +		return;
> +	sq->ratio = max_t(unsigned int,
> +		parent_sq->ratio * sq->weight / parent_sq->children_weight,
> +		1);
> +

It might be good to decouple updation of "share/ratio" and updation of
bps. Change of share can happen any time either weight is changed or
an active group is queue/dequeued and we don't have to do it every time
a bio is submitted.

> +	tg->bps[READ] = max_t(uint64_t,
> +		(queue_bandwidth(tg->td, READ) * sq->ratio) >>
> +			WEIGHT_RATIO_SHIFT,
> +		1024);
> +	tg->bps[WRITE] = max_t(uint64_t,
> +		(queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
> +			WEIGHT_RATIO_SHIFT,
> +		1024);
> +}
> +
> +static void tg_update_ratio(struct throtl_grp *tg)
> +{
> +	struct throtl_data *td = tg->td;
> +	struct cgroup_subsys_state *pos_css;
> +	struct blkcg_gq *blkg;
> +
> +	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {

Is it possible to traverse only the affected subtree instead of whole
tree of groups. Because if weight is updated on a group, then we just
need to traverse the subtree under that group's parent.


[..]
> @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  
>  	sq = &tg->service_queue;
>  
> +	tg_update_bps(tg);

Updating bps for every bio submitted sounds like a lot. We probably could
do it when first bio gets queued in the group and then refresh it at
some regular interval. Say when next set of dispatch happens from group
we could update bandwidth of group after dispatch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/3] blk-throttling: detect inactive cgroup
  2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
@ 2016-01-21 20:44   ` Vivek Goyal
  2016-01-21 21:05     ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-21 20:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Wed, Jan 20, 2016 at 09:49:19AM -0800, Shaohua Li wrote:
> If a cgroup is inactive for some time, it should be excluded from
> bandwidth calculation.

I am not sure why do we require this patch. If group is inactive, it
will not be on service tree and will not contribute to weight hence
will not contrinute to share.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 2/3] blk-throttling: weight based throttling
  2016-01-21 20:33   ` Vivek Goyal
@ 2016-01-21 21:00     ` Shaohua Li
  0 siblings, 0 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-21 21:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team

On Thu, Jan 21, 2016 at 03:33:32PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote:
> > We know total bandwidth of a disk and can calculate cgroup's bandwidth
> > percentage against disk bandwidth according to its weight. We can easily
> > calculate cgroup bandwidth.
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 134 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 2149a1d..b3f847d 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -12,6 +12,9 @@
> >  #include <linux/blk-cgroup.h>
> >  #include "blk.h"
> >  
> > +#define MAX_WEIGHT (1000)
> > +#define WEIGHT_RATIO_SHIFT (12)
> > +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
> >  /* Max dispatch from a group in 1 round */
> >  static int throtl_grp_quantum = 8;
> >  
> > @@ -74,6 +77,10 @@ struct throtl_service_queue {
> >  	unsigned int		nr_pending;	/* # queued in the tree */
> >  	unsigned long		first_pending_disptime;	/* disptime of the first tg */
> >  	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
> > +
> > +	unsigned int		weight;
> > +	unsigned int		children_weight;
> > +	unsigned int		ratio;
> 
> Will it be better to call it "share" instead of "ratio". It is basically
> a measure of % disk share of the group and share seems more intuitive.

Ok
> 
> [..]
> > +static void tg_update_bps(struct throtl_grp *tg)
> > +{
> > +	struct throtl_service_queue *sq, *parent_sq;
> > +
> > +	sq = &tg->service_queue;
> > +	parent_sq = sq->parent_sq;
> > +
> > +	if (!tg->td->weight_based || !parent_sq)
> > +		return;
> > +	sq->ratio = max_t(unsigned int,
> > +		parent_sq->ratio * sq->weight / parent_sq->children_weight,
> > +		1);
> > +
> 
> It might be good to decouple updation of "share/ratio" and updation of
> bps. Change of share can happen any time either weight is changed or
> an active group is queue/dequeued and we don't have to do it every time
> a bio is submitted.

Ok
> > +	tg->bps[READ] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, READ) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +	tg->bps[WRITE] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +}
> > +
> > +static void tg_update_ratio(struct throtl_grp *tg)
> > +{
> > +	struct throtl_data *td = tg->td;
> > +	struct cgroup_subsys_state *pos_css;
> > +	struct blkcg_gq *blkg;
> > +
> > +	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
> 
> Is it possible to traverse only the affected subtree instead of whole
> tree of groups. Because if weight is updated on a group, then we just
> need to traverse the subtree under that group's parent.

makes sense
> [..]
> > @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
> >  
> >  	sq = &tg->service_queue;
> >  
> > +	tg_update_bps(tg);
> 
> Updating bps for every bio submitted sounds like a lot. We probably could
> do it when first bio gets queued in the group and then refresh it at
> some regular interval. Say when next set of dispatch happens from group
> we could update bandwidth of group after dispatch.

That calculation isn't very heavy. I'll revisit this if it's a problem.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/3] blk-throttling: detect inactive cgroup
  2016-01-21 20:44   ` Vivek Goyal
@ 2016-01-21 21:05     ` Shaohua Li
  2016-01-21 21:09       ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-21 21:05 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Thu, Jan 21, 2016 at 03:44:05PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:19AM -0800, Shaohua Li wrote:
> > If a cgroup is inactive for some time, it should be excluded from
> > bandwidth calculation.
> 
> I am not sure why do we require this patch. If group is inactive, it
> will not be on service tree and will not contribute to weight hence
> will not contrinute to share.

The share calculation is based on existing cgroups (with this patch,
existing active cgroups). a cgroup is on service tree when it has pending
bios, right? Not having pending bios isn't a condition we should exclude
a cgroup.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 3/3] blk-throttling: detect inactive cgroup
  2016-01-21 21:05     ` Shaohua Li
@ 2016-01-21 21:09       ` Vivek Goyal
  0 siblings, 0 replies; 30+ messages in thread
From: Vivek Goyal @ 2016-01-21 21:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, tj, jmoyer, Kernel-team, linux-block

On Thu, Jan 21, 2016 at 01:05:43PM -0800, Shaohua Li wrote:
> On Thu, Jan 21, 2016 at 03:44:05PM -0500, Vivek Goyal wrote:
> > On Wed, Jan 20, 2016 at 09:49:19AM -0800, Shaohua Li wrote:
> > > If a cgroup is inactive for some time, it should be excluded from
> > > bandwidth calculation.
> > 
> > I am not sure why do we require this patch. If group is inactive, it
> > will not be on service tree and will not contribute to weight hence
> > will not contrinute to share.
> 
> The share calculation is based on existing cgroups (with this patch,
> existing active cgroups). a cgroup is on service tree when it has pending
> bios, right? Not having pending bios isn't a condition we should exclude
> a cgroup.

If a cgroup is not doing IO and is not active, then it should not be
part of disk share calculation. Once cgroup is active, it should get
its fair share and reduce the share of peers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
                   ` (3 preceding siblings ...)
  2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
@ 2016-01-21 21:10 ` Tejun Heo
  2016-01-21 22:24   ` Shaohua Li
  4 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-01-21 21:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

Hello, Shaohua.

On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is

Just a nit.  blk-throttle is both bw and iops based.

> weight based. It would be great there is a unified iocontroller for the two.
> And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> for blk-mq. It's time to have a scalable iocontroller supporting both
> bandwidth/weight based control and working with blk-mq.
> 
> blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> It has a global lock which is scaring for scalability, but it's not terrible in
> practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> this isn't a big problem for today's workload. This patchset then try to make a
> unified iocontroller. I'm leveraging blk-throttling.

Have you tried with some level, say 5, of nesting?  IIRC, how it
implements hierarchical control is rather braindead (and yeah I'm
responsible for the damage).

> The idea is pretty simple. If we know disk total bandwidth, we can calculate
> cgroup bandwidth according to its weight. blk-throttling can use the calculated
> bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> pattern. Long history is meaningless. The simple algorithm in patch 1 works
> pretty well when IO pattern changes.

So, that part is fine but I don't think it makes sense to make weight
based control either bandwidth or iops based.  The fundamental problem
is that it's a false choice.  It's like asking someone who wants a car
to choose between accelerator and brake.  It's a choice without a good
answer.  Both are wrong.  Also note that there's an inherent
difference from the currently implemented absolute limits.  Absolute
limits can be combined.  Weights based on different metrics can't be.

Even with modern SSDs, both iops and bandwidth play major roles in
deciding how costly each IO is and I'm fairly confident that this is
fundamental enough to be the case for quite a while.  I *think* the
cost model can be approximated from measurements.  Devices are
becoming more and more predictable in their behaviors after all.  For
weight based distribution, the unit of distribution should be IO time,
not bandwidth or iops.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-21 21:10 ` Tejun Heo
@ 2016-01-21 22:24   ` Shaohua Li
  2016-01-21 22:41     ` Tejun Heo
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-21 22:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

On Thu, Jan 21, 2016 at 04:10:02PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> 
> Just a nit.  blk-throttle is both bw and iops based.
> 
> > weight based. It would be great there is a unified iocontroller for the two.
> > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > for blk-mq. It's time to have a scalable iocontroller supporting both
> > bandwidth/weight based control and working with blk-mq.
> > 
> > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > It has a global lock which is scaring for scalability, but it's not terrible in
> > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > this isn't a big problem for today's workload. This patchset then try to make a
> > unified iocontroller. I'm leveraging blk-throttling.
> 
> Have you tried with some level, say 5, of nesting?  IIRC, how it
> implements hierarchical control is rather braindead (and yeah I'm
> responsible for the damage).

Not yet. Agree nesting increases the locking time. But my test is
already an extreme case. I had 32 threads in 2 nodes running IO and the
IOPS is 1M/s. Don't think real workload will act like this. The locking
issue definitely should be revisited in the future though.

> > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > pretty well when IO pattern changes.
> 
> So, that part is fine but I don't think it makes sense to make weight
> based control either bandwidth or iops based.  The fundamental problem
> is that it's a false choice.  It's like asking someone who wants a car
> to choose between accelerator and brake.  It's a choice without a good
> answer.  Both are wrong.  Also note that there's an inherent
> difference from the currently implemented absolute limits.  Absolute
> limits can be combined.  Weights based on different metrics can't be.
> 
> Even with modern SSDs, both iops and bandwidth play major roles in
> deciding how costly each IO is and I'm fairly confident that this is
> fundamental enough to be the case for quite a while.  I *think* the
> cost model can be approximated from measurements.  Devices are
> becoming more and more predictable in their behaviors after all.  For
> weight based distribution, the unit of distribution should be IO time,
> not bandwidth or iops.

Disagree io time is a better choice. Actually I think IO time will be
the least we shoule consider for SSD. Idealy if we know each IO cost and
total disk capability, things will be easy. Unfortunately there is no
way to know IO cost. Bandwidth isn't perfect, but might be the best.

I don't know why you think devices are predictable. SSD is never
predictable. I'm not sure how you will measure IO time. Morden SSD has
large queue depth (blk-mq support 10k queue depth). That means we can
send 10k IO in several ns. Measuring IO start/finish time doesn't help
too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
might use more than 100us. The IO time will increase with higher io
depth. The fundamental problem is disk with large queue depth can buffer
infinite IO request. I think IO time only works for queue depth 1 disk.

On the other hand, how do you utilize IO time? If we use similar
algorithm like the patch set (eg, cgroup's IO time slice = cgroup_share
/ all_cgroup_share * disk_IO_time_capability), how do you get
disk_IO_time_capability? Or use CFQ alrithm (eg, switch cgroup if the
cgroup uses its IO time slice). But CFQ is known not working well with
NCQ unless idle disk, because disk with large queue depth can dispatch
all cgorup's IO immediately. Idling should be avoided of course for high
speed storage.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-21 22:24   ` Shaohua Li
@ 2016-01-21 22:41     ` Tejun Heo
  2016-01-22  0:00       ` Shaohua Li
  2016-01-22 14:43       ` Vivek Goyal
  0 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2016-01-21 22:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

Hello, Shaohua.

On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote:
> > Have you tried with some level, say 5, of nesting?  IIRC, how it
> > implements hierarchical control is rather braindead (and yeah I'm
> > responsible for the damage).
> 
> Not yet. Agree nesting increases the locking time. But my test is
> already an extreme case. I had 32 threads in 2 nodes running IO and the
> IOPS is 1M/s. Don't think real workload will act like this. The locking
> issue definitely should be revisited in the future though.

The thing is that most of the possible contentions can be removed by
implementing per-cpu cache which shouldn't be too difficult.  10%
extra cost on current gen hardware is already pretty high.

> Disagree io time is a better choice. Actually I think IO time will be

If IO time isn't the right term, let's call it IO cost.  Whatever the
term, the actual fraction of cost that each IO is incurring.

> the least we shoule consider for SSD. Idealy if we know each IO cost and
> total disk capability, things will be easy. Unfortunately there is no
> way to know IO cost. Bandwidth isn't perfect, but might be the best.
>
> I don't know why you think devices are predictable. SSD is never
> predictable. I'm not sure how you will measure IO time. Morden SSD has
> large queue depth (blk-mq support 10k queue depth). That means we can
> send 10k IO in several ns. Measuring IO start/finish time doesn't help
> too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
> might use more than 100us. The IO time will increase with higher io
> depth. The fundamental problem is disk with large queue depth can buffer
> infinite IO request. I think IO time only works for queue depth 1 disk.

They're way more predictable than rotational devices when measured
over a period.  I don't think we'll be able to measure anything
meaningful at individual command level but aggregate numbers should be
fairly stable.  A simple approximation of IO cost such as fixed cost
per IO + cost proportional to IO size would do a far better job than
just depending on bandwidth or iops and that requires approximating
two variables over time.  I'm not sure how easy / feasible that
actually would be tho.

> On the other hand, how do you utilize IO time? If we use similar
> algorithm like the patch set (eg, cgroup's IO time slice = cgroup_share
> / all_cgroup_share * disk_IO_time_capability), how do you get
> disk_IO_time_capability? Or use CFQ alrithm (eg, switch cgroup if the
> cgroup uses its IO time slice). But CFQ is known not working well with
> NCQ unless idle disk, because disk with large queue depth can dispatch
> all cgorup's IO immediately. Idling should be avoided of course for high
> speed storage.

I wasn't talking about time slicing as in CFQ but rather approximating
the cost of each IO.  I don't think it makes sense to implement
bandwidth based weight control when the cost of IOs can significantly
vary depending on IO direction and size.  The approxmiation doesn't
have to be perfect but we should be able to land somehwere near the
ballpark.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-21 22:41     ` Tejun Heo
@ 2016-01-22  0:00       ` Shaohua Li
  2016-01-22 14:48         ` Tejun Heo
  2016-01-22 14:43       ` Vivek Goyal
  1 sibling, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-22  0:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

Hi,
On Thu, Jan 21, 2016 at 05:41:57PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote:
> > > Have you tried with some level, say 5, of nesting?  IIRC, how it
> > > implements hierarchical control is rather braindead (and yeah I'm
> > > responsible for the damage).
> > 
> > Not yet. Agree nesting increases the locking time. But my test is
> > already an extreme case. I had 32 threads in 2 nodes running IO and the
> > IOPS is 1M/s. Don't think real workload will act like this. The locking
> > issue definitely should be revisited in the future though.
> 
> The thing is that most of the possible contentions can be removed by
> implementing per-cpu cache which shouldn't be too difficult.  10%
> extra cost on current gen hardware is already pretty high.

I did think about this. per-cpu cache does sound straightforward, but it
could severely impact fairness. For example, we give each cpu a budget,
see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
breaks fairness very much. I have no idea how this can be fixed.

> > Disagree io time is a better choice. Actually I think IO time will be
> 
> If IO time isn't the right term, let's call it IO cost.  Whatever the
> term, the actual fraction of cost that each IO is incurring.
> 
> > the least we shoule consider for SSD. Idealy if we know each IO cost and
> > total disk capability, things will be easy. Unfortunately there is no
> > way to know IO cost. Bandwidth isn't perfect, but might be the best.
> >
> > I don't know why you think devices are predictable. SSD is never
> > predictable. I'm not sure how you will measure IO time. Morden SSD has
> > large queue depth (blk-mq support 10k queue depth). That means we can
> > send 10k IO in several ns. Measuring IO start/finish time doesn't help
> > too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
> > might use more than 100us. The IO time will increase with higher io
> > depth. The fundamental problem is disk with large queue depth can buffer
> > infinite IO request. I think IO time only works for queue depth 1 disk.
> 
> They're way more predictable than rotational devices when measured
> over a period.  I don't think we'll be able to measure anything
> meaningful at individual command level but aggregate numbers should be
> fairly stable.  A simple approximation of IO cost such as fixed cost
> per IO + cost proportional to IO size would do a far better job than
> just depending on bandwidth or iops and that requires approximating
> two variables over time.  I'm not sure how easy / feasible that
> actually would be tho.

It still sounds like IO time, otherwise I can't imagine we can measure
the cost. If we use some sort of aggregate number, it likes a variation
of bandwidth. eg cost = bandwidth/ios.

I understand you probably want something like: get disk total resource,
predicate resource of each IO, and then use the info to arbitrate
cgroups. I don't know how it's possible. A disk which uses all its
resources can still accept new IO queuing. Maybe someday a fancy device
can export the info.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-21 22:41     ` Tejun Heo
  2016-01-22  0:00       ` Shaohua Li
@ 2016-01-22 14:43       ` Vivek Goyal
  1 sibling, 0 replies; 30+ messages in thread
From: Vivek Goyal @ 2016-01-22 14:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, axboe, jmoyer, Kernel-team

On Thu, Jan 21, 2016 at 05:41:57PM -0500, Tejun Heo wrote:

[..]
> A simple approximation of IO cost such as fixed cost
> per IO + cost proportional to IO size would do a far better job than
> just depending on bandwidth or iops and that requires approximating
> two variables over time.  I'm not sure how easy / feasible that
> actually would be tho.

Hi Tejun,

"A fixed cost per IO sounds" like iops and "cost proportional to IO size"
sounds like bandwidth. I am wondering can we dynamically control both
bps and iops rate of cgroup based on cgroup weight and average bw/iops of
device queue. That way a cgroup can not get unfair share of disk neither
by throwing lots of small IOs, nor by sending down a small number of large
IOs.

Will that be good enough.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22  0:00       ` Shaohua Li
@ 2016-01-22 14:48         ` Tejun Heo
  2016-01-22 15:52           ` Vivek Goyal
  2016-01-22 17:57           ` Shaohua Li
  0 siblings, 2 replies; 30+ messages in thread
From: Tejun Heo @ 2016-01-22 14:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

Hello, Shaohua.

On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > The thing is that most of the possible contentions can be removed by
> > implementing per-cpu cache which shouldn't be too difficult.  10%
> > extra cost on current gen hardware is already pretty high.
> 
> I did think about this. per-cpu cache does sound straightforward, but it
> could severely impact fairness. For example, we give each cpu a budget,
> see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> breaks fairness very much. I have no idea how this can be fixed.

Let's say per-cgroup buffer budget B is calculated as, say, 100ms
worth of IO cost (or bandwidth or iops) available to the cgroup.  In
practice, this may have to be adjusted down depending on the number of
cgroups performing active IOs.  For a given cgroup, B can be
distributed among the CPUs that are actively issuing IOs in that
cgroup.  It will degenerate to round robin of small budget if there
are too many active for the budget available but for most cases this
will cut down most of cross-CPU traffic.

> > They're way more predictable than rotational devices when measured
> > over a period.  I don't think we'll be able to measure anything
> > meaningful at individual command level but aggregate numbers should be
> > fairly stable.  A simple approximation of IO cost such as fixed cost
> > per IO + cost proportional to IO size would do a far better job than
> > just depending on bandwidth or iops and that requires approximating
> > two variables over time.  I'm not sure how easy / feasible that
> > actually would be tho.
> 
> It still sounds like IO time, otherwise I can't imagine we can measure
> the cost. If we use some sort of aggregate number, it likes a variation
> of bandwidth. eg cost = bandwidth/ios.

I think cost of an IO can be approxmiated by a fixed per-IO cost +
cost proportional to the size, so

 cost = F + R * size

> I understand you probably want something like: get disk total resource,
> predicate resource of each IO, and then use the info to arbitrate
> cgroups. I don't know how it's possible. A disk which uses all its
> resources can still accept new IO queuing. Maybe someday a fancy device
> can export the info.

I don't know exactly how either; however, I don't want a situation
where we implement something just because it's easy regardless of
whether it's actually useful.  We've done that multiple times in
cgroup and they tend to become useless baggages which get in the way
of proper solutions.  Things don't have to be perfect from the
beginning but at least the abstractions and interfaces we expose must
be relevant to the capability that userland wants.

It isn't uncommon for devices to have close to or over an order of
magnitude difference in bandwidth between 4k random and sequential IO
patterns.  What the userland wants is proportional distribution of IO
resources.  I can't see how lumping up numbers whose differences are
in an order of magnitude would be able to represent that, or anything,
really.

I understand that it is a difficult and nasty problem but we'll just
have to solve it.  I'll think more about it too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 14:48         ` Tejun Heo
@ 2016-01-22 15:52           ` Vivek Goyal
  2016-01-22 18:00             ` Shaohua Li
  2016-01-22 17:57           ` Shaohua Li
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-22 15:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, axboe, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > The thing is that most of the possible contentions can be removed by
> > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > extra cost on current gen hardware is already pretty high.
> > 
> > I did think about this. per-cpu cache does sound straightforward, but it
> > could severely impact fairness. For example, we give each cpu a budget,
> > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > breaks fairness very much. I have no idea how this can be fixed.
> 
> Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> practice, this may have to be adjusted down depending on the number of
> cgroups performing active IOs.  For a given cgroup, B can be
> distributed among the CPUs that are actively issuing IOs in that
> cgroup.  It will degenerate to round robin of small budget if there
> are too many active for the budget available but for most cases this
> will cut down most of cross-CPU traffic.
> 
> > > They're way more predictable than rotational devices when measured
> > > over a period.  I don't think we'll be able to measure anything
> > > meaningful at individual command level but aggregate numbers should be
> > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > per IO + cost proportional to IO size would do a far better job than
> > > just depending on bandwidth or iops and that requires approximating
> > > two variables over time.  I'm not sure how easy / feasible that
> > > actually would be tho.
> > 
> > It still sounds like IO time, otherwise I can't imagine we can measure
> > the cost. If we use some sort of aggregate number, it likes a variation
> > of bandwidth. eg cost = bandwidth/ios.
> 
> I think cost of an IO can be approxmiated by a fixed per-IO cost +
> cost proportional to the size, so
> 
>  cost = F + R * size
> 

Hi Tejun,

May be we can throw in a cost differentiation for IO direction also here.
This still will not take care of cost based on IO pattern, but that's
another level of complexity which can be added to keep track of IO pattern
of cgroup and bump up cost accordingly.

Here are some random thoughts basically adding some more details to your idea.
I am not sure whether it makes sense or not or how difficult it is to
implement it.

Assume we ensure fairness in a time interval of T and have total of N
tokens for IO in that time interval T. When a new inteval starts, we
distribute these N tokens to the pending cgroups based on their weight and
proportional share. And keep on distributing N tokens after each time
interval.

We will have to come up with some sort of cost matrix to determine how many
tokens should be charged per IO (cost per IO). And how to adjust that cost
dynamically.

Both N and T will be variable and will have to be adjusted continuously.
For N we could start with some initial number. If we distributed too many
tokens then device can handle in time T, then in next cycle we will have
to reduce the value of N and distribute less tokens. If we distributed
too less tokens and device is fast and finished in less time than T, 
then we can start next cycle sooner and distribute more tokens for next
cycle. So based on device throughput in a certain time interval, number
of tokens issued for next cycle will vary.

Initially I guess cost could be fixed also. That is say, 5 tokens for each
IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
IO, then N tokens will not be consumed in time T and next time we will
distribute less tokens. If we overestimate the cost of IO, then N tokens
will finish fast and next time we will give more. So exact cost of IO 
might not be a huge factor.

Thought of writing it down, irrespective of the fact whether it made much
sense or not.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 14:48         ` Tejun Heo
  2016-01-22 15:52           ` Vivek Goyal
@ 2016-01-22 17:57           ` Shaohua Li
  2016-01-22 18:08             ` Tejun Heo
  1 sibling, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-22 17:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > The thing is that most of the possible contentions can be removed by
> > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > extra cost on current gen hardware is already pretty high.
> > 
> > I did think about this. per-cpu cache does sound straightforward, but it
> > could severely impact fairness. For example, we give each cpu a budget,
> > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > breaks fairness very much. I have no idea how this can be fixed.
> 
> Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> practice, this may have to be adjusted down depending on the number of
> cgroups performing active IOs.  For a given cgroup, B can be
> distributed among the CPUs that are actively issuing IOs in that
> cgroup.  It will degenerate to round robin of small budget if there
> are too many active for the budget available but for most cases this
> will cut down most of cross-CPU traffic.

The cgroup could be a single thread. It uses cpu0's per-cpu budget B-1,
move to cpu1 and use another B - 1, and so on
 
> > > They're way more predictable than rotational devices when measured
> > > over a period.  I don't think we'll be able to measure anything
> > > meaningful at individual command level but aggregate numbers should be
> > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > per IO + cost proportional to IO size would do a far better job than
> > > just depending on bandwidth or iops and that requires approximating
> > > two variables over time.  I'm not sure how easy / feasible that
> > > actually would be tho.
> > 
> > It still sounds like IO time, otherwise I can't imagine we can measure
> > the cost. If we use some sort of aggregate number, it likes a variation
> > of bandwidth. eg cost = bandwidth/ios.
> 
> I think cost of an IO can be approxmiated by a fixed per-IO cost +
> cost proportional to the size, so
> 
>  cost = F + R * size

F could be IOPS. and the real cost becomes R. How do you get R? We can't
simply use R(4k) = 1, R(8k) = 2 .... I tried the idea several years ago:
https://lwn.net/Articles/474164/
The idea is the same. But the reality is we can't get R. I don't want to
have a random math working for one SSD but not for another.

One possible solution is we benchmark the device at startup and get
corresponding proportion of size. That would only work for IO read. And
how to choose the benchmark is another challenge.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 15:52           ` Vivek Goyal
@ 2016-01-22 18:00             ` Shaohua Li
  2016-01-22 19:09               ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-22 18:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, linux-kernel, axboe, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > Hello, Shaohua.
> > 
> > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > The thing is that most of the possible contentions can be removed by
> > > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > > extra cost on current gen hardware is already pretty high.
> > > 
> > > I did think about this. per-cpu cache does sound straightforward, but it
> > > could severely impact fairness. For example, we give each cpu a budget,
> > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > breaks fairness very much. I have no idea how this can be fixed.
> > 
> > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > practice, this may have to be adjusted down depending on the number of
> > cgroups performing active IOs.  For a given cgroup, B can be
> > distributed among the CPUs that are actively issuing IOs in that
> > cgroup.  It will degenerate to round robin of small budget if there
> > are too many active for the budget available but for most cases this
> > will cut down most of cross-CPU traffic.
> > 
> > > > They're way more predictable than rotational devices when measured
> > > > over a period.  I don't think we'll be able to measure anything
> > > > meaningful at individual command level but aggregate numbers should be
> > > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > > per IO + cost proportional to IO size would do a far better job than
> > > > just depending on bandwidth or iops and that requires approximating
> > > > two variables over time.  I'm not sure how easy / feasible that
> > > > actually would be tho.
> > > 
> > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > the cost. If we use some sort of aggregate number, it likes a variation
> > > of bandwidth. eg cost = bandwidth/ios.
> > 
> > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > cost proportional to the size, so
> > 
> >  cost = F + R * size
> > 
> 
> Hi Tejun,
> 
> May be we can throw in a cost differentiation for IO direction also here.
> This still will not take care of cost based on IO pattern, but that's
> another level of complexity which can be added to keep track of IO pattern
> of cgroup and bump up cost accordingly.
> 
> Here are some random thoughts basically adding some more details to your idea.
> I am not sure whether it makes sense or not or how difficult it is to
> implement it.
> 
> Assume we ensure fairness in a time interval of T and have total of N
> tokens for IO in that time interval T. When a new inteval starts, we
> distribute these N tokens to the pending cgroups based on their weight and
> proportional share. And keep on distributing N tokens after each time
> interval.
> 
> We will have to come up with some sort of cost matrix to determine how many
> tokens should be charged per IO (cost per IO). And how to adjust that cost
> dynamically.
> 
> Both N and T will be variable and will have to be adjusted continuously.
> For N we could start with some initial number. If we distributed too many
> tokens then device can handle in time T, then in next cycle we will have
> to reduce the value of N and distribute less tokens. If we distributed
> too less tokens and device is fast and finished in less time than T, 
> then we can start next cycle sooner and distribute more tokens for next
> cycle. So based on device throughput in a certain time interval, number
> of tokens issued for next cycle will vary.

Note, we don't know if we dispatch too many/too less tokens. A device
with large queue depth can accept all requests. If queue depth is 1,
things would be easy.
 
> Initially I guess cost could be fixed also. That is say, 5 tokens for each
> IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> IO, then N tokens will not be consumed in time T and next time we will
> distribute less tokens. If we overestimate the cost of IO, then N tokens
> will finish fast and next time we will give more. So exact cost of IO 
> might not be a huge factor.

we still need know the R. any idea for this?

Thanks,
SHaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 17:57           ` Shaohua Li
@ 2016-01-22 18:08             ` Tejun Heo
  2016-01-22 19:11               ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2016-01-22 18:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

Hello, Shaohua.

On Fri, Jan 22, 2016 at 09:57:10AM -0800, Shaohua Li wrote:
> > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > practice, this may have to be adjusted down depending on the number of
> > cgroups performing active IOs.  For a given cgroup, B can be
> > distributed among the CPUs that are actively issuing IOs in that
> > cgroup.  It will degenerate to round robin of small budget if there
> > are too many active for the budget available but for most cases this
> > will cut down most of cross-CPU traffic.
> 
> The cgroup could be a single thread. It uses cpu0's per-cpu budget B-1,
> move to cpu1 and use another B - 1, and so on

Sure, just ensure that the total cached is bound by B and expire if
not used over a certain amount of time.  The thing is as long as we
can go through percpu cache most of the time, it's all fine.  We can
spend a lot of processing budget for corner cases.

> >  cost = F + R * size
> 
> F could be IOPS. and the real cost becomes R. How do you get R? We can't
> simply use R(4k) = 1, R(8k) = 2 .... I tried the idea several years ago:
> https://lwn.net/Articles/474164/
> The idea is the same. But the reality is we can't get R. I don't want to
> have a random math working for one SSD but not for another.

Yeah, it'll have to be adaptive.  We can't use fixed values; however,
note that using bandwidth means that we assume F == 0 and R == 1,
which wouldn't be appropriate for most devices.

> One possible solution is we benchmark the device at startup and get
> corresponding proportion of size. That would only work for IO read. And
> how to choose the benchmark is another challenge.

Hmmm... yeah, that can be one option although I think it'd still have
to be adjusted dynamically.  Let's think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 18:00             ` Shaohua Li
@ 2016-01-22 19:09               ` Vivek Goyal
  2016-01-22 19:45                 ` Shaohua Li
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2016-01-22 19:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, axboe, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > Hello, Shaohua.
> > > 
> > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > The thing is that most of the possible contentions can be removed by
> > > > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > > > extra cost on current gen hardware is already pretty high.
> > > > 
> > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > breaks fairness very much. I have no idea how this can be fixed.
> > > 
> > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > > practice, this may have to be adjusted down depending on the number of
> > > cgroups performing active IOs.  For a given cgroup, B can be
> > > distributed among the CPUs that are actively issuing IOs in that
> > > cgroup.  It will degenerate to round robin of small budget if there
> > > are too many active for the budget available but for most cases this
> > > will cut down most of cross-CPU traffic.
> > > 
> > > > > They're way more predictable than rotational devices when measured
> > > > > over a period.  I don't think we'll be able to measure anything
> > > > > meaningful at individual command level but aggregate numbers should be
> > > > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > just depending on bandwidth or iops and that requires approximating
> > > > > two variables over time.  I'm not sure how easy / feasible that
> > > > > actually would be tho.
> > > > 
> > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > of bandwidth. eg cost = bandwidth/ios.
> > > 
> > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > cost proportional to the size, so
> > > 
> > >  cost = F + R * size
> > > 
> > 
> > Hi Tejun,
> > 
> > May be we can throw in a cost differentiation for IO direction also here.
> > This still will not take care of cost based on IO pattern, but that's
> > another level of complexity which can be added to keep track of IO pattern
> > of cgroup and bump up cost accordingly.
> > 
> > Here are some random thoughts basically adding some more details to your idea.
> > I am not sure whether it makes sense or not or how difficult it is to
> > implement it.
> > 
> > Assume we ensure fairness in a time interval of T and have total of N
> > tokens for IO in that time interval T. When a new inteval starts, we
> > distribute these N tokens to the pending cgroups based on their weight and
> > proportional share. And keep on distributing N tokens after each time
> > interval.
> > 
> > We will have to come up with some sort of cost matrix to determine how many
> > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > dynamically.
> > 
> > Both N and T will be variable and will have to be adjusted continuously.
> > For N we could start with some initial number. If we distributed too many
> > tokens then device can handle in time T, then in next cycle we will have
> > to reduce the value of N and distribute less tokens. If we distributed
> > too less tokens and device is fast and finished in less time than T, 
> > then we can start next cycle sooner and distribute more tokens for next
> > cycle. So based on device throughput in a certain time interval, number
> > of tokens issued for next cycle will vary.
> 
> Note, we don't know if we dispatch too many/too less tokens. A device
> with large queue depth can accept all requests. If queue depth is 1,
> things would be easy.

If device accepts too many requests then we will keep on increasing tokens
and cgroups will keep on submitting IOs in the proportion of their weight.
Once queue is full, then we will hit a wall and we will start decreasing
number of tokens. So I guess this should still work.

One problem with deep queue depths will be though that a heavy writer
will be able to fill up the queue in a very short interval and block
small readers behind it. I guess until and unless devices start doing
some prioritization of IO, this problem will be hard to solve. Driving
smaller queue depth is not an option as it makes the bandwidth drop.

>  
> > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > IO, then N tokens will not be consumed in time T and next time we will
> > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > will finish fast and next time we will give more. So exact cost of IO 
> > might not be a huge factor.
> 
> we still need know the R. any idea for this?

Hmm..., thinking loud. Will following work.

Can we keep track of average bw and average iops of the queue. And then
use that to come up with per IO cost and BW cost.

Say average queue bandwidth is ABW and average IOPS is AIOPS.

So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.

A cgroup's fractional cost of IO = IO_size/(T * ABW)

As we are supposed to dispatch N tokens in time T, cgroups cost of IO
in terms of tokens will be 

Cgroup_cost_BW = (N * IO_Size)/(T * ABW)

Similarly, a cgroup's per IO cost based on IOPS will be.

Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)

So per IO per cgroup we could charge following tokens.

Charged_tokens = Cgroup_cost_BW +  Cgroup_cost_IO

As we are charging cgroup twice (once based on bandwidth and once based
on IOPS), may be we can half the effective cost.

Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2

Does it make sense?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 18:08             ` Tejun Heo
@ 2016-01-22 19:11               ` Shaohua Li
  0 siblings, 0 replies; 30+ messages in thread
From: Shaohua Li @ 2016-01-22 19:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, axboe, vgoyal, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 01:08:44PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Fri, Jan 22, 2016 at 09:57:10AM -0800, Shaohua Li wrote:
> > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > > practice, this may have to be adjusted down depending on the number of
> > > cgroups performing active IOs.  For a given cgroup, B can be
> > > distributed among the CPUs that are actively issuing IOs in that
> > > cgroup.  It will degenerate to round robin of small budget if there
> > > are too many active for the budget available but for most cases this
> > > will cut down most of cross-CPU traffic.
> > 
> > The cgroup could be a single thread. It uses cpu0's per-cpu budget B-1,
> > move to cpu1 and use another B - 1, and so on
> 
> Sure, just ensure that the total cached is bound by B and expire if
> not used over a certain amount of time.  The thing is as long as we
> can go through percpu cache most of the time, it's all fine.  We can
> spend a lot of processing budget for corner cases.
> 
> > >  cost = F + R * size
> > 
> > F could be IOPS. and the real cost becomes R. How do you get R? We can't
> > simply use R(4k) = 1, R(8k) = 2 .... I tried the idea several years ago:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_474164_&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=X13hAPkxmvBro1Ug8vcKHw&m=4X56EQmXhfF82BH-eQkQL08afWwbrOErtEVkn5xKsWA&s=_IkvDWMM7AXgh840OrQKndkJpBVcKrGhgLnHkA_aYNg&e= 
> > The idea is the same. But the reality is we can't get R. I don't want to
> > have a random math working for one SSD but not for another.
> 
> Yeah, it'll have to be adaptive.  We can't use fixed values; however,
> note that using bandwidth means that we assume F == 0 and R == 1,
> which wouldn't be appropriate for most devices.

It's true bandwidth means R == 1. But it has a kind of adaptive. The
cgroup bandwidth == share * disk_bandwidth. disk_bandwidth is adaptive.
It might not work well if cgroups have completely different IO pattern
though. 

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 19:09               ` Vivek Goyal
@ 2016-01-22 19:45                 ` Shaohua Li
  2016-01-22 20:04                   ` Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Shaohua Li @ 2016-01-22 19:45 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, linux-kernel, axboe, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 02:09:10PM -0500, Vivek Goyal wrote:
> On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> > On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > > Hello, Shaohua.
> > > > 
> > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > > The thing is that most of the possible contentions can be removed by
> > > > > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > > > > extra cost on current gen hardware is already pretty high.
> > > > > 
> > > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > > breaks fairness very much. I have no idea how this can be fixed.
> > > > 
> > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > > > practice, this may have to be adjusted down depending on the number of
> > > > cgroups performing active IOs.  For a given cgroup, B can be
> > > > distributed among the CPUs that are actively issuing IOs in that
> > > > cgroup.  It will degenerate to round robin of small budget if there
> > > > are too many active for the budget available but for most cases this
> > > > will cut down most of cross-CPU traffic.
> > > > 
> > > > > > They're way more predictable than rotational devices when measured
> > > > > > over a period.  I don't think we'll be able to measure anything
> > > > > > meaningful at individual command level but aggregate numbers should be
> > > > > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > > just depending on bandwidth or iops and that requires approximating
> > > > > > two variables over time.  I'm not sure how easy / feasible that
> > > > > > actually would be tho.
> > > > > 
> > > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > > of bandwidth. eg cost = bandwidth/ios.
> > > > 
> > > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > > cost proportional to the size, so
> > > > 
> > > >  cost = F + R * size
> > > > 
> > > 
> > > Hi Tejun,
> > > 
> > > May be we can throw in a cost differentiation for IO direction also here.
> > > This still will not take care of cost based on IO pattern, but that's
> > > another level of complexity which can be added to keep track of IO pattern
> > > of cgroup and bump up cost accordingly.
> > > 
> > > Here are some random thoughts basically adding some more details to your idea.
> > > I am not sure whether it makes sense or not or how difficult it is to
> > > implement it.
> > > 
> > > Assume we ensure fairness in a time interval of T and have total of N
> > > tokens for IO in that time interval T. When a new inteval starts, we
> > > distribute these N tokens to the pending cgroups based on their weight and
> > > proportional share. And keep on distributing N tokens after each time
> > > interval.
> > > 
> > > We will have to come up with some sort of cost matrix to determine how many
> > > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > > dynamically.
> > > 
> > > Both N and T will be variable and will have to be adjusted continuously.
> > > For N we could start with some initial number. If we distributed too many
> > > tokens then device can handle in time T, then in next cycle we will have
> > > to reduce the value of N and distribute less tokens. If we distributed
> > > too less tokens and device is fast and finished in less time than T, 
> > > then we can start next cycle sooner and distribute more tokens for next
> > > cycle. So based on device throughput in a certain time interval, number
> > > of tokens issued for next cycle will vary.
> > 
> > Note, we don't know if we dispatch too many/too less tokens. A device
> > with large queue depth can accept all requests. If queue depth is 1,
> > things would be easy.
> 
> If device accepts too many requests then we will keep on increasing tokens
> and cgroups will keep on submitting IOs in the proportion of their weight.
> Once queue is full, then we will hit a wall and we will start decreasing
> number of tokens. So I guess this should still work.

queue will never be full. Typical application drives < 32 IO depth. NVMe
SSD queue can have 64k queue depth for each hardware queue according to
the spec.
> One problem with deep queue depths will be though that a heavy writer
> will be able to fill up the queue in a very short interval and block
> small readers behind it. I guess until and unless devices start doing
> some prioritization of IO, this problem will be hard to solve. Driving
> smaller queue depth is not an option as it makes the bandwidth drop.

yep, this is the problem. Disk accepts new requests even all pending
requests already exhaust its resources.
> > > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > > IO, then N tokens will not be consumed in time T and next time we will
> > > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > > will finish fast and next time we will give more. So exact cost of IO 
> > > might not be a huge factor.
> > 
> > we still need know the R. any idea for this?
> 
> Hmm..., thinking loud. Will following work.
> 
> Can we keep track of average bw and average iops of the queue. And then
> use that to come up with per IO cost and BW cost.
> 
> Say average queue bandwidth is ABW and average IOPS is AIOPS.
> 
> So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.
> 
> A cgroup's fractional cost of IO = IO_size/(T * ABW)
> 
> As we are supposed to dispatch N tokens in time T, cgroups cost of IO
> in terms of tokens will be 
> 
> Cgroup_cost_BW = (N * IO_Size)/(T * ABW)
> 
> Similarly, a cgroup's per IO cost based on IOPS will be.
> 
> Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)
> 
> So per IO per cgroup we could charge following tokens.
> 
> Charged_tokens = Cgroup_cost_BW +  Cgroup_cost_IO
> 
> As we are charging cgroup twice (once based on bandwidth and once based
> on IOPS), may be we can half the effective cost.
> 
> Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2

So the cost = Cgroup_cost_BW * A + Cgroup_cost_IO * B

bandwidth based: A = 1, B = 0
IOPS based: A = 0, B = 1
The proposal: A = 1/2, B = 1/2

I'm sure people will invent other A/B combinations. It's hard to say
which one is better. Maybe we really should have simple ones first, eg,
either bandwidth based or IOPS based, and have a knob to choose. That
pretty much shows the powerless from kernel side, but that's something
we can offer.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC 0/3] block: proportional based blk-throttling
  2016-01-22 19:45                 ` Shaohua Li
@ 2016-01-22 20:04                   ` Vivek Goyal
  0 siblings, 0 replies; 30+ messages in thread
From: Vivek Goyal @ 2016-01-22 20:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, axboe, jmoyer, Kernel-team

On Fri, Jan 22, 2016 at 11:45:51AM -0800, Shaohua Li wrote:
> On Fri, Jan 22, 2016 at 02:09:10PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> > > On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > > > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > > > Hello, Shaohua.
> > > > > 
> > > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > > > The thing is that most of the possible contentions can be removed by
> > > > > > > implementing per-cpu cache which shouldn't be too difficult.  10%
> > > > > > > extra cost on current gen hardware is already pretty high.
> > > > > > 
> > > > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > > > breaks fairness very much. I have no idea how this can be fixed.
> > > > > 
> > > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > > > worth of IO cost (or bandwidth or iops) available to the cgroup.  In
> > > > > practice, this may have to be adjusted down depending on the number of
> > > > > cgroups performing active IOs.  For a given cgroup, B can be
> > > > > distributed among the CPUs that are actively issuing IOs in that
> > > > > cgroup.  It will degenerate to round robin of small budget if there
> > > > > are too many active for the budget available but for most cases this
> > > > > will cut down most of cross-CPU traffic.
> > > > > 
> > > > > > > They're way more predictable than rotational devices when measured
> > > > > > > over a period.  I don't think we'll be able to measure anything
> > > > > > > meaningful at individual command level but aggregate numbers should be
> > > > > > > fairly stable.  A simple approximation of IO cost such as fixed cost
> > > > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > > > just depending on bandwidth or iops and that requires approximating
> > > > > > > two variables over time.  I'm not sure how easy / feasible that
> > > > > > > actually would be tho.
> > > > > > 
> > > > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > > > of bandwidth. eg cost = bandwidth/ios.
> > > > > 
> > > > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > > > cost proportional to the size, so
> > > > > 
> > > > >  cost = F + R * size
> > > > > 
> > > > 
> > > > Hi Tejun,
> > > > 
> > > > May be we can throw in a cost differentiation for IO direction also here.
> > > > This still will not take care of cost based on IO pattern, but that's
> > > > another level of complexity which can be added to keep track of IO pattern
> > > > of cgroup and bump up cost accordingly.
> > > > 
> > > > Here are some random thoughts basically adding some more details to your idea.
> > > > I am not sure whether it makes sense or not or how difficult it is to
> > > > implement it.
> > > > 
> > > > Assume we ensure fairness in a time interval of T and have total of N
> > > > tokens for IO in that time interval T. When a new inteval starts, we
> > > > distribute these N tokens to the pending cgroups based on their weight and
> > > > proportional share. And keep on distributing N tokens after each time
> > > > interval.
> > > > 
> > > > We will have to come up with some sort of cost matrix to determine how many
> > > > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > > > dynamically.
> > > > 
> > > > Both N and T will be variable and will have to be adjusted continuously.
> > > > For N we could start with some initial number. If we distributed too many
> > > > tokens then device can handle in time T, then in next cycle we will have
> > > > to reduce the value of N and distribute less tokens. If we distributed
> > > > too less tokens and device is fast and finished in less time than T, 
> > > > then we can start next cycle sooner and distribute more tokens for next
> > > > cycle. So based on device throughput in a certain time interval, number
> > > > of tokens issued for next cycle will vary.
> > > 
> > > Note, we don't know if we dispatch too many/too less tokens. A device
> > > with large queue depth can accept all requests. If queue depth is 1,
> > > things would be easy.
> > 
> > If device accepts too many requests then we will keep on increasing tokens
> > and cgroups will keep on submitting IOs in the proportion of their weight.
> > Once queue is full, then we will hit a wall and we will start decreasing
> > number of tokens. So I guess this should still work.
> 
> queue will never be full. Typical application drives < 32 IO depth. NVMe
> SSD queue can have 64k queue depth for each hardware queue according to
> the spec.

These are really deep queues. Hmm.., so what's the solution? Should we
drive shallower queue depths in an attmept to achieve fairness? I guess
that will not fly and backfire.

If we can't fill up the queue, then yes, we will keep on increasing
number of tokens and every cgroup will get to submit whatever request
they have got and effectively not get any service differentiation.

That's the age old problem we have. Trying to achieve fairness without
compromising bandwidth will only works in a limited use cases. And
queue depths like 64K make it just worse.

In that case either we driver smaller queue depths (possibly compromising
on throughput) or service different mechanism need to be in hardware and
software needs to just tag the IO with relative priority. I can't think
what other options are there.

> > One problem with deep queue depths will be though that a heavy writer
> > will be able to fill up the queue in a very short interval and block
> > small readers behind it. I guess until and unless devices start doing
> > some prioritization of IO, this problem will be hard to solve. Driving
> > smaller queue depth is not an option as it makes the bandwidth drop.
> 
> yep, this is the problem. Disk accepts new requests even all pending
> requests already exhaust its resources.
> > > > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > > > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > > > IO, then N tokens will not be consumed in time T and next time we will
> > > > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > > > will finish fast and next time we will give more. So exact cost of IO 
> > > > might not be a huge factor.
> > > 
> > > we still need know the R. any idea for this?
> > 
> > Hmm..., thinking loud. Will following work.
> > 
> > Can we keep track of average bw and average iops of the queue. And then
> > use that to come up with per IO cost and BW cost.
> > 
> > Say average queue bandwidth is ABW and average IOPS is AIOPS.
> > 
> > So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.
> > 
> > A cgroup's fractional cost of IO = IO_size/(T * ABW)
> > 
> > As we are supposed to dispatch N tokens in time T, cgroups cost of IO
> > in terms of tokens will be 
> > 
> > Cgroup_cost_BW = (N * IO_Size)/(T * ABW)
> > 
> > Similarly, a cgroup's per IO cost based on IOPS will be.
> > 
> > Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)
> > 
> > So per IO per cgroup we could charge following tokens.
> > 
> > Charged_tokens = Cgroup_cost_BW +  Cgroup_cost_IO
> > 
> > As we are charging cgroup twice (once based on bandwidth and once based
> > on IOPS), may be we can half the effective cost.
> > 
> > Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2
> 
> So the cost = Cgroup_cost_BW * A + Cgroup_cost_IO * B
> 
> bandwidth based: A = 1, B = 0
> IOPS based: A = 0, B = 1
> The proposal: A = 1/2, B = 1/2
> 
> I'm sure people will invent other A/B combinations. It's hard to say
> which one is better. Maybe we really should have simple ones first, eg,
> either bandwidth based or IOPS based, and have a knob to choose. That
> pretty much shows the powerless from kernel side, but that's something
> we can offer.

Actually I kind of like adding iops and bw cost and dividing it by two,
instead of asking user to choose between the two. As Tejun said, there
are no good answers so this is a wrong question to ask.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-01-22 20:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-20 17:49 [RFC 0/3] block: proportional based blk-throttling Shaohua Li
2016-01-20 17:49 ` [RFC 1/3] block: estimate disk bandwidth Shaohua Li
2016-01-20 17:49 ` [RFC 2/3] blk-throttling: weight based throttling Shaohua Li
2016-01-21 20:33   ` Vivek Goyal
2016-01-21 21:00     ` Shaohua Li
2016-01-20 17:49 ` [RFC 3/3] blk-throttling: detect inactive cgroup Shaohua Li
2016-01-21 20:44   ` Vivek Goyal
2016-01-21 21:05     ` Shaohua Li
2016-01-21 21:09       ` Vivek Goyal
2016-01-20 19:05 ` [RFC 0/3] block: proportional based blk-throttling Vivek Goyal
2016-01-20 19:34   ` Shaohua Li
2016-01-20 19:40     ` Vivek Goyal
2016-01-20 19:43       ` Shaohua Li
2016-01-20 19:54         ` Vivek Goyal
2016-01-20 21:11         ` Vivek Goyal
2016-01-20 21:34           ` Shaohua Li
2016-01-21 21:10 ` Tejun Heo
2016-01-21 22:24   ` Shaohua Li
2016-01-21 22:41     ` Tejun Heo
2016-01-22  0:00       ` Shaohua Li
2016-01-22 14:48         ` Tejun Heo
2016-01-22 15:52           ` Vivek Goyal
2016-01-22 18:00             ` Shaohua Li
2016-01-22 19:09               ` Vivek Goyal
2016-01-22 19:45                 ` Shaohua Li
2016-01-22 20:04                   ` Vivek Goyal
2016-01-22 17:57           ` Shaohua Li
2016-01-22 18:08             ` Tejun Heo
2016-01-22 19:11               ` Shaohua Li
2016-01-22 14:43       ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).