[PATCH 0/2] blk-throttle: simplify logic by token bucket algorithm

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] blk-throttle: simplify logic by token bucket algorithm
@ 2014-04-15  7:18 Hong Zhiguo
       [not found] ` <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Hong Zhiguo @ 2014-04-15  7:18 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	tj-DgEjT+Ai2ygdnm+yROfE0A
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, Hong Zhiguo

From: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>

Based on the discussion with Vivek, Tejun and Jens.

Patch 1/2 implements basic token bucket rate limiting for blk-throttle.
Patch 2/2 adjusts some behavior for the cases pointed out by Vivek:
	- A huge bio may be allowed immediately after a long time of idle
	- When token is trimmed for above case, should not trim ancestors
	  when bio climbs up, instead, trim ancestors when bio is queued
	  on lowest child group.

Trimming of iops token is not necessary. Since a bio always costs _one_ 
iops token. Trimming it won't change the fact that current iops token is
zero or not.

Thanks Vivek for the ancestor over-trim issue and the proposed solution.
I took another method inspired by Vivek's comments but I think more decent.
Please help to review it.

Hong Zhiguo (2):
  blk-throttle: simplify logic by token bucket algorithm
  blk-throttle: trim tokens generated for an idle tree

 block/blk-throttle.c | 314 ++++++++++++++-------------------------------------
 1 file changed, 84 insertions(+), 230 deletions(-)

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>]

* [PATCH 1/2] blk-throttle: simplify logic by token bucket algorithm
       [not found] ` <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
@ 2014-04-15  7:18   ` Hong Zhiguo
  2014-04-15  7:18   ` [PATCH 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
  2014-04-15  7:23   ` Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
  2 siblings, 0 replies; 6+ messages in thread
From: Hong Zhiguo @ 2014-04-15  7:18 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	tj-DgEjT+Ai2ygdnm+yROfE0A
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, Hong Zhiguo

From: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>

Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
is very simple and widely used for rate limiting and shaping.
It's well suitable for blk-throttle. And it naturally works for
hierarchical cgroups. In this patch, the original time _slicing_|
_trimming_|_extending_ logic is replaced by a simple token bucket
algorithm.

The advantage is simplicity, reliability and less fluctuation, and
it's a base for future features such HTB scheduling.

About 200+ lines of code for time-slicing is replaced with 50 lines
of code for token bucket in blk-throttle.c.

I've tested this patch by fio with rw=randread, rw=randrw. It
behaves almost the same with original time-slicing implementation,
and with less fluctuation. It's also tested on hierarchical setup.

Signed-off-by: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
Tested-by: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
---
 block/blk-throttle.c | 283 +++++++++------------------------------------------
 1 file changed, 48 insertions(+), 235 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 033745c..6274734 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2,6 +2,8 @@
  * Interface for controlling IO bandwidth on a request queue
  *
  * Copyright (C) 2010 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Simplify the logic with token bucket, Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
  */
 
 #include <linux/module.h>
@@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
-
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -91,6 +90,10 @@ struct tg_stats_cpu {
 	struct blkg_rwstat		serviced;
 };
 
+/* Depth of iops token bucket */
+#define THROTL_BURST_IO 8
+/* Depth of bps token bucket */
+#define THROTL_BURST_BYTES (4096 * 8)
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -133,14 +136,13 @@ struct throtl_grp {
 	/* IOPS limits */
 	unsigned int iops[2];
 
-	/* Number of bytes disptached in current slice */
-	uint64_t bytes_disp[2];
-	/* Number of bio's dispatched in current slice */
-	unsigned int io_disp[2];
+	/* Token for throttling by bps */
+	uint64_t bytes_token[2];
+	/* Token for throttling by iops */
+	unsigned int io_token[2];
 
-	/* When did we start a new slice */
-	unsigned long slice_start[2];
-	unsigned long slice_end[2];
+	/* Time check-point */
+	unsigned long t_c[2];
 
 	/* Per cpu stats pointer */
 	struct tg_stats_cpu __percpu *stats_cpu;
@@ -690,171 +692,39 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 	return false;
 }
 
-static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
-		bool rw, unsigned long start)
+static void tg_update_token(struct throtl_grp *tg, bool rw)
 {
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-
-	/*
-	 * Previous slice has expired. We must have trimmed it after last
-	 * bio dispatch. That means since start of last slice, we never used
-	 * that bandwidth. Do try to make use of that bandwidth while giving
-	 * credit.
-	 */
-	if (time_after_eq(start, tg->slice_start[rw]))
-		tg->slice_start[rw] = start;
+	unsigned long tdiff = jiffies - tg->t_c[rw];
+	u64 token;
 
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
-					unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-}
-
-static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
-				       unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log(&tg->service_queue,
-		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-/* Determine if previously allocated or extended slice is complete or not */
-static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
-{
-	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
-		return 0;
-
-	return 1;
-}
-
-/* Trim the used slices and adjust slice start accordingly */
-static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
-{
-	unsigned long nr_slices, time_elapsed, io_trim;
-	u64 bytes_trim, tmp;
-
-	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
-
-	/*
-	 * If bps are unlimited (-1), then time slice don't get
-	 * renewed. Don't try to trim the slice if slice is used. A new
-	 * slice will start when appropriate.
-	 */
-	if (throtl_slice_used(tg, rw))
-		return;
-
-	/*
-	 * A bio has been dispatched. Also adjust slice_end. It might happen
-	 * that initially cgroup limit was very low resulting in high
-	 * slice_end, but later limit was bumped up and bio was dispached
-	 * sooner, then we need to reduce slice_end. A high bogus slice_end
-	 * is bad because it does not allow new slice to start.
-	 */
-
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
-
-	time_elapsed = jiffies - tg->slice_start[rw];
-
-	nr_slices = time_elapsed / throtl_slice;
-
-	if (!nr_slices)
+	if (!tdiff)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
-	do_div(tmp, HZ);
-	bytes_trim = tmp;
-
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
-
-	if (!bytes_trim && !io_trim)
-		return;
-
-	if (tg->bytes_disp[rw] >= bytes_trim)
-		tg->bytes_disp[rw] -= bytes_trim;
-	else
-		tg->bytes_disp[rw] = 0;
 
-	if (tg->io_disp[rw] >= io_trim)
-		tg->io_disp[rw] -= io_trim;
-	else
-		tg->io_disp[rw] = 0;
+	token = (u64)tg->iops[rw] * tdiff;
+	do_div(token, HZ);
+	tg->io_token[rw] += token;
 
-	tg->slice_start[rw] += nr_slices * throtl_slice;
+	token = (u64)tg->bps[rw] * tdiff;
+	do_div(token, HZ);
+	tg->bytes_token[rw] += token;
 
-	throtl_log(&tg->service_queue,
-		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
-		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
+	tg->t_c[rw] = jiffies;
 }
 
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	unsigned int io_allowed;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-	u64 tmp;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	/*
-	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
-	 * 1 then at max jiffy elapsed should be equivalent of 1 second as we
-	 * will allow dispatch after 1 second and after that slice should
-	 * have been trimmed.
-	 */
-
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-
-	if (tmp > UINT_MAX)
-		io_allowed = UINT_MAX;
-	else
-		io_allowed = tmp;
-
-	if (tg->io_disp[rw] + 1 <= io_allowed) {
+	if (tg->io_token[rw]) {
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
-
-	if (jiffy_wait > jiffy_elapsed)
-		jiffy_wait = jiffy_wait - jiffy_elapsed;
-	else
-		jiffy_wait = 1;
-
 	if (wait)
-		*wait = jiffy_wait;
+		*wait = HZ / tg->iops[rw] ?: 1;
 	return 0;
 }
 
@@ -862,41 +732,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	u64 bytes_allowed, extra_bytes, tmp;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
+	u64 extra_bytes;
+	unsigned long jiffy_wait;
 
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
-
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-	bytes_allowed = tmp;
-
-	if (tg->bytes_disp[rw] + bio->bi_iter.bi_size <= bytes_allowed) {
+	if (bio->bi_iter.bi_size <= tg->bytes_token[rw]) {
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
-
-	/*
-	 * This wait time is without taking into consideration the rounding
-	 * up we did. Add that time also.
-	 */
-	jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
-	if (wait)
-		*wait = jiffy_wait;
+	if (wait) {
+		extra_bytes = bio->bi_iter.bi_size - tg->bytes_token[rw];
+		jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+		*wait = jiffy_wait ?: 1;
+	}
 	return 0;
 }
 
@@ -926,17 +776,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 		return 1;
 	}
 
-	/*
-	 * If previous slice expired, start a new one otherwise renew/extend
-	 * existing slice to make sure it is at least throtl_slice interval
-	 * long since now.
-	 */
-	if (throtl_slice_used(tg, rw))
-		throtl_start_new_slice(tg, rw);
-	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
-	}
+	tg_update_token(tg, rw);
 
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
 	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
@@ -950,9 +790,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (wait)
 		*wait = max_wait;
 
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(tg, rw, jiffies + max_wait);
-
 	return 0;
 }
 
@@ -986,9 +823,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
-	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
-	tg->io_disp[rw]++;
+	tg_update_token(tg, rw);
+
+	/* Charge the bio to the group and trim the bucket */
+	tg->bytes_token[rw] -= bio->bi_iter.bi_size;
+	if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		tg->bytes_token[rw] = THROTL_BURST_BYTES;
+
+	tg->io_token[rw]--;
+	if (tg->io_token[rw] > THROTL_BURST_IO)
+		tg->io_token[rw] = THROTL_BURST_IO;
 
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1065,16 +909,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
-static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
-					struct throtl_grp *parent_tg, bool rw)
-{
-	if (throtl_slice_used(parent_tg, rw)) {
-		throtl_start_new_slice_with_credit(parent_tg, rw,
-				child_tg->slice_start[rw]);
-	}
-
-}
-
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1103,7 +937,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 */
 	if (parent_tg) {
 		throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
-		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
 				     &parent_sq->queued[rw]);
@@ -1111,8 +944,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 		tg->td->nr_queued[rw]--;
 	}
 
-	throtl_trim_slice(tg, rw);
-
 	if (tg_to_put)
 		blkg_put(tg_to_blkg(tg_to_put));
 }
@@ -1391,13 +1222,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
 	 * We're already holding queue_lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
+	 * Update dispatch time.
 	 */
-	throtl_start_new_slice(tg, 0);
-	throtl_start_new_slice(tg, 1);
-
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
 		throtl_schedule_next_dispatch(sq->parent_sq, true);
@@ -1528,19 +1354,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		throtl_charge_bio(tg, bio);
 
 		/*
-		 * We need to trim slice even when bios are not being queued
-		 * otherwise it might happen that a bio is not queued for
-		 * a long time and slice keeps on extending and trim is not
-		 * called for a long time. Now if limits are reduced suddenly
-		 * we take into account all the IO dispatched so far at new
-		 * low rate and * newly queued IO gets a really long dispatch
-		 * time.
-		 *
-		 * So keep on trimming slice even if bio is not queued.
-		 */
-		throtl_trim_slice(tg, rw);
-
-		/*
 		 * @bio passed through this layer without being throttled.
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
@@ -1553,10 +1366,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 	/* out-of-limit, queue to @tg */
-	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+	throtl_log(sq, "[%c] bio. btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_token[rw], bio->bi_iter.bi_size, tg->bps[rw],
+		   tg->io_token[rw], tg->iops[rw],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] blk-throttle: trim tokens generated for an idle tree
       [not found] ` <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
  2014-04-15  7:18   ` [PATCH 1/2] " Hong Zhiguo
@ 2014-04-15  7:18   ` Hong Zhiguo
  2014-04-15  7:23   ` Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
  2 siblings, 0 replies; 6+ messages in thread
From: Hong Zhiguo @ 2014-04-15  7:18 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	tj-DgEjT+Ai2ygdnm+yROfE0A
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, Hong Zhiguo

From: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>

Why
====
Pointed out by Vivek: Tokens generated during idle period should
be trimmed. Otherwise a huge bio may be permited immediately.
Overlimit behaviour may be observed during short I/O throughput
test.

Vivek also pointed out: We should not over-trim for hierarchical
groups. Suppose a subtree of groups are idle when a big bio comes.
The token of the child group is trimmed and not enough. So the bio is
queued on the child group. After some jiffies the child group reserved
enough tokens and the bio climbs up. If we trim the parent group at
this time again, this bio will wait too much time than expected.

Analysis
========
When the bio is queued on child group, all it's ancestor groups
becomes non-idle. They should start to reserve tokens for that
bio from this moment. And their reserved tokens before should be
trimmed at this moment.

How
====
service_queue now has a new member nr_queued_tree[2], to represent
the the number of bios waiting on the subtree rooted by this sq.

When a bio is queued on the hierarchy first time, nr_queued_tree
of all ancestors and the child group itself are increased. When a
bio climbs up, nr_queued_tree of the child group is decreased.

When nr_queued_tree turns from zero to one, the tokens reserved
before are trimmed. And after this switch, this group will never
be trimmed to reserve tokens for the bio waiting on it's descendant
group.

Signed-off-by: Hong Zhiguo <zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
---
 block/blk-throttle.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6274734..749e583 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -63,6 +63,10 @@ struct throtl_service_queue {
 	 */
 	struct list_head	queued[2];	/* throtl_qnode [READ/WRITE] */
 	unsigned int		nr_queued[2];	/* number of queued bios */
+	/*
+	 * number of bios queued on this subtree
+	 */
+	unsigned int		nr_queued_tree[2];
 
 	/*
 	 * RB tree of active children throtl_grp's, which are sorted by
@@ -708,9 +712,40 @@ static void tg_update_token(struct throtl_grp *tg, bool rw)
 	do_div(token, HZ);
 	tg->bytes_token[rw] += token;
 
+	/* 
+	 * trim token if the whole subtree is idle. 
+	 * trim of io_token[rw] is not necessary since trim it or not 
+	 * dosn't change the result of tg_may_dispatch
+	 */
+	if (!tg->service_queue.nr_queued_tree[rw] &&
+	    tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		token = THROTL_BURST_BYTES;
+
 	tg->t_c[rw] = jiffies;
 }
 
+/**
+ * tg_update_nr_queue_tree - update nr_queued_tree of all ancestors
+ * @sq: the target service_queue
+ *
+ * Traverse from @sq to the top throtl_grp, increase their
+ * nr_queued_tree. If one sq has zero nr_queued_tree and now turns to
+ * one, trim the tokens generated before. From now on it will never
+ * get trimmed until same thing happens next time.
+ */
+static void tg_update_nr_queue_tree(struct throtl_service_queue *sq, bool rw)
+{
+	struct throtl_grp *tg;
+
+	while (sq && (tg = sq_to_tg(sq))) {
+		if (!sq->nr_queued_tree[rw])
+			tg_update_token(tg, rw);
+
+		sq->nr_queued_tree[rw]++;
+		sq = sq->parent_sq;
+	}
+}
+
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
@@ -925,6 +960,8 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 */
 	bio = throtl_pop_queued(&sq->queued[rw], &tg_to_put);
 	sq->nr_queued[rw]--;
+	/* Here's the only place nr_queued_tree get decreased */
+	sq->nr_queued_tree[rw]--;
 
 	throtl_charge_bio(tg, bio);
 
@@ -1374,6 +1411,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
+
+	/* Here's the only place nr_queued_tree get increased */
+	tg_update_nr_queue_tree(sq, rw);
+
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm
       [not found] ` <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
  2014-04-15  7:18   ` [PATCH 1/2] " Hong Zhiguo
  2014-04-15  7:18   ` [PATCH 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
@ 2014-04-15  7:23   ` Hong Zhiguo
       [not found]     ` <1397546586-6776-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
  2 siblings, 1 reply; 6+ messages in thread
From: Hong Zhiguo @ 2014-04-15  7:23 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	tj-DgEjT+Ai2ygdnm+yROfE0A
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, Hong Zhiguo

Hi, Vivek,

I tested the PATCH again for some basic hierarchical setup as I did
before.

Preparation
============
1) mount subsys blkio with "__DEVEL__sane_behavior"
2) Create 3 levels of directories under the blkio mount point:
    mkdir 1
    mkdir 1/2
    mkdir 1/2/3
    mkdir 4
3) start 4 bash sessions, write their PIDs into:
    1/cgroup.procs
    1/2/cgroup.procs
    1/2/3/cgroup.procs
    4/cgroup.procs
4) prepare 4 10MB files on sdb(ext4 fs)

Note: in below hierarchy graph:
    "[50k]" means configured value for read_bps_device is 50kB/s
    "[-]"   means unlimit configured
    "(50k)" means bandwidth reported by dd is 50kB/s

Test A: 1 process throttled by ancestor group
=============================================
Hierarchy set-up:
    (echo "8:16 204800" > 1/blkio.throttle.read_bps_device)
    |-- 1 [200k]
    |    `-- 2 [-]
    |         `-- 3 [-]
    `-- 4 [-]

dd within group 3:
    (drop cache then: dd if=10M-file-3 of=/dev/null)
Result:
    205kB/s (I did same test without the token-bucket patch, The
result is also 205kB/s)

dd within group 2:
    (drop cache then: dd if=10M-file-2 of=/dev/null)
Result:
    205kB/s

Test B: 4 processes in 3 levels of hierarchy, group 1 unlimited
===============================================================
Hierarchy set-up:
    echo "8:16 204800" > 1/blkio.throttle.read_bps_device
    echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 1/2/3/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 4/blkio.throttle.read_bps_device
    |-- 1 [-]
    |    `-- 2 [100k]
    |         `-- 3 [50k]
    `-- 4 [50k]

start 4 dd processes from 4 bash sessions
    (dd if=10M-file-x of=/dev/null)
Result:
    |-- 1 (171M)
    |    `-- 2 (51.9k)
    |         `-- 3 (51.2k)
    `-- 4 (51.3k)

Test C: 4 processes in 3 levels of hierarchy
=============================================
Hierarchy set-up:
    echo "8:16 204800" > 1/blkio.throttle.read_bps_device
    echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 1/2/3/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 4/blkio.throttle.read_bps_device
    |-- 1 [200k]
    |    `-- 2 [100k]
    |         `-- 3 [50k]
    `-- 4 [50k]

write 10M-file-1 to 20M, others still 10M, so 4 dd processes
take similar time.

start 4 dd processes from 4 bash sessions
    (dd if=10M-file-x of=/dev/null)
Result:
    |-- 1 (104k)
    |    `-- 2 (52.8k)
    |         `-- 3 (51.1k)
    `-- 4 (51.4k)

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <1397546586-6776-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>]

* Re: Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm
       [not found]     ` <1397546586-6776-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
@ 2014-04-15 20:14       ` Vivek Goyal
       [not found]         ` <20140415201448.GC13033-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Vivek Goyal @ 2014-04-15 20:14 UTC (permalink / raw)
  To: Hong Zhiguo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	axboe-tSWWG44O7X1aa/9Udqfwiw, Hong Zhiguo

On Tue, Apr 15, 2014 at 03:23:06PM +0800, Hong Zhiguo wrote:
> Hi, Vivek,
> 
> I tested the PATCH again for some basic hierarchical setup as I did
> before.
> 
> Preparation
> ============
> 1) mount subsys blkio with "__DEVEL__sane_behavior"

What init system are you using? How did you specify sane flag. I am
using systemd with F20 and I don't know how to specify this flag
and test your patches.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <20140415201448.GC13033-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm
       [not found]         ` <20140415201448.GC13033-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-04-18  2:41           ` Hong zhi guo
  0 siblings, 0 replies; 6+ messages in thread
From: Hong zhi guo @ 2014-04-18  2:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Jens Axboe,
	Hong Zhiguo

I did it by:
"mount -t cgroup -o blkio,__DEVEL__sane_behavior cgrp /cgroup"

I'm using ubuntu 13.04 and replaced the kernel with my own build.


On Wed, Apr 16, 2014 at 4:14 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> What init system are you using? How did you specify sane flag. I am
> using systemd with F20 and I don't know how to specify this flag
> and test your patches.
>
> Thanks
> Vivek



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-04-18  2:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-15  7:18 [PATCH 0/2] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
     [not found] ` <1397546303-6709-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
2014-04-15  7:18   ` [PATCH 1/2] " Hong Zhiguo
2014-04-15  7:18   ` [PATCH 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
2014-04-15  7:23   ` Test of [PATCH] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
     [not found]     ` <1397546586-6776-1-git-send-email-zhiguohong-1Nz4purKYjRBDgjK7y7TUQ@public.gmane.org>
2014-04-15 20:14       ` Vivek Goyal
     [not found]         ` <20140415201448.GC13033-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-04-18  2:41           ` Hong zhi guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).