All of lore.kernel.org

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V4 09/15] blk-throttle: make bandwidth change smooth
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

When cgroups all reach high limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their high limit. For example, cg1
high limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach high limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its high limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_HIGH, then all cgroups are throttled to their high limit (T3). cg2
will have bandwidth below its high limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_HIGH to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 10/80

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its high limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its high limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 42 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 32cc6ec..45a28c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -170,6 +170,8 @@ struct throtl_data

 	unsigned long high_upgrade_time;
 	unsigned long high_downgrade_time;
+
+	unsigned int scale;
 };

 static void throtl_pending_timer_fn(unsigned long arg);
@@ -224,12 +226,27 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	uint64_t ret;

 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return -1;
-	ret = tg->bps[rw][tg->td->limit_index];
-	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+	td = tg->td;
+	ret = tg->bps[rw][td->limit_index];
+	if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_HIGH] != -1) {
+		uint64_t increase;
+
+		if (td->scale < 4096 && time_after_eq(jiffies,
+		    td->high_upgrade_time + td->scale * td->throtl_slice)) {
+			unsigned int time = jiffies - td->high_upgrade_time;
+
+			td->scale = time / td->throtl_slice;
+		}
+		increase = (tg->bps[rw][LIMIT_HIGH] >> 1) * td->scale;
+		ret = min(tg->bps[rw][LIMIT_MAX],
+			tg->bps[rw][LIMIT_HIGH] + increase);
+	}
+	if (ret == -1 && td->limit_index == LIMIT_HIGH)
 		return tg->bps[rw][LIMIT_MAX];

 	return ret;
@@ -238,12 +255,28 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	unsigned int ret;

 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return -1;
-	ret = tg->iops[rw][tg->td->limit_index];
-	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+	td = tg->td;
+	ret = tg->iops[rw][td->limit_index];
+	if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_HIGH] != -1) {
+		uint64_t increase;
+
+		if (td->scale < 4096 && time_after_eq(jiffies,
+		    td->high_upgrade_time + td->scale * td->throtl_slice)) {
+			unsigned int time = jiffies - td->high_upgrade_time;
+
+			td->scale = time / td->throtl_slice;
+		}
+
+		increase = (tg->iops[rw][LIMIT_HIGH] >> 1) * td->scale;
+		ret = min(tg->iops[rw][LIMIT_MAX],
+			tg->iops[rw][LIMIT_HIGH] + (unsigned int)increase);
+	}
+	if (ret == -1 && td->limit_index == LIMIT_HIGH)
 		return tg->iops[rw][LIMIT_MAX];
 	return ret;
 }
@@ -1676,6 +1709,7 @@ static void throtl_upgrade_state(struct throtl_data *td)

 	td->limit_index = LIMIT_MAX;
 	td->high_upgrade_time = jiffies;
+	td->scale = 0;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
-- 
2.9.3

^ permalink raw reply related

* [PATCH V4 08/15] blk-throttle: detect completed idle cgroup
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to thier lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e85b2b6..32cc6ec 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -144,6 +144,8 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	unsigned long last_dispatch_time[2];
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -438,11 +440,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+	struct throtl_grp *tg = pd_to_tg(pd);
 	/*
 	 * We don't want new groups to escape the limits of its ancestors.
 	 * Update has_rules[] after a new group is brought online.
 	 */
-	tg_update_has_rules(pd_to_tg(pd));
+	tg_update_has_rules(tg);
+	tg->last_dispatch_time[READ] = jiffies;
+	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1611,6 +1616,12 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 	if (write_limit && sq->nr_queued[WRITE] &&
 	    (!read_limit || sq->nr_queued[READ]))
 		return true;
+
+	if (time_after_eq(jiffies,
+	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+	    time_after_eq(jiffies,
+	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		return true;
 	return false;
 }
 
@@ -1691,6 +1702,11 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
+	if (time_after_eq(now, tg->last_dispatch_time[READ] +
+					td->throtl_slice) &&
+	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
+					td->throtl_slice))
+		return false;
 	/*
 	 * If cgroup is below high limit, consider downgrade and throttle other
 	 * cgroups
@@ -1811,6 +1827,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_high_overflow_time[rw] == 0)
 			tg->last_high_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 14/15] blk-throttle: add interface for per-cgroup target latency
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

Add interface for per-cgroup target latency. This latency is for 4k
request.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a05d351..ac4d9ea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -147,6 +147,8 @@ struct throtl_grp {
 	unsigned long last_check_time;
 
 	int upgrade_check_batch;
+
+	u64 latency_target;
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -463,6 +465,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 			tg->iops[rw][index] = -1;
 		}
 	}
+	/* target latency default 0, eg, always not meet */
 
 	return &tg->pd;
 }
@@ -1572,6 +1575,64 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static u64 tg_prfill_latency_target(struct seq_file *sf,
+	struct blkg_policy_data *pd, int off)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+	const char *dname = blkg_dev_name(pd->blkg);
+
+	if (!dname)
+		return 0;
+	if (tg->latency_target == 0)
+		return 0;
+
+	seq_printf(sf, "%s 4k_lat=%llu\n", dname, tg->latency_target);
+	return 0;
+}
+
+static int tg_print_latency_target(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+		tg_prfill_latency_target, &blkcg_policy_throtl,
+		seq_cft(sf)->private, false);
+	return 0;
+}
+
+static ssize_t tg_set_latency_target(struct kernfs_open_file *of,
+				     char *buf, size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct blkg_conf_ctx ctx;
+	struct throtl_grp *tg;
+	int ret = -EINVAL;
+	char tok[27];
+	char *p;
+	u64 val;
+
+	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
+	if (ret)
+		return ret;
+
+	tg = blkg_to_tg(ctx.blkg);
+
+	if (sscanf(ctx.body, "%26s", tok) != 1)
+		goto out_finish;
+
+	p = tok;
+	strsep(&p, "=");
+	if (!p || kstrtou64(p, 10, &val))
+		goto out_finish;
+
+	if (strcmp(tok, "4k_lat"))
+		goto out_finish;
+
+	tg->latency_target = val;
+	ret = 0;
+out_finish:
+	blkg_conf_finish(&ctx);
+	return ret ?: nbytes;
+}
+
 static struct cftype throtl_files[] = {
 	{
 		.name = "high",
@@ -1587,6 +1648,12 @@ static struct cftype throtl_files[] = {
 		.write = tg_set_limit,
 		.private = LIMIT_MAX,
 	},
+	{
+		.name = "latency_target",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_latency_target,
+		.write = tg_set_latency_target,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 12/15] blk-throttle: ignore idle cgroup limit
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e403e88..01b494d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -146,8 +146,7 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
-	unsigned long last_dispatch_time[2];
-
+	int upgrade_check_batch;
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -487,8 +486,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
-	tg->last_dispatch_time[READ] = jiffies;
-	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1667,9 +1664,8 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 		return true;
 
 	if (time_after_eq(jiffies,
-	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-	    time_after_eq(jiffies,
-	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		tg_last_high_overflow_time(tg) + tg->td->throtl_slice) &&
+	    throtl_tg_is_idle(tg))
 		return true;
 	return false;
 }
@@ -1718,6 +1714,24 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+	if (tg->td->limit_index != LIMIT_HIGH)
+		return;
+
+	if (!time_after_eq(jiffies,
+	     __tg_last_high_overflow_time(tg) + tg->td->throtl_slice))
+		return;
+
+	tg->upgrade_check_batch++;
+	if (tg->upgrade_check_batch < 16)
+		return;
+	tg->upgrade_check_batch = 0;
+
+	if (throtl_can_upgrade(tg->td, NULL))
+		throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
 	struct cgroup_subsys_state *pos_css;
@@ -1752,18 +1766,15 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
-	if (time_after_eq(now, tg->last_dispatch_time[READ] +
-					td->throtl_slice) &&
-	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
-					td->throtl_slice))
-		return false;
 	/*
 	 * If cgroup is below high limit, consider downgrade and throttle other
 	 * cgroups
 	 */
 	if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
 	    time_after_eq(now, tg_last_high_overflow_time(tg) +
-					td->throtl_slice))
+					td->throtl_slice) &&
+	    (!throtl_tg_is_idle(tg) ||
+	     !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
 		return true;
 	return false;
 }
@@ -1902,10 +1913,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
-		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_high_overflow_time[rw] == 0)
 			tg->last_high_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
+		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 11/15] blk-throttle: add interface to configure think time threshold
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

Add interface to configure the threshold

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-sysfs.c    |  7 +++++++
 block/blk-throttle.c | 25 +++++++++++++++++++++++++
 block/blk.h          |  4 ++++
 3 files changed, 36 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e284e4..f15aeed 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -532,6 +532,12 @@ static struct queue_sysfs_entry throtl_slice_entry = {
 	.show = blk_throtl_slice_show,
 	.store = blk_throtl_slice_store,
 };
+
+static struct queue_sysfs_entry throtl_idle_threshold_entry = {
+	.attr = {.name = "throttling_idle_threshold", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_throtl_idle_threshold_show,
+	.store = blk_throtl_idle_threshold_store,
+};
 #endif
 
 static struct attribute *default_attrs[] = {
@@ -563,6 +569,7 @@ static struct attribute *default_attrs[] = {
 	&queue_dax_entry.attr,
 #ifdef CONFIG_BLK_DEV_THROTTLING
 	&throtl_slice_entry.attr,
+	&throtl_idle_threshold_entry.attr,
 #endif
 	NULL,
 };
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cb5fd85..e403e88 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2139,6 +2139,31 @@ ssize_t blk_throtl_slice_store(struct request_queue *q,
 	return count;
 }
 
+ssize_t blk_throtl_idle_threshold_show(struct request_queue *q, char *page)
+{
+	u64 threshold = q->td->idle_ttime_threshold;
+
+	if (!q->td)
+		return -EINVAL;
+	do_div(threshold, 1000);
+	return sprintf(page, "%lluus\n", threshold);
+}
+
+ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+	const char *page, size_t count)
+{
+	unsigned long v;
+
+	if (!q->td)
+		return -EINVAL;
+	if (kstrtoul(page, 10, &v))
+		return -EINVAL;
+	if (v == 0)
+		return -EINVAL;
+	q->td->idle_ttime_threshold = v * 1000;
+	return count;
+}
+
 static int __init throtl_init(void)
 {
 	kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index b433f35..2ebde12 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,6 +292,10 @@ extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_slice_store(struct request_queue *q,
 	const char *page, size_t count);
+extern ssize_t blk_throtl_idle_threshold_show(struct request_queue *q,
+	char *page);
+extern ssize_t blk_throtl_idle_threshold_store(struct request_queue *q,
+	const char *page, size_t count);
 extern void blk_throtl_bio_endio(struct bio *bio);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH 4/4] selinux: Convert isec->lock into a spinlock
From: Paul Moore @ 2016-11-14 22:22 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: Stephen Smalley, Eric Paris, selinux
In-Reply-To: <1478812710-17190-5-git-send-email-agruenba@redhat.com>

On Thu, Nov 10, 2016 at 4:18 PM, Andreas Gruenbacher
<agruenba@redhat.com> wrote:
> Convert isec->lock from a mutex into a spinlock.  Instead of holding the
> lock while sleeping in inode_doinit_with_dentry, set isec->initialized
> to LABEL_PENDING and release the lock.  Then, when the sid has been
> determined, re-acquire the lock.  If isec->initialized is still set to
> LABEL_PENDING, set isec->sid; otherwise, the sid has been set by another
> task (LABEL_INITIALIZED) or invalidated (LABEL_INVALID) in the meantime.
>
> This fixes a deadlock on gfs2 where
>
>  * one task is in inode_doinit_with_dentry -> gfs2_getxattr, holds
>    isec->lock, and tries to acquire the inode's glock, and
>
>  * another task is in do_xmote -> inode_go_inval ->
>    selinux_inode_invalidate_secctx, holds the inode's glock, and tries
>    to acquire isec->lock.
>
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  security/selinux/hooks.c          | 108 ++++++++++++++++++++++++--------------
>  security/selinux/include/objsec.h |   5 +-
>  2 files changed, 72 insertions(+), 41 deletions(-)

We shouldn't need the spinlocks on the socket_post_create() and the
socket_accept() hooks as the callers should still have exclusive
access to the socket/inode at that point.

I didn't check all the callers of the inode_init_security(), but it
looks like the same idea applies.

> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index cf5067e..4af31f1 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -231,7 +231,7 @@ static int inode_alloc_security(struct inode *inode)
>         if (!isec)
>                 return -ENOMEM;
>
> -       mutex_init(&isec->lock);
> +       spin_lock_init(&isec->lock);
>         INIT_LIST_HEAD(&isec->list);
>         isec->inode = inode;
>         isec->sid = SECINITSID_UNLABELED;
> @@ -1381,7 +1381,8 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>  {
>         struct superblock_security_struct *sbsec = NULL;
>         struct inode_security_struct *isec = inode->i_security;
> -       u32 sid;
> +       u32 task_sid, sid = 0;
> +       u16 sclass;
>         struct dentry *dentry;
>  #define INITCONTEXTLEN 255
>         char *context = NULL;
> @@ -1391,7 +1392,7 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>         if (isec->initialized == LABEL_INITIALIZED)
>                 return 0;
>
> -       mutex_lock(&isec->lock);
> +       spin_lock(&isec->lock);
>         if (isec->initialized == LABEL_INITIALIZED)
>                 goto out_unlock;
>
> @@ -1410,12 +1411,18 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                 goto out_unlock;
>         }
>
> +       sclass = isec->sclass;
> +       task_sid = isec->task_sid;
> +       sid = isec->sid;
> +       isec->initialized = LABEL_PENDING;
> +       spin_unlock(&isec->lock);
> +
>         switch (sbsec->behavior) {
>         case SECURITY_FS_USE_NATIVE:
>                 break;
>         case SECURITY_FS_USE_XATTR:
>                 if (!(inode->i_opflags & IOP_XATTR)) {
> -                       isec->sid = sbsec->def_sid;
> +                       sid = sbsec->def_sid;
>                         break;
>                 }
>                 /* Need a dentry, since the xattr API requires one.
> @@ -1437,7 +1444,7 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                          * inode_doinit with a dentry, before these inodes could
>                          * be used again by userspace.
>                          */
> -                       goto out_unlock;
> +                       goto out;
>                 }
>
>                 len = INITCONTEXTLEN;
> @@ -1445,7 +1452,7 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                 if (!context) {
>                         rc = -ENOMEM;
>                         dput(dentry);
> -                       goto out_unlock;
> +                       goto out;
>                 }
>                 context[len] = '\0';
>                 rc = __vfs_getxattr(dentry, inode, XATTR_NAME_SELINUX, context, len);
> @@ -1456,14 +1463,14 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                         rc = __vfs_getxattr(dentry, inode, XATTR_NAME_SELINUX, NULL, 0);
>                         if (rc < 0) {
>                                 dput(dentry);
> -                               goto out_unlock;
> +                               goto out;
>                         }
>                         len = rc;
>                         context = kmalloc(len+1, GFP_NOFS);
>                         if (!context) {
>                                 rc = -ENOMEM;
>                                 dput(dentry);
> -                               goto out_unlock;
> +                               goto out;
>                         }
>                         context[len] = '\0';
>                         rc = __vfs_getxattr(dentry, inode, XATTR_NAME_SELINUX, context, len);
> @@ -1475,7 +1482,7 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                                        "%d for dev=%s ino=%ld\n", __func__,
>                                        -rc, inode->i_sb->s_id, inode->i_ino);
>                                 kfree(context);
> -                               goto out_unlock;
> +                               goto out;
>                         }
>                         /* Map ENODATA to the default file SID */
>                         sid = sbsec->def_sid;
> @@ -1505,28 +1512,25 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                         }
>                 }
>                 kfree(context);
> -               isec->sid = sid;
>                 break;
>         case SECURITY_FS_USE_TASK:
> -               isec->sid = isec->task_sid;
> +               sid = task_sid;
>                 break;
>         case SECURITY_FS_USE_TRANS:
>                 /* Default to the fs SID. */
> -               isec->sid = sbsec->sid;
> +               sid = sbsec->sid;
>
>                 /* Try to obtain a transition SID. */
> -               rc = security_transition_sid(isec->task_sid, sbsec->sid,
> -                                            isec->sclass, NULL, &sid);
> +               rc = security_transition_sid(task_sid, sid, sclass, NULL, &sid);
>                 if (rc)
> -                       goto out_unlock;
> -               isec->sid = sid;
> +                       goto out;
>                 break;
>         case SECURITY_FS_USE_MNTPOINT:
> -               isec->sid = sbsec->mntpoint_sid;
> +               sid = sbsec->mntpoint_sid;
>                 break;
>         default:
>                 /* Default to the fs superblock SID. */
> -               isec->sid = sbsec->sid;
> +               sid = sbsec->sid;
>
>                 if ((sbsec->flags & SE_SBGENFS) && !S_ISLNK(inode->i_mode)) {
>                         /* We must have a dentry to determine the label on
> @@ -1549,21 +1553,29 @@ static int inode_doinit_with_dentry(struct inode *inode, struct dentry *opt_dent
>                          * could be used again by userspace.
>                          */
>                         if (!dentry)
> -                               goto out_unlock;
> -                       rc = selinux_genfs_get_sid(dentry, isec->sclass,
> -                                                  sbsec->flags, &sid);
> +                               goto out;
> +                       rc = selinux_genfs_get_sid(dentry, sclass, sbsec->flags, &sid);
>                         dput(dentry);
>                         if (rc)
> -                               goto out_unlock;
> -                       isec->sid = sid;
> +                               goto out;
>                 }
>                 break;
>         }
>
> -       isec->initialized = LABEL_INITIALIZED;
> +out:
> +       spin_lock(&isec->lock);
> +       if (isec->initialized == LABEL_PENDING) {
> +               if (!sid || rc) {
> +                       isec->initialized = LABEL_INVALID;
> +                       goto out_unlock;
> +               }
> +
> +               isec->initialized = LABEL_INITIALIZED;
> +               isec->sid = sid;
> +       }
>
>  out_unlock:
> -       mutex_unlock(&isec->lock);
> +       spin_unlock(&isec->lock);
>         return rc;
>  }
>
> @@ -2889,9 +2901,11 @@ static int selinux_inode_init_security(struct inode *inode, struct inode *dir,
>         /* Possibly defer initialization to selinux_complete_init. */
>         if (sbsec->flags & SE_SBINITIALIZED) {
>                 struct inode_security_struct *isec = inode->i_security;
> +               spin_lock(&isec->lock);
>                 isec->sclass = inode_mode_to_security_class(inode->i_mode);
>                 isec->sid = newsid;
>                 isec->initialized = LABEL_INITIALIZED;
> +               spin_unlock(&isec->lock);
>         }
>
>         if (!ss_initialized || !(sbsec->flags & SBLABEL_MNT))
> @@ -3194,9 +3208,11 @@ static void selinux_inode_post_setxattr(struct dentry *dentry, const char *name,
>         }
>
>         isec = backing_inode_security(dentry);
> +       spin_lock(&isec->lock);
>         isec->sclass = inode_mode_to_security_class(inode->i_mode);
>         isec->sid = newsid;
>         isec->initialized = LABEL_INITIALIZED;
> +       spin_unlock(&isec->lock);
>
>         return;
>  }
> @@ -3289,9 +3305,11 @@ static int selinux_inode_setsecurity(struct inode *inode, const char *name,
>         if (rc)
>                 return rc;
>
> +       spin_lock(&isec->lock);
>         isec->sclass = inode_mode_to_security_class(inode->i_mode);
>         isec->sid = newsid;
>         isec->initialized = LABEL_INITIALIZED;
> +       spin_unlock(&isec->lock);
>         return 0;
>  }
>
> @@ -3952,9 +3970,11 @@ static void selinux_task_to_inode(struct task_struct *p,
>         struct inode_security_struct *isec = inode->i_security;
>         u32 sid = task_sid(p);
>
> +       spin_lock(&isec->lock);
>         isec->sclass = inode_mode_to_security_class(inode->i_mode);
>         isec->sid = sid;
>         isec->initialized = LABEL_INITIALIZED;
> +       spin_unlock(&isec->lock);
>  }
>
>  /* Returns error only if unable to parse addresses */
> @@ -4273,24 +4293,26 @@ static int selinux_socket_post_create(struct socket *sock, int family,
>         const struct task_security_struct *tsec = current_security();
>         struct inode_security_struct *isec = inode_security_novalidate(SOCK_INODE(sock));
>         struct sk_security_struct *sksec;
> +       u16 sclass = socket_type_to_security_class(family, type, protocol);
> +       u32 sid = SECINITSID_KERNEL;
>         int err = 0;
>
> -       isec->sclass = socket_type_to_security_class(family, type, protocol);
> -
> -       if (kern)
> -               isec->sid = SECINITSID_KERNEL;
> -       else {
> -               err = socket_sockcreate_sid(tsec, isec->sclass, &(isec->sid));
> +       if (!kern) {
> +               err = socket_sockcreate_sid(tsec, sclass, &sid);
>                 if (err)
>                         return err;
>         }
>
> +       spin_lock(&isec->lock);
> +       isec->sclass = sclass;
> +       isec->sid = sid;
>         isec->initialized = LABEL_INITIALIZED;
> +       spin_unlock(&isec->lock);
>
>         if (sock->sk) {
>                 sksec = sock->sk->sk_security;
> -               sksec->sid = isec->sid;
> -               sksec->sclass = isec->sclass;
> +               sksec->sclass = sclass;
> +               sksec->sid = sid;
>                 err = selinux_netlbl_socket_post_create(sock->sk, family);
>         }
>
> @@ -4466,17 +4488,25 @@ static int selinux_socket_accept(struct socket *sock, struct socket *newsock)
>         int err;
>         struct inode_security_struct *isec;
>         struct inode_security_struct *newisec;
> +       u16 sclass;
> +       u32 sid;
>
>         err = sock_has_perm(current, sock->sk, SOCKET__ACCEPT);
>         if (err)
>                 return err;
>
> -       newisec = inode_security_novalidate(SOCK_INODE(newsock));
> -
>         isec = inode_security_novalidate(SOCK_INODE(sock));
> -       newisec->sclass = isec->sclass;
> -       newisec->sid = isec->sid;
> +       spin_lock(&isec->lock);
> +       sclass = isec->sclass;
> +       sid = isec->sid;
> +       spin_unlock(&isec->lock);
> +
> +       newisec = inode_security_novalidate(SOCK_INODE(newsock));
> +       spin_lock(&newisec->lock);
> +       newisec->sclass = sclass;
> +       newisec->sid = sid;
>         newisec->initialized = LABEL_INITIALIZED;
> +       spin_unlock(&newisec->lock);
>
>         return 0;
>  }
> @@ -5978,9 +6008,9 @@ static void selinux_inode_invalidate_secctx(struct inode *inode)
>  {
>         struct inode_security_struct *isec = inode->i_security;
>
> -       mutex_lock(&isec->lock);
> +       spin_lock(&isec->lock);
>         isec->initialized = LABEL_INVALID;
> -       mutex_unlock(&isec->lock);
> +       spin_unlock(&isec->lock);
>  }
>
>  /*
> diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
> index c21e135..7e770e9 100644
> --- a/security/selinux/include/objsec.h
> +++ b/security/selinux/include/objsec.h
> @@ -39,7 +39,8 @@ struct task_security_struct {
>
>  enum label_initialized {
>         LABEL_INVALID,          /* invalid or not initialized */
> -       LABEL_INITIALIZED       /* initialized */
> +       LABEL_INITIALIZED,      /* initialized */
> +       LABEL_PENDING
>  };
>
>  struct inode_security_struct {
> @@ -52,7 +53,7 @@ struct inode_security_struct {
>         u32 sid;                /* SID of this object */
>         u16 sclass;             /* security class of this object */
>         unsigned char initialized;      /* initialization flag */
> -       struct mutex lock;
> +       struct spinlock lock;
>  };
>
>  struct file_security_struct {
> --
> 2.7.4
>



-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH V4 10/15] blk-throttle: add a simple idle detection
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

A cgroup gets assigned a high limit, but the cgroup could never dispatch
enough IO to cross the high limit. In such case, the queue state machine
will remain in LIMIT_HIGH state and all other cgroups will be throttled
according to high limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to higher state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its high limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_HIGH. But
when queue gets upgraded to higher state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its high limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio.c               |  2 ++
 block/blk-throttle.c      | 72 ++++++++++++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  2 ++
 include/linux/blk_types.h |  1 +
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index db85c57..7baa86d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
 #include <linux/cgroup.h>
 
 #include <trace/events/block.h>
+#include "blk.h"
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1759,6 +1760,7 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	blk_throtl_bio_endio(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 45a28c4..cb5fd85 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,8 @@ static int throtl_quantum = 32;
 /* Throttling is performed over 100ms slice and after that slice is renewed */
 #define DFL_THROTL_SLICE (HZ / 10)
 #define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -149,6 +151,10 @@ struct throtl_grp {
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
+
+	u64 last_finish_time;
+	u64 checked_last_finish_time;
+	u64 avg_ttime;
 };
 
 struct throtl_data
@@ -172,6 +178,8 @@ struct throtl_data
 	unsigned long high_downgrade_time;
 
 	unsigned int scale;
+
+	u64 idle_ttime_threshold;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -1629,6 +1637,14 @@ static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+	/* cgroup is idle if average think time is more than threshold */
+	return ktime_get_ns() - tg->last_finish_time >
+		4 * tg->td->idle_ttime_threshold ||
+	       tg->avg_ttime > tg->td->idle_ttime_threshold;
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1837,6 +1853,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 	tg->last_io_disp[WRITE] = 0;
 }
 
+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+	u64 now = ktime_get_ns();
+	u64 last_finish_time = tg->last_finish_time;
+
+	if (now <= last_finish_time || last_finish_time == 0 ||
+	    last_finish_time == tg->checked_last_finish_time)
+		return;
+
+	tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+	tg->checked_last_finish_time = last_finish_time;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1848,6 +1877,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
+	if (tg->td->idle_ttime_threshold == -1) {
+		if (blk_queue_nonrot(q))
+			tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+		else
+			tg->td->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+	}
+
 	/* see throtl_charge_bio() */
 	if ((bio->bi_opf & REQ_THROTTLED) || !tg->has_rules[rw])
 		goto out;
@@ -1857,6 +1893,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
+	bio_associate_current(bio);
+	bio->bi_cg_private = q;
+
+	blk_throtl_update_ttime(tg);
+
 	sq = &tg->service_queue;
 
 again:
@@ -1917,7 +1958,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_high_overflow_time[rw] = jiffies;
 
-	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
@@ -1946,6 +1986,34 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	return throttled;
 }
 
+void blk_throtl_bio_endio(struct bio *bio)
+{
+	struct blkcg *blkcg;
+	struct blkcg_gq *blkg;
+	struct throtl_grp *tg;
+	struct request_queue *q;
+
+	q = bio->bi_cg_private;
+	if (!q)
+		return;
+	bio->bi_cg_private = NULL;
+
+	rcu_read_lock();
+	blkcg = bio_blkcg(bio);
+	if (!blkcg)
+		goto end;
+	blkg = blkg_lookup(blkcg, q);
+	if (!blkg)
+		goto end;
+
+	tg = blkg_to_tg(blkg ?: q->root_blkg);
+
+	tg->last_finish_time = ktime_get_ns();
+
+end:
+	rcu_read_unlock();
+}
+
 /*
  * Dispatch all bios from all children tg's queued on @parent_sq.  On
  * return, @parent_sq is guaranteed to not have any active children tg's
@@ -2030,6 +2098,8 @@ int blk_throtl_init(struct request_queue *q)
 	td->limit_index = LIMIT_MAX;
 	td->high_upgrade_time = jiffies;
 	td->high_downgrade_time = jiffies;
+
+	td->idle_ttime_threshold = -1;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
diff --git a/block/blk.h b/block/blk.h
index 39c14dd..b433f35 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -292,10 +292,12 @@ extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_slice_store(struct request_queue *q,
 	const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..ff8dd24 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -59,6 +59,7 @@ struct bio {
 	 */
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
+	void *bi_cg_private;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 05/15] blk-throttle: add downgrade logic
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its high limit for some time, the queue should be downgraded to lower
state as one cgroup's high limit isn't met.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 34a75e5..d177252 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -136,6 +136,13 @@ struct throtl_grp {
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
 
+	unsigned long last_high_overflow_time[2];
+
+	uint64_t last_bytes_disp[2];
+	unsigned int last_io_disp[2];
+
+	unsigned long last_check_time;
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -155,6 +162,9 @@ struct throtl_data
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
+
+	unsigned long high_upgrade_time;
+	unsigned long high_downgrade_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	/* Charge the bio to the group */
 	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
 	tg->io_disp[rw]++;
+	tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->last_io_disp[rw]++;
 
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1510,6 +1522,65 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static unsigned long __tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+	unsigned long rtime = -1, wtime = -1;
+
+	if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+	    tg->iops[READ][LIMIT_HIGH] != -1 ||
+	    tg->bps[READ][LIMIT_MAX] != -1 ||
+	    tg->iops[READ][LIMIT_MAX] != -1)
+		rtime = tg->last_high_overflow_time[READ];
+	if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+	    tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+	    tg->bps[WRITE][LIMIT_MAX] != -1 ||
+	    tg->iops[WRITE][LIMIT_MAX] != -1)
+		wtime = tg->last_high_overflow_time[WRITE];
+	return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *parent_sq;
+	struct throtl_grp *parent = tg;
+	unsigned long ret = __tg_last_high_overflow_time(tg);
+
+	while (true) {
+		parent_sq = parent->service_queue.parent_sq;
+		parent = sq_to_tg(parent_sq);
+		if (!parent)
+			break;
+		if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
+		      parent->bps[READ][LIMIT_HIGH] >=
+		       tg->bps[READ][LIMIT_HIGH]) ||
+		     (parent->bps[READ][LIMIT_HIGH] == -1 &&
+		      parent->bps[READ][LIMIT_MAX] >=
+		       tg->bps[READ][LIMIT_HIGH])) &&
+		    ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
+		      parent->bps[WRITE][LIMIT_HIGH] >=
+		       tg->bps[WRITE][LIMIT_HIGH]) ||
+		     (parent->bps[WRITE][LIMIT_HIGH] == -1 &&
+		      parent->bps[WRITE][LIMIT_MAX] >=
+		       tg->bps[WRITE][LIMIT_HIGH])) &&
+		    ((parent->iops[READ][LIMIT_HIGH] != -1 &&
+		      parent->iops[READ][LIMIT_HIGH] >=
+		       tg->iops[READ][LIMIT_HIGH]) ||
+		     (parent->iops[READ][LIMIT_HIGH] == -1 &&
+		      parent->iops[READ][LIMIT_MAX] >=
+		       tg->iops[READ][LIMIT_HIGH])) &&
+		    ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
+		      parent->iops[WRITE][LIMIT_HIGH] >=
+		       tg->iops[WRITE][LIMIT_HIGH]) ||
+		     (parent->iops[WRITE][LIMIT_HIGH] == -1 &&
+		      parent->iops[WRITE][LIMIT_MAX] >=
+		       tg->iops[WRITE][LIMIT_HIGH])))
+			break;
+		if (time_after(__tg_last_high_overflow_time(parent), ret))
+			ret = __tg_last_high_overflow_time(parent);
+	}
+	return ret;
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1557,6 +1628,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_HIGH)
 		return false;
 
+	if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+		return false;
+
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1580,6 +1654,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	struct blkcg_gq *blkg;
 
 	td->limit_index = LIMIT_MAX;
+	td->high_upgrade_time = jiffies;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1595,6 +1670,111 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+	td->limit_index = new;
+	td->high_downgrade_time = jiffies;
+}
+
+static bool throtl_downgrade_check_one(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	unsigned long now = jiffies;
+
+	/*
+	 * If cgroup is below high limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
+	    time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+		return true;
+	return false;
+}
+
+static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg)
+{
+	if (!throtl_downgrade_check_one(tg))
+		return false;
+	while (true) {
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+			    !tg_to_blkg(tg)->parent))
+			break;
+
+		if (!throtl_downgrade_check_one(tg))
+			return false;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+	}
+	return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+	uint64_t bps;
+	unsigned int iops;
+	unsigned long elapsed_time;
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_MAX ||
+	    !tg->td->limit_valid[LIMIT_HIGH])
+		return;
+	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+		return;
+	if (time_after(tg->last_check_time + throtl_slice, now))
+		return;
+
+	if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+		return;
+
+	elapsed_time = now - tg->last_check_time;
+	tg->last_check_time = now;
+
+	if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+	    tg->bps[READ][LIMIT_MAX] != -1) {
+		bps = tg->last_bytes_disp[READ] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[READ][LIMIT_HIGH] ||
+		    bps >= tg->bps[READ][LIMIT_MAX])
+			tg->last_high_overflow_time[READ] = now;
+	}
+
+	if (tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+	    tg->bps[WRITE][LIMIT_MAX] != -1) {
+		bps = tg->last_bytes_disp[WRITE] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[WRITE][LIMIT_HIGH] ||
+		    bps >= tg->bps[WRITE][LIMIT_MAX])
+			tg->last_high_overflow_time[WRITE] = now;
+	}
+
+	if (tg->iops[READ][LIMIT_HIGH] != -1 ||
+	    tg->iops[READ][LIMIT_MAX] != -1) {
+		iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+		if (iops >= tg->iops[READ][LIMIT_HIGH] ||
+		    iops >= tg->iops[READ][LIMIT_MAX])
+			tg->last_high_overflow_time[READ] = now;
+	}
+
+	if (tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+	    tg->iops[WRITE][LIMIT_MAX] != -1) {
+		iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+		if (iops >= tg->iops[WRITE][LIMIT_HIGH] ||
+		    iops >= tg->iops[WRITE][LIMIT_MAX])
+			tg->last_high_overflow_time[WRITE] = now;
+	}
+
+	/*
+	 * If cgroup is below high limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (throtl_downgrade_check_hierarchy(tg))
+		throtl_downgrade_state(tg->td, LIMIT_HIGH);
+
+	tg->last_bytes_disp[READ] = 0;
+	tg->last_bytes_disp[WRITE] = 0;
+	tg->last_io_disp[READ] = 0;
+	tg->last_io_disp[WRITE] = 0;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1619,12 +1799,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		if (tg->last_high_overflow_time[rw] == 0)
+			tg->last_high_overflow_time[rw] = jiffies;
+		throtl_downgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
+			tg->last_high_overflow_time[rw] = jiffies;
 			if (throtl_can_upgrade(tg->td, tg)) {
 				throtl_upgrade_state(tg->td);
 				goto again;
@@ -1668,6 +1852,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
+	tg->last_high_overflow_time[rw] = jiffies;
+
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
@@ -1778,6 +1964,8 @@ int blk_throtl_init(struct request_queue *q)
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
+	td->high_upgrade_time = jiffies;
+	td->high_downgrade_time = jiffies;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 04/15] blk-throttle: add upgrade logic for LIMIT_HIGH state
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

When queue is in LIMIT_HIGH state and all cgroups with high limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower high
limit than parent. Parent's high limit is meaningless. If children's
bps/iops cross high limit, we can upgrade queue state. The other case is
children has higher high limit than parent. Children's high limit is
meaningless. As long as parent's bps/iops cross high limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ec53671..34a75e5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -456,6 +456,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
 		td->limit_valid[LIMIT_HIGH] = false;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -467,9 +468,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
 	blk_throtl_update_valid_limit(tg->td);
 
-	if (tg->td->limit_index == LIMIT_HIGH &&
-	    !tg->td->limit_valid[LIMIT_HIGH])
-		tg->td->limit_index = LIMIT_MAX;
+	if (!tg->td->limit_valid[tg->td->limit_index])
+		throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
 	int ret;
 
 	spin_lock_irq(q->queue_lock);
+	if (throtl_can_upgrade(td, NULL))
+		throtl_upgrade_state(td);
+
 again:
 	parent_sq = sq->parent_sq;
 	dispatched = false;
@@ -1505,6 +1510,91 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static bool throtl_upgrade_check_one(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	bool read_limit, write_limit;
+
+	/* if cgroup reaches high/max limit, it's ok to next limit */
+	read_limit = tg->bps[READ][LIMIT_HIGH] != -1 ||
+		     tg->iops[READ][LIMIT_HIGH] != -1 ||
+		     tg->bps[READ][LIMIT_MAX] != -1 ||
+		     tg->iops[READ][LIMIT_MAX] != -1;
+	write_limit = tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+		      tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+		      tg->bps[WRITE][LIMIT_MAX] != -1 ||
+		      tg->iops[WRITE][LIMIT_MAX] != -1;
+	if (read_limit && sq->nr_queued[READ] &&
+	    (!write_limit || sq->nr_queued[WRITE]))
+		return true;
+	if (write_limit && sq->nr_queued[WRITE] &&
+	    (!read_limit || sq->nr_queued[READ]))
+		return true;
+	return false;
+}
+
+static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg)
+{
+	if (throtl_upgrade_check_one(tg))
+		return true;
+	while (true) {
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+				!tg_to_blkg(tg)->parent))
+			return false;
+		if (throtl_upgrade_check_one(tg))
+			return true;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+	}
+	return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	if (td->limit_index != LIMIT_HIGH)
+		return false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg == this_tg)
+			continue;
+		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+			continue;
+		if (!throtl_upgrade_check_hierarchy(tg)) {
+			rcu_read_unlock();
+			return false;
+		}
+	}
+	rcu_read_unlock();
+	return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	td->limit_index = LIMIT_MAX;
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+		struct throtl_service_queue *sq = &tg->service_queue;
+
+		tg->disptime = jiffies - 1;
+		throtl_select_dispatch(sq);
+		throtl_schedule_next_dispatch(sq, false);
+	}
+	rcu_read_unlock();
+	throtl_select_dispatch(&td->service_queue);
+	throtl_schedule_next_dispatch(&td->service_queue, false);
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1527,14 +1617,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+again:
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
-		if (!tg_may_dispatch(tg, bio, NULL))
+		if (!tg_may_dispatch(tg, bio, NULL)) {
+			if (throtl_can_upgrade(tg->td, tg)) {
+				throtl_upgrade_state(tg->td);
+				goto again;
+			}
 			break;
+		}
 
 		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 00/15] blk-throttle: add .high limit
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal

Hi,

The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.

io.max is a hard limit throttling. cgroups with a max limit never dispatch more
IO than their max limit. While io.high is a best effort throttling. cgroups
with high limit can run above their high limit at appropriate time.
Specifically, if all cgroups reach their high limit, all cgroups can run above
their high limit. If any cgroup runs under its high limit, all other cgroups
will run according to their high limit.

An example usage is we have a high prio cgroup with high high limit and a low
prio cgroup with low high limit. If the high prio cgroup isn't running, the low
prio can run above its high limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its high limit, low prio cgroup will run
under its high limit. This will protect high prio cgroup to get more resources.
If both cgroups reach their high limit, both can run above their high limit
(eg, fully utilize disk bandwidth). All these can't be done with io.max limit.

The implementation is simple. The disk queue has 2 states LIMIT_HIGH and
LIMIT_MAX. In each disk state, we throttle cgroups according to the limit of
the state. That is io.high limit for LIMIT_HIGH state, io.max limit for
LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_HIGH/LIMIT_MAX according to the rule above. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.high, the disk state will remain in
LIMIT_MAX state. Users with only io.max set will find nothing changed with the
patches.

The first 8 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 8 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 9-15
adds more heuristics.

The basic framework has 2 major issues.
1. fluctuation. When the state is upgraded from LIMIT_HIGH to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way not expected.
For example, one cgroup's bandwidth will drop below its io.high limit very soon
after a upgrade. patch 9 has more details about the issue.
2. idle cgroup. cgroup with a io.high limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_HIGH state and all other
cgroups can't dispatch more IO above their high limit. Hence this is a waste of
disk bandwidth. patch 10 has more details about the issue.

For issue 1, we make cgroup bandwidth increase smoothly after a upgrade. This
will reduce the chance a cgroup's bandwidth drop under its high limit rapidly.
The smoothness means we could waste some bandwidth in the transition though.
But we must pay something for sharing.

The issue 2 is very hard to solve. The patch 10 uses the 'think time check'
idea borrowed from CFQ to detect idle cgroup. It's not perfect, eg, not works
well for high IO depth workloads.  But it's the best I tried so far and in
practice works well. This definitively needs more tuning.

The big change in this version comes from patch 13 - 15. We add a latency
target for each cgroup. The goal is to solve issue 2. If a cgroup's average io
latency exceeds its latency target, the cgroup is considered as busy.

Please review, test and consider merge.

Thanks,
Shaohua

V3->V4:
- Add latency target for cgroup
- Fix bugs

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
  prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2

Shaohua Li (15):
  blk-throttle: prepare support multiple limits
  blk-throttle: add .high interface
  blk-throttle: configure bps/iops limit for cgroup in high limit
  blk-throttle: add upgrade logic for LIMIT_HIGH state
  blk-throttle: add downgrade logic
  blk-throttle: make sure expire time isn't too big
  blk-throttle: make throtl_slice tunable
  blk-throttle: detect completed idle cgroup
  blk-throttle: make bandwidth change smooth
  blk-throttle: add a simple idle detection
  blk-throttle: add interface to configure think time threshold
  blk-throttle: ignore idle cgroup limit
  blk-throttle: add a mechanism to estimate IO latency
  blk-throttle: add interface for per-cgroup target latency
  blk-throttle: add latency target support

 block/bio.c               |    2 +
 block/blk-sysfs.c         |   18 +
 block/blk-throttle.c      | 1035 ++++++++++++++++++++++++++++++++++++++++++---
 block/blk.h               |    9 +
 include/linux/blk_types.h |    4 +
 5 files changed, 1001 insertions(+), 67 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [PATCH V4 02/15] blk-throttle: add .high interface
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

Add high limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 132 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 103 insertions(+), 29 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 925aa1ed..a564215 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
+	LIMIT_HIGH,
 	LIMIT_MAX,
 	LIMIT_CNT,
 };
@@ -414,6 +415,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool high_valid = false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+		    tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+		    tg->iops[READ][LIMIT_HIGH] != -1 ||
+		    tg->iops[WRITE][LIMIT_HIGH] != -1)
+			high_valid = true;
+	}
+	rcu_read_unlock();
+
+	if (high_valid)
+		td->limit_valid[LIMIT_HIGH] = true;
+	else
+		td->limit_valid[LIMIT_HIGH] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+
+	tg->bps[READ][LIMIT_HIGH] = -1;
+	tg->bps[WRITE][LIMIT_HIGH] = -1;
+	tg->iops[READ][LIMIT_HIGH] = -1;
+	tg->iops[WRITE][LIMIT_HIGH] = -1;
+
+	blk_throtl_update_valid_limit(tg->td);
+
+	if (tg->td->limit_index == LIMIT_HIGH &&
+	    !tg->td->limit_valid[LIMIT_HIGH])
+		tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1283,7 +1324,7 @@ static struct cftype throtl_legacy_files[] = {
 	{ }	/* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 			 int off)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1292,36 +1333,32 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
-	    tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
+	if (tg->bps[READ][off] == -1 && tg->bps[WRITE][off] == -1 &&
+	    tg->iops[READ][off] == -1 && tg->iops[WRITE][off] == -1)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] != -1)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-			tg->bps[READ][LIMIT_MAX]);
-	if (tg->bps[WRITE][LIMIT_MAX] != -1)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-			tg->bps[WRITE][LIMIT_MAX]);
-	if (tg->iops[READ][LIMIT_MAX] != -1)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u",
-			tg->iops[READ][LIMIT_MAX]);
-	if (tg->iops[WRITE][LIMIT_MAX] != -1)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u",
-			tg->iops[WRITE][LIMIT_MAX]);
+	if (tg->bps[READ][off] != -1)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ][off]);
+	if (tg->bps[WRITE][off] != -1)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE][off]);
+	if (tg->iops[READ][off] != -1)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ][off]);
+	if (tg->iops[WRITE][off] != -1)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE][off]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
 	return 0;
 }
 
-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
 {
-	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
 			  &blkcg_policy_throtl, seq_cft(sf)->private, false);
 	return 0;
 }
 
-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			  char *buf, size_t nbytes, loff_t off)
 {
 	struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1329,6 +1366,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	int ret;
+	int index = of_cft(of)->private;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1336,10 +1374,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ][LIMIT_MAX];
-	v[1] = tg->bps[WRITE][LIMIT_MAX];
-	v[2] = tg->iops[READ][LIMIT_MAX];
-	v[3] = tg->iops[WRITE][LIMIT_MAX];
+	v[0] = tg->bps[READ][index];
+	v[1] = tg->bps[WRITE][index];
+	v[2] = tg->iops[READ][index];
+	v[3] = tg->iops[WRITE][index];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1376,11 +1414,37 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ][LIMIT_MAX] = v[0];
-	tg->bps[WRITE][LIMIT_MAX] = v[1];
-	tg->iops[READ][LIMIT_MAX] = v[2];
-	tg->iops[WRITE][LIMIT_MAX] = v[3];
-
+	if (index == LIMIT_MAX) {
+		if ((v[0] < tg->bps[READ][LIMIT_HIGH] &&
+		       tg->bps[READ][LIMIT_HIGH] != -1) ||
+		    (v[1] < tg->bps[WRITE][LIMIT_HIGH] &&
+		       tg->bps[WRITE][LIMIT_HIGH] != -1) ||
+		    (v[2] < tg->iops[READ][LIMIT_HIGH] &&
+		       tg->iops[READ][LIMIT_HIGH] != -1) ||
+		    (v[3] < tg->iops[WRITE][LIMIT_HIGH] &&
+		       tg->iops[WRITE][LIMIT_HIGH] != -1)) {
+			ret = -EINVAL;
+			goto out_finish;
+		}
+	} else if (index == LIMIT_HIGH) {
+		if ((v[0] > tg->bps[READ][LIMIT_MAX] && v[0] != -1) ||
+		    (v[1] > tg->bps[WRITE][LIMIT_MAX] && v[1] != -1) ||
+		    (v[2] > tg->iops[READ][LIMIT_MAX] && v[2] != -1) ||
+		    (v[3] > tg->iops[WRITE][LIMIT_MAX] && v[3] != -1)) {
+			ret = -EINVAL;
+			goto out_finish;
+		}
+	}
+	tg->bps[READ][index] = v[0];
+	tg->bps[WRITE][index] = v[1];
+	tg->iops[READ][index] = v[2];
+	tg->iops[WRITE][index] = v[3];
+
+	if (index == LIMIT_HIGH) {
+		blk_throtl_update_valid_limit(tg->td);
+		if (tg->td->limit_valid[LIMIT_HIGH])
+			tg->td->limit_index = LIMIT_HIGH;
+	}
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1390,10 +1454,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 static struct cftype throtl_files[] = {
 	{
+		.name = "high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_HIGH,
+	},
+	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.seq_show = tg_print_max,
-		.write = tg_set_max,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_MAX,
 	},
 	{ }	/* terminate */
 };
@@ -1412,6 +1484,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_alloc_fn		= throtl_pd_alloc,
 	.pd_init_fn		= throtl_pd_init,
 	.pd_online_fn		= throtl_pd_online,
+	.pd_offline_fn		= throtl_pd_offline,
 	.pd_free_fn		= throtl_pd_free,
 };
 
@@ -1591,6 +1664,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->queue = q;
 
 	td->limit_valid[LIMIT_MAX] = true;
+	td->limit_index = LIMIT_MAX;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 07/15] blk-throttle: make throtl_slice tunable
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

throtl_slice is important for blk-throttling. A lot of stuffes depend on
it, for example, throughput measurement. It has 100ms default value,
which is not appropriate for all disks. For example, for SSD we might
use a smaller value to make the throughput smoother. This patch makes it
tunable.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-sysfs.c    | 11 ++++++++
 block/blk-throttle.c | 77 +++++++++++++++++++++++++++++++++++++---------------
 block/blk.h          |  3 ++
 3 files changed, 69 insertions(+), 22 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9cc8d7c..3e284e4 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -526,6 +526,14 @@ static struct queue_sysfs_entry queue_dax_entry = {
 	.show = queue_dax_show,
 };
 
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_slice_entry = {
+	.attr = {.name = "throttling_slice", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_throtl_slice_show,
+	.store = blk_throtl_slice_store,
+};
+#endif
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -553,6 +561,9 @@ static struct attribute *default_attrs[] = {
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+	&throtl_slice_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index eff3120..e85b2b6 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
 static int throtl_quantum = 32;
 
 /* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -158,6 +159,8 @@ struct throtl_data
 	/* Total Number of queued bios on READ and WRITE lists */
 	unsigned int nr_queued[2];
 
+	unsigned int throtl_slice;
+
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
@@ -589,7 +592,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
-	unsigned long max_expire = jiffies + 8 * throtl_slice;
+	unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
 
 	if (time_after(expires, max_expire))
 		expires = max_expire;
@@ -650,7 +653,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 	if (time_after_eq(start, tg->slice_start[rw]))
 		tg->slice_start[rw] = start;
 
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -662,7 +665,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -672,13 +675,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
 					unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 }
 
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 	throtl_log(&tg->service_queue,
 		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -718,19 +721,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	 * is bad because it does not allow new slice to start.
 	 */
 
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
 
 	time_elapsed = jiffies - tg->slice_start[rw];
 
-	nr_slices = time_elapsed / throtl_slice;
+	nr_slices = time_elapsed / tg->td->throtl_slice;
 
 	if (!nr_slices)
 		return;
-	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+	io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+		HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -745,7 +749,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	else
 		tg->io_disp[rw] = 0;
 
-	tg->slice_start[rw] += nr_slices * throtl_slice;
+	tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;
 
 	throtl_log(&tg->service_queue,
 		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -765,9 +769,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	/*
 	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -814,9 +818,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
@@ -881,8 +885,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
 		throtl_start_new_slice(tg, rw);
 	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+		if (time_before(tg->slice_end[rw],
+		    jiffies + tg->td->throtl_slice))
+			throtl_extend_slice(tg, rw,
+				jiffies + tg->td->throtl_slice);
 	}
 
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1632,7 +1638,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_HIGH)
 		return false;
 
-	if (time_before(jiffies, td->high_downgrade_time + throtl_slice))
+	if (time_before(jiffies, td->high_downgrade_time + td->throtl_slice))
 		return false;
 
 	rcu_read_lock();
@@ -1689,8 +1695,9 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
 	 * If cgroup is below high limit, consider downgrade and throttle other
 	 * cgroups
 	 */
-	if (time_after_eq(now, td->high_upgrade_time + throtl_slice) &&
-	    time_after_eq(now, tg_last_high_overflow_time(tg) + throtl_slice))
+	if (time_after_eq(now, td->high_upgrade_time + td->throtl_slice) &&
+	    time_after_eq(now, tg_last_high_overflow_time(tg) +
+					td->throtl_slice))
 		return true;
 	return false;
 }
@@ -1723,10 +1730,11 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 		return;
 	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
 		return;
-	if (time_after(tg->last_check_time + throtl_slice, now))
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
 		return;
 
-	if (time_before(now, tg_last_high_overflow_time(tg) + throtl_slice))
+	if (time_before(now, tg_last_high_overflow_time(tg) +
+			tg->td->throtl_slice))
 		return;
 
 	elapsed_time = now - tg->last_check_time;
@@ -1965,6 +1973,7 @@ int blk_throtl_init(struct request_queue *q)
 
 	q->td = td;
 	td->queue = q;
+	td->throtl_slice = DFL_THROTL_SLICE;
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
@@ -1985,6 +1994,30 @@ void blk_throtl_exit(struct request_queue *q)
 	kfree(q->td);
 }
 
+ssize_t blk_throtl_slice_show(struct request_queue *q, char *page)
+{
+	if (!q->td)
+		return -EINVAL;
+	return sprintf(page, "%ums\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_slice_store(struct request_queue *q,
+	const char *page, size_t count)
+{
+	unsigned long v;
+	unsigned long t;
+
+	if (!q->td)
+		return -EINVAL;
+	if (kstrtoul(page, 10, &v))
+		return -EINVAL;
+	t = msecs_to_jiffies(v);
+	if (t == 0 || t > MAX_THROTL_SLICE)
+		return -EINVAL;
+	q->td->throtl_slice = t;
+	return count;
+}
+
 static int __init throtl_init(void)
 {
 	kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 74444c4..39c14dd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -289,6 +289,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
 extern void blk_throtl_drain(struct request_queue *q);
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_slice_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_slice_store(struct request_queue *q,
+	const char *page, size_t count);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 01/15] blk-throttle: prepare support multiple limits
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

We are going to support high/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 110 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 69 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a3ea826..925aa1ed 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+	LIMIT_MAX,
+	LIMIT_CNT,
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
 	bool has_rules[2];
 
 	/* bytes per second rate limits */
-	uint64_t bps[2];
+	uint64_t bps[2][LIMIT_CNT];
 
 	/* IOPS limits */
-	unsigned int iops[2];
+	unsigned int iops[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+	unsigned int limit_index;
+	bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
 	struct throtl_grp *tg;
-	int rw;
+	int rw, index;
 
 	tg = kzalloc_node(sizeof(*tg), gfp, node);
 	if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = -1;
-	tg->bps[WRITE] = -1;
-	tg->iops[READ] = -1;
-	tg->iops[WRITE] = -1;
+	for (rw = READ; rw <= WRITE; rw++) {
+		for (index = 0; index < LIMIT_CNT; index++) {
+			tg->bps[rw][index] = -1;
+			tg->iops[rw][index] = -1;
+		}
+	}
 
 	return &tg->pd;
 }
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
 	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+	struct throtl_data *td = tg->td;
 	int rw;
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+			(td->limit_valid[td->limit_index] &&
+			 (tg_bps_limit(tg, rw) != -1 ||
+			  tg_iops_limit(tg, rw) != -1));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!nr_slices)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -771,7 +793,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+	if (tg_bps_limit(tg, rw) == -1 && tg_iops_limit(tg, rw) == -1) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1148,8 +1170,8 @@ static void tg_conf_updated(struct throtl_grp *tg)
 
 	throtl_log(&tg->service_queue,
 		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		   tg->bps[READ], tg->bps[WRITE],
-		   tg->iops[READ], tg->iops[WRITE]);
+		   tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+		   tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));
 
 	/*
 	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
@@ -1226,25 +1248,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 static struct cftype throtl_legacy_files[] = {
 	{
 		.name = "throttle.read_bps_device",
-		.private = offsetof(struct throtl_grp, bps[READ]),
+		.private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.write_bps_device",
-		.private = offsetof(struct throtl_grp, bps[WRITE]),
+		.private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.read_iops_device",
-		.private = offsetof(struct throtl_grp, iops[READ]),
+		.private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
 	{
 		.name = "throttle.write_iops_device",
-		.private = offsetof(struct throtl_grp, iops[WRITE]),
+		.private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
@@ -1270,18 +1292,22 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-	    tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+	if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
+	    tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
 		return 0;
 
-	if (tg->bps[READ] != -1)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != -1)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != -1)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != -1)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+	if (tg->bps[READ][LIMIT_MAX] != -1)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+			tg->bps[READ][LIMIT_MAX]);
+	if (tg->bps[WRITE][LIMIT_MAX] != -1)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+			tg->bps[WRITE][LIMIT_MAX]);
+	if (tg->iops[READ][LIMIT_MAX] != -1)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u",
+			tg->iops[READ][LIMIT_MAX]);
+	if (tg->iops[WRITE][LIMIT_MAX] != -1)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u",
+			tg->iops[WRITE][LIMIT_MAX]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1336,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ];
-	v[1] = tg->bps[WRITE];
-	v[2] = tg->iops[READ];
-	v[3] = tg->iops[WRITE];
+	v[0] = tg->bps[READ][LIMIT_MAX];
+	v[1] = tg->bps[WRITE][LIMIT_MAX];
+	v[2] = tg->iops[READ][LIMIT_MAX];
+	v[3] = tg->iops[WRITE][LIMIT_MAX];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1350,10 +1376,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ] = v[0];
-	tg->bps[WRITE] = v[1];
-	tg->iops[READ] = v[2];
-	tg->iops[WRITE] = v[3];
+	tg->bps[READ][LIMIT_MAX] = v[0];
+	tg->bps[WRITE][LIMIT_MAX] = v[1];
+	tg->iops[READ][LIMIT_MAX] = v[2];
+	tg->iops[WRITE][LIMIT_MAX] = v[3];
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1451,8 +1477,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_disp[rw], bio->bi_iter.bi_size,
+		   tg_bps_limit(tg, rw),
+		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
@@ -1563,6 +1590,7 @@ int blk_throtl_init(struct request_queue *q)
 	q->td = td;
 	td->queue = q;
 
+	td->limit_valid[LIMIT_MAX] = true;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3


^ permalink raw reply related

* [PATCH V4 06/15] blk-throttle: make sure expire time isn't too big
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d177252..eff3120 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -589,6 +589,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
+	unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+	if (time_after(expires, max_expire))
+		expires = max_expire;
 	mod_timer(&sq->pending_timer, expires);
 	throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
 		   expires - jiffies, jiffies);
-- 
2.9.3

^ permalink raw reply related

* [PATCH V4 03/15] blk-throttle: configure bps/iops limit for cgroup in high limit
From: Shaohua Li @ 2016-11-14 22:22 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Kernel-team, axboe, tj, vgoyal
In-Reply-To: <cover.1479161136.git.shli@fb.com>

each queue will have a state machine. Initially queue is in LIMIT_HIGH
state, which means all cgroups will be throttled according to their high
limit. After all cgroups with high limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
cgroups without high limit will use max limit for their high limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a564215..ec53671 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -208,12 +208,29 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->bps[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	uint64_t ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return -1;
+	ret = tg->bps[rw][tg->td->limit_index];
+	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+		return tg->bps[rw][LIMIT_MAX];
+
+	return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->iops[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	unsigned int ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return -1;
+	ret = tg->iops[rw][tg->td->limit_index];
+	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+		return tg->iops[rw][LIMIT_MAX];
+	return ret;
 }
 
 /**
-- 
2.9.3


^ permalink raw reply related

* Re: [PATCH v6 6/9] tpm: fix the missing .owner in tpm_bios_measurements_ops
From: Jarkko Sakkinen @ 2016-11-14 22:22 UTC (permalink / raw)
  To: Nayna Jain
  Cc: tpmdd-devel, peterhuewe, tpmdd, jgunthorpe, linux-kernel,
	linux-security-module
In-Reply-To: <1479117656-12403-7-git-send-email-nayna@linux.vnet.ibm.com>

On Mon, Nov 14, 2016 at 05:00:53AM -0500, Nayna Jain wrote:
> This patch fixes the missing .owner field in
> tpm_bios_measurements_ops definition.
> 
> Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
> Signed-off-by: Nayna Jain <nayna@linux.vnet.ibm.com>

Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

/Jarkko

> ---
>  drivers/char/tpm/tpm_eventlog.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/char/tpm/tpm_eventlog.c b/drivers/char/tpm/tpm_eventlog.c
> index f8c42fe..5575ffc 100644
> --- a/drivers/char/tpm/tpm_eventlog.c
> +++ b/drivers/char/tpm/tpm_eventlog.c
> @@ -349,6 +349,7 @@ static int tpm_bios_measurements_open(struct inode *inode,
>  }
>  
>  static const struct file_operations tpm_bios_measurements_ops = {
> +	.owner = THIS_MODULE,
>  	.open = tpm_bios_measurements_open,
>  	.read = seq_read,
>  	.llseek = seq_lseek,
> -- 
> 2.5.0
> 

^ permalink raw reply

* Re: [PATCH nft 0/3] src: add nft log flags support
From: Pablo Neira Ayuso @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Liping Zhang; +Cc: netfilter-devel, Liping Zhang
In-Reply-To: <1474794421-5365-1-git-send-email-zlpnobody@163.com>

[-- Attachment #1: Type: text/plain, Size: 1099 bytes --]

On Sun, Sep 25, 2016 at 05:06:58PM +0800, Liping Zhang wrote:
> From: Liping Zhang <liping.zhang@spreadtrum.com>
> 
> After NF_LOG_XXX is exposed to the userspace, we can set log flags to
> log more things. The following iptables rule:
>   # iptables -A OUTPUT -j LOG --log-tcp-sequence --log-tcp-options \
>   --log-ip-options --log-uid --log-macdecode
> is equal to the following nft rule:
>   # nft add rule filter OUTPUT log tcpseq,tcpopt,ipopt,uid,macdecode

Sorry, I wanted to have a closer look at this but time has been
running up and I didn't manage to get back to this.

So basically, I would like to explore different syntax for this, eg.

        log flags tcp sequence,options
        log flags ip options
        log flags skuid
        log flags ether

I think syntax would be larger, but it would look more consistent to
what we have. Worst case is to get them all set. We can provide a
compact version for this:

        log flags all

Please, see sketch patch attached for brainstorming.

Would you have a look into this? Thanks and again sorry for not
getting any sooner on this.

[-- Attachment #2: x.patch --]
[-- Type: text/x-diff, Size: 1416 bytes --]

diff --git a/src/parser_bison.y b/src/parser_bison.y
index 91955c187f3f..286290341ffb 100644
--- a/src/parser_bison.y
+++ b/src/parser_bison.y
@@ -201,6 +201,8 @@ static void location_update(struct location *loc, struct location *rhs, int n)
 %token EXPORT			"export"
 %token MONITOR			"monitor"
 
+%token ALL			"all"
+
 %token ACCEPT			"accept"
 %token DROP			"drop"
 %token CONTINUE			"continue"
@@ -268,6 +270,8 @@ static void location_update(struct location *loc, struct location *rhs, int n)
 %token GATEWAY			"gateway"
 %token MTU			"mtu"
 
+%token OPTIONS			"options"
+
 %token IP6			"ip6"
 %token PRIORITY			"priority"
 %token FLOWLABEL		"flowlabel"
@@ -1530,6 +1534,25 @@ log_arg			:	PREFIX			string
 				$<stmt>0->log.level	= $2;
 				$<stmt>0->log.flags 	|= STMT_LOG_LEVEL;
 			}
+			|	FLAGS			log_flags
+			{
+				;
+			}
+			;
+
+log_flags		:	TCP	log_flags_tcp
+			|	IP	OPTIONS
+			|	SKUID
+			|	ETHER
+			|	ALL
+			;
+
+log_flags_tcp		:	log_flags_tcp	COMMA	log_flag_tcp
+			|	log_flag_tcp
+			;
+
+log_flag_tcp		:	SEQUENCE
+			|	OPTIONS
 			;
 
 level_type		:	string
diff --git a/src/scanner.l b/src/scanner.l
index cd7398b4e534..625023f5257c 100644
--- a/src/scanner.l
+++ b/src/scanner.l
@@ -469,6 +469,9 @@ addrstring	({macaddr}|{ip4addr}|{ip6addr})
 
 "notrack"		{ return NOTRACK; }
 
+"options"		{ return OPTIONS; }
+"all"			{ return ALL; }
+
 "xml"			{ return XML; }
 "json"			{ return JSON; }
 

^ permalink raw reply related

* Re: Mic detection on Mackbook Pro 12,1 (CS4208)
From: Paulo Fidalgo @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Takashi Iwai; +Cc: alsa-devel
In-Reply-To: <s5h1syhg3h9.wl-tiwai@suse.de>

Thanks Takashi for point out, but that document is overwhelming.
I understand that developing drivers for unknown hardware should be
like running in the woods with our eyes covered, but I don't feel I
have the enough knowledge and time to be able to do it.

If helps, when I connect the headset running hda-analyzer tool in
monitor mode I only have events related to the OUTPUT:

Watching 2 cards
======================================
Diff for codec 1/0 (0x10134208):
---
+++
@@ -7,17 +7,17 @@
 No Modem Function Group found
 Default PCM:
     rates [0x0]:
     bits [0x0]:
     formats [0x0]:
 Default Amp-In caps: N/A
 Default Amp-Out caps: N/A
 GPIO: io=6, o=2, i=0, unsolicited=1, wake=1
-  IO[0]: enable=1, dir=1, wake=0, sticky=0, data=1, unsol=0
+  IO[0]: enable=1, dir=1, wake=0, sticky=0, data=0, unsol=0
   IO[1]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
   IO[2]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
   IO[3]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
   IO[4]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
   IO[5]: enable=0, dir=0, wake=0, sticky=0, data=0, unsol=0
 Node 0x02 [Audio Output] wcaps 0xd043d: Stereo Amp-Out Stripe
   Control: name="Headphone Playback Volume", index=0, device=0
     ControlAmp: chs=3, dir=1, idx=0, ofs=0
@@ -225,26 +225,24 @@
   Power: setting=D3, actual=D3
   Connection: 1
      0x02
 Node 0x12 [Pin Complex] wcaps 0x400501: Stereo
   Pincap 0x00000050: OUT Balance
   Pin Default 0x90170011: [Fixed] Speaker at Int N/A
     Conn = Analog, Color = Unknown
     DefAssociation = 0x1, Sequence = 0x1
-  Pin-ctls: 0x40: OUT
   Power: setting=D3, actual=D3
   Connection: 1
      0x03
 Node 0x13 [Pin Complex] wcaps 0x400501: Stereo
   Pincap 0x00000050: OUT Balance
   Pin Default 0x90170012: [Fixed] Speaker at Int N/A
     Conn = Analog, Color = Unknown
     DefAssociation = 0x1, Sequence = 0x2
-  Pin-ctls: 0x40: OUT
   Power: setting=D3, actual=D3
   Connection: 1
      0x04
 Node 0x14 [Pin Complex] wcaps 0x400501: Stereo
   Pincap 0x00000050: OUT Balance
   Pin Default 0x90170014: [Fixed] Speaker at Int N/A
     Conn = Analog, Color = Unknown
     DefAssociation = 0x1, Sequence = 0x4


Is there any other info I can provide to help?

Kind regards,

Paulo Fidalgo

On 11 November 2016 at 20:34, Takashi Iwai <tiwai@suse.de> wrote:
> On Thu, 10 Nov 2016 22:52:52 +0100,
> Paulo Fidalgo wrote:
>>
>> Hi,
>>
>> After reporting the issue on bugzilla
>> (https://bugzilla.kernel.org/show_bug.cgi?id=155601) I felt I could do
>> more in order to try to get a patch to fix the issue.
>>
>> Basically when I plug an headset(mic+headphone), it only detects the
>> headphone not the mic.
>>
>> Since I assumed the "memory address" of the 0x18 pin is wrong, I've
>> tried to change some values in order to get it working.
>
> The value you're referring to isn't a "memory address" at all.
> It's a "pin configuration" value, containing the information like
> the location, purpose, sequence number of the given pin, and it's
> encoded in 32bit integer.  Basically it's merely a hint for drivers to
> configure the system properly.
>
> Refer to the HD-audio specification document for more details.
>
>
> Takashi

^ permalink raw reply

* Re: [RESEND/PATCH v6 3/3] clk: qcom: Add A53 clock driver
From: Stephen Boyd @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Georgi Djakov
  Cc: Bjorn Andersson, mturquette, linux-clk, linux-kernel,
	linux-arm-msm, devicetree, Rob Herring
In-Reply-To: <549f87fe-7be9-14b4-8e34-86f7f8dad94e@linaro.org>

On 11/11, Georgi Djakov wrote:
> On 11/03/2016 08:28 PM, Bjorn Andersson wrote:
> >On Wed 02 Nov 15:55 PDT 2016, Stephen Boyd wrote:
> >
> >>On 11/02, Bjorn Andersson wrote:
> >>>On Thu 27 Oct 18:54 PDT 2016, Stephen Boyd wrote:
> >>>
> >>>>On 10/19, Georgi Djakov wrote:
> >>>>>Add a driver for the A53 Clock Controller. It is a hardware block that
> >>>>>implements a combined mux and half integer divider functionality. It can
> >>>>>choose between a fixed-rate clock or the dedicated A53 PLL. The source
> >>>>>and the divider can be set both at the same time.
> >>>>>
> >>>>>This is required for enabling CPU frequency scaling on platforms like
> >>>>>MSM8916.
> >>>>>
> >>>>
> >>>>Please Cc DT reviewers.
> >>>>
> >>>>>Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
> >>>>>---
> >>>>> .../devicetree/bindings/clock/qcom,a53cc.txt       |  22 +++
> >>>>> drivers/clk/qcom/Kconfig                           |   8 ++
> >>>>> drivers/clk/qcom/Makefile                          |   1 +
> >>>>> drivers/clk/qcom/a53cc.c                           | 155 +++++++++++++++++++++
> >>>>> 4 files changed, 186 insertions(+)
> >>>>> create mode 100644 Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>> create mode 100644 drivers/clk/qcom/a53cc.c
> >>>>>
> >>>>>diff --git a/Documentation/devicetree/bindings/clock/qcom,a53cc.txt b/Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>>new file mode 100644
> >>>>>index 000000000000..a025f062f177
> >>>>>--- /dev/null
> >>>>>+++ b/Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>>@@ -0,0 +1,22 @@
> >>>>>+Qualcomm A53 CPU Clock Controller Binding
> >>>>>+------------------------------------------------
> >>>>>+The A53 CPU Clock Controller is hardware, which provides a combined
> >>>>>+mux and divider functionality for the CPU clocks. It can choose between
> >>>>>+a fixed rate clock and the dedicated A53 PLL.
> >>>>>+
> >>>>>+Required properties :
> >>>>>+- compatible : shall contain:
> >>>>>+
> >>>>>+			"qcom,a53cc"
> >>>>>+
> >>>>>+- reg : shall contain base register location and length
> >>>>>+	of the APCS region
> >>>>>+- #clock-cells : shall contain 1
> >>>>>+
> >>>>>+Example:
> >>>>>+
> >>>>>+	apcs: syscon@b011000 {
> >>>>>+		compatible = "qcom,a53cc", "syscon";
> >>>>
> >>>>Why is it a syscon? Is that part used?
> >>>>
> >>>
> >>>I use the register at offset 8 for interrupting the other subsystems, so
> >>>this must be available as something I can poke.
> >>>
> >>>Which makes me think that this should be described as a "simple-mfd" and
> >>>"syscon" with the a53cc node as a child - grabbing the regmap of the
> >>>syscon parent, rather then ioremapping the same region again.
> >>>
> >>
> >>That's sort of a question for DT reviewers. The register space
> >>certainly seems like a free for all with a tilt toward power
> >>management of the CPU, similar to how this was done on Krait
> >>based designs.
> >>
> >
> >Right. But this kind of mashup blocks was the reason why simple-mfd was
> >put in place.
> >
> 
> Ok, thanks for the comments. Then i will make it look like this:
> 
> 	apcs: syscon@b011000 {
> 		compatible = "syscon", "simple-mfd";
> 		reg = <0x0b011000 0x1000>;
> 
> 		a53mux: clock {
> 			compatible = "qcom,msm8916-a53cc";
> 			#clock-cells = <1>;
> 		};
> 	};
> 
> Thanks,
> Georgi
> 
> >>I wonder why we didn't make up some provider/consumer binding for
> >>the "kicking" feature used by SMD/RPM code. Then this could be a
> >>clock provider and a "kick" provider (haha #kick-cells) and the
> >>usage of syscon/regmap wouldn't be mandatory.
> >>
> >
> >I did consider doing that, but had enough dependencies to put in place
> >as it was.
> >
> >I'm in favour of us inventing a kicker API and it's found outside out
> >use cases as well (e.g. virtio/rpmsg).
> >

I'd rather we did this kicker API as well. That way we don't need
to make a syscon and a simple-mfd to get software to work
properly. Don't other silicon vendors need a kicker API as well?
How are they kicking remote processors in other places? GPIOs?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [RESEND/PATCH v6 3/3] clk: qcom: Add A53 clock driver
From: Stephen Boyd @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Georgi Djakov
  Cc: Bjorn Andersson, mturquette-rdvid1DuHRBWk0Htik3J/w,
	linux-clk-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-msm-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Rob Herring
In-Reply-To: <549f87fe-7be9-14b4-8e34-86f7f8dad94e-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

On 11/11, Georgi Djakov wrote:
> On 11/03/2016 08:28 PM, Bjorn Andersson wrote:
> >On Wed 02 Nov 15:55 PDT 2016, Stephen Boyd wrote:
> >
> >>On 11/02, Bjorn Andersson wrote:
> >>>On Thu 27 Oct 18:54 PDT 2016, Stephen Boyd wrote:
> >>>
> >>>>On 10/19, Georgi Djakov wrote:
> >>>>>Add a driver for the A53 Clock Controller. It is a hardware block that
> >>>>>implements a combined mux and half integer divider functionality. It can
> >>>>>choose between a fixed-rate clock or the dedicated A53 PLL. The source
> >>>>>and the divider can be set both at the same time.
> >>>>>
> >>>>>This is required for enabling CPU frequency scaling on platforms like
> >>>>>MSM8916.
> >>>>>
> >>>>
> >>>>Please Cc DT reviewers.
> >>>>
> >>>>>Signed-off-by: Georgi Djakov <georgi.djakov-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
> >>>>>---
> >>>>> .../devicetree/bindings/clock/qcom,a53cc.txt       |  22 +++
> >>>>> drivers/clk/qcom/Kconfig                           |   8 ++
> >>>>> drivers/clk/qcom/Makefile                          |   1 +
> >>>>> drivers/clk/qcom/a53cc.c                           | 155 +++++++++++++++++++++
> >>>>> 4 files changed, 186 insertions(+)
> >>>>> create mode 100644 Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>> create mode 100644 drivers/clk/qcom/a53cc.c
> >>>>>
> >>>>>diff --git a/Documentation/devicetree/bindings/clock/qcom,a53cc.txt b/Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>>new file mode 100644
> >>>>>index 000000000000..a025f062f177
> >>>>>--- /dev/null
> >>>>>+++ b/Documentation/devicetree/bindings/clock/qcom,a53cc.txt
> >>>>>@@ -0,0 +1,22 @@
> >>>>>+Qualcomm A53 CPU Clock Controller Binding
> >>>>>+------------------------------------------------
> >>>>>+The A53 CPU Clock Controller is hardware, which provides a combined
> >>>>>+mux and divider functionality for the CPU clocks. It can choose between
> >>>>>+a fixed rate clock and the dedicated A53 PLL.
> >>>>>+
> >>>>>+Required properties :
> >>>>>+- compatible : shall contain:
> >>>>>+
> >>>>>+			"qcom,a53cc"
> >>>>>+
> >>>>>+- reg : shall contain base register location and length
> >>>>>+	of the APCS region
> >>>>>+- #clock-cells : shall contain 1
> >>>>>+
> >>>>>+Example:
> >>>>>+
> >>>>>+	apcs: syscon@b011000 {
> >>>>>+		compatible = "qcom,a53cc", "syscon";
> >>>>
> >>>>Why is it a syscon? Is that part used?
> >>>>
> >>>
> >>>I use the register at offset 8 for interrupting the other subsystems, so
> >>>this must be available as something I can poke.
> >>>
> >>>Which makes me think that this should be described as a "simple-mfd" and
> >>>"syscon" with the a53cc node as a child - grabbing the regmap of the
> >>>syscon parent, rather then ioremapping the same region again.
> >>>
> >>
> >>That's sort of a question for DT reviewers. The register space
> >>certainly seems like a free for all with a tilt toward power
> >>management of the CPU, similar to how this was done on Krait
> >>based designs.
> >>
> >
> >Right. But this kind of mashup blocks was the reason why simple-mfd was
> >put in place.
> >
> 
> Ok, thanks for the comments. Then i will make it look like this:
> 
> 	apcs: syscon@b011000 {
> 		compatible = "syscon", "simple-mfd";
> 		reg = <0x0b011000 0x1000>;
> 
> 		a53mux: clock {
> 			compatible = "qcom,msm8916-a53cc";
> 			#clock-cells = <1>;
> 		};
> 	};
> 
> Thanks,
> Georgi
> 
> >>I wonder why we didn't make up some provider/consumer binding for
> >>the "kicking" feature used by SMD/RPM code. Then this could be a
> >>clock provider and a "kick" provider (haha #kick-cells) and the
> >>usage of syscon/regmap wouldn't be mandatory.
> >>
> >
> >I did consider doing that, but had enough dependencies to put in place
> >as it was.
> >
> >I'm in favour of us inventing a kicker API and it's found outside out
> >use cases as well (e.g. virtio/rpmsg).
> >

I'd rather we did this kicker API as well. That way we don't need
to make a syscon and a simple-mfd to get software to work
properly. Don't other silicon vendors need a kicker API as well?
How are they kicking remote processors in other places? GPIOs?

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v6 2/9] tpm: replace symbolic permission with octal for securityfs files
From: Jarkko Sakkinen @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Nayna Jain
  Cc: tpmdd-devel, peterhuewe, tpmdd, jgunthorpe, linux-kernel,
	linux-security-module
In-Reply-To: <1479117656-12403-3-git-send-email-nayna@linux.vnet.ibm.com>

On Mon, Nov 14, 2016 at 05:00:49AM -0500, Nayna Jain wrote:
> checkpatch.pl flags warning for symbolic permissions and suggests
> to replace with octal value.
> 
> This patch changes securityfs pseudo files permission
> to octal values in tpm_bios_log_setup().
> 
> Signed-off-by: Nayna Jain <nayna@linux.vnet.ibm.com>

Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>

/Jarkko

> ---
>  drivers/char/tpm/tpm_eventlog.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm_eventlog.c b/drivers/char/tpm/tpm_eventlog.c
> index 42b49c4..9467e31 100644
> --- a/drivers/char/tpm/tpm_eventlog.c
> +++ b/drivers/char/tpm/tpm_eventlog.c
> @@ -378,7 +378,7 @@ struct dentry **tpm_bios_log_setup(const char *name)
>  
>  	bin_file =
>  	    securityfs_create_file("binary_bios_measurements",
> -				   S_IRUSR | S_IRGRP, tpm_dir,
> +				   0440, tpm_dir,
>  				   (void *)&tpm_binary_b_measurements_seqops,
>  				   &tpm_bios_measurements_ops);
>  	if (is_bad(bin_file))
> @@ -386,7 +386,7 @@ struct dentry **tpm_bios_log_setup(const char *name)
>  
>  	ascii_file =
>  	    securityfs_create_file("ascii_bios_measurements",
> -				   S_IRUSR | S_IRGRP, tpm_dir,
> +				   0440, tpm_dir,
>  				   (void *)&tpm_ascii_b_measurements_seqops,
>  				   &tpm_bios_measurements_ops);
>  	if (is_bad(ascii_file))
> -- 
> 2.5.0
> 

^ permalink raw reply

* Re: [PATCH v6 2/9] tpm: replace symbolic permission with octal for securityfs files
From: Jarkko Sakkinen @ 2016-11-14 22:21 UTC (permalink / raw)
  To: Nayna Jain
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <1479117656-12403-3-git-send-email-nayna-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

On Mon, Nov 14, 2016 at 05:00:49AM -0500, Nayna Jain wrote:
> checkpatch.pl flags warning for symbolic permissions and suggests
> to replace with octal value.
> 
> This patch changes securityfs pseudo files permission
> to octal values in tpm_bios_log_setup().
> 
> Signed-off-by: Nayna Jain <nayna-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

/Jarkko

> ---
>  drivers/char/tpm/tpm_eventlog.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm_eventlog.c b/drivers/char/tpm/tpm_eventlog.c
> index 42b49c4..9467e31 100644
> --- a/drivers/char/tpm/tpm_eventlog.c
> +++ b/drivers/char/tpm/tpm_eventlog.c
> @@ -378,7 +378,7 @@ struct dentry **tpm_bios_log_setup(const char *name)
>  
>  	bin_file =
>  	    securityfs_create_file("binary_bios_measurements",
> -				   S_IRUSR | S_IRGRP, tpm_dir,
> +				   0440, tpm_dir,
>  				   (void *)&tpm_binary_b_measurements_seqops,
>  				   &tpm_bios_measurements_ops);
>  	if (is_bad(bin_file))
> @@ -386,7 +386,7 @@ struct dentry **tpm_bios_log_setup(const char *name)
>  
>  	ascii_file =
>  	    securityfs_create_file("ascii_bios_measurements",
> -				   S_IRUSR | S_IRGRP, tpm_dir,
> +				   0440, tpm_dir,
>  				   (void *)&tpm_ascii_b_measurements_seqops,
>  				   &tpm_bios_measurements_ops);
>  	if (is_bad(ascii_file))
> -- 
> 2.5.0
> 

------------------------------------------------------------------------------

^ permalink raw reply

* Re: [PATCH v15 02/27] bisect: rewrite `check_term_format` shell function in C
From: Stephan Beyer @ 2016-11-14 22:20 UTC (permalink / raw)
  To: Pranit Bauva; +Cc: git
In-Reply-To: <01020157c38b1a74-ad2bcaff-0c92-4af4-9aa0-72d98f4945fc-000000@eu-west-1.amazonses.com>

Hi,

I saw in the recent "What's cooking" mail that this is still waiting
for review, so I thought I could interfere and help reviewing it from a
non-git-developer point of view.
But only two commits for today. The first one seems fine. The second
one makes me write this mail ;-)

On 10/14/2016 04:14 PM, Pranit Bauva wrote:
> +static int check_term_format(const char *term, const char *orig_term)
> +{
[...]
> +	if (one_of(term, "help", "start", "skip", "next", "reset",
> +			"visualize", "replay", "log", "run", NULL))
[... vs ...]
> -check_term_format () {
> -	term=$1
> -	git check-ref-format refs/bisect/"$term" ||
> -	die "$(eval_gettext "'\$term' is not a valid term")"
> -	case "$term" in
> -	help|start|terms|skip|next|reset|visualize|replay|log|run)

Is there a reasons why "terms" has been dropped from the list?

Best
Stephan

^ permalink raw reply

* Re: ceph v10.2.4 QE validation status
From: Yuri Weinstein @ 2016-11-14 22:20 UTC (permalink / raw)
  To: Ceph Development, Abhishek Varshney, Dachary, Loic, Dachary, Loic,
	Weil, Sage, Dillaman, Jason, Nathan Cutler, Ilya Dryomov,
	John Spray, Durgin, Josh
In-Reply-To: <CAMMFjmEkMUBCnZFqMr_kcfZgB5jm3CjVNfXpSAXT7ZCwMLQEfg@mail.gmail.com>

See updated status - http://tracker.ceph.com/issues/17487#note-32

Outstanding issues:

knfs - http://tracker.ceph.com/issues/16397 (same as in v10.2.3, Greg
pls review/approve, assumed Approved ?)

upgrade/hammer-x (jewel) - http://tracker.ceph.com/issues/17847 ((Sage
pls review/approve, seems persistent, but maybe not a showstopper)

upgrade/infernalis-x (jewel) - deprecated (Nathan is still working
to make it pass, see issues in the tacker summary above)

Sage, jewel 10.2.4 can be released as soon as you agree with the
findings/summary.

Thx
YuriW

On Fri, Nov 11, 2016 at 8:47 AM, Yuri Weinstein <yweinste@redhat.com> wrote:
> Detailed summary of the QE Validation can be found here
> http://tracker.ceph.com/issues/17487#note-32
>
> The following suites were in scope of this point release validation:
>
> rados (subset 35/50 297 jobs)
> rbd
> rgw
> fs
> krbd
> kcephfs
> knfs
> rest
> hadoop
> samba
> ceph-deploy
> ceph-disk
> upgrade/client-upgrade
> upgrade/hammer-x (jewel)
> upgrade/infernalis-x (jewel) - deprecated
> powercycle
> (please let me know if any suites are missing from this ^ list)
>
> ==============================
> Issues requiring approval/decision:
>
> fs - http://tracker.ceph.com/issues/17832 (Greg pls review/approve)
>
> krbd - http://tracker.ceph.com/issues/17221 (Jason, Ilya pls review/approve)
>
> knfs - http://tracker.ceph.com/issues/16397 (same as in v10.2.3, Greg
> pls review/approve)
>
> upgrade/hammer-x (jewel) - http://tracker.ceph.com/issues/17847 ((Sage
> pls review/approve)
>
> upgrade/infernalis-x (jewel) - deprecated, but Nathan is still working
> to make it pass (Sage pls review/approve)
>
> powercycle - in progress
>
> Thx
> YuriW

^ permalink raw reply

* [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.
From: dwilder @ 2016-11-14 22:20 UTC (permalink / raw)
  To: mtk.manpages; +Cc: linux-man, netdev

The example code in netlink(7) (for reading netlink message) suggests 
using
a 4k read buffer with recvmsg.  This can cause truncated messages on 
systems
using a page size is >4096.  Please see:
linux/include/linux/netlink.h (in the kernel source)

<snip>
/*
  *      skb should fit one page. This choice is good for headerless 
malloc.
  *      But we should limit to 8K so that userspace does not have to
  *      use enormous buffer sizes on recvmsg() calls just to avoid
  *      MSG_TRUNC when PAGE_SIZE is very large.
  */
#if PAGE_SIZE < 8192UL
#define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(PAGE_SIZE)
#else
#define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(8192UL)
#endif

#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
<snip>

I was troubleshooting some up-stream code on a ppc64le system
(page:size of 64k) This code had duplicated the example from netlink(7) 
and
was using a 4k buffer.  On x86-64 with a 4k page size this is not a 
problem,
however on the 64k page system some messages were truncated.  Using an 
8k buffer
as implied in netlink.h prevents problems with any page size.

Lets change the example so others don't propagate the problem further.

Signed-off-by David Wilder <dwilder@us.ibm.com>

--- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
+++ man7/netlink.7      2016-11-14 13:30:51.002086354 -0800
@@ -511,7 +511,7 @@
  .in +4n
  .nf
  int len;
-char buf[4096];
+char buf[8192];
  struct iovec iov = { buf, sizeof(buf) };
  struct sockaddr_nl sa;
  struct msghdr msg;

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.