public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] CFS Bandwidth Control
@ 2010-04-28 11:16 Paul Turner
  2010-04-28 11:16 ` [PATCH v2 1/6] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

Hi all,

Please find attached v2 of our proposed approach for bandwidth provisioning
under CFS.  Bharata's original RFC motivating discussion on this topic can be
found at: http://lkml.org/lkml/2009/6/4/24

This is an evolution of our previous posting: http://lkml.org/lkml/2010/2/12/393
The improvements herein are incremental: hierarchal task tracking for better
load-balance under throttle conditions, statistics export for decision
guidance in user-space control systems, minor bugs fixed, and some code
clean-up.

The skeleton of our approach is as follows:
- As above we maintain a global pool, per-tg, pool of unassigned quota.  On it
  we track the bandwidth period, quota per period, and runtime remaining in the
  current period.  As bandwidth is used within a period it is decremented from
  runtime.  Runtime is currently synchronized using a spinlock, in the current
  implementation there's no reason this couldn't be done using atomic ops
  instead however the spinlock allows for a little more flexibility in
  experimentation with other schemes.
- When a cfs_rq participating in a bandwidth constrained task_group executes it
  acquires time in sysctl_sched_cfs_bandwidth_slice (default currently 10ms)
  size chunks from the global pool, this synchronizes under rq->lock and is part
  of the update_curr path.
- Throttled entities are dequeued immediately.  Throttled entities are gated
  from participating in the tree at the {enqueue, dequeue}_entity level.

More details on the motivation and approach, as well as performance benchmark
results can be found in the original posting.

One caveat that bears discussion is that this leads to an alternate
specification of bandwidth versus the sched_rt case.  The defined bandwidth
becomes an absolute quantifier relative to the period and is agnostic of allowed
cpus.

Open-questions:
- Is there any value in having the slice be tunable at the task-group level?
- I suspect 5ms may be a better default slice value, however I have not had the
  opportunity to verify this yet.  There's also room for some dynamic range
  here.

Acknowledgements: 
We would like to thank Bharata B Rao and Dhaval Giani for discussion and their
original proposal, many elements in this patchset are directly inspired by
their original posting.  Bharata has also been integral in the preparation of
this second version, providing valuable feedback and review.

Ken Chen also provided early review and comments.

Thanks,

- Paul and Nikhil
---

Nikhil Rao (1):
      sched: add exports tracking cfs bandwidth control statistics

Paul Turner (5):
      sched: introduce primitives to account for CFS bandwidth tracking
      sched: accumulate per-cfs_rq cpu usage
      sched: throttle cfs_rq entities which exceed their local quota
      sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
      sched: hierarchical task accounting for FAIR_GROUP_SCHED


 include/linux/sched.h |    4 +
 init/Kconfig          |    9 +
 kernel/sched.c        |  347 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched_fair.c   |  240 +++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c     |   24 +--
 kernel/sysctl.c       |   10 +
 6 files changed, 585 insertions(+), 49 deletions(-)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/6] sched: introduce primitives to account for CFS bandwidth tracking
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
@ 2010-04-28 11:16 ` Paul Turner
  2010-04-28 11:16 ` [PATCH v2 2/6] sched: accumulate per-cfs_rq cpu usage Paul Turner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

In this patch we introduce the notion of CFS bandwidth, to account for the
realities of SMP this is partitioned into globally unassigned bandwidth, and
locally claimed bandwidth:
- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
  structure.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool
  bandwidth assigned to a task_group, this is tracked using the
  new cfs_bandwidth structure.

Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

A per-cfs_bandwidth timer is also introduced to handle future refresh at
period expiration.  There's some minor refactoring here so that
start_bandwidth_timer() functionality can be shared

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig        |    9 ++
 kernel/sched.c      |  271 +++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched_fair.c |   10 ++
 3 files changed, 268 insertions(+), 22 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index eb77e8c..971bc8e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -597,6 +597,15 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define quota and period for cpu
+	  bandwidth provisioning on a per-cgroup basis.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
diff --git a/kernel/sched.c b/kernel/sched.c
index 6af210a..96db602 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -191,10 +191,28 @@ static inline int rt_bandwidth_enabled(void)
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta, 
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
 
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -202,22 +220,7 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -242,6 +245,15 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+struct cfs_bandwidth {
+	raw_spinlock_t		lock;
+	ktime_t			period;
+	u64			runtime, quota;
+	struct hrtimer		period_timer;
+};
+#endif
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -267,6 +279,10 @@ struct task_group {
 	struct task_group *parent;
 	struct list_head siblings;
 	struct list_head children;
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct cfs_bandwidth cfs_bandwidth;
+#endif
 };
 
 #define root_task_group init_task_group
@@ -404,9 +420,76 @@ struct cfs_rq {
 	 */
 	unsigned long rq_weight;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	u64 quota_assigned, quota_used;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = cfs_b->runtime = quota;
+	cfs_b->period = ns_to_ktime(period);
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
+}
+
+static
+void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->quota_used = 0;
+	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
+		cfs_rq->quota_assigned = RUNTIME_INF;
+	else
+		cfs_rq->quota_assigned = 0;
+}
+
+static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->period_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
+#endif
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -1823,6 +1906,14 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 static const struct sched_class rt_sched_class;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s
+ */
+static u64 sched_cfs_bandwidth_period = 500000000ULL;
+#endif
+
 #define sched_class_highest (&rt_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
@@ -7620,6 +7711,9 @@ static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_rq_quota(cfs_rq);
+#endif
 	if (add)
 		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
@@ -7765,6 +7859,10 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
+#ifdef CONFIG_CFS_BANDWIDTH
+		init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
+				RUNTIME_INF, sched_cfs_bandwidth_period);
+#endif
 		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7997,6 +8095,10 @@ static void free_fair_sched_group(struct task_group *tg)
 {
 	int i;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
+#endif
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8024,7 +8126,10 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
-
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
+			sched_cfs_bandwidth_period);
+#endif
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 
@@ -8472,7 +8577,7 @@ static int __rt_schedulable(struct task_group *tg, u64 period, u64 runtime)
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8511,7 +8616,7 @@ int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8536,7 +8641,7 @@ int sched_group_set_rt_period(struct task_group *tg, long rt_period_us)
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -8743,6 +8848,116 @@ static u64 cpu_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &init_task_group)
+		return -EINVAL;
+
+	if (!period)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at least one tick of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (NS_TO_JIFFIES(quota) < 1)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
+	tg->cfs_bandwidth.period = ns_to_ktime(period);
+	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
+	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		init_cfs_rq_quota(cfs_rq);
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	if (cfs_runtime_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg->cfs_bandwidth.quota;
+	do_div(quota_us, NSEC_PER_USEC);
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg->cfs_bandwidth.quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8777,6 +8992,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5a5ea2c..a61bc24 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -360,6 +360,9 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	rb_link_node(&se->run_node, parent, link);
 	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+#ifdef CONFIG_CFS_BANDWIDTH
+	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
+#endif
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1144,6 +1147,13 @@ static void yield_task_fair(struct rq *rq)
 	se->vruntime = rightmost->vruntime + 1;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/6] sched: accumulate per-cfs_rq cpu usage
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
  2010-04-28 11:16 ` [PATCH v2 1/6] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2010-04-28 11:16 ` Paul Turner
  2010-04-28 11:17 ` [PATCH v2 3/6] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned.  This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
---
 include/linux/sched.h |    4 ++++
 kernel/sched.c        |   13 +++++++++++++
 kernel/sched_fair.c   |   50 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c       |   10 ++++++++++
 4 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..8603645 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1903,6 +1903,10 @@ int sched_rt_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
diff --git a/kernel/sched.c b/kernel/sched.c
index 96db602..3b53695 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1912,6 +1912,19 @@ static const struct sched_class rt_sched_class;
  * default: 0.5s
  */
 static u64 sched_cfs_bandwidth_period = 500000000ULL;
+
+/*
+ * default slice of quota to allocate from global tg to local cfs_rq pool on
+ * each refresh
+ * default: 10ms
+ */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
 #endif
 
 #define sched_class_highest (&rt_sched_class)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index a61bc24..1db1991 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -267,6 +267,16 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec);
+#endif
+
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -546,6 +556,9 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+#ifdef CONFIG_CFS_BANDWIDTH
+	account_cfs_rq_quota(cfs_rq, delta_exec);
+#endif
 }
 
 static inline void
@@ -1148,6 +1161,43 @@ static void yield_task_fair(struct rq *rq)
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static u64 tg_request_cfs_quota(struct task_group *tg)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 delta = 0;
+
+	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+		raw_spin_lock(&cfs_b->lock);
+		/*
+		 * it's possible a bandwidth update has changed the global
+		 * pool.
+		 */
+		if (cfs_b->quota == RUNTIME_INF)
+			delta = sched_cfs_bandwidth_slice();
+		else {
+			delta = min(cfs_b->runtime, 
+					sched_cfs_bandwidth_slice());
+			cfs_b->runtime -= delta;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+	}
+	return delta;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (cfs_rq->quota_assigned == RUNTIME_INF)
+		return;
+
+	cfs_rq->quota_used += delta_exec;
+
+	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+		return;
+
+	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	return 1;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8686b0f..d0e17ca 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -354,6 +354,16 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 3/6] sched: throttle cfs_rq entities which exceed their local quota
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
  2010-04-28 11:16 ` [PATCH v2 1/6] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2010-04-28 11:16 ` [PATCH v2 2/6] sched: accumulate per-cfs_rq cpu usage Paul Turner
@ 2010-04-28 11:17 ` Paul Turner
  2010-04-28 11:17 ` [PATCH v2 4/6] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
cfs_rq's local quota and whether there is global quota available to continue
enabling it in the event we run out.

This patch adds the required support for the latter case, throttling entities
until quota is available to run.  Throttling dequeues the entity in question
and sends a reschedule to the owning cpu so that it can be evicted.

The following restrictions apply to a throttled cfs_rq:
- It is dequeued from sched_entity hierarchy and restricted from being
  re-enqueued.  This means that new/waking children of this entity will be
  queued up to it, but not past it.
- It does not contribute to weight calculations in tg_shares_up
- In the case that the cfs_rq of the cpu we are trying to pull from is throttled
  it is  is ignored by the loadbalancer in __load_balance_fair() and
  move_one_task_fair().

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   12 +++++++++-
 kernel/sched_fair.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 68 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3b53695..d072881 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -422,6 +422,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
+	int throttled;
 #endif
 #endif
 };
@@ -1647,6 +1648,8 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
 	}
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
 /*
  * Re-compute the task group their per cpu shares over the given domain.
  * This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1667,7 +1670,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
 	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
 
 	for_each_cpu(i, sched_domain_span(sd)) {
-		weight = tg->cfs_rq[i]->load.weight;
+		/*
+		 * bandwidth throttled entities cannot contribute to load
+		 * balance
+		 */
+		if (!cfs_rq_throttled(tg->cfs_rq[i]))
+			weight = tg->cfs_rq[i]->load.weight;
+		else
+			weight = 0;
 		usd_rq_weight[i] = weight;
 
 		rq_weight += weight;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 1db1991..0e480ae 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -273,8 +273,18 @@ static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 	return &tg->cfs_bandwidth;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec);
+#else
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 
@@ -799,6 +809,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+
+	if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
+	     !group_cfs_rq(se)->nr_running))
+		return;
+
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
@@ -835,6 +850,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
 	 */
 	update_curr(cfs_rq);
 
+	if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
+		return;
+
 	update_stats_dequeue(cfs_rq, se);
 	if (sleep) {
 #ifdef CONFIG_SCHEDSTATS
@@ -1086,6 +1104,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, bool head)
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		/* don't continue to enqueue if our parent is throttled */
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1105,8 +1126,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, sleep);
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		/*
+		 * Don't dequeue parent if it has other entities besides us,
+		 * or if it is throttled
+		 */
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
 			break;
 		sleep = 1;
 	}
@@ -1184,6 +1208,22 @@ static u64 tg_request_cfs_quota(struct task_group *tg)
 	return delta;
 }
 
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		dequeue_entity(cfs_rq, se, 1);
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
+			break;
+	}
+	cfs_rq->throttled = 1;
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1192,10 +1232,16 @@ static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 
 	cfs_rq->quota_used += delta_exec;
 
-	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+	if (cfs_rq_throttled(cfs_rq) ||
+		cfs_rq->quota_used < cfs_rq->quota_assigned)
 		return;
 
 	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+
+	if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
+		throttle_cfs_rq(cfs_rq);
+		resched_task(cfs_rq->rq->curr);
+	}
 }
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -2057,9 +2103,10 @@ load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or throttled cfs_rq
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+				cfs_rq_throttled(busiest_cfs_rq))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;
@@ -2119,6 +2166,11 @@ static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		total_load_moved += load_moved;
 
 #ifdef CONFIG_PREEMPT
+	for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
+		/* skip throttled cfs_rq */
+		if (cfs_rq_throttled(busy_cfs_rq))
+			continue;
+
 		/*
 		 * NEWIDLE balancing is a source of latency, so preemptible
 		 * kernels will stop after the first task is pulled to minimize


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 4/6] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
                   ` (2 preceding siblings ...)
  2010-04-28 11:17 ` [PATCH v2 3/6] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
@ 2010-04-28 11:17 ` Paul Turner
  2010-04-28 11:17 ` [PATCH v2 5/6] sched: add exports tracking cfs bandwidth control statistics Paul Turner
  2010-04-28 11:17 ` [PATCH v2 6/6] sched: hierarchical task accounting for FAIR_GROUP_SCHED Paul Turner
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

At the start of a new period there are several actions we must take:
- Refresh global bandwidth pool
- Unthrottle entities who ran out of quota as refreshed bandwidth permits

Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
into the cfs entity hierarchy.

sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
since it is now shared by both cfs and rt bandwidth period timers.

The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
rd->span instead of cpu_online_mask since I think that was incorrect before
(don't want to hit cpu's outside of your root_domain for RT bandwidth).

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   16 +++++++++++++
 kernel/sched_fair.c |   63 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c   |   19 +--------------
 3 files changed, 79 insertions(+), 19 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d072881..aca1d32 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1529,6 +1529,8 @@ static int tg_nop(struct task_group *tg, void *data)
 }
 #endif
 
+static inline const struct cpumask *sched_bw_period_mask(void);
+
 #ifdef CONFIG_SMP
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
@@ -1916,6 +1918,18 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 
 static const struct sched_class rt_sched_class;
 
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_rq(smp_processor_id())->rd->span;
+}
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+#endif
+
 #ifdef CONFIG_CFS_BANDWIDTH
 /*
  * default period for cfs group bandwidth.
@@ -8904,6 +8918,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 		raw_spin_lock_irq(&rq->lock);
 		init_cfs_rq_quota(cfs_rq);
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 	mutex_unlock(&mutex);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 0e480ae..11de5de 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -268,6 +268,13 @@ find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static inline
+struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
+{
+	return container_of(cfs_b, struct task_group,
+			cfs_bandwidth)->cfs_rq[cpu];
+}
+
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 {
 	return &tg->cfs_bandwidth;
@@ -1224,6 +1231,24 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->throttled = 1;
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+
+		cfs_rq = cfs_rq_of(se);
+		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1246,8 +1271,44 @@ static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	return 1;
+	int i, idle = 1;
+	u64 delta;
+	const struct cpumask *span;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return 1;
+
+	/* reset group quota */
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->runtime = cfs_b->quota;
+	raw_spin_unlock(&cfs_b->lock);
+
+	span = sched_bw_period_mask();
+	for_each_cpu(i, span) {
+		struct rq *rq = cpu_rq(i);
+		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
+
+		if (cfs_rq->nr_running)
+			idle = 0;
+
+		if (!cfs_rq_throttled(cfs_rq))
+			continue;
+
+		delta = tg_request_cfs_quota(cfs_rq->tg);
+
+		if (delta) {
+			raw_spin_lock(&rq->lock);
+			cfs_rq->quota_assigned += delta;
+
+			if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+				unthrottle_cfs_rq(cfs_rq);
+			raw_spin_unlock(&rq->lock);
+		}
+	}
+
+	return idle;
 }
+
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index b5b920a..15bbc45 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -241,18 +241,6 @@ static int rt_se_boosted(struct sched_rt_entity *rt_se)
 	return p->prio != p->normal_prio;
 }
 
-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-#endif
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -302,11 +290,6 @@ static inline int rt_rq_throttled(struct rt_rq *rt_rq)
 	return rt_rq->rt_throttled;
 }
 
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -524,7 +507,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun)
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return 1;
 
-	span = sched_rt_period_mask();
+	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 5/6] sched: add exports tracking cfs bandwidth control statistics
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
                   ` (3 preceding siblings ...)
  2010-04-28 11:17 ` [PATCH v2 4/6] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2010-04-28 11:17 ` Paul Turner
  2010-04-28 11:17 ` [PATCH v2 6/6] sched: hierarchical task accounting for FAIR_GROUP_SCHED Paul Turner
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched.c      |   26 ++++++++++++++++++++++++++
 kernel/sched_fair.c |   19 ++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index aca1d32..ac74d3a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -251,6 +251,11 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			runtime, quota;
 	struct hrtimer		period_timer;
+
+	/* throttle statistics */
+	u64			nr_periods;
+	u64			nr_throttled;
+	u64			throttled_time;
 };
 #endif
 
@@ -423,6 +428,7 @@ struct cfs_rq {
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
 	int throttled;
+	u64 throttled_timestamp;
 #endif
 #endif
 };
@@ -460,6 +466,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	cfs_b->nr_periods = 0;
+	cfs_b->nr_throttled = 0;
+	cfs_b->throttled_time = 0;
 }
 
 static
@@ -8996,6 +9006,18 @@ static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9042,6 +9064,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 11de5de..edea44e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1229,15 +1229,26 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 			break;
 	}
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 }
 
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct rq *rq = rq_of(cfs_rq);
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
+	/* update stats */
+	update_rq_clock(rq);
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
+	raw_spin_unlock(&cfs_b->lock);
+
 	cfs_rq->throttled = 0;
+	cfs_rq->throttled_timestamp = 0;
+
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
@@ -1271,7 +1282,7 @@ static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int i, idle = 1;
+	int i, idle = 1, num_throttled = 0;
 	u64 delta;
 	const struct cpumask *span;
 
@@ -1293,6 +1304,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 
 		if (!cfs_rq_throttled(cfs_rq))
 			continue;
+		num_throttled++;
 
 		delta = tg_request_cfs_quota(cfs_rq->tg);
 
@@ -1306,6 +1318,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 		}
 	}
 
+	/* update throttled stats */
+	cfs_b->nr_periods++;
+	if (num_throttled)
+		cfs_b->nr_throttled++;
+
 	return idle;
 }
 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 6/6] sched: hierarchical task accounting for FAIR_GROUP_SCHED
  2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
                   ` (4 preceding siblings ...)
  2010-04-28 11:17 ` [PATCH v2 5/6] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2010-04-28 11:17 ` Paul Turner
  5 siblings, 0 replies; 7+ messages in thread
From: Paul Turner @ 2010-04-28 11:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul Menage, Srivatsa Vaddagiri, Dhaval Giani, Gautham R Shenoy,
	Kamalesh Babulal, Herbert Poetzl, Balbir Singh, Chris Friesen,
	Avi Kivity, Bharata B Rao, Nikhil Rao, Ingo Molnar,
	Pavel Emelyanov, Mike Waychison, Vaidyanathan Srinivasan,
	Peter Zijlstra

With task entities participating in throttled sub-trees it is possible for
task activation/de-activation to not lead to root visible changes to
rq->nr_running.  This in turn leads to incorrect idle and weight-per-task load
balance decisions.

To allow correct accounting we move responsibility for updating rq->nr_running
to the respective sched::classes.  In the fair-group case this update is
hierarchical, tracking the number of active tasks rooted at each group entity.

Note: technically this issue also exists with the existing sched_rt
throttling; however due to the nearly complete provisioning of system
resources for rt scheduling this is much less common by default.
---
 kernel/sched.c      |    9 ++++++---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c   |    5 ++++-
 3 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ac74d3a..87fb0c0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -368,7 +368,7 @@ static inline struct task_group *task_group(struct task_struct *p)
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_tasks;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1967,6 +1967,11 @@ static inline u64 sched_cfs_bandwidth_slice(void)
 
 #include "sched_stats.h"
 
+static void mod_nr_running(struct rq *rq, long delta)
+{
+	rq->nr_running += delta;
+}
+
 static void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
@@ -2042,7 +2047,6 @@ static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, wakeup, false);
-	inc_nr_running(rq);
 }
 
 /*
@@ -2054,7 +2058,6 @@ static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, sleep);
-	dec_nr_running(rq);
 }
 
 #include "sched_idletask.c"
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index edea44e..eb6ed15 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -76,6 +76,8 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
  */
 unsigned int __read_mostly sysctl_sched_compat_yield;
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 /*
  * SCHED_OTHER wake-up granularity.
  * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
@@ -682,6 +684,40 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	se->on_rq = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/* maintain hierarchal task counts on group entities */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		/* a throttled entity cannot affect its parent hierarchy */
+		if (group_cfs_rq(se) && cfs_rq_throttled(group_cfs_rq(se)))
+			break;
+
+		/* we affect our queuing entity */
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_tasks += delta;
+	}
+
+	/* account for global nr_running delta to hierarchy change */
+	if (!se)
+		mod_nr_running(rq, delta);
+}
+#else
+/*
+ * In the absence of group throttling, all operations are guaranteed to be
+ * globally visible at the root rq level.
+ */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+
+	mod_nr_running(rq, delta);
+}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1117,6 +1153,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, bool head)
 		flags = ENQUEUE_WAKEUP;
 	}
 
+	account_hier_tasks(&p->se, 1);
 	hrtick_update(rq);
 }
 
@@ -1142,6 +1179,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep)
 		sleep = 1;
 	}
 
+	account_hier_tasks(&p->se, -1);
 	hrtick_update(rq);
 }
 
@@ -1215,12 +1253,15 @@ static u64 tg_request_cfs_quota(struct task_group *tg)
 	return delta;
 }
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
+	account_hier_tasks(se, -cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -1249,6 +1290,7 @@ static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->throttled = 0;
 	cfs_rq->throttled_timestamp = 0;
 
+	account_hier_tasks(se, cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 15bbc45..c908bc0 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -882,6 +882,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, bool head)
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -892,6 +894,8 @@ static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1758,4 +1762,3 @@ static void print_rt_stats(struct seq_file *m, int cpu)
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-04-28 11:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-28 11:16 [PATCH v2 0/6] CFS Bandwidth Control Paul Turner
2010-04-28 11:16 ` [PATCH v2 1/6] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2010-04-28 11:16 ` [PATCH v2 2/6] sched: accumulate per-cfs_rq cpu usage Paul Turner
2010-04-28 11:17 ` [PATCH v2 3/6] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
2010-04-28 11:17 ` [PATCH v2 4/6] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2010-04-28 11:17 ` [PATCH v2 5/6] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2010-04-28 11:17 ` [PATCH v2 6/6] sched: hierarchical task accounting for FAIR_GROUP_SCHED Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox