[PATCH v3 0/7] CFS Bandwidth Control

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] CFS Bandwidth Control
@ 2010-10-12  7:49 Bharata B Rao
  2010-10-12  7:50 ` [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking Bharata B Rao
                   ` (9 more replies)
  0 siblings, 10 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

Hi,

Its been a while since we posted CPS hard limits (aka CFS bandwidth control
now) patches, hence a quick recap first:

- I have been working on CFS hard limits since last year and have posted
  a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
- Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
  control and have posted a couple of versions.
  (last post v2: http://lwn.net/Articles/385055/)

Paul's approach mainly changed the way the CPU hard limit was represented. After
his post, I decided to work with them and discontinue my patch series since
their global bandwidth specification for group appears more flexible than
the RT-type per-cpu bandwidth specification I had in my series.

Since Paul seems to be busy, I am taking the freedom of posting the next
version of his patches with a few enhancements to the slack time handling.
(more on this later)

Main changes in this post:

- Return the unused and remaining local quota at each CPU to the global
  runtime pool.
- A few fixes:
	- Explicitly wake up the idle cpu during unthrottle.
	- Optimally handle throttling of current task within enqueue_task.
	- Fix compilation break with CONFIG_PREEMPT on.
	- Fix a compilation break at intermediate patch level.
- Applies on 2.6.36-rc7.

More about slack time issue
---------------------------
Bandwidth available to a group is specified by two parameters: cpu.cfs_quota_us
and cpu.cfs_period_us. cpu.cfs_quota_us is the max CPU time a group can
consume within the time period of cpu.cfs_period_us. The quota available
to a group within a period is maintained in a per-group global pool. In each
CPU, the consumption happens by obtaining a slice of this global pool.

If the local quota (obtained as slices of global pool) isn't fully consumed
within a given period, a group can potentially get more CPU time than
its allowed for in the next interval. This is due to the slack time that may
be left over from the previous interval. More details about how this is fixed
is present in the description part of patch 7/7. Here I will only show the
benefit of handling the slack time correctly through this experiment:

On a 16 CPU system, two different kinds of tasks were run as part of a group
which had quota/bandwidth as 500000/500000 [=> 500ms/500ms], which means that
the group was capped at 1CPU worth of time every period.

Type A task: Complete CPU hog.
Type B task: Sleeps for 500ms and runs as CPU hog for next 500ms. And this cycle
		repeats.

1 task of type A and 15 tasks of type B were run for 20s, each bound to a
different CPU. At the end of 20s, the CPU time obtained by each of them
looked like this:

-----------------------------------------------------------------------
			Without returning	Returning slack time
			slack time to global	to global pool
			pool			(with patch 7/7)
-----------------------------------------------------------------------
1 type A task		7.96s			10.71s
15 type B tasks		12.36s			9.79s
-----------------------------------------------------------------------

This shows the effects of slack time and the benefit of handling it correctly.

I request the scheduler maintainers and others for comments on these patches.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
@ 2010-10-12  7:50 ` Bharata B Rao
  2010-10-13 13:00   ` Balbir Singh
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: introduce primitives to account for CFS bandwidth tracking

From: Paul Turner <pjt@google.com>

In this patch we introduce the notion of CFS bandwidth, to account for the
realities of SMP this is partitioned into globally unassigned bandwidth, and
locally claimed bandwidth:
- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
  structure.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool
  bandwidth assigned to a task_group, this is tracked using the
  new cfs_bandwidth structure.

Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

A per-cfs_bandwidth timer is also introduced to handle future refresh at
period expiration.  There's some minor refactoring here so that
start_bandwidth_timer() functionality can be shared

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig        |    9 +
 kernel/sched.c      |  271 +++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_fair.c |   10 +
 3 files changed, 268 insertions(+), 22 deletions(-)

--- a/init/Kconfig
+++ b/init/Kconfig
@@ -609,6 +609,15 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define quota and period for cpu
+	  bandwidth provisioning on a per-cgroup basis.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -192,10 +192,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
 
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -203,22 +221,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -243,6 +246,15 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+struct cfs_bandwidth {
+	raw_spinlock_t		lock;
+	ktime_t			period;
+	u64			runtime, quota;
+	struct hrtimer		period_timer;
+};
+#endif
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -268,6 +280,10 @@ struct task_group {
 	struct task_group *parent;
 	struct list_head siblings;
 	struct list_head children;
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct cfs_bandwidth cfs_bandwidth;
+#endif
 };
 
 #define root_task_group init_task_group
@@ -369,9 +385,76 @@ struct cfs_rq {
 	 */
 	unsigned long rq_weight;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	u64 quota_assigned, quota_used;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = cfs_b->runtime = quota;
+	cfs_b->period = ns_to_ktime(period);
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
+}
+
+static
+void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->quota_used = 0;
+	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
+		cfs_rq->quota_assigned = RUNTIME_INF;
+	else
+		cfs_rq->quota_assigned = 0;
+}
+
+static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->period_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
+#endif
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -1840,6 +1923,14 @@ static inline void __set_task_cpu(struct
 
 static const struct sched_class rt_sched_class;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s
+ */
+static u64 sched_cfs_bandwidth_period = 500000000ULL;
+#endif
+
 #define sched_class_highest (&rt_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
@@ -7644,6 +7735,9 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_rq_quota(cfs_rq);
+#endif
 	if (add)
 		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
@@ -7789,6 +7883,10 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
+#ifdef CONFIG_CFS_BANDWIDTH
+		init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
+				RUNTIME_INF, sched_cfs_bandwidth_period);
+#endif
 		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -8032,6 +8130,10 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
+#endif
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8059,7 +8161,10 @@ int alloc_fair_sched_group(struct task_g
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
-
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
+			sched_cfs_bandwidth_period);
+#endif
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 
@@ -8505,7 +8610,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8544,7 +8649,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8569,7 +8674,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -8776,6 +8881,116 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &init_task_group)
+		return -EINVAL;
+
+	if (!period)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at least one tick of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (NS_TO_JIFFIES(quota) < 1)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
+	tg->cfs_bandwidth.period = ns_to_ktime(period);
+	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
+	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		init_cfs_rq_quota(cfs_rq);
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	if (cfs_runtime_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg->cfs_bandwidth.quota;
+	do_div(quota_us, NSEC_PER_USEC);
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg->cfs_bandwidth.quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8810,6 +9025,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -360,6 +360,9 @@ static void __enqueue_entity(struct cfs_
 
 	rb_link_node(&se->run_node, parent, link);
 	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+#ifdef CONFIG_CFS_BANDWIDTH
+	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
+#endif
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1126,6 +1129,13 @@ static void yield_task_fair(struct rq *r
 	se->vruntime = rightmost->vruntime + 1;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
  2010-10-12  7:50 ` [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking Bharata B Rao
@ 2010-10-12  7:51 ` Bharata B Rao
  2010-10-13 13:30   ` Balbir Singh
                     ` (3 more replies)
  2010-10-12  7:52 ` [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota Bharata B Rao
                   ` (7 subsequent siblings)
  9 siblings, 4 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: accumulate per-cfs_rq cpu usage

From: Paul Turner <pjt@google.com>

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned.  This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    4 ++++
 kernel/sched.c        |   13 +++++++++++++
 kernel/sched_fair.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c       |   10 ++++++++++
 4 files changed, 77 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1898,6 +1898,10 @@ int sched_rt_handler(struct ctl_table *t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 extern unsigned int sysctl_sched_compat_yield;
 
 #ifdef CONFIG_RT_MUTEXES
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1929,6 +1929,19 @@ static const struct sched_class rt_sched
  * default: 0.5s
  */
 static u64 sched_cfs_bandwidth_period = 500000000ULL;
+
+/*
+ * default slice of quota to allocate from global tg to local cfs_rq pool on
+ * each refresh
+ * default: 10ms
+ */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
+
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
 #endif
 
 #define sched_class_highest (&rt_sched_class)
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -267,6 +267,16 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec);
+#endif
+
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -547,6 +557,9 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+#ifdef CONFIG_CFS_BANDWIDTH
+	account_cfs_rq_quota(cfs_rq, delta_exec);
+#endif
 }
 
 static inline void
@@ -1130,6 +1143,43 @@ static void yield_task_fair(struct rq *r
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static u64 tg_request_cfs_quota(struct task_group *tg)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 delta = 0;
+
+	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+		raw_spin_lock(&cfs_b->lock);
+		/*
+		 * it's possible a bandwidth update has changed the global
+		 * pool.
+		 */
+		if (cfs_b->quota == RUNTIME_INF)
+			delta = sched_cfs_bandwidth_slice();
+		else {
+			delta = min(cfs_b->runtime,
+					sched_cfs_bandwidth_slice());
+			cfs_b->runtime -= delta;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+	}
+	return delta;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (cfs_rq->quota_assigned == RUNTIME_INF)
+		return;
+
+	cfs_rq->quota_used += delta_exec;
+
+	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+		return;
+
+	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	return 1;
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -384,6 +384,16 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.procname	= "prove_locking",

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
  2010-10-12  7:50 ` [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking Bharata B Rao
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
@ 2010-10-12  7:52 ` Bharata B Rao
  2010-10-13  6:34   ` KAMEZAWA Hiroyuki
  2010-10-12  7:52 ` [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Bharata B Rao
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: throttle cfs_rq entities which exceed their local quota

From: Paul Turner <pjt@google.com>

In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
cfs_rq's local quota and whether there is global quota available to continue
enabling it in the event we run out.

This patch adds the required support for the latter case, throttling entities
until quota is available to run.  Throttling dequeues the entity in question
and sends a reschedule to the owning cpu so that it can be evicted.

The following restrictions apply to a throttled cfs_rq:
- It is dequeued from sched_entity hierarchy and restricted from being
  re-enqueued.  This means that new/waking children of this entity will be
  queued up to it, but not past it.
- It does not contribute to weight calculations in tg_shares_up
- In the case that the cfs_rq of the cpu we are trying to pull from is throttled
  it is  is ignored by the loadbalancer in __load_balance_fair() and
  move_one_task_fair().

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   12 ++++++++
 kernel/sched_fair.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 76 insertions(+), 6 deletions(-)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -387,6 +387,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
+	int throttled;
 #endif
 #endif
 };
@@ -1668,6 +1669,8 @@ static void update_group_shares_cpu(stru
 	}
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
 /*
  * Re-compute the task group their per cpu shares over the given domain.
  * This needs to be done in a bottom-up fashion because the rq weight of a
@@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
 	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
 
 	for_each_cpu(i, sched_domain_span(sd)) {
-		weight = tg->cfs_rq[i]->load.weight;
+		/*
+		 * bandwidth throttled entities cannot contribute to load
+		 * balance
+		 */
+		if (!cfs_rq_throttled(tg->cfs_rq[i]))
+			weight = tg->cfs_rq[i]->load.weight;
+		else
+			weight = 0;
 		usd_rq_weight[i] = weight;
 
 		rq_weight += weight;
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -273,8 +273,18 @@ static inline struct cfs_bandwidth *tg_c
 	return &tg->cfs_bandwidth;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec);
+#else
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
 #endif
 
 
@@ -787,6 +797,11 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+
+	if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
+	     !group_cfs_rq(se)->nr_running))
+		return;
+
 	account_entity_enqueue(cfs_rq, se);
 
 	if (flags & ENQUEUE_WAKEUP) {
@@ -823,6 +838,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_curr(cfs_rq);
 
+	if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
+		return;
+
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
 #ifdef CONFIG_SCHEDSTATS
@@ -1068,6 +1086,9 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		/* don't continue to enqueue if our parent is throttled */
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1087,8 +1108,11 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		/*
+		 * Don't dequeue parent if it has other entities besides us,
+		 * or if it is throttled
+		 */
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
 			break;
 		flags |= DEQUEUE_SLEEP;
 	}
@@ -1166,6 +1190,35 @@ static u64 tg_request_cfs_quota(struct t
 	return delta;
 }
 
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/*
+	 * It's possible for the current task to block and re-wake before task
+	 * switch, leading to a throttle within enqueue_task->update_curr()
+	 * versus an an entity that has not technically been enqueued yet.
+	 *
+	 * In this case, since we haven't actually done the enqueue yet, cut
+	 * out and allow enqueue_entity() to short-circuit
+	 */
+	if (!se->on_rq)
+		goto out_throttled;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		dequeue_entity(cfs_rq, se, 1);
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+out_throttled:
+	cfs_rq->throttled = 1;
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1174,10 +1227,16 @@ static void account_cfs_rq_quota(struct 
 
 	cfs_rq->quota_used += delta_exec;
 
-	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+	if (cfs_rq_throttled(cfs_rq) ||
+		cfs_rq->quota_used < cfs_rq->quota_assigned)
 		return;
 
 	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+
+	if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
+		throttle_cfs_rq(cfs_rq);
+		resched_task(cfs_rq->rq->curr);
+	}
 }
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -1995,9 +2054,10 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or throttled cfs_rq
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+				cfs_rq_throttled(busiest_cfs_rq))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (2 preceding siblings ...)
  2010-10-12  7:52 ` [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota Bharata B Rao
@ 2010-10-12  7:52 ` Bharata B Rao
  2010-10-15  4:45   ` Balbir Singh
  2010-10-12  7:53 ` [PATCH v3 5/7] sched: add exports tracking cfs bandwidth control statistics Bharata B Rao
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: unthrottle cfs_rq(s) who ran out of quota at period refresh

From: Paul Turner <pjt@google.com>

At the start of a new period there are several actions we must take:
- Refresh global bandwidth pool
- Unthrottle entities who ran out of quota as refreshed bandwidth permits

Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
into the cfs entity hierarchy.

sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
since it is now shared by both cfs and rt bandwidth period timers.

The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
rd->span instead of cpu_online_mask since I think that was incorrect before
(don't want to hit cpu's outside of your root_domain for RT bandwidth).

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   16 ++++++++++++
 kernel/sched_fair.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c   |   19 --------------
 3 files changed, 84 insertions(+), 19 deletions(-)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1565,6 +1565,8 @@ static int tg_nop(struct task_group *tg,
 }
 #endif
 
+static inline const struct cpumask *sched_bw_period_mask(void);
+
 #ifdef CONFIG_SMP
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
@@ -1933,6 +1935,18 @@ static inline void __set_task_cpu(struct
 
 static const struct sched_class rt_sched_class;
 
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_rq(smp_processor_id())->rd->span;
+}
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+#endif
+
 #ifdef CONFIG_CFS_BANDWIDTH
 /*
  * default period for cfs group bandwidth.
@@ -8937,6 +8951,8 @@ static int tg_set_cfs_bandwidth(struct t
 
 		raw_spin_lock_irq(&rq->lock);
 		init_cfs_rq_quota(cfs_rq);
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 	mutex_unlock(&mutex);
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -268,6 +268,13 @@ find_matching_se(struct sched_entity **s
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static inline
+struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
+{
+	return container_of(cfs_b, struct task_group,
+			cfs_bandwidth)->cfs_rq[cpu];
+}
+
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 {
 	return &tg->cfs_bandwidth;
@@ -1219,6 +1226,29 @@ out_throttled:
 	cfs_rq->throttled = 1;
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+	struct rq *rq = rq_of(cfs_rq);
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	cfs_rq->throttled = 0;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+
+		cfs_rq = cfs_rq_of(se);
+		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	/* determine whether we need to wake up potentally idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1241,8 +1271,44 @@ static void account_cfs_rq_quota(struct 
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	return 1;
+	int i, idle = 1;
+	u64 delta;
+	const struct cpumask *span;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return 1;
+
+	/* reset group quota */
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->runtime = cfs_b->quota;
+	raw_spin_unlock(&cfs_b->lock);
+
+	span = sched_bw_period_mask();
+	for_each_cpu(i, span) {
+		struct rq *rq = cpu_rq(i);
+		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
+
+		if (cfs_rq->nr_running)
+			idle = 0;
+
+		if (!cfs_rq_throttled(cfs_rq))
+			continue;
+
+		delta = tg_request_cfs_quota(cfs_rq->tg);
+
+		if (delta) {
+			raw_spin_lock(&rq->lock);
+			cfs_rq->quota_assigned += delta;
+
+			if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+				unthrottle_cfs_rq(cfs_rq);
+			raw_spin_unlock(&rq->lock);
+		}
+	}
+
+	return idle;
 }
+
 #endif
 
 #ifdef CONFIG_SMP
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -241,18 +241,6 @@ static int rt_se_boosted(struct sched_rt
 	return p->prio != p->normal_prio;
 }
 
-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-#endif
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -302,11 +290,6 @@ static inline int rt_rq_throttled(struct
 	return rt_rq->rt_throttled;
 }
 
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -524,7 +507,7 @@ static int do_sched_rt_period_timer(stru
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return 1;
 
-	span = sched_rt_period_mask();
+	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 5/7] sched: add exports tracking cfs bandwidth control statistics
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (3 preceding siblings ...)
  2010-10-12  7:52 ` [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Bharata B Rao
@ 2010-10-12  7:53 ` Bharata B Rao
  2010-10-12  7:54 ` [PATCH v3 6/7] sched: hierarchical task accounting for FAIR_GROUP_SCHED Bharata B Rao
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: add exports tracking cfs bandwidth control statistics

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   26 ++++++++++++++++++++++++++
 kernel/sched_fair.c |   18 +++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -252,6 +252,11 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			runtime, quota;
 	struct hrtimer		period_timer;
+
+	/* throttle statistics */
+	u64			nr_periods;
+	u64			nr_throttled;
+	u64			throttled_time;
 };
 #endif
 
@@ -388,6 +393,7 @@ struct cfs_rq {
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
 	int throttled;
+	u64 throttled_timestamp;
 #endif
 #endif
 };
@@ -425,6 +431,10 @@ void init_cfs_bandwidth(struct cfs_bandw
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	cfs_b->nr_periods = 0;
+	cfs_b->nr_throttled = 0;
+	cfs_b->throttled_time = 0;
 }
 
 static
@@ -9029,6 +9039,18 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9075,6 +9097,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1224,16 +1224,26 @@ static void throttle_cfs_rq(struct cfs_r
 
 out_throttled:
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 }
 
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
 	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
+	/* update stats */
+	update_rq_clock(rq);
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
+	raw_spin_unlock(&cfs_b->lock);
+
 	cfs_rq->throttled = 0;
+	cfs_rq->throttled_timestamp = 0;
+
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
@@ -1271,7 +1281,7 @@ static void account_cfs_rq_quota(struct 
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int i, idle = 1;
+	int i, idle = 1, num_throttled = 0;
 	u64 delta;
 	const struct cpumask *span;
 
@@ -1293,6 +1303,7 @@ static int do_sched_cfs_period_timer(str
 
 		if (!cfs_rq_throttled(cfs_rq))
 			continue;
+		num_throttled++;
 
 		delta = tg_request_cfs_quota(cfs_rq->tg);
 
@@ -1306,6 +1317,11 @@ static int do_sched_cfs_period_timer(str
 		}
 	}
 
+	/* update throttled stats */
+	cfs_b->nr_periods++;
+	if (num_throttled)
+		cfs_b->nr_throttled++;
+
 	return idle;
 }
 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 6/7] sched: hierarchical task accounting for FAIR_GROUP_SCHED
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (4 preceding siblings ...)
  2010-10-12  7:53 ` [PATCH v3 5/7] sched: add exports tracking cfs bandwidth control statistics Bharata B Rao
@ 2010-10-12  7:54 ` Bharata B Rao
  2010-10-12  7:55 ` [PATCH v3 7/7] sched: Return/expire slack quota using generation counters Bharata B Rao
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:54 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

sched: hierarchical task accounting for FAIR_GROUP_SCHED

From: Paul Turner <pjt@google.com>

With task entities participating in throttled sub-trees it is possible for
task activation/de-activation to not lead to root visible changes to
rq->nr_running.  This in turn leads to incorrect idle and weight-per-task load
balance decisions.

To allow correct accounting we move responsibility for updating rq->nr_running
to the respective sched::classes.  In the fair-group case this update is
hierarchical, tracking the number of active tasks rooted at each group entity.

Note: technically this issue also exists with the existing sched_rt
throttling; however due to the nearly complete provisioning of system
resources for rt scheduling this is much less common by default.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    9 ++++++---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c   |    5 ++++-
 3 files changed, 52 insertions(+), 4 deletions(-)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -333,7 +333,7 @@ struct task_group init_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_tasks;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1984,6 +1984,11 @@ static inline u64 sched_cfs_bandwidth_sl
 
 #include "sched_stats.h"
 
+static void mod_nr_running(struct rq *rq, long delta)
+{
+	rq->nr_running += delta;
+}
+
 static void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
@@ -2040,7 +2045,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -2052,7 +2056,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #include "sched_idletask.c"
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -76,6 +76,8 @@ unsigned int sysctl_sched_child_runs_fir
  */
 unsigned int __read_mostly sysctl_sched_compat_yield;
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 /*
  * SCHED_OTHER wake-up granularity.
  * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
@@ -683,6 +685,40 @@ account_entity_dequeue(struct cfs_rq *cf
 	se->on_rq = 0;
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/* maintain hierarchal task counts on group entities */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		/* a throttled entity cannot affect its parent hierarchy */
+		if (group_cfs_rq(se) && cfs_rq_throttled(group_cfs_rq(se)))
+			break;
+
+		/* we affect our queuing entity */
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_tasks += delta;
+	}
+
+	/* account for global nr_running delta to hierarchy change */
+	if (!se)
+		mod_nr_running(rq, delta);
+}
+#else
+/*
+ * In the absence of group throttling, all operations are guaranteed to be
+ * globally visible at the root rq level.
+ */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+
+	mod_nr_running(rq, delta);
+}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1099,6 +1135,7 @@ enqueue_task_fair(struct rq *rq, struct 
 		flags = ENQUEUE_WAKEUP;
 	}
 
+	account_hier_tasks(&p->se, 1);
 	hrtick_update(rq);
 }
 
@@ -1124,6 +1161,7 @@ static void dequeue_task_fair(struct rq 
 		flags |= DEQUEUE_SLEEP;
 	}
 
+	account_hier_tasks(&p->se, -1);
 	hrtick_update(rq);
 }
 
@@ -1197,6 +1235,8 @@ static u64 tg_request_cfs_quota(struct t
 	return delta;
 }
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
@@ -1214,6 +1254,7 @@ static void throttle_cfs_rq(struct cfs_r
 	if (!se->on_rq)
 		goto out_throttled;
 
+	account_hier_tasks(se, -cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -1244,6 +1285,7 @@ static void unthrottle_cfs_rq(struct cfs
 	cfs_rq->throttled = 0;
 	cfs_rq->throttled_timestamp = 0;
 
+	account_hier_tasks(se, cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -882,6 +882,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -892,6 +894,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1754,4 +1758,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 7/7] sched: Return/expire slack quota using generation counters
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (5 preceding siblings ...)
  2010-10-12  7:54 ` [PATCH v3 6/7] sched: hierarchical task accounting for FAIR_GROUP_SCHED Bharata B Rao
@ 2010-10-12  7:55 ` Bharata B Rao
  2010-10-13  5:14 ` [PATCH v3 0/7] CFS Bandwidth Control KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-12  7:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

>From Paul Turner <pjt@google.com>

sched: Return/expire slack quota using generation counters

This patch adds generation counters to track and expire quotas.

This allows for two useful semantics:

1) On voluntary dequeue quota can be returned to the global pool provided it
   is still "current".  In this patch we return all but one tick's worth of
   quota so that workloads with high rates of turn-over do not incur
   significant contention.

   When returning quota to the global pool, if there are throttled runqueues
   and we have more than a slice of quota available, attempt to unthrottle
   them (again this is to prevent contention in the high turn over case).

2) On period expiration the generation counter is incremented, naturally
   expiring outstanding slack quota in the system.


A separate hrtimer is used to drive the slack quota redistribution and
subsequent unthrottling of throttled entities.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   54 +++++++++++++++++++++++--
 kernel/sched_fair.c |  111 ++++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 146 insertions(+), 19 deletions(-)

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -250,8 +250,10 @@ static LIST_HEAD(task_groups);
 struct cfs_bandwidth {
 	raw_spinlock_t		lock;
 	ktime_t			period;
-	u64			runtime, quota;
+	u64			runtime, quota, generation;
+	int 			throttled_rqs;
 	struct hrtimer		period_timer;
+	struct hrtimer		slack_timer;
 
 	/* throttle statistics */
 	u64			nr_periods;
@@ -391,7 +393,7 @@ struct cfs_rq {
 	unsigned long rq_weight;
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
-	u64 quota_assigned, quota_used;
+	u64 quota_assigned, quota_used, quota_generation;
 	int throttled;
 	u64 throttled_timestamp;
 #endif
@@ -399,6 +401,17 @@ struct cfs_rq {
 };
 
 #ifdef CONFIG_CFS_BANDWIDTH
+
+static int do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b);
+
+static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, slack_timer);
+	do_sched_cfs_slack_timer(cfs_b);
+	return HRTIMER_NORESTART;
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
@@ -428,9 +441,11 @@ void init_cfs_bandwidth(struct cfs_bandw
 	raw_spin_lock_init(&cfs_b->lock);
 	cfs_b->quota = cfs_b->runtime = quota;
 	cfs_b->period = ns_to_ktime(period);
-
+	cfs_b->generation = 0;
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 
 	cfs_b->nr_periods = 0;
 	cfs_b->nr_throttled = 0;
@@ -464,6 +479,35 @@ static void destroy_cfs_bandwidth(struct
 {
 	hrtimer_cancel(&cfs_b->period_timer);
 }
+
+
+/* Should this be a tunable ? */
+#define CFS_SLACK_PERIOD	2000000	/* 2ms */
+
+static void destroy_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	hrtimer_cancel(&cfs_b->slack_timer);
+}
+
+static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->slack_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+
+	/*
+	 * TODO: Don't start the slack timer if the
+	 * period timer is about to fire.
+	 */
+	start_bandwidth_timer(&cfs_b->slack_timer,
+		ns_to_ktime(CFS_SLACK_PERIOD));
+	raw_spin_unlock(&cfs_b->lock);
+}
+
 #endif
 
 /* Real-Time classes' related field in a runqueue: */
@@ -8182,6 +8226,7 @@ static void free_fair_sched_group(struct
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
+	destroy_cfs_slack_bandwidth(&tg->cfs_bandwidth);
 #endif
 
 	for_each_possible_cpu(i) {
@@ -8936,6 +8981,7 @@ static u64 cpu_shares_read_u64(struct cg
 static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
 	int i;
+	u64 next_generation;
 	static DEFINE_MUTEX(mutex);
 
 	if (tg == &init_task_group)
@@ -8956,6 +9002,7 @@ static int tg_set_cfs_bandwidth(struct t
 	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
 	tg->cfs_bandwidth.period = ns_to_ktime(period);
 	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
+	next_generation = ++tg->cfs_bandwidth.generation;
 	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
 
 	for_each_possible_cpu(i) {
@@ -8964,6 +9011,7 @@ static int tg_set_cfs_bandwidth(struct t
 
 		raw_spin_lock_irq(&rq->lock);
 		init_cfs_rq_quota(cfs_rq);
+		cfs_rq->quota_generation = next_generation;
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -287,6 +287,8 @@ static inline int cfs_rq_throttled(struc
 	return cfs_rq->throttled;
 }
 
+static void cfs_rq_return_unused_quota(struct cfs_rq *cfs_rq);
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec);
 #else
@@ -912,6 +914,10 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	if (!(flags & DEQUEUE_SLEEP))
 		se->vruntime -= cfs_rq->min_vruntime;
+#ifdef CONFIG_CFS_BANDWIDTH
+	else if (cfs_rq->quota_assigned != RUNTIME_INF)
+		cfs_rq_return_unused_quota(cfs_rq);
+#endif
 }
 
 /*
@@ -1266,6 +1272,7 @@ static void throttle_cfs_rq(struct cfs_r
 out_throttled:
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
+	tg_cfs_bandwidth(cfs_rq->tg)->throttled_rqs = 1;
 }
 
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -1304,16 +1311,24 @@ static void unthrottle_cfs_rq(struct cfs
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	if (cfs_rq->quota_assigned == RUNTIME_INF)
 		return;
 
 	cfs_rq->quota_used += delta_exec;
 
-	if (cfs_rq_throttled(cfs_rq) ||
-		cfs_rq->quota_used < cfs_rq->quota_assigned)
+	if (cfs_rq_throttled(cfs_rq))
+		return;
+
+	if (cfs_rq->quota_generation != cfs_b->generation)
+		cfs_rq->quota_assigned = min(cfs_rq->quota_used,
+				cfs_rq->quota_assigned);
+
+	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
 		return;
 
 	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+	cfs_rq->quota_generation = cfs_b->generation;
 
 	if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
 		throttle_cfs_rq(cfs_rq);
@@ -1321,19 +1336,11 @@ static void account_cfs_rq_quota(struct 
 	}
 }
 
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int redistribute_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	int i, idle = 1, num_throttled = 0;
-	u64 delta;
+	int i, idle = 1, num_throttled = 0, throttled_rqs = 0;
 	const struct cpumask *span;
-
-	if (cfs_b->quota == RUNTIME_INF)
-		return 1;
-
-	/* reset group quota */
-	raw_spin_lock(&cfs_b->lock);
-	cfs_b->runtime = cfs_b->quota;
-	raw_spin_unlock(&cfs_b->lock);
+	u64 delta;
 
 	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
@@ -1346,27 +1353,99 @@ static int do_sched_cfs_period_timer(str
 		if (!cfs_rq_throttled(cfs_rq))
 			continue;
 		num_throttled++;
+		throttled_rqs++;
 
 		delta = tg_request_cfs_quota(cfs_rq->tg);
 
 		if (delta) {
 			raw_spin_lock(&rq->lock);
 			cfs_rq->quota_assigned += delta;
+			cfs_rq->quota_generation = cfs_b->generation;
 
-			if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+			if (cfs_rq->quota_used < cfs_rq->quota_assigned) {
 				unthrottle_cfs_rq(cfs_rq);
+				throttled_rqs--;
+			}
 			raw_spin_unlock(&rq->lock);
 		}
 	}
 
-	/* update throttled stats */
-	cfs_b->nr_periods++;
 	if (num_throttled)
 		cfs_b->nr_throttled++;
 
+	cfs_b->throttled_rqs = throttled_rqs;
 	return idle;
 }
 
+static void cfs_rq_return_unused_quota(struct cfs_rq *cfs_rq)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	s64 quota_remaining;
+
+	if (cfs_rq->quota_used > cfs_rq->quota_assigned ||
+	    cfs_rq->quota_generation != cfs_b->generation)
+		return;
+
+	quota_remaining = cfs_rq->quota_assigned - cfs_rq->quota_used;
+	/* hold 1 tick of quota in reserve for workloads with high turnover */
+	if (NS_TO_JIFFIES(quota_remaining) < 1)
+		return;
+
+	quota_remaining -= NSEC_PER_SEC / HZ;
+	BUG_ON(quota_remaining < 0);
+
+	if (!quota_remaining)
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	/* previous was speculative */
+	if (cfs_rq->quota_generation == cfs_b->generation) {
+		cfs_b->runtime += quota_remaining;
+		cfs_rq->quota_assigned -= quota_remaining;
+	}
+	raw_spin_unlock(&cfs_b->lock);
+
+	/*
+	 * if we've re-accumulated more than a slice and there are throttled
+	 * rq's, try to unthrottle them.
+	 */
+	if (cfs_b->throttled_rqs &&
+		cfs_b->runtime > sched_cfs_bandwidth_slice())
+		start_cfs_slack_bandwidth(cfs_b);
+}
+
+
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	int idle = 1;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return 1;
+
+	/* reset group quota */
+	raw_spin_lock(&cfs_b->lock);
+	idle = cfs_b->runtime == cfs_b->quota;
+	cfs_b->runtime = cfs_b->quota;
+	cfs_b->generation++;
+	raw_spin_unlock(&cfs_b->lock);
+
+	idle = redistribute_cfs_bandwidth(cfs_b);
+
+	/* update throttled stats */
+	cfs_b->nr_periods++;
+
+	return idle;
+}
+
+static int do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+{
+	if (cfs_b->quota == RUNTIME_INF)
+		return 0;
+
+	redistribute_cfs_bandwidth(cfs_b);
+	return 0;
+}
+
 #endif
 
 #ifdef CONFIG_SMP

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/7] CFS Bandwidth Control
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (6 preceding siblings ...)
  2010-10-12  7:55 ` [PATCH v3 7/7] sched: Return/expire slack quota using generation counters Bharata B Rao
@ 2010-10-13  5:14 ` KAMEZAWA Hiroyuki
  2010-10-13  5:44 ` Herbert Poetzl
  2010-11-17  8:32 ` Lai Jiangshan
  9 siblings, 0 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  5:14 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Tue, 12 Oct 2010 13:19:10 +0530
Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:

> Hi,
> 
> Its been a while since we posted CPS hard limits (aka CFS bandwidth control
> now) patches, hence a quick recap first:
> 
> - I have been working on CFS hard limits since last year and have posted
>   a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
> - Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
>   control and have posted a couple of versions.
>   (last post v2: http://lwn.net/Articles/385055/)
> 
> Paul's approach mainly changed the way the CPU hard limit was represented. After
> his post, I decided to work with them and discontinue my patch series since
> their global bandwidth specification for group appears more flexible than
> the RT-type per-cpu bandwidth specification I had in my series.
> 
> Since Paul seems to be busy, I am taking the freedom of posting the next
> version of his patches with a few enhancements to the slack time handling.
> (more on this later)
> 
Wow, thank you. I've been looking forward to see updates.
I'll try this set.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/7] CFS Bandwidth Control
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (7 preceding siblings ...)
  2010-10-13  5:14 ` [PATCH v3 0/7] CFS Bandwidth Control KAMEZAWA Hiroyuki
@ 2010-10-13  5:44 ` Herbert Poetzl
  2010-10-13  6:26   ` Paul Turner
  2010-11-17  8:32 ` Lai Jiangshan
  9 siblings, 1 reply; 51+ messages in thread
From: Herbert Poetzl @ 2010-10-13  5:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Avi Kivity, Chris Friesen, Paul Menage,
	Mike Waychison, Paul Turner, Nikhil Rao

On Tue, Oct 12, 2010 at 01:19:10PM +0530, Bharata B Rao wrote:
> Hi,

> Its been a while since we posted CPS hard limits (aka CFS bandwidth 

Indeed, will see that I can incorporate those in the
next experimental Linux-VServer patch for testing ...

btw, is it planned to allow for hard limits which
are temporarily disabled when the machine/cpu would
be otherwise idle (i.e. running the idle thread) or
as we solved it, can we artificially advance the
time (for the hard limits) when idle so that contexts
which have work to do can work without sacrificing
the priorization or the actual limits?

best,
Herbert

PS: from a quick glance I take it that using large
values for period and quota, while keeping the ratio
the same allows for 'burst loads'?

> control now) patches, hence a quick recap first:

> - I have been working on CFS hard limits since last year and have posted
>   a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
> - Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
>   control and have posted a couple of versions.
>   (last post v2: http://lwn.net/Articles/385055/)
> 
> Paul's approach mainly changed the way the CPU hard limit was represented. After
> his post, I decided to work with them and discontinue my patch series since
> their global bandwidth specification for group appears more flexible than
> the RT-type per-cpu bandwidth specification I had in my series.
> 
> Since Paul seems to be busy, I am taking the freedom of posting the next
> version of his patches with a few enhancements to the slack time handling.
> (more on this later)
> 
> Main changes in this post:
> 
> - Return the unused and remaining local quota at each CPU to the global
>   runtime pool.
> - A few fixes:
> 	- Explicitly wake up the idle cpu during unthrottle.
> 	- Optimally handle throttling of current task within enqueue_task.
> 	- Fix compilation break with CONFIG_PREEMPT on.
> 	- Fix a compilation break at intermediate patch level.
> - Applies on 2.6.36-rc7.
> 
> More about slack time issue
> ---------------------------
> Bandwidth available to a group is specified by two parameters: cpu.cfs_quota_us
> and cpu.cfs_period_us. cpu.cfs_quota_us is the max CPU time a group can
> consume within the time period of cpu.cfs_period_us. The quota available
> to a group within a period is maintained in a per-group global pool. In each
> CPU, the consumption happens by obtaining a slice of this global pool.
> 
> If the local quota (obtained as slices of global pool) isn't fully consumed
> within a given period, a group can potentially get more CPU time than
> its allowed for in the next interval. This is due to the slack time that may
> be left over from the previous interval. More details about how this is fixed
> is present in the description part of patch 7/7. Here I will only show the
> benefit of handling the slack time correctly through this experiment:
> 
> On a 16 CPU system, two different kinds of tasks were run as part of a group
> which had quota/bandwidth as 500000/500000 [=> 500ms/500ms], which means that
> the group was capped at 1CPU worth of time every period.
> 
> Type A task: Complete CPU hog.
> Type B task: Sleeps for 500ms and runs as CPU hog for next 500ms. And this cycle
> 		repeats.
> 
> 1 task of type A and 15 tasks of type B were run for 20s, each bound to a
> different CPU. At the end of 20s, the CPU time obtained by each of them
> looked like this:
> 
> -----------------------------------------------------------------------
> 			Without returning	Returning slack time
> 			slack time to global	to global pool
> 			pool			(with patch 7/7)
> -----------------------------------------------------------------------
> 1 type A task		7.96s			10.71s
> 15 type B tasks		12.36s			9.79s
> -----------------------------------------------------------------------
> 
> This shows the effects of slack time and the benefit of handling it correctly.
> 
> I request the scheduler maintainers and others for comments on these patches.
> 
> Regards,
> Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/7] CFS Bandwidth Control
  2010-10-13  5:44 ` Herbert Poetzl
@ 2010-10-13  6:26   ` Paul Turner
  0 siblings, 0 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-13  6:26 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Tue, Oct 12, 2010 at 10:44 PM, Herbert Poetzl <herbert@13thfloor.at> wrote:
> On Tue, Oct 12, 2010 at 01:19:10PM +0530, Bharata B Rao wrote:
>> Hi,
>
>> Its been a while since we posted CPS hard limits (aka CFS bandwidth
>
> Indeed, will see that I can incorporate those in the
> next experimental Linux-VServer patch for testing ...
>
> btw, is it planned to allow for hard limits which
> are temporarily disabled when the machine/cpu would
> be otherwise idle (i.e. running the idle thread) or
> as we solved it, can we artificially advance the
> time (for the hard limits) when idle so that contexts
> which have work to do can work without sacrificing
> the priorization or the actual limits?

This is a fairly useful idea, I hadn't planned on it but I can see the
applications for it.

It's somewhat complicated by the fact that throttled entities are not
in-tree, making correct selection in the case where the machine is
otherwise idle difficult.  One method of supporting this could be to
maintain a separate entity for throttled entities (the same mechanism
could be extended to include SCHED_IDLE entities, as well as extending
SCHED_IDLE to group scheduling).

I think it best to first iterate on the fundamentals since this has
been baking for a long time, before we try to add extensions.
>
> best,
> Herbert
>
> PS: from a quick glance I take it that using large
> values for period and quota, while keeping the ratio
> the same allows for 'burst loads'?
>

Exactly!

Another approach we've used is to use small value for period and quota
(with fixed ratio) for latency sensitive applications.  This allows
you to not blow out the tail of your latency distribution (since each
individual throttle is for a short period of time, tail latencies are
capped in exchange for a median latency increase).
>> control now) patches, hence a quick recap first:
>
>> - I have been working on CFS hard limits since last year and have posted
>>   a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
>> - Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
>>   control and have posted a couple of versions.
>>   (last post v2: http://lwn.net/Articles/385055/)
>>
>> Paul's approach mainly changed the way the CPU hard limit was represented. After
>> his post, I decided to work with them and discontinue my patch series since
>> their global bandwidth specification for group appears more flexible than
>> the RT-type per-cpu bandwidth specification I had in my series.
>>
>> Since Paul seems to be busy, I am taking the freedom of posting the next
>> version of his patches with a few enhancements to the slack time handling.
>> (more on this later)
>>
>> Main changes in this post:
>>
>> - Return the unused and remaining local quota at each CPU to the global
>>   runtime pool.
>> - A few fixes:
>>       - Explicitly wake up the idle cpu during unthrottle.
>>       - Optimally handle throttling of current task within enqueue_task.
>>       - Fix compilation break with CONFIG_PREEMPT on.
>>       - Fix a compilation break at intermediate patch level.
>> - Applies on 2.6.36-rc7.
>>
>> More about slack time issue
>> ---------------------------
>> Bandwidth available to a group is specified by two parameters: cpu.cfs_quota_us
>> and cpu.cfs_period_us. cpu.cfs_quota_us is the max CPU time a group can
>> consume within the time period of cpu.cfs_period_us. The quota available
>> to a group within a period is maintained in a per-group global pool. In each
>> CPU, the consumption happens by obtaining a slice of this global pool.
>>
>> If the local quota (obtained as slices of global pool) isn't fully consumed
>> within a given period, a group can potentially get more CPU time than
>> its allowed for in the next interval. This is due to the slack time that may
>> be left over from the previous interval. More details about how this is fixed
>> is present in the description part of patch 7/7. Here I will only show the
>> benefit of handling the slack time correctly through this experiment:
>>
>> On a 16 CPU system, two different kinds of tasks were run as part of a group
>> which had quota/bandwidth as 500000/500000 [=> 500ms/500ms], which means that
>> the group was capped at 1CPU worth of time every period.
>>
>> Type A task: Complete CPU hog.
>> Type B task: Sleeps for 500ms and runs as CPU hog for next 500ms. And this cycle
>>               repeats.
>>
>> 1 task of type A and 15 tasks of type B were run for 20s, each bound to a
>> different CPU. At the end of 20s, the CPU time obtained by each of them
>> looked like this:
>>
>> -----------------------------------------------------------------------
>>                       Without returning       Returning slack time
>>                       slack time to global    to global pool
>>                       pool                    (with patch 7/7)
>> -----------------------------------------------------------------------
>> 1 type A task         7.96s                   10.71s
>> 15 type B tasks               12.36s                  9.79s
>> -----------------------------------------------------------------------
>>
>> This shows the effects of slack time and the benefit of handling it correctly.
>>
>> I request the scheduler maintainers and others for comments on these patches.
>>
>> Regards,
>> Bharata.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-12  7:52 ` [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota Bharata B Rao
@ 2010-10-13  6:34   ` KAMEZAWA Hiroyuki
  2010-10-13  6:44     ` Paul Turner
  2010-10-14  9:12     ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  6:34 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Tue, 12 Oct 2010 13:22:02 +0530
Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:

> sched: throttle cfs_rq entities which exceed their local quota
> 
> From: Paul Turner <pjt@google.com>
> 
> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
> cfs_rq's local quota and whether there is global quota available to continue
> enabling it in the event we run out.
> 
> This patch adds the required support for the latter case, throttling entities
> until quota is available to run.  Throttling dequeues the entity in question
> and sends a reschedule to the owning cpu so that it can be evicted.
> 
> The following restrictions apply to a throttled cfs_rq:
> - It is dequeued from sched_entity hierarchy and restricted from being
>   re-enqueued.  This means that new/waking children of this entity will be
>   queued up to it, but not past it.
> - It does not contribute to weight calculations in tg_shares_up
> - In the case that the cfs_rq of the cpu we are trying to pull from is throttled
>   it is  is ignored by the loadbalancer in __load_balance_fair() and
>   move_one_task_fair().
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   12 ++++++++
>  kernel/sched_fair.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 76 insertions(+), 6 deletions(-)
> 
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -387,6 +387,7 @@ struct cfs_rq {
>  #endif
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	u64 quota_assigned, quota_used;
> +	int throttled;
>  #endif
>  #endif
>  };
> @@ -1668,6 +1669,8 @@ static void update_group_shares_cpu(stru
>  	}
>  }
>  
> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
> +

I just curious that static-inline forward declaration is inlined ?

>  /*
>   * Re-compute the task group their per cpu shares over the given domain.
>   * This needs to be done in a bottom-up fashion because the rq weight of a
> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
>  	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
>  
>  	for_each_cpu(i, sched_domain_span(sd)) {
> -		weight = tg->cfs_rq[i]->load.weight;
> +		/*
> +		 * bandwidth throttled entities cannot contribute to load
> +		 * balance
> +		 */
> +		if (!cfs_rq_throttled(tg->cfs_rq[i]))
> +			weight = tg->cfs_rq[i]->load.weight;
> +		else
> +			weight = 0;

cpu.share and bandwidth control can't be used simultaneously or...
is this fair ? I'm not familiar with scheduler but this allows boost this tg.
Could you add a brief documentaion of a spec/feature. in the next post ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  6:34   ` KAMEZAWA Hiroyuki
@ 2010-10-13  6:44     ` Paul Turner
  2010-10-13  6:47       ` Bharata B Rao
  2010-10-13  7:00       ` KAMEZAWA Hiroyuki
  2010-10-14  9:12     ` Peter Zijlstra
  1 sibling, 2 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-13  6:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Tue, Oct 12, 2010 at 11:34 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 12 Oct 2010 13:22:02 +0530
> Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:
>
>> sched: throttle cfs_rq entities which exceed their local quota
>>
>> From: Paul Turner <pjt@google.com>
>>
>> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
>> cfs_rq's local quota and whether there is global quota available to continue
>> enabling it in the event we run out.
>>
>> This patch adds the required support for the latter case, throttling entities
>> until quota is available to run.  Throttling dequeues the entity in question
>> and sends a reschedule to the owning cpu so that it can be evicted.
>>
>> The following restrictions apply to a throttled cfs_rq:
>> - It is dequeued from sched_entity hierarchy and restricted from being
>>   re-enqueued.  This means that new/waking children of this entity will be
>>   queued up to it, but not past it.
>> - It does not contribute to weight calculations in tg_shares_up
>> - In the case that the cfs_rq of the cpu we are trying to pull from is throttled
>>   it is  is ignored by the loadbalancer in __load_balance_fair() and
>>   move_one_task_fair().
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>>  kernel/sched.c      |   12 ++++++++
>>  kernel/sched_fair.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -387,6 +387,7 @@ struct cfs_rq {
>>  #endif
>>  #ifdef CONFIG_CFS_BANDWIDTH
>>       u64 quota_assigned, quota_used;
>> +     int throttled;
>>  #endif
>>  #endif
>>  };
>> @@ -1668,6 +1669,8 @@ static void update_group_shares_cpu(stru
>>       }
>>  }
>>
>> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
>> +
>
> I just curious that static-inline forward declaration is inlined ?
>

Hm.   This function is tiny, I should just move it up, thanks.

>>  /*
>>   * Re-compute the task group their per cpu shares over the given domain.
>>   * This needs to be done in a bottom-up fashion because the rq weight of a
>> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
>>       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
>>
>>       for_each_cpu(i, sched_domain_span(sd)) {
>> -             weight = tg->cfs_rq[i]->load.weight;
>> +             /*
>> +              * bandwidth throttled entities cannot contribute to load
>> +              * balance
>> +              */
>> +             if (!cfs_rq_throttled(tg->cfs_rq[i]))
>> +                     weight = tg->cfs_rq[i]->load.weight;
>> +             else
>> +                     weight = 0;
>
> cpu.share and bandwidth control can't be used simultaneously or...
> is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> Could you add a brief documentaion of a spec/feature. in the next post ?
>

Bandwidth control is orthogonal to shares, shares continue controls
distribution of bandwidth when within quota.  Bandwidth control only
has 'perceivable' effect when you exceed your reservation within a
quota period.

What the above is doing is removing any throttled entities from the
load-balancer's weight calculations (based on contribution from
throttled entities) since these entities are already dequeued and
cannot be balanced.

Or have I misunderstood your question?

> Thanks,
> -Kame
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  6:44     ` Paul Turner
@ 2010-10-13  6:47       ` Bharata B Rao
  2010-10-13  6:52         ` Paul Turner
  2010-10-13  7:00       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 51+ messages in thread
From: Bharata B Rao @ 2010-10-13  6:47 UTC (permalink / raw)
  To: Paul Turner
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Tue, Oct 12, 2010 at 11:44:29PM -0700, Paul Turner wrote:
> On Tue, Oct 12, 2010 at 11:34 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 12 Oct 2010 13:22:02 +0530
> > Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:
> >
> >> sched: throttle cfs_rq entities which exceed their local quota
> >>
> >> From: Paul Turner <pjt@google.com>
> >>
> >> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
> >> cfs_rq's local quota and whether there is global quota available to continue
> >> enabling it in the event we run out.
> >>
> >> This patch adds the required support for the latter case, throttling entities
> >> until quota is available to run.  Throttling dequeues the entity in question
> >> and sends a reschedule to the owning cpu so that it can be evicted.
> >>
> >> The following restrictions apply to a throttled cfs_rq:
> >> - It is dequeued from sched_entity hierarchy and restricted from being
> >>   re-enqueued.  This means that new/waking children of this entity will be
> >>   queued up to it, but not past it.
> >> - It does not contribute to weight calculations in tg_shares_up
> >> - In the case that the cfs_rq of the cpu we are trying to pull from is throttled
> >>   it is  is ignored by the loadbalancer in __load_balance_fair() and
> >>   move_one_task_fair().
> >>
> >> Signed-off-by: Paul Turner <pjt@google.com>
> >> Signed-off-by: Nikhil Rao <ncrao@google.com>
> >> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> >> ---
> >>  kernel/sched.c      |   12 ++++++++
> >>  kernel/sched_fair.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  2 files changed, 76 insertions(+), 6 deletions(-)
> >>
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -387,6 +387,7 @@ struct cfs_rq {
> >>  #endif
> >>  #ifdef CONFIG_CFS_BANDWIDTH
> >>       u64 quota_assigned, quota_used;
> >> +     int throttled;
> >>  #endif
> >>  #endif
> >>  };
> >> @@ -1668,6 +1669,8 @@ static void update_group_shares_cpu(stru
> >>       }
> >>  }
> >>
> >> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
> >> +
> >
> > I just curious that static-inline forward declaration is inlined ?
> >
> 
> Hm.   This function is tiny, I should just move it up, thanks.
> 
> >>  /*
> >>   * Re-compute the task group their per cpu shares over the given domain.
> >>   * This needs to be done in a bottom-up fashion because the rq weight of a
> >> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
> >>       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
> >>
> >>       for_each_cpu(i, sched_domain_span(sd)) {
> >> -             weight = tg->cfs_rq[i]->load.weight;
> >> +             /*
> >> +              * bandwidth throttled entities cannot contribute to load
> >> +              * balance
> >> +              */
> >> +             if (!cfs_rq_throttled(tg->cfs_rq[i]))
> >> +                     weight = tg->cfs_rq[i]->load.weight;
> >> +             else
> >> +                     weight = 0;
> >
> > cpu.share and bandwidth control can't be used simultaneously or...
> > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> > Could you add a brief documentaion of a spec/feature. in the next post ?
> >
> 
> Bandwidth control is orthogonal to shares, shares continue controls
> distribution of bandwidth when within quota.  Bandwidth control only
> has 'perceivable' effect when you exceed your reservation within a
> quota period.

So if a group gets throttled since its approaching its limit, it might
not be possible to see perfect fairness b/n groups since bandwidth control
kind of takes priority.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  6:47       ` Bharata B Rao
@ 2010-10-13  6:52         ` Paul Turner
  0 siblings, 0 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-13  6:52 UTC (permalink / raw)
  To: bharata
  Cc: KAMEZAWA Hiroyuki, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Tue, Oct 12, 2010 at 11:47 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Tue, Oct 12, 2010 at 11:44:29PM -0700, Paul Turner wrote:
>> On Tue, Oct 12, 2010 at 11:34 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Tue, 12 Oct 2010 13:22:02 +0530
>> > Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:
>> >
>> >> sched: throttle cfs_rq entities which exceed their local quota
>> >>
>> >> From: Paul Turner <pjt@google.com>
>> >>
>> >> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
>> >> cfs_rq's local quota and whether there is global quota available to continue
>> >> enabling it in the event we run out.
>> >>
>> >> This patch adds the required support for the latter case, throttling entities
>> >> until quota is available to run.  Throttling dequeues the entity in question
>> >> and sends a reschedule to the owning cpu so that it can be evicted.
>> >>
>> >> The following restrictions apply to a throttled cfs_rq:
>> >> - It is dequeued from sched_entity hierarchy and restricted from being
>> >>   re-enqueued.  This means that new/waking children of this entity will be
>> >>   queued up to it, but not past it.
>> >> - It does not contribute to weight calculations in tg_shares_up
>> >> - In the case that the cfs_rq of the cpu we are trying to pull from is throttled
>> >>   it is  is ignored by the loadbalancer in __load_balance_fair() and
>> >>   move_one_task_fair().
>> >>
>> >> Signed-off-by: Paul Turner <pjt@google.com>
>> >> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> >> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> >> ---
>> >>  kernel/sched.c      |   12 ++++++++
>> >>  kernel/sched_fair.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++----
>> >>  2 files changed, 76 insertions(+), 6 deletions(-)
>> >>
>> >> --- a/kernel/sched.c
>> >> +++ b/kernel/sched.c
>> >> @@ -387,6 +387,7 @@ struct cfs_rq {
>> >>  #endif
>> >>  #ifdef CONFIG_CFS_BANDWIDTH
>> >>       u64 quota_assigned, quota_used;
>> >> +     int throttled;
>> >>  #endif
>> >>  #endif
>> >>  };
>> >> @@ -1668,6 +1669,8 @@ static void update_group_shares_cpu(stru
>> >>       }
>> >>  }
>> >>
>> >> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
>> >> +
>> >
>> > I just curious that static-inline forward declaration is inlined ?
>> >
>>
>> Hm.   This function is tiny, I should just move it up, thanks.
>>
>> >>  /*
>> >>   * Re-compute the task group their per cpu shares over the given domain.
>> >>   * This needs to be done in a bottom-up fashion because the rq weight of a
>> >> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
>> >>       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
>> >>
>> >>       for_each_cpu(i, sched_domain_span(sd)) {
>> >> -             weight = tg->cfs_rq[i]->load.weight;
>> >> +             /*
>> >> +              * bandwidth throttled entities cannot contribute to load
>> >> +              * balance
>> >> +              */
>> >> +             if (!cfs_rq_throttled(tg->cfs_rq[i]))
>> >> +                     weight = tg->cfs_rq[i]->load.weight;
>> >> +             else
>> >> +                     weight = 0;
>> >
>> > cpu.share and bandwidth control can't be used simultaneously or...
>> > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
>> > Could you add a brief documentaion of a spec/feature. in the next post ?
>> >
>>
>> Bandwidth control is orthogonal to shares, shares continue controls
>> distribution of bandwidth when within quota.  Bandwidth control only
>> has 'perceivable' effect when you exceed your reservation within a
>> quota period.
>
> So if a group gets throttled since its approaching its limit, it might
> not be possible to see perfect fairness b/n groups since bandwidth control
> kind of takes priority.

It's two-fold:

A) The shares will be released so that they can be distributed to
other cpus within the group (where there may be quota remaining)
B) The weight will be hierarchically removed to *improve* fairness
since those entities are not actually running (this is fixing the
accounting, since these calculations are lazy we don't do it at time
of throttle).

It shouldn't have negative implications for group:group fairness.

>
> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  6:44     ` Paul Turner
  2010-10-13  6:47       ` Bharata B Rao
@ 2010-10-13  7:00       ` KAMEZAWA Hiroyuki
  2010-10-13  7:13         ` Paul Turner
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-13  7:00 UTC (permalink / raw)
  To: Paul Turner
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Tue, 12 Oct 2010 23:44:29 -0700
Paul Turner <pjt@google.com> wrote:

> On Tue, Oct 12, 2010 at 11:34 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 12 Oct 2010 13:22:02 +0530
> > Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:

> >>  /*
> >>   * Re-compute the task group their per cpu shares over the given domain.
> >>   * This needs to be done in a bottom-up fashion because the rq weight of a
> >> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
> >>       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
> >>
> >>       for_each_cpu(i, sched_domain_span(sd)) {
> >> -             weight = tg->cfs_rq[i]->load.weight;
> >> +             /*
> >> +              * bandwidth throttled entities cannot contribute to load
> >> +              * balance
> >> +              */
> >> +             if (!cfs_rq_throttled(tg->cfs_rq[i]))
> >> +                     weight = tg->cfs_rq[i]->load.weight;
> >> +             else
> >> +                     weight = 0;
> >
> > cpu.share and bandwidth control can't be used simultaneously or...
> > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> > Could you add a brief documentaion of a spec/feature. in the next post ?
> >
> 
> Bandwidth control is orthogonal to shares, shares continue controls
> distribution of bandwidth when within quota.  Bandwidth control only
> has 'perceivable' effect when you exceed your reservation within a
> quota period.
> 
> What the above is doing is removing any throttled entities from the
> load-balancer's weight calculations (based on contribution from
> throttled entities) since these entities are already dequeued and
> cannot be balanced.
> 
> Or have I misunderstood your question?
> 

Thank you for quick reply.
It seems I don't undestand what "cfs_rq_throttled()==1" means.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  7:00       ` KAMEZAWA Hiroyuki
@ 2010-10-13  7:13         ` Paul Turner
  0 siblings, 0 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-13  7:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Wed, Oct 13, 2010 at 12:00 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 12 Oct 2010 23:44:29 -0700
> Paul Turner <pjt@google.com> wrote:
>
>> On Tue, Oct 12, 2010 at 11:34 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Tue, 12 Oct 2010 13:22:02 +0530
>> > Bharata B Rao <bharata@linux.vnet.ibm.com> wrote:
>
>> >>  /*
>> >>   * Re-compute the task group their per cpu shares over the given domain.
>> >>   * This needs to be done in a bottom-up fashion because the rq weight of a
>> >> @@ -1688,7 +1691,14 @@ static int tg_shares_up(struct task_grou
>> >>       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
>> >>
>> >>       for_each_cpu(i, sched_domain_span(sd)) {
>> >> -             weight = tg->cfs_rq[i]->load.weight;
>> >> +             /*
>> >> +              * bandwidth throttled entities cannot contribute to load
>> >> +              * balance
>> >> +              */
>> >> +             if (!cfs_rq_throttled(tg->cfs_rq[i]))
>> >> +                     weight = tg->cfs_rq[i]->load.weight;
>> >> +             else
>> >> +                     weight = 0;
>> >
>> > cpu.share and bandwidth control can't be used simultaneously or...
>> > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
>> > Could you add a brief documentaion of a spec/feature. in the next post ?
>> >
>>
>> Bandwidth control is orthogonal to shares, shares continue controls
>> distribution of bandwidth when within quota.  Bandwidth control only
>> has 'perceivable' effect when you exceed your reservation within a
>> quota period.
>>
>> What the above is doing is removing any throttled entities from the
>> load-balancer's weight calculations (based on contribution from
>> throttled entities) since these entities are already dequeued and
>> cannot be balanced.
>>
>> Or have I misunderstood your question?
>>
>
> Thank you for quick reply.
> It seems I don't undestand what "cfs_rq_throttled()==1" means.

cfs_rq_throttled() == 1 correlates to cfs_rq->throttled == 1.

This is set when a cfs_rq exceeds it's local bandwidth quota and there
is no quota available in the global pool to refresh it.

At that point the cfs_rq is forcibly throttled, we accomplish this by

a) Ending its execution
b) Removing it from the entity tree (se->on_rq == 0)
c) Setting the throttled flag

The cfs_rq->throttled flag is required to track this state so that we
do not re-enqueue a throttled entity on child wake-up.

It also tells us here that we should update the accounting since these
se's, while containing runnable entities, are not actually on_rq.
>
> Thanks,
> -Kame
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-12  7:50 ` [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking Bharata B Rao
@ 2010-10-13 13:00   ` Balbir Singh
  2010-10-14  5:14     ` Bharata B Rao
  2010-10-14  7:52     ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: Balbir Singh @ 2010-10-13 13:00 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

* Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:20:23]:

> sched: introduce primitives to account for CFS bandwidth tracking
> 
> From: Paul Turner <pjt@google.com>
> 
> In this patch we introduce the notion of CFS bandwidth, to account for the
> realities of SMP this is partitioned into globally unassigned bandwidth, and
> locally claimed bandwidth:
> - The global bandwidth is per task_group, it represents a pool of unclaimed
>   bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
>   structure.
> - The local bandwidth is tracked per-cfs_rq, this represents allotments from
>   the global pool
>   bandwidth assigned to a task_group, this is tracked using the
>   new cfs_bandwidth structure.
> 
> Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
> - cpu.cfs_period_us : the bandwidth period in usecs
> - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
>   to consume over period above.
> 
> A per-cfs_bandwidth timer is also introduced to handle future refresh at
> period expiration.  There's some minor refactoring here so that
> start_bandwidth_timer() functionality can be shared
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  init/Kconfig        |    9 +
>  kernel/sched.c      |  271 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched_fair.c |   10 +
>  3 files changed, 268 insertions(+), 22 deletions(-)
> 
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -609,6 +609,15 @@ config FAIR_GROUP_SCHED
>  	depends on CGROUP_SCHED
>  	default CGROUP_SCHED
> 
> +config CFS_BANDWIDTH
> +	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
> +	depends on EXPERIMENTAL
> +	depends on FAIR_GROUP_SCHED
> +	default n
> +	help
> +	  This option allows users to define quota and period for cpu
> +	  bandwidth provisioning on a per-cgroup basis.
> +
>  config RT_GROUP_SCHED
>  	bool "Group scheduling for SCHED_RR/FIFO"
>  	depends on EXPERIMENTAL
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -192,10 +192,28 @@ static inline int rt_bandwidth_enabled(v
>  	return sysctl_sched_rt_runtime >= 0;
>  }
> 
> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
>  {
> -	ktime_t now;
> +	unsigned long delta;
> +	ktime_t soft, hard, now;
> +
> +	for (;;) {
> +		if (hrtimer_active(period_timer))
> +			break;
> 
> +		now = hrtimer_cb_get_time(period_timer);
> +		hrtimer_forward(period_timer, now, period);
> +
> +		soft = hrtimer_get_softexpires(period_timer);
> +		hard = hrtimer_get_expires(period_timer);
> +		delta = ktime_to_ns(ktime_sub(hard, soft));
> +		__hrtimer_start_range_ns(period_timer, soft, delta,
> +					 HRTIMER_MODE_ABS_PINNED, 0);

This code can be replaced with

hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED) if we
don't care about wakeup_softirq, is there a reason we prefer to keep
wakeup as 0?

> +	}
> +}
> +
> +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +{
>  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>  		return;
> 
> @@ -203,22 +221,7 @@ static void start_rt_bandwidth(struct rt
>  		return;
> 
>  	raw_spin_lock(&rt_b->rt_runtime_lock);
> -	for (;;) {
> -		unsigned long delta;
> -		ktime_t soft, hard;
> -
> -		if (hrtimer_active(&rt_b->rt_period_timer))
> -			break;
> -
> -		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
> -		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
> -
> -		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
> -		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
> -		delta = ktime_to_ns(ktime_sub(hard, soft));
> -		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
> -				HRTIMER_MODE_ABS_PINNED, 0);
> -	}
> +	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
>  	raw_spin_unlock(&rt_b->rt_runtime_lock);
>  }
> 
> @@ -243,6 +246,15 @@ struct cfs_rq;
> 
>  static LIST_HEAD(task_groups);
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +struct cfs_bandwidth {
> +	raw_spinlock_t		lock;
> +	ktime_t			period;
> +	u64			runtime, quota;
> +	struct hrtimer		period_timer;
> +};
> +#endif
> +
>  /* task group related information */
>  struct task_group {
>  	struct cgroup_subsys_state css;
> @@ -268,6 +280,10 @@ struct task_group {
>  	struct task_group *parent;
>  	struct list_head siblings;
>  	struct list_head children;
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct cfs_bandwidth cfs_bandwidth;
> +#endif
>  };
> 
>  #define root_task_group init_task_group
> @@ -369,9 +385,76 @@ struct cfs_rq {
>  	 */
>  	unsigned long rq_weight;
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	u64 quota_assigned, quota_used;
> +#endif
>  #endif
>  };
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> +
> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> +{
> +	struct cfs_bandwidth *cfs_b =
> +		container_of(timer, struct cfs_bandwidth, period_timer);
> +	ktime_t now;
> +	int overrun;
> +	int idle = 0;
> +
> +	for (;;) {
> +		now = hrtimer_cb_get_time(timer);
> +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> +
> +		if (!overrun)
> +			break;

What is the significance of overrun? overrun is set when delta >
interval. The logic seems to be that hrtimer is forwarded in steps of
cfs_b->period till we reach the desired time.

> +
> +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> +	}
> +
> +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> +}
> +
> +static
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> +{
> +	raw_spin_lock_init(&cfs_b->lock);
> +	cfs_b->quota = cfs_b->runtime = quota;
> +	cfs_b->period = ns_to_ktime(period);
> +
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	cfs_b->period_timer.function = sched_cfs_period_timer;
> +}
> +
> +static
> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +	cfs_rq->quota_used = 0;
> +	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> +		cfs_rq->quota_assigned = RUNTIME_INF;
> +	else
> +		cfs_rq->quota_assigned = 0;
> +}
> +
> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return;
> +
> +	if (hrtimer_active(&cfs_b->period_timer))
> +		return;

Why the double check, start_bandwidth_timer also checks this. Is it to
avoid doing the check under cfs_b->lock?

> +
> +	raw_spin_lock(&cfs_b->lock);
> +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> +	raw_spin_unlock(&cfs_b->lock);
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	hrtimer_cancel(&cfs_b->period_timer);
> +}
> +#endif
> +
>  /* Real-Time classes' related field in a runqueue: */
>  struct rt_rq {
>  	struct rt_prio_array active;
> @@ -1840,6 +1923,14 @@ static inline void __set_task_cpu(struct
> 
>  static const struct sched_class rt_sched_class;
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * default period for cfs group bandwidth.
> + * default: 0.5s
> + */
> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> +#endif
> +
>  #define sched_class_highest (&rt_sched_class)
>  #define for_each_class(class) \
>     for (class = sched_class_highest; class; class = class->next)
> @@ -7644,6 +7735,9 @@ static void init_tg_cfs_entry(struct tas
>  	tg->cfs_rq[cpu] = cfs_rq;
>  	init_cfs_rq(cfs_rq, rq);
>  	cfs_rq->tg = tg;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_rq_quota(cfs_rq);
> +#endif
>  	if (add)
>  		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
> 
> @@ -7789,6 +7883,10 @@ void __init sched_init(void)
>  		 * We achieve this by letting init_task_group's tasks sit
>  		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
>  		 */
> +#ifdef CONFIG_CFS_BANDWIDTH
> +		init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
> +				RUNTIME_INF, sched_cfs_bandwidth_period);
> +#endif
>  		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
>  #endif
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> @@ -8032,6 +8130,10 @@ static void free_fair_sched_group(struct
>  {
>  	int i;
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> +#endif
> +
>  	for_each_possible_cpu(i) {
>  		if (tg->cfs_rq)
>  			kfree(tg->cfs_rq[i]);
> @@ -8059,7 +8161,10 @@ int alloc_fair_sched_group(struct task_g
>  		goto err;
> 
>  	tg->shares = NICE_0_LOAD;
> -
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> +			sched_cfs_bandwidth_period);
> +#endif
>  	for_each_possible_cpu(i) {
>  		rq = cpu_rq(i);
> 
> @@ -8505,7 +8610,7 @@ static int __rt_schedulable(struct task_
>  	return walk_tg_tree(tg_schedulable, tg_nop, &data);
>  }
> 
> -static int tg_set_bandwidth(struct task_group *tg,
> +static int tg_set_rt_bandwidth(struct task_group *tg,
>  		u64 rt_period, u64 rt_runtime)
>  {
>  	int i, err = 0;
> @@ -8544,7 +8649,7 @@ int sched_group_set_rt_runtime(struct ta
>  	if (rt_runtime_us < 0)
>  		rt_runtime = RUNTIME_INF;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_runtime(struct task_group *tg)
> @@ -8569,7 +8674,7 @@ int sched_group_set_rt_period(struct tas
>  	if (rt_period == 0)
>  		return -EINVAL;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_period(struct task_group *tg)
> @@ -8776,6 +8881,116 @@ static u64 cpu_shares_read_u64(struct cg
> 
>  	return (u64) tg->shares;
>  }
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +{
> +	int i;
> +	static DEFINE_MUTEX(mutex);
> +
> +	if (tg == &init_task_group)
> +		return -EINVAL;
> +
> +	if (!period)
> +		return -EINVAL;
> +
> +	/*
> +	 * Ensure we have at least one tick of bandwidth every period.  This is
> +	 * to prevent reaching a state of large arrears when throttled via
> +	 * entity_tick() resulting in prolonged exit starvation.
> +	 */
> +	if (NS_TO_JIFFIES(quota) < 1)
> +		return -EINVAL;

I hope we document this in the Documentation :)

> +
> +	mutex_lock(&mutex);
> +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> +
> +	for_each_possible_cpu(i) {

Why not for_each_online_cpu()?

> +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		raw_spin_lock_irq(&rq->lock);
> +		init_cfs_rq_quota(cfs_rq);
> +		raw_spin_unlock_irq(&rq->lock);
> +	}
> +	mutex_unlock(&mutex);
> +
> +	return 0;
> +}
> +
> +int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
> +{
> +	u64 quota, period;
> +
> +	period = ktime_to_ns(tg->cfs_bandwidth.period);
> +	if (cfs_runtime_us < 0)
> +		quota = RUNTIME_INF;

So -1 or -x results in infinite quota?

> +	else
> +		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_quota(struct task_group *tg)
> +{
> +	u64 quota_us;
> +
> +	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
> +		return -1;

This is a little inconsistent, I can set -5, but I see -1 when I do a
get?

> +
> +	quota_us = tg->cfs_bandwidth.quota;
> +	do_div(quota_us, NSEC_PER_USEC);
> +	return quota_us;
> +}
> +
> +int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
> +{
> +	u64 quota, period;
> +
> +	period = (u64)cfs_period_us * NSEC_PER_USEC;
> +	quota = tg->cfs_bandwidth.quota;
> +
> +	if (period <= 0)
> +		return -EINVAL;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_period(struct task_group *tg)
> +{
> +	u64 cfs_period_us;
> +
> +	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
> +	do_div(cfs_period_us, NSEC_PER_USEC);
> +	return cfs_period_us;
> +}
> +
> +static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_quota(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
> +				s64 cfs_quota_us)
> +{
> +	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
> +}
> +
> +static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_period(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
> +				u64 cfs_period_us)
> +{
> +	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
> +}
> +
> +#endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -8810,6 +9025,18 @@ static struct cftype cpu_files[] = {
>  		.write_u64 = cpu_shares_write_u64,
>  	},
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	{
> +		.name = "cfs_quota_us",
> +		.read_s64 = cpu_cfs_quota_read_s64,
> +		.write_s64 = cpu_cfs_quota_write_s64,
> +	},
> +	{
> +		.name = "cfs_period_us",
> +		.read_u64 = cpu_cfs_period_read_u64,
> +		.write_u64 = cpu_cfs_period_write_u64,
> +	},
> +#endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	{
>  		.name = "rt_runtime_us",
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -360,6 +360,9 @@ static void __enqueue_entity(struct cfs_
> 
>  	rb_link_node(&se->run_node, parent, link);
>  	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
> +#endif
>  }
> 
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -1126,6 +1129,13 @@ static void yield_task_fair(struct rq *r
>  	se->vruntime = rightmost->vruntime + 1;
>  }
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> +{
> +	return 1;

Hmmm.. what is 1? true?

> +}
> +#endif
> +
>  #ifdef CONFIG_SMP
> 
>  static void task_waking_fair(struct rq *rq, struct task_struct *p)

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
@ 2010-10-13 13:30   ` Balbir Singh
  2010-10-13 13:46     ` Nikhil Rao
  2010-10-14  8:57   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 51+ messages in thread
From: Balbir Singh @ 2010-10-13 13:30 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

* Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:21:09]:

> sched: accumulate per-cfs_rq cpu usage
> 
> From: Paul Turner <pjt@google.com>
> 
> Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
> level versus task_groups for which bandwidth has been assigned.  This is
> tracked by whether the local cfs_rq->quota_assigned is finite or infinite
> (RUNTIME_INF).
> 
> For cfs_rq's that belong to a bandwidth constrained task_group we introduce
> tg_request_cfs_quota() which attempts to allocate quota from the global pool
> for use locally.  Updates involving the global pool are currently protected
> under cfs_bandwidth->lock, local pools are protected by rq->lock.
> 
> This patch only attempts to assign and track quota, no action is taken in the
> case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  include/linux/sched.h |    4 ++++
>  kernel/sched.c        |   13 +++++++++++++
>  kernel/sched_fair.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sysctl.c       |   10 ++++++++++
>  4 files changed, 77 insertions(+)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1898,6 +1898,10 @@ int sched_rt_handler(struct ctl_table *t
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern unsigned int sysctl_sched_cfs_bandwidth_slice;
> +#endif
> +
>  extern unsigned int sysctl_sched_compat_yield;
> 
>  #ifdef CONFIG_RT_MUTEXES
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1929,6 +1929,19 @@ static const struct sched_class rt_sched
>   * default: 0.5s
>   */
>  static u64 sched_cfs_bandwidth_period = 500000000ULL;
> +
> +/*
> + * default slice of quota to allocate from global tg to local cfs_rq pool on
> + * each refresh
> + * default: 10ms
> + */
> +unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
> +
> +static inline u64 sched_cfs_bandwidth_slice(void)
> +{
> +	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
> +}
> +
>  #endif
> 
>  #define sched_class_highest (&rt_sched_class)
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -267,6 +267,16 @@ find_matching_se(struct sched_entity **s
> 
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> +{
> +	return &tg->cfs_bandwidth;
> +}
> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +		unsigned long delta_exec);
> +#endif
> +
> 
>  /**************************************************************
>   * Scheduling class tree data structure manipulation methods:
> @@ -547,6 +557,9 @@ static void update_curr(struct cfs_rq *c
>  		cpuacct_charge(curtask, delta_exec);
>  		account_group_exec_runtime(curtask, delta_exec);
>  	}
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	account_cfs_rq_quota(cfs_rq, delta_exec);
> +#endif
>  }
> 
>  static inline void
> @@ -1130,6 +1143,43 @@ static void yield_task_fair(struct rq *r
>  }
> 
>  #ifdef CONFIG_CFS_BANDWIDTH
> +static u64 tg_request_cfs_quota(struct task_group *tg)
> +{
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +	u64 delta = 0;
> +
> +	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {

Quick question for cfs_b->quota == RUNTIME_INF, won't cfs_b->runtime
be always > 0?

> +		raw_spin_lock(&cfs_b->lock);
> +		/*
> +		 * it's possible a bandwidth update has changed the global
> +		 * pool.
> +		 */
> +		if (cfs_b->quota == RUNTIME_INF)
> +			delta = sched_cfs_bandwidth_slice();
> +		else {
> +			delta = min(cfs_b->runtime,
> +					sched_cfs_bandwidth_slice());
> +			cfs_b->runtime -= delta;
> +		}
> +		raw_spin_unlock(&cfs_b->lock);
> +	}
> +	return delta;
> +}
> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +		unsigned long delta_exec)
> +{
> +	if (cfs_rq->quota_assigned == RUNTIME_INF)
> +		return;
> +
> +	cfs_rq->quota_used += delta_exec;
> +
> +	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> +		return;
> +
> +	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +}
> +
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
>  	return 1;
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -384,6 +384,16 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	{
> +		.procname	= "sched_cfs_bandwidth_slice_us",
> +		.data		= &sysctl_sched_cfs_bandwidth_slice,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &one,
> +	},
> +#endif
>  #ifdef CONFIG_PROVE_LOCKING
>  	{
>  		.procname	= "prove_locking",

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-13 13:30   ` Balbir Singh
@ 2010-10-13 13:46     ` Nikhil Rao
  2010-10-13 13:59       ` Balbir Singh
  0 siblings, 1 reply; 51+ messages in thread
From: Nikhil Rao @ 2010-10-13 13:46 UTC (permalink / raw)
  To: balbir
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Paul Turner

On Wed, Oct 13, 2010 at 6:30 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:21:09]:
>
>> sched: accumulate per-cfs_rq cpu usage
>>
>> From: Paul Turner <pjt@google.com>
>>
>> Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
>> level versus task_groups for which bandwidth has been assigned.  This is
>> tracked by whether the local cfs_rq->quota_assigned is finite or infinite
>> (RUNTIME_INF).
>>
>> For cfs_rq's that belong to a bandwidth constrained task_group we introduce
>> tg_request_cfs_quota() which attempts to allocate quota from the global pool
>> for use locally.  Updates involving the global pool are currently protected
>> under cfs_bandwidth->lock, local pools are protected by rq->lock.
>>
>> This patch only attempts to assign and track quota, no action is taken in the
>> case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.
>>
>> Signed-off-by: Paul Turner <pjt@google.com>
>> Signed-off-by: Nikhil Rao <ncrao@google.com>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>>  include/linux/sched.h |    4 ++++
>>  kernel/sched.c        |   13 +++++++++++++
>>  kernel/sched_fair.c   |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sysctl.c       |   10 ++++++++++
>>  4 files changed, 77 insertions(+)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1898,6 +1898,10 @@ int sched_rt_handler(struct ctl_table *t
>>               void __user *buffer, size_t *lenp,
>>               loff_t *ppos);
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>> +#endif
>> +
>>  extern unsigned int sysctl_sched_compat_yield;
>>
>>  #ifdef CONFIG_RT_MUTEXES
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -1929,6 +1929,19 @@ static const struct sched_class rt_sched
>>   * default: 0.5s
>>   */
>>  static u64 sched_cfs_bandwidth_period = 500000000ULL;
>> +
>> +/*
>> + * default slice of quota to allocate from global tg to local cfs_rq pool on
>> + * each refresh
>> + * default: 10ms
>> + */
>> +unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
>> +
>> +static inline u64 sched_cfs_bandwidth_slice(void)
>> +{
>> +     return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
>> +}
>> +
>>  #endif
>>
>>  #define sched_class_highest (&rt_sched_class)
>> --- a/kernel/sched_fair.c
>> +++ b/kernel/sched_fair.c
>> @@ -267,6 +267,16 @@ find_matching_se(struct sched_entity **s
>>
>>  #endif       /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>> +{
>> +     return &tg->cfs_bandwidth;
>> +}
>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +             unsigned long delta_exec);
>> +#endif
>> +
>>
>>  /**************************************************************
>>   * Scheduling class tree data structure manipulation methods:
>> @@ -547,6 +557,9 @@ static void update_curr(struct cfs_rq *c
>>               cpuacct_charge(curtask, delta_exec);
>>               account_group_exec_runtime(curtask, delta_exec);
>>       }
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     account_cfs_rq_quota(cfs_rq, delta_exec);
>> +#endif
>>  }
>>
>>  static inline void
>> @@ -1130,6 +1143,43 @@ static void yield_task_fair(struct rq *r
>>  }
>>
>>  #ifdef CONFIG_CFS_BANDWIDTH
>> +static u64 tg_request_cfs_quota(struct task_group *tg)
>> +{
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +     u64 delta = 0;
>> +
>> +     if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>
> Quick question for cfs_b->quota == RUNTIME_INF, won't cfs_b->runtime
> be always > 0?

Hi Balbir,

cfs_b->runtime can be 0 if the task group exhausts its quota.
cfs_b->runtime is a counter that is periodically refreshed to
cfs_b->quota, and is decremented every time a cfs_rq requests a slice.

-Thanks,
Nikhil

>
>> +             raw_spin_lock(&cfs_b->lock);
>> +             /*
>> +              * it's possible a bandwidth update has changed the global
>> +              * pool.
>> +              */
>> +             if (cfs_b->quota == RUNTIME_INF)
>> +                     delta = sched_cfs_bandwidth_slice();
>> +             else {
>> +                     delta = min(cfs_b->runtime,
>> +                                     sched_cfs_bandwidth_slice());
>> +                     cfs_b->runtime -= delta;
>> +             }
>> +             raw_spin_unlock(&cfs_b->lock);
>> +     }
>> +     return delta;
>> +}
>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +             unsigned long delta_exec)
>> +{
>> +     if (cfs_rq->quota_assigned == RUNTIME_INF)
>> +             return;
>> +
>> +     cfs_rq->quota_used += delta_exec;
>> +
>> +     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +             return;
>> +
>> +     cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +}
>> +
>>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>>  {
>>       return 1;
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -384,6 +384,16 @@ static struct ctl_table kern_table[] = {
>>               .mode           = 0644,
>>               .proc_handler   = proc_dointvec,
>>       },
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     {
>> +             .procname       = "sched_cfs_bandwidth_slice_us",
>> +             .data           = &sysctl_sched_cfs_bandwidth_slice,
>> +             .maxlen         = sizeof(unsigned int),
>> +             .mode           = 0644,
>> +             .proc_handler   = proc_dointvec_minmax,
>> +             .extra1         = &one,
>> +     },
>> +#endif
>>  #ifdef CONFIG_PROVE_LOCKING
>>       {
>>               .procname       = "prove_locking",
>
> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-13 13:46     ` Nikhil Rao
@ 2010-10-13 13:59       ` Balbir Singh
  2010-10-13 14:41         ` Nikhil Rao
  0 siblings, 1 reply; 51+ messages in thread
From: Balbir Singh @ 2010-10-13 13:59 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Paul Turner

* Nikhil Rao <ncrao@google.com> [2010-10-13 06:46:18]:

> > Quick question for cfs_b->quota == RUNTIME_INF, won't cfs_b->runtime
> > be always > 0?
> 
> Hi Balbir,
> 
> cfs_b->runtime can be 0 if the task group exhausts its quota.
> cfs_b->runtime is a counter that is periodically refreshed to
> cfs_b->quota, and is decremented every time a cfs_rq requests a slice.
>

Thanks, Nikhil

I saw several checks for quota == RUNTIME_INF and saw updates not
happening. Do we track usage for cgroups with infinite quota? IOW, is
runtime updated? If so, why?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-13 13:59       ` Balbir Singh
@ 2010-10-13 14:41         ` Nikhil Rao
  2010-10-14  5:39           ` Balbir Singh
  0 siblings, 1 reply; 51+ messages in thread
From: Nikhil Rao @ 2010-10-13 14:41 UTC (permalink / raw)
  To: balbir
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Paul Turner

On Wed, Oct 13, 2010 at 6:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Nikhil Rao <ncrao@google.com> [2010-10-13 06:46:18]:
>
>> > Quick question for cfs_b->quota == RUNTIME_INF, won't cfs_b->runtime
>> > be always > 0?
>>
>> Hi Balbir,
>>
>> cfs_b->runtime can be 0 if the task group exhausts its quota.
>> cfs_b->runtime is a counter that is periodically refreshed to
>> cfs_b->quota, and is decremented every time a cfs_rq requests a slice.
>>
>
> Thanks, Nikhil
>
> I saw several checks for quota == RUNTIME_INF and saw updates not
> happening. Do we track usage for cgroups with infinite quota? IOW, is
> runtime updated? If so, why?
>

We don't track usage or update runtime for cgroups with infinite
quota. Most of these RUNTIME_INF checks are usually at the top of the
functions and return immediately. The checks in the middle of
functions are mostly to prevent races when cfs_b->quota is updated.

And ... I think I might have misunderstood your original question. Let
me try answering that again. We check for quota == RUNTIME_INF (in
addition to cfs_b->runtime > 0) because it is possible that the global
pool is updated to RUNTIME_INF between the quota check in
account_cfs_rq_quota() and when the actual quota is distributed in
tg_request_cfs_quota(). I hope that answers your original question.

-Thanks,
Nikhil

> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-13 13:00   ` Balbir Singh
@ 2010-10-14  5:14     ` Bharata B Rao
  2010-10-14  7:52     ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-14  5:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Wed, Oct 13, 2010 at 06:30:18PM +0530, Balbir Singh wrote:
> * Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:20:23]:
> 
> > sched: introduce primitives to account for CFS bandwidth tracking
> > 
> > From: Paul Turner <pjt@google.com>
> > 
> > In this patch we introduce the notion of CFS bandwidth, to account for the
> > realities of SMP this is partitioned into globally unassigned bandwidth, and
> > locally claimed bandwidth:
> > - The global bandwidth is per task_group, it represents a pool of unclaimed
> >   bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
> >   structure.
> > - The local bandwidth is tracked per-cfs_rq, this represents allotments from
> >   the global pool
> >   bandwidth assigned to a task_group, this is tracked using the
> >   new cfs_bandwidth structure.
> > 
> > Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
> > - cpu.cfs_period_us : the bandwidth period in usecs
> > - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
> >   to consume over period above.
> > 
> > A per-cfs_bandwidth timer is also introduced to handle future refresh at
> > period expiration.  There's some minor refactoring here so that
> > start_bandwidth_timer() functionality can be shared
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > Signed-off-by: Nikhil Rao <ncrao@google.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> >  init/Kconfig        |    9 +
> >  kernel/sched.c      |  271 +++++++++++++++++++++++++++++++++++++++++++++++-----
> >  kernel/sched_fair.c |   10 +
> >  3 files changed, 268 insertions(+), 22 deletions(-)
> > 
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -609,6 +609,15 @@ config FAIR_GROUP_SCHED
> >  	depends on CGROUP_SCHED
> >  	default CGROUP_SCHED
> > 
> > +config CFS_BANDWIDTH
> > +	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
> > +	depends on EXPERIMENTAL
> > +	depends on FAIR_GROUP_SCHED
> > +	default n
> > +	help
> > +	  This option allows users to define quota and period for cpu
> > +	  bandwidth provisioning on a per-cgroup basis.
> > +
> >  config RT_GROUP_SCHED
> >  	bool "Group scheduling for SCHED_RR/FIFO"
> >  	depends on EXPERIMENTAL
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -192,10 +192,28 @@ static inline int rt_bandwidth_enabled(v
> >  	return sysctl_sched_rt_runtime >= 0;
> >  }
> > 
> > -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> > +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> >  {
> > -	ktime_t now;
> > +	unsigned long delta;
> > +	ktime_t soft, hard, now;
> > +
> > +	for (;;) {
> > +		if (hrtimer_active(period_timer))
> > +			break;
> > 
> > +		now = hrtimer_cb_get_time(period_timer);
> > +		hrtimer_forward(period_timer, now, period);
> > +
> > +		soft = hrtimer_get_softexpires(period_timer);
> > +		hard = hrtimer_get_expires(period_timer);
> > +		delta = ktime_to_ns(ktime_sub(hard, soft));
> > +		__hrtimer_start_range_ns(period_timer, soft, delta,
> > +					 HRTIMER_MODE_ABS_PINNED, 0);
> 
> This code can be replaced with
> 
> hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED) if we
> don't care about wakeup_softirq, is there a reason we prefer to keep
> wakeup as 0?

We have carried this from RT code. ATM I am not exactly sure if wakeup
parameter makes any difference to us. Need to check.

> 
> > +	}
> > +}
> > +
> > +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> > +{
> >  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
> >  		return;
> > 
> > @@ -203,22 +221,7 @@ static void start_rt_bandwidth(struct rt
> >  		return;
> > 
> >  	raw_spin_lock(&rt_b->rt_runtime_lock);
> > -	for (;;) {
> > -		unsigned long delta;
> > -		ktime_t soft, hard;
> > -
> > -		if (hrtimer_active(&rt_b->rt_period_timer))
> > -			break;
> > -
> > -		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
> > -		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
> > -
> > -		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
> > -		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
> > -		delta = ktime_to_ns(ktime_sub(hard, soft));
> > -		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
> > -				HRTIMER_MODE_ABS_PINNED, 0);
> > -	}
> > +	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
> >  	raw_spin_unlock(&rt_b->rt_runtime_lock);
> >  }
> > 
> > @@ -243,6 +246,15 @@ struct cfs_rq;
> > 
> >  static LIST_HEAD(task_groups);
> > 
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +struct cfs_bandwidth {
> > +	raw_spinlock_t		lock;
> > +	ktime_t			period;
> > +	u64			runtime, quota;
> > +	struct hrtimer		period_timer;
> > +};
> > +#endif
> > +
> >  /* task group related information */
> >  struct task_group {
> >  	struct cgroup_subsys_state css;
> > @@ -268,6 +280,10 @@ struct task_group {
> >  	struct task_group *parent;
> >  	struct list_head siblings;
> >  	struct list_head children;
> > +
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	struct cfs_bandwidth cfs_bandwidth;
> > +#endif
> >  };
> > 
> >  #define root_task_group init_task_group
> > @@ -369,9 +385,76 @@ struct cfs_rq {
> >  	 */
> >  	unsigned long rq_weight;
> >  #endif
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	u64 quota_assigned, quota_used;
> > +#endif
> >  #endif
> >  };
> > 
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> > +
> > +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > +{
> > +	struct cfs_bandwidth *cfs_b =
> > +		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	ktime_t now;
> > +	int overrun;
> > +	int idle = 0;
> > +
> > +	for (;;) {
> > +		now = hrtimer_cb_get_time(timer);
> > +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> > +
> > +		if (!overrun)
> > +			break;
> 
> What is the significance of overrun? overrun is set when delta >
> interval. The logic seems to be that hrtimer is forwarded in steps of
> cfs_b->period till we reach the desired time.

This is again carried from RT code where RT accounts for timer overruns
by appropriately adjusting the runtime. We aren't currently doing anything
with it in our timer handler, but we do need to check if and how we need
to handle timer overruns.

> 
> > +
> > +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> > +	}
> > +
> > +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > +}
> > +
> > +static
> > +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> > +{
> > +	raw_spin_lock_init(&cfs_b->lock);
> > +	cfs_b->quota = cfs_b->runtime = quota;
> > +	cfs_b->period = ns_to_ktime(period);
> > +
> > +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +	cfs_b->period_timer.function = sched_cfs_period_timer;
> > +}
> > +
> > +static
> > +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> > +{
> > +	cfs_rq->quota_used = 0;
> > +	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> > +		cfs_rq->quota_assigned = RUNTIME_INF;
> > +	else
> > +		cfs_rq->quota_assigned = 0;
> > +}
> > +
> > +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +{
> > +	if (cfs_b->quota == RUNTIME_INF)
> > +		return;
> > +
> > +	if (hrtimer_active(&cfs_b->period_timer))
> > +		return;
> 
> Why the double check, start_bandwidth_timer also checks this. Is it to
> avoid doing the check under cfs_b->lock?

Again this is carried from RT code. I think the first one is quick check
to back out w/o having to take the cfs_b->lock.

> 
> > +
> > +	raw_spin_lock(&cfs_b->lock);
> > +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> > +	raw_spin_unlock(&cfs_b->lock);
> > +}
> > +
> > +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +{
> > +	hrtimer_cancel(&cfs_b->period_timer);
> > +}
> > +#endif
> > +
> >  /* Real-Time classes' related field in a runqueue: */
> >  struct rt_rq {
> >  	struct rt_prio_array active;
> > @@ -1840,6 +1923,14 @@ static inline void __set_task_cpu(struct
> > 
> >  static const struct sched_class rt_sched_class;
> > 
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +/*
> > + * default period for cfs group bandwidth.
> > + * default: 0.5s
> > + */
> > +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> > +#endif
> > +
> >  #define sched_class_highest (&rt_sched_class)
> >  #define for_each_class(class) \
> >     for (class = sched_class_highest; class; class = class->next)
> > @@ -7644,6 +7735,9 @@ static void init_tg_cfs_entry(struct tas
> >  	tg->cfs_rq[cpu] = cfs_rq;
> >  	init_cfs_rq(cfs_rq, rq);
> >  	cfs_rq->tg = tg;
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	init_cfs_rq_quota(cfs_rq);
> > +#endif
> >  	if (add)
> >  		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
> > 
> > @@ -7789,6 +7883,10 @@ void __init sched_init(void)
> >  		 * We achieve this by letting init_task_group's tasks sit
> >  		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> >  		 */
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +		init_cfs_bandwidth(&init_task_group.cfs_bandwidth,
> > +				RUNTIME_INF, sched_cfs_bandwidth_period);
> > +#endif
> >  		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
> >  #endif
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> > @@ -8032,6 +8130,10 @@ static void free_fair_sched_group(struct
> >  {
> >  	int i;
> > 
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> > +#endif
> > +
> >  	for_each_possible_cpu(i) {
> >  		if (tg->cfs_rq)
> >  			kfree(tg->cfs_rq[i]);
> > @@ -8059,7 +8161,10 @@ int alloc_fair_sched_group(struct task_g
> >  		goto err;
> > 
> >  	tg->shares = NICE_0_LOAD;
> > -
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> > +			sched_cfs_bandwidth_period);
> > +#endif
> >  	for_each_possible_cpu(i) {
> >  		rq = cpu_rq(i);
> > 
> > @@ -8505,7 +8610,7 @@ static int __rt_schedulable(struct task_
> >  	return walk_tg_tree(tg_schedulable, tg_nop, &data);
> >  }
> > 
> > -static int tg_set_bandwidth(struct task_group *tg,
> > +static int tg_set_rt_bandwidth(struct task_group *tg,
> >  		u64 rt_period, u64 rt_runtime)
> >  {
> >  	int i, err = 0;
> > @@ -8544,7 +8649,7 @@ int sched_group_set_rt_runtime(struct ta
> >  	if (rt_runtime_us < 0)
> >  		rt_runtime = RUNTIME_INF;
> > 
> > -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> > +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> >  }
> > 
> >  long sched_group_rt_runtime(struct task_group *tg)
> > @@ -8569,7 +8674,7 @@ int sched_group_set_rt_period(struct tas
> >  	if (rt_period == 0)
> >  		return -EINVAL;
> > 
> > -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> > +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
> >  }
> > 
> >  long sched_group_rt_period(struct task_group *tg)
> > @@ -8776,6 +8881,116 @@ static u64 cpu_shares_read_u64(struct cg
> > 
> >  	return (u64) tg->shares;
> >  }
> > +
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > +{
> > +	int i;
> > +	static DEFINE_MUTEX(mutex);
> > +
> > +	if (tg == &init_task_group)
> > +		return -EINVAL;
> > +
> > +	if (!period)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Ensure we have at least one tick of bandwidth every period.  This is
> > +	 * to prevent reaching a state of large arrears when throttled via
> > +	 * entity_tick() resulting in prolonged exit starvation.
> > +	 */
> > +	if (NS_TO_JIFFIES(quota) < 1)
> > +		return -EINVAL;
> 
> I hope we document this in the Documentation :)

Sure. This and also some documentation about the cgroup files we introduce
will be added in our next post.

> 
> > +
> > +	mutex_lock(&mutex);
> > +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> > +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> > +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> > +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> > +
> > +	for_each_possible_cpu(i) {
> 
> Why not for_each_online_cpu()?

Again this is similar to what is done in RT. We however need to re-check
if we want to take any special actions on CPU hot(un)plug.

> 
> > +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> > +		struct rq *rq = rq_of(cfs_rq);
> > +
> > +		raw_spin_lock_irq(&rq->lock);
> > +		init_cfs_rq_quota(cfs_rq);
> > +		raw_spin_unlock_irq(&rq->lock);
> > +	}
> > +	mutex_unlock(&mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
> > +{
> > +	u64 quota, period;
> > +
> > +	period = ktime_to_ns(tg->cfs_bandwidth.period);
> > +	if (cfs_runtime_us < 0)
> > +		quota = RUNTIME_INF;
> 
> So -1 or -x results in infinite quota?

Yes.

> 
> > +	else
> > +		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
> > +
> > +	return tg_set_cfs_bandwidth(tg, period, quota);
> > +}
> > +
> > +long tg_get_cfs_quota(struct task_group *tg)
> > +{
> > +	u64 quota_us;
> > +
> > +	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
> > +		return -1;
> 
> This is a little inconsistent, I can set -5, but I see -1 when I do a
> get?

I see your point. May be we should just allow only -1 and positive values
for quota.

> 
> > +
> > +	quota_us = tg->cfs_bandwidth.quota;
> > +	do_div(quota_us, NSEC_PER_USEC);
> > +	return quota_us;
> > +}
> > +
> > +int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
> > +{
> > +	u64 quota, period;
> > +
> > +	period = (u64)cfs_period_us * NSEC_PER_USEC;
> > +	quota = tg->cfs_bandwidth.quota;
> > +
> > +	if (period <= 0)
> > +		return -EINVAL;
> > +
> > +	return tg_set_cfs_bandwidth(tg, period, quota);
> > +}
> > +
> > +long tg_get_cfs_period(struct task_group *tg)
> > +{
> > +	u64 cfs_period_us;
> > +
> > +	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
> > +	do_div(cfs_period_us, NSEC_PER_USEC);
> > +	return cfs_period_us;
> > +}
> > +
> > +static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
> > +{
> > +	return tg_get_cfs_quota(cgroup_tg(cgrp));
> > +}
> > +
> > +static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
> > +				s64 cfs_quota_us)
> > +{
> > +	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
> > +}
> > +
> > +static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
> > +{
> > +	return tg_get_cfs_period(cgroup_tg(cgrp));
> > +}
> > +
> > +static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
> > +				u64 cfs_period_us)
> > +{
> > +	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
> > +}
> > +
> > +#endif /* CONFIG_CFS_BANDWIDTH */
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> > 
> >  #ifdef CONFIG_RT_GROUP_SCHED
> > @@ -8810,6 +9025,18 @@ static struct cftype cpu_files[] = {
> >  		.write_u64 = cpu_shares_write_u64,
> >  	},
> >  #endif
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	{
> > +		.name = "cfs_quota_us",
> > +		.read_s64 = cpu_cfs_quota_read_s64,
> > +		.write_s64 = cpu_cfs_quota_write_s64,
> > +	},
> > +	{
> > +		.name = "cfs_period_us",
> > +		.read_u64 = cpu_cfs_period_read_u64,
> > +		.write_u64 = cpu_cfs_period_write_u64,
> > +	},
> > +#endif
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >  	{
> >  		.name = "rt_runtime_us",
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -360,6 +360,9 @@ static void __enqueue_entity(struct cfs_
> > 
> >  	rb_link_node(&se->run_node, parent, link);
> >  	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
> > +#endif
> >  }
> > 
> >  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > @@ -1126,6 +1129,13 @@ static void yield_task_fair(struct rq *r
> >  	se->vruntime = rightmost->vruntime + 1;
> >  }
> > 
> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> > +{
> > +	return 1;
> 
> Hmmm.. what is 1? true?

That was just a place holder. This function gets full functionality in a
later patch.

Thanks for your review.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-13 14:41         ` Nikhil Rao
@ 2010-10-14  5:39           ` Balbir Singh
  0 siblings, 0 replies; 51+ messages in thread
From: Balbir Singh @ 2010-10-14  5:39 UTC (permalink / raw)
  To: Nikhil Rao
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Paul Turner

* Nikhil Rao <ncrao@google.com> [2010-10-13 07:41:46]:

> On Wed, Oct 13, 2010 at 6:59 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > * Nikhil Rao <ncrao@google.com> [2010-10-13 06:46:18]:
> >
> >> > Quick question for cfs_b->quota == RUNTIME_INF, won't cfs_b->runtime
> >> > be always > 0?
> >>
> >> Hi Balbir,
> >>
> >> cfs_b->runtime can be 0 if the task group exhausts its quota.
> >> cfs_b->runtime is a counter that is periodically refreshed to
> >> cfs_b->quota, and is decremented every time a cfs_rq requests a slice.
> >>
> >
> > Thanks, Nikhil
> >
> > I saw several checks for quota == RUNTIME_INF and saw updates not
> > happening. Do we track usage for cgroups with infinite quota? IOW, is
> > runtime updated? If so, why?
> >
> 
> We don't track usage or update runtime for cgroups with infinite
> quota. Most of these RUNTIME_INF checks are usually at the top of the
> functions and return immediately. The checks in the middle of
> functions are mostly to prevent races when cfs_b->quota is updated.
> 
> And ... I think I might have misunderstood your original question. Let
> me try answering that again. We check for quota == RUNTIME_INF (in
> addition to cfs_b->runtime > 0) because it is possible that the global
> pool is updated to RUNTIME_INF between the quota check in
> account_cfs_rq_quota() and when the actual quota is distributed in
> tg_request_cfs_quota(). I hope that answers your original question.
>

Yes, it does, thanks! 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-13 13:00   ` Balbir Singh
  2010-10-14  5:14     ` Bharata B Rao
@ 2010-10-14  7:52     ` Peter Zijlstra
  2010-10-14 12:38       ` Balbir Singh
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  7:52 UTC (permalink / raw)
  To: balbir
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Wed, 2010-10-13 at 18:30 +0530, Balbir Singh wrote:

> > -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> > +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> >  {
> > -	ktime_t now;
> > +	unsigned long delta;
> > +	ktime_t soft, hard, now;
> > +
> > +	for (;;) {
> > +		if (hrtimer_active(period_timer))
> > +			break;
> > 
> > +		now = hrtimer_cb_get_time(period_timer);
> > +		hrtimer_forward(period_timer, now, period);
> > +
> > +		soft = hrtimer_get_softexpires(period_timer);
> > +		hard = hrtimer_get_expires(period_timer);
> > +		delta = ktime_to_ns(ktime_sub(hard, soft));
> > +		__hrtimer_start_range_ns(period_timer, soft, delta,
> > +					 HRTIMER_MODE_ABS_PINNED, 0);
> 
> This code can be replaced with
> 
> hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED) if we
> don't care about wakeup_softirq, is there a reason we prefer to keep
> wakeup as 0?

You cannot do wakeups while holding the rq->lock, can you? :-)

> > +	}
> > +}


> > +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > +{
> > +	struct cfs_bandwidth *cfs_b =
> > +		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	ktime_t now;
> > +	int overrun;
> > +	int idle = 0;
> > +
> > +	for (;;) {
> > +		now = hrtimer_cb_get_time(timer);
> > +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> > +
> > +		if (!overrun)
> > +			break;
> 
> What is the significance of overrun? overrun is set when delta >
> interval. The logic seems to be that hrtimer is forwarded in steps of
> cfs_b->period till we reach the desired time.
> 

Overrun is the number of periods missed. The goal is to increment the
quota for each period, if the timer is late 3 periods, we still need to
increment it 3 times, that's what overrun does.

> > +
> > +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> > +	}
> > +
> > +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > +}

> > +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +{
> > +	if (cfs_b->quota == RUNTIME_INF)
> > +		return;
> > +
> > +	if (hrtimer_active(&cfs_b->period_timer))
> > +		return;
> 
> Why the double check, start_bandwidth_timer also checks this. Is it to
> avoid doing the check under cfs_b->lock?

Basically..

> > +
> > +	raw_spin_lock(&cfs_b->lock);
> > +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> > +	raw_spin_unlock(&cfs_b->lock);
> > +}
> > +

> > +#ifdef CONFIG_CFS_BANDWIDTH
> > +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > +{
> > +	int i;
> > +	static DEFINE_MUTEX(mutex);
> > +
> > +	if (tg == &init_task_group)
> > +		return -EINVAL;
> > +
> > +	if (!period)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Ensure we have at least one tick of bandwidth every period.  This is
> > +	 * to prevent reaching a state of large arrears when throttled via
> > +	 * entity_tick() resulting in prolonged exit starvation.
> > +	 */
> > +	if (NS_TO_JIFFIES(quota) < 1)
> > +		return -EINVAL;
> 
> I hope we document this in the Documentation :)

/me went and looked up arrears in a dictionary and wonders why 'debt'
wasn't good enough.

> > +
> > +	mutex_lock(&mutex);
> > +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> > +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> > +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> > +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> > +
> > +	for_each_possible_cpu(i) {
> 
> Why not for_each_online_cpu()?

Probably could be cured with a hotplug handler, but then you need to
track more state iirc.

> > +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> > +		struct rq *rq = rq_of(cfs_rq);
> > +
> > +		raw_spin_lock_irq(&rq->lock);
> > +		init_cfs_rq_quota(cfs_rq);
> > +		raw_spin_unlock_irq(&rq->lock);
> > +	}
> > +	mutex_unlock(&mutex);
> > +
> > +	return 0;
> > +}


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
  2010-10-13 13:30   ` Balbir Singh
@ 2010-10-14  8:57   ` Peter Zijlstra
  2010-10-14  9:07     ` Paul Turner
  2010-10-14  9:01   ` Peter Zijlstra
  2010-10-14  9:19   ` Peter Zijlstra
  3 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  8:57 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
> +static u64 tg_request_cfs_quota(struct task_group *tg)
> +{
> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +       u64 delta = 0;
> +
> +       if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> +               raw_spin_lock(&cfs_b->lock);
> +               /*
> +                * it's possible a bandwidth update has changed the global
> +                * pool.
> +                */
> +               if (cfs_b->quota == RUNTIME_INF)
> +                       delta = sched_cfs_bandwidth_slice();
> +               else {
> +                       delta = min(cfs_b->runtime,
> +                                       sched_cfs_bandwidth_slice());
> +                       cfs_b->runtime -= delta;
> +               }
> +               raw_spin_unlock(&cfs_b->lock);
> +       }
> +       return delta;
> +} 

Since you check cfs_b->quote outside of cfs_b->lock anyway, you might as
well avoid taking the lock in that case and directly return slice.

Also, you possibly evaluate sched_cfs_bandwidth_slice() twice.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
  2010-10-13 13:30   ` Balbir Singh
  2010-10-14  8:57   ` Peter Zijlstra
@ 2010-10-14  9:01   ` Peter Zijlstra
  2010-10-14  9:14     ` Paul Turner
  2010-10-14  9:19   ` Peter Zijlstra
  3 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:01 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +               unsigned long delta_exec)
> +{
> +       if (cfs_rq->quota_assigned == RUNTIME_INF)
> +               return;
> +
> +       cfs_rq->quota_used += delta_exec;
> +
> +       if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> +               return;
> +
> +       cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +}

That looks iffy, quota_assigned is only ever incremented and can wrap.
Why not subtract delta_exec and replenish when <0? That keeps the
numbers small.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  8:57   ` Peter Zijlstra
@ 2010-10-14  9:07     ` Paul Turner
  2010-10-14  9:13       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Paul Turner @ 2010-10-14  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 1:57 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
>> +static u64 tg_request_cfs_quota(struct task_group *tg)
>> +{
>> +       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +       u64 delta = 0;
>> +
>> +       if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>> +               raw_spin_lock(&cfs_b->lock);
>> +               /*
>> +                * it's possible a bandwidth update has changed the global
>> +                * pool.
>> +                */
>> +               if (cfs_b->quota == RUNTIME_INF)
>> +                       delta = sched_cfs_bandwidth_slice();
>> +               else {
>> +                       delta = min(cfs_b->runtime,
>> +                                       sched_cfs_bandwidth_slice());
>> +                       cfs_b->runtime -= delta;
>> +               }
>> +               raw_spin_unlock(&cfs_b->lock);
>> +       }
>> +       return delta;
>> +}
>
> Since you check cfs_b->quote outside of cfs_b->lock anyway, you might as
> well avoid taking the lock in that case and directly return slice.
>

Do you mean in the RUNTIME_INF case?

I suppose we could just avoid taking the lock there (theoretically it
would be possible to slightly over-commit on a RUNTIME_INF =>
constrained transition, but this should be the supremely uncommon case
and hey, it's transitioning from unlimited bandwidth anyway).

> Also, you possibly evaluate sched_cfs_bandwidth_slice() twice.
>

Separate branch no?  Should only be one evaluation.  (fwiw the min
macro looks to cache the evaluation)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-13  6:34   ` KAMEZAWA Hiroyuki
  2010-10-13  6:44     ` Paul Turner
@ 2010-10-14  9:12     ` Peter Zijlstra
  2010-10-14  9:50       ` KAMEZAWA Hiroyuki
  2010-10-14  9:58       ` Paul Turner
  1 sibling, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> cpu.share and bandwidth control can't be used simultaneously or...
> is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> Could you add a brief documentaion of a spec/feature. in the next post ?

Like explained, shares control the proportional distribution of time
between groups, bandwidth puts a limit on how much time a group can
take. It can cause a group to receive less than its fair share, but
never more.

There is, however, a problem with all this, and that is that all this
explicit idling of tasks can lead to a form of priority inversion.
Regular preemptive scheduling already suffers from this, but explicitly
idling tasks exacerbates the situation.

You basically get to add the longest induced idle time to all your lock
hold times.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:07     ` Paul Turner
@ 2010-10-14  9:13       ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:13 UTC (permalink / raw)
  To: Paul Turner
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, 2010-10-14 at 02:07 -0700, Paul Turner wrote:
> Separate branch no? 

Right, /me goes get more morning juice..

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:01   ` Peter Zijlstra
@ 2010-10-14  9:14     ` Paul Turner
  2010-10-14  9:27       ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Paul Turner @ 2010-10-14  9:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 2:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +               unsigned long delta_exec)
>> +{
>> +       if (cfs_rq->quota_assigned == RUNTIME_INF)
>> +               return;
>> +
>> +       cfs_rq->quota_used += delta_exec;
>> +
>> +       if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +               return;
>> +
>> +       cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +}
>
> That looks iffy, quota_assigned is only ever incremented and can wrap.

This can't advance at a rate faster than ~vruntime and we can't handle
wrapping there anyway (fortunately it would take something like 35k
years?)

> Why not subtract delta_exec and replenish when <0? That keeps the
> numbers small.
>

Accounting in the opposite direction allows us to catch-up in
subsequent periods when a task exceeds its bandwidth across an
interval where we are not able to immediately throttle it (e.g. costly
syscall without config_prempt).  Since we'll continue to accrue the
execution time in this case it will be effectively pre-charged against
the next slice received.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
                     ` (2 preceding siblings ...)
  2010-10-14  9:01   ` Peter Zijlstra
@ 2010-10-14  9:19   ` Peter Zijlstra
  2010-10-14  9:27     ` Paul Turner
  3 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:19 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
> +#ifdef CONFIG_CFS_BANDWIDTH
> +       {
> +               .procname       = "sched_cfs_bandwidth_slice_us",
> +               .data           = &sysctl_sched_cfs_bandwidth_slice,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &one,
> +       },
> +#endif 

So this is basically your scalability knob.. the larger this value less
less frequent we have to access global state, but the less parallelism
is possible due to fewer CPUs depleting the total quota, leaving nothing
for the others.

I guess one could go try and play load-balancer games to try and
mitigate this by pulling this group's tasks to the CPU(s) that have move
bandwidth for that group, but balancing that against the regular
load-balancer goal of well balancing load, will undoubtedly be
'interesting'...



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:19   ` Peter Zijlstra
@ 2010-10-14  9:27     ` Paul Turner
  2010-10-14  9:40       ` Bharata B Rao
  0 siblings, 1 reply; 51+ messages in thread
From: Paul Turner @ 2010-10-14  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 2:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +       {
>> +               .procname       = "sched_cfs_bandwidth_slice_us",
>> +               .data           = &sysctl_sched_cfs_bandwidth_slice,
>> +               .maxlen         = sizeof(unsigned int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &one,
>> +       },
>> +#endif
>
> So this is basically your scalability knob.. the larger this value less
> less frequent we have to access global state, but the less parallelism
> is possible due to fewer CPUs depleting the total quota, leaving nothing
> for the others.
>

Exactly

> I guess one could go try and play load-balancer games to try and
> mitigate this by pulling this group's tasks to the CPU(s) that have move
> bandwidth for that group, but balancing that against the regular
> load-balancer goal of well balancing load, will undoubtedly be
> 'interesting'...
>

I considered this approach as an alternative previously, but I don't
think it can be enacted effectively:

Since quota will likely expire in a staggered fashion you're going to
get a funnel-herd effect as everything is crowded onto the cpus with
remaining quota.

It's much more easily avoided by keeping the slice small enough
(relative to the bandwidth period) that we're not potentially
stranding a significant percentage of our quota.  The potential for
abuse could be eliminated/reduced here by making the slice size a
constant ratio relative to the period length.  This would also make
possible parallelism more deterministic.

I also think versioning the quota so that it can be potentially
returned and redistributed on sleep is more effective/efficient in
avoiding stranded quota.

>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:14     ` Paul Turner
@ 2010-10-14  9:27       ` Peter Zijlstra
  2010-10-14  9:53         ` Paul Turner
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, 2010-10-14 at 02:14 -0700, Paul Turner wrote:
> On Thu, Oct 14, 2010 at 2:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >> +               unsigned long delta_exec)
> >> +{
> >> +       if (cfs_rq->quota_assigned == RUNTIME_INF)
> >> +               return;
> >> +
> >> +       cfs_rq->quota_used += delta_exec;
> >> +
> >> +       if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> +               return;
> >> +
> >> +       cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> >> +}
> >
> > That looks iffy, quota_assigned is only ever incremented and can wrap.
> 
> This can't advance at a rate faster than ~vruntime and we can't handle
> wrapping there anyway (fortunately it would take something like 35k
> years?)

You can't go faster than wall-time, vruntime can actually go a lot
faster and can deal with wrapping.

> > Why not subtract delta_exec and replenish when <0? That keeps the
> > numbers small.
> >
> 
> Accounting in the opposite direction allows us to catch-up in
> subsequent periods when a task exceeds its bandwidth across an
> interval where we are not able to immediately throttle it (e.g. costly
> syscall without config_prempt).  Since we'll continue to accrue the
> execution time in this case it will be effectively pre-charged against
> the next slice received.

Humm, how so, that's a simply matter of the quota going negative, right?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:27     ` Paul Turner
@ 2010-10-14  9:40       ` Bharata B Rao
  0 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-10-14  9:40 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 02:27:02AM -0700, Paul Turner wrote:
> On Thu, Oct 14, 2010 at 2:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +       {
> >> +               .procname       = "sched_cfs_bandwidth_slice_us",
> >> +               .data           = &sysctl_sched_cfs_bandwidth_slice,
> >> +               .maxlen         = sizeof(unsigned int),
> >> +               .mode           = 0644,
> >> +               .proc_handler   = proc_dointvec_minmax,
> >> +               .extra1         = &one,
> >> +       },
> >> +#endif
> >
> > So this is basically your scalability knob.. the larger this value less
> > less frequent we have to access global state, but the less parallelism
> > is possible due to fewer CPUs depleting the total quota, leaving nothing
> > for the others.
> >
> 
> Exactly
> 
> > I guess one could go try and play load-balancer games to try and
> > mitigate this by pulling this group's tasks to the CPU(s) that have move
> > bandwidth for that group, but balancing that against the regular
> > load-balancer goal of well balancing load, will undoubtedly be
> > 'interesting'...
> >
> 
> I considered this approach as an alternative previously, but I don't
> think it can be enacted effectively:
> 
> Since quota will likely expire in a staggered fashion you're going to
> get a funnel-herd effect as everything is crowded onto the cpus with
> remaining quota.
> 
> It's much more easily avoided by keeping the slice small enough
> (relative to the bandwidth period) that we're not potentially
> stranding a significant percentage of our quota.  The potential for
> abuse could be eliminated/reduced here by making the slice size a
> constant ratio relative to the period length.  This would also make
> possible parallelism more deterministic.

You can see from the numbers I posted in 0/7, how current default of
10ms slice can lead to a large amount of stranted quota and how that
can affect the runtime obtained by the tasks. So reducing the slice
size should help, but it will still be a problem in a large system with
huge number of CPUs where each CPU claims a slice and does not use it fully.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14  9:12     ` Peter Zijlstra
@ 2010-10-14  9:50       ` KAMEZAWA Hiroyuki
  2010-10-14  9:59         ` Peter Zijlstra
  2010-10-14  9:58       ` Paul Turner
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-14  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Thu, 14 Oct 2010 11:12:22 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> > cpu.share and bandwidth control can't be used simultaneously or...
> > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> > Could you add a brief documentaion of a spec/feature. in the next post ?
> 
> Like explained, shares control the proportional distribution of time
> between groups, bandwidth puts a limit on how much time a group can
> take. It can cause a group to receive less than its fair share, but
> never more.
> 
> There is, however, a problem with all this, and that is that all this
> explicit idling of tasks can lead to a form of priority inversion.
> Regular preemptive scheduling already suffers from this, but explicitly
> idling tasks exacerbates the situation.
> 
> You basically get to add the longest induced idle time to all your lock
> hold times.
> 

What is the user-visible difference of the problem between
  1) limit share to be very small.
  2) use throttole.

If share is used, lock-hodler's priority is boosted ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage
  2010-10-14  9:27       ` Peter Zijlstra
@ 2010-10-14  9:53         ` Paul Turner
  0 siblings, 0 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-14  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 2:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-10-14 at 02:14 -0700, Paul Turner wrote:
>> On Thu, Oct 14, 2010 at 2:01 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, 2010-10-12 at 13:21 +0530, Bharata B Rao wrote:
>> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> >> +               unsigned long delta_exec)
>> >> +{
>> >> +       if (cfs_rq->quota_assigned == RUNTIME_INF)
>> >> +               return;
>> >> +
>> >> +       cfs_rq->quota_used += delta_exec;
>> >> +
>> >> +       if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> >> +               return;
>> >> +
>> >> +       cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> >> +}
>> >
>> > That looks iffy, quota_assigned is only ever incremented and can wrap.
>>
>> This can't advance at a rate faster than ~vruntime and we can't handle
>> wrapping there anyway (fortunately it would take something like 35k
>> years?)
>
> You can't go faster than wall-time, vruntime can actually go a lot
> faster and can deal with wrapping.

Right -- I hadn't previously noticed the normalization w/ min_vruntime
in min_vruntime()

Fortunately the wall_time constraint still holds :)
>
>> > Why not subtract delta_exec and replenish when <0? That keeps the
>> > numbers small.
>> >
>>
>> Accounting in the opposite direction allows us to catch-up in
>> subsequent periods when a task exceeds its bandwidth across an
>> interval where we are not able to immediately throttle it (e.g. costly
>> syscall without config_prempt).  Since we'll continue to accrue the
>> execution time in this case it will be effectively pre-charged against
>> the next slice received.
>
> Humm, how so, that's a simply matter of the quota going negative, right?
>

Then quota is signed and representation of infinite quota is more ambiguous.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14  9:12     ` Peter Zijlstra
  2010-10-14  9:50       ` KAMEZAWA Hiroyuki
@ 2010-10-14  9:58       ` Paul Turner
  1 sibling, 0 replies; 51+ messages in thread
From: Paul Turner @ 2010-10-14  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, bharata, linux-kernel, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Thu, Oct 14, 2010 at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
>> cpu.share and bandwidth control can't be used simultaneously or...
>> is this fair ? I'm not familiar with scheduler but this allows boost this tg.
>> Could you add a brief documentaion of a spec/feature. in the next post ?
>
> Like explained, shares control the proportional distribution of time
> between groups, bandwidth puts a limit on how much time a group can
> take. It can cause a group to receive less than its fair share, but
> never more.
>
> There is, however, a problem with all this, and that is that all this
> explicit idling of tasks can lead to a form of priority inversion.
> Regular preemptive scheduling already suffers from this, but explicitly
> idling tasks exacerbates the situation.
>
> You basically get to add the longest induced idle time to all your lock
> hold times.
>

This is a concern (especially for exit starvation, since the task
needs to be scheduled for reaping); the way throttling is enacted
should help to mitigate the risk / starvation here.  When a group
exceeds its bandwidth we don't actively force it off the cpu, we only
set TIF_RESCHED; this the "enforcement" of throttling until we drop
back down through put_prev_task().

This should mean we won't extend a semaphore wait time unless someone
explicitly issues something a cond_resched() within it [at which point
they are choosing a potentially arbitrary latency delay anyway,
although this does expand it relative to the target
sched_latency_period].

>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14  9:50       ` KAMEZAWA Hiroyuki
@ 2010-10-14  9:59         ` Peter Zijlstra
  2010-10-14 10:08           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14  9:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Thu, 2010-10-14 at 18:50 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Oct 2010 11:12:22 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> > > cpu.share and bandwidth control can't be used simultaneously or...
> > > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> > > Could you add a brief documentaion of a spec/feature. in the next post ?
> > 
> > Like explained, shares control the proportional distribution of time
> > between groups, bandwidth puts a limit on how much time a group can
> > take. It can cause a group to receive less than its fair share, but
> > never more.
> > 
> > There is, however, a problem with all this, and that is that all this
> > explicit idling of tasks can lead to a form of priority inversion.
> > Regular preemptive scheduling already suffers from this, but explicitly
> > idling tasks exacerbates the situation.
> > 
> > You basically get to add the longest induced idle time to all your lock
> > hold times.
> > 
> 
> What is the user-visible difference of the problem between
>   1) limit share to be very small.
>   2) use throttole.
> 
> If share is used, lock-hodler's priority is boosted ?

No, both lead to the same problem, its just that this adds another
dimension to it.. and I'm fairly sure people won't realise this until it
bites them in the ass.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14  9:59         ` Peter Zijlstra
@ 2010-10-14 10:08           ` KAMEZAWA Hiroyuki
  2010-10-14 10:25             ` Paul Turner
  2010-10-14 10:37             ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-14 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Thu, 14 Oct 2010 11:59:55 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2010-10-14 at 18:50 +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Oct 2010 11:12:22 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
> > > > cpu.share and bandwidth control can't be used simultaneously or...
> > > > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
> > > > Could you add a brief documentaion of a spec/feature. in the next post ?
> > > 
> > > Like explained, shares control the proportional distribution of time
> > > between groups, bandwidth puts a limit on how much time a group can
> > > take. It can cause a group to receive less than its fair share, but
> > > never more.
> > > 
> > > There is, however, a problem with all this, and that is that all this
> > > explicit idling of tasks can lead to a form of priority inversion.
> > > Regular preemptive scheduling already suffers from this, but explicitly
> > > idling tasks exacerbates the situation.
> > > 
> > > You basically get to add the longest induced idle time to all your lock
> > > hold times.
> > > 
> > 
> > What is the user-visible difference of the problem between
> >   1) limit share to be very small.
> >   2) use throttole.
> > 
> > If share is used, lock-hodler's priority is boosted ?
> 
> No, both lead to the same problem, its just that this adds another
> dimension to it.. and I'm fairly sure people won't realise this until it
> bites them in the ass.
> 
Hmm, them, existing problem but this add a new pitfall.

What's your recomendation to make progess on this work ?

I think 1st step will be..
- explain the problem of priority inversion in cgroup+cfs documenation with
  !!CAUTION!!

I'm sorry I'm not sure there have been trials for fixing priority inversion
in the linux scheduler development.

Explaining my motivation, a user of this feature on my customer is virtual machine
rental service. So, some fuctionality as
"When vcpu holds spinlock in kernel, please don't sleep.." will be nice.
Is there patch already ?

Thanks,
-Kame










 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14 10:08           ` KAMEZAWA Hiroyuki
@ 2010-10-14 10:25             ` Paul Turner
  2010-10-14 10:41               ` Peter Zijlstra
  2010-10-14 10:37             ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Paul Turner @ 2010-10-14 10:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, Oct 14, 2010 at 3:08 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 14 Oct 2010 11:59:55 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, 2010-10-14 at 18:50 +0900, KAMEZAWA Hiroyuki wrote:
>> > On Thu, 14 Oct 2010 11:12:22 +0200
>> > Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > > On Wed, 2010-10-13 at 15:34 +0900, KAMEZAWA Hiroyuki wrote:
>> > > > cpu.share and bandwidth control can't be used simultaneously or...
>> > > > is this fair ? I'm not familiar with scheduler but this allows boost this tg.
>> > > > Could you add a brief documentaion of a spec/feature. in the next post ?
>> > >
>> > > Like explained, shares control the proportional distribution of time
>> > > between groups, bandwidth puts a limit on how much time a group can
>> > > take. It can cause a group to receive less than its fair share, but
>> > > never more.
>> > >
>> > > There is, however, a problem with all this, and that is that all this
>> > > explicit idling of tasks can lead to a form of priority inversion.
>> > > Regular preemptive scheduling already suffers from this, but explicitly
>> > > idling tasks exacerbates the situation.
>> > >
>> > > You basically get to add the longest induced idle time to all your lock
>> > > hold times.
>> > >
>> >
>> > What is the user-visible difference of the problem between
>> >   1) limit share to be very small.
>> >   2) use throttole.
>> >
>> > If share is used, lock-hodler's priority is boosted ?
>>
>> No, both lead to the same problem, its just that this adds another
>> dimension to it.. and I'm fairly sure people won't realise this until it
>> bites them in the ass.
>>
> Hmm, them, existing problem but this add a new pitfall.
>
> What's your recomendation to make progess on this work ?
>
> I think 1st step will be..
> - explain the problem of priority inversion in cgroup+cfs documenation with
>  !!CAUTION!!
>
> I'm sorry I'm not sure there have been trials for fixing priority inversion
> in the linux scheduler development.
>
> Explaining my motivation, a user of this feature on my customer is virtual machine
> rental service. So, some fuctionality as
> "When vcpu holds spinlock in kernel, please don't sleep.." will be nice.
> Is there patch already ?
>


Per above:

When a group exceeds its bandwidth we don't actively force it off the
cpu, we only set TIF_RESCHED; we won't process the throttling until we
drop back down to userspace and handle the flag.

This means: we'll never throttle a spinlock

We'll also only throttle a sleepable lock (that doesn't disable
preemption) when they voluntarily reschedule without releasing the
lock, at which point they've chosen to open themselves to an arbitrary
latency period anyway.

**

The case of a guest cpu holding spinlocks is part of a much larger
rabbit hole that is spinlock enlightenment which should occur via
pvops/etc interaction.  The sane thing for this to do would be to (at
least) preempt_disable() at which point the vcpu will be protected
from throttling.

This seems somewhat orthogonal to this patchset however.

**

Agreed that PI inversion across threads and across vcpus are rather
sickly beasts; especially given how bare the curtains are on the first
case (which the second can only really build upon).

>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14 10:08           ` KAMEZAWA Hiroyuki
  2010-10-14 10:25             ` Paul Turner
@ 2010-10-14 10:37             ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Thu, 2010-10-14 at 19:08 +0900, KAMEZAWA Hiroyuki wrote:
> > No, both lead to the same problem, its just that this adds another
> > dimension to it.. and I'm fairly sure people won't realise this until it
> > bites them in the ass.
> > 
> Hmm, them, existing problem but this add a new pitfall.
> 
> What's your recomendation to make progess on this work ?
> 
> I think 1st step will be..
> - explain the problem of priority inversion in cgroup+cfs documenation with
>   !!CAUTION!!
> 
> I'm sorry I'm not sure there have been trials for fixing priority inversion
> in the linux scheduler development.

There's kernel/rtmutex.c which is a mutex implementation with classic PI
that works for SCHED_FIFO/RR.

Thomas and I have been planning of completely rewriting PI support in
the kernel for some time but haven't gotten around to actually doing it.

The goal is to implement a variant of the Proxy Execution Protocol
(close to what gets described here in:
http://www.artist-embedded.org/docs/Events/2010/OSPERT/OSPERT2010-Proceedings.pdf
page 16, Timeslice donation in component based systems).

One of the advantages of PEP is that it works across all scheduling
policies, including proportional fair share schedulers.

It also deals correctly with all PI issues for deadline based schedulers
(correctly handles deadline as well as bandwidth inheritance).

> Explaining my motivation, a user of this feature on my customer is virtual machine
> rental service. So, some fuctionality as
> "When vcpu holds spinlock in kernel, please don't sleep.." will be nice.
> Is there patch already ? 

Well, clearly you cannot sleep while holding a spinlock.. sleeping while
holding a mutex/semaphore is allowed however, but even when you'd have
full PI over all kernel lock (as -rt has), that's not enough, you have
exactly the same problem with all userspace locks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14 10:25             ` Paul Turner
@ 2010-10-14 10:41               ` Peter Zijlstra
  2010-10-14 23:30                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14 10:41 UTC (permalink / raw)
  To: Paul Turner
  Cc: KAMEZAWA Hiroyuki, bharata, linux-kernel, Dhaval Giani,
	Balbir Singh, Vaidyanathan Srinivasan, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Paul Menage, Mike Waychison,
	Nikhil Rao

On Thu, 2010-10-14 at 03:25 -0700, Paul Turner wrote:
> > "When vcpu holds spinlock in kernel, please don't sleep.." will be nice.
> > Is there patch already ? 

Oh you're talking about the guest spinlock crap? I don't particularly
care about that.. that's a virt problem you see ;-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-14  7:52     ` Peter Zijlstra
@ 2010-10-14 12:38       ` Balbir Singh
  2010-10-14 13:24         ` Peter Zijlstra
  2010-12-06  9:02         ` Bharata B Rao
  0 siblings, 2 replies; 51+ messages in thread
From: Balbir Singh @ 2010-10-14 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

* Peter Zijlstra <peterz@infradead.org> [2010-10-14 09:52:17]:

> On Wed, 2010-10-13 at 18:30 +0530, Balbir Singh wrote:
> 
> > > -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> > > +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
> > >  {
> > > -	ktime_t now;
> > > +	unsigned long delta;
> > > +	ktime_t soft, hard, now;
> > > +
> > > +	for (;;) {
> > > +		if (hrtimer_active(period_timer))
> > > +			break;
> > > 
> > > +		now = hrtimer_cb_get_time(period_timer);
> > > +		hrtimer_forward(period_timer, now, period);
> > > +
> > > +		soft = hrtimer_get_softexpires(period_timer);
> > > +		hard = hrtimer_get_expires(period_timer);
> > > +		delta = ktime_to_ns(ktime_sub(hard, soft));
> > > +		__hrtimer_start_range_ns(period_timer, soft, delta,
> > > +					 HRTIMER_MODE_ABS_PINNED, 0);
> > 
> > This code can be replaced with
> > 
> > hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED) if we
> > don't care about wakeup_softirq, is there a reason we prefer to keep
> > wakeup as 0?
> 
> You cannot do wakeups while holding the rq->lock, can you? :-)
>

:-) Yes, given that we wakeup softirq only for the current CPU. There
is scope for some code reuse anyway, I'll see if I can send a patch.
 
> > > +	}
> > > +}
> 
> 
> > > +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > > +{
> > > +	struct cfs_bandwidth *cfs_b =
> > > +		container_of(timer, struct cfs_bandwidth, period_timer);
> > > +	ktime_t now;
> > > +	int overrun;
> > > +	int idle = 0;
> > > +
> > > +	for (;;) {
> > > +		now = hrtimer_cb_get_time(timer);
> > > +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> > > +
> > > +		if (!overrun)
> > > +			break;
> > 
> > What is the significance of overrun? overrun is set when delta >
> > interval. The logic seems to be that hrtimer is forwarded in steps of
> > cfs_b->period till we reach the desired time.
> > 
> 
> Overrun is the number of periods missed. The goal is to increment the
> quota for each period, if the timer is late 3 periods, we still need to
> increment it 3 times, that's what overrun does.
> 
> > > +
> > > +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> > > +	}
> > > +
> > > +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> > > +}
> 
> > > +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > > +{
> > > +	if (cfs_b->quota == RUNTIME_INF)
> > > +		return;
> > > +
> > > +	if (hrtimer_active(&cfs_b->period_timer))
> > > +		return;
> > 
> > Why the double check, start_bandwidth_timer also checks this. Is it to
> > avoid doing the check under cfs_b->lock?
> 
> Basically..
> 
> > > +
> > > +	raw_spin_lock(&cfs_b->lock);
> > > +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> > > +	raw_spin_unlock(&cfs_b->lock);
> > > +}
> > > +
> 
> > > +#ifdef CONFIG_CFS_BANDWIDTH
> > > +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > > +{
> > > +	int i;
> > > +	static DEFINE_MUTEX(mutex);
> > > +
> > > +	if (tg == &init_task_group)
> > > +		return -EINVAL;
> > > +
> > > +	if (!period)
> > > +		return -EINVAL;
> > > +
> > > +	/*
> > > +	 * Ensure we have at least one tick of bandwidth every period.  This is
> > > +	 * to prevent reaching a state of large arrears when throttled via
> > > +	 * entity_tick() resulting in prolonged exit starvation.
> > > +	 */
> > > +	if (NS_TO_JIFFIES(quota) < 1)
> > > +		return -EINVAL;
> > 
> > I hope we document this in the Documentation :)
> 
> /me went and looked up arrears in a dictionary and wonders why 'debt'
> wasn't good enough.
>
> > > +
> > > +	mutex_lock(&mutex);
> > > +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> > > +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> > > +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> > > +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> > > +
> > > +	for_each_possible_cpu(i) {
> > 
> > Why not for_each_online_cpu()?
> 
> Probably could be cured with a hotplug handler, but then you need to
> track more state iirc.
>

What more state? If a CPU is offline, we never get to it, do we? I
think we need to do just an init and destroy - no?
 
> > > +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> > > +		struct rq *rq = rq_of(cfs_rq);
> > > +
> > > +		raw_spin_lock_irq(&rq->lock);
> > > +		init_cfs_rq_quota(cfs_rq);
> > > +		raw_spin_unlock_irq(&rq->lock);
> > > +	}
> > > +	mutex_unlock(&mutex);
> > > +
> > > +	return 0;
> > > +}
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-14 12:38       ` Balbir Singh
@ 2010-10-14 13:24         ` Peter Zijlstra
  2010-12-06  9:02         ` Bharata B Rao
  1 sibling, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-10-14 13:24 UTC (permalink / raw)
  To: balbir
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao, Thomas Gleixner

On Thu, 2010-10-14 at 18:08 +0530, Balbir Singh wrote:
> > > 
> > > hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED) if we
> > > don't care about wakeup_softirq, is there a reason we prefer to keep
> > > wakeup as 0?
> > 
> > You cannot do wakeups while holding the rq->lock, can you? :-)
> >
> 
> :-) Yes, given that we wakeup softirq only for the current CPU. There
> is scope for some code reuse anyway, I'll see if I can send a patch. 

Well, both Thomas and I would like hrtimer_start() and friends to return
-ETIME in this case. But that means changing all callers to cope with
that, and sweeping the tree to make that happen is like lots of work.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2010-10-14 10:41               ` Peter Zijlstra
@ 2010-10-14 23:30                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-14 23:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul Turner, bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Nikhil Rao

On Thu, 14 Oct 2010 12:41:44 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2010-10-14 at 03:25 -0700, Paul Turner wrote:
> > > "When vcpu holds spinlock in kernel, please don't sleep.." will be nice.
> > > Is there patch already ? 
> 
> Oh you're talking about the guest spinlock crap?

yes ;)

> I don't particularly care about that.. that's a virt problem you see ;-)

Hmm. AFAIK, all qemu's thread will be under a cgroup. So, I'll be happy if
a vcpu is on a loop to wait for spinlock yields immediately...But I'm not
famliar with this area, either. Thanks.

-Kame


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2010-10-12  7:52 ` [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Bharata B Rao
@ 2010-10-15  4:45   ` Balbir Singh
  2010-12-07 13:13     ` Bharata B Rao
  0 siblings, 1 reply; 51+ messages in thread
From: Balbir Singh @ 2010-10-15  4:45 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

* Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:22:47]:

> sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
> 
> From: Paul Turner <pjt@google.com>
> 
> At the start of a new period there are several actions we must take:
> - Refresh global bandwidth pool
> - Unthrottle entities who ran out of quota as refreshed bandwidth permits
> 
> Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
> into the cfs entity hierarchy.
>

Am I reading this right?
 
> sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
> since it is now shared by both cfs and rt bandwidth period timers.
> 
> The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
> rd->span instead of cpu_online_mask since I think that was incorrect before
> (don't want to hit cpu's outside of your root_domain for RT bandwidth).
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   16 ++++++++++++
>  kernel/sched_fair.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched_rt.c   |   19 --------------
>  3 files changed, 84 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1565,6 +1565,8 @@ static int tg_nop(struct task_group *tg,
>  }
>  #endif
> 
> +static inline const struct cpumask *sched_bw_period_mask(void);
> +
>  #ifdef CONFIG_SMP
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
> @@ -1933,6 +1935,18 @@ static inline void __set_task_cpu(struct
> 
>  static const struct sched_class rt_sched_class;
> 
> +#ifdef CONFIG_SMP
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +	return cpu_rq(smp_processor_id())->rd->span;
> +}
> +#else
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +	return cpu_online_mask;
> +}
> +#endif
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>  /*
>   * default period for cfs group bandwidth.
> @@ -8937,6 +8951,8 @@ static int tg_set_cfs_bandwidth(struct t
> 
>  		raw_spin_lock_irq(&rq->lock);
>  		init_cfs_rq_quota(cfs_rq);
> +		if (cfs_rq_throttled(cfs_rq))
> +			unthrottle_cfs_rq(cfs_rq);
>  		raw_spin_unlock_irq(&rq->lock);
>  	}
>  	mutex_unlock(&mutex);
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -268,6 +268,13 @@ find_matching_se(struct sched_entity **s
>  #endif	/* CONFIG_FAIR_GROUP_SCHED */
> 
>  #ifdef CONFIG_CFS_BANDWIDTH
> +static inline
> +struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
> +{

Nit pick, but I'd call this function cfs_bandwidth_cfs_cpu_rq

> +	return container_of(cfs_b, struct task_group,
> +			cfs_bandwidth)->cfs_rq[cpu];
> +}
> +
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>  {
>  	return &tg->cfs_bandwidth;
> @@ -1219,6 +1226,29 @@ out_throttled:
>  	cfs_rq->throttled = 1;
>  }
> 
> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *se;
> +	struct rq *rq = rq_of(cfs_rq);
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	cfs_rq->throttled = 0;
> +	for_each_sched_entity(se) {
> +		if (se->on_rq)
> +			break;
> +
> +		cfs_rq = cfs_rq_of(se);
> +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);

Should we really enqueue with ENQUEUE_WAKEUP - the task throttled, not
slept.

> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +
> +	/* determine whether we need to wake up potentally idle cpu */
> +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> +		resched_task(rq->curr);
> +}
> +
>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>  		unsigned long delta_exec)
>  {
> @@ -1241,8 +1271,44 @@ static void account_cfs_rq_quota(struct 
> 
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -	return 1;
> +	int i, idle = 1;
> +	u64 delta;
> +	const struct cpumask *span;
> +
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return 1;

I am afraid I don't understand how return codes are being used here.
idle is set to 1 if there are no running tasks across all CPUs. Why do
we return a 1 from here?

> +
> +	/* reset group quota */
> +	raw_spin_lock(&cfs_b->lock);
> +	cfs_b->runtime = cfs_b->quota;
> +	raw_spin_unlock(&cfs_b->lock);
> +
> +	span = sched_bw_period_mask();
> +	for_each_cpu(i, span) {
> +		struct rq *rq = cpu_rq(i);
> +		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
> +
> +		if (cfs_rq->nr_running)
> +			idle = 0;
> +
> +		if (!cfs_rq_throttled(cfs_rq))
> +			continue;
> +
> +		delta = tg_request_cfs_quota(cfs_rq->tg);
> +
> +		if (delta) {
> +			raw_spin_lock(&rq->lock);
> +			cfs_rq->quota_assigned += delta;
> +
> +			if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> +				unthrottle_cfs_rq(cfs_rq);
> +			raw_spin_unlock(&rq->lock);
> +		}
> +	}
> +
> +	return idle;
>  }
> +
>  #endif
> 
>  #ifdef CONFIG_SMP
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -241,18 +241,6 @@ static int rt_se_boosted(struct sched_rt
>  	return p->prio != p->normal_prio;
>  }
> 
> -#ifdef CONFIG_SMP
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_rq(smp_processor_id())->rd->span;
> -}
> -#else
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_online_mask;
> -}
> -#endif
> -
>  static inline
>  struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
>  {
> @@ -302,11 +290,6 @@ static inline int rt_rq_throttled(struct
>  	return rt_rq->rt_throttled;
>  }
> 
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_online_mask;
> -}
> -
>  static inline
>  struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
>  {
> @@ -524,7 +507,7 @@ static int do_sched_rt_period_timer(stru
>  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>  		return 1;
> 
> -	span = sched_rt_period_mask();
> +	span = sched_bw_period_mask();
>  	for_each_cpu(i, span) {
>  		int enqueue = 0;
>  		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/7] CFS Bandwidth Control
  2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
                   ` (8 preceding siblings ...)
  2010-10-13  5:44 ` Herbert Poetzl
@ 2010-11-17  8:32 ` Lai Jiangshan
  2010-11-19  3:24   ` Bharata B Rao
  9 siblings, 1 reply; 51+ messages in thread
From: Lai Jiangshan @ 2010-11-17  8:32 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Pete

On 10/12/2010 03:49 PM, Bharata B Rao wrote:
> Hi,
> 
> Its been a while since we posted CPS hard limits (aka CFS bandwidth control
> now) patches, hence a quick recap first:
> 
> - I have been working on CFS hard limits since last year and have posted
>   a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
> - Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
>   control and have posted a couple of versions.
>   (last post v2: http://lwn.net/Articles/385055/)
> 

Hi, All

	I tried it, it works very good! It adds a very useful feature.
It is very helpful for virtualization or cloud-computing I think.
I hope the next version patchset of it will be sent and get merged soon.

When I tested it, my box became dead several times, infrequently, and
I don't know how to reproduce it, but next steps may help to reproduce it:

# mount -t cgroup -o cpu xxx /mnt
# cd /mnt
# mkdir test
# cd test
# echo 5000000 > cpu.cfs_period_us
# echo 1000000 > cpu.cfs_quota_us
# cat /dev/zero > /dev/null &
# echo $! > tasks
# #make the system busy, browse websites and open many tabs in your browser etc
# #but do NOT do things important.
# #after some time, the system may dead. (can not respond even you type AltSysrqC)

Thank you all
Lai

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 0/7] CFS Bandwidth Control
  2010-11-17  8:32 ` Lai Jiangshan
@ 2010-11-19  3:24   ` Bharata B Rao
  0 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-11-19  3:24 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Dhaval Giani, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Pete

On Wed, Nov 17, 2010 at 04:32:29PM +0800, Lai Jiangshan wrote:
> On 10/12/2010 03:49 PM, Bharata B Rao wrote:
> > Hi,
> > 
> > Its been a while since we posted CPS hard limits (aka CFS bandwidth control
> > now) patches, hence a quick recap first:
> > 
> > - I have been working on CFS hard limits since last year and have posted
> >   a few versions of the same (last post: http://lkml.org/lkml/2010/1/5/44)
> > - Paul Turner and Nikhil Rao meanwhile started working on CFS bandwidth
> >   control and have posted a couple of versions.
> >   (last post v2: http://lwn.net/Articles/385055/)
> > 
> 
> Hi, All
> 
> 	I tried it, it works very good! It adds a very useful feature.
> It is very helpful for virtualization or cloud-computing I think.
> I hope the next version patchset of it will be sent and get merged soon.
> 
> When I tested it, my box became dead several times, infrequently, and
> I don't know how to reproduce it, but next steps may help to reproduce it:
> 
> # mount -t cgroup -o cpu xxx /mnt
> # cd /mnt
> # mkdir test
> # cd test
> # echo 5000000 > cpu.cfs_period_us
> # echo 1000000 > cpu.cfs_quota_us
> # cat /dev/zero > /dev/null &
> # echo $! > tasks
> # #make the system busy, browse websites and open many tabs in your browser etc
> # #but do NOT do things important.
> # #after some time, the system may dead. (can not respond even you type AltSysrqC)
> 

Hi Lai, Thanks for testing the patches. I haven't seen this behaviour still.
But will try out the steps you mention and see if I can reproduce it here.
Meanwhile if you find more information about the problem, do let us know.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2010-10-14 12:38       ` Balbir Singh
  2010-10-14 13:24         ` Peter Zijlstra
@ 2010-12-06  9:02         ` Bharata B Rao
  1 sibling, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-12-06  9:02 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Paul Menage, Mike Waychison, Paul Turner,
	Nikhil Rao

On Thu, Oct 14, 2010 at 06:08:34PM +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2010-10-14 09:52:17]:
> 
> > On Wed, 2010-10-13 at 18:30 +0530, Balbir Singh wrote:
> > > > +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > > > +{
> > > > +	int i;
> > > > +	static DEFINE_MUTEX(mutex);
> > > > +
> > > > +	if (tg == &init_task_group)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!period)
> > > > +		return -EINVAL;
> > > > +
> > > > +	/*
> > > > +	 * Ensure we have at least one tick of bandwidth every period.  This is
> > > > +	 * to prevent reaching a state of large arrears when throttled via
> > > > +	 * entity_tick() resulting in prolonged exit starvation.
> > > > +	 */
> > > > +	if (NS_TO_JIFFIES(quota) < 1)
> > > > +		return -EINVAL;
> > > 
> > > I hope we document this in the Documentation :)
> > 
> > /me went and looked up arrears in a dictionary and wonders why 'debt'
> > wasn't good enough.
> > > > +
> > > > +	mutex_lock(&mutex);
> > > > +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> > > > +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> > > > +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> > > > +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> > > > +
> > > > +	for_each_possible_cpu(i) {
> > > 
> > > Why not for_each_online_cpu()?
> > 
> > Probably could be cured with a hotplug handler, but then you need to
> > track more state iirc.
> >
> 
> What more state? If a CPU is offline, we never get to it, do we? I
> think we need to do just an init and destroy - no?

Here we essentially initialize tg->cfs_rq[cpu]->quota_used{assigned}
for all CPUs. Given that we don't destroy tg->cfs_rq[cpu] and tg->se->[cpu]
when a CPU goes offline, is it really worth to have a notifier to just
initialize quota_used and quota_assigned when a CPU comes online ?

Regards,
Bharata.

> 
> > > > +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> > > > +		struct rq *rq = rq_of(cfs_rq);
> > > > +
> > > > +		raw_spin_lock_irq(&rq->lock);
> > > > +		init_cfs_rq_quota(cfs_rq);
> > > > +		raw_spin_unlock_irq(&rq->lock);
> > > > +	}
> > > > +	mutex_unlock(&mutex);
> > > > +
> > > > +	return 0;
> > > > +}
> > 
> 
> -- 
> 	Three Cheers,
> 	Balbir

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2010-10-15  4:45   ` Balbir Singh
@ 2010-12-07 13:13     ` Bharata B Rao
  0 siblings, 0 replies; 51+ messages in thread
From: Bharata B Rao @ 2010-12-07 13:13 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Paul Menage, Mike Waychison, Paul Turner, Nikhil Rao

Sorry Balbir, didn't realize that we haven't replied to these comments from you.

On Fri, Oct 15, 2010 at 10:15:52AM +0530, Balbir Singh wrote:
> * Bharata B Rao <bharata@linux.vnet.ibm.com> [2010-10-12 13:22:47]:
> 
> > sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
> > 
> > From: Paul Turner <pjt@google.com>
> > 
> > At the start of a new period there are several actions we must take:
> > - Refresh global bandwidth pool
> > - Unthrottle entities who ran out of quota as refreshed bandwidth permits
> > 
> > Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
> > into the cfs entity hierarchy.
> >
> 
> Am I reading this right?

Yes, Needs to be corrected. Thanks.

> 
> > sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
> > since it is now shared by both cfs and rt bandwidth period timers.
> > 
> > The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
> > rd->span instead of cpu_online_mask since I think that was incorrect before
> > (don't want to hit cpu's outside of your root_domain for RT bandwidth).
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > Signed-off-by: Nikhil Rao <ncrao@google.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> >  kernel/sched.c      |   16 ++++++++++++
> >  kernel/sched_fair.c |   68 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched_rt.c   |   19 --------------
> >  3 files changed, 84 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -1565,6 +1565,8 @@ static int tg_nop(struct task_group *tg,
> >  }
> >  #endif
> > 
> > +static inline const struct cpumask *sched_bw_period_mask(void);
> > +
> >  #ifdef CONFIG_SMP
> >  /* Used instead of source_load when we know the type == 0 */
> >  static unsigned long weighted_cpuload(const int cpu)
> > @@ -1933,6 +1935,18 @@ static inline void __set_task_cpu(struct
> > 
> >  static const struct sched_class rt_sched_class;
> > 
> > +#ifdef CONFIG_SMP
> > +static inline const struct cpumask *sched_bw_period_mask(void)
> > +{
> > +	return cpu_rq(smp_processor_id())->rd->span;
> > +}
> > +#else
> > +static inline const struct cpumask *sched_bw_period_mask(void)
> > +{
> > +	return cpu_online_mask;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >  /*
> >   * default period for cfs group bandwidth.
> > @@ -8937,6 +8951,8 @@ static int tg_set_cfs_bandwidth(struct t
> > 
> >  		raw_spin_lock_irq(&rq->lock);
> >  		init_cfs_rq_quota(cfs_rq);
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			unthrottle_cfs_rq(cfs_rq);
> >  		raw_spin_unlock_irq(&rq->lock);
> >  	}
> >  	mutex_unlock(&mutex);
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -268,6 +268,13 @@ find_matching_se(struct sched_entity **s
> >  #endif	/* CONFIG_FAIR_GROUP_SCHED */
> > 
> >  #ifdef CONFIG_CFS_BANDWIDTH
> > +static inline
> > +struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
> > +{
> 
> Nit pick, but I'd call this function cfs_bandwidth_cfs_cpu_rq
> 
> > +	return container_of(cfs_b, struct task_group,
> > +			cfs_bandwidth)->cfs_rq[cpu];
> > +}
> > +
> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >  {
> >  	return &tg->cfs_bandwidth;
> > @@ -1219,6 +1226,29 @@ out_throttled:
> >  	cfs_rq->throttled = 1;
> >  }
> > 
> > +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > +{
> > +	struct sched_entity *se;
> > +	struct rq *rq = rq_of(cfs_rq);
> > +
> > +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > +
> > +	cfs_rq->throttled = 0;
> > +	for_each_sched_entity(se) {
> > +		if (se->on_rq)
> > +			break;
> > +
> > +		cfs_rq = cfs_rq_of(se);
> > +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> 
> Should we really enqueue with ENQUEUE_WAKEUP - the task throttled, not
> slept.

Yes, but the actions are quite similar. In both the cases, they go off
the runqueque and get enqueued back. I see two (side)effects of using
DEQUEUE_SLEEP during dequeue and ENQUEUE_WAKEUP during enqueue:

- vruntime normalization isn't done for throttled entities. This should be
  fine since they don't get pulled around when throttled.
- vruntime of throttled entities are re-calculated during unthrottling(enqueue).
  This will ensure that throttled entities don't get undue vruntime-advantage
  when they are enqueued back.

This is my understanding. I would request Paul to comment here.

> 
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			break;
> > +	}
> > +
> > +	/* determine whether we need to wake up potentally idle cpu */
> > +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> > +		resched_task(rq->curr);
> > +}
> > +
> >  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >  		unsigned long delta_exec)
> >  {
> > @@ -1241,8 +1271,44 @@ static void account_cfs_rq_quota(struct 
> > 
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> >  {
> > -	return 1;
> > +	int i, idle = 1;
> > +	u64 delta;
> > +	const struct cpumask *span;
> > +
> > +	if (cfs_b->quota == RUNTIME_INF)
> > +		return 1;
> 
> I am afraid I don't understand how return codes are being used here.
> idle is set to 1 if there are no running tasks across all CPUs. Why do
> we return a 1 from here?

Remember we are in hrtimer handler here, returning 1 will ensure that hrtimer
isn't restarted. So we don't restart the timer for a group that isn't
bandwidth constrained.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-12-07 13:14 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-12  7:49 [PATCH v3 0/7] CFS Bandwidth Control Bharata B Rao
2010-10-12  7:50 ` [PATCH v3 1/7] sched: introduce primitives to account for CFS bandwidth tracking Bharata B Rao
2010-10-13 13:00   ` Balbir Singh
2010-10-14  5:14     ` Bharata B Rao
2010-10-14  7:52     ` Peter Zijlstra
2010-10-14 12:38       ` Balbir Singh
2010-10-14 13:24         ` Peter Zijlstra
2010-12-06  9:02         ` Bharata B Rao
2010-10-12  7:51 ` [PATCH v3 2/7] sched: accumulate per-cfs_rq cpu usage Bharata B Rao
2010-10-13 13:30   ` Balbir Singh
2010-10-13 13:46     ` Nikhil Rao
2010-10-13 13:59       ` Balbir Singh
2010-10-13 14:41         ` Nikhil Rao
2010-10-14  5:39           ` Balbir Singh
2010-10-14  8:57   ` Peter Zijlstra
2010-10-14  9:07     ` Paul Turner
2010-10-14  9:13       ` Peter Zijlstra
2010-10-14  9:01   ` Peter Zijlstra
2010-10-14  9:14     ` Paul Turner
2010-10-14  9:27       ` Peter Zijlstra
2010-10-14  9:53         ` Paul Turner
2010-10-14  9:19   ` Peter Zijlstra
2010-10-14  9:27     ` Paul Turner
2010-10-14  9:40       ` Bharata B Rao
2010-10-12  7:52 ` [PATCH v3 3/7] sched: throttle cfs_rq entities which exceed their local quota Bharata B Rao
2010-10-13  6:34   ` KAMEZAWA Hiroyuki
2010-10-13  6:44     ` Paul Turner
2010-10-13  6:47       ` Bharata B Rao
2010-10-13  6:52         ` Paul Turner
2010-10-13  7:00       ` KAMEZAWA Hiroyuki
2010-10-13  7:13         ` Paul Turner
2010-10-14  9:12     ` Peter Zijlstra
2010-10-14  9:50       ` KAMEZAWA Hiroyuki
2010-10-14  9:59         ` Peter Zijlstra
2010-10-14 10:08           ` KAMEZAWA Hiroyuki
2010-10-14 10:25             ` Paul Turner
2010-10-14 10:41               ` Peter Zijlstra
2010-10-14 23:30                 ` KAMEZAWA Hiroyuki
2010-10-14 10:37             ` Peter Zijlstra
2010-10-14  9:58       ` Paul Turner
2010-10-12  7:52 ` [PATCH v3 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Bharata B Rao
2010-10-15  4:45   ` Balbir Singh
2010-12-07 13:13     ` Bharata B Rao
2010-10-12  7:53 ` [PATCH v3 5/7] sched: add exports tracking cfs bandwidth control statistics Bharata B Rao
2010-10-12  7:54 ` [PATCH v3 6/7] sched: hierarchical task accounting for FAIR_GROUP_SCHED Bharata B Rao
2010-10-12  7:55 ` [PATCH v3 7/7] sched: Return/expire slack quota using generation counters Bharata B Rao
2010-10-13  5:14 ` [PATCH v3 0/7] CFS Bandwidth Control KAMEZAWA Hiroyuki
2010-10-13  5:44 ` Herbert Poetzl
2010-10-13  6:26   ` Paul Turner
2010-11-17  8:32 ` Lai Jiangshan
2010-11-19  3:24   ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox