From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
To: Paul Turner <pjt@google.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>,
Vladimir Davydov <vdavydov@parallels.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Bharata B Rao <bharata@linux.vnet.ibm.com>,
Dhaval Giani <dhaval.giani@gmail.com>,
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
Ingo Molnar <mingo@elte.hu>,
Pavel Emelianov <xemul@parallels.com>
Subject: Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
Date: Wed, 7 Sep 2011 20:24:21 +0530 [thread overview]
Message-ID: <20110907145421.GA32577@linux.vnet.ibm.com> (raw)
In-Reply-To: <BANLkTi=7S2qdVjbJkDja+GAoD=pNo2gPsTFZMFkB8NWWkO1JVQ@mail.gmail.com>
* Paul Turner <pjt@google.com> [2011-06-21 12:48:17]:
> Hi Kamalesh,
>
> Can you see what things look like under v7?
>
> There's been a few improvements to quota re-distribution that should
> hopefully help your test case.
>
> The remaining idle% I see on my machines appear to be a product of
> load-balancer inefficiency.
which is quite a complex problem to solve! I am still surprised that
we can't handle 32 cpuhogs on a 16-cpu system very easily. The tasks seem to
hop around madly rather than settle down as 2 tasks/cpu. Kamalesh, can you post
the exact count of migrations we saw on latest tip over a 20-sec window?
Anyway, here's a "hack" to minimize the idle time induced due to load-balance
issues. It brings down idle time from 7+% to ~0% ..I am not too happy about
this, but I don't see any other simpler solutions to solve the idle time issue
completely (other than making load-balancer completely fair!).
--
Fix excessive idle time reported when cgroups are capped. The patch
introduces the notion of "steal" (or "grace") time which is the surplus
time/bandwidth each cgroup is allowed to consume, subject to a maximum
steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
or "grace" time when the lone task running on a cpu is about to be throttled.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Index: linux-3.1-rc4/include/linux/sched.h
===================================================================
--- linux-3.1-rc4.orig/include/linux/sched.h 2011-09-07 14:57:49.529602231 +0800
+++ linux-3.1-rc4/include/linux/sched.h 2011-09-07 14:58:49.952418107 +0800
@@ -2042,6 +2042,7 @@ static inline void sched_autogroup_exit(
#ifdef CONFIG_CFS_BANDWIDTH
extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+extern unsigned int sysctl_sched_cfs_max_steal_time;
#endif
#ifdef CONFIG_RT_MUTEXES
Index: linux-3.1-rc4/kernel/sched.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched.c 2011-09-07 14:57:49.532854588 +0800
+++ linux-3.1-rc4/kernel/sched.c 2011-09-07 14:58:49.955453578 +0800
@@ -254,7 +254,7 @@ struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
raw_spinlock_t lock;
ktime_t period;
- u64 quota, runtime;
+ u64 quota, runtime, steal_time;
s64 hierarchal_quota;
u64 runtime_expires;
Index: linux-3.1-rc4/kernel/sched_fair.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sched_fair.c 2011-09-07 14:57:49.533644483 +0800
+++ linux-3.1-rc4/kernel/sched_fair.c 2011-09-07 15:16:09.338824132 +0800
@@ -101,6 +101,18 @@ unsigned int __read_mostly sysctl_sched_
* default: 5 msec, units: microseconds
*/
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
+
+/*
+ * "Surplus" quota given to a cgroup to prevent a CPU from becoming idle.
+ *
+ * This would have been unnecessary had the load-balancer been "ideal" in
+ * loading tasks uniformly across all CPUs, which would have allowed
+ * all cgroups to claim their "quota" completely. In the absence of an
+ * "ideal" load-balancer, cgroups are unable to utilize their quota, leading
+ * to unexpected idle time. This knob allows a CPU to keep running a
+ * task beyond its throttled point before becoming idle.
+ */
+unsigned int sysctl_sched_cfs_max_steal_time = 100000UL;
#endif
static const struct sched_class fair_sched_class;
@@ -1288,6 +1300,11 @@ static inline u64 sched_cfs_bandwidth_sl
return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
}
+static inline u64 sched_cfs_max_steal_time(void)
+{
+ return (u64)sysctl_sched_cfs_max_steal_time * NSEC_PER_USEC;
+}
+
/*
* Replenish runtime according to assigned quota and update expiration time.
* We use sched_clock_cpu directly instead of rq->clock to avoid adding
@@ -1303,6 +1320,7 @@ static void __refill_cfs_bandwidth_runti
return;
now = sched_clock_cpu(smp_processor_id());
+ cfs_b->steal_time = 0;
cfs_b->runtime = cfs_b->quota;
cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
}
@@ -1337,6 +1355,12 @@ static int assign_cfs_rq_runtime(struct
cfs_b->runtime -= amount;
cfs_b->idle = 0;
}
+
+ if (!amount && rq_of(cfs_rq)->nr_running == 1 &&
+ cfs_b->steal_time < sched_cfs_max_steal_time()) {
+ amount = min_amount;
+ cfs_b->steal_time += amount;
+ }
}
expires = cfs_b->runtime_expires;
raw_spin_unlock(&cfs_b->lock);
@@ -1378,7 +1402,8 @@ static void expire_cfs_rq_runtime(struct
* whether the global deadline has advanced.
*/
- if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0) {
+ if ((s64)(cfs_rq->runtime_expires - cfs_b->runtime_expires) >= 0 ||
+ (rq_of(cfs_rq)->nr_running == 1 && cfs_b->steal_time < sched_cfs_max_steal_time())) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
Index: linux-3.1-rc4/kernel/sysctl.c
===================================================================
--- linux-3.1-rc4.orig/kernel/sysctl.c 2011-09-07 14:57:49.534454409 +0800
+++ linux-3.1-rc4/kernel/sysctl.c 2011-09-07 14:58:49.958452846 +0800
@@ -388,6 +388,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &one,
},
+ {
+ .procname = "sched_cfs_max_steal_time_us",
+ .data = &sysctl_sched_cfs_max_steal_time,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &one,
+ },
#endif
#ifdef CONFIG_PROVE_LOCKING
{
next prev parent reply other threads:[~2011-09-07 16:01 UTC|newest]
Thread overview: 129+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-03 9:28 [patch 00/15] CFS Bandwidth Control V6 Paul Turner
2011-05-03 9:28 ` [patch 01/15] sched: (fixlet) dont update shares twice on on_rq parent Paul Turner
2011-05-10 7:14 ` Hidetoshi Seto
2011-05-10 8:32 ` Mike Galbraith
2011-05-11 7:55 ` Hidetoshi Seto
2011-05-11 8:13 ` Paul Turner
2011-05-11 8:45 ` Mike Galbraith
2011-05-11 8:59 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 02/15] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-05-10 7:17 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 03/15] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-05-10 7:18 ` Hidetoshi Seto
2011-05-03 9:28 ` [patch 04/15] sched: validate CFS quota hierarchies Paul Turner
2011-05-10 7:20 ` Hidetoshi Seto
2011-05-11 9:37 ` Paul Turner
2011-05-16 9:30 ` Peter Zijlstra
2011-05-16 9:43 ` Peter Zijlstra
2011-05-16 12:32 ` Paul Turner
2011-05-17 15:26 ` Peter Zijlstra
2011-05-18 7:16 ` Paul Turner
2011-05-18 11:57 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 05/15] sched: add a timer to handle CFS bandwidth refresh Paul Turner
2011-05-10 7:21 ` Hidetoshi Seto
2011-05-11 9:27 ` Paul Turner
2011-05-16 10:18 ` Peter Zijlstra
2011-05-16 12:56 ` Paul Turner
2011-05-03 9:28 ` [patch 06/15] sched: accumulate per-cfs_rq cpu usage and charge against bandwidth Paul Turner
2011-05-10 7:22 ` Hidetoshi Seto
2011-05-11 9:25 ` Paul Turner
2011-05-16 10:27 ` Peter Zijlstra
2011-05-16 12:59 ` Paul Turner
2011-05-17 15:28 ` Peter Zijlstra
2011-05-18 7:02 ` Paul Turner
2011-05-16 10:32 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 07/15] sched: expire invalid runtime Paul Turner
2011-05-10 7:22 ` Hidetoshi Seto
2011-05-16 11:05 ` Peter Zijlstra
2011-05-16 11:07 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 08/15] sched: throttle cfs_rq entities which exceed their local runtime Paul Turner
2011-05-10 7:23 ` Hidetoshi Seto
2011-05-16 15:58 ` Peter Zijlstra
2011-05-16 16:05 ` Peter Zijlstra
2011-05-03 9:28 ` [patch 09/15] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-05-10 7:24 ` Hidetoshi Seto
2011-05-11 9:24 ` Paul Turner
2011-05-03 9:28 ` [patch 10/15] sched: allow for positional tg_tree walks Paul Turner
2011-05-10 7:24 ` Hidetoshi Seto
2011-05-17 13:31 ` Peter Zijlstra
2011-05-18 7:18 ` Paul Turner
2011-05-03 9:28 ` [patch 11/15] sched: prevent interactions between throttled entities and load-balance Paul Turner
2011-05-10 7:26 ` Hidetoshi Seto
2011-05-11 9:11 ` Paul Turner
2011-05-03 9:28 ` [patch 12/15] sched: migrate throttled tasks on HOTPLUG Paul Turner
2011-05-10 7:27 ` Hidetoshi Seto
2011-05-11 9:10 ` Paul Turner
2011-05-03 9:28 ` [patch 13/15] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-05-10 7:27 ` Hidetoshi Seto
2011-05-11 7:56 ` Hidetoshi Seto
2011-05-11 9:09 ` Paul Turner
2011-05-03 9:29 ` [patch 14/15] sched: return unused runtime on voluntary sleep Paul Turner
2011-05-10 7:28 ` Hidetoshi Seto
2011-05-03 9:29 ` [patch 15/15] sched: add documentation for bandwidth control Paul Turner
2011-05-10 7:29 ` Hidetoshi Seto
2011-05-11 9:09 ` Paul Turner
2011-06-07 15:45 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Kamalesh Babulal
2011-06-08 3:09 ` Paul Turner
2011-06-08 10:46 ` Vladimir Davydov
2011-06-08 16:32 ` Kamalesh Babulal
2011-06-09 3:25 ` Paul Turner
2011-06-10 18:17 ` Kamalesh Babulal
2011-06-14 0:00 ` Paul Turner
2011-06-15 5:37 ` Kamalesh Babulal
2011-06-21 19:48 ` Paul Turner
2011-06-24 15:05 ` Kamalesh Babulal
2011-09-07 11:00 ` Srivatsa Vaddagiri
2011-09-07 14:54 ` Srivatsa Vaddagiri [this message]
2011-09-07 15:20 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinnede Srivatsa Vaddagiri
2011-09-07 19:22 ` Peter Zijlstra
2011-09-08 15:15 ` Srivatsa Vaddagiri
2011-09-09 12:31 ` Peter Zijlstra
2011-09-09 13:26 ` Srivatsa Vaddagiri
2011-09-12 10:17 ` Srivatsa Vaddagiri
2011-09-12 12:35 ` Peter Zijlstra
2011-09-13 4:15 ` Srivatsa Vaddagiri
2011-09-13 5:03 ` Srivatsa Vaddagiri
2011-09-13 5:05 ` Srivatsa Vaddagiri
2011-09-13 9:39 ` Peter Zijlstra
2011-09-13 11:28 ` Srivatsa Vaddagiri
2011-09-13 14:07 ` Peter Zijlstra
2011-09-13 16:21 ` Srivatsa Vaddagiri
2011-09-13 16:33 ` Peter Zijlstra
2011-09-13 17:41 ` Srivatsa Vaddagiri
2011-09-13 16:36 ` Peter Zijlstra
2011-09-13 17:54 ` Srivatsa Vaddagiri
2011-09-13 18:03 ` Peter Zijlstra
2011-09-13 18:12 ` Srivatsa Vaddagiri
2011-09-13 18:07 ` Peter Zijlstra
2011-09-13 18:19 ` Peter Zijlstra
2011-09-13 18:28 ` Srivatsa Vaddagiri
2011-09-13 18:30 ` Peter Zijlstra
2011-09-13 18:35 ` Srivatsa Vaddagiri
2011-09-15 17:55 ` Kamalesh Babulal
2011-09-15 21:48 ` Peter Zijlstra
2011-09-19 17:51 ` Kamalesh Babulal
2011-09-20 0:38 ` Venki Pallipadi
2011-09-20 11:09 ` Kamalesh Babulal
2011-09-20 13:56 ` Peter Zijlstra
2011-09-20 14:04 ` Peter Zijlstra
2011-09-20 12:55 ` Peter Zijlstra
2011-09-21 17:34 ` Kamalesh Babulal
2011-09-13 14:19 ` Peter Zijlstra
2011-09-13 18:01 ` Srivatsa Vaddagiri
2011-09-13 18:23 ` Peter Zijlstra
2011-09-16 8:14 ` Paul Turner
2011-09-16 8:28 ` Peter Zijlstra
2011-09-19 16:35 ` Srivatsa Vaddagiri
2011-09-16 8:22 ` Paul Turner
2011-06-14 10:16 ` CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned Hidetoshi Seto
2011-06-14 6:58 ` [patch 00/15] CFS Bandwidth Control V6 Hu Tao
2011-06-14 7:29 ` Hidetoshi Seto
2011-06-14 7:44 ` Hu Tao
2011-06-15 8:37 ` Hu Tao
2011-06-16 0:57 ` Hidetoshi Seto
2011-06-16 9:45 ` Hu Tao
2011-06-17 1:22 ` Hidetoshi Seto
2011-06-17 6:05 ` Hu Tao
2011-06-17 6:25 ` Paul Turner
2011-06-17 9:13 ` Hidetoshi Seto
2011-06-18 0:28 ` Paul Turner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110907145421.GA32577@linux.vnet.ibm.com \
--to=vatsa@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=bharata@linux.vnet.ibm.com \
--cc=dhaval.giani@gmail.com \
--cc=kamalesh@linux.vnet.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=pjt@google.com \
--cc=svaidy@linux.vnet.ibm.com \
--cc=vdavydov@parallels.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).