[RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-05-11 18:16 Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 01/12 v2] CONFIG for " Yuyang Du
                   ` (13 more replies)
  0 siblings, 14 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Hi Ingo, PeterZ, Rafael, and others,

The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile. 

This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.

Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.

Another two ways of how to interpret CC:

1) the current work-conserving load balance also uses CC, but instantaneous
CC.

2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
period". Then a' is just about the CPU utilization. And the way we weight
runqueue-length is the simplest one (excluding the exponential decays, and you
may have other ways).

To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.

After CC, in the consolidation part, we do 1) attach the CPU topology to be
adaptive beyond our experimental platforms, and 2) intercept the current load
balance for load and load balancing containment.

Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100% or more or less.

By CC, we implemented a Workload Consolidation patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and
load balancing in the first dual-core when aggregated CC low, and if not in
the full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some). The workloads we used to
evaluate the Workload Consolidation include 1) 50+ perf/ux benchmarks (almost
all of the magazine ones), and 2) ~10 power workloads, of course, they are the
easiest ones, such as browsing, audio, video, recording, imaging, etc. The
current half-life is 1 period, and the period was 32ms, and now 64ms for more
aggressive consolidation.

v2:
- Data type defined in formation

Yuyang Du (12):
  CONFIG for CPU ConCurrency
  Init CPU ConCurrency
  CPU ConCurrency calculation
  CPU ConCurrency tracking
  CONFIG for Workload Consolidation
  Attach CPU topology to specify each sched_domain's workload
    consolidation
  CPU ConCurrency API for Workload Consolidation
  Intercept wakeup/fork/exec load balancing
  Intercept idle balancing
  Intercept periodic nohz idle balancing
  Intercept periodic load balancing
  Intercept RT scheduler

 arch/x86/Kconfig             |   21 +
 include/linux/sched.h        |   13 +
 include/linux/sched/sysctl.h |    8 +
 include/linux/topology.h     |   16 +
 kernel/sched/Makefile        |    1 +
 kernel/sched/concurrency.c   |  928 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |   46 +++
 kernel/sched/fair.c          |  131 +++++-
 kernel/sched/rt.c            |   25 ++
 kernel/sched/sched.h         |   36 ++
 kernel/sysctl.c              |   16 +
 11 files changed, 1232 insertions(+), 9 deletions(-)
 create mode 100644 kernel/sched/concurrency.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/12 v2] CONFIG for CPU ConCurrency
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 02/12 v2] Init " Yuyang Du
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Add CONFIG_CPU_CONCURRENCY in arch/x86/Kconfig. This CONFIG enables/disables
CPU ConCurrency load metric tracking.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 arch/x86/Kconfig |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..9bfac8d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -797,6 +797,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config CPU_CONCURRENCY
+	bool "CPU ConCurency (CC)"
+	default n
+	depends on SMP
+	---help---
+	  CPU ConCurrency (CC) is a new CPU load metric that measures the CPU
+	  load by averaging the number of running tasks. Using CC, the scheduler
+	  can evaluate the load of CPUs and may consolidate workloads on CPUs in
+	  load balancing for power efficiency without sacrificing performance.
+	  If unsure say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/12 v2] Init CPU ConCurrency
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 01/12 v2] CONFIG for " Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 03/12 v2] CPU ConCurrency calculation Yuyang Du
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

CPU ConCurrency (CC) is inserted as a member in each CPU's rq, and
initiated the same time as rq. Updating CC is protected by rq's lock.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/Makefile      |    1 +
 kernel/sched/concurrency.c |   22 ++++++++++++++++++++++
 kernel/sched/core.c        |    2 ++
 kernel/sched/sched.h       |   21 +++++++++++++++++++++
 4 files changed, 46 insertions(+)
 create mode 100644 kernel/sched/concurrency.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..e67f7e3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_CONCURRENCY) += concurrency.o
diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
new file mode 100644
index 0000000..50e08a2
--- /dev/null
+++ b/kernel/sched/concurrency.c
@@ -0,0 +1,22 @@
+/*
+ * CPU ConCurrency (CC) measures the CPU load by averaging
+ * the number of running tasks. Using CC, the scheduler can
+ * evaluate the load of CPUs to improve load balance for power
+ * efficiency without sacrificing performance.
+ *
+ */
+
+#ifdef CONFIG_CPU_CONCURRENCY
+
+#include "sched.h"
+
+void init_cpu_concurrency(struct rq *rq)
+{
+	rq->concurrency.sum = 0;
+	rq->concurrency.sum_now = 0;
+	rq->concurrency.contrib = 0;
+	rq->concurrency.nr_running = 0;
+	rq->concurrency.sum_timestamp = ULLONG_MAX;
+	rq->concurrency.contrib_timestamp = ULLONG_MAX;
+}
+#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..7958a47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6884,6 +6884,8 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+		init_cpu_concurrency(rq);
 	}
 
 	set_load_weight(&init_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..f1c9235 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -508,6 +508,17 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_CPU_CONCURRENCY
+struct cpu_concurrency_t {
+	u64 sum;
+	u64 sum_now;
+	u64 contrib;
+	u64 sum_timestamp;
+	u64 contrib_timestamp;
+	unsigned int nr_running;
+};
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -643,6 +654,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+#ifdef CONFIG_CPU_CONCURRENCY
+	struct cpu_concurrency_t concurrency;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -1203,6 +1218,12 @@ extern void init_sched_dl_class(void);
 extern void resched_task(struct task_struct *p);
 extern void resched_cpu(int cpu);
 
+#ifdef CONFIG_CPU_CONCURRENCY
+extern void init_cpu_concurrency(struct rq *rq);
+#else
+static inline void init_cpu_concurrency(struct rq *rq) {}
+#endif
+
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/12 v2] CPU ConCurrency calculation
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 01/12 v2] CONFIG for " Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 02/12 v2] Init " Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 04/12 v2] CPU ConCurrency tracking Yuyang Du
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

It is natural to use task concurrency (running tasks in the rq) as load
indicator. We calculate CC from task concurrency by two steps:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:

a = sum(concurrency * time) / period

2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):

s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched/sysctl.h |    8 +
 kernel/sched/concurrency.c   |  344 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |    2 +
 kernel/sysctl.c              |   16 ++
 4 files changed, 370 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 8045a55..ec52b3f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -36,6 +36,14 @@ extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 
+#ifdef CONFIG_CPU_CONCURRENCY
+extern unsigned int sysctl_concurrency_sum_period;
+extern unsigned int sysctl_concurrency_decay_rate;
+extern int concurrency_decay_rate_handler(struct ctl_table *table, int write,
+					void __user *buffer,
+					size_t *lenp, loff_t *ppos);
+#endif
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
index 50e08a2..da26dd7 100644
--- a/kernel/sched/concurrency.c
+++ b/kernel/sched/concurrency.c
@@ -10,6 +10,331 @@
 
 #include "sched.h"
 
+/*
+ * the sum period of time is 2^26 ns (~64) by default
+ */
+unsigned int sysctl_concurrency_sum_period = 26UL;
+
+/*
+ * the number of sum periods, after which the original
+ * will be reduced/decayed to half
+ */
+unsigned int sysctl_concurrency_decay_rate = 1UL;
+
+/*
+ * the contrib period of time is 2^10 (~1us) by default,
+ * us has better precision than ms, and
+ * 1024 makes use of faster shift than div
+ */
+static unsigned int cc_contrib_period = 10UL;
+
+/*
+ * the concurrency is scaled up for decaying,
+ * thus, concurrency 1 is effectively 2^cc_resolution (1024),
+ * which can be halved by 10 half-life periods
+ */
+static unsigned int cc_resolution = 10UL;
+
+/*
+ * after this number of half-life periods, even
+ * (1>>32)-1 (which is sufficiently large) is less than 1
+ */
+static unsigned int cc_decay_max_pds = 32UL;
+
+static inline u32 cc_scale_up(unsigned int c)
+{
+	return c << cc_resolution;
+}
+
+static inline u32 cc_scale_down(unsigned int c)
+{
+	return c >> cc_resolution;
+}
+
+/* from nanoseconds to sum periods */
+static inline u64 cc_sum_pds(u64 n)
+{
+	return n >> sysctl_concurrency_sum_period;
+}
+
+/* from sum period to timestamp in ns */
+static inline u64 cc_timestamp(u64 p)
+{
+	return p << sysctl_concurrency_sum_period;
+}
+
+/*
+ * from nanoseconds to contrib periods, because
+ * ns so risky that can overflow cc->contrib
+ */
+static inline u64 cc_contrib_pds(u64 n)
+{
+	return n >> cc_contrib_period;
+}
+
+/*
+ * cc_decay_factor only works for 32bit integer,
+ * cc_decay_factor_x, x indicates the number of periods
+ * as half-life (sysctl_concurrency_decay_rate)
+ */
+static const u32 cc_decay_factor_1[] = {
+	0xFFFFFFFF,
+};
+
+static const u32 cc_decay_factor_2[] = {
+	0xFFFFFFFF, 0xB504F333,
+};
+
+static const u32 cc_decay_factor_4[] = {
+	0xFFFFFFFF, 0xD744FCCA, 0xB504F333, 0x9837F051,
+};
+
+static const u32 cc_decay_factor_8[] = {
+	0xFFFFFFFF, 0xEAC0C6E7, 0xD744FCCA, 0xC5672A11,
+	0xB504F333, 0xA5FED6A9, 0x9837F051, 0x8B95C1E3,
+};
+
+/* by default sysctl_concurrency_decay_rate */
+static const u32 *cc_decay_factor =
+	cc_decay_factor_1;
+
+/*
+ * cc_decayed_sum depends on cc_resolution (fixed 10),
+ * cc_decayed_sum_x, x indicates the number of periods
+ * as half-life (sysctl_concurrency_decay_rate)
+ */
+static const u32 cc_decayed_sum_1[] = {
+	0, 512, 768, 896, 960, 992,
+	1008, 1016, 1020, 1022, 1023,
+};
+
+static const u32 cc_decayed_sum_2[] = {
+	0, 724, 1235, 1597, 1853, 2034, 2162, 2252,
+	2316, 2361, 2393, 2416, 2432, 2443, 2451,
+	2457, 2461, 2464, 2466, 2467, 2468, 2469,
+};
+
+static const u32 cc_decayed_sum_4[] = {
+	0, 861, 1585, 2193, 2705, 3135, 3497, 3801, 4057,
+	4272, 4453, 4605, 4733, 4840, 4930, 5006, 5070,
+	5124, 5169, 5207, 5239, 5266, 5289, 5308, 5324,
+	5337, 5348, 5358, 5366, 5373, 5379, 5384, 5388,
+	5391, 5394, 5396, 5398, 5400, 5401, 5402, 5403,
+	5404, 5405, 5406,
+};
+
+static const u32 cc_decayed_sum_8[] = {
+	0, 939, 1800, 2589, 3313, 3977, 4585, 5143,
+	5655, 6124, 6554, 6949, 7311, 7643, 7947, 8226,
+	8482, 8717, 8932, 9129, 9310, 9476, 9628, 9767,
+	9895, 10012, 10120, 10219, 10309, 10392, 10468, 10538,
+	10602, 10661, 10715, 10764, 10809, 10850, 10888, 10923,
+	10955, 10984, 11011, 11036, 11059, 11080, 11099, 11116,
+	11132, 11147, 11160, 11172, 11183, 11193, 11203, 11212,
+	11220, 11227, 11234, 11240, 11246, 11251, 11256, 11260,
+	11264, 11268, 11271, 11274, 11277, 11280, 11282, 11284,
+	11286, 11288, 11290, 11291, 11292, 11293, 11294, 11295,
+	11296, 11297, 11298, 11299, 11300, 11301, 11302,
+};
+
+/* by default sysctl_concurrency_decay_rate */
+static const u32 *cc_decayed_sum = cc_decayed_sum_1;
+
+/*
+ * the last index of cc_decayed_sum array
+ */
+static u32 cc_decayed_sum_len =
+	sizeof(cc_decayed_sum_1) / sizeof(cc_decayed_sum_1[0]) - 1;
+
+/*
+ * sysctl handler to update decay rate
+ */
+int concurrency_decay_rate_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (ret || !write)
+		return ret;
+
+	switch (sysctl_concurrency_decay_rate) {
+	case 1:
+		cc_decay_factor = cc_decay_factor_1;
+		cc_decayed_sum = cc_decayed_sum_1;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_1) /
+			sizeof(cc_decayed_sum_1[0]) - 1;
+		break;
+	case 2:
+		cc_decay_factor = cc_decay_factor_2;
+		cc_decayed_sum = cc_decayed_sum_2;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_2) /
+			sizeof(cc_decayed_sum_2[0]) - 1;
+		break;
+	case 4:
+		cc_decay_factor = cc_decay_factor_4;
+		cc_decayed_sum = cc_decayed_sum_4;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_4) /
+			sizeof(cc_decayed_sum_4[0]) - 1;
+		break;
+	case 8:
+		cc_decay_factor = cc_decay_factor_8;
+		cc_decayed_sum = cc_decayed_sum_8;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_8) /
+			sizeof(cc_decayed_sum_8[0]) - 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	cc_decay_max_pds *= sysctl_concurrency_decay_rate;
+
+	return 0;
+}
+
+/*
+ * decay concurrency at some decay rate
+ */
+static inline u64 decay_cc(u64 cc, u64 periods)
+{
+	u32 periods_l;
+
+	if (periods <= 0)
+		return cc;
+
+	if (unlikely(periods >= cc_decay_max_pds))
+		return 0;
+
+	/* now period is not too large */
+	periods_l = (u32)periods;
+	if (periods_l >= sysctl_concurrency_decay_rate) {
+		cc >>= periods_l / sysctl_concurrency_decay_rate;
+		periods_l %= sysctl_concurrency_decay_rate;
+	}
+
+	if (!periods_l)
+		return cc;
+
+	cc *= cc_decay_factor[periods_l];
+
+	return cc >> 32;
+}
+
+/*
+ * add missed periods by predefined constants
+ */
+static inline u64 cc_missed_pds(u64 periods)
+{
+	if (periods <= 0)
+		return 0;
+
+	if (periods > cc_decayed_sum_len)
+		periods = cc_decayed_sum_len;
+
+	return cc_decayed_sum[periods];
+}
+
+/*
+ * scale up nr_running, because we decay
+ */
+static inline u32 cc_weight(unsigned int nr_running)
+{
+	/*
+	 * scaling factor, this should be tunable
+	 */
+	return cc_scale_up(nr_running);
+}
+
+static inline void
+__update_concurrency(struct rq *rq, u64 now, struct cpu_concurrency_t *cc)
+{
+	u64 sum_pds, sum_pds_s, sum_pds_e;
+	u64 contrib_pds, ts_contrib, contrib_pds_one;
+	u64 sum_now = 0;
+	u32 weight;
+	int updated = 0;
+
+	/*
+	 * guarantee contrib_timestamp always >= sum_timestamp,
+	 * and sum_timestamp is at period boundary
+	 */
+	if (now <= cc->sum_timestamp) {
+		cc->sum_timestamp = cc_timestamp(cc_sum_pds(now));
+		cc->contrib_timestamp = now;
+		return;
+	}
+
+	weight = cc_weight(cc->nr_running);
+
+	/* start and end of sum periods */
+	sum_pds_s = cc_sum_pds(cc->sum_timestamp);
+	sum_pds_e = cc_sum_pds(now);
+	sum_pds = sum_pds_e - sum_pds_s;
+	/* number of contrib periods in one sum period */
+	contrib_pds_one = cc_contrib_pds(cc_timestamp(1));
+
+	/*
+	 * if we have passed at least one period,
+	 * we need to do four things:
+	 */
+	if (sum_pds) {
+		/* 1) complete the last period */
+		ts_contrib = cc_timestamp(sum_pds_s + 1);
+		contrib_pds = cc_contrib_pds(ts_contrib);
+		contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+		if (likely(contrib_pds))
+			cc->contrib += weight * contrib_pds;
+
+		cc->contrib = div64_u64(cc->contrib, contrib_pds_one);
+
+		cc->sum += cc->contrib;
+		cc->contrib = 0;
+
+		/* 2) update/decay them */
+		cc->sum = decay_cc(cc->sum, sum_pds);
+		sum_now = decay_cc(cc->sum, sum_pds - 1);
+
+		/* 3) compensate missed periods if any */
+		sum_pds -= 1;
+		cc->sum += cc->nr_running * cc_missed_pds(sum_pds);
+		sum_now += cc->nr_running * cc_missed_pds(sum_pds - 1);
+		updated = 1;
+
+		/* 4) update contrib timestamp to period boundary */
+		ts_contrib = cc_timestamp(sum_pds_e);
+
+		cc->sum_timestamp = ts_contrib;
+		cc->contrib_timestamp = ts_contrib;
+	}
+
+	/* current period */
+	contrib_pds = cc_contrib_pds(now);
+	contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+	if (likely(contrib_pds))
+		cc->contrib += weight * contrib_pds;
+
+	/* new nr_running for next update */
+	cc->nr_running = rq->nr_running;
+
+	/*
+	 * we need to account for the current sum period,
+	 * if now has passed 1/2 of sum period, we contribute,
+	 * otherwise, we use the last complete sum period
+	 */
+	contrib_pds = cc_contrib_pds(now - cc->sum_timestamp);
+
+	if (contrib_pds > contrib_pds_one / 2) {
+		sum_now = div64_u64(cc->contrib, contrib_pds);
+		sum_now += cc->sum;
+		updated = 1;
+	}
+
+	if (updated == 1)
+		cc->sum_now = sum_now;
+	cc->contrib_timestamp = now;
+}
+
 void init_cpu_concurrency(struct rq *rq)
 {
 	rq->concurrency.sum = 0;
@@ -19,4 +344,23 @@ void init_cpu_concurrency(struct rq *rq)
 	rq->concurrency.sum_timestamp = ULLONG_MAX;
 	rq->concurrency.contrib_timestamp = ULLONG_MAX;
 }
+
+/*
+ * we update cpu concurrency at:
+ * 1) enqueue task, which increases concurrency
+ * 2) dequeue task, which decreases concurrency
+ * 3) periodic scheduler tick, in case no en/dequeue for long
+ * 4) enter and exit idle
+ */
+void update_cpu_concurrency(struct rq *rq)
+{
+	/*
+	 * protected under rq->lock
+	 */
+	struct cpu_concurrency_t *cc = &rq->concurrency;
+	u64 now = rq->clock;
+
+	__update_concurrency(rq, now, cc);
+}
+
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f1c9235..a4043ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1220,8 +1220,10 @@ extern void resched_cpu(int cpu);
 
 #ifdef CONFIG_CPU_CONCURRENCY
 extern void init_cpu_concurrency(struct rq *rq);
+extern void update_cpu_concurrency(struct rq *rq);
 #else
 static inline void init_cpu_concurrency(struct rq *rq) {}
+static inline void update_cpu_concurrency(struct rq *rq) {}
 #endif
 
 extern struct rt_bandwidth def_rt_bandwidth;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b58..91ba467 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1090,6 +1090,22 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_CPU_CONCURRENCY
+	{
+		.procname	= "concurrency_sum_period",
+		.data		= &sysctl_concurrency_sum_period,
+		.maxlen		= sizeof(sysctl_concurrency_sum_period),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "concurrency_decay_rate",
+		.data		= &sysctl_concurrency_decay_rate,
+		.maxlen		= sizeof(sysctl_concurrency_decay_rate),
+		.mode		= 0644,
+		.proc_handler	= concurrency_decay_rate_handler,
+	},
+#endif
 	{ }
 };
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/12 v2] CPU ConCurrency tracking
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (2 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 03/12 v2] CPU ConCurrency calculation Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 05/12 v2] CONFIG for Workload Consolidation Yuyang Du
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

CC can only be modified when enqueue and dequeue the CPU rq. And we also
track it in scheduler tick and idle enter/exit in case we may not have
enqueue and dequeue for a long time.

Therefore, we track CC in and only in these four points:

1. dequeue
2. enqueue
3. scheduler tick
4. idle enter and exit

TODO: use existing load tracking framework

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c |    3 +++
 kernel/sched/fair.c |    2 ++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7958a47..0236455 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -767,6 +767,7 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	sched_info_queued(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+	update_cpu_concurrency(rq);
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -774,6 +775,7 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	sched_info_dequeued(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
+	update_cpu_concurrency(rq);
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2428,6 +2430,7 @@ void scheduler_tick(void)
 	update_rq_clock(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
 	update_cpu_load_active(rq);
+	update_cpu_concurrency(rq);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..e7153ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2563,6 +2563,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 void idle_enter_fair(struct rq *this_rq)
 {
 	update_rq_runnable_avg(this_rq, 1);
+	update_cpu_concurrency(this_rq);
 }
 
 /*
@@ -2573,6 +2574,7 @@ void idle_enter_fair(struct rq *this_rq)
 void idle_exit_fair(struct rq *this_rq)
 {
 	update_rq_runnable_avg(this_rq, 0);
+	update_cpu_concurrency(this_rq);
 }
 
 static int idle_balance(struct rq *this_rq);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/12 v2] CONFIG for Workload Consolidation
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (3 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 04/12 v2] CPU ConCurrency tracking Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 06/12 v2] Attach CPU topology to specify each sched_domain's workload consolidation Yuyang Du
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Add CONFIG_WORKLOAD_CONSOLIDATION in arch/x86/Kconfig. This CONFIG enables
and disables CPU workload consolidation in scheduler's load balancing.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9bfac8d..0999c16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -808,6 +808,16 @@ config CPU_CONCURRENCY
 	  load balancing for power efficiency without sacrificing performance.
 	  If unsure say N here.
 
+config WORKLOAD_CONSOLIDATION
+	bool "CPU Workload Consolidation"
+	default n
+	depends on CPU_CONCURRENCY
+	---help---
+	  CPU Workload Consolidation is a new CPU PM module, which uses the CPU
+	  concurrency of the CPU, and allows asymmetric concurrency across CPUs to
+	  reduce the SW and HW overhead to increase load balance efficiency and
+	  conserve energy. If unsure say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/12 v2] Attach CPU topology to specify each sched_domain's workload consolidation
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (4 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 05/12 v2] CONFIG for Workload Consolidation Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 07/12 v2] CPU ConCurrency API for Workload Consolidation Yuyang Du
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Defined SD_WORKLOAD_CONSOLIDATION flag in sched_domain. When this flag is
set, the workload consolidation applies to that domain. In addition, a
consolidating_coeff is defined in sched_domain to specify the degree of
consolidation in that domain.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h    |   13 +++++++++++++
 include/linux/topology.h |   16 ++++++++++++++++
 kernel/sched/core.c      |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c7..f3f7d4a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -877,6 +877,12 @@ enum cpu_idle_type {
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+#define SD_WORKLOAD_CONSOLIDATION	0x8000	/* Higher concurrency in front */
+#else
+#define SD_WORKLOAD_CONSOLIDATION	0
+#endif
+
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
 struct sched_domain_attr {
@@ -960,6 +966,13 @@ struct sched_domain {
 		struct rcu_head rcu;	/* used during destruction */
 	};
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	unsigned int total_groups;			/* total groups number */
+	unsigned int group_number;			/* this CPU's group sequence */
+	unsigned int consolidating_coeff;	/* consolidating coefficient */
+	struct sched_group *first_group;	/* ordered by CPU number */
+#endif
+
 	unsigned int span_weight;
 	/*
 	 * Span of all CPUs in this domain.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..334f83e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,6 +66,16 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+#ifndef WORKLOAD_CONSOLIDATION_INIT
+#define WORKLOAD_CONSOLIDATION_INIT(n) .consolidating_coeff = (n),
+#endif
+#else
+#ifndef WORKLOAD_CONSOLIDATION_INIT
+#define WORKLOAD_CONSOLIDATION_INIT(n)
+#endif
+#endif
+
 /*
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
@@ -102,12 +112,14 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				| arch_sd_sibling_asym_packing()	\
+				| 0*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.smt_gain		= 1178,	/* 15% */			\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	WORKLOAD_CONSOLIDATION_INIT(0)				\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
@@ -134,11 +146,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	WORKLOAD_CONSOLIDATION_INIT(180)			\
 }
 #endif
 #endif /* CONFIG_SCHED_MC */
@@ -167,11 +181,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 1*SD_PREFER_SIBLING			\
+				| 1*SD_WORKLOAD_CONSOLIDATION	\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	WORKLOAD_CONSOLIDATION_INIT(180)			\
 }
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0236455..cd92f2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4843,7 +4843,11 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct ctl_table *table = sd_alloc_ctl_entry(15);
+#else
 	struct ctl_table *table = sd_alloc_ctl_entry(14);
+#endif
 
 	if (table == NULL)
 		return NULL;
@@ -4876,7 +4880,13 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(long), 0644, proc_doulongvec_minmax, false);
 	set_table_entry(&table[12], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	set_table_entry(&table[13], "consolidating_coeff", &sd->consolidating_coeff,
+		sizeof(int), 0644, proc_dointvec, false);
+	/* &table[14] is terminator */
+#else
 	/* &table[13] is terminator */
+#endif
 
 	return table;
 }
@@ -5497,6 +5507,33 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+static void update_domain_extra_info(struct sched_domain *sd)
+{
+	while (sd) {
+		int i = 0, j = 0, first, min = INT_MAX;
+		struct sched_group *group;
+
+		group = sd->groups;
+		first = group_first_cpu(group);
+		do {
+			int k = group_first_cpu(group);
+			i += 1;
+			if (k < first)
+				j += 1;
+			if (k < min) {
+				sd->first_group = group;
+				min = k;
+			}
+		} while (group = group->next, group != sd->groups);
+
+		sd->total_groups = i;
+		sd->group_number = j;
+		sd = sd->parent;
+	}
+}
+#endif
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5545,6 +5582,10 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	destroy_sched_domains(tmp, cpu);
 
 	update_top_cache_domain(cpu);
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	update_domain_extra_info(sd);
+#endif
 }
 
 /* cpus with isolated domains */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/12 v2] CPU ConCurrency API for Workload Consolidation
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (5 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 06/12 v2] Attach CPU topology to specify each sched_domain's workload consolidation Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 08/12 v2] Intercept wakeup/fork/exec load balancing Yuyang Du
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidating_coefficient

The consolidating_coefficient could be like 100% or more or less.

TODO: need sched statistics and maybe more tunables

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/concurrency.c |  562 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h       |   13 +
 2 files changed, 575 insertions(+)

diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
index da26dd7..21e5631 100644
--- a/kernel/sched/concurrency.c
+++ b/kernel/sched/concurrency.c
@@ -28,6 +28,25 @@ unsigned int sysctl_concurrency_decay_rate = 1UL;
  */
 static unsigned int cc_contrib_period = 10UL;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+/*
+ * whether we use concurrency to select cpu to run
+ * the woken up task
+ */
+static unsigned int wc_wakeup = 1UL;
+
+/*
+ * concurrency lower than percentage of this number
+ * is capable of running wakee
+ */
+static unsigned int wc_wakeup_threshold = 80UL;
+
+/*
+ * aggressively push the task even it is hot
+ */
+static unsigned int wc_push_hot_task = 1UL;
+#endif
+
 /*
  * the concurrency is scaled up for decaying,
  * thus, concurrency 1 is effectively 2^cc_resolution (1024),
@@ -343,6 +362,9 @@ void init_cpu_concurrency(struct rq *rq)
 	rq->concurrency.nr_running = 0;
 	rq->concurrency.sum_timestamp = ULLONG_MAX;
 	rq->concurrency.contrib_timestamp = ULLONG_MAX;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	rq->concurrency.unload = 0;
+#endif
 }
 
 /*
@@ -364,3 +386,543 @@ void update_cpu_concurrency(struct rq *rq)
 }
 
 #endif
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+/*
+ * whether cpu is capable of having more concurrency
+ */
+static int cpu_cc_capable(int cpu)
+{
+	u64 sum = cpu_rq(cpu)->concurrency.sum_now;
+	u64 threshold = cc_weight(1);
+
+	sum *= 100;
+	sum *= cpu_rq(cpu)->cpu_power;
+
+	threshold *= wc_wakeup_threshold;
+	threshold <<= SCHED_POWER_SHIFT;
+
+	if (sum <= threshold)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * we do not select idle, if the cc of the
+ * wakee and waker (in this order) is capable
+ * of handling the wakee task
+ */
+int workload_consolidation_wakeup(int prev, int target)
+{
+	if (!wc_wakeup) {
+		if (idle_cpu(target))
+			return target;
+
+		return nr_cpu_ids;
+	}
+
+	if (idle_cpu(prev) || cpu_cc_capable(prev))
+		return prev;
+
+	if (prev != target && (idle_cpu(target) || cpu_cc_capable(target)))
+		return target;
+
+	return nr_cpu_ids;
+}
+
+static inline u64 sched_group_cc(struct sched_group *sg)
+{
+	u64 sg_cc = 0;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		sg_cc += cpu_rq(i)->concurrency.sum_now *
+			cpu_rq(i)->cpu_power;
+
+	return sg_cc;
+}
+
+static inline u64 sched_domain_cc(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+	u64 sd_cc = 0;
+
+	do {
+		sd_cc += sched_group_cc(sg);
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	return sd_cc;
+}
+
+static inline struct sched_group *
+find_lowest_cc_group(struct sched_group *sg, int span)
+{
+	u64 grp_cc, min = ULLONG_MAX;
+	struct sched_group *lowest = NULL;
+	int i;
+
+	for (i = 0; i < span; ++i) {
+		grp_cc = sched_group_cc(sg);
+
+		if (grp_cc < min) {
+			min = grp_cc;
+			lowest = sg;
+		}
+
+		sg = sg->next;
+	}
+
+	return lowest;
+}
+
+static inline u64 __calc_cc_thr(int cpus, unsigned int asym_cc)
+{
+	u64 thr = cpus;
+
+	thr *= cc_weight(1);
+	thr *= asym_cc;
+	thr <<= SCHED_POWER_SHIFT;
+
+	return thr;
+}
+
+/*
+ * can @src_cc of @src_nr cpus be consolidated
+ * to @dst_cc of @dst_nr cpus
+ */
+static inline int
+__can_consolidate_cc(u64 src_cc, int src_nr, u64 dst_cc, int dst_nr)
+{
+	dst_cc *= dst_nr;
+	src_nr -= dst_nr;
+
+	if (unlikely(src_nr <= 0))
+		return 0;
+
+	src_nr = ilog2(src_nr);
+	src_nr += dst_nr;
+	src_cc *= src_nr;
+
+	if (src_cc > dst_cc)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * find the group for asymmetric concurrency
+ * problem to address: traverse sd from top to down
+ */
+struct sched_group *
+workload_consolidation_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu)
+{
+	int half, sg_weight, ns_half = 0;
+	struct sched_group *sg;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int allowed = 0, i;
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		/*
+		 * we did not consider the added cc by this
+		 * wakeup (mostly from fork/exec)
+		 */
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			break;
+
+		sg = sd->first_group;
+		for (i = 0; i < half; ++i) {
+			/* if it has no cpus allowed */
+			if (!cpumask_intersects(sched_group_cpus(sg),
+					tsk_cpus_allowed(p)))
+				continue;
+
+			allowed = 1;
+			sg = sg->next;
+		}
+
+		if (!allowed)
+			break;
+
+		ns_half = half;
+		half /= 2;
+	}
+
+	if (!ns_half)
+		return NULL;
+
+	if (ns_half == 1)
+		return sd->first_group;
+
+	return find_lowest_cc_group(sd->first_group, ns_half);
+}
+
+/*
+ * top_flag_domain - return top sched_domain containing flag.
+ * @cpu:	the cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	the flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * returns the highest sched_domain of a cpu which contains the given flag.
+ * different from highest_flag_domain in that along the domain upward chain
+ * domain may or may not contain the flag.
+ */
+static inline struct sched_domain *top_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			continue;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+/*
+ * workload_consolidation_cpu_shielded - return whether @cpu is shielded or not
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_WORKLOAD_CONSOLIDATION, each sd may have more than two groups, but
+ * we assume 1) every sched_group has the same weight, 2) every CPU has
+ * the same computing power
+ */
+int workload_consolidation_cpu_shielded(int cpu)
+{
+	struct sched_domain *sd;
+
+	sd = top_flag_domain(cpu, SD_WORKLOAD_CONSOLIDATION);
+
+	while (sd) {
+		int half, sg_weight, this_sg_nr;
+		u64 sd_cc;
+
+		if (!(sd->flags & SD_WORKLOAD_CONSOLIDATION)) {
+			sd = sd->child;
+			continue;
+		}
+
+		half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+		sg_weight = sd->groups->group_weight;
+		this_sg_nr = sd->group_number;
+
+		sd_cc = sched_domain_cc(sd);
+		sd_cc *= 100;
+
+		while (half) {
+			int cpus = sg_weight * half;
+			u64 threshold = __calc_cc_thr(cpus,
+				sd->consolidating_coeff);
+
+			if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+				threshold, cpus))
+				return 0;
+
+			if (this_sg_nr >= half)
+				return 1;
+
+			half /= 2;
+		}
+
+		sd = sd->child;
+	}
+
+	return 0;
+}
+
+/*
+ * as of now, we have the following assumption
+ * 1) every sched_group has the same weight
+ * 2) every CPU has the same computing power
+ */
+static inline int __nonshielded_groups(struct sched_domain *sd)
+{
+	int half, sg_weight, ret = 0;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->consolidating_coeff);
+
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			return ret;
+
+		ret = half;
+		half /= 2;
+	}
+
+	return ret;
+}
+
+static DEFINE_PER_CPU(struct cpumask, nonshielded_cpumask);
+
+/*
+ * workload_consolidation_nonshielded_mask - return the nonshielded cpus in the @mask,
+ * which is unmasked by the shielded cpus
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_WORKLOAD_CONSOLIDATION, each sd may have more than two groups
+ */
+void workload_consolidation_nonshielded_mask(int cpu, struct cpumask *mask)
+{
+	struct sched_domain *sd;
+	struct cpumask *pcpu_mask = &per_cpu(nonshielded_cpumask, cpu);
+	int i;
+
+	sd = top_flag_domain(cpu, SD_WORKLOAD_CONSOLIDATION);
+
+	if (!sd)
+		return;
+
+	while (sd) {
+		struct sched_group *sg;
+		int this_sg_nr, ns_half;
+
+		if (!(sd->flags & SD_WORKLOAD_CONSOLIDATION)) {
+			sd = sd->child;
+			continue;
+		}
+
+		ns_half = __nonshielded_groups(sd);
+
+		if (!ns_half)
+			break;
+
+		cpumask_clear(pcpu_mask);
+		sg = sd->first_group;
+
+		for (i = 0; i < ns_half; ++i) {
+			cpumask_or(pcpu_mask, pcpu_mask,
+				sched_group_cpus(sg));
+			sg = sg->next;
+		}
+
+		cpumask_and(mask, mask, pcpu_mask);
+
+		this_sg_nr = sd->group_number;
+		if (this_sg_nr)
+			break;
+
+		sd = sd->child;
+	}
+}
+
+static int cpu_task_hot(struct task_struct *p, u64 now)
+{
+	s64 delta;
+
+	if (p->sched_class != &fair_sched_class)
+		return 0;
+
+	if (unlikely(p->policy == SCHED_IDLE))
+		return 0;
+
+	if (sysctl_sched_migration_cost == -1)
+		return 1;
+
+	if (sysctl_sched_migration_cost == 0)
+		return 0;
+
+	if (wc_push_hot_task)
+		return 0;
+
+	/*
+	 * buddy candidates are cache hot:
+	 */
+	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+			(&p->se == p->se.cfs_rq->next ||
+			 &p->se == p->se.cfs_rq->last)) {
+		return 1;
+	}
+
+	delta = now - p->se.exec_start;
+
+	if (delta < (s64)sysctl_sched_migration_cost)
+		return 1;
+
+	return 0;
+}
+
+static int
+cpu_move_task(struct task_struct *p, struct rq *src_rq, struct rq *dst_rq)
+{
+	/*
+	 * we do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 3) are cache-hot on their current CPU.
+	 */
+	if (!cpumask_test_cpu(dst_rq->cpu, tsk_cpus_allowed(p)))
+		return 0;
+
+	if (task_running(src_rq, p))
+		return 0;
+
+	/*
+	 * aggressive migration if task is cache cold
+	 */
+	if (!cpu_task_hot(p, src_rq->clock_task)) {
+		/*
+		 * move a task
+		 */
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_rq->cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ * __unload_cpu_work is run by src cpu stopper, which pushes running
+ * tasks off src cpu onto dst cpu
+ */
+static int __unload_cpu_work(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	struct rq *dst_rq = cpu_rq(cc->dst_cpu);
+
+	struct list_head *tasks = &src_rq->cfs_tasks;
+	struct task_struct *p, *n;
+	int pushed = 0;
+	int nr_migrate_break = 1;
+
+	raw_spin_lock_irq(&src_rq->lock);
+
+	/* make sure the requested cpu hasn't gone down in the meantime */
+	if (unlikely(src_cpu != smp_processor_id() || !cc->unload))
+		goto out_unlock;
+
+	/* Is there any task to move? */
+	if (src_rq->nr_running <= 1)
+		goto out_unlock;
+
+	double_lock_balance(src_rq, dst_rq);
+
+	list_for_each_entry_safe(p, n, tasks, se.group_node) {
+
+		if (!cpu_move_task(p, src_rq, dst_rq))
+			continue;
+
+		pushed++;
+
+		if (pushed >= nr_migrate_break)
+			break;
+	}
+
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	cc->unload = 0;
+	raw_spin_unlock_irq(&src_rq->lock);
+
+	return 0;
+}
+
+/*
+ * unload src_cpu to dst_cpu
+ */
+static void unload_cpu(int src_cpu, int dst_cpu)
+{
+	unsigned long flags;
+	struct rq *src_rq = cpu_rq(src_cpu);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	int unload = 0;
+
+	raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+	if (!cc->unload) {
+		cc->unload = 1;
+		cc->dst_cpu = dst_cpu;
+		unload = 1;
+	}
+
+	raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+
+	if (unload)
+		stop_one_cpu_nowait(src_cpu, __unload_cpu_work, src_rq,
+			&cc->unload_work);
+}
+
+static inline int find_lowest_cc_cpu(struct cpumask *mask)
+{
+	u64 cpu_cc, min = ULLONG_MAX;
+	int i, lowest = nr_cpu_ids;
+	struct rq *rq;
+
+	for_each_cpu(i, mask) {
+		rq = cpu_rq(i);
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			lowest = i;
+		}
+	}
+
+	return lowest;
+}
+
+/*
+ * find the lowest cc cpu in shielded and nonshielded cpus,
+ * aggressively unload the shielded to the nonshielded
+ */
+void workload_consolidation_unload(struct cpumask *nonshielded)
+{
+	int src_cpu = nr_cpu_ids, dst_cpu, i;
+	u64 cpu_cc, min = ULLONG_MAX;
+	struct rq *rq;
+
+	for_each_cpu_not(i, nonshielded) {
+		if (i >= nr_cpu_ids)
+			break;
+
+		rq = cpu_rq(i);
+		if (rq->nr_running <= 0)
+			continue;
+
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			src_cpu = i;
+		}
+	}
+
+	if (src_cpu >= nr_cpu_ids)
+		return;
+
+	dst_cpu = find_lowest_cc_cpu(nonshielded);
+	if (dst_cpu >= nr_cpu_ids)
+		return;
+
+	if (src_cpu != dst_cpu)
+		unload_cpu(src_cpu, dst_cpu);
+}
+
+#endif /* CONFIG_WORKLOAD_CONSOLIDATION */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a4043ed..27787ad 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -516,6 +516,11 @@ struct cpu_concurrency_t {
 	u64 sum_timestamp;
 	u64 contrib_timestamp;
 	unsigned int nr_running;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int unload;
+	int dst_cpu;
+	struct cpu_stop_work unload_work;
+#endif
 };
 #endif
 
@@ -1221,6 +1226,14 @@ extern void resched_cpu(int cpu);
 #ifdef CONFIG_CPU_CONCURRENCY
 extern void init_cpu_concurrency(struct rq *rq);
 extern void update_cpu_concurrency(struct rq *rq);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+extern int workload_consolidation_wakeup(int prev, int target);
+extern struct sched_group *workload_consolidation_find_group(
+			struct sched_domain *sd, struct task_struct *p, int this_cpu);
+extern void workload_consolidation_unload(struct cpumask *nonshielded);
+extern int workload_consolidation_cpu_shielded(int cpu);
+extern void workload_consolidation_nonshielded_mask(int cpu, struct cpumask *mask);
+#endif
 #else
 static inline void init_cpu_concurrency(struct rq *rq) {}
 static inline void update_cpu_concurrency(struct rq *rq) {}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/12 v2] Intercept wakeup/fork/exec load balancing
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (6 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 07/12 v2] CPU ConCurrency API for Workload Consolidation Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 09/12 v2] Intercept idle balancing Yuyang Du
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In wakeup load balaning, we do not select idle, if the CC of the wakee
and waker (in this order if SD_WAKE_AFFINE) is capable of handling the
wakee task. And in fork/exec load balancing when finding the sched_group,
we find the consolidated group.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7153ff..c7a6347 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4365,9 +4365,16 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int ret;
 
+	ret = workload_consolidation_wakeup(i, target);
+	if (ret < nr_cpu_ids)
+		return ret;
+#else
 	if (idle_cpu(target))
 		return target;
+#endif
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
@@ -4460,7 +4467,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		struct sched_group *group;
+		struct sched_group *group = NULL;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -4468,6 +4475,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+		if (sd->flags & SD_WORKLOAD_CONSOLIDATION)
+			group = workload_consolidation_find_group(sd, p, cpu);
+
+		if (!group)
+#endif
 		group = find_idlest_group(sd, p, cpu, sd_flag);
 		if (!group) {
 			sd = sd->child;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/12 v2] Intercept idle balancing
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (7 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 08/12 v2] Intercept wakeup/fork/exec load balancing Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:16 ` [RFC PATCH 10/12 v2] Intercept periodic nohz " Yuyang Du
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In idle balancing, we do two things:

1) Skip pulling task to the idle non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7a6347..94c7a6a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6655,6 +6655,10 @@ out:
 	return ld_moved;
 }
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
+#endif
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -6666,6 +6670,9 @@ static int idle_balance(struct rq *this_rq)
 	unsigned long next_balance = jiffies + HZ;
 	u64 curr_cost = 0;
 	int this_cpu = this_rq->cpu;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+#endif
 
 	idle_enter_fair(this_rq);
 	/*
@@ -6684,6 +6691,19 @@ static int idle_balance(struct rq *this_rq)
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	cpumask_copy(nonshielded, cpu_active_mask);
+
+	/*
+	 * if we encounter shielded cpus here, don't do balance on them
+	 */
+	workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+	if (!cpumask_test_cpu(this_cpu, nonshielded))
+		goto unlock;
+	workload_consolidation_unload(nonshielded);
+#endif
+
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int continue_balancing = 1;
@@ -6716,6 +6736,9 @@ static int idle_balance(struct rq *this_rq)
 		if (pulled_task)
 			break;
 	}
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+unlock:
+#endif
 	rcu_read_unlock();
 
 	raw_spin_lock(&this_rq->lock);
@@ -7709,6 +7732,14 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	unsigned int i;
+	for_each_possible_cpu(i) {
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
+					GFP_KERNEL, cpu_to_node(i));
+	}
+#endif
+
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 
 #ifdef CONFIG_NO_HZ_COMMON
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/12 v2] Intercept periodic nohz idle balancing
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (8 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 09/12 v2] Intercept idle balancing Yuyang Du
@ 2014-05-11 18:16 ` Yuyang Du
  2014-05-11 18:17 ` [RFC PATCH 11/12 v2] Intercept periodic load balancing Yuyang Du
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:16 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In periodic nohz idle balance, we skip the idle but non-consolidated
CPUs from load balancing.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94c7a6a..9bb1304 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6867,10 +6867,46 @@ static struct {
 
 static inline int find_new_ilb(void)
 {
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+	int ilb, weight;
+	int this_cpu = smp_processor_id();
+
+	/*
+	 * Optimize for the case when we have no idle CPUs or only one
+	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
+	 */
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
+		return nr_cpu_ids;
+
+	ilb = cpumask_first(nohz.idle_cpus_mask);
+
+	if (ilb < nr_cpu_ids && idle_cpu(ilb)) {
+
+		cpumask_copy(nonshielded, nohz.idle_cpus_mask);
+
+		rcu_read_lock();
+		workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+		rcu_read_unlock();
+
+		weight = cpumask_weight(nonshielded);
+
+		if (weight < 2)
+			return nr_cpu_ids;
+
+		/*
+		 * get idle load balancer again
+		 */
+		ilb = cpumask_first(nonshielded);
+	    if (ilb < nr_cpu_ids && idle_cpu(ilb))
+			return ilb;
+	}
+#else
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
 
 	if (ilb < nr_cpu_ids && idle_cpu(ilb))
 		return ilb;
+#endif
 
 	return nr_cpu_ids;
 }
@@ -7107,7 +7143,7 @@ out:
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
@@ -7117,7 +7153,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
-	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+	for_each_cpu(balance_cpu, mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
@@ -7165,10 +7201,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	if (unlikely(rq->idle_balance))
 		return 0;
 
-       /*
-	* We may be recently in ticked or tickless idle mode. At the first
-	* busy tick after returning from idle, we will update the busy stats.
-	*/
+	/*
+	 * We may be recently in ticked or tickless idle mode. At the first
+	 * busy tick after returning from idle, we will update the busy stats.
+	 */
 	set_cpu_sd_state_busy();
 	nohz_balance_exit_idle(cpu);
 
@@ -7211,7 +7247,7 @@ need_kick:
 	return 1;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask) { }
 #endif
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 11/12 v2] Intercept periodic load balancing
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (9 preceding siblings ...)
  2014-05-11 18:16 ` [RFC PATCH 10/12 v2] Intercept periodic nohz " Yuyang Du
@ 2014-05-11 18:17 ` Yuyang Du
  2014-05-11 18:17 ` [RFC PATCH 12/12 v2] Intercept RT scheduler Yuyang Du
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:17 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In periodic load balancing, we do two things:

1) Skip pulling task to the non-consolidated CPUs.

2) In addition, for consolidated Idle CPU, we aggressively pull tasks from
non-consolidated CPUs.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bb1304..1c7a3d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7260,6 +7260,36 @@ static void run_rebalance_domains(struct softirq_action *h)
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+	int this_cpu = cpu_of(this_rq);
+
+	/*
+	 * if we encounter shielded cpus here, don't do balance on them
+	 */
+	cpumask_copy(nonshielded, cpu_active_mask);
+
+	rcu_read_lock();
+	workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+	rcu_read_unlock();
+
+	/*
+	 * aggressively unload the shielded cpus to unshielded cpus
+	 */
+	workload_consolidation_unload(nonshielded);
+
+	if (cpumask_test_cpu(this_cpu, nonshielded)) {
+		rebalance_domains(this_rq, idle);
+
+		/*
+		 * If this cpu has a pending nohz_balance_kick, then do the
+		 * balancing on behalf of the other idle cpus whose ticks are
+		 * stopped.
+		 */
+		cpumask_and(nonshielded, nonshielded, nohz.idle_cpus_mask);
+		nohz_idle_balance(this_rq, idle, nonshielded);
+	}
+#else
 	rebalance_domains(this_rq, idle);
 
 	/*
@@ -7267,7 +7297,8 @@ static void run_rebalance_domains(struct softirq_action *h)
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	nohz_idle_balance(this_rq, idle);
+	nohz_idle_balance(this_rq, idle, nohz.idle_cpus_mask);
+#endif
 }
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 12/12 v2] Intercept RT scheduler
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (10 preceding siblings ...)
  2014-05-11 18:17 ` [RFC PATCH 11/12 v2] Intercept periodic load balancing Yuyang Du
@ 2014-05-11 18:17 ` Yuyang Du
  2014-05-12  6:45 ` [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
  2014-05-13 13:23 ` Krzysztof Kozlowski
  13 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-11 18:17 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu, yuyang.du

We intercept load balancing to contain the load and load balancing in
the consolidated CPUs according to our consolidating mechanism.

In RT scheduler, we also skip pulling/selecting task to the idle
non-consolidated CPUs. This is pretty provocative.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/rt.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd2267a..f8141fb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1217,6 +1217,9 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int do_find = 0;
+#endif
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
@@ -1230,6 +1233,11 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 	rcu_read_lock();
 	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (workload_consolidation_cpu_shielded(cpu))
+		do_find = 1;
+#endif
+
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
 	 * try to see if we can wake this RT task up on another
@@ -1252,9 +1260,15 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 	 * This test is optimistic, if we get it wrong the load-balancer
 	 * will have to sort it out.
 	 */
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (do_find || (curr && unlikely(rt_task(curr)) &&
+	    (curr->nr_cpus_allowed < 2 ||
+	     curr->prio <= p->prio))) {
+#else
 	if (curr && unlikely(rt_task(curr)) &&
 	    (curr->nr_cpus_allowed < 2 ||
 	     curr->prio <= p->prio)) {
+#endif
 		int target = find_lowest_rq(p);
 
 		if (target != -1)
@@ -1460,6 +1474,12 @@ static int find_lowest_rq(struct task_struct *task)
 	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
 		return -1; /* No targets found */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	workload_consolidation_nonshielded_mask(this_cpu, lowest_mask);
+	if (!cpumask_weight(lowest_mask))
+		return -1;
+#endif
+
 	/*
 	 * At this point we have built a mask of cpus representing the
 	 * lowest priority tasks in the system.  Now we want to elect
@@ -1687,6 +1707,11 @@ static int pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (workload_consolidation_cpu_shielded(this_cpu))
+		return 0;
+#endif
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (11 preceding siblings ...)
  2014-05-11 18:17 ` [RFC PATCH 12/12 v2] Intercept RT scheduler Yuyang Du
@ 2014-05-12  6:45 ` Peter Zijlstra
  2014-05-12  1:28   ` Yuyang Du
  2014-05-13 13:23 ` Krzysztof Kozlowski
  13 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-12  6:45 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri,
	jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 158 bytes --]

On Mon, May 12, 2014 at 02:16:49AM +0800, Yuyang Du wrote:

Yes, just what we need, more patches while we haven't had the time to
look at the old set yet :-(

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-12  6:45 ` [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
@ 2014-05-12  1:28   ` Yuyang Du
  2014-05-12 15:15     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Yuyang Du @ 2014-05-12  1:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri,
	jacob.jun.pan, fengguang.wu

On Mon, May 12, 2014 at 08:45:33AM +0200, Peter Zijlstra wrote:
> On Mon, May 12, 2014 at 02:16:49AM +0800, Yuyang Du wrote:
> 
> Yes, just what we need, more patches while we haven't had the time to
> look at the old set yet :-(

No essential change, added commit messages (after smashing our heads to
figure out what general code structure is). Can we get over immediate no go to
maybe slow no go?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-12  1:28   ` Yuyang Du
@ 2014-05-12 15:15     ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-12 15:15 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri,
	jacob.jun.pan, fengguang.wu

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Mon, May 12, 2014 at 09:28:57AM +0800, Yuyang Du wrote:
> On Mon, May 12, 2014 at 08:45:33AM +0200, Peter Zijlstra wrote:
> > On Mon, May 12, 2014 at 02:16:49AM +0800, Yuyang Du wrote:
> > 
> > Yes, just what we need, more patches while we haven't had the time to
> > look at the old set yet :-(
> 
> No essential change, added commit messages (after smashing our heads to
> figure out what general code structure is). Can we get over immediate no go to
> maybe slow no go?

Well, sane feedback means I first have to read and digest wth you're
doing, this means at least a days time reading through your patches.
Finding a spare day is hard when people keep sending me bugs to chase
down.

It'll happen, eventually, but you'll have to wait, just like everyone
else.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (12 preceding siblings ...)
  2014-05-12  6:45 ` [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
@ 2014-05-13 13:23 ` Krzysztof Kozlowski
  13 siblings, 0 replies; 17+ messages in thread
From: Krzysztof Kozlowski @ 2014-05-13 13:23 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm,
	arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, jacob.jun.pan, fengguang.wu

On pon, 2014-05-12 at 02:16 +0800, Yuyang Du wrote:
> Hi Ingo, PeterZ, Rafael, and others,
> 
> The current schedulerâ€™s load balancing is completely work-conserving. In some
> workload, generally low CPU utilization but immersed with CPU bursts of
> transient tasks, migrating task to engage all available CPUs for
> work-conserving can lead to significant overhead: cache locality loss,
> idle/active HW state transitional latency and power, shallower idle state,
> etc, which are both power and performance inefficient especially for todayâ€™s
> low power processors in mobile. 
> 
> This RFC introduces a sense of idleness-conserving into work-conserving (by
> all means, we really donâ€™t want to be overwhelming in only one way). But to
> what extent the idleness-conserving should be, bearing in mind that we donâ€™t
> want to sacrifice performance? We first need a load/idleness indicator to that
> end.
> 
> Thanks to CFSâ€™s â€œmodel an ideal, precise multi-tasking CPUâ€, tasks can be seen
> as concurrently running (the tasks in the runqueue). So it is natural to use
> task concurrency as load indicator. Having said that, we do two things:
> 
> 1) Divide continuous time into periods of time, and average task concurrency
> in period, for tolerating the transient bursts:
> a = sum(concurrency * time) / period
> 2) Exponentially decay past periods, and synthesize them all, for hysteresis
> to load drops or resilience to load rises (let f be decaying factor, and a_x
> the xth period average since period 0):
> s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0
> 
> We name this load indicator as CPU ConCurrency (CC): task concurrency
> determines how many CPUs are needed to be running concurrently.
> 
> Another two ways of how to interpret CC:
> 
> 1) the current work-conserving load balance also uses CC, but instantaneous
> CC.
> 
> 2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
> we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
> period". Then a' is just about the CPU utilization. And the way we weight
> runqueue-length is the simplest one (excluding the exponential decays, and you
> may have other ways).
> 
> To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
> scheduler tick, and 4) enter/exit idle.
> 
> After CC, in the consolidation part, we do 1) attach the CPU topology to be
> adaptive beyond our experimental platforms, and 2) intercept the current load
> balance for load and load balancing containment.
> 
> Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
> Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
> task, 'x' having tasks):
> 
> 1)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---------xxxx---- (CC[1])
> 
> 2)
> CPU0: ---xxxx---------- (CC[0])
> CPU1: ---xxxx---------- (CC[1])
> 
> If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
> CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
> between case 1 and 2 in terms of how xxx overlaps, the CC should be between
> CC' and CC''. So, we uniformly use this condition for consolidation (suppose
> we consolidate m CPUs to n CPUs, m > n):
> 
> (CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
> consolidating_coefficient
> 
> The consolidating_coefficient could be like 100% or more or less.
> 
> By CC, we implemented a Workload Consolidation patch on two Intel mobile
> platforms (a quad-core composed of two dual-core modules): contain load and
> load balancing in the first dual-core when aggregated CC low, and if not in
> the full quad-core. Results show that we got power savings and no substantial
> performance regression (even gains for some). The workloads we used to
> evaluate the Workload Consolidation include 1) 50+ perf/ux benchmarks (almost
> all of the magazine ones), and 2) ~10 power workloads, of course, they are the
> easiest ones, such as browsing, audio, video, recording, imaging, etc. The
> current half-life is 1 period, and the period was 32ms, and now 64ms for more
> aggressive consolidation.

Hi,

Could you share some more numbers for energy savings and impact on
performance? I am also interested in these 10 power workloads - what
they are exactly?

Best regards,
Krzysztof




> v2:
> - Data type defined in formation
> 
> Yuyang Du (12):
>   CONFIG for CPU ConCurrency
>   Init CPU ConCurrency
>   CPU ConCurrency calculation
>   CPU ConCurrency tracking
>   CONFIG for Workload Consolidation
>   Attach CPU topology to specify each sched_domain's workload
>     consolidation
>   CPU ConCurrency API for Workload Consolidation
>   Intercept wakeup/fork/exec load balancing
>   Intercept idle balancing
>   Intercept periodic nohz idle balancing
>   Intercept periodic load balancing
>   Intercept RT scheduler
> 
>  arch/x86/Kconfig             |   21 +
>  include/linux/sched.h        |   13 +
>  include/linux/sched/sysctl.h |    8 +
>  include/linux/topology.h     |   16 +
>  kernel/sched/Makefile        |    1 +
>  kernel/sched/concurrency.c   |  928 ++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c          |   46 +++
>  kernel/sched/fair.c          |  131 +++++-
>  kernel/sched/rt.c            |   25 ++
>  kernel/sched/sched.h         |   36 ++
>  kernel/sysctl.c              |   16 +
>  11 files changed, 1232 insertions(+), 9 deletions(-)
>  create mode 100644 kernel/sched/concurrency.c
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-13 13:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-11 18:16 [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 01/12 v2] CONFIG for " Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 02/12 v2] Init " Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 03/12 v2] CPU ConCurrency calculation Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 04/12 v2] CPU ConCurrency tracking Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 05/12 v2] CONFIG for Workload Consolidation Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 06/12 v2] Attach CPU topology to specify each sched_domain's workload consolidation Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 07/12 v2] CPU ConCurrency API for Workload Consolidation Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 08/12 v2] Intercept wakeup/fork/exec load balancing Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 09/12 v2] Intercept idle balancing Yuyang Du
2014-05-11 18:16 ` [RFC PATCH 10/12 v2] Intercept periodic nohz " Yuyang Du
2014-05-11 18:17 ` [RFC PATCH 11/12 v2] Intercept periodic load balancing Yuyang Du
2014-05-11 18:17 ` [RFC PATCH 12/12 v2] Intercept RT scheduler Yuyang Du
2014-05-12  6:45 ` [RFC PATCH 00/12 v2] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
2014-05-12  1:28   ` Yuyang Du
2014-05-12 15:15     ` Peter Zijlstra
2014-05-13 13:23 ` Krzysztof Kozlowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).