[RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
@ 2014-05-05  0:02 Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 01/12 v1] CONFIG for " Yuyang Du
                   ` (12 more replies)
  0 siblings, 13 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Hi Ingo, PeterZ, Rafael, and others,

The current scheduler’s load balancing is completely work-conserving. In some
workload, generally low CPU utilization but immersed with CPU bursts of
transient tasks, migrating task to engage all available CPUs for
work-conserving can lead to significant overhead: cache locality loss,
idle/active HW state transitional latency and power, shallower idle state,
etc, which are both power and performance inefficient especially for today’s
low power processors in mobile. 

This RFC introduces a sense of idleness-conserving into work-conserving (by
all means, we really don’t want to be overwhelming in only one way). But to
what extent the idleness-conserving should be, bearing in mind that we don’t
want to sacrifice performance? We first need a load/idleness indicator to that
end.

Thanks to CFS’s “model an ideal, precise multi-tasking CPU”, tasks can be seen
as concurrently running (the tasks in the runqueue). So it is natural to use
task concurrency as load indicator. Having said that, we do two things:

1) Divide continuous time into periods of time, and average task concurrency
in period, for tolerating the transient bursts:
a = sum(concurrency * time) / period
2) Exponentially decay past periods, and synthesize them all, for hysteresis
to load drops or resilience to load rises (let f be decaying factor, and a_x
the xth period average since period 0):
s = a_n + f^1 * a_n-1 + f^2 * a_n-2 +, ..., + f^(n-1) * a_1 + f^n * a_0

We name this load indicator as CPU ConCurrency (CC): task concurrency
determines how many CPUs are needed to be running concurrently.

Another two ways of how to interpret CC:

1) the current work-conserving load balance also uses CC, but instantaneous
CC.

2) CC vs. CPU utilization. CC is runqueue-length-weighted CPU utilization. If
we change: "a = sum(concurrency * time) / period" to "a' = sum(1 * time) /
period". Then a' is just about the CPU utilization. And the way we weight
runqueue-length is the simplest one (excluding the exponential decays, and you
may have other ways).

To track CC, we intercept the scheduler in 1) enqueue, 2) dequeue, 3)
scheduler tick, and 4) enter/exit idle.

After CC, in the consolidation part, we do 1) attach the CPU topology to be
adaptive beyond our experimental platforms, and 2) intercept the current load
balance for load and load balancing containment.

Currently, CC is per CPU. To consolidate, the formula is based on a heuristic.
Suppose we have 2 CPUs, their task concurrency over time is ('-' means no
task, 'x' having tasks):

1)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---------xxxx---- (CC[1])

2)
CPU0: ---xxxx---------- (CC[0])
CPU1: ---xxxx---------- (CC[1])

If we consolidate CPU0 and CPU1, the consolidated CC will be: CC' = CC[0] +
CC[1] for case 1 and CC'' = (CC[0] + CC[1]) * 2 for case 2. For the cases in
between case 1 and 2 in terms of how xxx overlaps, the CC should be between
CC' and CC''. So, we uniformly use this condition for consolidation (suppose
we consolidate m CPUs to n CPUs, m > n):

(CC[0] + CC[1] + ... + CC[m-2] + CC[m-1]) * (n + log(m-n)) >=<? (1 * n) * n *
consolidate_coefficient

The consolidate_coefficient could be like 100% or more or less.

By CC, we implemented a Workload Consolidation patch on two Intel mobile
platforms (a quad-core composed of two dual-core modules): contain load and
load balancing in the first dual-core when aggregated CC low, and if not in
the full quad-core. Results show that we got power savings and no substantial
performance regression (even gains for some). The workloads we used to
evaluate the Workload Consolidation include 1) 50+ perf/ux benchmarks (almost
all of the magazine ones), and 2) ~10 power workloads, of course, they are the
easiest ones, such as browsing, audio, video, recording, imaging, etc. The
current half-life is 1 period, and the period was 32ms, and now 64ms for more
aggressive consolidation.

Yuyang Du (12):
  CONFIG for CPU ConCurrency
  Init for CPU ConCurrency
  CPU ConCurrency calculation
  CPU ConCurrency collecting in:
  CONFIG for Workload Consolidation
  Attach CPU topology
  CPU ConCurrency API for Workload Consolidation
  Intercept wakeup/fork/exec balance
  Intercept idle balance
  Intercept periodic nohz idle balance
  Intercept periodic load balance
  Intercept RT provocatively

 arch/x86/Kconfig             |   21 +
 include/linux/sched.h        |   13 +
 include/linux/sched/sysctl.h |    8 +
 include/linux/topology.h     |   16 +
 kernel/sched/Makefile        |    1 +
 kernel/sched/concurrency.c   |  928 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |   49 +++
 kernel/sched/fair.c          |  131 +++++-
 kernel/sched/rt.c            |   25 ++
 kernel/sched/sched.h         |   36 ++
 kernel/sysctl.c              |   16 +
 11 files changed, 1235 insertions(+), 9 deletions(-)
 create mode 100644 kernel/sched/concurrency.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/12 v1] CONFIG for CPU ConCurrency
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 02/12 v1] Init " Yuyang Du
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 arch/x86/Kconfig |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..9bfac8d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -797,6 +797,17 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config CPU_CONCURRENCY
+	bool "CPU ConCurency (CC)"
+	default n
+	depends on SMP
+	---help---
+	  CPU ConCurrency (CC) is a new CPU load metric that measures the CPU
+	  load by averaging the number of running tasks. Using CC, the scheduler
+	  can evaluate the load of CPUs and may consolidate workloads on CPUs in
+	  load balancing for power efficiency without sacrificing performance.
+	  If unsure say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/12 v1] Init for CPU ConCurrency
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 01/12 v1] CONFIG for " Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 03/12 v1] CPU ConCurrency calculation Yuyang Du
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/Makefile      |    1 +
 kernel/sched/concurrency.c |   22 ++++++++++++++++++++++
 kernel/sched/core.c        |    5 +++++
 kernel/sched/sched.h       |   21 +++++++++++++++++++++
 4 files changed, 49 insertions(+)
 create mode 100644 kernel/sched/concurrency.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..e67f7e3 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_CONCURRENCY) += concurrency.o
diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
new file mode 100644
index 0000000..195bcf5
--- /dev/null
+++ b/kernel/sched/concurrency.c
@@ -0,0 +1,22 @@
+/*
+ * CPU ConCurrency (CC) is measures the CPU load by averaging
+ * the number of running tasks. Using CC, the scheduler can
+ * evaluate the load of CPUs to improve load balance for power
+ * efficiency without sacrificing performance.
+ *
+ */
+
+#ifdef CONFIG_CPU_CONCURRENCY
+
+#include "sched.h"
+
+void init_cpu_concurrency(struct rq *rq)
+{
+	rq->concurrency.sum = 0;
+	rq->concurrency.sum_now = 0;
+	rq->concurrency.contrib = 0;
+	rq->concurrency.nr_running = 0;
+	rq->concurrency.sum_timestamp = ULLONG_MAX;
+	rq->concurrency.contrib_timestamp = ULLONG_MAX;
+}
+#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 268a45e..ff990f9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6884,6 +6884,11 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
+
+		/*
+		 * cpu concurrency init
+		 */
+		init_cpu_concurrency(rq);
 	}
 
 	set_load_weight(&init_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..0cf9bf0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -508,6 +508,17 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+#ifdef CONFIG_CPU_CONCURRENCY
+struct cpu_concurrency_t {
+	u64 sum;
+	u64 sum_now;
+	u64 contrib;
+	u64 sum_timestamp;
+	u64 contrib_timestamp;
+	unsigned long nr_running;
+};
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -643,6 +654,10 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+#ifdef CONFIG_CPU_CONCURRENCY
+	struct cpu_concurrency_t concurrency;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
@@ -1203,6 +1218,12 @@ extern void init_sched_dl_class(void);
 extern void resched_task(struct task_struct *p);
 extern void resched_cpu(int cpu);
 
+#ifdef CONFIG_CPU_CONCURRENCY
+extern void init_cpu_concurrency(struct rq *rq);
+#else
+static inline void init_cpu_concurrency(struct rq *rq) {}
+#endif
+
 extern struct rt_bandwidth def_rt_bandwidth;
 extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/12 v1] CPU ConCurrency calculation
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 01/12 v1] CONFIG for " Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 02/12 v1] Init " Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 04/12 v1] CPU ConCurrency collecting in: Yuyang Du
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched/sysctl.h |    8 +
 kernel/sched/concurrency.c   |  344 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |    2 +
 kernel/sysctl.c              |   16 ++
 4 files changed, 370 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 8045a55..15075c2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -36,6 +36,14 @@ extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
 
+#ifdef CONFIG_CPU_CONCURRENCY
+extern unsigned long sysctl_concurrency_sum_period;
+extern unsigned long sysctl_concurrency_decay_rate;
+extern int concurrency_decay_rate_handler(struct ctl_table *table, int write,
+					void __user *buffer,
+					size_t *lenp, loff_t *ppos);
+#endif
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
index 195bcf5..554c4d5 100644
--- a/kernel/sched/concurrency.c
+++ b/kernel/sched/concurrency.c
@@ -10,6 +10,331 @@
 
 #include "sched.h"
 
+/*
+ * the sum period of time is 2^26 ns (~64) by default
+ */
+unsigned long sysctl_concurrency_sum_period = 26UL;
+
+/*
+ * the number of sum periods, after which the original
+ * will be reduced/decayed to half
+ */
+unsigned long sysctl_concurrency_decay_rate = 1UL;
+
+/*
+ * the contrib period of time is 2^10 (~1us) by default,
+ * us has better precision than ms, and
+ * 1024 makes use of faster shift than div
+ */
+static unsigned long cc_contrib_period = 10UL;
+
+/*
+ * the concurrency is scaled up for decaying,
+ * thus, concurrency 1 is effectively 2^cc_resolution (1024),
+ * which can be halved by 10 half-life periods
+ */
+static unsigned long cc_resolution = 10UL;
+
+/*
+ * after this number of half-life periods, even
+ * (1>>32)-1 (which is sufficiently large) is less than 1
+ */
+static unsigned long cc_decay_max_pds = 32UL;
+
+static inline unsigned long cc_scale_up(unsigned long c)
+{
+	return c << cc_resolution;
+}
+
+static inline unsigned long cc_scale_down(unsigned long c)
+{
+	return c >> cc_resolution;
+}
+
+/* from nanoseconds to sum periods */
+static inline u64 cc_sum_pds(u64 n)
+{
+	return n >> sysctl_concurrency_sum_period;
+}
+
+/* from sum period to timestamp in ns */
+static inline u64 cc_timestamp(u64 p)
+{
+	return p << sysctl_concurrency_sum_period;
+}
+
+/*
+ * from nanoseconds to contrib periods, because
+ * ns so risky that can overflow cc->contrib
+ */
+static inline u64 cc_contrib_pds(u64 n)
+{
+	return n >> cc_contrib_period;
+}
+
+/*
+ * cc_decay_factor only works for 32bit integer,
+ * cc_decay_factor_x, x indicates the number of periods
+ * as half-life (sysctl_concurrency_decay_rate)
+ */
+static const unsigned long cc_decay_factor_1[] = {
+	0xFFFFFFFF,
+};
+
+static const unsigned long cc_decay_factor_2[] = {
+	0xFFFFFFFF, 0xB504F333,
+};
+
+static const unsigned long cc_decay_factor_4[] = {
+	0xFFFFFFFF, 0xD744FCCA, 0xB504F333, 0x9837F051,
+};
+
+static const unsigned long cc_decay_factor_8[] = {
+	0xFFFFFFFF, 0xEAC0C6E7, 0xD744FCCA, 0xC5672A11,
+	0xB504F333, 0xA5FED6A9, 0x9837F051, 0x8B95C1E3,
+};
+
+/* by default sysctl_concurrency_decay_rate */
+static const unsigned long *cc_decay_factor =
+	cc_decay_factor_1;
+
+/*
+ * cc_decayed_sum depends on cc_resolution (fixed 10),
+ * cc_decayed_sum_x, x indicates the number of periods
+ * as half-life (sysctl_concurrency_decay_rate)
+ */
+static const unsigned long cc_decayed_sum_1[] = {
+	0, 512, 768, 896, 960, 992,
+	1008, 1016, 1020, 1022, 1023,
+};
+
+static const unsigned long cc_decayed_sum_2[] = {
+	0, 724, 1235, 1597, 1853, 2034, 2162, 2252,
+	2316, 2361, 2393, 2416, 2432, 2443, 2451,
+	2457, 2461, 2464, 2466, 2467, 2468, 2469,
+};
+
+static const unsigned long cc_decayed_sum_4[] = {
+	0, 861, 1585, 2193, 2705, 3135, 3497, 3801, 4057,
+	4272, 4453, 4605, 4733, 4840, 4930, 5006, 5070,
+	5124, 5169, 5207, 5239, 5266, 5289, 5308, 5324,
+	5337, 5348, 5358, 5366, 5373, 5379, 5384, 5388,
+	5391, 5394, 5396, 5398, 5400, 5401, 5402, 5403,
+	5404, 5405, 5406,
+};
+
+static const unsigned long cc_decayed_sum_8[] = {
+	0, 939, 1800, 2589, 3313, 3977, 4585, 5143,
+	5655, 6124, 6554, 6949, 7311, 7643, 7947, 8226,
+	8482, 8717, 8932, 9129, 9310, 9476, 9628, 9767,
+	9895, 10012, 10120, 10219, 10309, 10392, 10468, 10538,
+	10602, 10661, 10715, 10764, 10809, 10850, 10888, 10923,
+	10955, 10984, 11011, 11036, 11059, 11080, 11099, 11116,
+	11132, 11147, 11160, 11172, 11183, 11193, 11203, 11212,
+	11220, 11227, 11234, 11240, 11246, 11251, 11256, 11260,
+	11264, 11268, 11271, 11274, 11277, 11280, 11282, 11284,
+	11286, 11288, 11290, 11291, 11292, 11293, 11294, 11295,
+	11296, 11297, 11298, 11299, 11300, 11301, 11302,
+};
+
+/* by default sysctl_concurrency_decay_rate */
+static const unsigned long *cc_decayed_sum = cc_decayed_sum_1;
+
+/*
+ * the last index of cc_decayed_sum array
+ */
+static unsigned long cc_decayed_sum_len =
+	sizeof(cc_decayed_sum_1) / sizeof(cc_decayed_sum_1[0]) - 1;
+
+/*
+ * sysctl handler to update decay rate
+ */
+int concurrency_decay_rate_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (ret || !write)
+		return ret;
+
+	switch (sysctl_concurrency_decay_rate) {
+	case 1:
+		cc_decay_factor = cc_decay_factor_1;
+		cc_decayed_sum = cc_decayed_sum_1;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_1) /
+			sizeof(cc_decayed_sum_1[0]) - 1;
+		break;
+	case 2:
+		cc_decay_factor = cc_decay_factor_2;
+		cc_decayed_sum = cc_decayed_sum_2;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_2) /
+			sizeof(cc_decayed_sum_2[0]) - 1;
+		break;
+	case 4:
+		cc_decay_factor = cc_decay_factor_4;
+		cc_decayed_sum = cc_decayed_sum_4;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_4) /
+			sizeof(cc_decayed_sum_4[0]) - 1;
+		break;
+	case 8:
+		cc_decay_factor = cc_decay_factor_8;
+		cc_decayed_sum = cc_decayed_sum_8;
+		cc_decayed_sum_len = sizeof(cc_decayed_sum_8) /
+			sizeof(cc_decayed_sum_8[0]) - 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	cc_decay_max_pds *= sysctl_concurrency_decay_rate;
+
+	return 0;
+}
+
+/*
+ * decay concurrency at some decay rate
+ */
+static inline u64 decay_cc(u64 cc, u64 periods)
+{
+	u32 periods_l;
+
+	if (periods <= 0)
+		return cc;
+
+	if (unlikely(periods >= cc_decay_max_pds))
+		return 0;
+
+	/* now period is not too large */
+	periods_l = (u32)periods;
+	if (periods_l >= sysctl_concurrency_decay_rate) {
+		cc >>= periods_l / sysctl_concurrency_decay_rate;
+		periods_l %= sysctl_concurrency_decay_rate;
+	}
+
+	if (!periods_l)
+		return cc;
+
+	cc *= cc_decay_factor[periods_l];
+
+	return cc >> 32;
+}
+
+/*
+ * add missed periods by predefined constants
+ */
+static inline u64 cc_missed_pds(u64 periods)
+{
+	if (periods <= 0)
+		return 0;
+
+	if (periods > cc_decayed_sum_len)
+		periods = cc_decayed_sum_len;
+
+	return cc_decayed_sum[periods];
+}
+
+/*
+ * scale up nr_running, because we decay
+ */
+static inline unsigned long cc_weight(unsigned long nr_running)
+{
+	/*
+	 * scaling factor, this should be tunable
+	 */
+	return cc_scale_up(nr_running);
+}
+
+static inline void
+__update_concurrency(struct rq *rq, u64 now, struct cpu_concurrency_t *cc)
+{
+	u64 sum_pds, sum_pds_s, sum_pds_e;
+	u64 contrib_pds, ts_contrib, contrib_pds_one;
+	u64 sum_now = 0;
+	unsigned long weight;
+	int updated = 0;
+
+	/*
+	 * guarantee contrib_timestamp always >= sum_timestamp,
+	 * and sum_timestamp is at period boundary
+	 */
+	if (now <= cc->sum_timestamp) {
+		cc->sum_timestamp = cc_timestamp(cc_sum_pds(now));
+		cc->contrib_timestamp = now;
+		return;
+	}
+
+	weight = cc_weight(cc->nr_running);
+
+	/* start and end of sum periods */
+	sum_pds_s = cc_sum_pds(cc->sum_timestamp);
+	sum_pds_e = cc_sum_pds(now);
+	sum_pds = sum_pds_e - sum_pds_s;
+	/* number of contrib periods in one sum period */
+	contrib_pds_one = cc_contrib_pds(cc_timestamp(1));
+
+	/*
+	 * if we have passed at least one period,
+	 * we need to do four things:
+	 */
+	if (sum_pds) {
+		/* 1) complete the last period */
+		ts_contrib = cc_timestamp(sum_pds_s + 1);
+		contrib_pds = cc_contrib_pds(ts_contrib);
+		contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+		if (likely(contrib_pds))
+			cc->contrib += weight * contrib_pds;
+
+		cc->contrib = div64_u64(cc->contrib, contrib_pds_one);
+
+		cc->sum += cc->contrib;
+		cc->contrib = 0;
+
+		/* 2) update/decay them */
+		cc->sum = decay_cc(cc->sum, sum_pds);
+		sum_now = decay_cc(cc->sum, sum_pds - 1);
+
+		/* 3) compensate missed periods if any */
+		sum_pds -= 1;
+		cc->sum += cc->nr_running * cc_missed_pds(sum_pds);
+		sum_now += cc->nr_running * cc_missed_pds(sum_pds - 1);
+		updated = 1;
+
+		/* 4) update contrib timestamp to period boundary */
+		ts_contrib = cc_timestamp(sum_pds_e);
+
+		cc->sum_timestamp = ts_contrib;
+		cc->contrib_timestamp = ts_contrib;
+	}
+
+	/* current period */
+	contrib_pds = cc_contrib_pds(now);
+	contrib_pds -= cc_contrib_pds(cc->contrib_timestamp);
+
+	if (likely(contrib_pds))
+		cc->contrib += weight * contrib_pds;
+
+	/* new nr_running for next update */
+	cc->nr_running = rq->nr_running;
+
+	/*
+	 * we need to account for the current sum period,
+	 * if now has passed 1/2 of sum period, we contribute,
+	 * otherwise, we use the last complete sum period
+	 */
+	contrib_pds = cc_contrib_pds(now - cc->sum_timestamp);
+
+	if (contrib_pds > contrib_pds_one / 2) {
+		sum_now = div64_u64(cc->contrib, contrib_pds);
+		sum_now += cc->sum;
+		updated = 1;
+	}
+
+	if (updated == 1)
+		cc->sum_now = sum_now;
+	cc->contrib_timestamp = now;
+}
+
 void init_cpu_concurrency(struct rq *rq)
 {
 	rq->concurrency.sum = 0;
@@ -19,4 +344,23 @@ void init_cpu_concurrency(struct rq *rq)
 	rq->concurrency.sum_timestamp = ULLONG_MAX;
 	rq->concurrency.contrib_timestamp = ULLONG_MAX;
 }
+
+/*
+ * we update cpu concurrency at:
+ * 1) enqueue task, which increases concurrency
+ * 2) dequeue task, which decreases concurrency
+ * 3) periodic scheduler tick, in case no en/dequeue for long
+ * 4) enter and exit idle
+ */
+void update_cpu_concurrency(struct rq *rq)
+{
+	/*
+	 * protected under rq->lock
+	 */
+	struct cpu_concurrency_t *cc = &rq->concurrency;
+	u64 now = rq->clock;
+
+	__update_concurrency(rq, now, cc);
+}
+
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0cf9bf0..11b0c81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1220,8 +1220,10 @@ extern void resched_cpu(int cpu);
 
 #ifdef CONFIG_CPU_CONCURRENCY
 extern void init_cpu_concurrency(struct rq *rq);
+extern void update_cpu_concurrency(struct rq *rq);
 #else
 static inline void init_cpu_concurrency(struct rq *rq) {}
+static inline void update_cpu_concurrency(struct rq *rq) {}
 #endif
 
 extern struct rt_bandwidth def_rt_bandwidth;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b58..91ba467 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1090,6 +1090,22 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_CPU_CONCURRENCY
+	{
+		.procname	= "concurrency_sum_period",
+		.data		= &sysctl_concurrency_sum_period,
+		.maxlen		= sizeof(sysctl_concurrency_sum_period),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "concurrency_decay_rate",
+		.data		= &sysctl_concurrency_decay_rate,
+		.maxlen		= sizeof(sysctl_concurrency_decay_rate),
+		.mode		= 0644,
+		.proc_handler	= concurrency_decay_rate_handler,
+	},
+#endif
 	{ }
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/12 v1] CPU ConCurrency collecting in:
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (2 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 03/12 v1] CPU ConCurrency calculation Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 05/12 v1] CONFIG for Workload Consolidation Yuyang Du
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

1. dequeue
2. enqueue
3. scheduler tick
4. idle enter and exit

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c |    3 +++
 kernel/sched/fair.c |    2 ++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ff990f9..aee8660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -767,6 +767,7 @@ static void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	sched_info_queued(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
+	update_cpu_concurrency(rq);
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
@@ -774,6 +775,7 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	update_rq_clock(rq);
 	sched_info_dequeued(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
+	update_cpu_concurrency(rq);
 }
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
@@ -2428,6 +2430,7 @@ void scheduler_tick(void)
 	update_rq_clock(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
 	update_cpu_load_active(rq);
+	update_cpu_concurrency(rq);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..e7153ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2563,6 +2563,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 void idle_enter_fair(struct rq *this_rq)
 {
 	update_rq_runnable_avg(this_rq, 1);
+	update_cpu_concurrency(this_rq);
 }
 
 /*
@@ -2573,6 +2574,7 @@ void idle_enter_fair(struct rq *this_rq)
 void idle_exit_fair(struct rq *this_rq)
 {
 	update_rq_runnable_avg(this_rq, 0);
+	update_cpu_concurrency(this_rq);
 }
 
 static int idle_balance(struct rq *this_rq);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/12 v1] CONFIG for Workload Consolidation
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (3 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 04/12 v1] CPU ConCurrency collecting in: Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 06/12 v1] Attach CPU topology Yuyang Du
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 arch/x86/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9bfac8d..0999c16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -808,6 +808,16 @@ config CPU_CONCURRENCY
 	  load balancing for power efficiency without sacrificing performance.
 	  If unsure say N here.
 
+config WORKLOAD_CONSOLIDATION
+	bool "CPU Workload Consolidation"
+	default n
+	depends on CPU_CONCURRENCY
+	---help---
+	  CPU Workload Consolidation is a new CPU PM module, which uses the CPU
+	  concurrency of the CPU, and allows asymmetric concurrency across CPUs to
+	  reduce the SW and HW overhead to increase load balance efficiency and
+	  conserve energy. If unsure say N here.
+
 source "kernel/Kconfig.preempt"
 
 config X86_UP_APIC
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/12 v1] Attach CPU topology
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (4 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 05/12 v1] CONFIG for Workload Consolidation Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 07/12 v1] CPU ConCurrency API for Workload Consolidation Yuyang Du
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h    |   13 +++++++++++++
 include/linux/topology.h |   16 ++++++++++++++++
 kernel/sched/core.c      |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 25f54c7..29827ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -877,6 +877,12 @@ enum cpu_idle_type {
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+#define SD_ASYM_CONCURRENCY	0x8000	/* Higher concurrency in front to save power */
+#else
+#define SD_ASYM_CONCURRENCY	0
+#endif
+
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
 struct sched_domain_attr {
@@ -960,6 +966,13 @@ struct sched_domain {
 		struct rcu_head rcu;	/* used during destruction */
 	};
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	unsigned int total_groups;
+	unsigned int group_number;
+	unsigned int asym_concurrency;
+	struct sched_group *first_group;	/* ordered by CPU number */
+#endif
+
 	unsigned int span_weight;
 	/*
 	 * Span of all CPUs in this domain.
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..e57f4d6 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,6 +66,16 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+#ifndef ASYM_CONCURRENCY_INIT
+#define ASYM_CONCURRENCY_INIT(n) .asym_concurrency = (n),
+#endif
+#else
+#ifndef ASYM_CONCURRENCY_INIT
+#define ASYM_CONCURRENCY_INIT(n)
+#endif
+#endif
+
 /*
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
@@ -102,12 +112,14 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				| arch_sd_sibling_asym_packing()	\
+				| 0*SD_ASYM_CONCURRENCY			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.smt_gain		= 1178,	/* 15% */			\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	ASYM_CONCURRENCY_INIT(0)					\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
@@ -134,11 +146,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_ASYM_CONCURRENCY			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	ASYM_CONCURRENCY_INIT(180)					\
 }
 #endif
 #endif /* CONFIG_SCHED_MC */
@@ -167,11 +181,13 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 1*SD_PREFER_SIBLING			\
+				| 1*SD_ASYM_CONCURRENCY			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
 	.max_newidle_lb_cost	= 0,					\
 	.next_decay_max_lb_cost	= jiffies,				\
+	ASYM_CONCURRENCY_INIT(180)					\
 }
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aee8660..671f953 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4843,7 +4843,11 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct ctl_table *table = sd_alloc_ctl_entry(15);
+#else
 	struct ctl_table *table = sd_alloc_ctl_entry(14);
+#endif
 
 	if (table == NULL)
 		return NULL;
@@ -4876,7 +4880,13 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(long), 0644, proc_doulongvec_minmax, false);
 	set_table_entry(&table[12], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	set_table_entry(&table[13], "asym_concurrency", &sd->asym_concurrency,
+		sizeof(int), 0644, proc_dointvec, false);
+	/* &table[14] is terminator */
+#else
 	/* &table[13] is terminator */
+#endif
 
 	return table;
 }
@@ -5497,6 +5507,33 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+static void update_domain_extra_info(struct sched_domain *sd)
+{
+	while (sd) {
+		int i = 0, j = 0, first, min = INT_MAX;
+		struct sched_group *group;
+
+		group = sd->groups;
+		first = group_first_cpu(group);
+		do {
+			int k = group_first_cpu(group);
+			i += 1;
+			if (k < first)
+				j += 1;
+			if (k < min) {
+				sd->first_group = group;
+				min = k;
+			}
+		} while (group = group->next, group != sd->groups);
+
+		sd->total_groups = i;
+		sd->group_number = j;
+		sd = sd->parent;
+	}
+}
+#endif
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -5545,6 +5582,10 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	destroy_sched_domains(tmp, cpu);
 
 	update_top_cache_domain(cpu);
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	update_domain_extra_info(sd);
+#endif
 }
 
 /* cpus with isolated domains */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/12 v1] CPU ConCurrency API for Workload Consolidation
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (5 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 06/12 v1] Attach CPU topology Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 08/12 v1] Intercept wakeup/fork/exec balance Yuyang Du
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/concurrency.c |  562 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h       |   13 +
 2 files changed, 575 insertions(+)

diff --git a/kernel/sched/concurrency.c b/kernel/sched/concurrency.c
index 554c4d5..15563de 100644
--- a/kernel/sched/concurrency.c
+++ b/kernel/sched/concurrency.c
@@ -28,6 +28,25 @@ unsigned long sysctl_concurrency_decay_rate = 1UL;
  */
 static unsigned long cc_contrib_period = 10UL;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+/*
+ * whether we use concurrency to select cpu to run
+ * the woken up task
+ */
+static unsigned long wc_wakeup = 1UL;
+
+/*
+ * concurrency lower than percentage of this number
+ * is capable of running wakee
+ */
+static unsigned long wc_wakeup_threshold = 80UL;
+
+/*
+ * aggressively push the task even it is hot
+ */
+static unsigned long wc_push_hot_task = 1UL;
+#endif
+
 /*
  * the concurrency is scaled up for decaying,
  * thus, concurrency 1 is effectively 2^cc_resolution (1024),
@@ -343,6 +362,9 @@ void init_cpu_concurrency(struct rq *rq)
 	rq->concurrency.nr_running = 0;
 	rq->concurrency.sum_timestamp = ULLONG_MAX;
 	rq->concurrency.contrib_timestamp = ULLONG_MAX;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	rq->concurrency.unload = 0;
+#endif
 }
 
 /*
@@ -364,3 +386,543 @@ void update_cpu_concurrency(struct rq *rq)
 }
 
 #endif
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+/*
+ * whether cpu is capable of having more concurrency
+ */
+static int cpu_cc_capable(int cpu)
+{
+	u64 sum = cpu_rq(cpu)->concurrency.sum_now;
+	u64 threshold = cc_weight(1);
+
+	sum *= 100;
+	sum *= cpu_rq(cpu)->cpu_power;
+
+	threshold *= wc_wakeup_threshold;
+	threshold <<= SCHED_POWER_SHIFT;
+
+	if (sum <= threshold)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * we do not select idle, if the cc of the
+ * wakee and waker (in this order) is capable
+ * of handling the wakee task
+ */
+int workload_consolidation_wakeup(int prev, int target)
+{
+	if (!wc_wakeup) {
+		if (idle_cpu(target))
+			return target;
+
+		return nr_cpu_ids;
+	}
+
+	if (idle_cpu(prev) || cpu_cc_capable(prev))
+		return prev;
+
+	if (prev != target && (idle_cpu(target) || cpu_cc_capable(target)))
+		return target;
+
+	return nr_cpu_ids;
+}
+
+static inline u64 sched_group_cc(struct sched_group *sg)
+{
+	u64 sg_cc = 0;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		sg_cc += cpu_rq(i)->concurrency.sum_now *
+			cpu_rq(i)->cpu_power;
+
+	return sg_cc;
+}
+
+static inline u64 sched_domain_cc(struct sched_domain *sd)
+{
+	struct sched_group *sg = sd->groups;
+	u64 sd_cc = 0;
+
+	do {
+		sd_cc += sched_group_cc(sg);
+		sg = sg->next;
+	} while (sg != sd->groups);
+
+	return sd_cc;
+}
+
+static inline struct sched_group *
+find_lowest_cc_group(struct sched_group *sg, int span)
+{
+	u64 grp_cc, min = ULLONG_MAX;
+	struct sched_group *lowest = NULL;
+	int i;
+
+	for (i = 0; i < span; ++i) {
+		grp_cc = sched_group_cc(sg);
+
+		if (grp_cc < min) {
+			min = grp_cc;
+			lowest = sg;
+		}
+
+		sg = sg->next;
+	}
+
+	return lowest;
+}
+
+static inline u64 __calc_cc_thr(int cpus, unsigned int asym_cc)
+{
+	u64 thr = cpus;
+
+	thr *= cc_weight(1);
+	thr *= asym_cc;
+	thr <<= SCHED_POWER_SHIFT;
+
+	return thr;
+}
+
+/*
+ * can @src_cc of @src_nr cpus be consolidated
+ * to @dst_cc of @dst_nr cpus
+ */
+static inline int
+__can_consolidate_cc(u64 src_cc, int src_nr, u64 dst_cc, int dst_nr)
+{
+	dst_cc *= dst_nr;
+	src_nr -= dst_nr;
+
+	if (unlikely(src_nr <= 0))
+		return 0;
+
+	src_nr = ilog2(src_nr);
+	src_nr += dst_nr;
+	src_cc *= src_nr;
+
+	if (src_cc > dst_cc)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * find the group for asymmetric concurrency
+ * problem to address: traverse sd from top to down
+ */
+struct sched_group *
+workload_consolidation_find_group(struct sched_domain *sd,
+	struct task_struct *p, int this_cpu)
+{
+	int half, sg_weight, ns_half = 0;
+	struct sched_group *sg;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int allowed = 0, i;
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->asym_concurrency);
+
+		/*
+		 * we did not consider the added cc by this
+		 * wakeup (mostly from fork/exec)
+		 */
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			break;
+
+		sg = sd->first_group;
+		for (i = 0; i < half; ++i) {
+			/* if it has no cpus allowed */
+			if (!cpumask_intersects(sched_group_cpus(sg),
+					tsk_cpus_allowed(p)))
+				continue;
+
+			allowed = 1;
+			sg = sg->next;
+		}
+
+		if (!allowed)
+			break;
+
+		ns_half = half;
+		half /= 2;
+	}
+
+	if (!ns_half)
+		return NULL;
+
+	if (ns_half == 1)
+		return sd->first_group;
+
+	return find_lowest_cc_group(sd->first_group, ns_half);
+}
+
+/*
+ * top_flag_domain - return top sched_domain containing flag.
+ * @cpu:	the cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	the flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * returns the highest sched_domain of a cpu which contains the given flag.
+ * different from highest_flag_domain in that along the domain upward chain
+ * domain may or may not contain the flag.
+ */
+static inline struct sched_domain *top_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			continue;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
+/*
+ * workload_consolidation_cpu_shielded - return whether @cpu is shielded or not
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_ASYM_CONCURRENCY, each sd may have more than two groups, but
+ * we assume 1) every sched_group has the same weight, 2) every CPU has
+ * the same computing power
+ */
+int workload_consolidation_cpu_shielded(int cpu)
+{
+	struct sched_domain *sd;
+
+	sd = top_flag_domain(cpu, SD_ASYM_CONCURRENCY);
+
+	while (sd) {
+		int half, sg_weight, this_sg_nr;
+		u64 sd_cc;
+
+		if (!(sd->flags & SD_ASYM_CONCURRENCY)) {
+			sd = sd->child;
+			continue;
+		}
+
+		half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+		sg_weight = sd->groups->group_weight;
+		this_sg_nr = sd->group_number;
+
+		sd_cc = sched_domain_cc(sd);
+		sd_cc *= 100;
+
+		while (half) {
+			int cpus = sg_weight * half;
+			u64 threshold = __calc_cc_thr(cpus,
+				sd->asym_concurrency);
+
+			if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+				threshold, cpus))
+				return 0;
+
+			if (this_sg_nr >= half)
+				return 1;
+
+			half /= 2;
+		}
+
+		sd = sd->child;
+	}
+
+	return 0;
+}
+
+/*
+ * as of now, we have the following assumption
+ * 1) every sched_group has the same weight
+ * 2) every CPU has the same computing power
+ */
+static inline int __nonshielded_groups(struct sched_domain *sd)
+{
+	int half, sg_weight, ret = 0;
+	u64 sd_cc;
+
+	half = DIV_ROUND_CLOSEST(sd->total_groups, 2);
+	sg_weight = sd->groups->group_weight;
+
+	sd_cc = sched_domain_cc(sd);
+	sd_cc *= 100;
+
+	while (half) {
+		int cpus = sg_weight * half;
+		u64 threshold = __calc_cc_thr(cpus,
+			sd->asym_concurrency);
+
+		if (!__can_consolidate_cc(sd_cc, sd->span_weight,
+			threshold, cpus))
+			return ret;
+
+		ret = half;
+		half /= 2;
+	}
+
+	return ret;
+}
+
+static DEFINE_PER_CPU(struct cpumask, nonshielded_cpumask);
+
+/*
+ * workload_consolidation_nonshielded_mask - return the nonshielded cpus in the @mask,
+ * which is unmasked by the shielded cpus
+ *
+ * traverse downward the sched_domain tree when the sched_domain contains
+ * flag SD_ASYM_CONCURRENCY, each sd may have more than two groups
+ */
+void workload_consolidation_nonshielded_mask(int cpu, struct cpumask *mask)
+{
+	struct sched_domain *sd;
+	struct cpumask *pcpu_mask = &per_cpu(nonshielded_cpumask, cpu);
+	int i;
+
+	sd = top_flag_domain(cpu, SD_ASYM_CONCURRENCY);
+
+	if (!sd)
+		return;
+
+	while (sd) {
+		struct sched_group *sg;
+		int this_sg_nr, ns_half;
+
+		if (!(sd->flags & SD_ASYM_CONCURRENCY)) {
+			sd = sd->child;
+			continue;
+		}
+
+		ns_half = __nonshielded_groups(sd);
+
+		if (!ns_half)
+			break;
+
+		cpumask_clear(pcpu_mask);
+		sg = sd->first_group;
+
+		for (i = 0; i < ns_half; ++i) {
+			cpumask_or(pcpu_mask, pcpu_mask,
+				sched_group_cpus(sg));
+			sg = sg->next;
+		}
+
+		cpumask_and(mask, mask, pcpu_mask);
+
+		this_sg_nr = sd->group_number;
+		if (this_sg_nr)
+			break;
+
+		sd = sd->child;
+	}
+}
+
+static int cpu_task_hot(struct task_struct *p, u64 now)
+{
+	s64 delta;
+
+	if (p->sched_class != &fair_sched_class)
+		return 0;
+
+	if (unlikely(p->policy == SCHED_IDLE))
+		return 0;
+
+	if (sysctl_sched_migration_cost == -1)
+		return 1;
+
+	if (sysctl_sched_migration_cost == 0)
+		return 0;
+
+	if (wc_push_hot_task)
+		return 0;
+
+	/*
+	 * buddy candidates are cache hot:
+	 */
+	if (sched_feat(CACHE_HOT_BUDDY) && this_rq()->nr_running &&
+			(&p->se == p->se.cfs_rq->next ||
+			 &p->se == p->se.cfs_rq->last)) {
+		return 1;
+	}
+
+	delta = now - p->se.exec_start;
+
+	if (delta < (s64)sysctl_sched_migration_cost)
+		return 1;
+
+	return 0;
+}
+
+static int
+cpu_move_task(struct task_struct *p, struct rq *src_rq, struct rq *dst_rq)
+{
+	/*
+	 * we do not migrate tasks that are:
+	 * 1) running (obviously), or
+	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 3) are cache-hot on their current CPU.
+	 */
+	if (!cpumask_test_cpu(dst_rq->cpu, tsk_cpus_allowed(p)))
+		return 0;
+
+	if (task_running(src_rq, p))
+		return 0;
+
+	/*
+	 * aggressive migration if task is cache cold
+	 */
+	if (!cpu_task_hot(p, src_rq->clock_task)) {
+		/*
+		 * move a task
+		 */
+		deactivate_task(src_rq, p, 0);
+		set_task_cpu(p, dst_rq->cpu);
+		activate_task(dst_rq, p, 0);
+		check_preempt_curr(dst_rq, p, 0);
+		return 1;
+	}
+
+	return 0;
+}
+
+/*
+ * __unload_cpu_work is run by src cpu stopper, which pushes running
+ * tasks off src cpu onto dst cpu
+ */
+static int __unload_cpu_work(void *data)
+{
+	struct rq *src_rq = data;
+	int src_cpu = cpu_of(src_rq);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	struct rq *dst_rq = cpu_rq(cc->dst_cpu);
+
+	struct list_head *tasks = &src_rq->cfs_tasks;
+	struct task_struct *p, *n;
+	int pushed = 0;
+	int nr_migrate_break = 1;
+
+	raw_spin_lock_irq(&src_rq->lock);
+
+	/* make sure the requested cpu hasn't gone down in the meantime */
+	if (unlikely(src_cpu != smp_processor_id() || !cc->unload))
+		goto out_unlock;
+
+	/* Is there any task to move? */
+	if (src_rq->nr_running <= 1)
+		goto out_unlock;
+
+	double_lock_balance(src_rq, dst_rq);
+
+	list_for_each_entry_safe(p, n, tasks, se.group_node) {
+
+		if (!cpu_move_task(p, src_rq, dst_rq))
+			continue;
+
+		pushed++;
+
+		if (pushed >= nr_migrate_break)
+			break;
+	}
+
+	double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+	cc->unload = 0;
+	raw_spin_unlock_irq(&src_rq->lock);
+
+	return 0;
+}
+
+/*
+ * unload src_cpu to dst_cpu
+ */
+static void unload_cpu(int src_cpu, int dst_cpu)
+{
+	unsigned long flags;
+	struct rq *src_rq = cpu_rq(src_cpu);
+	struct cpu_concurrency_t *cc = &src_rq->concurrency;
+	int unload = 0;
+
+	raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+	if (!cc->unload) {
+		cc->unload = 1;
+		cc->dst_cpu = dst_cpu;
+		unload = 1;
+	}
+
+	raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+
+	if (unload)
+		stop_one_cpu_nowait(src_cpu, __unload_cpu_work, src_rq,
+			&cc->unload_work);
+}
+
+static inline int find_lowest_cc_cpu(struct cpumask *mask)
+{
+	u64 cpu_cc, min = ULLONG_MAX;
+	int i, lowest = nr_cpu_ids;
+	struct rq *rq;
+
+	for_each_cpu(i, mask) {
+		rq = cpu_rq(i);
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			lowest = i;
+		}
+	}
+
+	return lowest;
+}
+
+/*
+ * find the lowest cc cpu in shielded and nonshielded cpus,
+ * aggressively unload the shielded to the nonshielded
+ */
+void workload_consolidation_unload(struct cpumask *nonshielded)
+{
+	int src_cpu = nr_cpu_ids, dst_cpu, i;
+	u64 cpu_cc, min = ULLONG_MAX;
+	struct rq *rq;
+
+	for_each_cpu_not(i, nonshielded) {
+		if (i >= nr_cpu_ids)
+			break;
+
+		rq = cpu_rq(i);
+		if (rq->nr_running <= 0)
+			continue;
+
+		cpu_cc = rq->concurrency.sum_now * rq->cpu_power;
+		if (cpu_cc < min) {
+			min = cpu_cc;
+			src_cpu = i;
+		}
+	}
+
+	if (src_cpu >= nr_cpu_ids)
+		return;
+
+	dst_cpu = find_lowest_cc_cpu(nonshielded);
+	if (dst_cpu >= nr_cpu_ids)
+		return;
+
+	if (src_cpu != dst_cpu)
+		unload_cpu(src_cpu, dst_cpu);
+}
+
+#endif /* CONFIG_WORKLOAD_CONSOLIDATION */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 11b0c81..a9d46f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -516,6 +516,11 @@ struct cpu_concurrency_t {
 	u64 sum_timestamp;
 	u64 contrib_timestamp;
 	unsigned long nr_running;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int unload;
+	int dst_cpu;
+	struct cpu_stop_work unload_work;
+#endif
 };
 #endif
 
@@ -1221,6 +1226,14 @@ extern void resched_cpu(int cpu);
 #ifdef CONFIG_CPU_CONCURRENCY
 extern void init_cpu_concurrency(struct rq *rq);
 extern void update_cpu_concurrency(struct rq *rq);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+extern int workload_consolidation_wakeup(int prev, int target);
+extern struct sched_group *workload_consolidation_find_group(
+			struct sched_domain *sd, struct task_struct *p, int this_cpu);
+extern void workload_consolidation_unload(struct cpumask *nonshielded);
+extern int workload_consolidation_cpu_shielded(int cpu);
+extern void workload_consolidation_nonshielded_mask(int cpu, struct cpumask *mask);
+#endif
 #else
 static inline void init_cpu_concurrency(struct rq *rq) {}
 static inline void update_cpu_concurrency(struct rq *rq) {}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/12 v1] Intercept wakeup/fork/exec balance
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (6 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 07/12 v1] CPU ConCurrency API for Workload Consolidation Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 09/12 v1] Intercept idle balance Yuyang Du
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7153ff..65f1651 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4365,9 +4365,16 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int ret;
 
+	ret = workload_consolidation_wakeup(i, target);
+	if (ret < nr_cpu_ids)
+		return ret;
+#else
 	if (idle_cpu(target))
 		return target;
+#endif
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
@@ -4460,7 +4467,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	}
 
 	while (sd) {
-		struct sched_group *group;
+		struct sched_group *group = NULL;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -4468,6 +4475,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+		if (sd->flags & SD_ASYM_CONCURRENCY)
+			group = workload_consolidation_find_group(sd, p, cpu);
+
+		if (!group)
+#endif
 		group = find_idlest_group(sd, p, cpu, sd_flag);
 		if (!group) {
 			sd = sd->child;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/12 v1] Intercept idle balance
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (7 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 08/12 v1] Intercept wakeup/fork/exec balance Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 10/12 v1] Intercept periodic nohz " Yuyang Du
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65f1651..d921c20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6655,6 +6655,10 @@ out:
 	return ld_moved;
 }
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
+#endif
+
 /*
  * idle_balance is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -6666,6 +6670,9 @@ static int idle_balance(struct rq *this_rq)
 	unsigned long next_balance = jiffies + HZ;
 	u64 curr_cost = 0;
 	int this_cpu = this_rq->cpu;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+#endif
 
 	idle_enter_fair(this_rq);
 	/*
@@ -6684,6 +6691,19 @@ static int idle_balance(struct rq *this_rq)
 
 	update_blocked_averages(this_cpu);
 	rcu_read_lock();
+
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	cpumask_copy(nonshielded, cpu_active_mask);
+
+	/*
+	 * if we encounter shadowded cpus here, don't do balance on them
+	 */
+	workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+	if (!cpumask_test_cpu(this_cpu, nonshielded))
+		goto unlock;
+	workload_consolidation_unload(nonshielded);
+#endif
+
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int continue_balancing = 1;
@@ -6716,6 +6736,9 @@ static int idle_balance(struct rq *this_rq)
 		if (pulled_task)
 			break;
 	}
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+unlock:
+#endif
 	rcu_read_unlock();
 
 	raw_spin_lock(&this_rq->lock);
@@ -7709,6 +7732,14 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	unsigned int i;
+	for_each_possible_cpu(i) {
+		zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
+					GFP_KERNEL, cpu_to_node(i));
+	}
+#endif
+
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 
 #ifdef CONFIG_NO_HZ_COMMON
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/12 v1] Intercept periodic nohz idle balance
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (8 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 09/12 v1] Intercept idle balance Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 11/12 v1] Intercept periodic load balance Yuyang Du
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d921c20..374c86b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6867,10 +6867,46 @@ static struct {
 
 static inline int find_new_ilb(void)
 {
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+	int ilb, weight;
+	int this_cpu = smp_processor_id();
+
+	/*
+	 * Optimize for the case when we have no idle CPUs or only one
+	 * idle CPU. Don't walk the sched_domain hierarchy in such cases
+	 */
+	if (cpumask_weight(nohz.idle_cpus_mask) < 2)
+		return nr_cpu_ids;
+
+	ilb = cpumask_first(nohz.idle_cpus_mask);
+
+	if (ilb < nr_cpu_ids && idle_cpu(ilb)) {
+
+		cpumask_copy(nonshielded, nohz.idle_cpus_mask);
+
+		rcu_read_lock();
+		workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+		rcu_read_unlock();
+
+		weight = cpumask_weight(nonshielded);
+
+		if (weight < 2)
+			return nr_cpu_ids;
+
+		/*
+		 * get idle load balancer again
+		 */
+		ilb = cpumask_first(nonshielded);
+	    if (ilb < nr_cpu_ids && idle_cpu(ilb))
+			return ilb;
+	}
+#else
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
 
 	if (ilb < nr_cpu_ids && idle_cpu(ilb))
 		return ilb;
+#endif
 
 	return nr_cpu_ids;
 }
@@ -7107,7 +7143,7 @@ out:
  * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
  * rebalancing for all the cpus for whom scheduler ticks are stopped.
  */
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask)
 {
 	int this_cpu = this_rq->cpu;
 	struct rq *rq;
@@ -7117,7 +7153,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	    !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
 		goto end;
 
-	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+	for_each_cpu(balance_cpu, mask) {
 		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
 			continue;
 
@@ -7165,10 +7201,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	if (unlikely(rq->idle_balance))
 		return 0;
 
-       /*
-	* We may be recently in ticked or tickless idle mode. At the first
-	* busy tick after returning from idle, we will update the busy stats.
-	*/
+	/*
+	 * We may be recently in ticked or tickless idle mode. At the first
+	 * busy tick after returning from idle, we will update the busy stats.
+	 */
 	set_cpu_sd_state_busy();
 	nohz_balance_exit_idle(cpu);
 
@@ -7211,7 +7247,7 @@ need_kick:
 	return 1;
 }
 #else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle, struct cpumask *mask) { }
 #endif
 
 /*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 11/12 v1] Intercept periodic load balance
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (9 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 10/12 v1] Intercept periodic nohz " Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  0:02 ` [RFC PATCH 12/12 v1] Intercept RT provocatively Yuyang Du
  2014-05-05  9:37 ` [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/fair.c |   33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 374c86b..6cbf6c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7260,6 +7260,36 @@ static void run_rebalance_domains(struct softirq_action *h)
 	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	struct cpumask *nonshielded = __get_cpu_var(local_cpu_mask);
+	int this_cpu = cpu_of(this_rq);
+
+	/*
+	 * if we encounter shadowded cpus here, don't do balance on them
+	 */
+	cpumask_copy(nonshielded, cpu_active_mask);
+
+	rcu_read_lock();
+	workload_consolidation_nonshielded_mask(this_cpu, nonshielded);
+	rcu_read_unlock();
+
+	/*
+	 * aggressively unload the shielded cpus to unshielded cpus
+	 */
+	workload_consolidation_unload(nonshielded);
+
+	if (cpumask_test_cpu(this_cpu, nonshielded)) {
+		rebalance_domains(this_rq, idle);
+
+		/*
+		 * If this cpu has a pending nohz_balance_kick, then do the
+		 * balancing on behalf of the other idle cpus whose ticks are
+		 * stopped.
+		 */
+		cpumask_and(nonshielded, nonshielded, nohz.idle_cpus_mask);
+		nohz_idle_balance(this_rq, idle, nonshielded);
+	}
+#else
 	rebalance_domains(this_rq, idle);
 
 	/*
@@ -7267,7 +7297,8 @@ static void run_rebalance_domains(struct softirq_action *h)
 	 * balancing on behalf of the other idle cpus whose ticks are
 	 * stopped.
 	 */
-	nohz_idle_balance(this_rq, idle);
+	nohz_idle_balance(this_rq, idle, nohz.idle_cpus_mask);
+#endif
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 12/12 v1] Intercept RT provocatively
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (10 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 11/12 v1] Intercept periodic load balance Yuyang Du
@ 2014-05-05  0:02 ` Yuyang Du
  2014-05-05  9:37 ` [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
  12 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-05  0:02 UTC (permalink / raw)
  To: mingo, peterz, rafael.j.wysocki, linux-kernel, linux-pm
  Cc: arjan.van.de.ven, len.brown, alan.cox, mark.gross,
	morten.rasmussen, vincent.guittot, rajeev.d.muralidhar,
	vishwesh.m.rudramuni, nicole.chalhoub, ajaya.durg,
	harinarayanan.seshadri, yuyang.du

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/rt.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd2267a..f8141fb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1217,6 +1217,9 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 {
 	struct task_struct *curr;
 	struct rq *rq;
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	int do_find = 0;
+#endif
 
 	if (p->nr_cpus_allowed == 1)
 		goto out;
@@ -1230,6 +1233,11 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 	rcu_read_lock();
 	curr = ACCESS_ONCE(rq->curr); /* unlocked access */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (workload_consolidation_cpu_shielded(cpu))
+		do_find = 1;
+#endif
+
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
 	 * try to see if we can wake this RT task up on another
@@ -1252,9 +1260,15 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
 	 * This test is optimistic, if we get it wrong the load-balancer
 	 * will have to sort it out.
 	 */
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (do_find || (curr && unlikely(rt_task(curr)) &&
+	    (curr->nr_cpus_allowed < 2 ||
+	     curr->prio <= p->prio))) {
+#else
 	if (curr && unlikely(rt_task(curr)) &&
 	    (curr->nr_cpus_allowed < 2 ||
 	     curr->prio <= p->prio)) {
+#endif
 		int target = find_lowest_rq(p);
 
 		if (target != -1)
@@ -1460,6 +1474,12 @@ static int find_lowest_rq(struct task_struct *task)
 	if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
 		return -1; /* No targets found */
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	workload_consolidation_nonshielded_mask(this_cpu, lowest_mask);
+	if (!cpumask_weight(lowest_mask))
+		return -1;
+#endif
+
 	/*
 	 * At this point we have built a mask of cpus representing the
 	 * lowest priority tasks in the system.  Now we want to elect
@@ -1687,6 +1707,11 @@ static int pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overloaded(this_rq)))
 		return 0;
 
+#ifdef CONFIG_WORKLOAD_CONSOLIDATION
+	if (workload_consolidation_cpu_shielded(this_cpu))
+		return 0;
+#endif
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
                   ` (11 preceding siblings ...)
  2014-05-05  0:02 ` [RFC PATCH 12/12 v1] Intercept RT provocatively Yuyang Du
@ 2014-05-05  9:37 ` Peter Zijlstra
  2014-05-06 18:46   ` Yuyang Du
  12 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-05  9:37 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri

On Mon, May 05, 2014 at 08:02:40AM +0800, Yuyang Du wrote:
> Hi Ingo, PeterZ, Rafael, and others,

The general code structure is an immediate no go. We're not going to
bolt on anything like this.

I've yet to look at the content.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-05  9:37 ` [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
@ 2014-05-06 18:46   ` Yuyang Du
  2014-05-15 14:50     ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Yuyang Du @ 2014-05-06 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri

> The general code structure is an immediate no go. We're not going to
> bolt on anything like this.

Could you please detail a little bit about general code structure?

Thank you all the same,
Yuyang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-06 18:46   ` Yuyang Du
@ 2014-05-15 14:50     ` Peter Zijlstra
  2014-05-18 19:16       ` Yuyang Du
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-15 14:50 UTC (permalink / raw)
  To: Yuyang Du
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]

On Wed, May 07, 2014 at 02:46:37AM +0800, Yuyang Du wrote:
> > The general code structure is an immediate no go. We're not going to
> > bolt on anything like this.
> 
> Could you please detail a little bit about general code structure?

So I should have just deleted all patches, for none of them has a
changelog.

So all this cc crap only hooks into and modifies fair.c behaviour. There
is absolutely no reason it should live anywhere else except fair.c

Secondly, the very last thing we need is more CONFIG_ goo, and you
sprinkle #ifdef around like it was gold dust.

Thirdly, wth is wrong with the current per-task runtime accounting and
why can't you extend/adapt that instead of duplicating the lot.

Fourthly, I'm _never_ going to merge anything that hijacks the load
balancer and does some random other thing. There's going to be a single
load-balancer full stop.

Many people have expressed interest in a packing balancer (vs the
spreading we currently default to). Some have even done patches.
At the same time it seems very difficult to agree on _when_ packing
makes sense. That said, when we do packing we should do it driven by the
topology and policy, not by some compile time option.

Lastly, if you'd done your homework and actually read some of the
threads on the subject from say the past two years, you'd know pretty
much all that already.

I'm not here to endlessly repeat myself and waste time staring at
unchangelogged patches.

Anyway, there might or might not be useful ideas in there.. but its very
hard to tell one way or another.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency
  2014-05-15 14:50     ` Peter Zijlstra
@ 2014-05-18 19:16       ` Yuyang Du
  0 siblings, 0 replies; 17+ messages in thread
From: Yuyang Du @ 2014-05-18 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rafael.j.wysocki, linux-kernel, linux-pm, arjan.van.de.ven,
	len.brown, alan.cox, mark.gross, morten.rasmussen,
	vincent.guittot, rajeev.d.muralidhar, vishwesh.m.rudramuni,
	nicole.chalhoub, ajaya.durg, harinarayanan.seshadri

> So I should have just deleted all patches, for none of them has a
> changelog.
> 

It is my bad to not make changelogs in patches. The v2 has them, but I should
have made them since always.

> So all this cc crap only hooks into and modifies fair.c behaviour. There
> is absolutely no reason it should live anywhere else except fair.c
> 
> Secondly, the very last thing we need is more CONFIG_ goo, and you
> sprinkle #ifdef around like it was gold dust.
> 

Aggreed. I will change these.

> Thirdly, wth is wrong with the current per-task runtime accounting and
> why can't you extend/adapt that instead of duplicating the lot.
> 

Sure. As you and Vincent said, CC will take a ride of current tracking codes
instead of duplicating.

> Fourthly, I'm _never_ going to merge anything that hijacks the load
> balancer and does some random other thing. There's going to be a single
> load-balancer full stop.
> 
> Many people have expressed interest in a packing balancer (vs the
> spreading we currently default to). Some have even done patches.
> At the same time it seems very difficult to agree on _when_ packing
> makes sense. That said, when we do packing we should do it driven by the
> topology and policy, not by some compile time option.
>

I will make "Workload Consolidation" driven by topology and policy,
essentially it is already so, but sure the codes are not completely clean in
that regard.

> Lastly, if you'd done your homework and actually read some of the
> threads on the subject from say the past two years, you'd know pretty
> much all that already.
> 
> I'm not here to endlessly repeat myself and waste time staring at
> unchangelogged patches.
> 

This will not happen again.

> Anyway, there might or might not be useful ideas in there.. but its very
> hard to tell one way or another.

I think the above is mostly about "amenability" to scheduler codes.
Apparently, I am not doing it right. Will send another version to
make it less hard. Thanks for your time.

Yuyang

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-19  3:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05  0:02 [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 01/12 v1] CONFIG for " Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 02/12 v1] Init " Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 03/12 v1] CPU ConCurrency calculation Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 04/12 v1] CPU ConCurrency collecting in: Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 05/12 v1] CONFIG for Workload Consolidation Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 06/12 v1] Attach CPU topology Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 07/12 v1] CPU ConCurrency API for Workload Consolidation Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 08/12 v1] Intercept wakeup/fork/exec balance Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 09/12 v1] Intercept idle balance Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 10/12 v1] Intercept periodic nohz " Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 11/12 v1] Intercept periodic load balance Yuyang Du
2014-05-05  0:02 ` [RFC PATCH 12/12 v1] Intercept RT provocatively Yuyang Du
2014-05-05  9:37 ` [RFC PATCH 00/12 v1] A new CPU load metric for power-efficient scheduler: CPU ConCurrency Peter Zijlstra
2014-05-06 18:46   ` Yuyang Du
2014-05-15 14:50     ` Peter Zijlstra
2014-05-18 19:16       ` Yuyang Du

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).