[RFC PATCH 0/3] power aware scheduling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] power aware scheduling
@ 2012-11-06 13:09 Alex Shi
  2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-06 13:09 UTC (permalink / raw)
  To: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx
  Cc: gregkh, andre.przywara, alex.shi, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, cl, pjt

This patchset implement my previous power aware scheduling proposal:
https://lkml.org/lkml/2012/8/13/139

It base on tip/master tree, 
It also resused much of old code from Suresh in 2nd patch. I don't know
to give credit to Suresh for this.
Suresh, could you like to give 'signed-off' or sth you think needed?

Appreciate for any comments!

Regards!
Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-06 13:09 [RFC PATCH 0/3] power aware scheduling Alex Shi
@ 2012-11-06 13:09 ` Alex Shi
  2012-11-06 13:48   ` Greg KH
  2012-11-06 15:20   ` Luming Yu
  2012-11-06 13:09 ` [RFC PATCH 2/3] sched: power aware load balance, Alex Shi
  2012-11-06 13:09 ` [RFC PATCH 3/3] sched: add power aware scheduling in fork/exec/wake Alex Shi
  2 siblings, 2 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-06 13:09 UTC (permalink / raw)
  To: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx
  Cc: gregkh, andre.przywara, alex.shi, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, cl, pjt

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving

$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving

The using sched policy is 'powersaving'. User can change the policy
by commend 'echo':
 echo performance > /sys/devices/system/cpu/current_sched_policy

Power aware scheduling will has different behavior according to
different policy:

performance: the current scheduling behaviour, try to spread tasks
		on more CPU sockets or cores.
powersaving: will shrink tasks into sched group until the group's
		nr_running is up to group_weight.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 21 +++++++
 drivers/base/cpu.c                                 |  2 +
 include/linux/cpu.h                                |  2 +
 kernel/sched/fair.c                                | 68 +++++++++++++++++++++-
 kernel/sched/sched.h                               |  5 ++
 5 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..1909d3e 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,27 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
+		/sys/devices/system/cpu/sched_policy/available_sched_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS scheduler policy showing and setting interface.
+
+		available_sched_policy shows there are 2 kinds of policy now:
+		performance and powersaving.
+		current_sched_policy shows current scheduler policy. And user
+		can change the policy by writing it.
+
+		Policy decides that CFS scheduler how to distribute tasks onto
+		which CPU unit when tasks number less than LCPU number in system
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores.
+
+		powersaving:     try to shrink tasks onto same core or same CPU
+		until running task number beyond the LCPU number in the core
+		or socket.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 6345294..5f6a573 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
 		panic("Failed to register CPU subsystem");
 
 	cpu_dev_register_generic();
+
+	create_sysfs_sched_policy_group(cpu_subsys.dev_root);
 }
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..b2e9265 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -36,6 +36,8 @@ extern void cpu_remove_dev_attr(struct device_attribute *attr);
 extern int cpu_add_dev_attr_group(struct attribute_group *attrs);
 extern void cpu_remove_dev_attr_group(struct attribute_group *attrs);
 
+extern int create_sysfs_sched_policy_group(struct device *dev);
+
 #ifdef CONFIG_HOTPLUG_CPU
 extern void unregister_cpu(struct cpu *cpu);
 extern ssize_t arch_cpu_probe(const char *, size_t);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cebc81..dedc576 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6383,7 +6383,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
@@ -6399,6 +6398,73 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "performance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_policy = SCHED_POLICY_POWERSAVING;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+						set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+		show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+	&dev_attr_current_sched_policy.attr,
+	&dev_attr_available_sched_policy.attr,
+	NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+	.attrs = sched_policy_default_attrs,
+	.name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+#endif /* CONFIG_SYSFS */
+
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 508e77e..9a6e06c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -9,6 +9,11 @@
 
 extern __read_mostly int scheduler_running;
 
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+
+extern int __read_mostly sched_policy;
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-06 13:09 [RFC PATCH 0/3] power aware scheduling Alex Shi
  2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
@ 2012-11-06 13:09 ` Alex Shi
  2012-11-06 19:51   ` Andrew Morton
  2012-11-07  4:37   ` Preeti Murthy
  2012-11-06 13:09 ` [RFC PATCH 3/3] sched: add power aware scheduling in fork/exec/wake Alex Shi
  2 siblings, 2 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-06 13:09 UTC (permalink / raw)
  To: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx
  Cc: gregkh, andre.przywara, alex.shi, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, cl, pjt

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, shrink tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

This patch reuse lots of Suresh's power saving load balance code.
Now the general enabling logical is:
1, Collect power aware scheduler statistics with performance load
balance statistics collection.
2, if domain is eligible for power load balance do it and forget
performance load balance, else do performance load balance.

Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
and 2 sockets * 8 cores * HT SNB EP machine.
In the following checking, when I is 2/4/8/16, all tasks are
shrank to run on single core or single socket.

$for ((i=0; i < I; i++)) ; do while true; do : ; done & done

Checking the power consuming with a powermeter on the NHM EP.
	powersaving     performance
I = 2   148w            160w
I = 4   175w            181w
I = 8   207w            224w
I = 16  324w            324w

On a SNB laptop(4 cores *HT)
	powersaving     performance
I = 2   28w             35w
I = 4   38w             52w
I = 6   44w             54w
I = 8   56w             56w

On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Also tested the specjbb2005 with jrockit, kbuild, their peak performance
has no clear change with powersaving policy on all machines. Just
specjbb2005 with openjdk has about 2% drop on NHM EP machine with
powersaving policy.

This patch seems a bit long, but seems hard to split smaller.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 153 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dedc576..acc8b41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3930,6 +3930,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	int			power_lb;  /* if powersaving lb needed */
+	int			perf_lb;   /* if performance lb needed */
 
 	struct rq *		(*find_busiest_queue)(struct lb_env *,
 						      struct sched_group *);
@@ -4356,6 +4358,16 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned long	sd_capacity;	/* capacity of this domain */
+	unsigned long	sd_nr_running;	/* Nr running of this domain */
+	struct sched_group *group_min; /* Least loaded group in sd */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned long leader_nr_running; /* Nr running of group_leader */
+	unsigned long min_nr_running; /* Nr running of group_min */
+
 #ifdef CONFIG_SCHED_NUMA
 	struct sched_group *numa_group; /* group which has offnode_tasks */
 	unsigned long numa_group_weight;
@@ -4387,6 +4399,123 @@ struct sg_lb_stats {
 };
 
 /**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_nr_running = ULONG_MAX;
+	sds->leader_nr_running = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold;
+
+	if (!env->power_lb)
+		return;
+
+	threshold = sgs->group_weight;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (sds->this_nr_running == threshold ||
+				!sds->this_nr_running))
+		env->power_lb = 0;
+
+	/* Do performance load balance if any group overload */
+	if (sgs->sum_nr_running > threshold) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	sds->sd_nr_running += sgs->sum_nr_running;
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->sum_nr_running < sds->min_nr_running) ||
+	    (sgs->sum_nr_running == sds->min_nr_running &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_nr_running = sgs->sum_nr_running;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->sum_nr_running + 1 > threshold)
+		return;
+
+	if (sgs->sum_nr_running > sds->leader_nr_running ||
+	    (sgs->sum_nr_running == sds->leader_nr_running &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_nr_running = sgs->sum_nr_running;
+	}
+}
+
+/**
+ * check_sd_power_lb_needed - Check if the power awaring load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+
+static inline void check_sd_power_lb_needed(struct lb_env *env,
+					struct sd_lb_stats *sds)
+{
+	unsigned long threshold = env->sd->span_weight;
+	if (!env->power_lb)
+		return;
+
+	if (sds->sd_nr_running > threshold) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+	}
+}
+
+/**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
  * @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4850,6 +4979,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4899,8 +5029,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 
 		update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
+
+	check_sd_power_lb_needed(env, sds);
 }
 
 /**
@@ -5116,6 +5249,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -5222,7 +5368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+		if (rq->nr_running == 0 ||
+			(!env->power_lb && capacity &&
+				rq->nr_running == 1 && wl > env->imbalance))
 			continue;
 
 		/*
@@ -5298,6 +5446,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.loop_break	    = sched_nr_migrate_break,
 		.cpus		    = cpus,
 		.find_busiest_queue = find_busiest_queue,
+		.power_lb	    = 1,
+		.perf_lb	    = 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -5330,7 +5480,8 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 ||
+		(busiest->nr_running == 1 && env.power_lb)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/3] sched: add power aware scheduling in fork/exec/wake
  2012-11-06 13:09 [RFC PATCH 0/3] power aware scheduling Alex Shi
  2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
  2012-11-06 13:09 ` [RFC PATCH 2/3] sched: power aware load balance, Alex Shi
@ 2012-11-06 13:09 ` Alex Shi
  2 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-06 13:09 UTC (permalink / raw)
  To: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx
  Cc: gregkh, andre.przywara, alex.shi, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, cl, pjt

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest but has capcaity group. The trade off is
adding power aware statistics collection for the group seeking. But
since the collection just happened in power scheduling eligible
condition. So no munch performance impact.

hackbench testing results has no clear dropping even with powersaving
policy.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 233 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 159 insertions(+), 74 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acc8b41..902ef5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3370,12 +3370,149 @@ static int numa_select_node_cpu(struct task_struct *p, int node)
 #endif /* CONFIG_SCHED_NUMA */
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest; /* Busiest group in this sd */
+	struct sched_group *this;  /* Local group in this sd */
+	unsigned long total_load;  /* Total load of all groups in sd */
+	unsigned long total_pwr;   /*	Total power of all groups in sd */
+	unsigned long avg_load;	   /* Average load across all groups in sd */
+
+	/** Statistics of this group */
+	unsigned long this_load;
+	unsigned long this_load_per_task;
+	unsigned long this_nr_running;
+	unsigned long this_has_capacity;
+	unsigned int  this_idle_cpus;
+
+	/* Statistics of the busiest group */
+	unsigned int  busiest_idle_cpus;
+	unsigned long max_load;
+	unsigned long busiest_load_per_task;
+	unsigned long busiest_nr_running;
+	unsigned long busiest_group_capacity;
+	unsigned long busiest_has_capacity;
+	unsigned int  busiest_group_weight;
+
+	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned long	sd_capacity;	/* capacity of this domain */
+	unsigned long	sd_nr_running;	/* Nr running of this domain */
+	struct sched_group *group_min; /* Least loaded group in sd */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned long leader_nr_running; /* Nr running of group_leader */
+	unsigned long min_nr_running; /* Nr running of group_min */
+#ifdef CONFIG_SCHED_NUMA
+	struct sched_group *numa_group; /* group which has offnode_tasks */
+	unsigned long numa_group_weight;
+	unsigned long numa_group_running;
+
+	unsigned long this_offnode_running;
+	unsigned long this_onnode_running;
+#endif
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ * and task rq selection
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_nr_running; /* Nr tasks running in the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long group_capacity;
+	unsigned long idle_cpus;
+	unsigned long group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_SCHED_NUMA
+	unsigned long numa_offnode_weight;
+	unsigned long numa_offnode_running;
+	unsigned long numa_onnode_running;
+#endif
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+static void get_sg_power_stats(struct sched_group *group,
+			struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		sgs->sum_nr_running += rq->nr_running;
+	}
+
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+						SCHED_POWER_SCALE);
+	if (!sgs->group_capacity)
+		sgs->group_capacity = fix_small_capacity(sd, group);
+	sgs->group_weight = group->group_weight;
+}
+
+static void get_sd_power_stats(struct sched_domain *sd,
+					struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	long sd_min_delta = LONG_MAX;
+
+	group = sd->groups;
+	do {
+		long g_delta;
+		unsigned long threshold;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		if (sched_policy == SCHED_POLICY_POWERSAVING)
+			threshold = sgs.group_weight;
+		else
+			threshold = sgs.group_capacity;
+		g_delta = threshold - sgs.sum_nr_running;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_nr_running += sgs.sum_nr_running;
+		sds->total_pwr += group->sgp->power;
+	} while  (group = group->next, group != sd->groups);
+
+	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+						SCHED_POWER_SCALE);
+}
+
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+					struct sd_lb_stats *sds)
+{
+	int policy = SCHED_POLICY_PERFORMANCE;
+
+	if (sched_policy != SCHED_POLICY_PERFORMANCE) {
+		memset(sds, 0, sizeof(*sds));
+		get_sd_power_stats(sd, sds);
+
+		if (sd->span_weight > sds->sd_nr_running)
+			policy = SCHED_POLICY_POWERSAVING;
+	}
+	return policy;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
  *
- * Balance, ie. select the least loaded group.
- *
  * Returns the target CPU number, or the same CPU if no balancing is needed.
  *
  * preempt must be disabled.
@@ -3384,12 +3521,14 @@ static int
 select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sd_lb_stats sds;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 	int node = tsk_home_node(p);
+	int policy = sched_policy;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -3412,6 +3551,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 
 			new_cpu = cpu = node_cpu;
 			sd = per_cpu(sd_node, cpu);
+			policy = get_sd_sched_policy(sd, &sds);
 			goto pick_idlest;
 		}
 
@@ -3445,8 +3585,12 @@ find_sd:
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+			policy = get_sd_sched_policy(sd, &sds);
+			if (policy != SCHED_POLICY_PERFORMANCE)
+				break;
+		}
 	}
 
 	if (affine_sd) {
@@ -3460,7 +3604,7 @@ find_sd:
 pick_idlest:
 	while (sd) {
 		int load_idx = sd->forkexec_idx;
-		struct sched_group *group;
+		struct sched_group *group = NULL;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -3471,7 +3615,12 @@ pick_idlest:
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
 
-		group = find_idlest_group(sd, p, cpu, load_idx);
+		if (policy != SCHED_POLICY_PERFORMANCE)
+			group = sds.group_leader;
+
+		if (!group)
+			group = find_idlest_group(sd, p, cpu, load_idx);
+
 		if (!group) {
 			sd = sd->child;
 			continue;
@@ -3491,8 +3640,11 @@ pick_idlest:
 		for_each_domain(cpu, tmp) {
 			if (weight <= tmp->span_weight)
 				break;
-			if (tmp->flags & sd_flag)
+			if (tmp->flags & sd_flag) {
 				sd = tmp;
+				if (policy != SCHED_POLICY_PERFORMANCE)
+					policy = get_sd_sched_policy(sd, &sds);
+			}
 		}
 		/* while loop will break here if sd == NULL */
 	}
@@ -4330,73 +4482,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-
-	/* Varibles of power awaring scheduling */
-	unsigned long	sd_capacity;	/* capacity of this domain */
-	unsigned long	sd_nr_running;	/* Nr running of this domain */
-	struct sched_group *group_min; /* Least loaded group in sd */
-	struct sched_group *group_leader; /* Group which relieves group_min */
-	unsigned long min_load_per_task; /* load_per_task in group_min */
-	unsigned long leader_nr_running; /* Nr running of group_leader */
-	unsigned long min_nr_running; /* Nr running of group_min */
-
-#ifdef CONFIG_SCHED_NUMA
-	struct sched_group *numa_group; /* group which has offnode_tasks */
-	unsigned long numa_group_weight;
-	unsigned long numa_group_running;
-
-	unsigned long this_offnode_running;
-	unsigned long this_onnode_running;
-#endif
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
-#ifdef CONFIG_SCHED_NUMA
-	unsigned long numa_offnode_weight;
-	unsigned long numa_offnode_running;
-	unsigned long numa_onnode_running;
-#endif
-};
 
 /**
  * init_sd_lb_power_stats - Initialize power savings statistics for
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
@ 2012-11-06 13:48   ` Greg KH
  2012-11-07 12:27     ` Alex Shi
  2012-11-06 15:20   ` Luming Yu
  1 sibling, 1 reply; 16+ messages in thread
From: Greg KH @ 2012-11-06 13:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	cl, pjt

On Tue, Nov 06, 2012 at 09:09:57PM +0800, Alex Shi wrote:
> This patch add the power aware scheduler knob into sysfs:
> 
> $cat /sys/devices/system/cpu/sched_policy/available_sched_policy
> performance powersaving
> 
> $cat /sys/devices/system/cpu/sched_policy/current_sched_policy
> powersaving
> 
> The using sched policy is 'powersaving'. User can change the policy
> by commend 'echo':
>  echo performance > /sys/devices/system/cpu/current_sched_policy
> 
> Power aware scheduling will has different behavior according to
> different policy:
> 
> performance: the current scheduling behaviour, try to spread tasks
> 		on more CPU sockets or cores.
> powersaving: will shrink tasks into sched group until the group's
> 		nr_running is up to group_weight.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 21 +++++++
>  drivers/base/cpu.c                                 |  2 +
>  include/linux/cpu.h                                |  2 +
>  kernel/sched/fair.c                                | 68 +++++++++++++++++++++-
>  kernel/sched/sched.h                               |  5 ++
>  5 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 6943133..1909d3e 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -53,6 +53,27 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
>  		the system.  Information writtento the file to remove CPU's
>  		is architecture specific.
>  
> +What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
> +		/sys/devices/system/cpu/sched_policy/available_sched_policy
> +Date:		Oct 2012
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:	CFS scheduler policy showing and setting interface.
> +
> +		available_sched_policy shows there are 2 kinds of policy now:
> +		performance and powersaving.
> +		current_sched_policy shows current scheduler policy. And user
> +		can change the policy by writing it.
> +
> +		Policy decides that CFS scheduler how to distribute tasks onto
> +		which CPU unit when tasks number less than LCPU number in system
> +
> +		performance: try to spread tasks onto more CPU sockets,
> +		more CPU cores.
> +
> +		powersaving:     try to shrink tasks onto same core or same CPU
> +		until running task number beyond the LCPU number in the core
> +		or socket.
> +
>  What:		/sys/devices/system/cpu/cpu#/node
>  Date:		October 2009
>  Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index 6345294..5f6a573 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
>  		panic("Failed to register CPU subsystem");
>  
>  	cpu_dev_register_generic();
> +
> +	create_sysfs_sched_policy_group(cpu_subsys.dev_root);

Are you sure you didn't just race with userspace, creating the sysfs
files after the device was created and announced to userspace?

If so, you need to fix this :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
  2012-11-06 13:48   ` Greg KH
@ 2012-11-06 15:20   ` Luming Yu
  2012-11-07 13:03     ` Alex Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Luming Yu @ 2012-11-06 15:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt

On Tue, Nov 6, 2012 at 8:09 AM, Alex Shi <alex.shi@intel.com> wrote:
> This patch add the power aware scheduler knob into sysfs:

The problem is user doesn't know how to use this knob.

Based on what data, people could select one policy which could be surely
better than another?

"Packing small tasks" approach could be better and more intelligent.
http://thread.gmane.org/gmane.linux.kernel/1348522

Just some random thoughts, as I didn't have chance to look into the
details of that patch set yet. But to me, we need to exploit the fact
that we could automatically bind a group of tasks on minimal set of
CPUs that can provide sufficient CPU cycles that are comparable to
a"cpu- run-average" that the task group can get in pure CFS situation
in a given period, until we see more CPU is needed.Then we probably
can maintain required CPU power available to the corresponding
workload, while leaving all other CPUs into power saving mode. The
problem is historical data suggested pattern could become invalid in
future, then we need more CPUs in future..I think this is the point we
need to know before spread or not-spread decision ...if spread would
not help CPU-run-average ,we don't need waste CPU power..but I don't
know how hard it could be. But I'm pretty sure sysfs knob is harder.
:-) /l

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-06 13:09 ` [RFC PATCH 2/3] sched: power aware load balance, Alex Shi
@ 2012-11-06 19:51   ` Andrew Morton
  2012-11-07 12:42     ` Alex Shi
  2012-11-07  4:37   ` Preeti Murthy
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2012-11-06 19:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, paulmck,
	linux-kernel, cl, pjt

On Tue,  6 Nov 2012 21:09:58 +0800
Alex Shi <alex.shi@intel.com> wrote:

> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
> 
> Checking the power consuming with a powermeter on the NHM EP.
> 	powersaving     performance
> I = 2   148w            160w
> I = 4   175w            181w
> I = 8   207w            224w
> I = 16  324w            324w
> 
> On a SNB laptop(4 cores *HT)
> 	powersaving     performance
> I = 2   28w             35w
> I = 4   38w             52w
> I = 6   44w             54w
> I = 8   56w             56w
> 
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Confused.  According to the above table, at I=16 the EP machine saved 0
watts.  Typo in the data?


Also, that's a pretty narrow test - it's doing fork and exec at very
high frequency and things such as task placement decisions at process
startup might be affecting the results.  Also, the load will be quite
kernel-intensive, as opposed to the more typical userspace-intensive
loads.

So, please run a broader set of tests so we can see the effects?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-06 13:09 ` [RFC PATCH 2/3] sched: power aware load balance, Alex Shi
  2012-11-06 19:51   ` Andrew Morton
@ 2012-11-07  4:37   ` Preeti Murthy
  2012-11-07 13:27     ` Alex Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Preeti Murthy @ 2012-11-07  4:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt, Viresh Kumar, Vaidyanathan Srinivasan

Hi Alex,

What I am concerned about in this patchset as Peter also
mentioned in the previous discussion of your approach
(https://lkml.org/lkml/2012/8/13/139)
is that:

1.Using nr_running of two different sched groups to decide which one
can be group_leader or group_min might not be be the right approach,
as this might mislead us to think that a group running one task is less
loaded than the group running three tasks although the former task is
a cpu hogger.

2.Comparing the number of cpus with the number of tasks running in a sched
group to decide if the group is underloaded or overloaded again faces
the same issue.The tasks might be short running,not utilizing cpu much.

I also feel before we introduce another side to the scheduler called
'power aware',why not try and see if the current scheduler itself can
perform better? We have an opportunity in terms of PJT's patches which
can help scheduler make more realistic decisions in load balance.Also
since PJT's metric is a statistical one,I believe we could vary it to
allow scheduler to do more rigorous or less rigorous power savings.

It is true however that this approach will not try and evacuate nearly idle
cpus over to nearly full cpus.That is definitely one of the benefits of your
patch,in terms of power savings,but I believe your patch is not making use
of the right metric to decide that.

IMHO,the appraoch towards power aware scheduler should take the following steps:

1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
more intelligent decisions in load balancing.Test the performance and power save
numbers.

2.If the above shows some characteristic change in behaviour over the earlier
scheduler,it should be either towards power save or towards performance.If found
positive towards one of them, try varying the calculation of
per-entity-load to see
if it can lean towards the other behaviour.If it can,then there you
go,you have a
knob to change between policies right there!

3.If you don't get enough power savings with the above approach then
add your patchset
to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
make the decision.

What do you think?

Regards
Preeti U Murthy
On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi <alex.shi@intel.com> wrote:
> This patch enabled the power aware consideration in load balance.
>
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, shrink tasks on less sched_groups will reduce power consumption
>
> The first assumption make performance policy take over scheduling when
> system busy.
> The second assumption make power aware scheduling try to move
> disperse tasks into fewer groups until that groups are full of tasks.
>
> This patch reuse lots of Suresh's power saving load balance code.
> Now the general enabling logical is:
> 1, Collect power aware scheduler statistics with performance load
> balance statistics collection.
> 2, if domain is eligible for power load balance do it and forget
> performance load balance, else do performance load balance.
>
> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
> and 2 sockets * 8 cores * HT SNB EP machine.
> In the following checking, when I is 2/4/8/16, all tasks are
> shrank to run on single core or single socket.
>
> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>
> Checking the power consuming with a powermeter on the NHM EP.
>         powersaving     performance
> I = 2   148w            160w
> I = 4   175w            181w
> I = 8   207w            224w
> I = 16  324w            324w
>
> On a SNB laptop(4 cores *HT)
>         powersaving     performance
> I = 2   28w             35w
> I = 4   38w             52w
> I = 6   44w             54w
> I = 8   56w             56w
>
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>
> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
> has no clear change with powersaving policy on all machines. Just
> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
> powersaving policy.
>
> This patch seems a bit long, but seems hard to split smaller.
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-06 13:48   ` Greg KH
@ 2012-11-07 12:27     ` Alex Shi
  2012-11-07 14:41       ` Greg KH
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Shi @ 2012-11-07 12:27 UTC (permalink / raw)
  To: Greg KH
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	cl, pjt

>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>> index 6345294..5f6a573 100644
>> --- a/drivers/base/cpu.c
>> +++ b/drivers/base/cpu.c
>> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
>>  		panic("Failed to register CPU subsystem");
>>  
>>  	cpu_dev_register_generic();
>> +
>> +	create_sysfs_sched_policy_group(cpu_subsys.dev_root);
> 
> Are you sure you didn't just race with userspace, creating the sysfs
> files after the device was created and announced to userspace?

Sorry for don't fully get you. Is the sysfs announced to userspace
just in 'mount -t sysfs sysfs /sys'?

The old powersaving interface: sched_smt_power_savings also
created here. and cpu_dev_init was called early before do_initcalls
which cpuidle/cpufreq sysfs were initialized.

Do you mean this line need to init as core_initcall?

Thanks for comments! :)
> 
> If so, you need to fix this :)
> 
> thanks,
> 
> greg k-h
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-06 19:51   ` Andrew Morton
@ 2012-11-07 12:42     ` Alex Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-07 12:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, paulmck,
	linux-kernel, cl, pjt

On 11/07/2012 03:51 AM, Andrew Morton wrote:
> On Tue,  6 Nov 2012 21:09:58 +0800
> Alex Shi <alex.shi@intel.com> wrote:
> 
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>> 	powersaving     performance
>> I = 2   148w            160w
>> I = 4   175w            181w
>> I = 8   207w            224w
>> I = 16  324w            324w
>>
>> On a SNB laptop(4 cores *HT)
>> 	powersaving     performance
>> I = 2   28w             35w
>> I = 4   38w             52w
>> I = 6   44w             54w
>> I = 8   56w             56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
> 
> Confused.  According to the above table, at I=16 the EP machine saved 0
> watts.  Typo in the data?

Not typo, since the LCPU number in the EP machine is 16, so if I = 16,
the powersaving policy doesn't work actually. That is the patch designed
for race to idle assumption.

The result looks same as the third patch(for fork/exec/wu) applied.
Result put here because it is from this patch.

> 
> 
> Also, that's a pretty narrow test - it's doing fork and exec at very
> high frequency and things such as task placement decisions at process
> startup might be affecting the results.  Also, the load will be quite
> kernel-intensive, as opposed to the more typical userspace-intensive
> loads.

Sorry, why you think it keep do fork/exec? It just generate several
'bash' task to burn CPU, without fork/exec.

with I = 8, on my 32 LCPU SNB EP machine:
No do_fork calling in 5 seconds.

$ sudo perf stat -e probe:* -a sleep 5
 Performance counter stats for 'sleep 5':
           3 probe:do_execve           [100.00%]
           0 probe:do_fork             [100.00%]

And it is not kernel-intensive, it nearly running all in user level.

'Top' output: 25:0%us VS 0.0%sy

Tasks: 319 total,   9 running, 310 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.0%us,  0.0%sy,  0.0%ni, 74.5%id,  0.4%wa,  0.1%hi,  0.0%si,
0.0%st
...

> So, please run a broader set of tests so we can see the effects?
> 

Really, I have no more ideas for the suitable benchmarks.

Just tried the kbuild -j 16 on the 32 LCPU SNB EP, power just saved 10%,
but compile time increase about ~15%.
Seems if the task number is variation around the powersaving criteria
number, that just cause trouble.




-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-06 15:20   ` Luming Yu
@ 2012-11-07 13:03     ` Alex Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-07 13:03 UTC (permalink / raw)
  To: Luming Yu
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt

On 11/06/2012 11:20 PM, Luming Yu wrote:
> On Tue, Nov 6, 2012 at 8:09 AM, Alex Shi <alex.shi@intel.com> wrote:
>> This patch add the power aware scheduler knob into sysfs:
> 
> The problem is user doesn't know how to use this knob.
> 
> Based on what data, people could select one policy which could be surely
> better than another?
> 
> "Packing small tasks" approach could be better and more intelligent.
> http://thread.gmane.org/gmane.linux.kernel/1348522

It is not conflict with this patchset. :)
> 
> Just some random thoughts, as I didn't have chance to look into the
> details of that patch set yet. But to me, we need to exploit the fact
> that we could automatically bind a group of tasks on minimal set of
> CPUs that can provide sufficient CPU cycles that are comparable to
> a"cpu- run-average" that the task group can get in pure CFS situation
> in a given period, until we see more CPU is needed.Then we probably
> can maintain required CPU power available to the corresponding
> workload, while leaving all other CPUs into power saving mode. The
> problem is historical data suggested pattern could become invalid in
> future, then we need more CPUs in future..I think this is the point we
> need to know before spread or not-spread decision ...if spread would
> not help CPU-run-average ,we don't need waste CPU power..but I don't
> know how hard it could be. But I'm pretty sure sysfs knob is harder.
> :-) /l
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-07  4:37   ` Preeti Murthy
@ 2012-11-07 13:27     ` Alex Shi
  2012-11-11 18:49       ` Preeti Murthy
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Shi @ 2012-11-07 13:27 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt, Viresh Kumar, Vaidyanathan Srinivasan

On 11/07/2012 12:37 PM, Preeti Murthy wrote:
> Hi Alex,
> 
> What I am concerned about in this patchset as Peter also
> mentioned in the previous discussion of your approach
> (https://lkml.org/lkml/2012/8/13/139)
> is that:
> 
> 1.Using nr_running of two different sched groups to decide which one
> can be group_leader or group_min might not be be the right approach,
> as this might mislead us to think that a group running one task is less
> loaded than the group running three tasks although the former task is
> a cpu hogger.
> 
> 2.Comparing the number of cpus with the number of tasks running in a sched
> group to decide if the group is underloaded or overloaded again faces
> the same issue.The tasks might be short running,not utilizing cpu much.

Yes, maybe nr task is not the best indicator. But as first step, it can
approve the proposal is a correct path and worth to try more.
Considering the old powersaving implement is also judge on nr tasks, and
my testing result of this. It may be still a option.
> 
> I also feel before we introduce another side to the scheduler called
> 'power aware',why not try and see if the current scheduler itself can
> perform better? We have an opportunity in terms of PJT's patches which
> can help scheduler make more realistic decisions in load balance.Also
> since PJT's metric is a statistical one,I believe we could vary it to
> allow scheduler to do more rigorous or less rigorous power savings.

will study the PJT's approach.
Actually, current patch set is also a kind of load balance modification,
right? :)
> 
> It is true however that this approach will not try and evacuate nearly idle
> cpus over to nearly full cpus.That is definitely one of the benefits of your
> patch,in terms of power savings,but I believe your patch is not making use
> of the right metric to decide that.

If one sched group just has one task, and another group just has one
LCPU idle, my patch definitely will pull the task to the nearly full
sched group. So I didn't understand what you mean 'will not try and
evacuate nearly idle cpus over to nearly full cpus'.


> 
> IMHO,the appraoch towards power aware scheduler should take the following steps:
> 
> 1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
> more intelligent decisions in load balancing.Test the performance and power save
> numbers.
> 
> 2.If the above shows some characteristic change in behaviour over the earlier
> scheduler,it should be either towards power save or towards performance.If found
> positive towards one of them, try varying the calculation of
> per-entity-load to see
> if it can lean towards the other behaviour.If it can,then there you
> go,you have a
> knob to change between policies right there!
> 
> 3.If you don't get enough power savings with the above approach then
> add your patchset
> to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
> make the decision.
> 
> What do you think?

Will consider this. thanks!
> 
> Regards
> Preeti U Murthy
> On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi <alex.shi@intel.com> wrote:
>> This patch enabled the power aware consideration in load balance.
>>
>> As mentioned in the power aware scheduler proposal, Power aware
>> scheduling has 2 assumptions:
>> 1, race to idle is helpful for power saving
>> 2, shrink tasks on less sched_groups will reduce power consumption
>>
>> The first assumption make performance policy take over scheduling when
>> system busy.
>> The second assumption make power aware scheduling try to move
>> disperse tasks into fewer groups until that groups are full of tasks.
>>
>> This patch reuse lots of Suresh's power saving load balance code.
>> Now the general enabling logical is:
>> 1, Collect power aware scheduler statistics with performance load
>> balance statistics collection.
>> 2, if domain is eligible for power load balance do it and forget
>> performance load balance, else do performance load balance.
>>
>> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
>> and 2 sockets * 8 cores * HT SNB EP machine.
>> In the following checking, when I is 2/4/8/16, all tasks are
>> shrank to run on single core or single socket.
>>
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>>         powersaving     performance
>> I = 2   148w            160w
>> I = 4   175w            181w
>> I = 8   207w            224w
>> I = 16  324w            324w
>>
>> On a SNB laptop(4 cores *HT)
>>         powersaving     performance
>> I = 2   28w             35w
>> I = 4   38w             52w
>> I = 6   44w             54w
>> I = 8   56w             56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>>
>> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
>> has no clear change with powersaving policy on all machines. Just
>> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
>> powersaving policy.
>>
>> This patch seems a bit long, but seems hard to split smaller.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-07 12:27     ` Alex Shi
@ 2012-11-07 14:41       ` Greg KH
  2012-11-08 14:40         ` Alex Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Greg KH @ 2012-11-07 14:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	cl, pjt

On Wed, Nov 07, 2012 at 08:27:17PM +0800, Alex Shi wrote:
> >> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> >> index 6345294..5f6a573 100644
> >> --- a/drivers/base/cpu.c
> >> +++ b/drivers/base/cpu.c
> >> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
> >>  		panic("Failed to register CPU subsystem");
> >>  
> >>  	cpu_dev_register_generic();
> >> +
> >> +	create_sysfs_sched_policy_group(cpu_subsys.dev_root);
> > 
> > Are you sure you didn't just race with userspace, creating the sysfs
> > files after the device was created and announced to userspace?
> 
> Sorry for don't fully get you. Is the sysfs announced to userspace
> just in 'mount -t sysfs sysfs /sys'?

No, when the struct device is registered with the driver core.

> The old powersaving interface: sched_smt_power_savings also
> created here. and cpu_dev_init was called early before do_initcalls
> which cpuidle/cpufreq sysfs were initialized.
> 
> Do you mean this line need to init as core_initcall?

No, you need to make this as an attribute group for the device, so the
driver core will create it automatically before it tells userspace that
the device is now present.

Use the default attribute groups and you should be fine.

Hope this helps,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface
  2012-11-07 14:41       ` Greg KH
@ 2012-11-08 14:40         ` Alex Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-08 14:40 UTC (permalink / raw)
  To: Greg KH
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	cl, pjt

On 11/07/2012 10:41 PM, Greg KH wrote:
> On Wed, Nov 07, 2012 at 08:27:17PM +0800, Alex Shi wrote:
>>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>>> index 6345294..5f6a573 100644
>>>> --- a/drivers/base/cpu.c
>>>> +++ b/drivers/base/cpu.c
>>>> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
>>>>  		panic("Failed to register CPU subsystem");
>>>>  
>>>>  	cpu_dev_register_generic();
>>>> +
>>>> +	create_sysfs_sched_policy_group(cpu_subsys.dev_root);
>>>
>>> Are you sure you didn't just race with userspace, creating the sysfs
>>> files after the device was created and announced to userspace?
>>
>> Sorry for don't fully get you. Is the sysfs announced to userspace
>> just in 'mount -t sysfs sysfs /sys'?
> 
> No, when the struct device is registered with the driver core.
> 
>> The old powersaving interface: sched_smt_power_savings also
>> created here. and cpu_dev_init was called early before do_initcalls
>> which cpuidle/cpufreq sysfs were initialized.
>>
>> Do you mean this line need to init as core_initcall?
> 
> No, you need to make this as an attribute group for the device, so the
> driver core will create it automatically before it tells userspace that
> the device is now present.
> 
> Use the default attribute groups and you should be fine.

Thanks a lot for explanation! :)

It seems a misunderstanding here. I just create a sysfs group, no device
registered.
The code followed the cpuidle's implementation:

$ ls /sys/devices/system/cpu/cpuidle/
current_driver  current_governor_ro

Seems it's still better to move the group creation into sched/fair.c not
here.
> 
> Hope this helps,
> 
> greg k-h
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-07 13:27     ` Alex Shi
@ 2012-11-11 18:49       ` Preeti Murthy
  2012-11-12  3:05         ` Alex Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Preeti Murthy @ 2012-11-11 18:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt, Viresh Kumar, Vaidyanathan Srinivasan

Hi Alex
I apologise for the delay in replying .

On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi <alex.shi@intel.com> wrote:
> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>> Hi Alex,
>>
>> What I am concerned about in this patchset as Peter also
>> mentioned in the previous discussion of your approach
>> (https://lkml.org/lkml/2012/8/13/139)
>> is that:
>>
>> 1.Using nr_running of two different sched groups to decide which one
>> can be group_leader or group_min might not be be the right approach,
>> as this might mislead us to think that a group running one task is less
>> loaded than the group running three tasks although the former task is
>> a cpu hogger.
>>
>> 2.Comparing the number of cpus with the number of tasks running in a sched
>> group to decide if the group is underloaded or overloaded again faces
>> the same issue.The tasks might be short running,not utilizing cpu much.
>
> Yes, maybe nr task is not the best indicator. But as first step, it can
> approve the proposal is a correct path and worth to try more.
> Considering the old powersaving implement is also judge on nr tasks, and
> my testing result of this. It may be still a option.
Hmm.. will think about this and get back.
>>
>> I also feel before we introduce another side to the scheduler called
>> 'power aware',why not try and see if the current scheduler itself can
>> perform better? We have an opportunity in terms of PJT's patches which
>> can help scheduler make more realistic decisions in load balance.Also
>> since PJT's metric is a statistical one,I believe we could vary it to
>> allow scheduler to do more rigorous or less rigorous power savings.
>
> will study the PJT's approach.
> Actually, current patch set is also a kind of load balance modification,
> right? :)
It is true that this is a different approach,in fact we will require
this approach
to do power savings because PJT's patches introduce a new 'metric' and not a new
'approach' in my opinion, to do smarter load balancing,not power aware
load balancing per say.So your patch is surely a step towards power
aware lb.I am just worried about the metric used in it.
>>
>> It is true however that this approach will not try and evacuate nearly idle
>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>> patch,in terms of power savings,but I believe your patch is not making use
>> of the right metric to decide that.
>
> If one sched group just has one task, and another group just has one
> LCPU idle, my patch definitely will pull the task to the nearly full
> sched group. So I didn't understand what you mean 'will not try and
> evacuate nearly idle cpus over to nearly full cpus'
No, by 'this approach' I meant the current load balancer integrated with
the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
over to the nearly full cpus..

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/3] sched: power aware load balance,
  2012-11-11 18:49       ` Preeti Murthy
@ 2012-11-12  3:05         ` Alex Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-11-12  3:05 UTC (permalink / raw)
  To: Preeti Murthy
  Cc: rob, mingo, peterz, suresh.b.siddha, arjan, vincent.guittot, tglx,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, cl, pjt, Viresh Kumar, Vaidyanathan Srinivasan

On 11/12/2012 02:49 AM, Preeti Murthy wrote:
> Hi Alex
> I apologise for the delay in replying .

That's all right. I often also busy on other Intel tasks and have no
time to look at LKML. :)
> 
> On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi <alex.shi@intel.com> wrote:
>> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>>> Hi Alex,
>>>
>>> What I am concerned about in this patchset as Peter also
>>> mentioned in the previous discussion of your approach
>>> (https://lkml.org/lkml/2012/8/13/139)
>>> is that:
>>>
>>> 1.Using nr_running of two different sched groups to decide which one
>>> can be group_leader or group_min might not be be the right approach,
>>> as this might mislead us to think that a group running one task is less
>>> loaded than the group running three tasks although the former task is
>>> a cpu hogger.
>>>
>>> 2.Comparing the number of cpus with the number of tasks running in a sched
>>> group to decide if the group is underloaded or overloaded again faces
>>> the same issue.The tasks might be short running,not utilizing cpu much.
>>
>> Yes, maybe nr task is not the best indicator. But as first step, it can
>> approve the proposal is a correct path and worth to try more.
>> Considering the old powersaving implement is also judge on nr tasks, and
>> my testing result of this. It may be still a option.
> Hmm.. will think about this and get back.
>>>
>>> I also feel before we introduce another side to the scheduler called
>>> 'power aware',why not try and see if the current scheduler itself can
>>> perform better? We have an opportunity in terms of PJT's patches which
>>> can help scheduler make more realistic decisions in load balance.Also
>>> since PJT's metric is a statistical one,I believe we could vary it to
>>> allow scheduler to do more rigorous or less rigorous power savings.
>>
>> will study the PJT's approach.
>> Actually, current patch set is also a kind of load balance modification,
>> right? :)
> It is true that this is a different approach,in fact we will require
> this approach
> to do power savings because PJT's patches introduce a new 'metric' and not a new
> 'approach' in my opinion, to do smarter load balancing,not power aware
> load balancing per say.So your patch is surely a step towards power
> aware lb.I am just worried about the metric used in it.
>>>
>>> It is true however that this approach will not try and evacuate nearly idle
>>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>>> patch,in terms of power savings,but I believe your patch is not making use
>>> of the right metric to decide that.
>>
>> If one sched group just has one task, and another group just has one
>> LCPU idle, my patch definitely will pull the task to the nearly full
>> sched group. So I didn't understand what you mean 'will not try and
>> evacuate nearly idle cpus over to nearly full cpus'
> No, by 'this approach' I meant the current load balancer integrated with
> the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
> over to the nearly full cpus..

Oh, a misunderstand on 'this approach'. :) Anyway, we are all clear
about this now.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-11-12  3:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-06 13:09 [RFC PATCH 0/3] power aware scheduling Alex Shi
2012-11-06 13:09 ` [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface Alex Shi
2012-11-06 13:48   ` Greg KH
2012-11-07 12:27     ` Alex Shi
2012-11-07 14:41       ` Greg KH
2012-11-08 14:40         ` Alex Shi
2012-11-06 15:20   ` Luming Yu
2012-11-07 13:03     ` Alex Shi
2012-11-06 13:09 ` [RFC PATCH 2/3] sched: power aware load balance, Alex Shi
2012-11-06 19:51   ` Andrew Morton
2012-11-07 12:42     ` Alex Shi
2012-11-07  4:37   ` Preeti Murthy
2012-11-07 13:27     ` Alex Shi
2012-11-11 18:49       ` Preeti Murthy
2012-11-12  3:05         ` Alex Shi
2012-11-06 13:09 ` [RFC PATCH 3/3] sched: add power aware scheduling in fork/exec/wake Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox