[RFC] Consider CPU idle state while choosing a new CPU

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Consider CPU idle state while choosing a new CPU
@ 2013-01-30 21:19 Sebastian Andrzej Siewior
  2013-01-30 21:19 ` [RFC 1/2] cpuidle: trace state of the CPU Sebastian Andrzej Siewior
  2013-01-30 21:19 ` [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state Sebastian Andrzej Siewior
  0 siblings, 2 replies; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-01-30 21:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra, tglx

The scheduler does not consider the power states CPUs while selecting
an idle one. So it might happen, that it puts load on a cpu which is
in a deeper power state than another idle one, which has higher
latency and power costs than pushing the load to the core which is not
in a deep power saving state.

This patch series implements infrastructure which allows to consider
the power state in the selection process of an idle cpu. The decision
function is default off and can be enabled via a sysctl.

This is just the basic infrastructure. Further modifications are
necessary to hand in proper state information from the various idle
implementation. Its has been tested with a static testcase plugged
into the cpuidle core.

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/2] cpuidle: trace state of the CPU
  2013-01-30 21:19 [RFC] Consider CPU idle state while choosing a new CPU Sebastian Andrzej Siewior
@ 2013-01-30 21:19 ` Sebastian Andrzej Siewior
  2013-01-31  5:21   ` Namhyung Kim
  2013-01-30 21:19 ` [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state Sebastian Andrzej Siewior
  1 sibling, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-01-30 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra, tglx,
	Sebastian Andrzej Siewior

This patch adds an interface to cpuidle in order to retrieve the current
idle state of a given CPU. Zero means the CPU is not in an idle state. Either
because a cpuidle driver is not available or because the CPU is busy
executing code. Values greater than zero the CPU indicate that the CPU
is in some kind of idle state. The larger the value, the deeper the idle
state.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/cpuidle/cpuidle.c |   13 ++++++++++++-
 include/linux/cpuidle.h   |    2 ++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index e1f6860..3594e0c 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -23,6 +23,7 @@
 #include "cpuidle.h"
 
 DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
+static DEFINE_PER_CPU(unsigned int, cpu_state);
 
 DEFINE_MUTEX(cpuidle_lock);
 LIST_HEAD(cpuidle_detected_devices);
@@ -40,13 +41,23 @@ void disable_cpuidle(void)
 	off = 1;
 }
 
+int cpuidle_get_state(int cpu)
+{
+	return per_cpu(cpu_state, cpu);
+}
+
 static int __cpuidle_register_device(struct cpuidle_device *dev);
 
 static inline int cpuidle_enter(struct cpuidle_device *dev,
 				struct cpuidle_driver *drv, int index)
 {
 	struct cpuidle_state *target_state = &drv->states[index];
-	return target_state->enter(dev, drv, index);
+	int ret;
+
+	per_cpu(cpu_state, smp_processor_id()) = index + 1;
+	ret = target_state->enter(dev, drv, index);
+	per_cpu(cpu_state, smp_processor_id()) = 0;
+	return ret;
 }
 
 static inline int cpuidle_enter_tk(struct cpuidle_device *dev,
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 24cd1037..256baeb 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -155,6 +155,7 @@ extern int cpuidle_wrap_enter(struct cpuidle_device *dev,
 				struct cpuidle_driver *drv, int index,
 				int (*enter)(struct cpuidle_device *dev,
 					struct cpuidle_driver *drv, int index));
+extern int cpuidle_get_state(int cpu);
 extern int cpuidle_play_dead(void);
 
 extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
@@ -186,6 +187,7 @@ static inline int cpuidle_wrap_enter(struct cpuidle_device *dev,
 				int (*enter)(struct cpuidle_device *dev,
 					struct cpuidle_driver *drv, int index))
 { return -ENODEV; }
+static inline int cpuidle_get_state(int cpu) {return 0; }
 static inline int cpuidle_play_dead(void) {return -ENODEV; }
 #endif
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-30 21:19 [RFC] Consider CPU idle state while choosing a new CPU Sebastian Andrzej Siewior
  2013-01-30 21:19 ` [RFC 1/2] cpuidle: trace state of the CPU Sebastian Andrzej Siewior
@ 2013-01-30 21:19 ` Sebastian Andrzej Siewior
  2013-01-31  2:12   ` Michael Wang
  1 sibling, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-01-30 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra, tglx,
	Sebastian Andrzej Siewior

If a new CPU has to be choosen for a task, then the scheduler first selects
the group with the least load. This group is returned if its load is lower
compared to the group to which the task is currently assigned.
If there are several groups with completely idle CPU(s) (the CPU is in
an idle state like C1) then the first group is returned.
This patch extends this decision by considering the idle state of CPU(s)
in the group and the first group with a CPU in the lowest idle state
wins (C1 is prefered over C2). If there is a CPU which is not in an idle
state (C0) but has no tasks assigned then it is consider as a valid target.
Should there be no CPU in an idle state at disposal then the loadavg is
used as a fallback.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    6 ++++--
 kernel/sched/fair.c   |   24 ++++++++++++++++++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d211247..c2f6a44 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -934,6 +934,7 @@ struct sched_domain {
 	unsigned int wake_idx;
 	unsigned int forkexec_idx;
 	unsigned int smt_gain;
+	unsigned int prefer_lp;
 	int flags;			/* See SD_* */
 	int level;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..fad16e6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4971,7 +4971,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(13);
+	struct ctl_table *table = sd_alloc_ctl_entry(14);
 
 	if (table == NULL)
 		return NULL;
@@ -5001,7 +5001,9 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(int), 0644, proc_dointvec_minmax, false);
 	set_table_entry(&table[11], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[12] is terminator */
+	set_table_entry(&table[12], "prefer_lp", &sd->prefer_lp,
+		sizeof(int), 0644, proc_dointvec_minmax, false);
+	/* &table[13] is terminator */
 
 	return table;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..bff9800 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -23,6 +23,7 @@
 #include <linux/latencytop.h>
 #include <linux/sched.h>
 #include <linux/cpumask.h>
+#include <linux/cpuidle.h>
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
@@ -3181,8 +3182,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int load_idx)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *idle_group = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	int least_idle_cpu = INT_MAX;
 
 	do {
 		unsigned long load, avg_load;
@@ -3208,6 +3211,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+			if (!local_group && sd->prefer_lp && least_idle_cpu) {
+				int idle_level;
+
+				idle_level = cpuidle_get_state(i);
+				/*
+				 * Select the CPU which is in the lowest
+				 * possible power state. Take the active
+				 * CPU only if its run queue is empty.
+				 */
+				if (!idle_level) {
+					if (idle_cpu(i)) {
+						least_idle_cpu = idle_level;
+						idle_group = group;
+					}
+				} else if (least_idle_cpu > idle_level) {
+					least_idle_cpu = idle_level;
+					idle_group = group;
+				}
+			}
 		}
 
 		/* Adjust by relative CPU power of the group */
@@ -3221,6 +3243,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 	} while (group = group->next, group != sd->groups);
 
+	if (idle_group)
+		return idle_group;
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-30 21:19 ` [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state Sebastian Andrzej Siewior
@ 2013-01-31  2:12   ` Michael Wang
  2013-01-31  5:16     ` Namhyung Kim
  2013-02-02 17:50     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 17+ messages in thread
From: Michael Wang @ 2013-01-31  2:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	tglx

On 01/31/2013 05:19 AM, Sebastian Andrzej Siewior wrote:
> If a new CPU has to be choosen for a task, then the scheduler first selects
> the group with the least load. This group is returned if its load is lower
> compared to the group to which the task is currently assigned.
> If there are several groups with completely idle CPU(s) (the CPU is in
> an idle state like C1) then the first group is returned.
> This patch extends this decision by considering the idle state of CPU(s)
> in the group and the first group with a CPU in the lowest idle state
> wins (C1 is prefered over C2). If there is a CPU which is not in an idle
> state (C0) but has no tasks assigned then it is consider as a valid target.
> Should there be no CPU in an idle state at disposal then the loadavg is
> used as a fallback.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/core.c   |    6 ++++--
>  kernel/sched/fair.c   |   24 ++++++++++++++++++++++++
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d211247..c2f6a44 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -934,6 +934,7 @@ struct sched_domain {
>  	unsigned int wake_idx;
>  	unsigned int forkexec_idx;
>  	unsigned int smt_gain;
> +	unsigned int prefer_lp;
>  	int flags;			/* See SD_* */
>  	int level;
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..fad16e6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4971,7 +4971,7 @@ set_table_entry(struct ctl_table *entry,
>  static struct ctl_table *
>  sd_alloc_ctl_domain_table(struct sched_domain *sd)
>  {
> -	struct ctl_table *table = sd_alloc_ctl_entry(13);
> +	struct ctl_table *table = sd_alloc_ctl_entry(14);
> 
>  	if (table == NULL)
>  		return NULL;
> @@ -5001,7 +5001,9 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
>  		sizeof(int), 0644, proc_dointvec_minmax, false);
>  	set_table_entry(&table[11], "name", sd->name,
>  		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
> -	/* &table[12] is terminator */
> +	set_table_entry(&table[12], "prefer_lp", &sd->prefer_lp,
> +		sizeof(int), 0644, proc_dointvec_minmax, false);
> +	/* &table[13] is terminator */
> 
>  	return table;
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5eea870..bff9800 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -23,6 +23,7 @@
>  #include <linux/latencytop.h>
>  #include <linux/sched.h>
>  #include <linux/cpumask.h>
> +#include <linux/cpuidle.h>
>  #include <linux/slab.h>
>  #include <linux/profile.h>
>  #include <linux/interrupt.h>
> @@ -3181,8 +3182,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		  int this_cpu, int load_idx)
>  {
>  	struct sched_group *idlest = NULL, *group = sd->groups;
> +	struct sched_group *idle_group = NULL;
>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
> +	int least_idle_cpu = INT_MAX;
> 
>  	do {
>  		unsigned long load, avg_load;
> @@ -3208,6 +3211,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  				load = target_load(i, load_idx);
> 
>  			avg_load += load;
> +			if (!local_group && sd->prefer_lp && least_idle_cpu) {
> +				int idle_level;
> +
> +				idle_level = cpuidle_get_state(i);
> +				/*
> +				 * Select the CPU which is in the lowest
> +				 * possible power state. Take the active
> +				 * CPU only if its run queue is empty.
> +				 */
> +				if (!idle_level) {
> +					if (idle_cpu(i)) {
> +						least_idle_cpu = idle_level;
> +						idle_group = group;
> +					}
> +				} else if (least_idle_cpu > idle_level) {
> +					least_idle_cpu = idle_level;
> +					idle_group = group;
> +				}
> +			}
>  		}
> 
>  		/* Adjust by relative CPU power of the group */
> @@ -3221,6 +3243,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		}
>  	} while (group = group->next, group != sd->groups);
> 
> +	if (idle_group)
> +		return idle_group;

Hi, Sebastian

I'm not sure, but just concern about this case:

	group 0		cpu 0			cpu 1
			least idle		4 task

	group 1		cpu 2			cpu 3
			1 task			1 task

The previous logical will pick group 1 and now it will take group 0, and
that cause more imbalance, doesn't it?

May be check that state in find_idlest_cpu() will be better?

Regards,
Michael Wang

>  	if (!idlest || 100*this_load < imbalance*min_load)
>  		return NULL;
>  	return idlest;
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  2:12   ` Michael Wang
@ 2013-01-31  5:16     ` Namhyung Kim
  2013-01-31  6:39       ` Michael Wang
  2013-02-02 17:50     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2013-01-31  5:16 UTC (permalink / raw)
  To: Michael Wang
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

Hi Sebastian and Michael,

On Thu, 31 Jan 2013 10:12:35 +0800, Michael Wang wrote:
> On 01/31/2013 05:19 AM, Sebastian Andrzej Siewior wrote:
>> If a new CPU has to be choosen for a task, then the scheduler first selects
>> the group with the least load. This group is returned if its load is lower
>> compared to the group to which the task is currently assigned.
>> If there are several groups with completely idle CPU(s) (the CPU is in
>> an idle state like C1) then the first group is returned.
>> This patch extends this decision by considering the idle state of CPU(s)
>> in the group and the first group with a CPU in the lowest idle state
>> wins (C1 is prefered over C2). If there is a CPU which is not in an idle
>> state (C0) but has no tasks assigned then it is consider as a valid target.
>> Should there be no CPU in an idle state at disposal then the loadavg is
>> used as a fallback.
[snip]
>> @@ -3181,8 +3182,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>  		  int this_cpu, int load_idx)
>>  {
>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>> +	struct sched_group *idle_group = NULL;
>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>> +	int least_idle_cpu = INT_MAX;
>> 
>>  	do {
>>  		unsigned long load, avg_load;
>> @@ -3208,6 +3211,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>  				load = target_load(i, load_idx);
>> 
>>  			avg_load += load;
>> +			if (!local_group && sd->prefer_lp && least_idle_cpu) {
>> +				int idle_level;
>> +
>> +				idle_level = cpuidle_get_state(i);
>> +				/*
>> +				 * Select the CPU which is in the lowest
>> +				 * possible power state. Take the active
>> +				 * CPU only if its run queue is empty.
>> +				 */
>> +				if (!idle_level) {
>> +					if (idle_cpu(i)) {
>> +						least_idle_cpu = idle_level;
>> +						idle_group = group;
>> +					}
>> +				} else if (least_idle_cpu > idle_level) {
>> +					least_idle_cpu = idle_level;
>> +					idle_group = group;
>> +				}
>> +			}
>>  		}
>> 
>>  		/* Adjust by relative CPU power of the group */
>> @@ -3221,6 +3243,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>  		}
>>  	} while (group = group->next, group != sd->groups);
>> 
>> +	if (idle_group)
>> +		return idle_group;
>
> I'm not sure, but just concern about this case:
>
> 	group 0		cpu 0			cpu 1
> 			least idle		4 task
>
> 	group 1		cpu 2			cpu 3
> 			1 task			1 task
>
> The previous logical will pick group 1 and now it will take group 0, and
> that cause more imbalance, doesn't it?
>
> May be check that state in find_idlest_cpu() will be better?

Right, at least find_idlest_cpu() should also check the idle_level IMHO.

Anyway, I have an idea with this in mind.  It's like adding a new "idle
load" to each idle cpu rather than special casing the idle cpus like
above.  IOW an idle cpu will get very small load weight depends on how
deep it's slept so that it can be compared to other cpus in a same way
but we can find prefered (lowest load) cpu among the idle cpus.

The simple way I can think of is adding idle_level to a rq load in
weighted_cpuload():

static unsigned long weighted_cpuload(const int cpu)
{
	return cpu_rq(cpu)->load.weight + cpuidle_get_state(cpu);
}

What do you think?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] cpuidle: trace state of the CPU
  2013-01-30 21:19 ` [RFC 1/2] cpuidle: trace state of the CPU Sebastian Andrzej Siewior
@ 2013-01-31  5:21   ` Namhyung Kim
  2013-02-02 17:35     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2013-01-31  5:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	tglx

On Wed, 30 Jan 2013 22:19:16 +0100, Sebastian Andrzej Siewior wrote:
> This patch adds an interface to cpuidle in order to retrieve the current
> idle state of a given CPU. Zero means the CPU is not in an idle state. Either
> because a cpuidle driver is not available or because the CPU is busy
> executing code. Values greater than zero the CPU indicate that the CPU
> is in some kind of idle state. The larger the value, the deeper the idle
> state.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  drivers/cpuidle/cpuidle.c |   13 ++++++++++++-
>  include/linux/cpuidle.h   |    2 ++
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index e1f6860..3594e0c 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -23,6 +23,7 @@
>  #include "cpuidle.h"
>  
>  DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
> +static DEFINE_PER_CPU(unsigned int, cpu_state);

What about making it 'int' as cpuidle_get_state() returns int type?

>  
>  DEFINE_MUTEX(cpuidle_lock);
>  LIST_HEAD(cpuidle_detected_devices);
> @@ -40,13 +41,23 @@ void disable_cpuidle(void)
>  	off = 1;
>  }
>  
> +int cpuidle_get_state(int cpu)
> +{
> +	return per_cpu(cpu_state, cpu);
> +}
> +
>  static int __cpuidle_register_device(struct cpuidle_device *dev);
>  
>  static inline int cpuidle_enter(struct cpuidle_device *dev,
>  				struct cpuidle_driver *drv, int index)
>  {
>  	struct cpuidle_state *target_state = &drv->states[index];
> -	return target_state->enter(dev, drv, index);
> +	int ret;
> +
> +	per_cpu(cpu_state, smp_processor_id()) = index + 1;
> +	ret = target_state->enter(dev, drv, index);
> +	per_cpu(cpu_state, smp_processor_id()) = 0;

Maybe we can use local variable 'cpu' for this duplicated
smp_processor_id() call as it's never executed on another cpu in
between?

Thanks,
Namhyung


> +	return ret;
>  }
>  
>  static inline int cpuidle_enter_tk(struct cpuidle_device *dev,
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index 24cd1037..256baeb 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -155,6 +155,7 @@ extern int cpuidle_wrap_enter(struct cpuidle_device *dev,
>  				struct cpuidle_driver *drv, int index,
>  				int (*enter)(struct cpuidle_device *dev,
>  					struct cpuidle_driver *drv, int index));
> +extern int cpuidle_get_state(int cpu);
>  extern int cpuidle_play_dead(void);
>  
>  extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
> @@ -186,6 +187,7 @@ static inline int cpuidle_wrap_enter(struct cpuidle_device *dev,
>  				int (*enter)(struct cpuidle_device *dev,
>  					struct cpuidle_driver *drv, int index))
>  { return -ENODEV; }
> +static inline int cpuidle_get_state(int cpu) {return 0; }
>  static inline int cpuidle_play_dead(void) {return -ENODEV; }
>  #endif

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  5:16     ` Namhyung Kim
@ 2013-01-31  6:39       ` Michael Wang
  2013-01-31  6:58         ` Namhyung Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Wang @ 2013-01-31  6:39 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On 01/31/2013 01:16 PM, Namhyung Kim wrote:
> Hi Sebastian and Michael,
> 
> On Thu, 31 Jan 2013 10:12:35 +0800, Michael Wang wrote:
>> On 01/31/2013 05:19 AM, Sebastian Andrzej Siewior wrote:
>>> If a new CPU has to be choosen for a task, then the scheduler first selects
>>> the group with the least load. This group is returned if its load is lower
>>> compared to the group to which the task is currently assigned.
>>> If there are several groups with completely idle CPU(s) (the CPU is in
>>> an idle state like C1) then the first group is returned.
>>> This patch extends this decision by considering the idle state of CPU(s)
>>> in the group and the first group with a CPU in the lowest idle state
>>> wins (C1 is prefered over C2). If there is a CPU which is not in an idle
>>> state (C0) but has no tasks assigned then it is consider as a valid target.
>>> Should there be no CPU in an idle state at disposal then the loadavg is
>>> used as a fallback.
> [snip]
>>> @@ -3181,8 +3182,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>  		  int this_cpu, int load_idx)
>>>  {
>>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>>> +	struct sched_group *idle_group = NULL;
>>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>> +	int least_idle_cpu = INT_MAX;
>>>
>>>  	do {
>>>  		unsigned long load, avg_load;
>>> @@ -3208,6 +3211,25 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>  				load = target_load(i, load_idx);
>>>
>>>  			avg_load += load;
>>> +			if (!local_group && sd->prefer_lp && least_idle_cpu) {
>>> +				int idle_level;
>>> +
>>> +				idle_level = cpuidle_get_state(i);
>>> +				/*
>>> +				 * Select the CPU which is in the lowest
>>> +				 * possible power state. Take the active
>>> +				 * CPU only if its run queue is empty.
>>> +				 */
>>> +				if (!idle_level) {
>>> +					if (idle_cpu(i)) {
>>> +						least_idle_cpu = idle_level;
>>> +						idle_group = group;
>>> +					}
>>> +				} else if (least_idle_cpu > idle_level) {
>>> +					least_idle_cpu = idle_level;
>>> +					idle_group = group;
>>> +				}
>>> +			}
>>>  		}
>>>
>>>  		/* Adjust by relative CPU power of the group */
>>> @@ -3221,6 +3243,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>  		}
>>>  	} while (group = group->next, group != sd->groups);
>>>
>>> +	if (idle_group)
>>> +		return idle_group;
>>
>> I'm not sure, but just concern about this case:
>>
>> 	group 0		cpu 0			cpu 1
>> 			least idle		4 task
>>
>> 	group 1		cpu 2			cpu 3
>> 			1 task			1 task
>>
>> The previous logical will pick group 1 and now it will take group 0, and
>> that cause more imbalance, doesn't it?
>>
>> May be check that state in find_idlest_cpu() will be better?
> 
> Right, at least find_idlest_cpu() should also check the idle_level IMHO.
> 
> Anyway, I have an idea with this in mind.  It's like adding a new "idle
> load" to each idle cpu rather than special casing the idle cpus like
> above.  IOW an idle cpu will get very small load weight depends on how
> deep it's slept so that it can be compared to other cpus in a same way
> but we can find prefered (lowest load) cpu among the idle cpus.
> 
> The simple way I can think of is adding idle_level to a rq load in
> weighted_cpuload():
> 
> static unsigned long weighted_cpuload(const int cpu)
> {
> 	return cpu_rq(cpu)->load.weight + cpuidle_get_state(cpu);
> }

Hmm... then we don't need changes in find_idlest_cpu(), just compare the
load as before, but it works only when the appendix state value is
smaller than the lowest load of one task, which is 15 currently, I'm not
sure whether we have the promise...

Regards,
Michael Wang

> 
> What do you think?
> 
> Thanks,
> Namhyung
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  6:39       ` Michael Wang
@ 2013-01-31  6:58         ` Namhyung Kim
  2013-01-31  7:30           ` Michael Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2013-01-31  6:58 UTC (permalink / raw)
  To: Michael Wang
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On Thu, 31 Jan 2013 14:39:20 +0800, Michael Wang wrote:
> On 01/31/2013 01:16 PM, Namhyung Kim wrote:
>> Anyway, I have an idea with this in mind.  It's like adding a new "idle
>> load" to each idle cpu rather than special casing the idle cpus like
>> above.  IOW an idle cpu will get very small load weight depends on how
>> deep it's slept so that it can be compared to other cpus in a same way
>> but we can find prefered (lowest load) cpu among the idle cpus.
>> 
>> The simple way I can think of is adding idle_level to a rq load in
>> weighted_cpuload():
>> 
>> static unsigned long weighted_cpuload(const int cpu)
>> {
>> 	return cpu_rq(cpu)->load.weight + cpuidle_get_state(cpu);
>> }
>
> Hmm... then we don't need changes in find_idlest_cpu(), just compare the
> load as before, but it works only when the appendix state value is
> smaller than the lowest load of one task, which is 15 currently, I'm not
> sure whether we have the promise...

You said about a nice 19 process, right?  But I found that SCHED_IDLE
task will have weight of 3. :(

  #define WEIGHT_IDLEPRIO                3


But AFAIK the number of states in cpuidle is usually less than 10 so maybe
we can change the weight then, but there's no promise...

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  6:58         ` Namhyung Kim
@ 2013-01-31  7:30           ` Michael Wang
  2013-01-31  7:40             ` Namhyung Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Wang @ 2013-01-31  7:30 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On 01/31/2013 02:58 PM, Namhyung Kim wrote:
> On Thu, 31 Jan 2013 14:39:20 +0800, Michael Wang wrote:
>> On 01/31/2013 01:16 PM, Namhyung Kim wrote:
>>> Anyway, I have an idea with this in mind.  It's like adding a new "idle
>>> load" to each idle cpu rather than special casing the idle cpus like
>>> above.  IOW an idle cpu will get very small load weight depends on how
>>> deep it's slept so that it can be compared to other cpus in a same way
>>> but we can find prefered (lowest load) cpu among the idle cpus.
>>>
>>> The simple way I can think of is adding idle_level to a rq load in
>>> weighted_cpuload():
>>>
>>> static unsigned long weighted_cpuload(const int cpu)
>>> {
>>> 	return cpu_rq(cpu)->load.weight + cpuidle_get_state(cpu);
>>> }
>>
>> Hmm... then we don't need changes in find_idlest_cpu(), just compare the
>> load as before, but it works only when the appendix state value is
>> smaller than the lowest load of one task, which is 15 currently, I'm not
>> sure whether we have the promise...
> 
> You said about a nice 19 process, right?  But I found that SCHED_IDLE
> task will have weight of 3. :(
> 
>   #define WEIGHT_IDLEPRIO                3

I missed that policy :)

> 
> 
> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
> we can change the weight then, but there's no promise...

And I just got another case we should take care:

	group 0		cpu 0			cpu 1
			power index 8		power index 8


	group 1		cpu 2			cpu 3
			power index 0		load 15

so load of group 0 is 16 and group 1 is 15, but group 0 is better...

Regards,
Michael Wang

> 
> Thanks,
> Namhyung
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  7:30           ` Michael Wang
@ 2013-01-31  7:40             ` Namhyung Kim
  2013-01-31  8:24               ` Michael Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Namhyung Kim @ 2013-01-31  7:40 UTC (permalink / raw)
  To: Michael Wang
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On Thu, 31 Jan 2013 15:30:02 +0800, Michael Wang wrote:
> On 01/31/2013 02:58 PM, Namhyung Kim wrote:
>> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
>> we can change the weight then, but there's no promise...
>
> And I just got another case we should take care:
>
> 	group 0		cpu 0			cpu 1
> 			power index 8		power index 8
>
>
> 	group 1		cpu 2			cpu 3
> 			power index 0		load 15
>
> so load of group 0 is 16 and group 1 is 15, but group 0 is better...

Maybe it's not.  The cpus in group 0 are in a lower power state so that
there will be a benefit to select cpu 2 from the power' PoV IMHO.  Also
such a low power state has a longer exit latency so that we should
choose cpu2 to get a better performance and it's the basic idea of this
patchset I believe.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  7:40             ` Namhyung Kim
@ 2013-01-31  8:24               ` Michael Wang
  2013-01-31  8:45                 ` Michael Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Wang @ 2013-01-31  8:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On 01/31/2013 03:40 PM, Namhyung Kim wrote:
> On Thu, 31 Jan 2013 15:30:02 +0800, Michael Wang wrote:
>> On 01/31/2013 02:58 PM, Namhyung Kim wrote:
>>> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
>>> we can change the weight then, but there's no promise...
>>
>> And I just got another case we should take care:
>>
>> 	group 0		cpu 0			cpu 1
>> 			power index 8		power index 8
>>
>>
>> 	group 1		cpu 2			cpu 3
>> 			power index 0		load 15
>>
>> so load of group 0 is 16 and group 1 is 15, but group 0 is better...
> 
> Maybe it's not.  The cpus in group 0 are in a lower power state so that
> there will be a benefit to select cpu 2 from the power' PoV IMHO.  Also
> such a low power state has a longer exit latency so that we should
> choose cpu2 to get a better performance and it's the basic idea of this
> patchset I believe.

Well, this case is just to notify that, we may face the comparison
between load and index, not between index and index, I just doubt there
won't be a rule which could take care both, besides, comparison between
load and index is strange...

Regards,
Michael Wang

> 
> Thanks,
> Namhyung
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  8:24               ` Michael Wang
@ 2013-01-31  8:45                 ` Michael Wang
  2013-01-31  8:57                   ` Michael Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Wang @ 2013-01-31  8:45 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On 01/31/2013 04:24 PM, Michael Wang wrote:
> On 01/31/2013 03:40 PM, Namhyung Kim wrote:
>> On Thu, 31 Jan 2013 15:30:02 +0800, Michael Wang wrote:
>>> On 01/31/2013 02:58 PM, Namhyung Kim wrote:
>>>> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
>>>> we can change the weight then, but there's no promise...
>>>
>>> And I just got another case we should take care:
>>>
>>> 	group 0		cpu 0			cpu 1
>>> 			power index 8		power index 8
>>>
>>>
>>> 	group 1		cpu 2			cpu 3
>>> 			power index 0		load 15
>>>
>>> so load of group 0 is 16 and group 1 is 15, but group 0 is better...
>>
>> Maybe it's not.  The cpus in group 0 are in a lower power state so that
>> there will be a benefit to select cpu 2 from the power' PoV IMHO.  Also
>> such a low power state has a longer exit latency so that we should
>> choose cpu2 to get a better performance and it's the basic idea of this
>> patchset I believe.
> 
> Well, this case is just to notify that, we may face the comparison
> between load and index, not between index and index, I just doubt there
> won't be a rule which could take care both, besides, comparison between
> load and index is strange...

Oh, I miss the point that you call it 'idle load', hmm...may be it could
works, if we could scale the current load number, then we will have more
'space' for 'idle load'.

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>
>> Thanks,
>> Namhyung
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  8:45                 ` Michael Wang
@ 2013-01-31  8:57                   ` Michael Wang
  2013-02-01  8:53                     ` Namhyung Kim
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Wang @ 2013-01-31  8:57 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

On 01/31/2013 04:45 PM, Michael Wang wrote:
> On 01/31/2013 04:24 PM, Michael Wang wrote:
>> On 01/31/2013 03:40 PM, Namhyung Kim wrote:
>>> On Thu, 31 Jan 2013 15:30:02 +0800, Michael Wang wrote:
>>>> On 01/31/2013 02:58 PM, Namhyung Kim wrote:
>>>>> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
>>>>> we can change the weight then, but there's no promise...
>>>>
>>>> And I just got another case we should take care:
>>>>
>>>> 	group 0		cpu 0			cpu 1
>>>> 			power index 8		power index 8
>>>>
>>>>
>>>> 	group 1		cpu 2			cpu 3
>>>> 			power index 0		load 15
>>>>
>>>> so load of group 0 is 16 and group 1 is 15, but group 0 is better...
>>>
>>> Maybe it's not.  The cpus in group 0 are in a lower power state so that
>>> there will be a benefit to select cpu 2 from the power' PoV IMHO.  Also
>>> such a low power state has a longer exit latency so that we should
>>> choose cpu2 to get a better performance and it's the basic idea of this
>>> patchset I believe.
>>
>> Well, this case is just to notify that, we may face the comparison
>> between load and index, not between index and index, I just doubt there
>> won't be a rule which could take care both, besides, comparison between
>> load and index is strange...
> 
> Oh, I miss the point that you call it 'idle load', hmm...may be it could
> works, if we could scale the current load number, then we will have more
> 'space' for 'idle load'.

And some tips here:

/*      
 * Increase resolution of nice-level calculations for 64-bit architectures.
 * The extra resolution improves shares distribution and load balancing of
 * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
 * hierarchies, especially on larger systems. This is not a user-visible change
 * and does not change the user-interface for setting shares/weights.
 *      
 * We increase resolution only if we have enough bits to allow this increased
 * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
 * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
 * increased costs.
 */
#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
# define SCHED_LOAD_RESOLUTION  10
# define scale_load(w)          ((w) << SCHED_LOAD_RESOLUTION)
# define scale_load_down(w)     ((w) >> SCHED_LOAD_RESOLUTION)

It mentioned some regressions, that's the history but
sounds like a lot of testing is needed.

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>
>> Regards,
>> Michael Wang
>>
>>>
>>> Thanks,
>>> Namhyung
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  8:57                   ` Michael Wang
@ 2013-02-01  8:53                     ` Namhyung Kim
  0 siblings, 0 replies; 17+ messages in thread
From: Namhyung Kim @ 2013-02-01  8:53 UTC (permalink / raw)
  To: Michael Wang
  Cc: Sebastian Andrzej Siewior, linux-kernel, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, tglx

Hi Michael,

On Thu, 31 Jan 2013 16:57:49 +0800, Michael Wang wrote:
> On 01/31/2013 04:45 PM, Michael Wang wrote:
>> On 01/31/2013 04:24 PM, Michael Wang wrote:
>>> On 01/31/2013 03:40 PM, Namhyung Kim wrote:
>>>> On Thu, 31 Jan 2013 15:30:02 +0800, Michael Wang wrote:
>>>>> On 01/31/2013 02:58 PM, Namhyung Kim wrote:
>>>>>> But AFAIK the number of states in cpuidle is usually less than 10 so maybe
>>>>>> we can change the weight then, but there's no promise...
>>>>>
>>>>> And I just got another case we should take care:
>>>>>
>>>>> 	group 0		cpu 0			cpu 1
>>>>> 			power index 8		power index 8
>>>>>
>>>>>
>>>>> 	group 1		cpu 2			cpu 3
>>>>> 			power index 0		load 15
>>>>>
>>>>> so load of group 0 is 16 and group 1 is 15, but group 0 is better...
>>>>
>>>> Maybe it's not.  The cpus in group 0 are in a lower power state so that
>>>> there will be a benefit to select cpu 2 from the power' PoV IMHO.  Also
>>>> such a low power state has a longer exit latency so that we should
>>>> choose cpu2 to get a better performance and it's the basic idea of this
>>>> patchset I believe.
>>>
>>> Well, this case is just to notify that, we may face the comparison
>>> between load and index, not between index and index, I just doubt there
>>> won't be a rule which could take care both, besides, comparison between
>>> load and index is strange...
>> 
>> Oh, I miss the point that you call it 'idle load', hmm...may be it could
>> works, if we could scale the current load number, then we will have more
>> 'space' for 'idle load'.
>
> And some tips here:
>
> /*      
>  * Increase resolution of nice-level calculations for 64-bit architectures.
>  * The extra resolution improves shares distribution and load balancing of
>  * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
>  * hierarchies, especially on larger systems. This is not a user-visible change
>  * and does not change the user-interface for setting shares/weights.
>  *      
>  * We increase resolution only if we have enough bits to allow this increased
>  * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
>  * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
>  * increased costs.
>  */
> #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> # define SCHED_LOAD_RESOLUTION  10
> # define scale_load(w)          ((w) << SCHED_LOAD_RESOLUTION)
> # define scale_load_down(w)     ((w) >> SCHED_LOAD_RESOLUTION)
>
> It mentioned some regressions, that's the history but
> sounds like a lot of testing is needed.

Thanks for the tip.  Yes, I was aware of this and bit worried about the
mentioned regression.  So I just wanted to increate the load weight of a
SCHED_IDLE task.

However I don't have a good test environment unfortunately so cannot
carry the work on at least for a while.

Thanks for your help anyway,
Namhyung

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] cpuidle: trace state of the CPU
  2013-01-31  5:21   ` Namhyung Kim
@ 2013-02-02 17:35     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-02-02 17:35 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	tglx

On 01/31/2013 06:21 AM, Namhyung Kim wrote:
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index e1f6860..3594e0c 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -23,6 +23,7 @@
>>  #include "cpuidle.h"
>>  
>>  DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
>> +static DEFINE_PER_CPU(unsigned int, cpu_state);
> 
> What about making it 'int' as cpuidle_get_state() returns int type?

index should never get negative because it is an index. In case it does
(which should not) we end up with a large value which is considered as
a really deep state and the CPU will be ignored as long as possible.

>> @@ -40,13 +41,23 @@ void disable_cpuidle(void)
…

>>  static inline int cpuidle_enter(struct cpuidle_device *dev,
>>  				struct cpuidle_driver *drv, int index)
>>  {
>>  	struct cpuidle_state *target_state = &drv->states[index];
>> -	return target_state->enter(dev, drv, index);
>> +	int ret;
>> +
>> +	per_cpu(cpu_state, smp_processor_id()) = index + 1;
>> +	ret = target_state->enter(dev, drv, index);
>> +	per_cpu(cpu_state, smp_processor_id()) = 0;
> 
> Maybe we can use local variable 'cpu' for this duplicated
> smp_processor_id() call as it's never executed on another cpu in
> between?

This should be really cheap unless CONFIG_DEBUG_PREEMPT is on but okay,
will change.

> Thanks,
> Namhyung

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-01-31  2:12   ` Michael Wang
  2013-01-31  5:16     ` Namhyung Kim
@ 2013-02-02 17:50     ` Sebastian Andrzej Siewior
  2013-02-04  3:01       ` Michael Wang
  1 sibling, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-02-02 17:50 UTC (permalink / raw)
  To: Michael Wang
  Cc: linux-kernel, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	tglx

On 01/31/2013 03:12 AM, Michael Wang wrote:
> I'm not sure, but just concern about this case:
> 
> 	group 0		cpu 0			cpu 1
> 			least idle		4 task
> 
> 	group 1		cpu 2			cpu 3
> 			1 task			1 task
> 
> The previous logical will pick group 1 and now it will take group 0, and
> that cause more imbalance, doesn't it?

That depends on load of CPU 0 + 1 vs CPU 2 + 3. If the four tasks on
CPU1 are idle then the previous code should return group 0.
If the four tasks are running at 100% each then two of them should be
migrated to CPU0 and this point the idle state does not matter :)

> May be check that state in find_idlest_cpu() will be better?

You say to move this from find_idlest_group() to find_idlest_cpu()?

> Regards,
> Michael Wang

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
  2013-02-02 17:50     ` Sebastian Andrzej Siewior
@ 2013-02-04  3:01       ` Michael Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Wang @ 2013-02-04  3:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	tglx

On 02/03/2013 01:50 AM, Sebastian Andrzej Siewior wrote:
> On 01/31/2013 03:12 AM, Michael Wang wrote:
>> I'm not sure, but just concern about this case:
>>
>> 	group 0		cpu 0			cpu 1
>> 			least idle		4 task
>>
>> 	group 1		cpu 2			cpu 3
>> 			1 task			1 task
>>
>> The previous logical will pick group 1 and now it will take group 0, and
>> that cause more imbalance, doesn't it?
> 
> That depends on load of CPU 0 + 1 vs CPU 2 + 3. If the four tasks on
> CPU1 are idle then the previous code should return group 0.
> If the four tasks are running at 100% each then two of them should be
> migrated to CPU0 and this point the idle state does not matter :)

Hmm...may be I should make it more clear like this:

Prev find_idlest_group():

	cpu 0 is the least idle
	cpu 1 has 4 tasks on it's running queue
	cpu 2 has 1 task(current task) on it's running queue
	cpu 3 has 1 task on it's running queue

	and suppose no changes happen during the search, and this
	sd only contain 2 groups:

	group 0 has cpu 0 and 1
	group 1 has cpu 2 and 3

So in the old world, group 0 has load 4096 (if all the task are nice 0,
and let's put down the revise), group 1 has load 2048, so
find_idlest_group() will return group 1 since it's the idlest.

But now, since we directly using the idle group, that will be group 0,
and after applied, group 0 will has 5120 load while group 1 only has
2048, and that's cause more imbalance (than 4096 : 3072).

That's just flash in my mind when I saw the patch, may be not a good
case or missed some thing, but since find_idlest_group() is trying to
balance the load, if we want to override the rule, we need proof by
logical or benchmarks.

> 
>> May be check that state in find_idlest_cpu() will be better?
> 
> You say to move this from find_idlest_group() to find_idlest_cpu()?

Yes, since we already make sure the balance by find_idlest_group(), we
only need to add some check like below in find_idlest_cpu():

if (load < min_load || (load == min_load && i == this_cpu)) {
	if (power state of 'idlest' < power state of 'i')
		continue;

	min_load = load;
	idlest = i;
}

That will get very limited benefit (only the case when there are
multiple different power state idle cpu in the group), but is very easy
to be proved by logical, doesn't it?

And Namhyung mentioned some interesting implementation which may need no
changes to the code in select, please take a look :)

Regards,
Michael Wang

> 
>> Regards,
>> Michael Wang
> 
> Sebastian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-02-04  3:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-30 21:19 [RFC] Consider CPU idle state while choosing a new CPU Sebastian Andrzej Siewior
2013-01-30 21:19 ` [RFC 1/2] cpuidle: trace state of the CPU Sebastian Andrzej Siewior
2013-01-31  5:21   ` Namhyung Kim
2013-02-02 17:35     ` Sebastian Andrzej Siewior
2013-01-30 21:19 ` [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state Sebastian Andrzej Siewior
2013-01-31  2:12   ` Michael Wang
2013-01-31  5:16     ` Namhyung Kim
2013-01-31  6:39       ` Michael Wang
2013-01-31  6:58         ` Namhyung Kim
2013-01-31  7:30           ` Michael Wang
2013-01-31  7:40             ` Namhyung Kim
2013-01-31  8:24               ` Michael Wang
2013-01-31  8:45                 ` Michael Wang
2013-01-31  8:57                   ` Michael Wang
2013-02-01  8:53                     ` Namhyung Kim
2013-02-02 17:50     ` Sebastian Andrzej Siewior
2013-02-04  3:01       ` Michael Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).