[PATCH v3 0/2] sched: smart wake-affine

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] sched: smart wake-affine
@ 2013-07-04  4:55 Michael Wang
  2013-07-04  4:55 ` [PATCH v3 1/2] sched: smart wake-affine foundation Michael Wang
  2013-07-04  4:56 ` [PATCH v3 2/2] sched: reduce the overhead of obtain factor Michael Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Michael Wang @ 2013-07-04  4:55 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Alex Shi, Namhyung Kim, Paul Turner,
	Andrew Morton, Nikunj A. Dadhania, Ram Pai

Since v2:
	Add patch [PATCH 2/2] sched: reduce the overhead of obtain factor
	for optimization. (Thanks to PeterZ)

This patch-set will implement a smart wake-affine, in order to regain the
lost performance of the workload like pgbench, meanwhile reserve the gained
benefit of the workload like hackbench.

Michael Wang (1):
	[PATCH v3 1/2] sched: smart wake-affine foundation

Peter Zijlstra (1):
	[PATCH v3 2/2] sched: reduce the overhead of obtain factor

---
 b/include/linux/sched.h |    3 +++
 b/kernel/sched/core.c   |    7 ++++++-
 b/kernel/sched/fair.c   |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 b/kernel/sched/sched.h  |    1 +
 kernel/sched/fair.c     |    2 +-
 5 files changed, 58 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/2] sched: smart wake-affine foundation
  2013-07-04  4:55 [PATCH v3 0/2] sched: smart wake-affine Michael Wang
@ 2013-07-04  4:55 ` Michael Wang
  2013-07-07  1:31   ` Sam Ben
  2013-07-24  3:56   ` [tip:perf/core] sched: Implement smarter wake-affine logic tip-bot for Michael Wang
  2013-07-04  4:56 ` [PATCH v3 2/2] sched: reduce the overhead of obtain factor Michael Wang
  1 sibling, 2 replies; 9+ messages in thread
From: Michael Wang @ 2013-07-04  4:55 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Alex Shi, Namhyung Kim, Paul Turner,
	Andrew Morton, Nikunj A. Dadhania, Ram Pai

wake-affine stuff is always trying to pull wakee close to waker, by theory,
this will bring benefit if waker's cpu cached hot data for wakee, or the
extreme ping-pong case.

And testing show it could benefit hackbench 15% at most.

However, the whole stuff is somewhat blindly and time-consuming, some
workload therefore suffer.

And testing show it could damage pgbench 50% at most.

Thus, wake-affine stuff should be more smart, and realise when to stop
it's thankless effort.

This patch introduced 'nr_wakee_switch', which will be increased each
time the task switch it's wakee.

So a high 'nr_wakee_switch' means the task has more than one wakee, and
bigger the number, higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention on
the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
but also imply that waker will face cruel competition later, it could be
very cruel or very fast depends on the story behind 'nr_wakee_switch',
whatever, waker therefore suffer.

Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
tasks rely on it, then waker's higher latency will damage all of them, pull
wakee seems to be a bad deal.

Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
and higher, the deal seems to be worse and worse.

The patch therefore help wake-affine stuff to stop it's work when:

	wakee->nr_wakee_switch > factor &&
	waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)

The factor here is the node-size of current-cpu, so bigger node will lead
to more pull since the trial become more severe.

After applied the patch, pgbench show 40% improvement at most.

Test:
	Tested with 12 cpu X86 server and tip 3.10.0-rc7.

	pgbench		    base	smart

	| db_size | clients |  tps  |	|  tps  |
	+---------+---------+-------+   +-------+
	| 22 MB   |       1 | 10598 |   | 10796 |
	| 22 MB   |       2 | 21257 |   | 21336 |
	| 22 MB   |       4 | 41386 |   | 41622 |
	| 22 MB   |       8 | 51253 |   | 57932 |
	| 22 MB   |      12 | 48570 |   | 54000 |
	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
	| 7484 MB |       1 |  8951 |   |  9193 |
	| 7484 MB |       2 | 19233 |   | 19240 |
	| 7484 MB |       4 | 37239 |   | 37302 |
	| 7484 MB |       8 | 46087 |   | 50018 |
	| 7484 MB |      12 | 42054 |   | 48763 |
	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
	| 15 GB   |       1 |  8845 |   |  9104 |
	| 15 GB   |       2 | 19094 |   | 19162 |
	| 15 GB   |       4 | 36979 |   | 36983 |
	| 15 GB   |       8 | 46087 |   | 49977 |
	| 15 GB   |      12 | 41901 |   | 48591 |
	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Mike Galbraith <efault@gmx.de>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 include/linux/sched.h |    3 +++
 kernel/sched/fair.c   |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..1c996c7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1041,6 +1041,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+	struct task_struct *last_wakee;
+	unsigned long nr_wakee_switch;
+	unsigned long last_switch_decay;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c61a614..a4ddbf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	return 0;
 }
 
+static void record_wakee(struct task_struct *p)
+{
+	/*
+	 * Rough decay(wiping) for cost saving, don't worry
+	 * about the boundary, really active task won't care
+	 * the loose.
+	 */
+	if (jiffies > current->last_switch_decay + HZ) {
+		current->nr_wakee_switch = 0;
+		current->last_switch_decay = jiffies;
+	}
+
+	if (current->last_wakee != p) {
+		current->last_wakee = p;
+		current->nr_wakee_switch++;
+	}
+}
 
 static void task_waking_fair(struct task_struct *p)
 {
@@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
 #endif
 
 	se->vruntime -= min_vruntime;
+	record_wakee(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3109,6 +3127,28 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 #endif
 
+static int wake_wide(struct task_struct *p)
+{
+	int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+
+	/*
+	 * Yeah, it's the switching-frequency, could means many wakee or
+	 * rapidly switch, use factor here will just help to automatically
+	 * adjust the loose-degree, so bigger node will lead to more pull.
+	 */
+	if (p->nr_wakee_switch > factor) {
+		/*
+		 * wakee is somewhat hot, it needs certain amount of cpu
+		 * resource, so if waker is far more hot, prefer to leave
+		 * it alone.
+		 */
+		if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
+			return 1;
+	}
+
+	return 0;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
@@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	unsigned long weight;
 	int balanced;
 
+	/*
+	 * If we wake multiple tasks be careful to not bounce
+	 * ourselves around too much.
+	 */
+	if (wake_wide(p))
+		return 0;
+
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] sched: smart wake-affine foundation
  2013-07-04  4:55 ` [PATCH v3 1/2] sched: smart wake-affine foundation Michael Wang
@ 2013-07-07  1:31   ` Sam Ben
  2013-07-08  2:36     ` Michael Wang
  2013-07-24  3:56   ` [tip:perf/core] sched: Implement smarter wake-affine logic tip-bot for Michael Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Sam Ben @ 2013-07-07  1:31 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Alex Shi,
	Namhyung Kim, Paul Turner, Andrew Morton, Nikunj A. Dadhania,
	Ram Pai

On 07/04/2013 12:55 PM, Michael Wang wrote:
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.

What's the meaning of ping-pong case?

>
> And testing show it could benefit hackbench 15% at most.
>
> However, the whole stuff is somewhat blindly and time-consuming, some
> workload therefore suffer.
>
> And testing show it could damage pgbench 50% at most.
>
> Thus, wake-affine stuff should be more smart, and realise when to stop
> it's thankless effort.
>
> This patch introduced 'nr_wakee_switch', which will be increased each
> time the task switch it's wakee.
>
> So a high 'nr_wakee_switch' means the task has more than one wakee, and
> bigger the number, higher the wakeup frequency.
>
> Now when making the decision on whether to pull or not, pay attention on
> the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
> but also imply that waker will face cruel competition later, it could be
> very cruel or very fast depends on the story behind 'nr_wakee_switch',
> whatever, waker therefore suffer.
>
> Furthermore, if waker also has a high 'nr_wakee_switch', imply that multiple
> tasks rely on it, then waker's higher latency will damage all of them, pull
> wakee seems to be a bad deal.
>
> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
> and higher, the deal seems to be worse and worse.
>
> The patch therefore help wake-affine stuff to stop it's work when:
>
> 	wakee->nr_wakee_switch > factor &&
> 	waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> The factor here is the node-size of current-cpu, so bigger node will lead
> to more pull since the trial become more severe.
>
> After applied the patch, pgbench show 40% improvement at most.
>
> Test:
> 	Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>
> 	pgbench		    base	smart
>
> 	| db_size | clients |  tps  |	|  tps  |
> 	+---------+---------+-------+   +-------+
> 	| 22 MB   |       1 | 10598 |   | 10796 |
> 	| 22 MB   |       2 | 21257 |   | 21336 |
> 	| 22 MB   |       4 | 41386 |   | 41622 |
> 	| 22 MB   |       8 | 51253 |   | 57932 |
> 	| 22 MB   |      12 | 48570 |   | 54000 |
> 	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
> 	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
> 	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
> 	| 7484 MB |       1 |  8951 |   |  9193 |
> 	| 7484 MB |       2 | 19233 |   | 19240 |
> 	| 7484 MB |       4 | 37239 |   | 37302 |
> 	| 7484 MB |       8 | 46087 |   | 50018 |
> 	| 7484 MB |      12 | 42054 |   | 48763 |
> 	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
> 	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
> 	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
> 	| 15 GB   |       1 |  8845 |   |  9104 |
> 	| 15 GB   |       2 | 19094 |   | 19162 |
> 	| 15 GB   |       4 | 36979 |   | 36983 |
> 	| 15 GB   |       8 | 46087 |   | 49977 |
> 	| 15 GB   |      12 | 41901 |   | 48591 |
> 	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
> 	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
> 	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> CC: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> ---
>   include/linux/sched.h |    3 +++
>   kernel/sched/fair.c   |   47 +++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 50 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
>   #ifdef CONFIG_SMP
>   	struct llist_node wake_entry;
>   	int on_cpu;
> +	struct task_struct *last_wakee;
> +	unsigned long nr_wakee_switch;
> +	unsigned long last_switch_decay;
>   #endif
>   	int on_rq;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c61a614..a4ddbf5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>   	return 0;
>   }
>   
> +static void record_wakee(struct task_struct *p)
> +{
> +	/*
> +	 * Rough decay(wiping) for cost saving, don't worry
> +	 * about the boundary, really active task won't care
> +	 * the loose.
> +	 */
> +	if (jiffies > current->last_switch_decay + HZ) {
> +		current->nr_wakee_switch = 0;
> +		current->last_switch_decay = jiffies;
> +	}
> +
> +	if (current->last_wakee != p) {
> +		current->last_wakee = p;
> +		current->nr_wakee_switch++;
> +	}
> +}
>   
>   static void task_waking_fair(struct task_struct *p)
>   {
> @@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
>   #endif
>   
>   	se->vruntime -= min_vruntime;
> +	record_wakee(p);
>   }
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -3109,6 +3127,28 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>   
>   #endif
>   
> +static int wake_wide(struct task_struct *p)
> +{
> +	int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
> +
> +	/*
> +	 * Yeah, it's the switching-frequency, could means many wakee or
> +	 * rapidly switch, use factor here will just help to automatically
> +	 * adjust the loose-degree, so bigger node will lead to more pull.
> +	 */
> +	if (p->nr_wakee_switch > factor) {
> +		/*
> +		 * wakee is somewhat hot, it needs certain amount of cpu
> +		 * resource, so if waker is far more hot, prefer to leave
> +		 * it alone.
> +		 */
> +		if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
>   static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>   {
>   	s64 this_load, load;
> @@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>   	unsigned long weight;
>   	int balanced;
>   
> +	/*
> +	 * If we wake multiple tasks be careful to not bounce
> +	 * ourselves around too much.
> +	 */
> +	if (wake_wide(p))
> +		return 0;
> +
>   	idx	  = sd->wake_idx;
>   	this_cpu  = smp_processor_id();
>   	prev_cpu  = task_cpu(p);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] sched: smart wake-affine foundation
  2013-07-07  1:31   ` Sam Ben
@ 2013-07-08  2:36     ` Michael Wang
  2013-07-10  1:52       ` Sam Ben
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Wang @ 2013-07-08  2:36 UTC (permalink / raw)
  To: Sam Ben
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Alex Shi,
	Namhyung Kim, Paul Turner, Andrew Morton, Nikunj A. Dadhania,
	Ram Pai

Hi, Sam

On 07/07/2013 09:31 AM, Sam Ben wrote:
> On 07/04/2013 12:55 PM, Michael Wang wrote:
>> wake-affine stuff is always trying to pull wakee close to waker, by
>> theory,
>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>> extreme ping-pong case.
> 
> What's the meaning of ping-pong case?

PeterZ explained it well in here:

	https://lkml.org/lkml/2013/3/7/332

And you could try to compare:
	taskset 1 perf bench sched pipe
with
	perf bench sched pipe

to confirm it ;-)

Regards,
Michael Wang

> 
>>
>> And testing show it could benefit hackbench 15% at most.
>>
>> However, the whole stuff is somewhat blindly and time-consuming, some
>> workload therefore suffer.
>>
>> And testing show it could damage pgbench 50% at most.
>>
>> Thus, wake-affine stuff should be more smart, and realise when to stop
>> it's thankless effort.
>>
>> This patch introduced 'nr_wakee_switch', which will be increased each
>> time the task switch it's wakee.
>>
>> So a high 'nr_wakee_switch' means the task has more than one wakee, and
>> bigger the number, higher the wakeup frequency.
>>
>> Now when making the decision on whether to pull or not, pay attention on
>> the wakee with a high 'nr_wakee_switch', pull such task may benefit
>> wakee,
>> but also imply that waker will face cruel competition later, it could be
>> very cruel or very fast depends on the story behind 'nr_wakee_switch',
>> whatever, waker therefore suffer.
>>
>> Furthermore, if waker also has a high 'nr_wakee_switch', imply that
>> multiple
>> tasks rely on it, then waker's higher latency will damage all of them,
>> pull
>> wakee seems to be a bad deal.
>>
>> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
>> higher
>> and higher, the deal seems to be worse and worse.
>>
>> The patch therefore help wake-affine stuff to stop it's work when:
>>
>>     wakee->nr_wakee_switch > factor &&
>>     waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>
>> The factor here is the node-size of current-cpu, so bigger node will lead
>> to more pull since the trial become more severe.
>>
>> After applied the patch, pgbench show 40% improvement at most.
>>
>> Test:
>>     Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>>
>>     pgbench            base    smart
>>
>>     | db_size | clients |  tps  |    |  tps  |
>>     +---------+---------+-------+   +-------+
>>     | 22 MB   |       1 | 10598 |   | 10796 |
>>     | 22 MB   |       2 | 21257 |   | 21336 |
>>     | 22 MB   |       4 | 41386 |   | 41622 |
>>     | 22 MB   |       8 | 51253 |   | 57932 |
>>     | 22 MB   |      12 | 48570 |   | 54000 |
>>     | 22 MB   |      16 | 46748 |   | 55982 | +19.75%
>>     | 22 MB   |      24 | 44346 |   | 55847 | +25.93%
>>     | 22 MB   |      32 | 43460 |   | 54614 | +25.66%
>>     | 7484 MB |       1 |  8951 |   |  9193 |
>>     | 7484 MB |       2 | 19233 |   | 19240 |
>>     | 7484 MB |       4 | 37239 |   | 37302 |
>>     | 7484 MB |       8 | 46087 |   | 50018 |
>>     | 7484 MB |      12 | 42054 |   | 48763 |
>>     | 7484 MB |      16 | 40765 |   | 51633 | +26.66%
>>     | 7484 MB |      24 | 37651 |   | 52377 | +39.11%
>>     | 7484 MB |      32 | 37056 |   | 51108 | +37.92%
>>     | 15 GB   |       1 |  8845 |   |  9104 |
>>     | 15 GB   |       2 | 19094 |   | 19162 |
>>     | 15 GB   |       4 | 36979 |   | 36983 |
>>     | 15 GB   |       8 | 46087 |   | 49977 |
>>     | 15 GB   |      12 | 41901 |   | 48591 |
>>     | 15 GB   |      16 | 40147 |   | 50651 | +26.16%
>>     | 15 GB   |      24 | 37250 |   | 52365 | +40.58%
>>     | 15 GB   |      32 | 36470 |   | 50015 | +37.14%
>>
>> CC: Ingo Molnar <mingo@kernel.org>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> CC: Mike Galbraith <efault@gmx.de>
>> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
>> ---
>>   include/linux/sched.h |    3 +++
>>   kernel/sched/fair.c   |   47
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 178a8d9..1c996c7 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1041,6 +1041,9 @@ struct task_struct {
>>   #ifdef CONFIG_SMP
>>       struct llist_node wake_entry;
>>       int on_cpu;
>> +    struct task_struct *last_wakee;
>> +    unsigned long nr_wakee_switch;
>> +    unsigned long last_switch_decay;
>>   #endif
>>       int on_rq;
>>   diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c61a614..a4ddbf5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
>> cpu)
>>       return 0;
>>   }
>>   +static void record_wakee(struct task_struct *p)
>> +{
>> +    /*
>> +     * Rough decay(wiping) for cost saving, don't worry
>> +     * about the boundary, really active task won't care
>> +     * the loose.
>> +     */
>> +    if (jiffies > current->last_switch_decay + HZ) {
>> +        current->nr_wakee_switch = 0;
>> +        current->last_switch_decay = jiffies;
>> +    }
>> +
>> +    if (current->last_wakee != p) {
>> +        current->last_wakee = p;
>> +        current->nr_wakee_switch++;
>> +    }
>> +}
>>     static void task_waking_fair(struct task_struct *p)
>>   {
>> @@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
>>   #endif
>>         se->vruntime -= min_vruntime;
>> +    record_wakee(p);
>>   }
>>     #ifdef CONFIG_FAIR_GROUP_SCHED
>> @@ -3109,6 +3127,28 @@ static inline unsigned long
>> effective_load(struct task_group *tg, int cpu,
>>     #endif
>>   +static int wake_wide(struct task_struct *p)
>> +{
>> +    int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
>> +
>> +    /*
>> +     * Yeah, it's the switching-frequency, could means many wakee or
>> +     * rapidly switch, use factor here will just help to automatically
>> +     * adjust the loose-degree, so bigger node will lead to more pull.
>> +     */
>> +    if (p->nr_wakee_switch > factor) {
>> +        /*
>> +         * wakee is somewhat hot, it needs certain amount of cpu
>> +         * resource, so if waker is far more hot, prefer to leave
>> +         * it alone.
>> +         */
>> +        if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>> +            return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static int wake_affine(struct sched_domain *sd, struct task_struct
>> *p, int sync)
>>   {
>>       s64 this_load, load;
>> @@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd,
>> struct task_struct *p, int sync)
>>       unsigned long weight;
>>       int balanced;
>>   +    /*
>> +     * If we wake multiple tasks be careful to not bounce
>> +     * ourselves around too much.
>> +     */
>> +    if (wake_wide(p))
>> +        return 0;
>> +
>>       idx      = sd->wake_idx;
>>       this_cpu  = smp_processor_id();
>>       prev_cpu  = task_cpu(p);
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] sched: smart wake-affine foundation
  2013-07-08  2:36     ` Michael Wang
@ 2013-07-10  1:52       ` Sam Ben
  2013-07-10  2:12         ` Michael Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Ben @ 2013-07-10  1:52 UTC (permalink / raw)
  To: Michael Wang
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Alex Shi,
	Namhyung Kim, Paul Turner, Andrew Morton, Nikunj A. Dadhania,
	Ram Pai

On 07/08/2013 10:36 AM, Michael Wang wrote:
> Hi, Sam
>
> On 07/07/2013 09:31 AM, Sam Ben wrote:
>> On 07/04/2013 12:55 PM, Michael Wang wrote:
>>> wake-affine stuff is always trying to pull wakee close to waker, by
>>> theory,
>>> this will bring benefit if waker's cpu cached hot data for wakee, or the
>>> extreme ping-pong case.
>> What's the meaning of ping-pong case?
> PeterZ explained it well in here:
>
> 	https://lkml.org/lkml/2013/3/7/332
>
> And you could try to compare:
> 	taskset 1 perf bench sched pipe
> with
> 	perf bench sched pipe

Why sched pipe is special?

>
> to confirm it ;-)
>
> Regards,
> Michael Wang
>
>>> And testing show it could benefit hackbench 15% at most.
>>>
>>> However, the whole stuff is somewhat blindly and time-consuming, some
>>> workload therefore suffer.
>>>
>>> And testing show it could damage pgbench 50% at most.
>>>
>>> Thus, wake-affine stuff should be more smart, and realise when to stop
>>> it's thankless effort.
>>>
>>> This patch introduced 'nr_wakee_switch', which will be increased each
>>> time the task switch it's wakee.
>>>
>>> So a high 'nr_wakee_switch' means the task has more than one wakee, and
>>> bigger the number, higher the wakeup frequency.
>>>
>>> Now when making the decision on whether to pull or not, pay attention on
>>> the wakee with a high 'nr_wakee_switch', pull such task may benefit
>>> wakee,
>>> but also imply that waker will face cruel competition later, it could be
>>> very cruel or very fast depends on the story behind 'nr_wakee_switch',
>>> whatever, waker therefore suffer.
>>>
>>> Furthermore, if waker also has a high 'nr_wakee_switch', imply that
>>> multiple
>>> tasks rely on it, then waker's higher latency will damage all of them,
>>> pull
>>> wakee seems to be a bad deal.
>>>
>>> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
>>> higher
>>> and higher, the deal seems to be worse and worse.
>>>
>>> The patch therefore help wake-affine stuff to stop it's work when:
>>>
>>>      wakee->nr_wakee_switch > factor &&
>>>      waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>
>>> The factor here is the node-size of current-cpu, so bigger node will lead
>>> to more pull since the trial become more severe.
>>>
>>> After applied the patch, pgbench show 40% improvement at most.
>>>
>>> Test:
>>>      Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>>>
>>>      pgbench            base    smart
>>>
>>>      | db_size | clients |  tps  |    |  tps  |
>>>      +---------+---------+-------+   +-------+
>>>      | 22 MB   |       1 | 10598 |   | 10796 |
>>>      | 22 MB   |       2 | 21257 |   | 21336 |
>>>      | 22 MB   |       4 | 41386 |   | 41622 |
>>>      | 22 MB   |       8 | 51253 |   | 57932 |
>>>      | 22 MB   |      12 | 48570 |   | 54000 |
>>>      | 22 MB   |      16 | 46748 |   | 55982 | +19.75%
>>>      | 22 MB   |      24 | 44346 |   | 55847 | +25.93%
>>>      | 22 MB   |      32 | 43460 |   | 54614 | +25.66%
>>>      | 7484 MB |       1 |  8951 |   |  9193 |
>>>      | 7484 MB |       2 | 19233 |   | 19240 |
>>>      | 7484 MB |       4 | 37239 |   | 37302 |
>>>      | 7484 MB |       8 | 46087 |   | 50018 |
>>>      | 7484 MB |      12 | 42054 |   | 48763 |
>>>      | 7484 MB |      16 | 40765 |   | 51633 | +26.66%
>>>      | 7484 MB |      24 | 37651 |   | 52377 | +39.11%
>>>      | 7484 MB |      32 | 37056 |   | 51108 | +37.92%
>>>      | 15 GB   |       1 |  8845 |   |  9104 |
>>>      | 15 GB   |       2 | 19094 |   | 19162 |
>>>      | 15 GB   |       4 | 36979 |   | 36983 |
>>>      | 15 GB   |       8 | 46087 |   | 49977 |
>>>      | 15 GB   |      12 | 41901 |   | 48591 |
>>>      | 15 GB   |      16 | 40147 |   | 50651 | +26.16%
>>>      | 15 GB   |      24 | 37250 |   | 52365 | +40.58%
>>>      | 15 GB   |      32 | 36470 |   | 50015 | +37.14%
>>>
>>> CC: Ingo Molnar <mingo@kernel.org>
>>> CC: Peter Zijlstra <peterz@infradead.org>
>>> CC: Mike Galbraith <efault@gmx.de>
>>> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
>>> ---
>>>    include/linux/sched.h |    3 +++
>>>    kernel/sched/fair.c   |   47
>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 50 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 178a8d9..1c996c7 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1041,6 +1041,9 @@ struct task_struct {
>>>    #ifdef CONFIG_SMP
>>>        struct llist_node wake_entry;
>>>        int on_cpu;
>>> +    struct task_struct *last_wakee;
>>> +    unsigned long nr_wakee_switch;
>>> +    unsigned long last_switch_decay;
>>>    #endif
>>>        int on_rq;
>>>    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c61a614..a4ddbf5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
>>> cpu)
>>>        return 0;
>>>    }
>>>    +static void record_wakee(struct task_struct *p)
>>> +{
>>> +    /*
>>> +     * Rough decay(wiping) for cost saving, don't worry
>>> +     * about the boundary, really active task won't care
>>> +     * the loose.
>>> +     */
>>> +    if (jiffies > current->last_switch_decay + HZ) {
>>> +        current->nr_wakee_switch = 0;
>>> +        current->last_switch_decay = jiffies;
>>> +    }
>>> +
>>> +    if (current->last_wakee != p) {
>>> +        current->last_wakee = p;
>>> +        current->nr_wakee_switch++;
>>> +    }
>>> +}
>>>      static void task_waking_fair(struct task_struct *p)
>>>    {
>>> @@ -2991,6 +3008,7 @@ static void task_waking_fair(struct task_struct *p)
>>>    #endif
>>>          se->vruntime -= min_vruntime;
>>> +    record_wakee(p);
>>>    }
>>>      #ifdef CONFIG_FAIR_GROUP_SCHED
>>> @@ -3109,6 +3127,28 @@ static inline unsigned long
>>> effective_load(struct task_group *tg, int cpu,
>>>      #endif
>>>    +static int wake_wide(struct task_struct *p)
>>> +{
>>> +    int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
>>> +
>>> +    /*
>>> +     * Yeah, it's the switching-frequency, could means many wakee or
>>> +     * rapidly switch, use factor here will just help to automatically
>>> +     * adjust the loose-degree, so bigger node will lead to more pull.
>>> +     */
>>> +    if (p->nr_wakee_switch > factor) {
>>> +        /*
>>> +         * wakee is somewhat hot, it needs certain amount of cpu
>>> +         * resource, so if waker is far more hot, prefer to leave
>>> +         * it alone.
>>> +         */
>>> +        if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>>> +            return 1;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    static int wake_affine(struct sched_domain *sd, struct task_struct
>>> *p, int sync)
>>>    {
>>>        s64 this_load, load;
>>> @@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd,
>>> struct task_struct *p, int sync)
>>>        unsigned long weight;
>>>        int balanced;
>>>    +    /*
>>> +     * If we wake multiple tasks be careful to not bounce
>>> +     * ourselves around too much.
>>> +     */
>>> +    if (wake_wide(p))
>>> +        return 0;
>>> +
>>>        idx      = sd->wake_idx;
>>>        this_cpu  = smp_processor_id();
>>>        prev_cpu  = task_cpu(p);
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/2] sched: smart wake-affine foundation
  2013-07-10  1:52       ` Sam Ben
@ 2013-07-10  2:12         ` Michael Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Wang @ 2013-07-10  2:12 UTC (permalink / raw)
  To: Sam Ben
  Cc: LKML, Ingo Molnar, Peter Zijlstra, Mike Galbraith, Alex Shi,
	Namhyung Kim, Paul Turner, Andrew Morton, Nikunj A. Dadhania,
	Ram Pai

On 07/10/2013 09:52 AM, Sam Ben wrote:
> On 07/08/2013 10:36 AM, Michael Wang wrote:
>> Hi, Sam
>>
>> On 07/07/2013 09:31 AM, Sam Ben wrote:
>>> On 07/04/2013 12:55 PM, Michael Wang wrote:
>>>> wake-affine stuff is always trying to pull wakee close to waker, by
>>>> theory,
>>>> this will bring benefit if waker's cpu cached hot data for wakee, or
>>>> the
>>>> extreme ping-pong case.
>>> What's the meaning of ping-pong case?
>> PeterZ explained it well in here:
>>
>>     https://lkml.org/lkml/2013/3/7/332
>>
>> And you could try to compare:
>>     taskset 1 perf bench sched pipe
>> with
>>     perf bench sched pipe
> 
> Why sched pipe is special?

I think the link already explained the reason well, or you can read the
code of that pipe implementation, and you will find out there is a high
chances to match the ping-pong cases :)

Regards,
Michael Wang

> 
>>
>> to confirm it ;-)
>>
>> Regards,
>> Michael Wang
>>
>>>> And testing show it could benefit hackbench 15% at most.
>>>>
>>>> However, the whole stuff is somewhat blindly and time-consuming, some
>>>> workload therefore suffer.
>>>>
>>>> And testing show it could damage pgbench 50% at most.
>>>>
>>>> Thus, wake-affine stuff should be more smart, and realise when to stop
>>>> it's thankless effort.
>>>>
>>>> This patch introduced 'nr_wakee_switch', which will be increased each
>>>> time the task switch it's wakee.
>>>>
>>>> So a high 'nr_wakee_switch' means the task has more than one wakee, and
>>>> bigger the number, higher the wakeup frequency.
>>>>
>>>> Now when making the decision on whether to pull or not, pay
>>>> attention on
>>>> the wakee with a high 'nr_wakee_switch', pull such task may benefit
>>>> wakee,
>>>> but also imply that waker will face cruel competition later, it
>>>> could be
>>>> very cruel or very fast depends on the story behind 'nr_wakee_switch',
>>>> whatever, waker therefore suffer.
>>>>
>>>> Furthermore, if waker also has a high 'nr_wakee_switch', imply that
>>>> multiple
>>>> tasks rely on it, then waker's higher latency will damage all of them,
>>>> pull
>>>> wakee seems to be a bad deal.
>>>>
>>>> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
>>>> higher
>>>> and higher, the deal seems to be worse and worse.
>>>>
>>>> The patch therefore help wake-affine stuff to stop it's work when:
>>>>
>>>>      wakee->nr_wakee_switch > factor &&
>>>>      waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>>
>>>> The factor here is the node-size of current-cpu, so bigger node will
>>>> lead
>>>> to more pull since the trial become more severe.
>>>>
>>>> After applied the patch, pgbench show 40% improvement at most.
>>>>
>>>> Test:
>>>>      Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>>>>
>>>>      pgbench            base    smart
>>>>
>>>>      | db_size | clients |  tps  |    |  tps  |
>>>>      +---------+---------+-------+   +-------+
>>>>      | 22 MB   |       1 | 10598 |   | 10796 |
>>>>      | 22 MB   |       2 | 21257 |   | 21336 |
>>>>      | 22 MB   |       4 | 41386 |   | 41622 |
>>>>      | 22 MB   |       8 | 51253 |   | 57932 |
>>>>      | 22 MB   |      12 | 48570 |   | 54000 |
>>>>      | 22 MB   |      16 | 46748 |   | 55982 | +19.75%
>>>>      | 22 MB   |      24 | 44346 |   | 55847 | +25.93%
>>>>      | 22 MB   |      32 | 43460 |   | 54614 | +25.66%
>>>>      | 7484 MB |       1 |  8951 |   |  9193 |
>>>>      | 7484 MB |       2 | 19233 |   | 19240 |
>>>>      | 7484 MB |       4 | 37239 |   | 37302 |
>>>>      | 7484 MB |       8 | 46087 |   | 50018 |
>>>>      | 7484 MB |      12 | 42054 |   | 48763 |
>>>>      | 7484 MB |      16 | 40765 |   | 51633 | +26.66%
>>>>      | 7484 MB |      24 | 37651 |   | 52377 | +39.11%
>>>>      | 7484 MB |      32 | 37056 |   | 51108 | +37.92%
>>>>      | 15 GB   |       1 |  8845 |   |  9104 |
>>>>      | 15 GB   |       2 | 19094 |   | 19162 |
>>>>      | 15 GB   |       4 | 36979 |   | 36983 |
>>>>      | 15 GB   |       8 | 46087 |   | 49977 |
>>>>      | 15 GB   |      12 | 41901 |   | 48591 |
>>>>      | 15 GB   |      16 | 40147 |   | 50651 | +26.16%
>>>>      | 15 GB   |      24 | 37250 |   | 52365 | +40.58%
>>>>      | 15 GB   |      32 | 36470 |   | 50015 | +37.14%
>>>>
>>>> CC: Ingo Molnar <mingo@kernel.org>
>>>> CC: Peter Zijlstra <peterz@infradead.org>
>>>> CC: Mike Galbraith <efault@gmx.de>
>>>> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
>>>> ---
>>>>    include/linux/sched.h |    3 +++
>>>>    kernel/sched/fair.c   |   47
>>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 50 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index 178a8d9..1c996c7 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -1041,6 +1041,9 @@ struct task_struct {
>>>>    #ifdef CONFIG_SMP
>>>>        struct llist_node wake_entry;
>>>>        int on_cpu;
>>>> +    struct task_struct *last_wakee;
>>>> +    unsigned long nr_wakee_switch;
>>>> +    unsigned long last_switch_decay;
>>>>    #endif
>>>>        int on_rq;
>>>>    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index c61a614..a4ddbf5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
>>>> cpu)
>>>>        return 0;
>>>>    }
>>>>    +static void record_wakee(struct task_struct *p)
>>>> +{
>>>> +    /*
>>>> +     * Rough decay(wiping) for cost saving, don't worry
>>>> +     * about the boundary, really active task won't care
>>>> +     * the loose.
>>>> +     */
>>>> +    if (jiffies > current->last_switch_decay + HZ) {
>>>> +        current->nr_wakee_switch = 0;
>>>> +        current->last_switch_decay = jiffies;
>>>> +    }
>>>> +
>>>> +    if (current->last_wakee != p) {
>>>> +        current->last_wakee = p;
>>>> +        current->nr_wakee_switch++;
>>>> +    }
>>>> +}
>>>>      static void task_waking_fair(struct task_struct *p)
>>>>    {
>>>> @@ -2991,6 +3008,7 @@ static void task_waking_fair(struct
>>>> task_struct *p)
>>>>    #endif
>>>>          se->vruntime -= min_vruntime;
>>>> +    record_wakee(p);
>>>>    }
>>>>      #ifdef CONFIG_FAIR_GROUP_SCHED
>>>> @@ -3109,6 +3127,28 @@ static inline unsigned long
>>>> effective_load(struct task_group *tg, int cpu,
>>>>      #endif
>>>>    +static int wake_wide(struct task_struct *p)
>>>> +{
>>>> +    int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
>>>> +
>>>> +    /*
>>>> +     * Yeah, it's the switching-frequency, could means many wakee or
>>>> +     * rapidly switch, use factor here will just help to automatically
>>>> +     * adjust the loose-degree, so bigger node will lead to more pull.
>>>> +     */
>>>> +    if (p->nr_wakee_switch > factor) {
>>>> +        /*
>>>> +         * wakee is somewhat hot, it needs certain amount of cpu
>>>> +         * resource, so if waker is far more hot, prefer to leave
>>>> +         * it alone.
>>>> +         */
>>>> +        if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>>>> +            return 1;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>    static int wake_affine(struct sched_domain *sd, struct task_struct
>>>> *p, int sync)
>>>>    {
>>>>        s64 this_load, load;
>>>> @@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd,
>>>> struct task_struct *p, int sync)
>>>>        unsigned long weight;
>>>>        int balanced;
>>>>    +    /*
>>>> +     * If we wake multiple tasks be careful to not bounce
>>>> +     * ourselves around too much.
>>>> +     */
>>>> +    if (wake_wide(p))
>>>> +        return 0;
>>>> +
>>>>        idx      = sd->wake_idx;
>>>>        this_cpu  = smp_processor_id();
>>>>        prev_cpu  = task_cpu(p);
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:perf/core] sched: Implement smarter wake-affine logic
  2013-07-04  4:55 ` [PATCH v3 1/2] sched: smart wake-affine foundation Michael Wang
  2013-07-07  1:31   ` Sam Ben
@ 2013-07-24  3:56   ` tip-bot for Michael Wang
  1 sibling, 0 replies; 9+ messages in thread
From: tip-bot for Michael Wang @ 2013-07-24  3:56 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, peterz, efault, wangyun, tglx

Commit-ID:  62470419e993f8d9d93db0effd3af4296ecb79a5
Gitweb:     http://git.kernel.org/tip/62470419e993f8d9d93db0effd3af4296ecb79a5
Author:     Michael Wang <wangyun@linux.vnet.ibm.com>
AuthorDate: Thu, 4 Jul 2013 12:55:51 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 23 Jul 2013 12:18:41 +0200

sched: Implement smarter wake-affine logic

The wake-affine scheduler feature is currently always trying to pull
the wakee close to the waker. In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.

Testing shows it can benefit hackbench up to 15%.

However, the feature is somewhat blind, from which some workloads
such as pgbench suffer. It's also time-consuming algorithmically.

Testing shows it can damage pgbench up to 50% - far more than the
benefit it brings in the best case.

So wake-affine should be smarter and it should realize when to
stop its thankless effort at trying to find a suitable CPU to wake on.

This patch introduces 'wakee_flips', which will be increased each
time the task flips (switches) its wakee target.

So a high 'wakee_flips' value means the task has more than one
wakee, and the bigger the number, the higher the wakeup frequency.

Now when making the decision on whether to pull or not, pay attention to
the wakee with a high 'wakee_flips', pulling such a task may benefit
the wakee. Also imply that the waker will face cruel competition later,
it could be very cruel or very fast depends on the story behind
'wakee_flips', waker therefore suffers.

Furthermore, if waker also has a high 'wakee_flips', that implies that
multiple tasks rely on it, then waker's higher latency will damage all
of them, so pulling wakee seems to be a bad deal.

Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
higher and higher, the cost of pulling seems to be worse and worse.

The patch therefore helps the wake-affine feature to stop its pulling
work when:

	wakee->wakee_flips > factor &&
	waker->wakee_flips > (factor * wakee->wakee_flips)

The 'factor' here is the number of CPUs in the current CPU's NUMA node,
so a bigger node will lead to more pulling since the trial becomes more
severe.

After applying the patch, pgbench shows up to 40% improvements and no regressions.

Tested with 12 cpu x86 server and tip 3.10.0-rc7.

The percentages in the final column highlight the areas with the biggest wins,
all other areas improved as well:

	pgbench		    base	smart

	| db_size | clients |  tps  |	|  tps  |
	+---------+---------+-------+   +-------+
	| 22 MB   |       1 | 10598 |   | 10796 |
	| 22 MB   |       2 | 21257 |   | 21336 |
	| 22 MB   |       4 | 41386 |   | 41622 |
	| 22 MB   |       8 | 51253 |   | 57932 |
	| 22 MB   |      12 | 48570 |   | 54000 |
	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
	| 7484 MB |       1 |  8951 |   |  9193 |
	| 7484 MB |       2 | 19233 |   | 19240 |
	| 7484 MB |       4 | 37239 |   | 37302 |
	| 7484 MB |       8 | 46087 |   | 50018 |
	| 7484 MB |      12 | 42054 |   | 48763 |
	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
	| 15 GB   |       1 |  8845 |   |  9104 |
	| 15 GB   |       2 | 19094 |   | 19162 |
	| 15 GB   |       4 | 36979 |   | 36983 |
	| 15 GB   |       8 | 46087 |   | 49977 |
	| 15 GB   |      12 | 41901 |   | 48591 |
	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%

Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50d04b9..4f163a8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1034,6 +1034,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
+	struct task_struct *last_wakee;
+	unsigned long wakee_flips;
+	unsigned long wakee_flip_decay_ts;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 765d87a..860063a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3017,6 +3017,23 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	return 0;
 }
 
+static void record_wakee(struct task_struct *p)
+{
+	/*
+	 * Rough decay (wiping) for cost saving, don't worry
+	 * about the boundary, really active task won't care
+	 * about the loss.
+	 */
+	if (jiffies > current->wakee_flip_decay_ts + HZ) {
+		current->wakee_flips = 0;
+		current->wakee_flip_decay_ts = jiffies;
+	}
+
+	if (current->last_wakee != p) {
+		current->last_wakee = p;
+		current->wakee_flips++;
+	}
+}
 
 static void task_waking_fair(struct task_struct *p)
 {
@@ -3037,6 +3054,7 @@ static void task_waking_fair(struct task_struct *p)
 #endif
 
 	se->vruntime -= min_vruntime;
+	record_wakee(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3155,6 +3173,28 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 #endif
 
+static int wake_wide(struct task_struct *p)
+{
+	int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+
+	/*
+	 * Yeah, it's the switching-frequency, could means many wakee or
+	 * rapidly switch, use factor here will just help to automatically
+	 * adjust the loose-degree, so bigger node will lead to more pull.
+	 */
+	if (p->wakee_flips > factor) {
+		/*
+		 * wakee is somewhat hot, it needs certain amount of cpu
+		 * resource, so if waker is far more hot, prefer to leave
+		 * it alone.
+		 */
+		if (current->wakee_flips > (factor * p->wakee_flips))
+			return 1;
+	}
+
+	return 0;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
@@ -3164,6 +3204,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	unsigned long weight;
 	int balanced;
 
+	/*
+	 * If we wake multiple tasks be careful to not bounce
+	 * ourselves around too much.
+	 */
+	if (wake_wide(p))
+		return 0;
+
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/2] sched: reduce the overhead of obtain factor
  2013-07-04  4:55 [PATCH v3 0/2] sched: smart wake-affine Michael Wang
  2013-07-04  4:55 ` [PATCH v3 1/2] sched: smart wake-affine foundation Michael Wang
@ 2013-07-04  4:56 ` Michael Wang
  2013-07-24  3:56   ` [tip:perf/core] sched: Micro-optimize the smart wake-affine logic tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Wang @ 2013-07-04  4:56 UTC (permalink / raw)
  To: LKML, Ingo Molnar, Peter Zijlstra
  Cc: Mike Galbraith, Alex Shi, Namhyung Kim, Paul Turner,
	Andrew Morton, Nikunj A. Dadhania, Ram Pai

From: Peter Zijlstra <peterz@infradead.org>

Smart wake-affine is using node-size as the factor, but the overhead of
mask operation is high.

Thus, this patch introduce the 'sd_llc_size', which will record the highest
cache-share domain size, and make it to be the new factor, in order to
reduce the overhead and make more reasonable.

And we suppose it will benefit a lot when facing a huge platform.

Test:
	Tested with 12 cpu X86 server and tip 3.10.0-rc7.

	pgbench		    base	smart + optimization

	| db_size | clients |  tps  |	|  tps  |
	+---------+---------+-------+   +-------+
	| 22 MB   |       1 | 10598 |   | 10781 |
	| 22 MB   |       2 | 21257 |   | 21328 |
	| 22 MB   |       4 | 41386 |   | 41622 |
	| 22 MB   |       8 | 51253 |   | 60351 |
	| 22 MB   |      12 | 48570 |   | 54255 |
	| 22 MB   |      16 | 46748 |   | 55534 | +18.79%
	| 22 MB   |      24 | 44346 |   | 55976 | +26.23%
	| 22 MB   |      32 | 43460 |   | 55279 | +27.20%
	| 7484 MB |       1 |  8951 |   |  9054 |
	| 7484 MB |       2 | 19233 |   | 19252 |
	| 7484 MB |       4 | 37239 |   | 37354 |
	| 7484 MB |       8 | 46087 |   | 51218 |
	| 7484 MB |      12 | 42054 |   | 49510 |
	| 7484 MB |      16 | 40765 |   | 52151 | +27.93%
	| 7484 MB |      24 | 37651 |   | 52720 | +40.02%
	| 7484 MB |      32 | 37056 |   | 51094 | +37.88%
	| 15 GB   |       1 |  8845 |   |  9139 |
	| 15 GB   |       2 | 19094 |   | 19379 |
	| 15 GB   |       4 | 36979 |   | 37077 |
	| 15 GB   |       8 | 46087 |   | 50490 |
	| 15 GB   |      12 | 41901 |   | 48235 |
	| 15 GB   |      16 | 40147 |   | 51878 | +29.22%
	| 15 GB   |      24 | 37250 |   | 52676 | +41.41%
	| 15 GB   |      32 | 36470 |   | 50198 | +37.64%

CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Mike Galbraith <efault@gmx.de>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
---
 kernel/sched/core.c  |    7 ++++++-
 kernel/sched/fair.c  |    2 +-
 kernel/sched/sched.h |    1 +
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..8fcca57 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5648,18 +5648,23 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
  * two cpus are in the same cache domain, see cpus_share_cache().
  */
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
+DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
+	int size = 1;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
-	if (sd)
+	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
+		size = cpumask_weight(sched_domain_span(sd));
+	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
+	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4ddbf5..86c4b86 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3129,7 +3129,7 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 static int wake_wide(struct task_struct *p)
 {
-	int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+	int factor = this_cpu_read(sd_llc_size);
 
 	/*
 	 * Yeah, it's the switching-frequency, could means many wakee or
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ce39224..3227948 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -582,6 +582,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 }
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
+DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 
 struct sched_group_power {
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:perf/core] sched: Micro-optimize the smart wake-affine logic
  2013-07-04  4:56 ` [PATCH v3 2/2] sched: reduce the overhead of obtain factor Michael Wang
@ 2013-07-24  3:56   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-07-24  3:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, davidlohr.bueso, efault, peterz,
	wangyun, tglx

Commit-ID:  7d9ffa8961482232d964173cccba6e14d2d543b2
Gitweb:     http://git.kernel.org/tip/7d9ffa8961482232d964173cccba6e14d2d543b2
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 4 Jul 2013 12:56:46 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 23 Jul 2013 12:22:06 +0200

sched: Micro-optimize the smart wake-affine logic

Smart wake-affine is using node-size as the factor currently, but the overhead
of the mask operation is high.

Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
the highest cache-share domain size, and make it to be the new factor, in order
to reduce the overhead and make it more reasonable.

Tested-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
Tested-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 7 ++++++-
 kernel/sched/fair.c  | 2 +-
 kernel/sched/sched.h | 1 +
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..6df0fbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5083,18 +5083,23 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
  * two cpus are in the same cache domain, see cpus_share_cache().
  */
 DEFINE_PER_CPU(struct sched_domain *, sd_llc);
+DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain *sd;
 	int id = cpu;
+	int size = 1;
 
 	sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
-	if (sd)
+	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
+		size = cpumask_weight(sched_domain_span(sd));
+	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
+	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 860063a..f237437 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3175,7 +3175,7 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 static int wake_wide(struct task_struct *p)
 {
-	int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
+	int factor = this_cpu_read(sd_llc_size);
 
 	/*
 	 * Yeah, it's the switching-frequency, could means many wakee or
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5e129ef..4c1cb80 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
 }
 
 DECLARE_PER_CPU(struct sched_domain *, sd_llc);
+DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 
 struct sched_group_power {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-07-24  3:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-04  4:55 [PATCH v3 0/2] sched: smart wake-affine Michael Wang
2013-07-04  4:55 ` [PATCH v3 1/2] sched: smart wake-affine foundation Michael Wang
2013-07-07  1:31   ` Sam Ben
2013-07-08  2:36     ` Michael Wang
2013-07-10  1:52       ` Sam Ben
2013-07-10  2:12         ` Michael Wang
2013-07-24  3:56   ` [tip:perf/core] sched: Implement smarter wake-affine logic tip-bot for Michael Wang
2013-07-04  4:56 ` [PATCH v3 2/2] sched: reduce the overhead of obtain factor Michael Wang
2013-07-24  3:56   ` [tip:perf/core] sched: Micro-optimize the smart wake-affine logic tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).