[RFC PATCH 0/2] sched/eevdf: Rate limit task migration

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] sched/eevdf: Rate limit task migration
@ 2023-09-05 17:11 Mathieu Desnoyers
  2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers
  2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers
  0 siblings, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

Implement task migration rate limiting to speed up workload patterns
such as hackbench which trigger frequent migrations.

The first patch implements a simple rate limiting of 1 migration per
2ms. The second patch implements adaptative task migration rate
limiting.

I would be interested to hear feedback on this approach, especially
about how it behaves on various workloads.

Thanks,

Mathieu

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: x86@kernel.org

Mathieu Desnoyers (2):
  sched: rate limit migrations to 1 per 2ms per task
  sched: Implement adaptative rate limiting of task migrations

 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  |  4 ++++
 4 files changed, 51 insertions(+)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers
@ 2023-09-05 17:11 ` Mathieu Desnoyers
  2023-09-05 20:28   ` Tim Chen
  2023-09-06  8:41   ` Peter Zijlstra
  2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers
  1 sibling, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

Rate limit migrations to 1 migration per 2 milliseconds per task. On a
kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).

This results in the following benchmark improvements:

        hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

from 62s to 45s. (27% speedup))

And similarly with perf bench:

        perf bench sched messaging -g 32 -p -t -l 100000

from 13.0s to 9.5s (26% speedup)

I have noticed that in order to observe the speedup, the workload needs
to keep the CPUs sufficiently busy to cause runqueue lock contention,
but not so busy that they don't go idle. This can be explained by the
fact that idle CPUs are a preferred target for task wakeup runqueue
selection, and therefore having idle cpus causes more migrations, which
triggers more remote wakeups. For both the hackbench and the perf bench
sched messaging benchmarks, the scale of the workload can be tweaked by
changing the number groups.

This was developed as part of the investigation into a weird regression
reported by AMD where adding a raw spinlock in the scheduler context
switch accelerated hackbench. It turned out that changing this raw
spinlock for a loop of 10000x cpu_relax within do_idle() had similar
benefits.

This patch results from the observation that the common effect of the
prior approaches that succeeded in speeding up this workload was to
diminish the number of migrations from 7.5k migrations/s to 1.5k
migrations/s.

This patch shows similar speedup on a 6.4.4 kernel with the CFS
scheduler.

With this patch applied, the "skip queued wakeups only when L2 is
shared" patch [1] brings the hackbench benchmark to 41s (34% speedup
from baseline), but the the "ratelimit update to tg->load_avg" patch
from Aaron Lu [2] does not seem to offer any speed up.

The values "1 migration" and the 2ms window size were determined
empirically with the hackbench benchmark on the targeted hardware.

I would be interested to hear feedback about performance impact of this
patch (improvement or regression) on other workloads and hardware,
especially for Intel CPUs.

Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com
Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ [1]
Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/ [2]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: x86@kernel.org
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 14 ++++++++++++++
 kernel/sched/sched.h  |  2 ++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..1111d04255cc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -564,6 +564,8 @@ struct sched_entity {
 
 	u64				nr_migrations;
 
+	u64				next_migration_time;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	struct sched_entity		*parent;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 479db611f46e..0d294fce261d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
 	p->se.slice			= sysctl_sched_base_slice;
+	p->se.next_migration_time	= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d92da2d78774..24ac69913005 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -960,6 +960,14 @@ int sched_update_scaling(void)
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
+static bool should_migrate_task(struct task_struct *p, int prev_cpu)
+{
+	/* Rate limit task migration. */
+	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
+	       return false;
+	return true;
+}
+
 /*
  * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
  * this is probably good enough.
@@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
+	if (want_affine && !should_migrate_task(p, prev_cpu))
+		return prev_cpu;
+
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
@@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
 	struct sched_entity *se = &p->se;
 
+	/* Rate limit task migration. */
+	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
+
 	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cf54fe338e23..c9b1a5976761 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,8 @@ struct cpuidle_state;
 #define TASK_ON_RQ_QUEUED	1
 #define TASK_ON_RQ_MIGRATING	2
 
+#define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
+
 extern __read_mostly int scheduler_running;
 
 extern unsigned long calc_load_update;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers
@ 2023-09-05 20:28   ` Tim Chen
  2023-09-05 21:16     ` Mathieu Desnoyers
  2023-09-06  8:41   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Chen @ 2023-09-05 20:28 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote:
> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).
> 
> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 479db611f46e..0d294fce261d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	p->se.vruntime			= 0;
>  	p->se.vlag			= 0;
>  	p->se.slice			= sysctl_sched_base_slice;
> +	p->se.next_migration_time	= 0;

It seems like the next_migration_time should be initialized to the current time,
in case the system run for a long time and clock wrap around could cause problem.

>  	INIT_LIST_HEAD(&p->se.group_node);
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d92da2d78774..24ac69913005 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>  
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>  
> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
> +{
> +	/* Rate limit task migration. */
> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)

Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ?

> +	       return false;
> +	return true;
> +}
> +

Thanks.

Tim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 20:28   ` Tim Chen
@ 2023-09-05 21:16     ` Mathieu Desnoyers
  2023-09-05 22:44       ` Tim Chen
  2023-09-06  8:44       ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-05 21:16 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On 9/5/23 16:28, Tim Chen wrote:
> On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote:
>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
>> this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).
>>
>>
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 479db611f46e..0d294fce261d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>   	p->se.vruntime			= 0;
>>   	p->se.vlag			= 0;
>>   	p->se.slice			= sysctl_sched_base_slice;
>> +	p->se.next_migration_time	= 0;
> 
> It seems like the next_migration_time should be initialized to the current time,
> in case the system run for a long time and clock wrap around could cause problem.

next_migration_time is a u64, which should "never" overflow. Other 
scheduler code comparing with sched_clock() don't appear to care about 
u64 overflow. Sampling the next_migration_time on fork could delay 
migrations for a 2ms window after process creation, which I don't think 
is something we want. Or if we do want this behavior, it should be 
validated with benchmarks beforehand.

> 
>>   	INIT_LIST_HEAD(&p->se.group_node);
>>   
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d92da2d78774..24ac69913005 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>>   
>>   static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>>   
>> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
>> +{
>> +	/* Rate limit task migration. */
>> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
> 
> Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ?

No, because time_before expects unsigned long parameters, and 
sched_clock_cpu() and next_migration_time are u64.

Thanks,

Mathieu

> 
>> +	       return false;
>> +	return true;
>> +}
>> +
> 
> Thanks.
> 
> Tim

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 21:16     ` Mathieu Desnoyers
@ 2023-09-05 22:44       ` Tim Chen
  2023-09-06  9:47         ` Peter Zijlstra
  2023-09-06  8:44       ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Chen @ 2023-09-05 22:44 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On Tue, 2023-09-05 at 17:16 -0400, Mathieu Desnoyers wrote:
> On 9/5/23 16:28, Tim Chen wrote:
> > On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote:
> > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a
> > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> > > this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).
> > > 
> > > 
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 479db611f46e..0d294fce261d 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >   	p->se.vruntime			= 0;
> > >   	p->se.vlag			= 0;
> > >   	p->se.slice			= sysctl_sched_base_slice;
> > > +	p->se.next_migration_time	= 0;
> > 
> > It seems like the next_migration_time should be initialized to the current time,
> > in case the system run for a long time and clock wrap around could cause problem.
> 
> next_migration_time is a u64, which should "never" overflow. Other 

Reading up on sched_clock() documentation and seems like it should 
indeed be monotonic.
For TSC based clock, which starts from 0 at boot
and TSC doesn't wrap around on the order of ~190 years.  

I wonder about the corner case when a system suspeds and resume.  The
documentation on sched clock says "The clock driving sched_clock() may 
stop or reset to zero during system suspend/sleep".
If the sched_clock is reset to 0, the next_migration_time for all 
suspended tasks should also be reset to 0
before they resume so the next migration time is not in the long future.

Thanks.

Tim

> scheduler code comparing with sched_clock() don't appear to care about 
> u64 overflow. Sampling the next_migration_time on fork could delay 
> migrations for a 2ms window after process creation, which I don't think 
> is something we want. Or if we do want this behavior, it should be 
> validated with benchmarks beforehand.
> 
> > 
> > >   	INIT_LIST_HEAD(&p->se.group_node);
> > >   
> > >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d92da2d78774..24ac69913005 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -960,6 +960,14 @@ int sched_update_scaling(void)
> > >   
> > >   static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
> > >   
> > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
> > > +{
> > > +	/* Rate limit task migration. */
> > > +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
> > 
> > Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ?
> 
> No, because time_before expects unsigned long parameters, and 
> sched_clock_cpu() and next_migration_time are u64.
> 
> Thanks,
> 
> Mathieu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 22:44       ` Tim Chen
@ 2023-09-06  9:47         ` Peter Zijlstra
  2023-09-06 20:51           ` Tim Chen
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2023-09-06  9:47 UTC (permalink / raw)
  To: Tim Chen
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote:

> Reading up on sched_clock() documentation and seems like it should 
> indeed be monotonic.

It tries very hard to be monotonic but cannot guarantee. The moment TSC
is found unstable it's too late to fix up everything.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06  9:47         ` Peter Zijlstra
@ 2023-09-06 20:51           ` Tim Chen
  2023-09-06 21:55             ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Tim Chen @ 2023-09-06 20:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On Wed, 2023-09-06 at 11:47 +0200, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote:
> 
> > Reading up on sched_clock() documentation and seems like it should 
> > indeed be monotonic.
> 
> It tries very hard to be monotonic but cannot guarantee. The moment TSC
> is found unstable it's too late to fix up everything.
> 

Yes, if TSC becomes unstable and could cause sched_clock to reset and go way backward.
Perhaps we can add the following check in Mathieu's original
patch to fix things up:

+static bool should_migrate_task(struct task_struct *p, int prev_cpu)
> +{
	/* sched_clock reset causing next migration time to be too far ahead */
	if (p->se.next_migration_time > sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW)
		p->se.next_migration_time = sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;

> +	/* Rate limit task migration. */
> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
> +	       return false;
> +	return true;
> +}
> +

Tim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06 20:51           ` Tim Chen
@ 2023-09-06 21:55             ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-06 21:55 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On 9/6/23 16:51, Tim Chen wrote:
> On Wed, 2023-09-06 at 11:47 +0200, Peter Zijlstra wrote:
>> On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote:
>>
>>> Reading up on sched_clock() documentation and seems like it should
>>> indeed be monotonic.
>>
>> It tries very hard to be monotonic but cannot guarantee. The moment TSC
>> is found unstable it's too late to fix up everything.
>>
> 
> Yes, if TSC becomes unstable and could cause sched_clock to reset and go way backward.
> Perhaps we can add the following check in Mathieu's original
> patch to fix things up:
> 
> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
>> +{
> 	/* sched_clock reset causing next migration time to be too far ahead */
> 	if (p->se.next_migration_time > sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW)
> 		p->se.next_migration_time = sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
> 
>> +	/* Rate limit task migration. */
>> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
>> +	       return false;
>> +	return true;
>> +}
>> +
> 

Along those lines I think something like this should work:

static bool should_migrate_task(struct task_struct *p, int prev_cpu)
{
         u64 now = sched_clock_cpu(prev_cpu);

         /* sched_clock reset causing next migration time to be too far ahead. */
         if (now + SCHED_MIGRATION_RATELIMIT_WINDOW < p->se.next_migration_time)
                 return true;
         /* Rate limit task migration. */
         if (now >= p->se.next_migration_time)
                return true;
         return false;
}

It will let migrate_task_rq_fair() update se->next_migration_time.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 21:16     ` Mathieu Desnoyers
  2023-09-05 22:44       ` Tim Chen
@ 2023-09-06  8:44       ` Peter Zijlstra
  2023-09-06 13:58         ` Mathieu Desnoyers
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2023-09-06  8:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Tim Chen, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On Tue, Sep 05, 2023 at 05:16:25PM -0400, Mathieu Desnoyers wrote:
> On 9/5/23 16:28, Tim Chen wrote:
> > On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote:
> > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a
> > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> > > this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).
> > > 
> > > 
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 479db611f46e..0d294fce261d 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >   	p->se.vruntime			= 0;
> > >   	p->se.vlag			= 0;
> > >   	p->se.slice			= sysctl_sched_base_slice;
> > > +	p->se.next_migration_time	= 0;
> > 
> > It seems like the next_migration_time should be initialized to the current time,
> > in case the system run for a long time and clock wrap around could cause problem.
> 
> next_migration_time is a u64, which should "never" overflow. Other scheduler
> code comparing with sched_clock() don't appear to care about u64 overflow.

Much code actually considers overflow. We also have monotonicity filters
where it really matters.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06  8:44       ` Peter Zijlstra
@ 2023-09-06 13:58         ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-06 13:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On 9/6/23 04:44, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 05:16:25PM -0400, Mathieu Desnoyers wrote:
>> On 9/5/23 16:28, Tim Chen wrote:
>>> On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote:
>>>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
>>>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
>>>> this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets).
>>>>
>>>>
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 479db611f46e..0d294fce261d 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>>>    	p->se.vruntime			= 0;
>>>>    	p->se.vlag			= 0;
>>>>    	p->se.slice			= sysctl_sched_base_slice;
>>>> +	p->se.next_migration_time	= 0;
>>>
>>> It seems like the next_migration_time should be initialized to the current time,
>>> in case the system run for a long time and clock wrap around could cause problem.
>>
>> next_migration_time is a u64, which should "never" overflow. Other scheduler
>> code comparing with sched_clock() don't appear to care about u64 overflow.
> 
> Much code actually considers overflow. We also have monotonicity filters
> where it really matters.

OK, I'll update the patch to consider overflow if we end up going that
route, but for now I'll try an approach based on idle timestamps
instead.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers
  2023-09-05 20:28   ` Tim Chen
@ 2023-09-06  8:41   ` Peter Zijlstra
  2023-09-06 13:57     ` Mathieu Desnoyers
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2023-09-06  8:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote:
> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),

This is not in any way related to the actual eevdf part, perhaps just
call it fair.


>  include/linux/sched.h |  2 ++
>  kernel/sched/core.c   |  1 +
>  kernel/sched/fair.c   | 14 ++++++++++++++
>  kernel/sched/sched.h  |  2 ++
>  4 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 177b3f3676ef..1111d04255cc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -564,6 +564,8 @@ struct sched_entity {
>  
>  	u64				nr_migrations;
>  
> +	u64				next_migration_time;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	int				depth;
>  	struct sched_entity		*parent;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 479db611f46e..0d294fce261d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>  	p->se.vruntime			= 0;
>  	p->se.vlag			= 0;
>  	p->se.slice			= sysctl_sched_base_slice;
> +	p->se.next_migration_time	= 0;
>  	INIT_LIST_HEAD(&p->se.group_node);
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d92da2d78774..24ac69913005 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>  
>  static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>  
> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
> +{
> +	/* Rate limit task migration. */
> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
> +	       return false;
> +	return true;
> +}
> +
>  /*
>   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>   * this is probably good enough.
> @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
>  	}
>  
> +	if (want_affine && !should_migrate_task(p, prev_cpu))
> +		return prev_cpu;
> +
>  	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
>  		/*
> @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  {
>  	struct sched_entity *se = &p->se;
>  
> +	/* Rate limit task migration. */
> +	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
> +
>  	if (!task_on_rq_migrating(p)) {
>  		remove_entity_load_avg(se);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cf54fe338e23..c9b1a5976761 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -104,6 +104,8 @@ struct cpuidle_state;
>  #define TASK_ON_RQ_QUEUED	1
>  #define TASK_ON_RQ_MIGRATING	2
>  
> +#define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
> +
>  extern __read_mostly int scheduler_running;
>  
>  extern unsigned long calc_load_update;

Urgh... so we already have much of this around task_hot() /
can_migrate_task(). And I would much rather see we extend those things
to this wakeup migration path, rather than build a whole new parallel
thing.

Also:

> I have noticed that in order to observe the speedup, the workload needs
> to keep the CPUs sufficiently busy to cause runqueue lock contention,
> but not so busy that they don't go idle.

This would suggest inhibiting pulling tasks based on rq statistics,
instead of tasks stats. It doesn't matter when the task migrated last,
what matter is that this rq doesn't want new tasks at this point.

Them not the same thing.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06  8:41   ` Peter Zijlstra
@ 2023-09-06 13:57     ` Mathieu Desnoyers
  2023-09-06 15:38       ` Mathieu Desnoyers
  2023-09-10  7:03       ` Chen Yu
  0 siblings, 2 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-06 13:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On 9/6/23 04:41, Peter Zijlstra wrote:
> On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote:
>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a
>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> 
> This is not in any way related to the actual eevdf part, perhaps just
> call it fair.

Good point.

> 
> 
>>   include/linux/sched.h |  2 ++
>>   kernel/sched/core.c   |  1 +
>>   kernel/sched/fair.c   | 14 ++++++++++++++
>>   kernel/sched/sched.h  |  2 ++
>>   4 files changed, 19 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 177b3f3676ef..1111d04255cc 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -564,6 +564,8 @@ struct sched_entity {
>>   
>>   	u64				nr_migrations;
>>   
>> +	u64				next_migration_time;
>> +
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   	int				depth;
>>   	struct sched_entity		*parent;
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 479db611f46e..0d294fce261d 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>   	p->se.vruntime			= 0;
>>   	p->se.vlag			= 0;
>>   	p->se.slice			= sysctl_sched_base_slice;
>> +	p->se.next_migration_time	= 0;
>>   	INIT_LIST_HEAD(&p->se.group_node);
>>   
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d92da2d78774..24ac69913005 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -960,6 +960,14 @@ int sched_update_scaling(void)
>>   
>>   static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>>   
>> +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
>> +{
>> +	/* Rate limit task migration. */
>> +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
>> +	       return false;
>> +	return true;
>> +}
>> +
>>   /*
>>    * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
>>    * this is probably good enough.
>> @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
>>   	}
>>   
>> +	if (want_affine && !should_migrate_task(p, prev_cpu))
>> +		return prev_cpu;
>> +
>>   	rcu_read_lock();
>>   	for_each_domain(cpu, tmp) {
>>   		/*
>> @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>>   {
>>   	struct sched_entity *se = &p->se;
>>   
>> +	/* Rate limit task migration. */
>> +	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
>> +
>>   	if (!task_on_rq_migrating(p)) {
>>   		remove_entity_load_avg(se);
>>   
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index cf54fe338e23..c9b1a5976761 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -104,6 +104,8 @@ struct cpuidle_state;
>>   #define TASK_ON_RQ_QUEUED	1
>>   #define TASK_ON_RQ_MIGRATING	2
>>   
>> +#define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
>> +
>>   extern __read_mostly int scheduler_running;
>>   
>>   extern unsigned long calc_load_update;
> 
> Urgh... so we already have much of this around task_hot() /
> can_migrate_task(). And I would much rather see we extend those things
> to this wakeup migration path, rather than build a whole new parallel
> thing.

Yes, good point.

> 
> Also:
> 
>> I have noticed that in order to observe the speedup, the workload needs
>> to keep the CPUs sufficiently busy to cause runqueue lock contention,
>> but not so busy that they don't go idle.
> 
> This would suggest inhibiting pulling tasks based on rq statistics,
> instead of tasks stats. It doesn't matter when the task migrated last,
> what matter is that this rq doesn't want new tasks at this point.
> 
> Them not the same thing.

I suspect we could try something like this then:

When a cpu enters idle state, it could grab a sched_clock() timestamp
and store it into this_rq()->enter_idle_time. Then, when it exits
idle and reenters idle again, it could save rq->enter_idle_time to
rq->prev_enter_idle_time, and sample enter_idle_time again.

When considering the CPU as a target for task migration, if it is
idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
prev_enter_idle_time is below a threshold (e.g. a few ms), this means
the CPU got out of idle and went back to idle pretty quickly, which
means it's not a good target for pulling tasks for a short while.

I'll try something along these lines and see how it goes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06 13:57     ` Mathieu Desnoyers
@ 2023-09-06 15:38       ` Mathieu Desnoyers
  2023-09-10  7:03       ` Chen Yu
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-06 15:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu,
	Julien Desfossez, x86

On 9/6/23 09:57, Mathieu Desnoyers wrote:
> On 9/6/23 04:41, Peter Zijlstra wrote:
[...]
>>
>> Also:
>>
>>> I have noticed that in order to observe the speedup, the workload needs
>>> to keep the CPUs sufficiently busy to cause runqueue lock contention,
>>> but not so busy that they don't go idle.
>>
>> This would suggest inhibiting pulling tasks based on rq statistics,
>> instead of tasks stats. It doesn't matter when the task migrated last,
>> what matter is that this rq doesn't want new tasks at this point.
>>
>> Them not the same thing.
> 
> I suspect we could try something like this then:
> 
> When a cpu enters idle state, it could grab a sched_clock() timestamp
> and store it into this_rq()->enter_idle_time. Then, when it exits
> idle and reenters idle again, it could save rq->enter_idle_time to
> rq->prev_enter_idle_time, and sample enter_idle_time again.
> 
> When considering the CPU as a target for task migration, if it is
> idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
> prev_enter_idle_time is below a threshold (e.g. a few ms), this means
> the CPU got out of idle and went back to idle pretty quickly, which
> means it's not a good target for pulling tasks for a short while.
> 
> I'll try something along these lines and see how it goes.

I've tried this approach and failed to observe any kind of speed up.

The effect I'm looking for is to favor keeping a task on its prev 
runqueue (prevent migration) even if there are rq siblings which have a 
lower load (or are actually idle) as long as the prev runqueue does not 
have a too high load.

I'll try this approach and let you know how it goes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-06 13:57     ` Mathieu Desnoyers
  2023-09-06 15:38       ` Mathieu Desnoyers
@ 2023-09-10  7:03       ` Chen Yu
  2023-09-13 15:46         ` Mathieu Desnoyers
  1 sibling, 1 reply; 16+ messages in thread
From: Chen Yu @ 2023-09-10  7:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On 2023-09-06 at 09:57:04 -0400, Mathieu Desnoyers wrote:
> On 9/6/23 04:41, Peter Zijlstra wrote:
> > On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote:
> > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a
> > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679),
> > 
> > This is not in any way related to the actual eevdf part, perhaps just
> > call it fair.
> 
> Good point.
> 
> > 
> > 
> > >   include/linux/sched.h |  2 ++
> > >   kernel/sched/core.c   |  1 +
> > >   kernel/sched/fair.c   | 14 ++++++++++++++
> > >   kernel/sched/sched.h  |  2 ++
> > >   4 files changed, 19 insertions(+)
> > > 
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 177b3f3676ef..1111d04255cc 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -564,6 +564,8 @@ struct sched_entity {
> > >   	u64				nr_migrations;
> > > +	u64				next_migration_time;
> > > +
> > >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > >   	int				depth;
> > >   	struct sched_entity		*parent;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 479db611f46e..0d294fce261d 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >   	p->se.vruntime			= 0;
> > >   	p->se.vlag			= 0;
> > >   	p->se.slice			= sysctl_sched_base_slice;
> > > +	p->se.next_migration_time	= 0;
> > >   	INIT_LIST_HEAD(&p->se.group_node);
> > >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d92da2d78774..24ac69913005 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -960,6 +960,14 @@ int sched_update_scaling(void)
> > >   static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
> > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu)
> > > +{
> > > +	/* Rate limit task migration. */
> > > +	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
> > > +	       return false;
> > > +	return true;
> > > +}
> > > +
> > >   /*
> > >    * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
> > >    * this is probably good enough.
> > > @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >   		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > >   	}
> > > +	if (want_affine && !should_migrate_task(p, prev_cpu))
> > > +		return prev_cpu;
> > > +
> > >   	rcu_read_lock();
> > >   	for_each_domain(cpu, tmp) {
> > >   		/*
> > > @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> > >   {
> > >   	struct sched_entity *se = &p->se;
> > > +	/* Rate limit task migration. */
> > > +	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
> > > +
> > >   	if (!task_on_rq_migrating(p)) {
> > >   		remove_entity_load_avg(se);
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index cf54fe338e23..c9b1a5976761 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -104,6 +104,8 @@ struct cpuidle_state;
> > >   #define TASK_ON_RQ_QUEUED	1
> > >   #define TASK_ON_RQ_MIGRATING	2
> > > +#define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
> > > +
> > >   extern __read_mostly int scheduler_running;
> > >   extern unsigned long calc_load_update;
> > 
> > Urgh... so we already have much of this around task_hot() /
> > can_migrate_task(). And I would much rather see we extend those things
> > to this wakeup migration path, rather than build a whole new parallel
> > thing.
> 
> Yes, good point.
> 
> > 
> > Also:
> > 
> > > I have noticed that in order to observe the speedup, the workload needs
> > > to keep the CPUs sufficiently busy to cause runqueue lock contention,
> > > but not so busy that they don't go idle.
> > 
> > This would suggest inhibiting pulling tasks based on rq statistics,
> > instead of tasks stats. It doesn't matter when the task migrated last,
> > what matter is that this rq doesn't want new tasks at this point.
> > 
> > Them not the same thing.
> 
> I suspect we could try something like this then:
> 
> When a cpu enters idle state, it could grab a sched_clock() timestamp
> and store it into this_rq()->enter_idle_time. Then, when it exits
> idle and reenters idle again, it could save rq->enter_idle_time to
> rq->prev_enter_idle_time, and sample enter_idle_time again.
> 
> When considering the CPU as a target for task migration, if it is
> idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
> prev_enter_idle_time is below a threshold (e.g. a few ms), this means
> the CPU got out of idle and went back to idle pretty quickly, which
> means it's not a good target for pulling tasks for a short while.
> 

Do you mean inhit the newidle balance? Currently the newidle balance
checks if the average idle duration of that rq is below the total cost
to do a load balance:
   this_rq->avg_idle < sd->max_newidle_lb_cost

> I'll try something along these lines and see how it goes.
>

Consider the sleep time looks like a good idea! What you suggests that
inspires me that, maybe we could consider the task's sleep duration,
and decide whether to migrate it or not in the next wakeup.

Say, if a task p sleeps and woken up quickly, can we reserve its previous
CPU as idle for a short while? So other tasks can not choose p's previous
CPU during their wakeup. A short while later, when p is woken up, it finds
that its previous CPU is still idle and chooses that.

I created a draft patch based on that, and it shows some improvements on
a 224 CPUs system. I'll post the draft patch and Cc you.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task
  2023-09-10  7:03       ` Chen Yu
@ 2023-09-13 15:46         ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-13 15:46 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

On 9/10/23 03:03, Chen Yu wrote:
> On 2023-09-06 at 09:57:04 -0400, Mathieu Desnoyers wrote:
[...]
>>
>> I suspect we could try something like this then:
>>
>> When a cpu enters idle state, it could grab a sched_clock() timestamp
>> and store it into this_rq()->enter_idle_time. Then, when it exits
>> idle and reenters idle again, it could save rq->enter_idle_time to
>> rq->prev_enter_idle_time, and sample enter_idle_time again.
>>
>> When considering the CPU as a target for task migration, if it is
>> idle but the delta between sched_clock_cpu(cpu_of(rq)) and that
>> prev_enter_idle_time is below a threshold (e.g. a few ms), this means
>> the CPU got out of idle and went back to idle pretty quickly, which
>> means it's not a good target for pulling tasks for a short while.
>>
> 
> Do you mean inhit the newidle balance? Currently the newidle balance
> checks if the average idle duration of that rq is below the total cost
> to do a load balance:
>     this_rq->avg_idle < sd->max_newidle_lb_cost

Not quite but..

> 
>> I'll try something along these lines and see how it goes.

anyway this approach did not work based on my testing.

>>
> 
> Consider the sleep time looks like a good idea! What you suggests that
> inspires me that, maybe we could consider the task's sleep duration,
> and decide whether to migrate it or not in the next wakeup.
> 
> Say, if a task p sleeps and woken up quickly, can we reserve its previous
> CPU as idle for a short while? So other tasks can not choose p's previous
> CPU during their wakeup. A short while later, when p is woken up, it finds
> that its previous CPU is still idle and chooses that.
> 
> I created a draft patch based on that, and it shows some improvements on
> a 224 CPUs system. I'll post the draft patch and Cc you.

I think your approach is very promising, let's keep digging into that 
direction.

Thanks,

Mathieu

> 
> thanks,
> Chenyu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations
  2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers
  2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers
@ 2023-09-05 17:11 ` Mathieu Desnoyers
  1 sibling, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli,
	Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86

Define a task migration quota (initially 10) within a 2ms window for
each task. Whenever a task reaches its quota, the following affine
wakeups keep the task on its previous CPU for the rest of the window.
When the migration quota has been entirely used within a window, the
available quota is divided by 2 for the next window, down to a minimum
of 1 per window. When the available quota is not entirely used within a
window, it is replenished by adding 1 per window.

One advantage of this approach is to leave extra room for sporadic
migration bursts, limiting the migration rate more strictly as tasks
repeatedly bust their quota.

Another advantage of this approach is to reduce the number of
sched_clock calls on the wakeup path when tasks are below their quota.

The resulting benchmarks are similar to those achieved with the
non-adaptative approach for hackbench and perf bench sched messaging
workloads.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: x86@kernel.org
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 36 +++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |  2 ++
 4 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1111d04255cc..2190e59ebc99 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -565,6 +565,8 @@ struct sched_entity {
 	u64				nr_migrations;
 
 	u64				next_migration_time;
+	int				nr_migrations_per_window;
+	int				quota_migrations_per_window;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d294fce261d..834e37231c36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4511,6 +4511,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.vlag			= 0;
 	p->se.slice			= sysctl_sched_base_slice;
 	p->se.next_migration_time	= 0;
+	p->se.nr_migrations_per_window	= 0;
+	p->se.quota_migrations_per_window = SCHED_MIGRATION_QUOTA;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24ac69913005..e2f4c30483a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -963,9 +963,11 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static bool should_migrate_task(struct task_struct *p, int prev_cpu)
 {
 	/* Rate limit task migration. */
-	if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time)
-	       return false;
-	return true;
+	if (p->se.nr_migrations_per_window < p->se.quota_migrations_per_window)
+		return true;
+	if (sched_clock_cpu(prev_cpu) > p->se.next_migration_time)
+		return true;
+	return false;
 }
 
 /*
@@ -7946,6 +7948,31 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	return new_cpu;
 }
 
+static void migrate_task_ratelimit_fair(struct task_struct *p, int new_cpu)
+{
+	struct sched_entity *se = &p->se;
+	u64 now;
+
+	/* Rate limit task migration. */
+	now = sched_clock_cpu(task_cpu(p));
+	if (now < se->next_migration_time) {
+		se->nr_migrations_per_window++;
+		if (!sched_clock_stable())
+			se->next_migration_time += sched_clock_cpu(new_cpu) - now;
+	} else {
+		if (!sched_clock_stable())
+			now = sched_clock_cpu(new_cpu);
+		se->next_migration_time = now + SCHED_MIGRATION_RATELIMIT_WINDOW;
+		if (se->nr_migrations_per_window >= se->quota_migrations_per_window)
+			se->quota_migrations_per_window = max(1, se->quota_migrations_per_window >> 1);
+		else
+			se->quota_migrations_per_window =
+				min(SCHED_MIGRATION_QUOTA,
+				    se->quota_migrations_per_window + SCHED_MIGRATION_QUOTA_REPLENISH);
+		se->nr_migrations_per_window = 1;
+	}
+}
+
 /*
  * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -7955,8 +7982,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 {
 	struct sched_entity *se = &p->se;
 
-	/* Rate limit task migration. */
-	se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW;
+	migrate_task_ratelimit_fair(p, new_cpu);
 
 	if (!task_on_rq_migrating(p)) {
 		remove_entity_load_avg(se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c9b1a5976761..04529661976c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -105,6 +105,8 @@ struct cpuidle_state;
 #define TASK_ON_RQ_MIGRATING	2
 
 #define SCHED_MIGRATION_RATELIMIT_WINDOW	2000000		/* 2 ms */
+#define SCHED_MIGRATION_QUOTA			10
+#define SCHED_MIGRATION_QUOTA_REPLENISH		1
 
 extern __read_mostly int scheduler_running;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-13 15:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers
2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers
2023-09-05 20:28   ` Tim Chen
2023-09-05 21:16     ` Mathieu Desnoyers
2023-09-05 22:44       ` Tim Chen
2023-09-06  9:47         ` Peter Zijlstra
2023-09-06 20:51           ` Tim Chen
2023-09-06 21:55             ` Mathieu Desnoyers
2023-09-06  8:44       ` Peter Zijlstra
2023-09-06 13:58         ` Mathieu Desnoyers
2023-09-06  8:41   ` Peter Zijlstra
2023-09-06 13:57     ` Mathieu Desnoyers
2023-09-06 15:38       ` Mathieu Desnoyers
2023-09-10  7:03       ` Chen Yu
2023-09-13 15:46         ` Mathieu Desnoyers
2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox