* [RFC PATCH 0/2] sched/eevdf: Rate limit task migration @ 2023-09-05 17:11 Mathieu Desnoyers 2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers 2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers 0 siblings, 2 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 Implement task migration rate limiting to speed up workload patterns such as hackbench which trigger frequent migrations. The first patch implements a simple rate limiting of 1 migration per 2ms. The second patch implements adaptative task migration rate limiting. I would be interested to hear feedback on this approach, especially about how it behaves on various workloads. Thanks, Mathieu Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: x86@kernel.org Mathieu Desnoyers (2): sched: rate limit migrations to 1 per 2ms per task sched: Implement adaptative rate limiting of task migrations include/linux/sched.h | 4 ++++ kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 4 ++++ 4 files changed, 51 insertions(+) -- 2.39.2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers @ 2023-09-05 17:11 ` Mathieu Desnoyers 2023-09-05 20:28 ` Tim Chen 2023-09-06 8:41 ` Peter Zijlstra 2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers 1 sibling, 2 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 Rate limit migrations to 1 migration per 2 milliseconds per task. On a kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). This results in the following benchmark improvements: hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 from 62s to 45s. (27% speedup)) And similarly with perf bench: perf bench sched messaging -g 32 -p -t -l 100000 from 13.0s to 9.5s (26% speedup) I have noticed that in order to observe the speedup, the workload needs to keep the CPUs sufficiently busy to cause runqueue lock contention, but not so busy that they don't go idle. This can be explained by the fact that idle CPUs are a preferred target for task wakeup runqueue selection, and therefore having idle cpus causes more migrations, which triggers more remote wakeups. For both the hackbench and the perf bench sched messaging benchmarks, the scale of the workload can be tweaked by changing the number groups. This was developed as part of the investigation into a weird regression reported by AMD where adding a raw spinlock in the scheduler context switch accelerated hackbench. It turned out that changing this raw spinlock for a loop of 10000x cpu_relax within do_idle() had similar benefits. This patch results from the observation that the common effect of the prior approaches that succeeded in speeding up this workload was to diminish the number of migrations from 7.5k migrations/s to 1.5k migrations/s. This patch shows similar speedup on a 6.4.4 kernel with the CFS scheduler. With this patch applied, the "skip queued wakeups only when L2 is shared" patch [1] brings the hackbench benchmark to 41s (34% speedup from baseline), but the the "ratelimit update to tg->load_avg" patch from Aaron Lu [2] does not seem to offer any speed up. The values "1 migration" and the 2ms window size were determined empirically with the hackbench benchmark on the targeted hardware. I would be interested to hear feedback about performance impact of this patch (improvement or regression) on other workloads and hardware, especially for Intel CPUs. Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/lkml/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ Link: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ [1] Link: https://lore.kernel.org/lkml/20230823060832.454842-1-aaron.lu@intel.com/ [2] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: x86@kernel.org --- include/linux/sched.h | 2 ++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 14 ++++++++++++++ kernel/sched/sched.h | 2 ++ 4 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 177b3f3676ef..1111d04255cc 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -564,6 +564,8 @@ struct sched_entity { u64 nr_migrations; + u64 next_migration_time; + #ifdef CONFIG_FAIR_GROUP_SCHED int depth; struct sched_entity *parent; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 479db611f46e..0d294fce261d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.vruntime = 0; p->se.vlag = 0; p->se.slice = sysctl_sched_base_slice; + p->se.next_migration_time = 0; INIT_LIST_HEAD(&p->se.group_node); #ifdef CONFIG_FAIR_GROUP_SCHED diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d92da2d78774..24ac69913005 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -960,6 +960,14 @@ int sched_update_scaling(void) static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); +static bool should_migrate_task(struct task_struct *p, int prev_cpu) +{ + /* Rate limit task migration. */ + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) + return false; + return true; +} + /* * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i * this is probably good enough. @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } + if (want_affine && !should_migrate_task(p, prev_cpu)) + return prev_cpu; + rcu_read_lock(); for_each_domain(cpu, tmp) { /* @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) { struct sched_entity *se = &p->se; + /* Rate limit task migration. */ + se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; + if (!task_on_rq_migrating(p)) { remove_entity_load_avg(se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cf54fe338e23..c9b1a5976761 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -104,6 +104,8 @@ struct cpuidle_state; #define TASK_ON_RQ_QUEUED 1 #define TASK_ON_RQ_MIGRATING 2 +#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */ + extern __read_mostly int scheduler_running; extern unsigned long calc_load_update; -- 2.39.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers @ 2023-09-05 20:28 ` Tim Chen 2023-09-05 21:16 ` Mathieu Desnoyers 2023-09-06 8:41 ` Peter Zijlstra 1 sibling, 1 reply; 16+ messages in thread From: Tim Chen @ 2023-09-05 20:28 UTC (permalink / raw) To: Mathieu Desnoyers, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote: > Rate limit migrations to 1 migration per 2 milliseconds per task. On a > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), > this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 479db611f46e..0d294fce261d 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->se.vruntime = 0; > p->se.vlag = 0; > p->se.slice = sysctl_sched_base_slice; > + p->se.next_migration_time = 0; It seems like the next_migration_time should be initialized to the current time, in case the system run for a long time and clock wrap around could cause problem. > INIT_LIST_HEAD(&p->se.group_node); > > #ifdef CONFIG_FAIR_GROUP_SCHED > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d92da2d78774..24ac69913005 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -960,6 +960,14 @@ int sched_update_scaling(void) > > static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu) > +{ > + /* Rate limit task migration. */ > + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ? > + return false; > + return true; > +} > + Thanks. Tim ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 20:28 ` Tim Chen @ 2023-09-05 21:16 ` Mathieu Desnoyers 2023-09-05 22:44 ` Tim Chen 2023-09-06 8:44 ` Peter Zijlstra 0 siblings, 2 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-05 21:16 UTC (permalink / raw) To: Tim Chen, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/5/23 16:28, Tim Chen wrote: > On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote: >> Rate limit migrations to 1 migration per 2 milliseconds per task. On a >> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), >> this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). >> >> >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 479db611f46e..0d294fce261d 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) >> p->se.vruntime = 0; >> p->se.vlag = 0; >> p->se.slice = sysctl_sched_base_slice; >> + p->se.next_migration_time = 0; > > It seems like the next_migration_time should be initialized to the current time, > in case the system run for a long time and clock wrap around could cause problem. next_migration_time is a u64, which should "never" overflow. Other scheduler code comparing with sched_clock() don't appear to care about u64 overflow. Sampling the next_migration_time on fork could delay migrations for a 2ms window after process creation, which I don't think is something we want. Or if we do want this behavior, it should be validated with benchmarks beforehand. > >> INIT_LIST_HEAD(&p->se.group_node); >> >> #ifdef CONFIG_FAIR_GROUP_SCHED >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index d92da2d78774..24ac69913005 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -960,6 +960,14 @@ int sched_update_scaling(void) >> >> static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); >> >> +static bool should_migrate_task(struct task_struct *p, int prev_cpu) >> +{ >> + /* Rate limit task migration. */ >> + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) > > Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ? No, because time_before expects unsigned long parameters, and sched_clock_cpu() and next_migration_time are u64. Thanks, Mathieu > >> + return false; >> + return true; >> +} >> + > > Thanks. > > Tim -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 21:16 ` Mathieu Desnoyers @ 2023-09-05 22:44 ` Tim Chen 2023-09-06 9:47 ` Peter Zijlstra 2023-09-06 8:44 ` Peter Zijlstra 1 sibling, 1 reply; 16+ messages in thread From: Tim Chen @ 2023-09-05 22:44 UTC (permalink / raw) To: Mathieu Desnoyers, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Tue, 2023-09-05 at 17:16 -0400, Mathieu Desnoyers wrote: > On 9/5/23 16:28, Tim Chen wrote: > > On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote: > > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a > > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), > > > this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). > > > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 479db611f46e..0d294fce261d 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > p->se.vruntime = 0; > > > p->se.vlag = 0; > > > p->se.slice = sysctl_sched_base_slice; > > > + p->se.next_migration_time = 0; > > > > It seems like the next_migration_time should be initialized to the current time, > > in case the system run for a long time and clock wrap around could cause problem. > > next_migration_time is a u64, which should "never" overflow. Other Reading up on sched_clock() documentation and seems like it should indeed be monotonic. For TSC based clock, which starts from 0 at boot and TSC doesn't wrap around on the order of ~190 years. I wonder about the corner case when a system suspeds and resume. The documentation on sched clock says "The clock driving sched_clock() may stop or reset to zero during system suspend/sleep". If the sched_clock is reset to 0, the next_migration_time for all suspended tasks should also be reset to 0 before they resume so the next migration time is not in the long future. Thanks. Tim > scheduler code comparing with sched_clock() don't appear to care about > u64 overflow. Sampling the next_migration_time on fork could delay > migrations for a 2ms window after process creation, which I don't think > is something we want. Or if we do want this behavior, it should be > validated with benchmarks beforehand. > > > > > > INIT_LIST_HEAD(&p->se.group_node); > > > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index d92da2d78774..24ac69913005 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -960,6 +960,14 @@ int sched_update_scaling(void) > > > > > > static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); > > > > > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu) > > > +{ > > > + /* Rate limit task migration. */ > > > + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) > > > > Should we use time_before(sched_clock_cpu(prev_cpu), p->se.next_migration_time) ? > > No, because time_before expects unsigned long parameters, and > sched_clock_cpu() and next_migration_time are u64. > > Thanks, > > Mathieu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 22:44 ` Tim Chen @ 2023-09-06 9:47 ` Peter Zijlstra 2023-09-06 20:51 ` Tim Chen 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2023-09-06 9:47 UTC (permalink / raw) To: Tim Chen Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote: > Reading up on sched_clock() documentation and seems like it should > indeed be monotonic. It tries very hard to be monotonic but cannot guarantee. The moment TSC is found unstable it's too late to fix up everything. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 9:47 ` Peter Zijlstra @ 2023-09-06 20:51 ` Tim Chen 2023-09-06 21:55 ` Mathieu Desnoyers 0 siblings, 1 reply; 16+ messages in thread From: Tim Chen @ 2023-09-06 20:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Wed, 2023-09-06 at 11:47 +0200, Peter Zijlstra wrote: > On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote: > > > Reading up on sched_clock() documentation and seems like it should > > indeed be monotonic. > > It tries very hard to be monotonic but cannot guarantee. The moment TSC > is found unstable it's too late to fix up everything. > Yes, if TSC becomes unstable and could cause sched_clock to reset and go way backward. Perhaps we can add the following check in Mathieu's original patch to fix things up: +static bool should_migrate_task(struct task_struct *p, int prev_cpu) > +{ /* sched_clock reset causing next migration time to be too far ahead */ if (p->se.next_migration_time > sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW) p->se.next_migration_time = sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; > + /* Rate limit task migration. */ > + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) > + return false; > + return true; > +} > + Tim ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 20:51 ` Tim Chen @ 2023-09-06 21:55 ` Mathieu Desnoyers 0 siblings, 0 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-06 21:55 UTC (permalink / raw) To: Tim Chen, Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/6/23 16:51, Tim Chen wrote: > On Wed, 2023-09-06 at 11:47 +0200, Peter Zijlstra wrote: >> On Tue, Sep 05, 2023 at 03:44:57PM -0700, Tim Chen wrote: >> >>> Reading up on sched_clock() documentation and seems like it should >>> indeed be monotonic. >> >> It tries very hard to be monotonic but cannot guarantee. The moment TSC >> is found unstable it's too late to fix up everything. >> > > Yes, if TSC becomes unstable and could cause sched_clock to reset and go way backward. > Perhaps we can add the following check in Mathieu's original > patch to fix things up: > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu) >> +{ > /* sched_clock reset causing next migration time to be too far ahead */ > if (p->se.next_migration_time > sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW) > p->se.next_migration_time = sched_clock_cpu(prev_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; > >> + /* Rate limit task migration. */ >> + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) >> + return false; >> + return true; >> +} >> + > Along those lines I think something like this should work: static bool should_migrate_task(struct task_struct *p, int prev_cpu) { u64 now = sched_clock_cpu(prev_cpu); /* sched_clock reset causing next migration time to be too far ahead. */ if (now + SCHED_MIGRATION_RATELIMIT_WINDOW < p->se.next_migration_time) return true; /* Rate limit task migration. */ if (now >= p->se.next_migration_time) return true; return false; } It will let migrate_task_rq_fair() update se->next_migration_time. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 21:16 ` Mathieu Desnoyers 2023-09-05 22:44 ` Tim Chen @ 2023-09-06 8:44 ` Peter Zijlstra 2023-09-06 13:58 ` Mathieu Desnoyers 1 sibling, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2023-09-06 8:44 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Tim Chen, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Tue, Sep 05, 2023 at 05:16:25PM -0400, Mathieu Desnoyers wrote: > On 9/5/23 16:28, Tim Chen wrote: > > On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote: > > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a > > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), > > > this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). > > > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 479db611f46e..0d294fce261d 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > p->se.vruntime = 0; > > > p->se.vlag = 0; > > > p->se.slice = sysctl_sched_base_slice; > > > + p->se.next_migration_time = 0; > > > > It seems like the next_migration_time should be initialized to the current time, > > in case the system run for a long time and clock wrap around could cause problem. > > next_migration_time is a u64, which should "never" overflow. Other scheduler > code comparing with sched_clock() don't appear to care about u64 overflow. Much code actually considers overflow. We also have monotonicity filters where it really matters. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 8:44 ` Peter Zijlstra @ 2023-09-06 13:58 ` Mathieu Desnoyers 0 siblings, 0 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-06 13:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Tim Chen, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/6/23 04:44, Peter Zijlstra wrote: > On Tue, Sep 05, 2023 at 05:16:25PM -0400, Mathieu Desnoyers wrote: >> On 9/5/23 16:28, Tim Chen wrote: >>> On Tue, 2023-09-05 at 13:11 -0400, Mathieu Desnoyers wrote: >>>> Rate limit migrations to 1 migration per 2 milliseconds per task. On a >>>> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), >>>> this speeds up hackbench from 62s to 45s on AMD EPYC 192-core (over 2 sockets). >>>> >>>> >>>> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>> index 479db611f46e..0d294fce261d 100644 >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) >>>> p->se.vruntime = 0; >>>> p->se.vlag = 0; >>>> p->se.slice = sysctl_sched_base_slice; >>>> + p->se.next_migration_time = 0; >>> >>> It seems like the next_migration_time should be initialized to the current time, >>> in case the system run for a long time and clock wrap around could cause problem. >> >> next_migration_time is a u64, which should "never" overflow. Other scheduler >> code comparing with sched_clock() don't appear to care about u64 overflow. > > Much code actually considers overflow. We also have monotonicity filters > where it really matters. OK, I'll update the patch to consider overflow if we end up going that route, but for now I'll try an approach based on idle timestamps instead. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers 2023-09-05 20:28 ` Tim Chen @ 2023-09-06 8:41 ` Peter Zijlstra 2023-09-06 13:57 ` Mathieu Desnoyers 1 sibling, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2023-09-06 8:41 UTC (permalink / raw) To: Mathieu Desnoyers Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote: > Rate limit migrations to 1 migration per 2 milliseconds per task. On a > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), This is not in any way related to the actual eevdf part, perhaps just call it fair. > include/linux/sched.h | 2 ++ > kernel/sched/core.c | 1 + > kernel/sched/fair.c | 14 ++++++++++++++ > kernel/sched/sched.h | 2 ++ > 4 files changed, 19 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 177b3f3676ef..1111d04255cc 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -564,6 +564,8 @@ struct sched_entity { > > u64 nr_migrations; > > + u64 next_migration_time; > + > #ifdef CONFIG_FAIR_GROUP_SCHED > int depth; > struct sched_entity *parent; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 479db611f46e..0d294fce261d 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->se.vruntime = 0; > p->se.vlag = 0; > p->se.slice = sysctl_sched_base_slice; > + p->se.next_migration_time = 0; > INIT_LIST_HEAD(&p->se.group_node); > > #ifdef CONFIG_FAIR_GROUP_SCHED > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d92da2d78774..24ac69913005 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -960,6 +960,14 @@ int sched_update_scaling(void) > > static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu) > +{ > + /* Rate limit task migration. */ > + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) > + return false; > + return true; > +} > + > /* > * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i > * this is probably good enough. > @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > } > > + if (want_affine && !should_migrate_task(p, prev_cpu)) > + return prev_cpu; > + > rcu_read_lock(); > for_each_domain(cpu, tmp) { > /* > @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) > { > struct sched_entity *se = &p->se; > > + /* Rate limit task migration. */ > + se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; > + > if (!task_on_rq_migrating(p)) { > remove_entity_load_avg(se); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index cf54fe338e23..c9b1a5976761 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -104,6 +104,8 @@ struct cpuidle_state; > #define TASK_ON_RQ_QUEUED 1 > #define TASK_ON_RQ_MIGRATING 2 > > +#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */ > + > extern __read_mostly int scheduler_running; > > extern unsigned long calc_load_update; Urgh... so we already have much of this around task_hot() / can_migrate_task(). And I would much rather see we extend those things to this wakeup migration path, rather than build a whole new parallel thing. Also: > I have noticed that in order to observe the speedup, the workload needs > to keep the CPUs sufficiently busy to cause runqueue lock contention, > but not so busy that they don't go idle. This would suggest inhibiting pulling tasks based on rq statistics, instead of tasks stats. It doesn't matter when the task migrated last, what matter is that this rq doesn't want new tasks at this point. Them not the same thing. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 8:41 ` Peter Zijlstra @ 2023-09-06 13:57 ` Mathieu Desnoyers 2023-09-06 15:38 ` Mathieu Desnoyers 2023-09-10 7:03 ` Chen Yu 0 siblings, 2 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-06 13:57 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/6/23 04:41, Peter Zijlstra wrote: > On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote: >> Rate limit migrations to 1 migration per 2 milliseconds per task. On a >> kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), > > This is not in any way related to the actual eevdf part, perhaps just > call it fair. Good point. > > >> include/linux/sched.h | 2 ++ >> kernel/sched/core.c | 1 + >> kernel/sched/fair.c | 14 ++++++++++++++ >> kernel/sched/sched.h | 2 ++ >> 4 files changed, 19 insertions(+) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 177b3f3676ef..1111d04255cc 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -564,6 +564,8 @@ struct sched_entity { >> >> u64 nr_migrations; >> >> + u64 next_migration_time; >> + >> #ifdef CONFIG_FAIR_GROUP_SCHED >> int depth; >> struct sched_entity *parent; >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 479db611f46e..0d294fce261d 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) >> p->se.vruntime = 0; >> p->se.vlag = 0; >> p->se.slice = sysctl_sched_base_slice; >> + p->se.next_migration_time = 0; >> INIT_LIST_HEAD(&p->se.group_node); >> >> #ifdef CONFIG_FAIR_GROUP_SCHED >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index d92da2d78774..24ac69913005 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -960,6 +960,14 @@ int sched_update_scaling(void) >> >> static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); >> >> +static bool should_migrate_task(struct task_struct *p, int prev_cpu) >> +{ >> + /* Rate limit task migration. */ >> + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) >> + return false; >> + return true; >> +} >> + >> /* >> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i >> * this is probably good enough. >> @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) >> want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); >> } >> >> + if (want_affine && !should_migrate_task(p, prev_cpu)) >> + return prev_cpu; >> + >> rcu_read_lock(); >> for_each_domain(cpu, tmp) { >> /* >> @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) >> { >> struct sched_entity *se = &p->se; >> >> + /* Rate limit task migration. */ >> + se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; >> + >> if (!task_on_rq_migrating(p)) { >> remove_entity_load_avg(se); >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index cf54fe338e23..c9b1a5976761 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -104,6 +104,8 @@ struct cpuidle_state; >> #define TASK_ON_RQ_QUEUED 1 >> #define TASK_ON_RQ_MIGRATING 2 >> >> +#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */ >> + >> extern __read_mostly int scheduler_running; >> >> extern unsigned long calc_load_update; > > Urgh... so we already have much of this around task_hot() / > can_migrate_task(). And I would much rather see we extend those things > to this wakeup migration path, rather than build a whole new parallel > thing. Yes, good point. > > Also: > >> I have noticed that in order to observe the speedup, the workload needs >> to keep the CPUs sufficiently busy to cause runqueue lock contention, >> but not so busy that they don't go idle. > > This would suggest inhibiting pulling tasks based on rq statistics, > instead of tasks stats. It doesn't matter when the task migrated last, > what matter is that this rq doesn't want new tasks at this point. > > Them not the same thing. I suspect we could try something like this then: When a cpu enters idle state, it could grab a sched_clock() timestamp and store it into this_rq()->enter_idle_time. Then, when it exits idle and reenters idle again, it could save rq->enter_idle_time to rq->prev_enter_idle_time, and sample enter_idle_time again. When considering the CPU as a target for task migration, if it is idle but the delta between sched_clock_cpu(cpu_of(rq)) and that prev_enter_idle_time is below a threshold (e.g. a few ms), this means the CPU got out of idle and went back to idle pretty quickly, which means it's not a good target for pulling tasks for a short while. I'll try something along these lines and see how it goes. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 13:57 ` Mathieu Desnoyers @ 2023-09-06 15:38 ` Mathieu Desnoyers 2023-09-10 7:03 ` Chen Yu 1 sibling, 0 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-06 15:38 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/6/23 09:57, Mathieu Desnoyers wrote: > On 9/6/23 04:41, Peter Zijlstra wrote: [...] >> >> Also: >> >>> I have noticed that in order to observe the speedup, the workload needs >>> to keep the CPUs sufficiently busy to cause runqueue lock contention, >>> but not so busy that they don't go idle. >> >> This would suggest inhibiting pulling tasks based on rq statistics, >> instead of tasks stats. It doesn't matter when the task migrated last, >> what matter is that this rq doesn't want new tasks at this point. >> >> Them not the same thing. > > I suspect we could try something like this then: > > When a cpu enters idle state, it could grab a sched_clock() timestamp > and store it into this_rq()->enter_idle_time. Then, when it exits > idle and reenters idle again, it could save rq->enter_idle_time to > rq->prev_enter_idle_time, and sample enter_idle_time again. > > When considering the CPU as a target for task migration, if it is > idle but the delta between sched_clock_cpu(cpu_of(rq)) and that > prev_enter_idle_time is below a threshold (e.g. a few ms), this means > the CPU got out of idle and went back to idle pretty quickly, which > means it's not a good target for pulling tasks for a short while. > > I'll try something along these lines and see how it goes. I've tried this approach and failed to observe any kind of speed up. The effect I'm looking for is to favor keeping a task on its prev runqueue (prevent migration) even if there are rq siblings which have a lower load (or are actually idle) as long as the prev runqueue does not have a too high load. I'll try this approach and let you know how it goes. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-06 13:57 ` Mathieu Desnoyers 2023-09-06 15:38 ` Mathieu Desnoyers @ 2023-09-10 7:03 ` Chen Yu 2023-09-13 15:46 ` Mathieu Desnoyers 1 sibling, 1 reply; 16+ messages in thread From: Chen Yu @ 2023-09-10 7:03 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 2023-09-06 at 09:57:04 -0400, Mathieu Desnoyers wrote: > On 9/6/23 04:41, Peter Zijlstra wrote: > > On Tue, Sep 05, 2023 at 01:11:04PM -0400, Mathieu Desnoyers wrote: > > > Rate limit migrations to 1 migration per 2 milliseconds per task. On a > > > kernel with EEVDF scheduler (commit b97d64c722598ffed42ece814a2cb791336c6679), > > > > This is not in any way related to the actual eevdf part, perhaps just > > call it fair. > > Good point. > > > > > > > > include/linux/sched.h | 2 ++ > > > kernel/sched/core.c | 1 + > > > kernel/sched/fair.c | 14 ++++++++++++++ > > > kernel/sched/sched.h | 2 ++ > > > 4 files changed, 19 insertions(+) > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > > index 177b3f3676ef..1111d04255cc 100644 > > > --- a/include/linux/sched.h > > > +++ b/include/linux/sched.h > > > @@ -564,6 +564,8 @@ struct sched_entity { > > > u64 nr_migrations; > > > + u64 next_migration_time; > > > + > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > int depth; > > > struct sched_entity *parent; > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index 479db611f46e..0d294fce261d 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4510,6 +4510,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > p->se.vruntime = 0; > > > p->se.vlag = 0; > > > p->se.slice = sysctl_sched_base_slice; > > > + p->se.next_migration_time = 0; > > > INIT_LIST_HEAD(&p->se.group_node); > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index d92da2d78774..24ac69913005 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -960,6 +960,14 @@ int sched_update_scaling(void) > > > static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); > > > +static bool should_migrate_task(struct task_struct *p, int prev_cpu) > > > +{ > > > + /* Rate limit task migration. */ > > > + if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) > > > + return false; > > > + return true; > > > +} > > > + > > > /* > > > * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i > > > * this is probably good enough. > > > @@ -7897,6 +7905,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) > > > want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); > > > } > > > + if (want_affine && !should_migrate_task(p, prev_cpu)) > > > + return prev_cpu; > > > + > > > rcu_read_lock(); > > > for_each_domain(cpu, tmp) { > > > /* > > > @@ -7944,6 +7955,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) > > > { > > > struct sched_entity *se = &p->se; > > > + /* Rate limit task migration. */ > > > + se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; > > > + > > > if (!task_on_rq_migrating(p)) { > > > remove_entity_load_avg(se); > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index cf54fe338e23..c9b1a5976761 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -104,6 +104,8 @@ struct cpuidle_state; > > > #define TASK_ON_RQ_QUEUED 1 > > > #define TASK_ON_RQ_MIGRATING 2 > > > +#define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */ > > > + > > > extern __read_mostly int scheduler_running; > > > extern unsigned long calc_load_update; > > > > Urgh... so we already have much of this around task_hot() / > > can_migrate_task(). And I would much rather see we extend those things > > to this wakeup migration path, rather than build a whole new parallel > > thing. > > Yes, good point. > > > > > Also: > > > > > I have noticed that in order to observe the speedup, the workload needs > > > to keep the CPUs sufficiently busy to cause runqueue lock contention, > > > but not so busy that they don't go idle. > > > > This would suggest inhibiting pulling tasks based on rq statistics, > > instead of tasks stats. It doesn't matter when the task migrated last, > > what matter is that this rq doesn't want new tasks at this point. > > > > Them not the same thing. > > I suspect we could try something like this then: > > When a cpu enters idle state, it could grab a sched_clock() timestamp > and store it into this_rq()->enter_idle_time. Then, when it exits > idle and reenters idle again, it could save rq->enter_idle_time to > rq->prev_enter_idle_time, and sample enter_idle_time again. > > When considering the CPU as a target for task migration, if it is > idle but the delta between sched_clock_cpu(cpu_of(rq)) and that > prev_enter_idle_time is below a threshold (e.g. a few ms), this means > the CPU got out of idle and went back to idle pretty quickly, which > means it's not a good target for pulling tasks for a short while. > Do you mean inhit the newidle balance? Currently the newidle balance checks if the average idle duration of that rq is below the total cost to do a load balance: this_rq->avg_idle < sd->max_newidle_lb_cost > I'll try something along these lines and see how it goes. > Consider the sleep time looks like a good idea! What you suggests that inspires me that, maybe we could consider the task's sleep duration, and decide whether to migrate it or not in the next wakeup. Say, if a task p sleeps and woken up quickly, can we reserve its previous CPU as idle for a short while? So other tasks can not choose p's previous CPU during their wakeup. A short while later, when p is woken up, it finds that its previous CPU is still idle and chooses that. I created a draft patch based on that, and it shows some improvements on a 224 CPUs system. I'll post the draft patch and Cc you. thanks, Chenyu ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task 2023-09-10 7:03 ` Chen Yu @ 2023-09-13 15:46 ` Mathieu Desnoyers 0 siblings, 0 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-13 15:46 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 On 9/10/23 03:03, Chen Yu wrote: > On 2023-09-06 at 09:57:04 -0400, Mathieu Desnoyers wrote: [...] >> >> I suspect we could try something like this then: >> >> When a cpu enters idle state, it could grab a sched_clock() timestamp >> and store it into this_rq()->enter_idle_time. Then, when it exits >> idle and reenters idle again, it could save rq->enter_idle_time to >> rq->prev_enter_idle_time, and sample enter_idle_time again. >> >> When considering the CPU as a target for task migration, if it is >> idle but the delta between sched_clock_cpu(cpu_of(rq)) and that >> prev_enter_idle_time is below a threshold (e.g. a few ms), this means >> the CPU got out of idle and went back to idle pretty quickly, which >> means it's not a good target for pulling tasks for a short while. >> > > Do you mean inhit the newidle balance? Currently the newidle balance > checks if the average idle duration of that rq is below the total cost > to do a load balance: > this_rq->avg_idle < sd->max_newidle_lb_cost Not quite but.. > >> I'll try something along these lines and see how it goes. anyway this approach did not work based on my testing. >> > > Consider the sleep time looks like a good idea! What you suggests that > inspires me that, maybe we could consider the task's sleep duration, > and decide whether to migrate it or not in the next wakeup. > > Say, if a task p sleeps and woken up quickly, can we reserve its previous > CPU as idle for a short while? So other tasks can not choose p's previous > CPU during their wakeup. A short while later, when p is woken up, it finds > that its previous CPU is still idle and chooses that. > > I created a draft patch based on that, and it shows some improvements on > a 224 CPUs system. I'll post the draft patch and Cc you. I think your approach is very promising, let's keep digging into that direction. Thanks, Mathieu > > thanks, > Chenyu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations 2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers 2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers @ 2023-09-05 17:11 ` Mathieu Desnoyers 1 sibling, 0 replies; 16+ messages in thread From: Mathieu Desnoyers @ 2023-09-05 17:11 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Mathieu Desnoyers, Ingo Molnar, Valentin Schneider, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Vincent Guittot, Juri Lelli, Swapnil Sapkal, Aaron Lu, Julien Desfossez, x86 Define a task migration quota (initially 10) within a 2ms window for each task. Whenever a task reaches its quota, the following affine wakeups keep the task on its previous CPU for the rest of the window. When the migration quota has been entirely used within a window, the available quota is divided by 2 for the next window, down to a minimum of 1 per window. When the available quota is not entirely used within a window, it is replenished by adding 1 per window. One advantage of this approach is to leave extra room for sporadic migration bursts, limiting the migration rate more strictly as tasks repeatedly bust their quota. Another advantage of this approach is to reduce the number of sched_clock calls on the wakeup path when tasks are below their quota. The resulting benchmarks are similar to those achieved with the non-adaptative approach for hackbench and perf bench sched messaging workloads. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: x86@kernel.org --- include/linux/sched.h | 2 ++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 4 files changed, 37 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 1111d04255cc..2190e59ebc99 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -565,6 +565,8 @@ struct sched_entity { u64 nr_migrations; u64 next_migration_time; + int nr_migrations_per_window; + int quota_migrations_per_window; #ifdef CONFIG_FAIR_GROUP_SCHED int depth; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0d294fce261d..834e37231c36 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4511,6 +4511,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.vlag = 0; p->se.slice = sysctl_sched_base_slice; p->se.next_migration_time = 0; + p->se.nr_migrations_per_window = 0; + p->se.quota_migrations_per_window = SCHED_MIGRATION_QUOTA; INIT_LIST_HEAD(&p->se.group_node); #ifdef CONFIG_FAIR_GROUP_SCHED diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 24ac69913005..e2f4c30483a1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -963,9 +963,11 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se); static bool should_migrate_task(struct task_struct *p, int prev_cpu) { /* Rate limit task migration. */ - if (sched_clock_cpu(prev_cpu) < p->se.next_migration_time) - return false; - return true; + if (p->se.nr_migrations_per_window < p->se.quota_migrations_per_window) + return true; + if (sched_clock_cpu(prev_cpu) > p->se.next_migration_time) + return true; + return false; } /* @@ -7946,6 +7948,31 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags) return new_cpu; } +static void migrate_task_ratelimit_fair(struct task_struct *p, int new_cpu) +{ + struct sched_entity *se = &p->se; + u64 now; + + /* Rate limit task migration. */ + now = sched_clock_cpu(task_cpu(p)); + if (now < se->next_migration_time) { + se->nr_migrations_per_window++; + if (!sched_clock_stable()) + se->next_migration_time += sched_clock_cpu(new_cpu) - now; + } else { + if (!sched_clock_stable()) + now = sched_clock_cpu(new_cpu); + se->next_migration_time = now + SCHED_MIGRATION_RATELIMIT_WINDOW; + if (se->nr_migrations_per_window >= se->quota_migrations_per_window) + se->quota_migrations_per_window = max(1, se->quota_migrations_per_window >> 1); + else + se->quota_migrations_per_window = + min(SCHED_MIGRATION_QUOTA, + se->quota_migrations_per_window + SCHED_MIGRATION_QUOTA_REPLENISH); + se->nr_migrations_per_window = 1; + } +} + /* * Called immediately before a task is migrated to a new CPU; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the @@ -7955,8 +7982,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu) { struct sched_entity *se = &p->se; - /* Rate limit task migration. */ - se->next_migration_time = sched_clock_cpu(new_cpu) + SCHED_MIGRATION_RATELIMIT_WINDOW; + migrate_task_ratelimit_fair(p, new_cpu); if (!task_on_rq_migrating(p)) { remove_entity_load_avg(se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c9b1a5976761..04529661976c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -105,6 +105,8 @@ struct cpuidle_state; #define TASK_ON_RQ_MIGRATING 2 #define SCHED_MIGRATION_RATELIMIT_WINDOW 2000000 /* 2 ms */ +#define SCHED_MIGRATION_QUOTA 10 +#define SCHED_MIGRATION_QUOTA_REPLENISH 1 extern __read_mostly int scheduler_running; -- 2.39.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-09-13 15:45 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-05 17:11 [RFC PATCH 0/2] sched/eevdf: Rate limit task migration Mathieu Desnoyers 2023-09-05 17:11 ` [RFC PATCH 1/2] sched: Rate limit migrations to 1 per 2ms per task Mathieu Desnoyers 2023-09-05 20:28 ` Tim Chen 2023-09-05 21:16 ` Mathieu Desnoyers 2023-09-05 22:44 ` Tim Chen 2023-09-06 9:47 ` Peter Zijlstra 2023-09-06 20:51 ` Tim Chen 2023-09-06 21:55 ` Mathieu Desnoyers 2023-09-06 8:44 ` Peter Zijlstra 2023-09-06 13:58 ` Mathieu Desnoyers 2023-09-06 8:41 ` Peter Zijlstra 2023-09-06 13:57 ` Mathieu Desnoyers 2023-09-06 15:38 ` Mathieu Desnoyers 2023-09-10 7:03 ` Chen Yu 2023-09-13 15:46 ` Mathieu Desnoyers 2023-09-05 17:11 ` [RFC PATCH 2/2] sched: Implement adaptative rate limiting of task migrations Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox