* [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU
@ 2023-09-11 2:49 Chen Yu
2023-09-11 2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu
2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
0 siblings, 2 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-11 2:49 UTC (permalink / raw)
To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
Juri Lelli
Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.
Inhit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch takes
the task average sleep duration into consideration. If the
task is a short sleeping one, then tag its previous CPU as cache
hot for a short while. During this tag period, other wakees are
not allowed to pick this idle CPU until a timeout. Later if the
task is woken up again, it can find its previous CPU still be
idle, and chooses it in select_idle_sibling().
The benchmark from netperf has shown some improvement, which is
described in patch 2/2.
This series is based on the tip/sched/core on top of
Commit 3f4feb58037a ("sched: Misc cleanups").
It would be appreciated for any feedback/comments.
Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Chen Yu (2):
sched/fair: Record the average sleep time of a task
sched/fair: skip the cache hot CPU in select_idle_cpu()
include/linux/sched.h | 3 +++
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
4 files changed, 49 insertions(+), 3 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 34+ messages in thread* [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task 2023-09-11 2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu @ 2023-09-11 2:49 ` Chen Yu 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu 1 sibling, 0 replies; 34+ messages in thread From: Chen Yu @ 2023-09-11 2:49 UTC (permalink / raw) To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu There is a requirement to have the information on how long a task sleeps. This information can be used for better task placement that, the short sleeping task can be woken up on its previous idle easier. The optimization will be introduced in the next patch. Define the sleep time of the task as: The time delta between this task was dequeued and enqueued. Signed-off-by: Chen Yu <yu.c.chen@intel.com> --- include/linux/sched.h | 3 +++ kernel/sched/fair.c | 17 +++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 177b3f3676ef..c11879468027 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -561,6 +561,9 @@ struct sched_entity { u64 vruntime; s64 vlag; u64 slice; + u64 prev_sleep_time; + /* average sleep duration of a task */ + u64 sleep_avg; u64 nr_migrations; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b7445cd5af9..e20f50726ab8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6457,6 +6457,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct sched_entity *se = &p->se; int idle_h_nr_running = task_has_idle_policy(p); int task_new = !(flags & ENQUEUE_WAKEUP); + u64 last_sleep = p->se.prev_sleep_time; + + /* + * Calculate the average sleep time of the wakee. + * Use wakee's previous CPU's clock to get the + * delta. The previous CPU should be online, otherwise + * the sleep time could be larger than expected(the clock + * of the offline previous CPU is paused). Check if there + * is a sched_clock() reset on previous CPU due to suspend/resume. + */ + if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && + sched_clock_cpu(task_cpu(p)) > last_sleep) + update_avg(&p->se.sleep_avg, + sched_clock_cpu(task_cpu(p)) - last_sleep); /* * The code below (indirectly) updates schedutil which looks at @@ -6551,6 +6565,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) int task_sleep = flags & DEQUEUE_SLEEP; int idle_h_nr_running = task_has_idle_policy(p); bool was_sched_idle = sched_idle_rq(rq); + u64 now; util_est_dequeue(&rq->cfs, p); @@ -6612,6 +6627,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); hrtick_update(rq); + now = sched_clock_cpu(cpu_of(rq)); + p->se.prev_sleep_time = task_sleep ? now : 0; } #ifdef CONFIG_SMP -- 2.25.1 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu 2023-09-11 2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu @ 2023-09-11 2:50 ` Chen Yu 2023-09-11 7:26 ` Aaron Lu ` (3 more replies) 1 sibling, 4 replies; 34+ messages in thread From: Chen Yu @ 2023-09-11 2:50 UTC (permalink / raw) To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu When task p is woken up, the scheduler leverages select_idle_sibling() to find an idle CPU for it. p's previous CPU is usually a preference because it can improve cache locality. However in many cases the previous CPU has already been taken by other wakees, thus p has to find another idle CPU. Inspired by Mathieu's idea[1], consider the sleep time of the task. If that task is a short sleeping one, keep p's previous CPU idle for a short while. Later when p is woken up, it can choose its previous CPU in select_idle_sibling(). When p's previous CPU is reserved, other wakee is not allowed to choose this CPU in select_idle_idle(). The reservation period is set to the task's average sleep time. That is to say, if p is a short sleeping task, there is no need to migrate p to another idle CPU. This does not break the work conservation of the scheduler, because wakee will still try its best to find an idle CPU. The difference is that, different idle CPUs might have different priorities. On the other hand, in theory this extra check could increase the failure ratio of select_idle_cpu(), but per the initial test result, no regression is detected. Baseline: tip/sched/core, on top of: Commit 3f4feb58037a ("sched: Misc cleanups") Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets. cpufreq governor is performance, turbo boost is disabled, C-states deeper than C1 are disabled, Numa balancing is disabled. netperf ======= case load baseline(std%) compare%( std%) UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04) UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42) UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96) UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50) Noticeable improvements of netperf is observed in 168 and 224 threads cases. hackbench ========= case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48) process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10) process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41) process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95) process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97) process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09) threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49) threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90) threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44) threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43) threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91) threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07) schbench(old) ======== case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37) normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83) normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28) normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28) No much difference is observed in hackbench/schbench, due to the run-to-run variance. Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 Suggested-by: Tim Chen <tim.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> --- kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- kernel/sched/features.h | 1 + kernel/sched/sched.h | 1 + 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e20f50726ab8..fe3b760c9654 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); now = sched_clock_cpu(cpu_of(rq)); p->se.prev_sleep_time = task_sleep ? now : 0; +#ifdef CONFIG_SMP + /* + * If this rq will become idle, and dequeued task is + * a short sleeping one, check if we can reserve + * this idle CPU for that task for a short while. + * During this reservation period, other wakees will + * skip this 'idle' CPU in select_idle_cpu(), and this + * short sleeping task can pick its previous CPU in + * select_idle_sibling(), which brings better cache + * locality. + */ + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) + rq->cache_hot_timeout = now + p->se.sleep_avg; +#endif } #ifdef CONFIG_SMP @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p static inline int __select_idle_cpu(int cpu, struct task_struct *p) { if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && - sched_cpu_cookie_match(cpu_rq(cpu), p)) + sched_cpu_cookie_match(cpu_rq(cpu), p)) { + if (sched_feat(SIS_CACHE) && + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) + return -1; + return cpu; + } return -1; } @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu int cpu; for_each_cpu(cpu, cpu_smt_mask(core)) { - if (!available_idle_cpu(cpu)) { + bool cache_hot = sched_feat(SIS_CACHE) ? + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; + + if (!available_idle_cpu(cpu) || cache_hot) { idle = false; if (*idle_cpu == -1) { - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && + !cache_hot) { *idle_cpu = cpu; break; } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index f770168230ae..04ed9fcf67f8 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_PROP, false) SCHED_FEAT(SIS_UTIL, true) +SCHED_FEAT(SIS_CACHE, true) /* * Issue a WARN when we do multiple update_rq_clock() calls diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 62013c49c451..7a2c12c3b6d0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1078,6 +1078,7 @@ struct rq { #endif u64 idle_stamp; u64 avg_idle; + u64 cache_hot_timeout; unsigned long wake_stamp; u64 wake_avg_idle; -- 2.25.1 ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu @ 2023-09-11 7:26 ` Aaron Lu 2023-09-11 8:40 ` Chen Yu 2023-09-11 8:29 ` K Prateek Nayak ` (2 subsequent siblings) 3 siblings, 1 reply; 34+ messages in thread From: Aaron Lu @ 2023-09-11 7:26 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote: > When task p is woken up, the scheduler leverages select_idle_sibling() > to find an idle CPU for it. p's previous CPU is usually a preference > because it can improve cache locality. However in many cases the > previous CPU has already been taken by other wakees, thus p has to > find another idle CPU. > > Inspired by Mathieu's idea[1], consider the sleep time of the task. > If that task is a short sleeping one, keep p's previous CPU idle > for a short while. Later when p is woken up, it can choose its > previous CPU in select_idle_sibling(). When p's previous CPU is reserved, > other wakee is not allowed to choose this CPU in select_idle_idle(). > The reservation period is set to the task's average sleep time. That > is to say, if p is a short sleeping task, there is no need to migrate > p to another idle CPU. > > This does not break the work conservation of the scheduler, > because wakee will still try its best to find an idle CPU. > The difference is that, different idle CPUs might have different > priorities. On the other hand, in theory this extra check could > increase the failure ratio of select_idle_cpu(), but per the > initial test result, no regression is detected. > > Baseline: tip/sched/core, on top of: > Commit 3f4feb58037a ("sched: Misc cleanups") > > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets. > cpufreq governor is performance, turbo boost is disabled, C-states deeper > than C1 are disabled, Numa balancing is disabled. > > netperf > ======= > case load baseline(std%) compare%( std%) > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04) > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42) > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96) > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50) > > Noticeable improvements of netperf is observed in 168 and 224 threads > cases. > > hackbench > ========= > case load baseline(std%) compare%( std%) > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48) > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10) > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41) > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95) > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97) > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09) > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49) > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90) > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44) > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43) > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91) > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07) > > schbench(old) > ======== > case load baseline(std%) compare%( std%) > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37) > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83) > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28) > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28) > > No much difference is observed in hackbench/schbench, due to the > run-to-run variance. > > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 > Suggested-by: Tim Chen <tim.c.chen@intel.com> > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > --- > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > kernel/sched/features.h | 1 + > kernel/sched/sched.h | 1 + > 3 files changed, 29 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e20f50726ab8..fe3b760c9654 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > now = sched_clock_cpu(cpu_of(rq)); > p->se.prev_sleep_time = task_sleep ? now : 0; > +#ifdef CONFIG_SMP > + /* > + * If this rq will become idle, and dequeued task is > + * a short sleeping one, check if we can reserve > + * this idle CPU for that task for a short while. > + * During this reservation period, other wakees will > + * skip this 'idle' CPU in select_idle_cpu(), and this > + * short sleeping task can pick its previous CPU in > + * select_idle_sibling(), which brings better cache > + * locality. > + */ > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > + rq->cache_hot_timeout = now + p->se.sleep_avg; Should this be written as: rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg); ? A even earlier task may have a bigger sleep_avg and overwriting rq->cache_hot_timeout here with an earlier time may cause that task migrate. Not sure how much impact this can have, just someting hit my brain while reading this code. > +#endif > } > > #ifdef CONFIG_SMP > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > { > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > + if (sched_feat(SIS_CACHE) && > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > + return -1; > + Maybe introduce a new function that also considers rq->cache_hot_timeout, like available_idle_cpu_migrate() so that above and below logic can be simplified a bit? I was thinking to simply add that rq->cache_hot_timeout check to available_idle_cpu() but then a long sleeping task could be forced to migrate if its prev_cpu happens to just deschedule a task that sets rq's cache_hot_timeout. I guess that's why you chose to only change the idle semantic in select_idle_cpu() but not in select_idle_sibling()? Thanks, Aaron > return cpu; > + } > > return -1; > } > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > int cpu; > > for_each_cpu(cpu, cpu_smt_mask(core)) { > - if (!available_idle_cpu(cpu)) { > + bool cache_hot = sched_feat(SIS_CACHE) ? > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; > + > + if (!available_idle_cpu(cpu) || cache_hot) { > idle = false; > if (*idle_cpu == -1) { > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { > + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && > + !cache_hot) { > *idle_cpu = cpu; > break; > } > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index f770168230ae..04ed9fcf67f8 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_PROP, false) > SCHED_FEAT(SIS_UTIL, true) > +SCHED_FEAT(SIS_CACHE, true) > > /* > * Issue a WARN when we do multiple update_rq_clock() calls > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 62013c49c451..7a2c12c3b6d0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1078,6 +1078,7 @@ struct rq { > #endif > u64 idle_stamp; > u64 avg_idle; > + u64 cache_hot_timeout; > > unsigned long wake_stamp; > u64 wake_avg_idle; > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 7:26 ` Aaron Lu @ 2023-09-11 8:40 ` Chen Yu 2023-09-13 6:22 ` Gautham R. Shenoy 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-11 8:40 UTC (permalink / raw) To: Aaron Lu Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel Hi Aaron, thanks for the review, On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote: > On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote: > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inspired by Mathieu's idea[1], consider the sleep time of the task. > > If that task is a short sleeping one, keep p's previous CPU idle > > for a short while. Later when p is woken up, it can choose its > > previous CPU in select_idle_sibling(). When p's previous CPU is reserved, > > other wakee is not allowed to choose this CPU in select_idle_idle(). > > The reservation period is set to the task's average sleep time. That > > is to say, if p is a short sleeping task, there is no need to migrate > > p to another idle CPU. > > > > This does not break the work conservation of the scheduler, > > because wakee will still try its best to find an idle CPU. > > The difference is that, different idle CPUs might have different > > priorities. On the other hand, in theory this extra check could > > increase the failure ratio of select_idle_cpu(), but per the > > initial test result, no regression is detected. > > > > Baseline: tip/sched/core, on top of: > > Commit 3f4feb58037a ("sched: Misc cleanups") > > > > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets. > > cpufreq governor is performance, turbo boost is disabled, C-states deeper > > than C1 are disabled, Numa balancing is disabled. > > > > netperf > > ======= > > case load baseline(std%) compare%( std%) > > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04) > > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42) > > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96) > > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50) > > > > Noticeable improvements of netperf is observed in 168 and 224 threads > > cases. > > > > hackbench > > ========= > > case load baseline(std%) compare%( std%) > > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48) > > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10) > > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41) > > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95) > > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97) > > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09) > > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49) > > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90) > > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44) > > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43) > > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91) > > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07) > > > > schbench(old) > > ======== > > case load baseline(std%) compare%( std%) > > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37) > > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83) > > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28) > > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28) > > > > No much difference is observed in hackbench/schbench, due to the > > run-to-run variance. > > > > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 > > Suggested-by: Tim Chen <tim.c.chen@intel.com> > > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > --- > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > kernel/sched/features.h | 1 + > > kernel/sched/sched.h | 1 + > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e20f50726ab8..fe3b760c9654 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > hrtick_update(rq); > > now = sched_clock_cpu(cpu_of(rq)); > > p->se.prev_sleep_time = task_sleep ? now : 0; > > +#ifdef CONFIG_SMP > > + /* > > + * If this rq will become idle, and dequeued task is > > + * a short sleeping one, check if we can reserve > > + * this idle CPU for that task for a short while. > > + * During this reservation period, other wakees will > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > + * short sleeping task can pick its previous CPU in > > + * select_idle_sibling(), which brings better cache > > + * locality. > > + */ > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > Should this be written as: > rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg); > ? > > A even earlier task may have a bigger sleep_avg and overwriting > rq->cache_hot_timeout here with an earlier time may cause that task > migrate. Not sure how much impact this can have, just someting hit my > brain while reading this code. > My previous thought was that, SIS_CACHE only honors the average sleep time of the latest dequeued task. Say, task p1 goes to sleep, then task p2 is woken up and godes to sleeps, we only honor the sleep time of p2. Because we don't know how long p2 has been executed, and how much p2 has polluted the L1/L2 cache, and does it still make sense to let p1 be woken up on its previous CPU(is this CPU still cache hot for p1?) > > +#endif > > } > > > > #ifdef CONFIG_SMP > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > { > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > + if (sched_feat(SIS_CACHE) && > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > + return -1; > > + > > Maybe introduce a new function that also considers rq->cache_hot_timeout, > like available_idle_cpu_migrate() so that above and below logic can be > simplified a bit? > Yes, that would be simpler, I'll do in next version. > I was thinking to simply add that rq->cache_hot_timeout check to > available_idle_cpu() but then a long sleeping task could be forced to > migrate if its prev_cpu happens to just deschedule a task that sets rq's > cache_hot_timeout. I guess that's why you chose to only change the idle > semantic in select_idle_cpu() but not in select_idle_sibling()? > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu() or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling(). If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped, thus no one could use this CPU within the cache hot period, including the cache-hot task itself. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 8:40 ` Chen Yu @ 2023-09-13 6:22 ` Gautham R. Shenoy 2023-09-13 7:25 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: Gautham R. Shenoy @ 2023-09-13 6:22 UTC (permalink / raw) To: Chen Yu Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote: > Hi Aaron, > > thanks for the review, > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote: [..snip..] > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > > { > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > > + if (sched_feat(SIS_CACHE) && > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > > + return -1; > > > + > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout, > > like available_idle_cpu_migrate() so that above and below logic can be > > simplified a bit? > > > > Yes, that would be simpler, I'll do in next version. > > > I was thinking to simply add that rq->cache_hot_timeout check to > > available_idle_cpu() but then a long sleeping task could be forced to > > migrate if its prev_cpu happens to just deschedule a task that sets rq's > > cache_hot_timeout. I guess that's why you chose to only change the idle > > semantic in select_idle_cpu() but not in select_idle_sibling()? > > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu() > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling(). > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped, > thus no one could use this CPU within the cache hot period, including the cache-hot task > itself. This happens even with this patch right? It is possible for a task p1 whose avg sleep time is "t" to go to sleep which causes its CPU to go idle. When it wakes up after a time t' < t, the logic above skips the idle CPU because it is still cache-hot, despite the fact that it is cache hot for p1! Have you considered recording p1's identity in the rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply return the previous CPU if it is cache hot and the wakee is rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling scan. > > thanks, > Chenyu -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-13 6:22 ` Gautham R. Shenoy @ 2023-09-13 7:25 ` Chen Yu 2023-09-14 7:06 ` Gautham R. Shenoy 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-13 7:25 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel Hi Gautham, thanks for the review, On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote: > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote: > > Hi Aaron, > > > > thanks for the review, > > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote: > > [..snip..] > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > > > { > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > > > + if (sched_feat(SIS_CACHE) && > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > > > + return -1; > > > > + > > > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout, > > > like available_idle_cpu_migrate() so that above and below logic can be > > > simplified a bit? > > > > > > > Yes, that would be simpler, I'll do in next version. > > > > > I was thinking to simply add that rq->cache_hot_timeout check to > > > available_idle_cpu() but then a long sleeping task could be forced to > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's > > > cache_hot_timeout. I guess that's why you chose to only change the idle > > > semantic in select_idle_cpu() but not in select_idle_sibling()? > > > > > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu() > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling(). > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped, > > thus no one could use this CPU within the cache hot period, including the cache-hot task > > itself. > > This happens even with this patch right? It is possible for a task p1 > whose avg sleep time is "t" to go to sleep which causes its CPU to go > idle. When it wakes up after a time t' < t, the logic above skips the > idle CPU because it is still cache-hot, despite the fact that it is > cache hot for p1! > Not sure if I understand correctly, in select_idle_sibling() p1's previous CPU will be checked first, and that check does not involve cache-hot. So if p1's previous CPU is idle, it will be picked, no? if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; Or do you mean, in select_idle_cpu(), we will re-check p1's previous CPU but it is skipped due to cache-hot? > Have you considered recording p1's identity in the > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply > return the previous CPU if it is cache hot and the wakee is > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling > scan. > Yes this seems to be donable, and one problem would be, if there are more than 2 dequeued tasks prefer the same (previous) CPU, which task should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we want to deal with multiple dequeued tasks. If we record all of them, it might be costly. [1] https://lore.kernel.org/lkml/2a47ae82-b8cd-95db-9f48-82b3df0730f3@efficios.com/ thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-13 7:25 ` Chen Yu @ 2023-09-14 7:06 ` Gautham R. Shenoy 2023-09-14 12:09 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: Gautham R. Shenoy @ 2023-09-14 7:06 UTC (permalink / raw) To: Chen Yu Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel Hello Chen Yu, On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote: > Hi Gautham, > > thanks for the review, > > On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote: > > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote: > > > Hi Aaron, > > > > > > thanks for the review, > > > > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote: > > > > [..snip..] > > > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > > > > { > > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > > > > + if (sched_feat(SIS_CACHE) && > > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > > > > + return -1; > > > > > + > > > > > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout, > > > > like available_idle_cpu_migrate() so that above and below logic can be > > > > simplified a bit? > > > > > > > > > > Yes, that would be simpler, I'll do in next version. > > > > > > > I was thinking to simply add that rq->cache_hot_timeout check to > > > > available_idle_cpu() but then a long sleeping task could be forced to > > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's > > > > cache_hot_timeout. I guess that's why you chose to only change the idle > > > > semantic in select_idle_cpu() but not in select_idle_sibling()? > > > > > > > > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu() > > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling(). > > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped, > > > thus no one could use this CPU within the cache hot period, including the cache-hot task > > > itself. > > > > This happens even with this patch right? It is possible for a task p1 > > whose avg sleep time is "t" to go to sleep which causes its CPU to go > > idle. When it wakes up after a time t' < t, the logic above skips the > > idle CPU because it is still cache-hot, despite the fact that it is > > cache hot for p1! > > > Not sure if I understand correctly, in select_idle_sibling() p1's previous > CPU will be checked first, and that check does not involve cache-hot. So if > p1's previous CPU is idle, it will be picked, no? > > if (prev != target && cpus_share_cache(prev, target) && > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > asym_fits_cpu(task_util, util_min, util_max, prev)) > return prev; You are right, but the "if" condition is the one prior to "if (prev != target ...)" which causes p1 can pick the previous CPU here. However, note that it is not just p1 which can pick the previous CPU as intended by your patch. Consider the following case: * p1' ran on CPUX, and went to sleep. We now update the cache_hot_timeout for CPUX based on the sleep-avg of p1' * When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or due to shared_runq search), the CPU could pull p1 from another CPU Y. * p1 now runs on CPUX, and goes to sleep. * We update the cache_hot_timeout for CPUX based on the sleep-avg of p1. * p1' wakesup and picks prev as the target after doing wake_affine() check in select_task_rq_fair(). * In select_idle_sibling() it encounters the following, which checks out and thus returns from target. if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; * p1 now wakes up, picks the previous cpu as the target. However, available_idle_cpu() is no longer true. So despite "reserving" the CPU for p1, which is likely to have its data still hot in the case, we would have scheduled p1', thus defeating the whole purpose of reservation. To be honest, this isn't so bad, because we have been able to avoid a migration in this case. > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous > CPU but it is skipped due to cache-hot? I had originally thought about this, but then as you pointed out we have an opportunity to pick the previous cpu in the early checks inside select_idle_sibling(). > > > Have you considered recording p1's identity in the > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply > > return the previous CPU if it is cache hot and the wakee is > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling > > scan. > > > > Yes this seems to be donable, and one problem would be, if there are > more than 2 dequeued tasks prefer the same (previous) CPU, which task > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we > want to deal with multiple dequeued tasks. If we record all of them, > it might be costly. If there are multiple dequeued tasks, then it doesn't make sense to record the identity of the tasks. However, we need the bail out to be much earlier, in select_task_rq_fair(), perhaps even before the want_affine() checks. After all, if the previous CPU is idle, and its cache_hot_timeout hasn't expired, and if the wakee's sleep duration is less than the cache_hot_timeout, why don't we just pick it here and be done with it? > > [1] https://lore.kernel.org/lkml/2a47ae82-b8cd-95db-9f48-82b3df0730f3@efficios.com/ > > thanks, > Chenyu -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 7:06 ` Gautham R. Shenoy @ 2023-09-14 12:09 ` Chen Yu 2023-09-15 15:18 ` Gautham R. Shenoy 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-14 12:09 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel Hi Gautham, On 2023-09-14 at 12:36:33 +0530, Gautham R. Shenoy wrote: > Hello Chen Yu, > > On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote: > > Hi Gautham, > > > > thanks for the review, > > > > On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote: > > > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote: > > > > Hi Aaron, > > > > > > > > thanks for the review, > > > > > > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote: > > > > > > [..snip..] > > > > > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > > > > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > > > > > { > > > > > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > > > > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > > > > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > > > > > + if (sched_feat(SIS_CACHE) && > > > > > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > > > > > + return -1; > > > > > > + > > > > > > > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout, > > > > > like available_idle_cpu_migrate() so that above and below logic can be > > > > > simplified a bit? > > > > > > > > > > > > > Yes, that would be simpler, I'll do in next version. > > > > > > > > > I was thinking to simply add that rq->cache_hot_timeout check to > > > > > available_idle_cpu() but then a long sleeping task could be forced to > > > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's > > > > > cache_hot_timeout. I guess that's why you chose to only change the idle > > > > > semantic in select_idle_cpu() but not in select_idle_sibling()? > > > > > > > > > > > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu() > > > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling(). > > > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped, > > > > thus no one could use this CPU within the cache hot period, including the cache-hot task > > > > itself. > > > > > > This happens even with this patch right? It is possible for a task p1 > > > whose avg sleep time is "t" to go to sleep which causes its CPU to go > > > idle. When it wakes up after a time t' < t, the logic above skips the > > > idle CPU because it is still cache-hot, despite the fact that it is > > > cache hot for p1! > > > > > Not sure if I understand correctly, in select_idle_sibling() p1's previous > > CPU will be checked first, and that check does not involve cache-hot. So if > > p1's previous CPU is idle, it will be picked, no? > > > > if (prev != target && cpus_share_cache(prev, target) && > > (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > > asym_fits_cpu(task_util, util_min, util_max, prev)) > > return prev; > > > You are right, but the "if" condition is the one prior to "if (prev != > target ...)" which causes p1 can pick the previous CPU here. However, > note that it is not just p1 which can pick the previous CPU as > intended by your patch. > > Consider the following case: > > * p1' ran on CPUX, and went to sleep. We now update the > cache_hot_timeout for CPUX based on the sleep-avg of p1' > > * When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or > due to shared_runq search), the CPU could pull p1 from another CPU > Y. Agree, newidle balance could scribble the cache-hot CPU. > > * p1 now runs on CPUX, and goes to sleep. > > * We update the cache_hot_timeout for CPUX based on the sleep-avg of p1. > > * p1' wakesup and picks prev as the target after doing wake_affine() > check in select_task_rq_fair(). > > * In select_idle_sibling() it encounters the following, which checks > out and thus returns from target. > > if ((available_idle_cpu(target) || sched_idle_cpu(target)) && > asym_fits_cpu(task_util, util_min, util_max, target)) > return target; > > * p1 now wakes up, picks the previous cpu as the target. However, > available_idle_cpu() is no longer true. > > So despite "reserving" the CPU for p1, which is likely to have its > data still hot in the case, we would have scheduled p1', thus > defeating the whole purpose of reservation. > I see. So you mean, although we reserve the CPU for the wakee, the wakee might not choose its previous CPU, which is against our goal. The reason to prevent the wakee choosing its previous CPU could be: 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or 2. the wakee's CPU has already been taken by someone else, via newidle_balance(). For 1, I think Prateek has expressed the concern. One mitigation method could be that, we give penalty to that wakee, if it decides not to choose its previous CPU: " new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); if (new_cpu != prev_cpu) p->burst_sleep_avg >>= 1; So the duration of reservation could be shrinked. " For 2, maybe inhit the newidle balance, something in my mind: --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) u64 t0, t1, curr_cost = 0; struct sched_domain *sd; int pulled_task = 0; + bool cache_hot = false; update_misfit_status(NULL, this_rq); @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); + if (sched_feat(SIS_CACHE)) { + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu); + + /* + * If a short time later, a short sleeping task will be woken up + * on this idle CPU, do not launch the newidle balance. + */ + if (delta > 0 && delta < this_rq->max_idle_balance_cost) + cache_hot = true; + } + if (!READ_ONCE(this_rq->rd->overload) || - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) { if (sd) update_next_balance(sd, &next_balance); > To be honest, this isn't so bad, because we have been able to avoid a > migration in this case. > > > > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous > > CPU but it is skipped due to cache-hot? > > I had originally thought about this, but then as you pointed out we > have an opportunity to pick the previous cpu in the early checks > inside select_idle_sibling(). > > > > > > Have you considered recording p1's identity in the > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply > > > return the previous CPU if it is cache hot and the wakee is > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling > > > scan. > > > > > > > Yes this seems to be donable, and one problem would be, if there are > > more than 2 dequeued tasks prefer the same (previous) CPU, which task > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we > > want to deal with multiple dequeued tasks. If we record all of them, > > it might be costly. > > If there are multiple dequeued tasks, then it doesn't make sense to > record the identity of the tasks. However, we need the bail out to be > much earlier, in select_task_rq_fair(), perhaps even before the > want_affine() checks. > > After all, if the previous CPU is idle, and its cache_hot_timeout > hasn't expired, and if the wakee's sleep duration is less than the > cache_hot_timeout, why don't we just pick it here and be done with it? > Yes we can return the previous CPU earlier, one concern is that, should we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC seems to have a higher priority than available_idle_cpu(prev_cpu). Say, if the current CPU has 1 running task, and the previous CPU is idle, wake_affine_idle() still prefers the current CPU. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 12:09 ` Chen Yu @ 2023-09-15 15:18 ` Gautham R. Shenoy 2023-09-19 9:01 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: Gautham R. Shenoy @ 2023-09-15 15:18 UTC (permalink / raw) To: Chen Yu Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel Hello Chen Yu, On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote: [..snip..] > > > > So despite "reserving" the CPU for p1, which is likely to have its > > data still hot in the case, we would have scheduled p1', thus > > defeating the whole purpose of reservation. > > > > I see. So you mean, although we reserve the CPU for the wakee, > the wakee might not choose its previous CPU, which is against our > goal. Yes, but only because some other task could have run on the previous CPU. That other task could be something that was woken up on that CPU due to: 1) wake-affine choosing that CPU 2) newidle-balance pulling the other task on that CPU 3) !wake-affine && that CPU was also the other task's previous CPU It could also be due to this wakee task being woken up on the waker CPU due to wake-affine. > > The reason to prevent the wakee choosing its previous CPU could be: > 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or > 2. the wakee's CPU has already been taken by someone else, via newidle_balance(). > > For 1, I think Prateek has expressed the concern. One mitigation method could be > that, we give penalty to that wakee, if it decides not to choose its previous CPU: We would be penalizing the task for something that the scheduler decides :-) As you point out below, in the presence of the WF_SYNC flag, wake_affine_idle() prefer the waker CPU over the previous CPU when they are on different LLCs and when the waker is the only task. This strategy makes sense for two reasons: 1) The wakee may be consuming the data produced by the waker. 2) Since the wakeup will happen on the local CPU, there is no risk of task-stacking, exactly what your SIS_CURRENT patchset was attempting. But this strategy would also result in increased task-migration. Which both Mattieu and you have found is not so beneficial for workloads such as hackbench. Is it only because task's data is still hot in the previous CPU's cache ? Or is there more to it ? It would be good to confirm if this is why lower migration is better for these kinds of workloads. > > " > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > if (new_cpu != prev_cpu) > p->burst_sleep_avg >>= 1; > So the duration of reservation could be shrinked. > " > > For 2, maybe inhit the newidle balance, something in my mind: > > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > u64 t0, t1, curr_cost = 0; > struct sched_domain *sd; > int pulled_task = 0; > + bool cache_hot = false; > > update_misfit_status(NULL, this_rq); > > @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > rcu_read_lock(); > sd = rcu_dereference_check_sched_domain(this_rq->sd); > > + if (sched_feat(SIS_CACHE)) { > + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu); > + > + /* > + * If a short time later, a short sleeping task will be woken up > + * on this idle CPU, do not launch the newidle balance. > + */ > + if (delta > 0 && delta < this_rq->max_idle_balance_cost) > + cache_hot = true; > + } > + > if (!READ_ONCE(this_rq->rd->overload) || > - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) { > > if (sd) > update_next_balance(sd, &next_balance); If the benefit that the workload obtains is really due to the data being hot near its previous CPU, then this seems a sensible thing to do. It would be good to confirm this. Let me get some IBS data for hackbench which is the workload which likes a sticky wakeup. -- Thanks and Regards gautham. > > > > To be honest, this isn't so bad, because we have been able to avoid a > > migration in this case. > > > > > > > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous > > > CPU but it is skipped due to cache-hot? > > > > I had originally thought about this, but then as you pointed out we > > have an opportunity to pick the previous cpu in the early checks > > inside select_idle_sibling(). > > > > > > > > > Have you considered recording p1's identity in the > > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply > > > > return the previous CPU if it is cache hot and the wakee is > > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling > > > > scan. > > > > > > > > > > Yes this seems to be donable, and one problem would be, if there are > > > more than 2 dequeued tasks prefer the same (previous) CPU, which task > > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we > > > want to deal with multiple dequeued tasks. If we record all of them, > > > it might be costly. > > > > If there are multiple dequeued tasks, then it doesn't make sense to > > record the identity of the tasks. However, we need the bail out to be > > much earlier, in select_task_rq_fair(), perhaps even before the > > want_affine() checks. > > > > After all, if the previous CPU is idle, and its cache_hot_timeout > > hasn't expired, and if the wakee's sleep duration is less than the > > cache_hot_timeout, why don't we just pick it here and be done with it? > > > > Yes we can return the previous CPU earlier, one concern is that, should > we honor WF_SYNC flag or not, because in wake_affine_idle(), WF_SYNC > seems to have a higher priority than available_idle_cpu(prev_cpu). Say, > if the current CPU has 1 running task, and the previous CPU is idle, > wake_affine_idle() still prefers the current CPU. > > thanks, > Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-15 15:18 ` Gautham R. Shenoy @ 2023-09-19 9:01 ` Chen Yu 0 siblings, 0 replies; 34+ messages in thread From: Chen Yu @ 2023-09-19 9:01 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, linux-kernel Hi Gautham, Sorry for late reply, On 2023-09-15 at 20:48:14 +0530, Gautham R. Shenoy wrote: > Hello Chen Yu, > > On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote: > [..snip..] > > > > > > > So despite "reserving" the CPU for p1, which is likely to have its > > > data still hot in the case, we would have scheduled p1', thus > > > defeating the whole purpose of reservation. > > > > > > > I see. So you mean, although we reserve the CPU for the wakee, > > the wakee might not choose its previous CPU, which is against our > > goal. > > > Yes, but only because some other task could have run on the previous > CPU. That other task could be something that was woken up on that CPU > due to: > > 1) wake-affine choosing that CPU > 2) newidle-balance pulling the other task on that CPU > 3) !wake-affine && that CPU was also the other task's previous CPU > > It could also be due to this wakee task being woken up on the waker > CPU due to wake-affine. > > > > > The reason to prevent the wakee choosing its previous CPU could be: > > 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or > > 2. the wakee's CPU has already been taken by someone else, via newidle_balance(). > > > > > > For 1, I think Prateek has expressed the concern. One mitigation method could be > > that, we give penalty to that wakee, if it decides not to choose its previous CPU: > > We would be penalizing the task for something that the scheduler > decides :-) > > As you point out below, in the presence of the WF_SYNC flag, > wake_affine_idle() prefer the waker CPU over the previous CPU when > they are on different LLCs and when the waker is the only task. > > This strategy makes sense for two reasons: > > 1) The wakee may be consuming the data produced by the waker. > 2) Since the wakeup will happen on the local CPU, there is no risk of > task-stacking, exactly what your SIS_CURRENT patchset was > attempting. > > But this strategy would also result in increased task-migration. Which > both Mattieu and you have found is not so beneficial for workloads > such as hackbench. Is it only because task's data is still hot in the > previous CPU's cache ? Or is there more to it ? > > > It would be good to confirm if this is why lower migration is better > for these kinds of workloads. > Right. According to the previous hackbench test for shared runqueue, higher migration brings higher DSB miss rate [1]. I'll collect some statistics with this patch applied to confirm. https://lore.kernel.org/lkml/ZO7e5YaS71cXVxQN@chenyu5-mobl2/ > > > > " > > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > > if (new_cpu != prev_cpu) > > p->burst_sleep_avg >>= 1; > > So the duration of reservation could be shrinked. > > " > > > > For 2, maybe inhit the newidle balance, something in my mind: > > > > > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > u64 t0, t1, curr_cost = 0; > > struct sched_domain *sd; > > int pulled_task = 0; > > + bool cache_hot = false; > > > > update_misfit_status(NULL, this_rq); > > > > @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) > > rcu_read_lock(); > > sd = rcu_dereference_check_sched_domain(this_rq->sd); > > > > + if (sched_feat(SIS_CACHE)) { > > + s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu); > > + > > + /* > > + * If a short time later, a short sleeping task will be woken up > > + * on this idle CPU, do not launch the newidle balance. > > + */ > > + if (delta > 0 && delta < this_rq->max_idle_balance_cost) > > + cache_hot = true; > > + } > > + > > if (!READ_ONCE(this_rq->rd->overload) || > > - (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) { > > + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) { > > > > > if (sd) > > update_next_balance(sd, &next_balance); > > If the benefit that the workload obtains is really due to the data > being hot near its previous CPU, then this seems a sensible thing to > do. > > It would be good to confirm this. Let me get some IBS data for > hackbench which is the workload which likes a sticky wakeup. > I'll collect the statistics too. Thanks for your time. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu 2023-09-11 7:26 ` Aaron Lu @ 2023-09-11 8:29 ` K Prateek Nayak 2023-09-11 10:19 ` Chen Yu 2023-09-12 6:32 ` Mike Galbraith 2023-09-11 15:26 ` Mathieu Desnoyers 2023-09-14 5:30 ` K Prateek Nayak 3 siblings, 2 replies; 34+ messages in thread From: K Prateek Nayak @ 2023-09-11 8:29 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hello Chenyu, On 9/11/2023 8:20 AM, Chen Yu wrote: > [..snip..] > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > kernel/sched/features.h | 1 + > kernel/sched/sched.h | 1 + > 3 files changed, 29 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e20f50726ab8..fe3b760c9654 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > now = sched_clock_cpu(cpu_of(rq)); > p->se.prev_sleep_time = task_sleep ? now : 0; > +#ifdef CONFIG_SMP > + /* > + * If this rq will become idle, and dequeued task is > + * a short sleeping one, check if we can reserve > + * this idle CPU for that task for a short while. > + * During this reservation period, other wakees will > + * skip this 'idle' CPU in select_idle_cpu(), and this > + * short sleeping task can pick its previous CPU in > + * select_idle_sibling(), which brings better cache > + * locality. > + */ > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > + rq->cache_hot_timeout = now + p->se.sleep_avg; > +#endif > } > > #ifdef CONFIG_SMP > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > { > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > + if (sched_feat(SIS_CACHE) && > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > + return -1; Just wondering, Similar to how select_idle_core() caches the "idle_cpu" if it ends up finding one in its search for an idle core, would returning a "cache-hot idle CPU" be better than returning previous CPU / current CPU if all idle CPUs found during the search in select_idle_cpu() are marked cache-hot? Speaking of cache-hot idle CPU, is netperf actually more happy with piling on current CPU? I ask this because the logic seems to be reserving the previous CPU for a task that dislikes migration but I do not see anything in the wake_affine_idle() path that would make the short sleeper proactively choose the previous CPU when the wakeup is marked with the WF_SYNC flag. Let me know if I'm missing something? To confirm this can you look at the trend in migration count with and without the series? Also the ratio of cache-hot idle CPUs to number of CPUs searched can help estimate overheads of additional search - I presume SIS_UTIL is efficient at curbing the additional search in a busy system. In the meantime, I'll queue this series for testing on my end too. > + > return cpu; > + } > > return -1; > } > [..snip..] -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 8:29 ` K Prateek Nayak @ 2023-09-11 10:19 ` Chen Yu 2023-09-12 3:05 ` K Prateek Nayak 2023-09-12 9:39 ` Mike Galbraith 2023-09-12 6:32 ` Mike Galbraith 1 sibling, 2 replies; 34+ messages in thread From: Chen Yu @ 2023-09-11 10:19 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hi Prateek, thanks for your review, On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/11/2023 8:20 AM, Chen Yu wrote: > > [..snip..] > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > kernel/sched/features.h | 1 + > > kernel/sched/sched.h | 1 + > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e20f50726ab8..fe3b760c9654 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > hrtick_update(rq); > > now = sched_clock_cpu(cpu_of(rq)); > > p->se.prev_sleep_time = task_sleep ? now : 0; > > +#ifdef CONFIG_SMP > > + /* > > + * If this rq will become idle, and dequeued task is > > + * a short sleeping one, check if we can reserve > > + * this idle CPU for that task for a short while. > > + * During this reservation period, other wakees will > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > + * short sleeping task can pick its previous CPU in > > + * select_idle_sibling(), which brings better cache > > + * locality. > > + */ > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > +#endif > > } > > > > #ifdef CONFIG_SMP > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > > { > > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > > + if (sched_feat(SIS_CACHE) && > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > > + return -1; > > Just wondering, > > Similar to how select_idle_core() caches the "idle_cpu" if it ends up > finding one in its search for an idle core, would returning a "cache-hot > idle CPU" be better than returning previous CPU / current CPU if all > idle CPUs found during the search in select_idle_cpu() are marked > cache-hot? > This is a good point, we can optimize this further. Currently I only send a simple version to desmonstrate how we can leverage the task's sleep time. > Speaking of cache-hot idle CPU, is netperf actually more happy with > piling on current CPU? Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to put the waker and wakee together. > I ask this because the logic seems to be > reserving the previous CPU for a task that dislikes migration but I > do not see anything in the wake_affine_idle() path that would make the > short sleeper proactively choose the previous CPU when the wakeup is > marked with the WF_SYNC flag. Let me know if I'm missing something? > If I understand correctly, WF_SYNC is to let the wakee to woken up on the waker's CPU, rather than the wakee's previous CPU, because the waker goes to sleep after wakeup. SIS_CACHE mainly cares about wakee's previous CPU. We can only restrict that other wakee does not occupy the previous CPU, but do not enhance the possibility that wake_affine_idle() chooses the previous CPU. Say, there are two tasks t1, t2. t1's previous CPU is p1. We don't enhance that when t1 is woken up, wake_affine_idle() will choose p1 or not, but we makes sure t2 will not choose p1. > To confirm this can you look at the trend in migration count with and > without the series? Also the ratio of cache-hot idle CPUs to number > of CPUs searched can help estimate overheads of additional search - I > presume SIS_UTIL is efficient at curbing the additional search in > a busy system. OK, I'll collect these statistics. > > In the meantime, I'll queue this series for testing on my end too. Thanks again for your time. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 10:19 ` Chen Yu @ 2023-09-12 3:05 ` K Prateek Nayak 2023-09-12 12:32 ` Chen Yu 2023-09-12 9:39 ` Mike Galbraith 1 sibling, 1 reply; 34+ messages in thread From: K Prateek Nayak @ 2023-09-12 3:05 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hello Chenyu, On 9/11/2023 3:49 PM, Chen Yu wrote: > Hi Prateek, > > thanks for your review, > > On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote: >> Hello Chenyu, >> >> On 9/11/2023 8:20 AM, Chen Yu wrote: >>> [..snip..] >>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- >>> kernel/sched/features.h | 1 + >>> kernel/sched/sched.h | 1 + >>> 3 files changed, 29 insertions(+), 3 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index e20f50726ab8..fe3b760c9654 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >>> hrtick_update(rq); >>> now = sched_clock_cpu(cpu_of(rq)); >>> p->se.prev_sleep_time = task_sleep ? now : 0; >>> +#ifdef CONFIG_SMP >>> + /* >>> + * If this rq will become idle, and dequeued task is >>> + * a short sleeping one, check if we can reserve >>> + * this idle CPU for that task for a short while. >>> + * During this reservation period, other wakees will >>> + * skip this 'idle' CPU in select_idle_cpu(), and this >>> + * short sleeping task can pick its previous CPU in >>> + * select_idle_sibling(), which brings better cache >>> + * locality. >>> + */ >>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && >>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) >>> + rq->cache_hot_timeout = now + p->se.sleep_avg; >>> +#endif >>> } >>> >>> #ifdef CONFIG_SMP >>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p >>> static inline int __select_idle_cpu(int cpu, struct task_struct *p) >>> { >>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && >>> - sched_cpu_cookie_match(cpu_rq(cpu), p)) >>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) { >>> + if (sched_feat(SIS_CACHE) && >>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) >>> + return -1; >> >> Just wondering, >> >> Similar to how select_idle_core() caches the "idle_cpu" if it ends up >> finding one in its search for an idle core, would returning a "cache-hot >> idle CPU" be better than returning previous CPU / current CPU if all >> idle CPUs found during the search in select_idle_cpu() are marked >> cache-hot? >> > > This is a good point, we can optimize this further. Currently I only > send a simple version to desmonstrate how we can leverage the task's > sleep time. > >> Speaking of cache-hot idle CPU, is netperf actually more happy with >> piling on current CPU? > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > put the waker and wakee together. > >> I ask this because the logic seems to be >> reserving the previous CPU for a task that dislikes migration but I >> do not see anything in the wake_affine_idle() path that would make the >> short sleeper proactively choose the previous CPU when the wakeup is >> marked with the WF_SYNC flag. Let me know if I'm missing something? >> > > If I understand correctly, WF_SYNC is to let the wakee to woken up > on the waker's CPU, rather than the wakee's previous CPU, because > the waker goes to sleep after wakeup. SIS_CACHE mainly cares about > wakee's previous CPU. We can only restrict that other wakee does not > occupy the previous CPU, but do not enhance the possibility that > wake_affine_idle() chooses the previous CPU. Correct me if I'm wrong here, Say a short sleeper, is always woken up using WF_SYNC flag. When the task is dequeued, we mark the previous CPU where it ran as "cache-hot" and restrict any wakeup happening until the "cache_hot_timeout" is crossed. Let us assume a perfect world where the task wakes up before the "cache_hot_timeout" expires. Logically this CPU was reserved all this while for the short sleeper but since the wakeup bears WF_SYNC flag, the whole reservation is ignored and waker's LLC is explored. Should the timeout be cleared if the wakeup decides to not target the previous CPU? (The default "sysctl_sched_migration_cost" is probably small enough to curb any side effect that could possibly show here but if a genuine use-case warrants setting "sysctl_sched_migration_cost" to a larger value, the wakeup path might be affected where lot of idle targets are overlooked since the CPUs are marked cache-hot forr longer duration) Let me know what you think. > > Say, there are two tasks t1, t2. t1's previous CPU is p1. > We don't enhance that when t1 is woken up, wake_affine_idle() will > choose p1 or not, but we makes sure t2 will not choose p1. > >> To confirm this can you look at the trend in migration count with and >> without the series? Also the ratio of cache-hot idle CPUs to number >> of CPUs searched can help estimate overheads of additional search - I >> presume SIS_UTIL is efficient at curbing the additional search in >> a busy system. > > OK, I'll collect these statistics. Thank you :) > > [..snip..] > -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 3:05 ` K Prateek Nayak @ 2023-09-12 12:32 ` Chen Yu 2023-09-12 14:26 ` K Prateek Nayak 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-12 12:32 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hi Prateek, On 2023-09-12 at 08:35:12 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/11/2023 3:49 PM, Chen Yu wrote: > > Hi Prateek, > > > > thanks for your review, > > > > On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote: > >> Hello Chenyu, > >> > >> On 9/11/2023 8:20 AM, Chen Yu wrote: > >>> [..snip..] > >>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > >>> kernel/sched/features.h | 1 + > >>> kernel/sched/sched.h | 1 + > >>> 3 files changed, 29 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index e20f50726ab8..fe3b760c9654 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > >>> hrtick_update(rq); > >>> now = sched_clock_cpu(cpu_of(rq)); > >>> p->se.prev_sleep_time = task_sleep ? now : 0; > >>> +#ifdef CONFIG_SMP > >>> + /* > >>> + * If this rq will become idle, and dequeued task is > >>> + * a short sleeping one, check if we can reserve > >>> + * this idle CPU for that task for a short while. > >>> + * During this reservation period, other wakees will > >>> + * skip this 'idle' CPU in select_idle_cpu(), and this > >>> + * short sleeping task can pick its previous CPU in > >>> + * select_idle_sibling(), which brings better cache > >>> + * locality. > >>> + */ > >>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > >>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > >>> + rq->cache_hot_timeout = now + p->se.sleep_avg; > >>> +#endif > >>> } > >>> > >>> #ifdef CONFIG_SMP > >>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > >>> static inline int __select_idle_cpu(int cpu, struct task_struct *p) > >>> { > >>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > >>> - sched_cpu_cookie_match(cpu_rq(cpu), p)) > >>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > >>> + if (sched_feat(SIS_CACHE) && > >>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > >>> + return -1; > >> > >> Just wondering, > >> > >> Similar to how select_idle_core() caches the "idle_cpu" if it ends up > >> finding one in its search for an idle core, would returning a "cache-hot > >> idle CPU" be better than returning previous CPU / current CPU if all > >> idle CPUs found during the search in select_idle_cpu() are marked > >> cache-hot? > >> > > > > This is a good point, we can optimize this further. Currently I only > > send a simple version to desmonstrate how we can leverage the task's > > sleep time. > > > >> Speaking of cache-hot idle CPU, is netperf actually more happy with > >> piling on current CPU? > > > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > > put the waker and wakee together. > > > >> I ask this because the logic seems to be > >> reserving the previous CPU for a task that dislikes migration but I > >> do not see anything in the wake_affine_idle() path that would make the > >> short sleeper proactively choose the previous CPU when the wakeup is > >> marked with the WF_SYNC flag. Let me know if I'm missing something? > >> > > > > If I understand correctly, WF_SYNC is to let the wakee to woken up > > on the waker's CPU, rather than the wakee's previous CPU, because > > the waker goes to sleep after wakeup. SIS_CACHE mainly cares about > > wakee's previous CPU. We can only restrict that other wakee does not > > occupy the previous CPU, but do not enhance the possibility that > > wake_affine_idle() chooses the previous CPU. > > Correct me if I'm wrong here, > > Say a short sleeper, is always woken up using WF_SYNC flag. When the > task is dequeued, we mark the previous CPU where it ran as "cache-hot" > and restrict any wakeup happening until the "cache_hot_timeout" is > crossed. Let us assume a perfect world where the task wakes up before > the "cache_hot_timeout" expires. Logically this CPU was reserved all > this while for the short sleeper but since the wakeup bears WF_SYNC > flag, the whole reservation is ignored and waker's LLC is explored. > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() forces the short sleeping wakee to be woken up on waker's CPU rather the wakee's previous CPU, but wakee's previous has been marked as cache hot for nothing? > Should the timeout be cleared if the wakeup decides to not target the > previous CPU? (The default "sysctl_sched_migration_cost" is probably > small enough to curb any side effect that could possibly show here but > if a genuine use-case warrants setting "sysctl_sched_migration_cost" to > a larger value, the wakeup path might be affected where lot of idle > targets are overlooked since the CPUs are marked cache-hot forr longer > duration) > > Let me know what you think. > This makes sense. In theory the above logic can be added in select_idle_sibling(), if target CPU is chosen rather than the previous CPU, the previous CPU's cache hot flag should be cleared. But this might bring overhead. Because we need to grab the rq lock and write to other CPU's rq, which could be costly. It seems to be a trade-off of current implementation. On the other hand, if the user sets the sysctl_sched_migration_cost to a quite large value: 1. Without SIS_CACHE, there is no task migration. 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped in select_idle_cpu(), the wakee will be woken up locally. It seems to be of the same effect, so there is no much impact to wakeup behavior I suppose. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 12:32 ` Chen Yu @ 2023-09-12 14:26 ` K Prateek Nayak 2023-09-13 2:57 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: K Prateek Nayak @ 2023-09-12 14:26 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hello Chenyu, On 9/12/2023 6:02 PM, Chen Yu wrote: > [..snip..] > >>> If I understand correctly, WF_SYNC is to let the wakee to woken up >>> on the waker's CPU, rather than the wakee's previous CPU, because >>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about >>> wakee's previous CPU. We can only restrict that other wakee does not >>> occupy the previous CPU, but do not enhance the possibility that >>> wake_affine_idle() chooses the previous CPU. >> >> Correct me if I'm wrong here, >> >> Say a short sleeper, is always woken up using WF_SYNC flag. When the >> task is dequeued, we mark the previous CPU where it ran as "cache-hot" >> and restrict any wakeup happening until the "cache_hot_timeout" is >> crossed. Let us assume a perfect world where the task wakes up before >> the "cache_hot_timeout" expires. Logically this CPU was reserved all >> this while for the short sleeper but since the wakeup bears WF_SYNC >> flag, the whole reservation is ignored and waker's LLC is explored. >> > > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() > forces the short sleeping wakee to be woken up on waker's CPU rather the > wakee's previous CPU, but wakee's previous has been marked as cache hot > for nothing? Precisely :) > >> Should the timeout be cleared if the wakeup decides to not target the >> previous CPU? (The default "sysctl_sched_migration_cost" is probably >> small enough to curb any side effect that could possibly show here but >> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to >> a larger value, the wakeup path might be affected where lot of idle >> targets are overlooked since the CPUs are marked cache-hot forr longer >> duration) >> >> Let me know what you think. >> > > This makes sense. In theory the above logic can be added in > select_idle_sibling(), if target CPU is chosen rather than > the previous CPU, the previous CPU's cache hot flag should be > cleared. > > But this might bring overhead. Because we need to grab the rq > lock and write to other CPU's rq, which could be costly. It > seems to be a trade-off of current implementation. I agree, it will not be pretty. Maybe the other way is to have a history of the type of wakeup the task experiences (similar to wakee_flips but for sync and non-syn wakeups) and only reserve the CPU if the task wakes up more via non-sync wakeups? Thinking out loud here. > On the other > hand, if the user sets the sysctl_sched_migration_cost to a quite > large value: > 1. Without SIS_CACHE, there is no task migration. But that is in the load balancing path. I think the wakeup path will still migrate the task. But I believe there might be very few cases where all CPUs are marked cache-hot and the SIS_UTIL will not bail out straight away as a result of high utilization. Probably a rare scenario. > 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped > in select_idle_cpu(), the wakee will be woken up locally. > It seems to be of the same effect, so there is no much impact > to wakeup behavior I suppose. > > [..snip..] > -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 14:26 ` K Prateek Nayak @ 2023-09-13 2:57 ` Chen Yu 2023-09-14 4:13 ` K Prateek Nayak 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-13 2:57 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/12/2023 6:02 PM, Chen Yu wrote: > > [..snip..] > > > >>> If I understand correctly, WF_SYNC is to let the wakee to woken up > >>> on the waker's CPU, rather than the wakee's previous CPU, because > >>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about > >>> wakee's previous CPU. We can only restrict that other wakee does not > >>> occupy the previous CPU, but do not enhance the possibility that > >>> wake_affine_idle() chooses the previous CPU. > >> > >> Correct me if I'm wrong here, > >> > >> Say a short sleeper, is always woken up using WF_SYNC flag. When the > >> task is dequeued, we mark the previous CPU where it ran as "cache-hot" > >> and restrict any wakeup happening until the "cache_hot_timeout" is > >> crossed. Let us assume a perfect world where the task wakes up before > >> the "cache_hot_timeout" expires. Logically this CPU was reserved all > >> this while for the short sleeper but since the wakeup bears WF_SYNC > >> flag, the whole reservation is ignored and waker's LLC is explored. > >> > > > > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() > > forces the short sleeping wakee to be woken up on waker's CPU rather the > > wakee's previous CPU, but wakee's previous has been marked as cache hot > > for nothing? > > Precisely :) > > > > >> Should the timeout be cleared if the wakeup decides to not target the > >> previous CPU? (The default "sysctl_sched_migration_cost" is probably > >> small enough to curb any side effect that could possibly show here but > >> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to > >> a larger value, the wakeup path might be affected where lot of idle > >> targets are overlooked since the CPUs are marked cache-hot forr longer > >> duration) > >> > >> Let me know what you think. > >> > > > > This makes sense. In theory the above logic can be added in > > select_idle_sibling(), if target CPU is chosen rather than > > the previous CPU, the previous CPU's cache hot flag should be > > cleared. > > > > But this might bring overhead. Because we need to grab the rq > > lock and write to other CPU's rq, which could be costly. It > > seems to be a trade-off of current implementation. > > I agree, it will not be pretty. Maybe the other way is to have a > history of the type of wakeup the task experiences (similar to > wakee_flips but for sync and non-syn wakeups) and only reserve > the CPU if the task wakes up more via non-sync wakeups? Thinking > out loud here. > This looks good to consider the task's attribute, or maybe something like this: new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); if (new_cpu != prev_cpu) p->burst_sleep_avg >>= 1; So the duration of reservation could be shrinked. > > On the other > > hand, if the user sets the sysctl_sched_migration_cost to a quite > > large value: > > 1. Without SIS_CACHE, there is no task migration. > > But that is in the load balancing path. I think the wakeup path will > still migrate the task. OK, I see. > But I believe there might be very few cases > where all CPUs are marked cache-hot and the SIS_UTIL will not bail > out straight away as a result of high utilization. Probably a rare > scenario. > Agree. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-13 2:57 ` Chen Yu @ 2023-09-14 4:13 ` K Prateek Nayak 2023-09-14 11:01 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: K Prateek Nayak @ 2023-09-14 4:13 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hello Chenyu, On 9/13/2023 8:27 AM, Chen Yu wrote: > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote: >> Hello Chenyu, >> >> On 9/12/2023 6:02 PM, Chen Yu wrote: >>> [..snip..] >>> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up >>>>> on the waker's CPU, rather than the wakee's previous CPU, because >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about >>>>> wakee's previous CPU. We can only restrict that other wakee does not >>>>> occupy the previous CPU, but do not enhance the possibility that >>>>> wake_affine_idle() chooses the previous CPU. >>>> >>>> Correct me if I'm wrong here, >>>> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the >>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot" >>>> and restrict any wakeup happening until the "cache_hot_timeout" is >>>> crossed. Let us assume a perfect world where the task wakes up before >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all >>>> this while for the short sleeper but since the wakeup bears WF_SYNC >>>> flag, the whole reservation is ignored and waker's LLC is explored. >>>> >>> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() >>> forces the short sleeping wakee to be woken up on waker's CPU rather the >>> wakee's previous CPU, but wakee's previous has been marked as cache hot >>> for nothing? >> >> Precisely :) >> >>> >>>> Should the timeout be cleared if the wakeup decides to not target the >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably >>>> small enough to curb any side effect that could possibly show here but >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to >>>> a larger value, the wakeup path might be affected where lot of idle >>>> targets are overlooked since the CPUs are marked cache-hot forr longer >>>> duration) >>>> >>>> Let me know what you think. >>>> >>> >>> This makes sense. In theory the above logic can be added in >>> select_idle_sibling(), if target CPU is chosen rather than >>> the previous CPU, the previous CPU's cache hot flag should be >>> cleared. >>> >>> But this might bring overhead. Because we need to grab the rq >>> lock and write to other CPU's rq, which could be costly. It >>> seems to be a trade-off of current implementation. >> >> I agree, it will not be pretty. Maybe the other way is to have a >> history of the type of wakeup the task experiences (similar to >> wakee_flips but for sync and non-syn wakeups) and only reserve >> the CPU if the task wakes up more via non-sync wakeups? Thinking >> out loud here. >> > > This looks good to consider the task's attribute, or maybe something > like this: > > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > if (new_cpu != prev_cpu) > p->burst_sleep_avg >>= 1; > So the duration of reservation could be shrinked. That seems like a good approach. Meanwhile, here is result for the current series without any modifications: tl;dr - There seems to be a noticeable increase in hackbench runtime with a single group but big gains beyond that. The regression could possibly be because of added searching but let me do some digging to confirm that. - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization) and DeathStarBench (High Utilization) - Other benchmarks are more of less perf neutral with the changes. More information below: o System information - Dual socket 3rd Generation EPYC System (2 x 64C/128T) - NPS1 mode (each socket is a NUMA node) - Boost Enabled - C2 disabled (MWAIT based C1 is still enabled) o Kernel information base : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core") + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") SIS_CACHE : base + this series as is o Benchmark results ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: base[pct imp](CV) SIS_CACHE[pct imp](CV) 1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03) 2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04) 4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88) 8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76) 16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04) ================================================================== Test : tbench Units : Normalized throughput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV) 1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27) 2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09) 4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42) 8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65) 16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55) 32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55) 64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74) 128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88) 256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49) 512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33) 1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16) ================================================================== Test : stream-10 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: base[pct imp](CV) SIS_CACHE[pct imp](CV) Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72) Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83) Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49) Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07) ================================================================== Test : stream-100 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: base[pct imp](CV) SIS_CACHE[pct imp](CV) Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26) Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74) Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82) Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04) ================================================================== Test : netperf Units : Normalized Througput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV) 1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49) 2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19) 4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21) 8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08) 16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04) 32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20) 64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77) 128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99) 256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71) 512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72) ================================================================== Test : schbench (old) Units : Normalized 99th percentile latency in us Interpretation: Lower is better Statistic : Median ================================================================== #workers: base[pct imp](CV) SIS_CACHE[pct imp](CV) 1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68) 2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96) 4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61) 8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86) 16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13) 32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73) 64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26) 128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08) 256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59) 512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37) ================================================================== Test : Unixbench Units : Various, Throughput Interpretation: Higher is better Statistic : AMean, Hmean (Specified) ================================================================== Metric variant base SIS_CACHE Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%) Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%) Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%* Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%* Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%* Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%* Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%) Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%) Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%) Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%) ================================================================== Test : ycsb-mongodb Units : Throughput Interpretation: Higher is better Statistic : AMean ================================================================== base : 309589.33 (var: 1.41%) SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%] ================================================================== Test : DeathStarBench Units : Normalized Throughput, relative to base Interpretation: Higher is better Statistic : AMean ================================================================== Pinning base SIS_CACHE 1 CCD 100% 99.18% [%diff: -0.82%] 2 CCD 100% 97.46% [%diff: -2.54%] 4 CCD 100% 97.22% [%diff: -2.78%] 8 CCD 100% 99.01% [%diff: -0.99%] -- Regression observed could either be because of the larger search time to find a non cache-hot idle CPU, or perhaps just the larger search time in general adding to utilization and curbing the SIS_UTIL limits further. I'll go gather some stats to back my suspicion (particularly for hackbench). > > [..snip..] -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 4:13 ` K Prateek Nayak @ 2023-09-14 11:01 ` Chen Yu 2023-09-15 3:21 ` K Prateek Nayak 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-14 11:01 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hi Prateek, thanks for the test, On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > On 9/13/2023 8:27 AM, Chen Yu wrote: > > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote: > >> Hello Chenyu, > >> > >> On 9/12/2023 6:02 PM, Chen Yu wrote: > >>> [..snip..] > >>> > >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up > >>>>> on the waker's CPU, rather than the wakee's previous CPU, because > >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about > >>>>> wakee's previous CPU. We can only restrict that other wakee does not > >>>>> occupy the previous CPU, but do not enhance the possibility that > >>>>> wake_affine_idle() chooses the previous CPU. > >>>> > >>>> Correct me if I'm wrong here, > >>>> > >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the > >>>> task is dequeued, we mark the previous CPU where it ran as "cache-hot" > >>>> and restrict any wakeup happening until the "cache_hot_timeout" is > >>>> crossed. Let us assume a perfect world where the task wakes up before > >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all > >>>> this while for the short sleeper but since the wakeup bears WF_SYNC > >>>> flag, the whole reservation is ignored and waker's LLC is explored. > >>>> > >>> > >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle() > >>> forces the short sleeping wakee to be woken up on waker's CPU rather the > >>> wakee's previous CPU, but wakee's previous has been marked as cache hot > >>> for nothing? > >> > >> Precisely :) > >> > >>> > >>>> Should the timeout be cleared if the wakeup decides to not target the > >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably > >>>> small enough to curb any side effect that could possibly show here but > >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to > >>>> a larger value, the wakeup path might be affected where lot of idle > >>>> targets are overlooked since the CPUs are marked cache-hot forr longer > >>>> duration) > >>>> > >>>> Let me know what you think. > >>>> > >>> > >>> This makes sense. In theory the above logic can be added in > >>> select_idle_sibling(), if target CPU is chosen rather than > >>> the previous CPU, the previous CPU's cache hot flag should be > >>> cleared. > >>> > >>> But this might bring overhead. Because we need to grab the rq > >>> lock and write to other CPU's rq, which could be costly. It > >>> seems to be a trade-off of current implementation. > >> > >> I agree, it will not be pretty. Maybe the other way is to have a > >> history of the type of wakeup the task experiences (similar to > >> wakee_flips but for sync and non-syn wakeups) and only reserve > >> the CPU if the task wakes up more via non-sync wakeups? Thinking > >> out loud here. > >> > > > > This looks good to consider the task's attribute, or maybe something > > like this: > > > > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); > > if (new_cpu != prev_cpu) > > p->burst_sleep_avg >>= 1; > > So the duration of reservation could be shrinked. > > That seems like a good approach. > > Meanwhile, here is result for the current series without any > modifications: > > tl;dr > > - There seems to be a noticeable increase in hackbench runtime with a > single group but big gains beyond that. The regression could possibly > be because of added searching but let me do some digging to confirm > that. Ah OK. May I have the command to run 1 group hackbench? > > - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization) > and DeathStarBench (High Utilization) > > - Other benchmarks are more of less perf neutral with the changes. > > More information below: > > o System information > > - Dual socket 3rd Generation EPYC System (2 x 64C/128T) > - NPS1 mode (each socket is a NUMA node) > - Boost Enabled > - C2 disabled (MWAIT based C1 is still enabled) > > > o Kernel information > > base : tip:sched/core at commit b41bbb33cf75 ("Merge branch > 'sched/eevdf' into sched/core") > + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb > wakeup-preemption") > > SIS_CACHE : base > + this series as is > > > o Benchmark results > > ================================================================== > Test : hackbench > Units : Normalized time in seconds > Interpretation: Lower is better > Statistic : AMean > ================================================================== > Case: base[pct imp](CV) SIS_CACHE[pct imp](CV) > 1-groups 1.00 [ -0.00]( 1.89) 1.10 [-10.28]( 2.03) > 2-groups 1.00 [ -0.00]( 2.04) 0.98 [ 1.57]( 2.04) > 4-groups 1.00 [ -0.00]( 2.38) 0.95 [ 4.70]( 0.88) > 8-groups 1.00 [ -0.00]( 1.52) 0.93 [ 7.18]( 0.76) > 16-groups 1.00 [ -0.00]( 3.44) 0.90 [ 9.76]( 1.04) > > > ================================================================== > Test : tbench > Units : Normalized throughput > Interpretation: Higher is better > Statistic : AMean > ================================================================== > Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV) > 1 1.00 [ 0.00]( 0.18) 0.98 [ -1.61]( 0.27) > 2 1.00 [ 0.00]( 0.63) 0.98 [ -1.58]( 0.09) > 4 1.00 [ 0.00]( 0.86) 0.99 [ -0.52]( 0.42) > 8 1.00 [ 0.00]( 0.22) 0.98 [ -1.77]( 0.65) > 16 1.00 [ 0.00]( 1.99) 1.00 [ -0.10]( 1.55) > 32 1.00 [ 0.00]( 4.29) 0.98 [ -1.73]( 1.55) > 64 1.00 [ 0.00]( 1.71) 0.97 [ -2.77]( 3.74) > 128 1.00 [ 0.00]( 0.65) 1.00 [ -0.14]( 0.88) > 256 1.00 [ 0.00]( 0.19) 0.97 [ -2.65]( 0.49) > 512 1.00 [ 0.00]( 0.20) 0.99 [ -1.10]( 0.33) > 1024 1.00 [ 0.00]( 0.29) 0.99 [ -0.70]( 0.16) > > > ================================================================== > Test : stream-10 > Units : Normalized Bandwidth, MB/s > Interpretation: Higher is better > Statistic : HMean > ================================================================== > Test: base[pct imp](CV) SIS_CACHE[pct imp](CV) > Copy 1.00 [ 0.00]( 4.32) 0.90 [ -9.82](10.72) > Scale 1.00 [ 0.00]( 5.21) 1.01 [ 0.59]( 1.83) > Add 1.00 [ 0.00]( 6.25) 0.99 [ -0.91]( 4.49) > Triad 1.00 [ 0.00](10.74) 1.02 [ 2.28]( 6.07) > > > ================================================================== > Test : stream-100 > Units : Normalized Bandwidth, MB/s > Interpretation: Higher is better > Statistic : HMean > ================================================================== > Test: base[pct imp](CV) SIS_CACHE[pct imp](CV) > Copy 1.00 [ 0.00]( 0.70) 0.98 [ -1.79]( 2.26) > Scale 1.00 [ 0.00]( 6.55) 1.03 [ 2.80]( 0.74) > Add 1.00 [ 0.00]( 6.53) 1.02 [ 2.05]( 1.82) > Triad 1.00 [ 0.00]( 6.66) 1.04 [ 3.54]( 1.04) > > > ================================================================== > Test : netperf > Units : Normalized Througput > Interpretation: Higher is better > Statistic : AMean > ================================================================== > Clients: base[pct imp](CV) SIS_CACHE[pct imp](CV) > 1-clients 1.00 [ 0.00]( 0.46) 0.99 [ -0.55]( 0.49) > 2-clients 1.00 [ 0.00]( 0.38) 0.99 [ -1.23]( 1.19) > 4-clients 1.00 [ 0.00]( 0.72) 0.98 [ -1.91]( 1.21) > 8-clients 1.00 [ 0.00]( 0.98) 0.98 [ -1.61]( 1.08) > 16-clients 1.00 [ 0.00]( 0.70) 0.98 [ -1.80]( 1.04) > 32-clients 1.00 [ 0.00]( 0.74) 0.98 [ -1.55]( 1.20) > 64-clients 1.00 [ 0.00]( 2.24) 1.00 [ -0.04]( 2.77) > 128-clients 1.00 [ 0.00]( 1.72) 1.03 [ 3.22]( 1.99) > 256-clients 1.00 [ 0.00]( 4.44) 0.99 [ -1.33]( 4.71) > 512-clients 1.00 [ 0.00](52.42) 0.98 [ -1.61](52.72) > > > ================================================================== > Test : schbench (old) > Units : Normalized 99th percentile latency in us > Interpretation: Lower is better > Statistic : Median > ================================================================== > #workers: base[pct imp](CV) SIS_CACHE[pct imp](CV) > 1 1.00 [ -0.00]( 2.28) 0.96 [ 4.00](15.68) > 2 1.00 [ -0.00]( 6.42) 1.00 [ -0.00](10.96) > 4 1.00 [ -0.00]( 3.77) 0.97 [ 3.33]( 7.61) > 8 1.00 [ -0.00](13.83) 1.08 [ -7.89]( 2.86) > 16 1.00 [ -0.00]( 4.37) 1.00 [ -0.00]( 2.13) > 32 1.00 [ -0.00]( 8.69) 0.95 [ 4.94]( 2.73) > 64 1.00 [ -0.00]( 2.30) 1.05 [ -5.13]( 1.26) > 128 1.00 [ -0.00](12.12) 1.03 [ -3.41]( 5.08) > 256 1.00 [ -0.00](26.04) 0.91 [ 8.88]( 2.59) > 512 1.00 [ -0.00]( 5.62) 0.97 [ 3.32]( 0.37) > > > ================================================================== > Test : Unixbench > Units : Various, Throughput > Interpretation: Higher is better > Statistic : AMean, Hmean (Specified) > ================================================================== > Metric variant base SIS_CACHE > Hmean unixbench-dhry2reg-1 41248390.97 ( 0.00%) 41485503.82 ( 0.57%) > Hmean unixbench-dhry2reg-512 6239969914.15 ( 0.00%) 6233919689.40 ( -0.10%) > Amean unixbench-syscall-1 2968518.27 ( 0.00%) 2841236.43 * 4.29%* > Amean unixbench-syscall-512 7790656.20 ( 0.00%) 7631558.00 * 2.04%* > Hmean unixbench-pipe-1 2535689.01 ( 0.00%) 2598208.16 * 2.47%* > Hmean unixbench-pipe-512 361385055.25 ( 0.00%) 368566373.76 * 1.99%* > Hmean unixbench-spawn-1 4506.26 ( 0.00%) 4551.67 ( 1.01%) > Hmean unixbench-spawn-512 69380.09 ( 0.00%) 69264.30 ( -0.17%) > Hmean unixbench-execl-1 3824.57 ( 0.00%) 3822.67 ( -0.05%) > Hmean unixbench-execl-512 12288.64 ( 0.00%) 11728.12 ( -4.56%) > > > ================================================================== > Test : ycsb-mongodb > Units : Throughput > Interpretation: Higher is better > Statistic : AMean > ================================================================== > base : 309589.33 (var: 1.41%) > SIS_CACHE : 304931.33 (var: 1.29%) [diff: -1.50%] > > > ================================================================== > Test : DeathStarBench > Units : Normalized Throughput, relative to base > Interpretation: Higher is better > Statistic : AMean > ================================================================== > Pinning base SIS_CACHE > 1 CCD 100% 99.18% [%diff: -0.82%] > 2 CCD 100% 97.46% [%diff: -2.54%] > 4 CCD 100% 97.22% [%diff: -2.78%] > 8 CCD 100% 99.01% [%diff: -0.99%] > > -- > > Regression observed could either be because of the larger search time to > find a non cache-hot idle CPU, or perhaps just the larger search time in > general adding to utilization and curbing the SIS_UTIL limits further. Yeah that is possible. And you also mentioned that we should consider the cache-hot idle CPU if we can not find any cache-cold idle CPUs, that might be a better choice than forcely putting the wakee on the current CPU which brings task stacking. > I'll go gather some stats to back my suspicion (particularly for > hackbench). > Thanks! Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 11:01 ` Chen Yu @ 2023-09-15 3:21 ` K Prateek Nayak 0 siblings, 0 replies; 34+ messages in thread From: K Prateek Nayak @ 2023-09-15 3:21 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hello Chenyu, Thank you for taking a look at the report. On 9/14/2023 4:31 PM, Chen Yu wrote: > Hi Prateek, > > thanks for the test, > > On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote: >> [..snip..] >> >> Meanwhile, here is result for the current series without any >> modifications: >> >> tl;dr >> >> - There seems to be a noticeable increase in hackbench runtime with a >> single group but big gains beyond that. The regression could possibly >> be because of added searching but let me do some digging to confirm >> that. > > Ah OK. May I have the command to run 1 group hackbench? This is actually perf bench sched messaging. The cmdline from the runner is: $ perf bench sched messaging -t -p -l 100000 -g <# of groups> > [..snip..] -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 10:19 ` Chen Yu 2023-09-12 3:05 ` K Prateek Nayak @ 2023-09-12 9:39 ` Mike Galbraith 2023-09-12 14:51 ` Chen Yu 1 sibling, 1 reply; 34+ messages in thread From: Mike Galbraith @ 2023-09-12 9:39 UTC (permalink / raw) To: Chen Yu, K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote: > > > Speaking of cache-hot idle CPU, is netperf actually more happy with > > piling on current CPU? > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > put the waker and wakee together. Hm, seems there's at least one shared L2 case where that's untrue by more than a tiny margin, which surprised me rather a lot. For grins, I tested netperf on my dinky rpi4b, and while its RR numbers seem kinda odd, they're also seemingly repeatable (ergo showing them). I measured a very modest cross-core win on a shared L2 Intel CPU some years ago (when Q6600 was shiny/new) but nothing close to these deltas. Makes me wonder what (a tad beefier) Bulldog RR numbers look like. root@rpi4:~# ONLY=TCP_RR netperf.sh TCP_RR-1 unbound Avg: 29611 Sum: 29611 TCP_RR-1 stacked Avg: 22540 Sum: 22540 TCP_RR-1 cross-core Avg: 30181 Sum: 30181 root@rpi4:~# netperf.sh TCP_SENDFILE-1 unbound Avg: 15572 Sum: 15572 TCP_SENDFILE-1 stacked Avg: 11533 Sum: 11533 TCP_SENDFILE-1 cross-core Avg: 15751 Sum: 15751 TCP_STREAM-1 unbound Avg: 6331 Sum: 6331 TCP_STREAM-1 stacked Avg: 6031 Sum: 6031 TCP_STREAM-1 cross-core Avg: 6211 Sum: 6211 TCP_MAERTS-1 unbound Avg: 6306 Sum: 6306 TCP_MAERTS-1 stacked Avg: 6094 Sum: 6094 TCP_MAERTS-1 cross-core Avg: 9393 Sum: 9393 UDP_STREAM-1 unbound Avg: 22277 Sum: 22277 UDP_STREAM-1 stacked Avg: 18844 Sum: 18844 UDP_STREAM-1 cross-core Avg: 24749 Sum: 24749 TCP_RR-1 unbound Avg: 29674 Sum: 29674 TCP_RR-1 stacked Avg: 22267 Sum: 22267 TCP_RR-1 cross-core Avg: 30237 Sum: 30237 UDP_RR-1 unbound Avg: 36189 Sum: 36189 UDP_RR-1 stacked Avg: 27129 Sum: 27129 UDP_RR-1 cross-core Avg: 37033 Sum: 37033 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 9:39 ` Mike Galbraith @ 2023-09-12 14:51 ` Chen Yu 0 siblings, 0 replies; 34+ messages in thread From: Chen Yu @ 2023-09-12 14:51 UTC (permalink / raw) To: Mike Galbraith Cc: K Prateek Nayak, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli Hi Mike, thanks for taking a look, On 2023-09-12 at 11:39:55 +0200, Mike Galbraith wrote: > On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote: > > > > > Speaking of cache-hot idle CPU, is netperf actually more happy with > > > piling on current CPU? > > > > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to > > put the waker and wakee together. > > Hm, seems there's at least one shared L2 case where that's untrue by > more than a tiny margin, which surprised me rather a lot. > Yes, the task stacking is in theory against the work conservation of the scheduler, and it depends on how much the resource(l1/l2 cache, dsb) locallity is, and it is workload and hardware specific. > For grins, I tested netperf on my dinky rpi4b, and while its RR numbers > seem kinda odd, they're also seemingly repeatable (ergo showing them). > I measured a very modest cross-core win on a shared L2 Intel CPU some > years ago (when Q6600 was shiny/new) but nothing close to these deltas. > This is interesting, I have a Jacobsville which also has shared L2, I'll run some tests to check what the difference between task stacking vs spreading task on that platform. But I guess that is another topic because current patch avoids stacking tasks. thanks, Chenyu > Makes me wonder what (a tad beefier) Bulldog RR numbers look like. > > root@rpi4:~# ONLY=TCP_RR netperf.sh > TCP_RR-1 unbound Avg: 29611 Sum: 29611 > TCP_RR-1 stacked Avg: 22540 Sum: 22540 > TCP_RR-1 cross-core Avg: 30181 Sum: 30181 > > root@rpi4:~# netperf.sh > TCP_SENDFILE-1 unbound Avg: 15572 Sum: 15572 > TCP_SENDFILE-1 stacked Avg: 11533 Sum: 11533 > TCP_SENDFILE-1 cross-core Avg: 15751 Sum: 15751 > > TCP_STREAM-1 unbound Avg: 6331 Sum: 6331 > TCP_STREAM-1 stacked Avg: 6031 Sum: 6031 > TCP_STREAM-1 cross-core Avg: 6211 Sum: 6211 > > TCP_MAERTS-1 unbound Avg: 6306 Sum: 6306 > TCP_MAERTS-1 stacked Avg: 6094 Sum: 6094 > TCP_MAERTS-1 cross-core Avg: 9393 Sum: 9393 > > UDP_STREAM-1 unbound Avg: 22277 Sum: 22277 > UDP_STREAM-1 stacked Avg: 18844 Sum: 18844 > UDP_STREAM-1 cross-core Avg: 24749 Sum: 24749 > > TCP_RR-1 unbound Avg: 29674 Sum: 29674 > TCP_RR-1 stacked Avg: 22267 Sum: 22267 > TCP_RR-1 cross-core Avg: 30237 Sum: 30237 > > UDP_RR-1 unbound Avg: 36189 Sum: 36189 > UDP_RR-1 stacked Avg: 27129 Sum: 27129 > UDP_RR-1 cross-core Avg: 37033 Sum: 37033 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 8:29 ` K Prateek Nayak 2023-09-11 10:19 ` Chen Yu @ 2023-09-12 6:32 ` Mike Galbraith 1 sibling, 0 replies; 34+ messages in thread From: Mike Galbraith @ 2023-09-12 6:32 UTC (permalink / raw) To: K Prateek Nayak, Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli On Mon, 2023-09-11 at 13:59 +0530, K Prateek Nayak wrote: > > Speaking of cache-hot idle CPU, is netperf actually more happy with > piling on current CPU? Some tests would be happier, others not at all, some numbers below. I doubt much in the real world can perform better stacked, to be a win, stacked task overlap induced service latency and utilization loss has to be less than cache population cost of an idle CPU, something that modern CPUs have become darn good at, making for a high bar. > I ask this because the logic seems to be > reserving the previous CPU for a task that dislikes migration but I > do not see anything in the wake_affine_idle() path that would make the > short sleeper proactively choose the previous CPU when the wakeup is > marked with the WF_SYNC flag. Let me know if I'm missing something? If select_idle_sibling() didn't intervene, the wake affine logic would indeed routinely step all over working sets, and at one time briefly did so due to a silly bug. (see kernel/sched/fair.c.today:7292) The sync hint stems from the bad old days of SMP when cross-cpu latency was horrid, and has lost much of its value, but its bias toward the waker CPU still helps reduce man-in-the-middle latency in a busy box, which can do even more damage than that done by stacking of not really synchronous tasks that can be seen below. The TCP/UDP_RR tests are very close to synchronous, and the numbers reflect that, stacking is unbeatable for them [1], but for the other tests, hopefully doing something a bit more realistic than tiny ball ping-pong, stacking is a demonstrable loser. Not super carefully run script output: homer:/root # netperf.sh TCP_SENDFILE-1 unbound Avg: 87889 Sum: 87889 TCP_SENDFILE-1 stacked Avg: 62885 Sum: 62885 TCP_SENDFILE-1 cross-smt Avg: 58887 Sum: 58887 TCP_SENDFILE-1 cross-core Avg: 90673 Sum: 90673 TCP_STREAM-1 unbound Avg: 71858 Sum: 71858 TCP_STREAM-1 stacked Avg: 58883 Sum: 58883 TCP_STREAM-1 cross-smt Avg: 49345 Sum: 49345 TCP_STREAM-1 cross-core Avg: 72346 Sum: 72346 TCP_MAERTS-1 unbound Avg: 73890 Sum: 73890 TCP_MAERTS-1 stacked Avg: 60682 Sum: 60682 TCP_MAERTS-1 cross-smt Avg: 49868 Sum: 49868 TCP_MAERTS-1 cross-core Avg: 73343 Sum: 73343 UDP_STREAM-1 unbound Avg: 99442 Sum: 99442 UDP_STREAM-1 stacked Avg: 85319 Sum: 85319 UDP_STREAM-1 cross-smt Avg: 63239 Sum: 63239 UDP_STREAM-1 cross-core Avg: 99102 Sum: 99102 TCP_RR-1 unbound Avg: 200833 Sum: 200833 TCP_RR-1 stacked Avg: 243733 Sum: 243733 TCP_RR-1 cross-smt Avg: 138507 Sum: 138507 TCP_RR-1 cross-core Avg: 210404 Sum: 210404 UDP_RR-1 unbound Avg: 252575 Sum: 252575 UDP_RR-1 stacked Avg: 273081 Sum: 273081 UDP_RR-1 cross-smt Avg: 168448 Sum: 168448 UDP_RR-1 cross-core Avg: 264124 Sum: 264124 1. nearly unbeatable - shared L2 CPUS can by a wee bit. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu 2023-09-11 7:26 ` Aaron Lu 2023-09-11 8:29 ` K Prateek Nayak @ 2023-09-11 15:26 ` Mathieu Desnoyers 2023-09-11 15:43 ` Mathieu Desnoyers 2023-09-20 12:34 ` Chen Yu 2023-09-14 5:30 ` K Prateek Nayak 3 siblings, 2 replies; 34+ messages in thread From: Mathieu Desnoyers @ 2023-09-11 15:26 UTC (permalink / raw) To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 9/10/23 22:50, Chen Yu wrote: > When task p is woken up, the scheduler leverages select_idle_sibling() > to find an idle CPU for it. p's previous CPU is usually a preference > because it can improve cache locality. However in many cases the > previous CPU has already been taken by other wakees, thus p has to > find another idle CPU. > > Inspired by Mathieu's idea[1], consider the sleep time of the task. > If that task is a short sleeping one, keep p's previous CPU idle > for a short while. Later when p is woken up, it can choose its > previous CPU in select_idle_sibling(). When p's previous CPU is reserved, > other wakee is not allowed to choose this CPU in select_idle_idle(). > The reservation period is set to the task's average sleep time. That > is to say, if p is a short sleeping task, there is no need to migrate > p to another idle CPU. > > This does not break the work conservation of the scheduler, > because wakee will still try its best to find an idle CPU. > The difference is that, different idle CPUs might have different > priorities. On the other hand, in theory this extra check could > increase the failure ratio of select_idle_cpu(), but per the > initial test result, no regression is detected. > > Baseline: tip/sched/core, on top of: > Commit 3f4feb58037a ("sched: Misc cleanups") > > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets. > cpufreq governor is performance, turbo boost is disabled, C-states deeper > than C1 are disabled, Numa balancing is disabled. > > netperf > ======= > case load baseline(std%) compare%( std%) > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04) > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42) > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96) > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50) > > Noticeable improvements of netperf is observed in 168 and 224 threads > cases. > > hackbench > ========= > case load baseline(std%) compare%( std%) > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48) > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10) > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41) > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95) > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97) > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09) > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49) > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90) > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44) > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43) > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91) > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07) > > schbench(old) > ======== > case load baseline(std%) compare%( std%) > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37) > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83) > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28) > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28) > > No much difference is observed in hackbench/schbench, due to the > run-to-run variance. > > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 > Suggested-by: Tim Chen <tim.c.chen@intel.com> > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > --- > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > kernel/sched/features.h | 1 + > kernel/sched/sched.h | 1 + > 3 files changed, 29 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e20f50726ab8..fe3b760c9654 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > hrtick_update(rq); > now = sched_clock_cpu(cpu_of(rq)); > p->se.prev_sleep_time = task_sleep ? now : 0; > +#ifdef CONFIG_SMP > + /* > + * If this rq will become idle, and dequeued task is > + * a short sleeping one, check if we can reserve > + * this idle CPU for that task for a short while. > + * During this reservation period, other wakees will > + * skip this 'idle' CPU in select_idle_cpu(), and this > + * short sleeping task can pick its previous CPU in > + * select_idle_sibling(), which brings better cache > + * locality. > + */ > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > + rq->cache_hot_timeout = now + p->se.sleep_avg; This is really cool! There is one scenario that worries me with this approach: workloads that sleep for a long time and then have short blocked periods. Those bursts will likely bring the average to values too high to stay below sysctl_sched_migration_cost. I wonder if changing the code above for the following would help ? if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg) rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg); For tasks that have a large sleep_avg, it would activate this rq "appear as not idle for rq selection" scheme for a window of sysctl_sched_migration_cost. If the sleep ends up being a long one, preventing other tasks from being migrated to this rq for a tiny window should not matter performance-wise. I would expect that it could help workloads that come in bursts. Thanks, Mathieu > +#endif > } > > #ifdef CONFIG_SMP > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > { > if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > - sched_cpu_cookie_match(cpu_rq(cpu), p)) > + sched_cpu_cookie_match(cpu_rq(cpu), p)) { > + if (sched_feat(SIS_CACHE) && > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout) > + return -1; > + > return cpu; > + } > > return -1; > } > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > int cpu; > > for_each_cpu(cpu, cpu_smt_mask(core)) { > - if (!available_idle_cpu(cpu)) { > + bool cache_hot = sched_feat(SIS_CACHE) ? > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; > + > + if (!available_idle_cpu(cpu) || cache_hot) { > idle = false; > if (*idle_cpu == -1) { > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { > + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && > + !cache_hot) { > *idle_cpu = cpu; > break; > } > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index f770168230ae..04ed9fcf67f8 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_PROP, false) > SCHED_FEAT(SIS_UTIL, true) > +SCHED_FEAT(SIS_CACHE, true) > > /* > * Issue a WARN when we do multiple update_rq_clock() calls > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 62013c49c451..7a2c12c3b6d0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1078,6 +1078,7 @@ struct rq { > #endif > u64 idle_stamp; > u64 avg_idle; > + u64 cache_hot_timeout; > > unsigned long wake_stamp; > u64 wake_avg_idle; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 15:26 ` Mathieu Desnoyers @ 2023-09-11 15:43 ` Mathieu Desnoyers 2023-09-12 11:53 ` Chen Yu 2023-09-20 12:34 ` Chen Yu 1 sibling, 1 reply; 34+ messages in thread From: Mathieu Desnoyers @ 2023-09-11 15:43 UTC (permalink / raw) To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 9/11/23 11:26, Mathieu Desnoyers wrote: > On 9/10/23 22:50, Chen Yu wrote: [...] >> --- >> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- >> kernel/sched/features.h | 1 + >> kernel/sched/sched.h | 1 + >> 3 files changed, 29 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index e20f50726ab8..fe3b760c9654 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, >> struct task_struct *p, int flags) >> hrtick_update(rq); >> now = sched_clock_cpu(cpu_of(rq)); >> p->se.prev_sleep_time = task_sleep ? now : 0; >> +#ifdef CONFIG_SMP >> + /* >> + * If this rq will become idle, and dequeued task is >> + * a short sleeping one, check if we can reserve >> + * this idle CPU for that task for a short while. >> + * During this reservation period, other wakees will >> + * skip this 'idle' CPU in select_idle_cpu(), and this >> + * short sleeping task can pick its previous CPU in >> + * select_idle_sibling(), which brings better cache >> + * locality. >> + */ >> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && >> + p->se.sleep_avg && p->se.sleep_avg < >> sysctl_sched_migration_cost) >> + rq->cache_hot_timeout = now + p->se.sleep_avg; > > This is really cool! > > There is one scenario that worries me with this approach: workloads > that sleep for a long time and then have short blocked periods. > Those bursts will likely bring the average to values too high > to stay below sysctl_sched_migration_cost. > > I wonder if changing the code above for the following would help ? > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > p->se.sleep_avg) > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, > p->se.sleep_avg); > > For tasks that have a large sleep_avg, it would activate this rq > "appear as not idle for rq selection" scheme for a window of > sysctl_sched_migration_cost. If the sleep ends up being a long one, > preventing other tasks from being migrated to this rq for a tiny > window should not matter performance-wise. I would expect that it > could help workloads that come in bursts. There is perhaps a better way to handle bursts: When calculating the sleep_avg, we actually only really care about the sleep time for short bursts, so we could use the sysctl_sched_migration_cost to select which of the sleeps should be taken into account in the avg. We could rename the field "sleep_avg" to "burst_sleep_avg", and have: u64 now = sched_clock_cpu(task_cpu(p)); if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) update_avg(&p->se.burst_sleep_avg, now - last_sleep); Then we can have this code is dequeue_task_fair: if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) rq->cache_hot_timeout = now + p->se.burst_sleep_avg; Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 15:43 ` Mathieu Desnoyers @ 2023-09-12 11:53 ` Chen Yu 2023-09-12 14:06 ` Mathieu Desnoyers 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-12 11:53 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel Hi Mathieu, thanks for the review, On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote: > On 9/11/23 11:26, Mathieu Desnoyers wrote: > > On 9/10/23 22:50, Chen Yu wrote: > [...] > > > --- > > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > > kernel/sched/features.h | 1 + > > > kernel/sched/sched.h | 1 + > > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index e20f50726ab8..fe3b760c9654 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, > > > struct task_struct *p, int flags) > > > hrtick_update(rq); > > > now = sched_clock_cpu(cpu_of(rq)); > > > p->se.prev_sleep_time = task_sleep ? now : 0; > > > +#ifdef CONFIG_SMP > > > + /* > > > + * If this rq will become idle, and dequeued task is > > > + * a short sleeping one, check if we can reserve > > > + * this idle CPU for that task for a short while. > > > + * During this reservation period, other wakees will > > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > > + * short sleeping task can pick its previous CPU in > > > + * select_idle_sibling(), which brings better cache > > > + * locality. > > > + */ > > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > + p->se.sleep_avg && p->se.sleep_avg < > > > sysctl_sched_migration_cost) > > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > > > This is really cool! > > > > There is one scenario that worries me with this approach: workloads > > that sleep for a long time and then have short blocked periods. > > Those bursts will likely bring the average to values too high > > to stay below sysctl_sched_migration_cost. > > > > I wonder if changing the code above for the following would help ? > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > p->se.sleep_avg) > > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, > > p->se.sleep_avg); > > > > For tasks that have a large sleep_avg, it would activate this rq > > "appear as not idle for rq selection" scheme for a window of > > sysctl_sched_migration_cost. If the sleep ends up being a long one, > > preventing other tasks from being migrated to this rq for a tiny > > window should not matter performance-wise. I would expect that it > > could help workloads that come in bursts. > > There is perhaps a better way to handle bursts: > > When calculating the sleep_avg, we actually only really care about > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost > to select which of the sleeps should be taken into account in the avg. > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have: > > u64 now = sched_clock_cpu(task_cpu(p)); > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && > now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) > update_avg(&p->se.burst_sleep_avg, now - last_sleep); > > Then we can have this code is dequeue_task_fair: > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > > Thoughts ? > This looks reasonable, it would be fine grained to only monitor the short sleep duration of that task. Let me try this approach to see if there is any difference. thanks, Chenyu > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 11:53 ` Chen Yu @ 2023-09-12 14:06 ` Mathieu Desnoyers 2023-09-12 14:14 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: Mathieu Desnoyers @ 2023-09-12 14:06 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 9/12/23 07:53, Chen Yu wrote: > Hi Mathieu, > > thanks for the review, > > On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote: >> On 9/11/23 11:26, Mathieu Desnoyers wrote: >>> On 9/10/23 22:50, Chen Yu wrote: >> [...] >>>> --- >>>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- >>>> kernel/sched/features.h | 1 + >>>> kernel/sched/sched.h | 1 + >>>> 3 files changed, 29 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index e20f50726ab8..fe3b760c9654 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, >>>> struct task_struct *p, int flags) >>>> hrtick_update(rq); >>>> now = sched_clock_cpu(cpu_of(rq)); >>>> p->se.prev_sleep_time = task_sleep ? now : 0; >>>> +#ifdef CONFIG_SMP >>>> + /* >>>> + * If this rq will become idle, and dequeued task is >>>> + * a short sleeping one, check if we can reserve >>>> + * this idle CPU for that task for a short while. >>>> + * During this reservation period, other wakees will >>>> + * skip this 'idle' CPU in select_idle_cpu(), and this >>>> + * short sleeping task can pick its previous CPU in >>>> + * select_idle_sibling(), which brings better cache >>>> + * locality. >>>> + */ >>>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && >>>> + p->se.sleep_avg && p->se.sleep_avg < >>>> sysctl_sched_migration_cost) >>>> + rq->cache_hot_timeout = now + p->se.sleep_avg; >>> >>> This is really cool! >>> >>> There is one scenario that worries me with this approach: workloads >>> that sleep for a long time and then have short blocked periods. >>> Those bursts will likely bring the average to values too high >>> to stay below sysctl_sched_migration_cost. >>> >>> I wonder if changing the code above for the following would help ? >>> >>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && >>> p->se.sleep_avg) >>> rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, >>> p->se.sleep_avg); >>> >>> For tasks that have a large sleep_avg, it would activate this rq >>> "appear as not idle for rq selection" scheme for a window of >>> sysctl_sched_migration_cost. If the sleep ends up being a long one, >>> preventing other tasks from being migrated to this rq for a tiny >>> window should not matter performance-wise. I would expect that it >>> could help workloads that come in bursts. >> >> There is perhaps a better way to handle bursts: >> >> When calculating the sleep_avg, we actually only really care about >> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost >> to select which of the sleeps should be taken into account in the avg. >> >> We could rename the field "sleep_avg" to "burst_sleep_avg", and have: >> >> u64 now = sched_clock_cpu(task_cpu(p)); >> >> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && >> now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) >> update_avg(&p->se.burst_sleep_avg, now - last_sleep); >> >> Then we can have this code is dequeue_task_fair: >> >> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) >> rq->cache_hot_timeout = now + p->se.burst_sleep_avg; >> >> Thoughts ? >> > > This looks reasonable, it would be fine grained to only monitor the short sleep duration > of that task. Let me try this approach to see if there is any difference. > One more tweak: given that more than one task can update the cache_hot_timeout forward one after another, and given that some tasks have larger burst_sleep_avg values than others, I suspect we want to keep the forward movement monotonic with something like: if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && rq->cache_hot_timeout < now + p->se.burst_sleep_avg) rq->cache_hot_timeout = now + p->se.burst_sleep_avg; Thanks, Mathieu > thanks, > Chenyu > >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 14:06 ` Mathieu Desnoyers @ 2023-09-12 14:14 ` Chen Yu 2023-09-12 15:18 ` Mathieu Desnoyers 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-12 14:14 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote: > On 9/12/23 07:53, Chen Yu wrote: > > Hi Mathieu, > > > > thanks for the review, > > > > On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote: > > > On 9/11/23 11:26, Mathieu Desnoyers wrote: > > > > On 9/10/23 22:50, Chen Yu wrote: > > > [...] > > > > > --- > > > > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > > > > kernel/sched/features.h | 1 + > > > > > kernel/sched/sched.h | 1 + > > > > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > index e20f50726ab8..fe3b760c9654 100644 > > > > > --- a/kernel/sched/fair.c > > > > > +++ b/kernel/sched/fair.c > > > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, > > > > > struct task_struct *p, int flags) > > > > > hrtick_update(rq); > > > > > now = sched_clock_cpu(cpu_of(rq)); > > > > > p->se.prev_sleep_time = task_sleep ? now : 0; > > > > > +#ifdef CONFIG_SMP > > > > > + /* > > > > > + * If this rq will become idle, and dequeued task is > > > > > + * a short sleeping one, check if we can reserve > > > > > + * this idle CPU for that task for a short while. > > > > > + * During this reservation period, other wakees will > > > > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > > > > + * short sleeping task can pick its previous CPU in > > > > > + * select_idle_sibling(), which brings better cache > > > > > + * locality. > > > > > + */ > > > > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > > > + p->se.sleep_avg && p->se.sleep_avg < > > > > > sysctl_sched_migration_cost) > > > > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > > > > > > > This is really cool! > > > > > > > > There is one scenario that worries me with this approach: workloads > > > > that sleep for a long time and then have short blocked periods. > > > > Those bursts will likely bring the average to values too high > > > > to stay below sysctl_sched_migration_cost. > > > > > > > > I wonder if changing the code above for the following would help ? > > > > > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > > > p->se.sleep_avg) > > > > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, > > > > p->se.sleep_avg); > > > > > > > > For tasks that have a large sleep_avg, it would activate this rq > > > > "appear as not idle for rq selection" scheme for a window of > > > > sysctl_sched_migration_cost. If the sleep ends up being a long one, > > > > preventing other tasks from being migrated to this rq for a tiny > > > > window should not matter performance-wise. I would expect that it > > > > could help workloads that come in bursts. > > > > > > There is perhaps a better way to handle bursts: > > > > > > When calculating the sleep_avg, we actually only really care about > > > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost > > > to select which of the sleeps should be taken into account in the avg. > > > > > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have: > > > > > > u64 now = sched_clock_cpu(task_cpu(p)); > > > > > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && > > > now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) > > > update_avg(&p->se.burst_sleep_avg, now - last_sleep); > > > > > > Then we can have this code is dequeue_task_fair: > > > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg) > > > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > > > > > > Thoughts ? > > > > > > > This looks reasonable, it would be fine grained to only monitor the short sleep duration > > of that task. Let me try this approach to see if there is any difference. > > > > One more tweak: given that more than one task can update the cache_hot_timeout forward > one after another, and given that some tasks have larger burst_sleep_avg values than > others, I suspect we want to keep the forward movement monotonic with something like: > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && > rq->cache_hot_timeout < now + p->se.burst_sleep_avg) > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > Yeah, Aaron has mentioned this too: https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/ May I know the benefit of keeping forward movement monotonic? I thought that, should we only honor the latest dequeued task's burst_sleep_avg? Because we don't know whether the old deuqued task's cache has been scribbled by the latest dequeued task or not, does it still make sense to wake up the old dequeued task on its previous CPU? thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 14:14 ` Chen Yu @ 2023-09-12 15:18 ` Mathieu Desnoyers 2023-09-13 3:02 ` Chen Yu 0 siblings, 1 reply; 34+ messages in thread From: Mathieu Desnoyers @ 2023-09-12 15:18 UTC (permalink / raw) To: Chen Yu Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 9/12/23 10:14, Chen Yu wrote: > On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote: [...] >> >> One more tweak: given that more than one task can update the cache_hot_timeout forward >> one after another, and given that some tasks have larger burst_sleep_avg values than >> others, I suspect we want to keep the forward movement monotonic with something like: >> >> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && >> rq->cache_hot_timeout < now + p->se.burst_sleep_avg) >> rq->cache_hot_timeout = now + p->se.burst_sleep_avg; >> > > Yeah, Aaron has mentioned this too: > https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/ > May I know the benefit of keeping forward movement monotonic? > I thought that, should we only honor the latest dequeued task's burst_sleep_avg? > Because we don't know whether the old deuqued task's cache has been scribbled by the latest > dequeued task or not, does it still make sense to wake up the old dequeued task on its > previous CPU? Here is my reasoning: If a second task is scheduled after the first dequeued task (a task with large burst_sleep_avg) is dequeued, that second task (with small burst_sleep_avg) would need to entirely scribble the other task's cache lines within the time given by sysctl_sched_migration_cost, which I suspect is typically not very large. So I doubt that the second task can entirely kick out the first task cache lines within that time frame, and therefore that second task should not move the cache_hot_timeout value backwards. But perhaps I'm missing something ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-12 15:18 ` Mathieu Desnoyers @ 2023-09-13 3:02 ` Chen Yu 0 siblings, 0 replies; 34+ messages in thread From: Chen Yu @ 2023-09-13 3:02 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel On 2023-09-12 at 11:18:33 -0400, Mathieu Desnoyers wrote: > On 9/12/23 10:14, Chen Yu wrote: > > On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote: > [...] > > > > > > One more tweak: given that more than one task can update the cache_hot_timeout forward > > > one after another, and given that some tasks have larger burst_sleep_avg values than > > > others, I suspect we want to keep the forward movement monotonic with something like: > > > > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg && > > > rq->cache_hot_timeout < now + p->se.burst_sleep_avg) > > > rq->cache_hot_timeout = now + p->se.burst_sleep_avg; > > > > > > > Yeah, Aaron has mentioned this too: > > https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/ > > May I know the benefit of keeping forward movement monotonic? > > I thought that, should we only honor the latest dequeued task's burst_sleep_avg? > > Because we don't know whether the old deuqued task's cache has been scribbled by the latest > > dequeued task or not, does it still make sense to wake up the old dequeued task on its > > previous CPU? > > Here is my reasoning: > > If a second task is scheduled after the first dequeued task (a > task with large burst_sleep_avg) is dequeued, that second task (with > small burst_sleep_avg) would need to entirely scribble the other task's > cache lines within the time given by sysctl_sched_migration_cost, which > I suspect is typically not very large. So I doubt that the second task > can entirely kick out the first task cache lines within that time frame, > and therefore that second task should not move the cache_hot_timeout > value backwards. > > But perhaps I'm missing something ? > You are right, the reservation time itself is not very long, unless someone enlarges sysctl_sched_migration_cost in user space. Keeping the reservation time moving forward could help the first dequeued task to be put on its previous CPU easier(if the second dequeued task has not been woken up yet). I'll modify it according to your suggestion and to see the results. thanks, Chenyu > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 15:26 ` Mathieu Desnoyers 2023-09-11 15:43 ` Mathieu Desnoyers @ 2023-09-20 12:34 ` Chen Yu 1 sibling, 0 replies; 34+ messages in thread From: Chen Yu @ 2023-09-20 12:34 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy, linux-kernel Hi Mathieu, On 2023-09-11 at 11:26:50 -0400, Mathieu Desnoyers wrote: > On 9/10/23 22:50, Chen Yu wrote: > > When task p is woken up, the scheduler leverages select_idle_sibling() > > to find an idle CPU for it. p's previous CPU is usually a preference > > because it can improve cache locality. However in many cases the > > previous CPU has already been taken by other wakees, thus p has to > > find another idle CPU. > > > > Inspired by Mathieu's idea[1], consider the sleep time of the task. > > If that task is a short sleeping one, keep p's previous CPU idle > > for a short while. Later when p is woken up, it can choose its > > previous CPU in select_idle_sibling(). When p's previous CPU is reserved, > > other wakee is not allowed to choose this CPU in select_idle_idle(). > > The reservation period is set to the task's average sleep time. That > > is to say, if p is a short sleeping task, there is no need to migrate > > p to another idle CPU. > > > > This does not break the work conservation of the scheduler, > > because wakee will still try its best to find an idle CPU. > > The difference is that, different idle CPUs might have different > > priorities. On the other hand, in theory this extra check could > > increase the failure ratio of select_idle_cpu(), but per the > > initial test result, no regression is detected. > > > > Baseline: tip/sched/core, on top of: > > Commit 3f4feb58037a ("sched: Misc cleanups") > > > > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets. > > cpufreq governor is performance, turbo boost is disabled, C-states deeper > > than C1 are disabled, Numa balancing is disabled. > > > > netperf > > ======= > > case load baseline(std%) compare%( std%) > > UDP_RR 56-threads 1.00 ( 1.34) +1.05 ( 1.04) > > UDP_RR 112-threads 1.00 ( 7.94) -0.68 ( 14.42) > > UDP_RR 168-threads 1.00 ( 33.17) +49.63 ( 5.96) > > UDP_RR 224-threads 1.00 ( 13.52) +122.53 ( 18.50) > > > > Noticeable improvements of netperf is observed in 168 and 224 threads > > cases. > > > > hackbench > > ========= > > case load baseline(std%) compare%( std%) > > process-pipe 1-groups 1.00 ( 5.61) -4.69 ( 1.48) > > process-pipe 2-groups 1.00 ( 8.74) -0.24 ( 3.10) > > process-pipe 4-groups 1.00 ( 3.52) +1.61 ( 4.41) > > process-sockets 1-groups 1.00 ( 4.73) +2.32 ( 0.95) > > process-sockets 2-groups 1.00 ( 1.27) -3.29 ( 0.97) > > process-sockets 4-groups 1.00 ( 0.09) +0.24 ( 0.09) > > threads-pipe 1-groups 1.00 ( 10.44) -5.88 ( 1.49) > > threads-pipe 2-groups 1.00 ( 19.15) +5.31 ( 12.90) > > threads-pipe 4-groups 1.00 ( 1.74) -5.01 ( 6.44) > > threads-sockets 1-groups 1.00 ( 1.58) -1.79 ( 0.43) > > threads-sockets 2-groups 1.00 ( 1.19) -8.43 ( 6.91) > > threads-sockets 4-groups 1.00 ( 0.10) -0.09 ( 0.07) > > > > schbench(old) > > ======== > > case load baseline(std%) compare%( std%) > > normal 1-mthreads 1.00 ( 0.63) +1.28 ( 0.37) > > normal 2-mthreads 1.00 ( 8.33) +1.58 ( 2.83) > > normal 4-mthreads 1.00 ( 2.48) -2.98 ( 3.28) > > normal 8-mthreads 1.00 ( 3.97) +5.01 ( 1.28) > > > > No much difference is observed in hackbench/schbench, due to the > > run-to-run variance. > > > > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1 > > Suggested-by: Tim Chen <tim.c.chen@intel.com> > > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > --- > > kernel/sched/fair.c | 30 +++++++++++++++++++++++++++--- > > kernel/sched/features.h | 1 + > > kernel/sched/sched.h | 1 + > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e20f50726ab8..fe3b760c9654 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > hrtick_update(rq); > > now = sched_clock_cpu(cpu_of(rq)); > > p->se.prev_sleep_time = task_sleep ? now : 0; > > +#ifdef CONFIG_SMP > > + /* > > + * If this rq will become idle, and dequeued task is > > + * a short sleeping one, check if we can reserve > > + * this idle CPU for that task for a short while. > > + * During this reservation period, other wakees will > > + * skip this 'idle' CPU in select_idle_cpu(), and this > > + * short sleeping task can pick its previous CPU in > > + * select_idle_sibling(), which brings better cache > > + * locality. > > + */ > > + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && > > + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost) > > + rq->cache_hot_timeout = now + p->se.sleep_avg; > > This is really cool! > > There is one scenario that worries me with this approach: workloads > that sleep for a long time and then have short blocked periods. > Those bursts will likely bring the average to values too high > to stay below sysctl_sched_migration_cost. > > I wonder if changing the code above for the following would help ? > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg) > rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg); > > For tasks that have a large sleep_avg, it would activate this rq > "appear as not idle for rq selection" scheme for a window of > sysctl_sched_migration_cost. If the sleep ends up being a long one, > preventing other tasks from being migrated to this rq for a tiny > window should not matter performance-wise. I would expect that it > could help workloads that come in bursts. > After a revist, I thought maybe above approach is better. Because this method considers all the historic data rather than only the short sleeping duration: if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) && now > last_sleep && now - last_sleep < sysctl_sched_migration_cost) update_avg(&p->se.burst_sleep_avg, now - last_sleep); Say, if a task sleeps for 1 ms, and in all the subsequent context switch it will sleep much longer than 1 ms, then the burst_avg_sleep will remain 1 ms which does not reflect the behavior of this task. update_avg() is actually (Exponentially Weighted Moving-Average) of new_avg = 0.875 * old_avg + 0.125 * lastest_sleep_duration which approximately reflects the latest 1/0.125 = 8 sample data. I guess the 8 is small enought to capture the burst sleep duration. Unless we want to capture the scenario that, during the latest 8 sleep events, there are 1 small sleeps, and 7 large sleep behavior and we still want to honor that 1 small sleep and reserve the idle CPU. Then either we can use the method you proposed to use the min logic, or use the logic you proposed in another reply, but enhance it by clearing that burst avergae sleep duration if there is a continious 8 long sleep event, which looks a little overkilled. Anyway, I can get some data based on this. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu ` (2 preceding siblings ...) 2023-09-11 15:26 ` Mathieu Desnoyers @ 2023-09-14 5:30 ` K Prateek Nayak 2023-09-14 10:43 ` Chen Yu 3 siblings, 1 reply; 34+ messages in thread From: K Prateek Nayak @ 2023-09-14 5:30 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar Hello Chenyu, One question ... On 9/11/2023 8:20 AM, Chen Yu wrote: > [..snip..] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e20f50726ab8..fe3b760c9654 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > [..more snip..] > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > int cpu; > > for_each_cpu(cpu, cpu_smt_mask(core)) { > - if (!available_idle_cpu(cpu)) { > + bool cache_hot = sched_feat(SIS_CACHE) ? > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; > + > + if (!available_idle_cpu(cpu) || cache_hot) { > idle = false; > if (*idle_cpu == -1) { > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { > + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && > + !cache_hot) { Here, the CPU is running a SCHED_IDLE task ... > *idle_cpu = cpu; > break; > } ... but just below this, there are following lines to cache the idle_cpu: } if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) *idle_cpu = cpu; Would it make sense to also add the same "cache_hot" check here when we come across an idle CPU during the search for an idle core? Something like: - if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) + if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr)) *idle_cpu = cpu; Implications with the above change: If the entire core is idle, "select_idle_core()" will return the core and the search will bail out in "select_idle_cpu()". Otherwise, the cache-hot idle CPUs encountered during the search for idle core will be ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is not cache-hot. Thoughts? > [..snip..] -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 5:30 ` K Prateek Nayak @ 2023-09-14 10:43 ` Chen Yu 2023-09-15 3:37 ` K Prateek Nayak 0 siblings, 1 reply; 34+ messages in thread From: Chen Yu @ 2023-09-14 10:43 UTC (permalink / raw) To: K Prateek Nayak Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar Hi Prateek, On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote: > Hello Chenyu, > > One question ... > > On 9/11/2023 8:20 AM, Chen Yu wrote: > > [..snip..] > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e20f50726ab8..fe3b760c9654 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > [..more snip..] > > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > > int cpu; > > > > for_each_cpu(cpu, cpu_smt_mask(core)) { > > - if (!available_idle_cpu(cpu)) { > > + bool cache_hot = sched_feat(SIS_CACHE) ? > > + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; > > + > > + if (!available_idle_cpu(cpu) || cache_hot) { > > idle = false; > > if (*idle_cpu == -1) { > > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { > > + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && > > + !cache_hot) { > > Here, the CPU is running a SCHED_IDLE task ... > > > *idle_cpu = cpu; > > break; > > } > > ... but just below this, there are following lines to cache the idle_cpu: > > } > if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) > *idle_cpu = cpu; > > Would it make sense to also add the same "cache_hot" check here when we > come across an idle CPU during the search for an idle core? Something > like: > > - if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) When we reached above code, the following condition should be true: (available_idle_cpu(cpu) && !cache_hot) because the previous 'if' statement is false. So I guess we already has !cache_hot ? > + if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr)) > *idle_cpu = cpu; > > Implications with the above change: > > If the entire core is idle, "select_idle_core()" will return the core > and the search will bail out in "select_idle_cpu()". Otherwise, the > cache-hot idle CPUs encountered during the search for idle core will be > ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is > not cache-hot. > > Thoughts? > Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle in your case. thanks, Chenyu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() 2023-09-14 10:43 ` Chen Yu @ 2023-09-15 3:37 ` K Prateek Nayak 0 siblings, 0 replies; 34+ messages in thread From: K Prateek Nayak @ 2023-09-15 3:37 UTC (permalink / raw) To: Chen Yu Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Gautham R . Shenoy, linux-kernel, Peter Zijlstra, Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar Hello Chenyu, On 9/14/2023 4:13 PM, Chen Yu wrote: > Hi Prateek, > > On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote: >> Hello Chenyu, >> >> One question ... >> >> On 9/11/2023 8:20 AM, Chen Yu wrote: >>> [..snip..] >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index e20f50726ab8..fe3b760c9654 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> [..more snip..] >>> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu >>> int cpu; >>> >>> for_each_cpu(cpu, cpu_smt_mask(core)) { >>> - if (!available_idle_cpu(cpu)) { >>> + bool cache_hot = sched_feat(SIS_CACHE) ? >>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false; >>> + >>> + if (!available_idle_cpu(cpu) || cache_hot) { >>> idle = false; >>> if (*idle_cpu == -1) { >>> - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) { >>> + if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) && >>> + !cache_hot) { >> >> Here, the CPU is running a SCHED_IDLE task ... >> >>> *idle_cpu = cpu; >>> break; >>> } >> >> ... but just below this, there are following lines to cache the idle_cpu: >> >> } >> if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) >> *idle_cpu = cpu; >> >> Would it make sense to also add the same "cache_hot" check here when we >> come across an idle CPU during the search for an idle core? Something >> like: >> >> - if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr)) > > When we reached above code, the following condition should be true: > (available_idle_cpu(cpu) && !cache_hot) > because the previous 'if' statement is false. So I guess we already > has !cache_hot ? Ah! You are right. I missed the break at end of the if block. Thank you for pointing it out to me :) > >> + if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr)) >> *idle_cpu = cpu; >> >> Implications with the above change: >> >> If the entire core is idle, "select_idle_core()" will return the core >> and the search will bail out in "select_idle_cpu()". Otherwise, the >> cache-hot idle CPUs encountered during the search for idle core will be >> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is >> not cache-hot. >> >> Thoughts? >> > > Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle > in your case. > > thanks, > Chenyu -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2023-09-20 12:35 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-11 2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu 2023-09-11 2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu 2023-09-11 2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu 2023-09-11 7:26 ` Aaron Lu 2023-09-11 8:40 ` Chen Yu 2023-09-13 6:22 ` Gautham R. Shenoy 2023-09-13 7:25 ` Chen Yu 2023-09-14 7:06 ` Gautham R. Shenoy 2023-09-14 12:09 ` Chen Yu 2023-09-15 15:18 ` Gautham R. Shenoy 2023-09-19 9:01 ` Chen Yu 2023-09-11 8:29 ` K Prateek Nayak 2023-09-11 10:19 ` Chen Yu 2023-09-12 3:05 ` K Prateek Nayak 2023-09-12 12:32 ` Chen Yu 2023-09-12 14:26 ` K Prateek Nayak 2023-09-13 2:57 ` Chen Yu 2023-09-14 4:13 ` K Prateek Nayak 2023-09-14 11:01 ` Chen Yu 2023-09-15 3:21 ` K Prateek Nayak 2023-09-12 9:39 ` Mike Galbraith 2023-09-12 14:51 ` Chen Yu 2023-09-12 6:32 ` Mike Galbraith 2023-09-11 15:26 ` Mathieu Desnoyers 2023-09-11 15:43 ` Mathieu Desnoyers 2023-09-12 11:53 ` Chen Yu 2023-09-12 14:06 ` Mathieu Desnoyers 2023-09-12 14:14 ` Chen Yu 2023-09-12 15:18 ` Mathieu Desnoyers 2023-09-13 3:02 ` Chen Yu 2023-09-20 12:34 ` Chen Yu 2023-09-14 5:30 ` K Prateek Nayak 2023-09-14 10:43 ` Chen Yu 2023-09-15 3:37 ` K Prateek Nayak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox