* [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC @ 2025-03-17 8:28 Joel Fernandes 2025-03-17 17:08 ` Tejun Heo 2025-03-17 17:20 ` Andrea Righi 0 siblings, 2 replies; 17+ messages in thread From: Joel Fernandes @ 2025-03-17 8:28 UTC (permalink / raw) To: linux-kernel, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Cc: Joel Fernandes Consider that the previous CPU is cache affined to the waker's CPU and is idle. Currently, scx's default select function only selects the previous CPU in this case if WF_SYNC request is also made to wakeup on the waker's CPU. This means, without WF_SYNC, the previous CPU being cache affined to the waker and is idle is not considered. This seems extreme. WF_SYNC is not normally passed to the wakeup path outside of some IPC drivers but it is very possible that the task is cache hot on previous CPU and shares cache with the waker CPU. Lets avoid too many migrations and select the previous CPU in such cases. This change is consistent with the fair scheduler's behavior as well. In select_idle_sibling(), the previous CPU is selected if it is cache affined with the target. This is done regardless of WF_SYNC and before any scanning of fully idle cores is done. One difference still exists though between SCX and CFS in this regard, in CFS we first check if the target CPU is idle before checking if the previous CPU is idle. However that could be a matter of choice and in the future, that behavior could also be unified. Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> --- kernel/sched/ext.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 5a81d9a1e31f..3b1a489e2aaf 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, { const struct cpumask *llc_cpus = NULL; const struct cpumask *numa_cpus = NULL; - s32 cpu; + s32 cpu = smp_processor_id(); *found = false; @@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, llc_cpus = llc_span(prev_cpu); } + /* + * If the waker's CPU is cache affine and prev_cpu is idle, then avoid + * a migration. + */ + if (cpus_share_cache(cpu, prev_cpu) && + test_and_clear_cpu_idle(prev_cpu)) { + cpu = prev_cpu; + goto cpu_found; + } + /* * If WAKE_SYNC, try to migrate the wakee to the waker's CPU. */ if (wake_flags & SCX_WAKE_SYNC) { - cpu = smp_processor_id(); - - /* - * If the waker's CPU is cache affine and prev_cpu is idle, - * then avoid a migration. - */ - if (cpus_share_cache(cpu, prev_cpu) && - test_and_clear_cpu_idle(prev_cpu)) { - cpu = prev_cpu; - goto cpu_found; - } - /* * If the waker's local DSQ is empty, and the system is under * utilized, try to wake up @p to the local DSQ of the waker. -- 2.43.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 8:28 [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC Joel Fernandes @ 2025-03-17 17:08 ` Tejun Heo 2025-03-17 17:30 ` Andrea Righi ` (2 more replies) 2025-03-17 17:20 ` Andrea Righi 1 sibling, 3 replies; 17+ messages in thread From: Tejun Heo @ 2025-03-17 17:08 UTC (permalink / raw) To: Joel Fernandes Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Hello, Joel. On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: > Consider that the previous CPU is cache affined to the waker's CPU and > is idle. Currently, scx's default select function only selects the > previous CPU in this case if WF_SYNC request is also made to wakeup on the > waker's CPU. > > This means, without WF_SYNC, the previous CPU being cache affined to the > waker and is idle is not considered. This seems extreme. WF_SYNC is not > normally passed to the wakeup path outside of some IPC drivers but it is > very possible that the task is cache hot on previous CPU and shares > cache with the waker CPU. Lets avoid too many migrations and select the > previous CPU in such cases. Hmm.. if !WF_SYNC: 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle core in widening scopes. 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an idle CPU in widening scopes. So, it is considering prev_cpu, right? I think it's preferring idle core a bit too much - it probably doesn't make sense to cross the NUMA boundary if there is an idle CPU in this node, at least. Isn't the cpus_share_cache() code block mostly about not doing waker-affining if prev_cpu of the wakee is close enough and idle, so waker-affining is likely to be worse? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:08 ` Tejun Heo @ 2025-03-17 17:30 ` Andrea Righi 2025-03-17 17:44 ` Peter Zijlstra 2025-03-17 22:11 ` Joel Fernandes 2025-03-17 22:07 ` Joel Fernandes 2025-03-18 0:12 ` Libo Chen 2 siblings, 2 replies; 17+ messages in thread From: Andrea Righi @ 2025-03-17 17:30 UTC (permalink / raw) To: Tejun Heo Cc: Joel Fernandes, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote: > Hello, Joel. > > On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: > > Consider that the previous CPU is cache affined to the waker's CPU and > > is idle. Currently, scx's default select function only selects the > > previous CPU in this case if WF_SYNC request is also made to wakeup on the > > waker's CPU. > > > > This means, without WF_SYNC, the previous CPU being cache affined to the > > waker and is idle is not considered. This seems extreme. WF_SYNC is not > > normally passed to the wakeup path outside of some IPC drivers but it is > > very possible that the task is cache hot on previous CPU and shares > > cache with the waker CPU. Lets avoid too many migrations and select the > > previous CPU in such cases. > > Hmm.. if !WF_SYNC: > > 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle > core in widening scopes. > > 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an > idle CPU in widening scopes. > > So, it is considering prev_cpu, right? I think it's preferring idle core a > bit too much - it probably doesn't make sense to cross the NUMA boundary if > there is an idle CPU in this node, at least. Yeah, we should probably be a bit more conservative by default and avoid jumping across nodes if there are still idle CPUs within the node. With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce that while still using the built-in idle policy (since we can specify idle flags), but that doesn't preclude adjusting the default policy anyway, if it makes more sense. I guess the question is: what is more expensive in general on task wakeup? 1) a cross-node migration or 2) running on a partially busy SMT core? -Andrea [1] https://lore.kernel.org/all/20250314094827.167563-1-arighi@nvidia.com/ > > Isn't the cpus_share_cache() code block mostly about not doing > waker-affining if prev_cpu of the wakee is close enough and idle, so > waker-affining is likely to be worse? > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:30 ` Andrea Righi @ 2025-03-17 17:44 ` Peter Zijlstra 2025-03-18 5:17 ` Andrea Righi 2025-03-17 22:11 ` Joel Fernandes 1 sibling, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2025-03-17 17:44 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, Joel Fernandes, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Mon, Mar 17, 2025 at 06:30:57PM +0100, Andrea Righi wrote: > I guess the question is: what is more expensive in general on task wakeup? > 1) a cross-node migration or 2) running on a partially busy SMT core? That totally depends on both the workload and the actual machine :/ If you have 'fast' numa and not a very big footprint, the numa migrations aren't too bad. OTOH if you have sucky numa or your memory footprint is significant, then running on the wrong node is super painful. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:44 ` Peter Zijlstra @ 2025-03-18 5:17 ` Andrea Righi 0 siblings, 0 replies; 17+ messages in thread From: Andrea Righi @ 2025-03-18 5:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Tejun Heo, Joel Fernandes, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Mon, Mar 17, 2025 at 06:44:31PM +0100, Peter Zijlstra wrote: > On Mon, Mar 17, 2025 at 06:30:57PM +0100, Andrea Righi wrote: > > > I guess the question is: what is more expensive in general on task wakeup? > > 1) a cross-node migration or 2) running on a partially busy SMT core? > > That totally depends on both the workload and the actual machine :/ > > If you have 'fast' numa and not a very big footprint, the numa > migrations aren't too bad. OTOH if you have sucky numa or your memory > footprint is significant, then running on the wrong node is super > painful. Right, there isn't a single "best solution" in general, I guess we just need to pick what we think it works best in most cases, set that as default, and then leave it to the BPF schedulers to adjust the behavior as needed (at the end this is just a default policy). Thanks, -Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:30 ` Andrea Righi 2025-03-17 17:44 ` Peter Zijlstra @ 2025-03-17 22:11 ` Joel Fernandes 2025-03-18 5:09 ` Andrea Righi 1 sibling, 1 reply; 17+ messages in thread From: Joel Fernandes @ 2025-03-17 22:11 UTC (permalink / raw) To: Andrea Righi, Tejun Heo Cc: linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/17/2025 6:30 PM, Andrea Righi wrote: > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote: >> Hello, Joel. >> >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: >>> Consider that the previous CPU is cache affined to the waker's CPU and >>> is idle. Currently, scx's default select function only selects the >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the >>> waker's CPU. >>> >>> This means, without WF_SYNC, the previous CPU being cache affined to the >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not >>> normally passed to the wakeup path outside of some IPC drivers but it is >>> very possible that the task is cache hot on previous CPU and shares >>> cache with the waker CPU. Lets avoid too many migrations and select the >>> previous CPU in such cases. >> Hmm.. if !WF_SYNC: >> >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle >> core in widening scopes. >> >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an >> idle CPU in widening scopes. >> >> So, it is considering prev_cpu, right? I think it's preferring idle core a >> bit too much - it probably doesn't make sense to cross the NUMA boundary if >> there is an idle CPU in this node, at least. > > Yeah, we should probably be a bit more conservative by default and avoid > jumping across nodes if there are still idle CPUs within the node. > Agreed. So maybe we check for fully idle cores *within the node* first, before preferring idle SMTs *within the node* ? And then, as next step go looking at other nodes. Would that be a reasonable middle ground? > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce > that while still using the built-in idle policy (since we can specify idle > flags), but that doesn't preclude adjusting the default policy anyway, if > it makes more sense. Aren't you deprecating the usage of the default select function? If we are going to be adjusting its behavior like my patch is doing, then we should probably not also deprecate it. thanks, - Joel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 22:11 ` Joel Fernandes @ 2025-03-18 5:09 ` Andrea Righi 2025-03-18 17:00 ` Joel Fernandes 0 siblings, 1 reply; 17+ messages in thread From: Andrea Righi @ 2025-03-18 5:09 UTC (permalink / raw) To: Joel Fernandes Cc: Tejun Heo, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Mon, Mar 17, 2025 at 11:11:08PM +0100, Joel Fernandes wrote: > > > On 3/17/2025 6:30 PM, Andrea Righi wrote: > > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote: > >> Hello, Joel. > >> > >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: > >>> Consider that the previous CPU is cache affined to the waker's CPU and > >>> is idle. Currently, scx's default select function only selects the > >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the > >>> waker's CPU. > >>> > >>> This means, without WF_SYNC, the previous CPU being cache affined to the > >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not > >>> normally passed to the wakeup path outside of some IPC drivers but it is > >>> very possible that the task is cache hot on previous CPU and shares > >>> cache with the waker CPU. Lets avoid too many migrations and select the > >>> previous CPU in such cases. > >> Hmm.. if !WF_SYNC: > >> > >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle > >> core in widening scopes. > >> > >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an > >> idle CPU in widening scopes. > >> > >> So, it is considering prev_cpu, right? I think it's preferring idle core a > >> bit too much - it probably doesn't make sense to cross the NUMA boundary if > >> there is an idle CPU in this node, at least. > > > > Yeah, we should probably be a bit more conservative by default and avoid > > jumping across nodes if there are still idle CPUs within the node. > > > > Agreed. So maybe we check for fully idle cores *within the node* first, before > preferring idle SMTs *within the node* ? And then, as next step go looking at > other nodes. Would that be a reasonable middle ground? > > > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce > > that while still using the built-in idle policy (since we can specify idle > > flags), but that doesn't preclude adjusting the default policy anyway, if > > it makes more sense. > > Aren't you deprecating the usage of the default select function? If we are going > to be adjusting its behavior like my patch is doing, then we should probably not > also deprecate it. I'm just extending the default select function to accept a cpumask and idle SCX_PICK_IDLE_* flags, so that it's easier for BPF schedulers to change the select behavior without reimplementing the whole thing. The old scx_bpf_select_cpu_dfl() will be remapped to the new API for a while for backward compatibility and the underlying selection logic remains the same. So, in this case for example, you could implement the "check full-idle then partial-idle SMT CPUs within the node" logic as following: /* Search for full-idle SMT first, then idle CPUs within prev_cpu's node */ cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, SCX_PICK_IDLE_IN_NODE) if (cpu < 0) { /* Search for full-idle SMT first, then idle CPUs across all nodes */ cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, 0) } -Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-18 5:09 ` Andrea Righi @ 2025-03-18 17:00 ` Joel Fernandes 2025-03-18 17:46 ` Andrea Righi 0 siblings, 1 reply; 17+ messages in thread From: Joel Fernandes @ 2025-03-18 17:00 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Tue, Mar 18, 2025 at 06:09:42AM +0100, Andrea Righi wrote: > On Mon, Mar 17, 2025 at 11:11:08PM +0100, Joel Fernandes wrote: > > > > > > On 3/17/2025 6:30 PM, Andrea Righi wrote: > > > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote: > > >> Hello, Joel. > > >> > > >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: > > >>> Consider that the previous CPU is cache affined to the waker's CPU and > > >>> is idle. Currently, scx's default select function only selects the > > >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the > > >>> waker's CPU. > > >>> > > >>> This means, without WF_SYNC, the previous CPU being cache affined to the > > >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not > > >>> normally passed to the wakeup path outside of some IPC drivers but it is > > >>> very possible that the task is cache hot on previous CPU and shares > > >>> cache with the waker CPU. Lets avoid too many migrations and select the > > >>> previous CPU in such cases. > > >> Hmm.. if !WF_SYNC: > > >> > > >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle > > >> core in widening scopes. > > >> > > >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an > > >> idle CPU in widening scopes. > > >> > > >> So, it is considering prev_cpu, right? I think it's preferring idle core a > > >> bit too much - it probably doesn't make sense to cross the NUMA boundary if > > >> there is an idle CPU in this node, at least. > > > > > > Yeah, we should probably be a bit more conservative by default and avoid > > > jumping across nodes if there are still idle CPUs within the node. > > > > > > > Agreed. So maybe we check for fully idle cores *within the node* first, before > > preferring idle SMTs *within the node* ? And then, as next step go looking at > > other nodes. Would that be a reasonable middle ground? > > > > > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce > > > that while still using the built-in idle policy (since we can specify idle > > > flags), but that doesn't preclude adjusting the default policy anyway, if > > > it makes more sense. > > > > Aren't you deprecating the usage of the default select function? If we are going > > to be adjusting its behavior like my patch is doing, then we should probably not > > also deprecate it. > > I'm just extending the default select function to accept a cpumask and idle > SCX_PICK_IDLE_* flags, so that it's easier for BPF schedulers to change the > select behavior without reimplementing the whole thing. > > The old scx_bpf_select_cpu_dfl() will be remapped to the new API for a > while for backward compatibility and the underlying selection logic remains > the same. > > So, in this case for example, you could implement the "check full-idle then > partial-idle SMT CPUs within the node" logic as following: > > /* Search for full-idle SMT first, then idle CPUs within prev_cpu's node */ > cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, > p->cpus_ptr, SCX_PICK_IDLE_IN_NODE) > if (cpu < 0) { > /* Search for full-idle SMT first, then idle CPUs across all nodes */ > cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, 0) > } Thanks, Andrea! I adjusted the default selection as below, hope it looks good now, will test it more as well. Let me know any comments. ----------------8<----- From: Joel Fernandes <joelagnelf@nvidia.com> Subject: [PATCH] sched/ext: Make default idle CPU selection better Currently, sched_ext's default CPU selection is roughly something like this: 1. Look for FULLY IDLE CORES: 1.1. Select prev CPU (wakee) if its CORE is fully idle. 1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA. 1.3. Or, any idle CPU from fully idle CORE usable by task. 2. Or, use PREV CPU if it is idle. 3. Or any idle CPU in the LLC, NUMA. 4. Or finally any CPU usable by the task. This can end up select any idle core in the system even if that means jumping across NUMA nodes (basically 1.3 happens before 3.). Improve this by moving 1.3 to after 3 (so that skipping over NUMA happens only later) and also add selection of fully idle target (waker) core before looking for fully-idle cores in the LLC/NUMA. This is similar to what FAIR scheduler does. The new sequence is as follows: 1. Look for FULLY IDLE CORES: 1.1. Select prev CPU (wakee) if its CORE is fully idle. 1.2. Select target CPU (waker) if its CORE is fully idle and shares cache with prev. <- Added this. 1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA. 2. Or, use PREV CPU if it is idle. 3. Or any idle CPU in the LLC, NUMA. 4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down. 5. Or finally any CPU usable by the task. Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> --- kernel/sched/ext.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 5a81d9a1e31f..324e442319c7 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, goto cpu_found; } + /* + * If the waker's CPU shares cache with @prev_cpu and is part + * of a fully idle core, select it. + */ + if (cpus_share_cache(cpu, prev_cpu) && + cpumask_test_cpu(cpu, idle_masks.smt) && + test_and_clear_cpu_idle(cpu)) { + goto cpu_found; + } + /* * Search for any fully idle core in the same LLC domain. */ @@ -3575,13 +3585,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, if (cpu >= 0) goto cpu_found; } - - /* - * Search for any full idle core usable by the task. - */ - cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); - if (cpu >= 0) - goto cpu_found; } /* @@ -3610,6 +3613,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, goto cpu_found; } + /* + * Search for any full idle core usable by the task. + */ + if (sched_smt_active()) { + cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); + if (cpu >= 0) + goto cpu_found; + } + /* * Search for any idle CPU usable by the task. */ -- 2.43.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-18 17:00 ` Joel Fernandes @ 2025-03-18 17:46 ` Andrea Righi 2025-03-18 22:37 ` Joel Fernandes 0 siblings, 1 reply; 17+ messages in thread From: Andrea Righi @ 2025-03-18 17:46 UTC (permalink / raw) To: Joel Fernandes Cc: Tejun Heo, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Hi Joel, On Tue, Mar 18, 2025 at 01:00:47PM -0400, Joel Fernandes wrote: ... > From: Joel Fernandes <joelagnelf@nvidia.com> > Subject: [PATCH] sched/ext: Make default idle CPU selection better > > Currently, sched_ext's default CPU selection is roughly something like > this: > > 1. Look for FULLY IDLE CORES: > 1.1. Select prev CPU (wakee) if its CORE is fully idle. > 1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA. > 1.3. Or, any idle CPU from fully idle CORE usable by task. > 2. Or, use PREV CPU if it is idle. > 3. Or any idle CPU in the LLC, NUMA. > 4. Or finally any CPU usable by the task. > > This can end up select any idle core in the system even if that means > jumping across NUMA nodes (basically 1.3 happens before 3.). > > Improve this by moving 1.3 to after 3 (so that skipping over NUMA > happens only later) and also add selection of fully idle target (waker) > core before looking for fully-idle cores in the LLC/NUMA. This is similar to > what FAIR scheduler does. > > The new sequence is as follows: > > 1. Look for FULLY IDLE CORES: > 1.1. Select prev CPU (wakee) if its CORE is fully idle. > 1.2. Select target CPU (waker) if its CORE is fully idle and shares cache > with prev. <- Added this. > 1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA. > 2. Or, use PREV CPU if it is idle. > 3. Or any idle CPU in the LLC, NUMA. > 4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down. > 5. Or finally any CPU usable by the task. > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > --- > kernel/sched/ext.c | 26 +++++++++++++++++++------- > 1 file changed, 19 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 5a81d9a1e31f..324e442319c7 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > goto cpu_found; > } > > + /* > + * If the waker's CPU shares cache with @prev_cpu and is part > + * of a fully idle core, select it. > + */ > + if (cpus_share_cache(cpu, prev_cpu) && > + cpumask_test_cpu(cpu, idle_masks.smt) && > + test_and_clear_cpu_idle(cpu)) { I think this is always false, because cpu is still in use by the waker and its state hasn't been updated to idle yet. > + goto cpu_found; > + } > + -Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-18 17:46 ` Andrea Righi @ 2025-03-18 22:37 ` Joel Fernandes 0 siblings, 0 replies; 17+ messages in thread From: Joel Fernandes @ 2025-03-18 22:37 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, linux-kernel, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/18/2025 6:46 PM, Andrea Righi wrote: > Hi Joel, > > On Tue, Mar 18, 2025 at 01:00:47PM -0400, Joel Fernandes wrote: > ... >> From: Joel Fernandes <joelagnelf@nvidia.com> >> Subject: [PATCH] sched/ext: Make default idle CPU selection better >> >> Currently, sched_ext's default CPU selection is roughly something like >> this: >> >> 1. Look for FULLY IDLE CORES: >> 1.1. Select prev CPU (wakee) if its CORE is fully idle. >> 1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA. >> 1.3. Or, any idle CPU from fully idle CORE usable by task. >> 2. Or, use PREV CPU if it is idle. >> 3. Or any idle CPU in the LLC, NUMA. >> 4. Or finally any CPU usable by the task. >> >> This can end up select any idle core in the system even if that means >> jumping across NUMA nodes (basically 1.3 happens before 3.). >> >> Improve this by moving 1.3 to after 3 (so that skipping over NUMA >> happens only later) and also add selection of fully idle target (waker) >> core before looking for fully-idle cores in the LLC/NUMA. This is similar to >> what FAIR scheduler does. >> >> The new sequence is as follows: >> >> 1. Look for FULLY IDLE CORES: >> 1.1. Select prev CPU (wakee) if its CORE is fully idle. >> 1.2. Select target CPU (waker) if its CORE is fully idle and shares cache >> with prev. <- Added this. >> 1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA. >> 2. Or, use PREV CPU if it is idle. >> 3. Or any idle CPU in the LLC, NUMA. >> 4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down. >> 5. Or finally any CPU usable by the task. >> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> >> --- >> kernel/sched/ext.c | 26 +++++++++++++++++++------- >> 1 file changed, 19 insertions(+), 7 deletions(-) >> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c >> index 5a81d9a1e31f..324e442319c7 100644 >> --- a/kernel/sched/ext.c >> +++ b/kernel/sched/ext.c >> @@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, >> goto cpu_found; >> } >> >> + /* >> + * If the waker's CPU shares cache with @prev_cpu and is part >> + * of a fully idle core, select it. >> + */ >> + if (cpus_share_cache(cpu, prev_cpu) && >> + cpumask_test_cpu(cpu, idle_masks.smt) && >> + test_and_clear_cpu_idle(cpu)) { > > I think this is always false, because cpu is still in use by the waker and > its state hasn't been updated to idle yet. > Summarizing our conference in-person meeting, we verified that in case of wakeups having from IRQ, the waker could be in idle state at the time of select. So I'll keep the change and will rebase and send it out again soon. thanks, - Joel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:08 ` Tejun Heo 2025-03-17 17:30 ` Andrea Righi @ 2025-03-17 22:07 ` Joel Fernandes 2025-03-17 22:25 ` Tejun Heo 2025-03-18 0:12 ` Libo Chen 2 siblings, 1 reply; 17+ messages in thread From: Joel Fernandes @ 2025-03-17 22:07 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/17/2025 6:08 PM, Tejun Heo wrote: > Hello, Joel. > > On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: >> Consider that the previous CPU is cache affined to the waker's CPU and >> is idle. Currently, scx's default select function only selects the >> previous CPU in this case if WF_SYNC request is also made to wakeup on the >> waker's CPU. >> >> This means, without WF_SYNC, the previous CPU being cache affined to the >> waker and is idle is not considered. This seems extreme. WF_SYNC is not >> normally passed to the wakeup path outside of some IPC drivers but it is >> very possible that the task is cache hot on previous CPU and shares >> cache with the waker CPU. Lets avoid too many migrations and select the >> previous CPU in such cases. > > Hmm.. if !WF_SYNC: > > 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle > core in widening scopes. > > 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an > idle CPU in widening scopes. > > So, it is considering prev_cpu, right? Yes, but unlike the FAIR scheduler, it is not selecting the previous idle SMT for a busy core, and instead looking for a fully idle core first. I saw that as a difference between what FAIR and EXT do. > I think it's preferring idle core a > bit too much - it probably doesn't make sense to cross the NUMA boundary if > there is an idle CPU in this node, at least. > Yes, that is a bit extreme. I believe that is what my patch is fixing. If previous CPU and current CPU share cache, we prefer busy cores with free previous idle SMT, otherwise go looking for fully idle cores. But it sounds like from Peter's reply that is not necessarily a good thing to do in case 'fast numa'. So I guess there is no good answer (?) or way of doing it. I guess one of things I would like is for FAIR and EXT's selection algorithms (default in EXT's case) are close in behavior and consistent. And maybe we can unify their behaviors in this regard, and 'do the right thing' for majority of workloads and machines. > Isn't the cpus_share_cache() code block mostly about not doing > waker-affining if prev_cpu of the wakee is close enough and idle, so > waker-affining is likely to be worse? It seems with FAIR, we do waker-affining first: In select_idle_sibling(...): if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_cpu(task_util, util_min, util_max, target)) return target; before checking if previous CPU shares cache: In select_idle_sibling(...): /* * If the previous CPU is cache affine and idle, don't be stupid: */ if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) { if (!static_branch_unlikely(&sched_cluster_active) || cpus_share_resources(prev, target)) return prev; prev_aff = prev; } So I think that block is not about "not doing waker affining" but about "doing wakee affining". thanks, - Joel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 22:07 ` Joel Fernandes @ 2025-03-17 22:25 ` Tejun Heo 2025-03-17 22:45 ` Joel Fernandes 0 siblings, 1 reply; 17+ messages in thread From: Tejun Heo @ 2025-03-17 22:25 UTC (permalink / raw) To: Joel Fernandes Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Hello, On Mon, Mar 17, 2025 at 11:07:36PM +0100, Joel Fernandes wrote: ... > > I think it's preferring idle core a > > bit too much - it probably doesn't make sense to cross the NUMA boundary if > > there is an idle CPU in this node, at least. > > Yes, that is a bit extreme. I believe that is what my patch is fixing. If > previous CPU and current CPU share cache, we prefer busy cores with free > previous idle SMT, otherwise go looking for fully idle cores. But it sounds like > from Peter's reply that is not necessarily a good thing to do in case 'fast > numa'. So I guess there is no good answer (?) or way of doing it. Yeah, recent AMD CPUs can be configured into NUMA mode where each chiplet is reported as a node and they do behave like one as each has its own memory connection but the distances among them are significantly closer than traditional multi-socket. That said, I think the implementation is still a bit too happy to jump the boundary. What Andrea is suggesting seems reasonable? Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 22:25 ` Tejun Heo @ 2025-03-17 22:45 ` Joel Fernandes 0 siblings, 0 replies; 17+ messages in thread From: Joel Fernandes @ 2025-03-17 22:45 UTC (permalink / raw) To: Tejun Heo Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/17/2025 11:25 PM, Tejun Heo wrote: > Hello, > > On Mon, Mar 17, 2025 at 11:07:36PM +0100, Joel Fernandes wrote: > ... >>> I think it's preferring idle core a >>> bit too much - it probably doesn't make sense to cross the NUMA boundary if >>> there is an idle CPU in this node, at least. >> >> Yes, that is a bit extreme. I believe that is what my patch is fixing. If >> previous CPU and current CPU share cache, we prefer busy cores with free >> previous idle SMT, otherwise go looking for fully idle cores. But it sounds like >> from Peter's reply that is not necessarily a good thing to do in case 'fast >> numa'. So I guess there is no good answer (?) or way of doing it. > > Yeah, recent AMD CPUs can be configured into NUMA mode where each chiplet is > reported as a node and they do behave like one as each has its own memory > connection but the distances among them are significantly closer than > traditional multi-socket. That said, I think the implementation is still a > bit too happy to jump the boundary. What Andrea is suggesting seems > reasonable? > Thanks for sharing the AMD configurations. Yeah, agreed on Andrea's suggestion. - Joel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:08 ` Tejun Heo 2025-03-17 17:30 ` Andrea Righi 2025-03-17 22:07 ` Joel Fernandes @ 2025-03-18 0:12 ` Libo Chen 2025-03-18 0:14 ` Tejun Heo 2 siblings, 1 reply; 17+ messages in thread From: Libo Chen @ 2025-03-18 0:12 UTC (permalink / raw) To: Tejun Heo, Joel Fernandes Cc: linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/17/25 10:08, Tejun Heo wrote: > Hello, Joel. > > On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: >> Consider that the previous CPU is cache affined to the waker's CPU and >> is idle. Currently, scx's default select function only selects the >> previous CPU in this case if WF_SYNC request is also made to wakeup on the >> waker's CPU. >> >> This means, without WF_SYNC, the previous CPU being cache affined to the >> waker and is idle is not considered. This seems extreme. WF_SYNC is not >> normally passed to the wakeup path outside of some IPC drivers but it is >> very possible that the task is cache hot on previous CPU and shares >> cache with the waker CPU. Lets avoid too many migrations and select the >> previous CPU in such cases. > > Hmm.. if !WF_SYNC: > > 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle > core in widening scopes. > > 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an > idle CPU in widening scopes. > > So, it is considering prev_cpu, right? I think it's preferring idle core a > bit too much - it probably doesn't make sense to cross the NUMA boundary if > there is an idle CPU in this node, at least. > Hi Tejun, Just as Peter said, this is whole wake affinity thing is highly workload dependent. We have seen cases where even if there are idle cores in the prev node, it's still beneficial to cross the NUMA boundary. Will it make more sense to have the BPF scheduler implement/alter some of logic here so it can fit to different workload? > Isn't the cpus_share_cache() code block mostly about not doing > waker-affining if prev_cpu of the wakee is close enough and idle, so > waker-affining is likely to be worse? > > Thanks. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-18 0:12 ` Libo Chen @ 2025-03-18 0:14 ` Tejun Heo 0 siblings, 0 replies; 17+ messages in thread From: Tejun Heo @ 2025-03-18 0:14 UTC (permalink / raw) To: Libo Chen Cc: Joel Fernandes, linux-kernel, David Vernet, Andrea Righi, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On Mon, Mar 17, 2025 at 05:12:25PM -0700, Libo Chen wrote: > Just as Peter said, this is whole wake affinity thing is highly workload > dependent. We have seen cases where even if there are idle cores in the > prev node, it's still beneficial to cross the NUMA boundary. Will it make > more sense to have the BPF scheduler implement/alter some of logic here so > it can fit to different workload? Oh yeah, multiple SCX schedulers already implement their own select_cpu(). This discussion is just around the default implementation which can be used if a BPF scheduler doesn't want to implement its own. Thanks. -- tejun ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 8:28 [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC Joel Fernandes 2025-03-17 17:08 ` Tejun Heo @ 2025-03-17 17:20 ` Andrea Righi 2025-03-17 22:44 ` Joel Fernandes 1 sibling, 1 reply; 17+ messages in thread From: Andrea Righi @ 2025-03-17 17:20 UTC (permalink / raw) To: Joel Fernandes Cc: linux-kernel, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Hi Joel, On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: > Consider that the previous CPU is cache affined to the waker's CPU and > is idle. Currently, scx's default select function only selects the > previous CPU in this case if WF_SYNC request is also made to wakeup on the > waker's CPU. > > This means, without WF_SYNC, the previous CPU being cache affined to the > waker and is idle is not considered. This seems extreme. WF_SYNC is not > normally passed to the wakeup path outside of some IPC drivers but it is > very possible that the task is cache hot on previous CPU and shares > cache with the waker CPU. Lets avoid too many migrations and select the > previous CPU in such cases. > > This change is consistent with the fair scheduler's behavior as well. In > select_idle_sibling(), the previous CPU is selected if it is cache > affined with the target. This is done regardless of WF_SYNC and before > any scanning of fully idle cores is done. > > One difference still exists though between SCX and CFS in this regard, in CFS > we first check if the target CPU is idle before checking if the previous CPU is > idle. However that could be a matter of choice and in the future, that behavior > could also be unified. > > Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> > --- > kernel/sched/ext.c | 24 +++++++++++------------- > 1 file changed, 11 insertions(+), 13 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 5a81d9a1e31f..3b1a489e2aaf 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > { > const struct cpumask *llc_cpus = NULL; > const struct cpumask *numa_cpus = NULL; > - s32 cpu; > + s32 cpu = smp_processor_id(); > > *found = false; > > @@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > llc_cpus = llc_span(prev_cpu); > } > > + /* > + * If the waker's CPU is cache affine and prev_cpu is idle, then avoid > + * a migration. > + */ > + if (cpus_share_cache(cpu, prev_cpu) && > + test_and_clear_cpu_idle(prev_cpu)) { > + cpu = prev_cpu; > + goto cpu_found; > + } > + One potential issue that I see is that when SMT is enabled, you may end up using prev_cpu also when the other sibling is busy. Maybe we should check if prev_cpu is set in the SMT idle cpumask. Also, last time I tried a similar change I was regressing a lot of benchmarks. Maybe we should repeat the tests and get some numbers. > /* > * If WAKE_SYNC, try to migrate the wakee to the waker's CPU. > */ > if (wake_flags & SCX_WAKE_SYNC) { > - cpu = smp_processor_id(); > - > - /* > - * If the waker's CPU is cache affine and prev_cpu is idle, > - * then avoid a migration. > - */ > - if (cpus_share_cache(cpu, prev_cpu) && > - test_and_clear_cpu_idle(prev_cpu)) { > - cpu = prev_cpu; > - goto cpu_found; > - } > - > /* > * If the waker's local DSQ is empty, and the system is under > * utilized, try to wake up @p to the local DSQ of the waker. > -- > 2.43.0 > Thanks, -Andrea ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC 2025-03-17 17:20 ` Andrea Righi @ 2025-03-17 22:44 ` Joel Fernandes 0 siblings, 0 replies; 17+ messages in thread From: Joel Fernandes @ 2025-03-17 22:44 UTC (permalink / raw) To: Andrea Righi Cc: linux-kernel, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider On 3/17/2025 6:20 PM, Andrea Righi wrote: > Hi Joel, > > On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote: >> Consider that the previous CPU is cache affined to the waker's CPU and >> is idle. Currently, scx's default select function only selects the >> previous CPU in this case if WF_SYNC request is also made to wakeup on the >> waker's CPU. >> >> This means, without WF_SYNC, the previous CPU being cache affined to the >> waker and is idle is not considered. This seems extreme. WF_SYNC is not >> normally passed to the wakeup path outside of some IPC drivers but it is >> very possible that the task is cache hot on previous CPU and shares >> cache with the waker CPU. Lets avoid too many migrations and select the >> previous CPU in such cases. >> >> This change is consistent with the fair scheduler's behavior as well. In >> select_idle_sibling(), the previous CPU is selected if it is cache >> affined with the target. This is done regardless of WF_SYNC and before >> any scanning of fully idle cores is done. >> >> One difference still exists though between SCX and CFS in this regard, in CFS >> we first check if the target CPU is idle before checking if the previous CPU is >> idle. However that could be a matter of choice and in the future, that behavior >> could also be unified. >> >> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> >> --- >> kernel/sched/ext.c | 24 +++++++++++------------- >> 1 file changed, 11 insertions(+), 13 deletions(-) >> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c >> index 5a81d9a1e31f..3b1a489e2aaf 100644 >> --- a/kernel/sched/ext.c >> +++ b/kernel/sched/ext.c >> @@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, >> { >> const struct cpumask *llc_cpus = NULL; >> const struct cpumask *numa_cpus = NULL; >> - s32 cpu; >> + s32 cpu = smp_processor_id(); >> >> *found = false; >> >> @@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, >> llc_cpus = llc_span(prev_cpu); >> } >> >> + /* >> + * If the waker's CPU is cache affine and prev_cpu is idle, then avoid >> + * a migration. >> + */ >> + if (cpus_share_cache(cpu, prev_cpu) && >> + test_and_clear_cpu_idle(prev_cpu)) { >> + cpu = prev_cpu; >> + goto cpu_found; >> + } >> + > > One potential issue that I see is that when SMT is enabled, you may end up > using prev_cpu also when the other sibling is busy. Maybe we should check > if prev_cpu is set in the SMT idle cpumask. Hmm so you're suggesting select previous if it is part of a fully idle core *and* shares cache with the waker. That does sound reasonable to me, will look more into it, thanks. - Joel ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-03-18 22:37 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-17 8:28 [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC Joel Fernandes 2025-03-17 17:08 ` Tejun Heo 2025-03-17 17:30 ` Andrea Righi 2025-03-17 17:44 ` Peter Zijlstra 2025-03-18 5:17 ` Andrea Righi 2025-03-17 22:11 ` Joel Fernandes 2025-03-18 5:09 ` Andrea Righi 2025-03-18 17:00 ` Joel Fernandes 2025-03-18 17:46 ` Andrea Righi 2025-03-18 22:37 ` Joel Fernandes 2025-03-17 22:07 ` Joel Fernandes 2025-03-17 22:25 ` Tejun Heo 2025-03-17 22:45 ` Joel Fernandes 2025-03-18 0:12 ` Libo Chen 2025-03-18 0:14 ` Tejun Heo 2025-03-17 17:20 ` Andrea Righi 2025-03-17 22:44 ` Joel Fernandes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox