* [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-11-29 17:54 Andrea Righi
2024-11-29 17:54 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() Andrea Righi
2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
0 siblings, 2 replies; 12+ messages in thread
From: Andrea Righi @ 2024-11-29 17:54 UTC (permalink / raw)
To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel
= Overview =
As discussed during the sched_ext office hours, using a global cpumask
to keep track of the idle CPUs can be inefficient and it may not scale
really well on large NUMA systems.
Therefore, split the idle cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on such large systems.
Scalability issues seem to be more noticeable on Intel Sapphire Rapids
dual-socket architectures. However, benefits can also be observed on AMD
systems with a sufficient number of NUMA nodes, see for example [1].
= Test =
Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes
Scheduler:
- scx_simple [2] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)
Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:
avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.
I've repeated the same test on a DGX-1 (40 physical cores, Intel Xeon
E5-2698 v4 @ 2.20GHz, 2 NUMA nodes) and I didn't observe any measurable
difference.
In general, on smaller systems, I haven't noticed any measurable
regressions or improvements with the same test (parallel kernel build)
and scheduler (scx_simple).
NOTE: splitting the global cpumask into multiple cpumasks may increase
the overhead of scx_bpf_pick_idle_cpu() or ops.select_cpu() (for
schedulers relying on the built-in CPU idle selection policy) in
presence of multiple NUMA nodes, particularly under high system load,
since we may have to access multiple cpumasks to find an idle CPU.
However, this increased overhead seems to be highly compensated by a
lower overhead when updating the idle state (__scx_update_idle()) and by
the fact that CPUs are more likely operating within their local idle
cpumask, reducing the stress on the cache coherency protocol.
= References =
[1] https://github.com/sched-ext/scx/issues/998
[2] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
ChangeLog v1 -> v2:
- renamed for_each_node_mask|state_from() -> for_each_node_mask|state_wrap()
- misc cpumask optimizations (thanks to Yury)
Andrea Righi (2):
nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap()
sched_ext: Introduce per-NUMA idle cpumasks
include/linux/nodemask.h | 13 ++++++
kernel/sched/ext.c | 115 ++++++++++++++++++++++++++++++++---------------
2 files changed, 92 insertions(+), 36 deletions(-)
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() 2024-11-29 17:54 [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi @ 2024-11-29 17:54 ` Andrea Righi 2024-11-29 19:27 ` Yury Norov 2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi 1 sibling, 1 reply; 12+ messages in thread From: Andrea Righi @ 2024-11-29 17:54 UTC (permalink / raw) To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel Introduce NUMA node iterators to support circular iteration, starting from a specified node. Cc: Yury Norov <yury.norov@gmail.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- include/linux/nodemask.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index b61438313a73..c99cea40dfac 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -392,6 +392,16 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp, for ((node) = 0; (node) < 1 && !nodes_empty(mask); (node)++) #endif /* MAX_NUMNODES */ +#if MAX_NUMNODES > 1 +#define for_each_node_mask_wrap(node, nodemask, start) \ + for_each_set_bit_wrap((node), (nodemask)->bits, MAX_NUMNODES, (start)) +#else /* MAX_NUMNODES == 1 */ +#define for_each_node_mask_wrap(node, mask, start) \ + for ((node) = 0; \ + (node) < 1 && !nodes_empty(mask); \ + (node)++, (void)(start), (void)(cnt)) +#endif /* MAX_NUMNODES */ + /* * Bitmasks that are kept for all the nodes. */ @@ -441,6 +451,9 @@ static inline int num_node_state(enum node_states state) #define for_each_node_state(__node, __state) \ for_each_node_mask((__node), node_states[__state]) +#define for_each_node_state_wrap(__node, __state, __start) \ + for_each_node_mask_wrap((__node), &node_states[__state], __start) + #define first_online_node first_node(node_states[N_ONLINE]) #define first_memory_node first_node(node_states[N_MEMORY]) static inline unsigned int next_online_node(int nid) -- 2.47.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() 2024-11-29 17:54 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() Andrea Righi @ 2024-11-29 19:27 ` Yury Norov 2024-11-30 15:13 ` Andrea Righi 0 siblings, 1 reply; 12+ messages in thread From: Yury Norov @ 2024-11-29 19:27 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, linux-kernel On Fri, Nov 29, 2024 at 06:54:31PM +0100, Andrea Righi wrote: > Introduce NUMA node iterators to support circular iteration, starting > from a specified node. > > Cc: Yury Norov <yury.norov@gmail.com> > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > include/linux/nodemask.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index b61438313a73..c99cea40dfac 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -392,6 +392,16 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp, > for ((node) = 0; (node) < 1 && !nodes_empty(mask); (node)++) > #endif /* MAX_NUMNODES */ > > +#if MAX_NUMNODES > 1 > +#define for_each_node_mask_wrap(node, nodemask, start) \ > + for_each_set_bit_wrap((node), (nodemask)->bits, MAX_NUMNODES, (start)) > +#else /* MAX_NUMNODES == 1 */ > +#define for_each_node_mask_wrap(node, mask, start) \ There's a very well made historical mess of how nodemasks are implemented. Contrary to bitmaps and cpumasks, we pass nodemasks by value, not by pointer. For example, try_to_free_low() in mm/hugetlb.c takes a pointer, but has to 'dereference' it before passing to for_each_node_mask(): static void try_to_free_low(struct hstate *h, unsigned long count, nodemask_t *nodes_allowed) { for_each_node_mask(i, *nodes_allowed) { ... } } That's because all nodemask functions takes an address from a variable provided. For example the below nodes_empty() is implemented like: #define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES) static __always_inline bool __nodes_empty(const nodemask_t *srcp, unsigned int nbits) { return bitmap_empty(srcp->bits, nbits); } It means that your 'MAX_NUMNODES > 1' version doesn't match the existing for_each_node_mask(), i.e. doesn't pass a nodemask by value. The opencoded 'MAX_NUMNODES == 1' version does, although. > + for ((node) = 0; \ > + (node) < 1 && !nodes_empty(mask); \ > + (node)++, (void)(start), (void)(cnt)) This cnt is a leftover from v1, I guess. > +#endif /* MAX_NUMNODES */ > + > /* > * Bitmasks that are kept for all the nodes. > */ > @@ -441,6 +451,9 @@ static inline int num_node_state(enum node_states state) > #define for_each_node_state(__node, __state) \ > for_each_node_mask((__node), node_states[__state]) > > +#define for_each_node_state_wrap(__node, __state, __start) \ > + for_each_node_mask_wrap((__node), &node_states[__state], __start) Can you also add for_each_online_node_wrap() to align with the existing for_each_online_node()? > + > #define first_online_node first_node(node_states[N_ONLINE]) > #define first_memory_node first_node(node_states[N_MEMORY]) > static inline unsigned int next_online_node(int nid) > -- > 2.47.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() 2024-11-29 19:27 ` Yury Norov @ 2024-11-30 15:13 ` Andrea Righi 0 siblings, 0 replies; 12+ messages in thread From: Andrea Righi @ 2024-11-30 15:13 UTC (permalink / raw) To: Yury Norov; +Cc: Tejun Heo, David Vernet, linux-kernel On Fri, Nov 29, 2024 at 11:27:37AM -0800, Yury Norov wrote: > On Fri, Nov 29, 2024 at 06:54:31PM +0100, Andrea Righi wrote: > > Introduce NUMA node iterators to support circular iteration, starting > > from a specified node. > > > > Cc: Yury Norov <yury.norov@gmail.com> > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- > > include/linux/nodemask.h | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > > index b61438313a73..c99cea40dfac 100644 > > --- a/include/linux/nodemask.h > > +++ b/include/linux/nodemask.h > > @@ -392,6 +392,16 @@ static inline void __nodes_fold(nodemask_t *dstp, const nodemask_t *origp, > > for ((node) = 0; (node) < 1 && !nodes_empty(mask); (node)++) > > #endif /* MAX_NUMNODES */ > > > > +#if MAX_NUMNODES > 1 > > +#define for_each_node_mask_wrap(node, nodemask, start) \ > > + for_each_set_bit_wrap((node), (nodemask)->bits, MAX_NUMNODES, (start)) > > +#else /* MAX_NUMNODES == 1 */ > > +#define for_each_node_mask_wrap(node, mask, start) \ > > There's a very well made historical mess of how nodemasks are > implemented. Contrary to bitmaps and cpumasks, we pass nodemasks by > value, not by pointer. For example, try_to_free_low() in mm/hugetlb.c > takes a pointer, but has to 'dereference' it before passing to > for_each_node_mask(): > > static void try_to_free_low(struct hstate *h, unsigned long count, > nodemask_t *nodes_allowed) > { > for_each_node_mask(i, *nodes_allowed) { > ... > } > } > > That's because all nodemask functions takes an address from a variable > provided. For example the below nodes_empty() is implemented like: > > #define nodes_empty(src) __nodes_empty(&(src), MAX_NUMNODES) > static __always_inline bool __nodes_empty(const nodemask_t *srcp, unsigned int nbits) > { > return bitmap_empty(srcp->bits, nbits); > } > > It means that your 'MAX_NUMNODES > 1' version doesn't match the > existing for_each_node_mask(), i.e. doesn't pass a nodemask by value. > The opencoded 'MAX_NUMNODES == 1' version does, although. Thanks for the detailed clarification! I'll change for_each_node_mask_wrap() to pass the nodemask by value. > > > + for ((node) = 0; \ > > + (node) < 1 && !nodes_empty(mask); \ > > + (node)++, (void)(start), (void)(cnt)) > > This cnt is a leftover from v1, I guess. Indeed! Thanks for noticing it (my bad for not testing the build with CONFIG_NUMA off), will fix this. > > > +#endif /* MAX_NUMNODES */ > > + > > /* > > * Bitmasks that are kept for all the nodes. > > */ > > @@ -441,6 +451,9 @@ static inline int num_node_state(enum node_states state) > > #define for_each_node_state(__node, __state) \ > > for_each_node_mask((__node), node_states[__state]) > > > > +#define for_each_node_state_wrap(__node, __state, __start) \ > > + for_each_node_mask_wrap((__node), &node_states[__state], __start) > > Can you also add for_each_online_node_wrap() to align with the > existing for_each_online_node()? Ok. > > > + > > #define first_online_node first_node(node_states[N_ONLINE]) > > #define first_memory_node first_node(node_states[N_MEMORY]) > > static inline unsigned int next_online_node(int nid) > > -- > > 2.47.1 Thanks, -Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-29 17:54 [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi 2024-11-29 17:54 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() Andrea Righi @ 2024-11-29 17:54 ` Andrea Righi 2024-11-29 19:38 ` Yury Norov 1 sibling, 1 reply; 12+ messages in thread From: Andrea Righi @ 2024-11-29 17:54 UTC (permalink / raw) To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. [Open issue] The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return a single cpumask for all the CPUs, have been changed to report only the cpumask of the current NUMA node (using the current CPU); this breaks the old behavior, so it can potentially introduce regressions in some scx schedulers. An alternative approach could be to construct a global cpumask on-the-fly, but this could add significant overhead to ops.select_cpu() for schedulers relying on these kfunc's. Additionally, it would be less reliable than accessing the actual cpumasks, as the copy could quickly become out of sync and not represent the actual idle state very well. Probably a better way to solve this issue is to introduce new kfunc's to explicitly select specific per-NUMA cpumask and modify the scx schedulers to transition to this new API, for example: const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node) const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node) Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 115 +++++++++++++++++++++++++++++++-------------- 1 file changed, 79 insertions(+), 36 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 508845f0c25a..c10131171dfb 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work; static struct { cpumask_var_t cpu; cpumask_var_t smt; -} idle_masks CL_ALIGNED_IF_ONSTACK; +} **idle_masks CL_ALIGNED_IF_ONSTACK; + +static struct cpumask *get_idle_cpumask(int cpu) +{ + int node = cpu_to_node(cpu); + + return idle_masks[node]->cpu; +} + +static struct cpumask *get_idle_smtmask(int cpu) +{ + int node = cpu_to_node(cpu); + + return idle_masks[node]->smt; +} + +static void idle_masks_init(void) +{ + int node; + + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL); + BUG_ON(!idle_masks); + + for_each_node_state(node, N_POSSIBLE) { + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node); + BUG_ON(!idle_masks[node]); + + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node)); + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node)); + } +} #endif /* CONFIG_SMP */ @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) */ if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_smt = get_idle_smtmask(cpu); /* * If offline, @cpu is not its own sibling and * scx_pick_idle_cpu() can get caught in an infinite loop as - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu - * is eventually cleared. + * @cpu is never cleared from the idle SMT mask. Ensure that + * @cpu is eventually cleared. */ - if (cpumask_intersects(smt, idle_masks.smt)) - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); - else if (cpumask_test_cpu(cpu, idle_masks.smt)) - __cpumask_clear_cpu(cpu, idle_masks.smt); + cpumask_andnot(idle_smt, idle_smt, smt); + __cpumask_clear_cpu(cpu, idle_smt); } #endif - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu)); } static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { - int cpu; + int start = cpu_to_node(smp_processor_id()); + int node, cpu; retry: if (sched_smt_active()) { - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); - if (cpu < nr_cpu_ids) - goto found; + for_each_node_state_wrap(node, N_ONLINE, start) { + if (!cpumask_intersects(idle_masks[node]->smt, cpus_allowed)) + continue; + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + } if (flags & SCX_PICK_IDLE_CORE) return -EBUSY; } - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); - if (cpu >= nr_cpu_ids) - return -EBUSY; + for_each_node_state_wrap(node, N_ONLINE, start) { + if (!cpumask_intersects(idle_masks[node]->cpu, cpus_allowed)) + continue; + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu, cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + } + return -EBUSY; found: if (test_and_clear_cpu_idle(cpu)) @@ -3459,9 +3498,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, * piled up on it even if there is an idle core elsewhere on * the system. */ - if (!cpumask_empty(idle_masks.cpu) && - !(current->flags & PF_EXITING) && - cpu_rq(cpu)->scx.local_dsq.nr == 0) { + if (!(current->flags & PF_EXITING) && + cpu_rq(cpu)->scx.local_dsq.nr == 0 && + !cpumask_empty(get_idle_cpumask(cpu))) { if (cpumask_test_cpu(cpu, p->cpus_ptr)) goto cpu_found; } @@ -3475,7 +3514,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, /* * Keep using @prev_cpu if it's part of a fully idle core. */ - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) && test_and_clear_cpu_idle(prev_cpu)) { cpu = prev_cpu; goto cpu_found; @@ -3618,12 +3657,18 @@ static void set_cpus_allowed_scx(struct task_struct *p, static void reset_idle_masks(void) { + int node; + /* * Consider all online cpus idle. Should converge to the actual state * quickly. */ - cpumask_copy(idle_masks.cpu, cpu_online_mask); - cpumask_copy(idle_masks.smt, cpu_online_mask); + for_each_node_state(node, N_POSSIBLE) { + const struct cpumask *node_mask = cpumask_of_node(node); + + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask); + cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu); + } } void __scx_update_idle(struct rq *rq, bool idle) @@ -3636,14 +3681,13 @@ void __scx_update_idle(struct rq *rq, bool idle) return; } - if (idle) - cpumask_set_cpu(cpu, idle_masks.cpu); - else - cpumask_clear_cpu(cpu, idle_masks.cpu); + assign_cpu(cpu, get_idle_cpumask(cpu), idle); #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_cpu = get_idle_cpumask(cpu); + struct cpumask *idle_smt = get_idle_smtmask(cpu); if (idle) { /* @@ -3651,12 +3695,12 @@ void __scx_update_idle(struct rq *rq, bool idle) * it's only for optimization and self-correcting. */ for_each_cpu(cpu, smt) { - if (!cpumask_test_cpu(cpu, idle_masks.cpu)) + if (!cpumask_test_cpu(cpu, idle_cpu)) return; } - cpumask_or(idle_masks.smt, idle_masks.smt, smt); + cpumask_or(idle_smt, idle_smt, smt); } else { - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + cpumask_andnot(idle_smt, idle_smt, smt); } } #endif @@ -6232,8 +6276,7 @@ void __init init_sched_ext_class(void) BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); #ifdef CONFIG_SMP - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); + idle_masks_init(); #endif scx_kick_cpus_pnt_seqs = __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, @@ -7379,7 +7422,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) /** * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking - * per-CPU cpumask. + * per-CPU cpumask of the current NUMA node. * * Returns NULL if idle tracking is not enabled, or running on a UP kernel. */ @@ -7391,7 +7434,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) } #ifdef CONFIG_SMP - return idle_masks.cpu; + return get_idle_cpumask(smp_processor_id()); #else return cpu_none_mask; #endif @@ -7399,8 +7442,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) /** * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, - * per-physical-core cpumask. Can be used to determine if an entire physical - * core is free. + * per-physical-core cpumask of the current NUMA node. Can be used to determine + * if an entire physical core is free. * * Returns NULL if idle tracking is not enabled, or running on a UP kernel. */ @@ -7413,9 +7456,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) #ifdef CONFIG_SMP if (sched_smt_active()) - return idle_masks.smt; + return get_idle_smtmask(smp_processor_id()); else - return idle_masks.cpu; + return get_idle_cpumask(smp_processor_id()); #else return cpu_none_mask; #endif -- 2.47.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi @ 2024-11-29 19:38 ` Yury Norov 2024-11-30 15:24 ` Andrea Righi 2024-12-03 14:16 ` Andrea Righi 0 siblings, 2 replies; 12+ messages in thread From: Yury Norov @ 2024-11-29 19:38 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, linux-kernel On Fri, Nov 29, 2024 at 06:54:32PM +0100, Andrea Righi wrote: > Using a single global idle mask can lead to inefficiencies and a lot of > stress on the cache coherency protocol on large systems with multiple > NUMA nodes, since all the CPUs can create a really intense read/write > activity on the single global cpumask. > > Therefore, split the global cpumask into multiple per-NUMA node cpumasks > to improve scalability and performance on large systems. > > The concept is that each cpumask will track only the idle CPUs within > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. > In this way concurrent access to the idle cpumask will be restricted > within each NUMA node. > > [Open issue] > > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return > a single cpumask for all the CPUs, have been changed to report only the > cpumask of the current NUMA node (using the current CPU); this breaks > the old behavior, so it can potentially introduce regressions in some > scx schedulers. > > An alternative approach could be to construct a global cpumask > on-the-fly, but this could add significant overhead to ops.select_cpu() > for schedulers relying on these kfunc's. Additionally, it would be less > reliable than accessing the actual cpumasks, as the copy could quickly > become out of sync and not represent the actual idle state very well. > > Probably a better way to solve this issue is to introduce new kfunc's to > explicitly select specific per-NUMA cpumask and modify the scx > schedulers to transition to this new API, for example: > > const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node) > const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node) > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext.c | 115 +++++++++++++++++++++++++++++++-------------- > 1 file changed, 79 insertions(+), 36 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 508845f0c25a..c10131171dfb 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work; > static struct { > cpumask_var_t cpu; > cpumask_var_t smt; > -} idle_masks CL_ALIGNED_IF_ONSTACK; > +} **idle_masks CL_ALIGNED_IF_ONSTACK; > + > +static struct cpumask *get_idle_cpumask(int cpu) > +{ > + int node = cpu_to_node(cpu); > + > + return idle_masks[node]->cpu; > +} > + > +static struct cpumask *get_idle_smtmask(int cpu) > +{ > + int node = cpu_to_node(cpu); > + > + return idle_masks[node]->smt; > +} > + > +static void idle_masks_init(void) > +{ > + int node; > + > + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL); > + BUG_ON(!idle_masks); > + > + for_each_node_state(node, N_POSSIBLE) { > + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node); > + BUG_ON(!idle_masks[node]); > + > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node)); > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node)); > + } > +} > > #endif /* CONFIG_SMP */ > > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) > */ > if (sched_smt_active()) { > const struct cpumask *smt = cpu_smt_mask(cpu); > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > /* > * If offline, @cpu is not its own sibling and > * scx_pick_idle_cpu() can get caught in an infinite loop as > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > - * is eventually cleared. > + * @cpu is never cleared from the idle SMT mask. Ensure that > + * @cpu is eventually cleared. > */ > - if (cpumask_intersects(smt, idle_masks.smt)) > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > - __cpumask_clear_cpu(cpu, idle_masks.smt); > + cpumask_andnot(idle_smt, idle_smt, smt); > + __cpumask_clear_cpu(cpu, idle_smt); > } > #endif > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu)); > } > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > { > - int cpu; > + int start = cpu_to_node(smp_processor_id()); > + int node, cpu; > > retry: > if (sched_smt_active()) { > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > - if (cpu < nr_cpu_ids) > - goto found; > + for_each_node_state_wrap(node, N_ONLINE, start) { > + if (!cpumask_intersects(idle_masks[node]->smt, cpus_allowed)) > + continue; > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, cpus_allowed); > + if (cpu < nr_cpu_ids) > + goto found; > + } Here the same consideration is applicable as for v1: if idle_masks[node]->smt and cpus_allowed are disjoint, the cpumask_any_and_distribute() will return >= nr_cpu_ids, and we'll go to the next iteration. No need to call cpumask_intersects(). > > if (flags & SCX_PICK_IDLE_CORE) > return -EBUSY; > } > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > - if (cpu >= nr_cpu_ids) > - return -EBUSY; > + for_each_node_state_wrap(node, N_ONLINE, start) { > + if (!cpumask_intersects(idle_masks[node]->cpu, cpus_allowed)) > + continue; > + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu, cpus_allowed); > + if (cpu < nr_cpu_ids) > + goto found; > + } > + return -EBUSY; > > found: > if (test_and_clear_cpu_idle(cpu)) > @@ -3459,9 +3498,9 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > * piled up on it even if there is an idle core elsewhere on > * the system. > */ > - if (!cpumask_empty(idle_masks.cpu) && > - !(current->flags & PF_EXITING) && > - cpu_rq(cpu)->scx.local_dsq.nr == 0) { > + if (!(current->flags & PF_EXITING) && > + cpu_rq(cpu)->scx.local_dsq.nr == 0 && > + !cpumask_empty(get_idle_cpumask(cpu))) { > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > goto cpu_found; > } > @@ -3475,7 +3514,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > /* > * Keep using @prev_cpu if it's part of a fully idle core. > */ > - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && > + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) && > test_and_clear_cpu_idle(prev_cpu)) { > cpu = prev_cpu; > goto cpu_found; > @@ -3618,12 +3657,18 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > static void reset_idle_masks(void) > { > + int node; > + > /* > * Consider all online cpus idle. Should converge to the actual state > * quickly. > */ > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > - cpumask_copy(idle_masks.smt, cpu_online_mask); > + for_each_node_state(node, N_POSSIBLE) { > + const struct cpumask *node_mask = cpumask_of_node(node); > + > + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask); > + cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu); > + } > } > > void __scx_update_idle(struct rq *rq, bool idle) > @@ -3636,14 +3681,13 @@ void __scx_update_idle(struct rq *rq, bool idle) > return; > } > > - if (idle) > - cpumask_set_cpu(cpu, idle_masks.cpu); > - else > - cpumask_clear_cpu(cpu, idle_masks.cpu); > + assign_cpu(cpu, get_idle_cpumask(cpu), idle); > > #ifdef CONFIG_SCHED_SMT > if (sched_smt_active()) { > const struct cpumask *smt = cpu_smt_mask(cpu); > + struct cpumask *idle_cpu = get_idle_cpumask(cpu); > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > if (idle) { > /* > @@ -3651,12 +3695,12 @@ void __scx_update_idle(struct rq *rq, bool idle) > * it's only for optimization and self-correcting. > */ > for_each_cpu(cpu, smt) { > - if (!cpumask_test_cpu(cpu, idle_masks.cpu)) > + if (!cpumask_test_cpu(cpu, idle_cpu)) > return; > } > - cpumask_or(idle_masks.smt, idle_masks.smt, smt); > + cpumask_or(idle_smt, idle_smt, smt); > } else { > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > + cpumask_andnot(idle_smt, idle_smt, smt); > } > } > #endif > @@ -6232,8 +6276,7 @@ void __init init_sched_ext_class(void) > > BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); > #ifdef CONFIG_SMP > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > + idle_masks_init(); > #endif > scx_kick_cpus_pnt_seqs = > __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, > @@ -7379,7 +7422,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) > > /** > * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking > - * per-CPU cpumask. > + * per-CPU cpumask of the current NUMA node. > * > * Returns NULL if idle tracking is not enabled, or running on a UP kernel. > */ > @@ -7391,7 +7434,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) > } > > #ifdef CONFIG_SMP > - return idle_masks.cpu; > + return get_idle_cpumask(smp_processor_id()); > #else > return cpu_none_mask; > #endif > @@ -7399,8 +7442,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) > > /** > * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, > - * per-physical-core cpumask. Can be used to determine if an entire physical > - * core is free. > + * per-physical-core cpumask of the current NUMA node. Can be used to determine > + * if an entire physical core is free. > * > * Returns NULL if idle tracking is not enabled, or running on a UP kernel. > */ > @@ -7413,9 +7456,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) > > #ifdef CONFIG_SMP > if (sched_smt_active()) > - return idle_masks.smt; > + return get_idle_smtmask(smp_processor_id()); > else > - return idle_masks.cpu; > + return get_idle_cpumask(smp_processor_id()); > #else > return cpu_none_mask; > #endif > -- > 2.47.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-29 19:38 ` Yury Norov @ 2024-11-30 15:24 ` Andrea Righi 2024-12-03 7:38 ` Andrea Righi 2024-12-03 14:16 ` Andrea Righi 1 sibling, 1 reply; 12+ messages in thread From: Andrea Righi @ 2024-11-30 15:24 UTC (permalink / raw) To: Yury Norov; +Cc: Tejun Heo, David Vernet, linux-kernel On Fri, Nov 29, 2024 at 11:38:53AM -0800, Yury Norov wrote: ... > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > { > > - int cpu; > > + int start = cpu_to_node(smp_processor_id()); > > + int node, cpu; > > > > retry: > > if (sched_smt_active()) { > > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > > - if (cpu < nr_cpu_ids) > > - goto found; > > + for_each_node_state_wrap(node, N_ONLINE, start) { > > + if (!cpumask_intersects(idle_masks[node]->smt, cpus_allowed)) > > + continue; > > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, cpus_allowed); > > + if (cpu < nr_cpu_ids) > > + goto found; > > + } > > Here the same consideration is applicable as for v1: > if idle_masks[node]->smt and cpus_allowed are disjoint, the > cpumask_any_and_distribute() will return >= nr_cpu_ids, and we'll go to > the next iteration. No need to call cpumask_intersects(). For some reason, removing cpumask_intersects() here seems to introduce a slight performance drop. My initial assumption was that the performance drop occurs because cpus_allowed often doesn't intersect with idle_masks[node]->smt (since cpus_allowed usually doesn't span multiple NUMA nodes), so running cpumask_any_and_distribute() on N cpumasks (worst case) is slower than first checking for an intersection. However, I will rerun the test to ensure that the regression is consistent and not just a measurement error. -Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-30 15:24 ` Andrea Righi @ 2024-12-03 7:38 ` Andrea Righi 0 siblings, 0 replies; 12+ messages in thread From: Andrea Righi @ 2024-12-03 7:38 UTC (permalink / raw) To: Yury Norov; +Cc: Tejun Heo, David Vernet, linux-kernel On Sat, Nov 30, 2024 at 04:24:36PM +0100, Andrea Righi wrote: > On Fri, Nov 29, 2024 at 11:38:53AM -0800, Yury Norov wrote: > ... > > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > > { > > > - int cpu; > > > + int start = cpu_to_node(smp_processor_id()); > > > + int node, cpu; > > > > > > retry: > > > if (sched_smt_active()) { > > > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > > > - if (cpu < nr_cpu_ids) > > > - goto found; > > > + for_each_node_state_wrap(node, N_ONLINE, start) { > > > + if (!cpumask_intersects(idle_masks[node]->smt, cpus_allowed)) > > > + continue; > > > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, cpus_allowed); > > > + if (cpu < nr_cpu_ids) > > > + goto found; > > > + } > > > > Here the same consideration is applicable as for v1: > > if idle_masks[node]->smt and cpus_allowed are disjoint, the > > cpumask_any_and_distribute() will return >= nr_cpu_ids, and we'll go to > > the next iteration. No need to call cpumask_intersects(). > > For some reason, removing cpumask_intersects() here seems to introduce a > slight performance drop. > > My initial assumption was that the performance drop occurs because > cpus_allowed often doesn't intersect with idle_masks[node]->smt (since > cpus_allowed usually doesn't span multiple NUMA nodes), so running > cpumask_any_and_distribute() on N cpumasks (worst case) is slower than > first checking for an intersection. > > However, I will rerun the test to ensure that the regression is > consistent and not just a measurement error. I did more testing and the slight performance drop is not consistent, therefore, I believe we can attribute it to measurement errors. I'll send a v3 that removes cpumask_intersects() and includes some minor code refactoring. -Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-29 19:38 ` Yury Norov 2024-11-30 15:24 ` Andrea Righi @ 2024-12-03 14:16 ` Andrea Righi 1 sibling, 0 replies; 12+ messages in thread From: Andrea Righi @ 2024-12-03 14:16 UTC (permalink / raw) To: Yury Norov; +Cc: Tejun Heo, David Vernet, linux-kernel Hi Yuri, On Fri, Nov 29, 2024 at 11:38:53AM -0800, Yury Norov wrote: ... > > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) > > */ > > if (sched_smt_active()) { > > const struct cpumask *smt = cpu_smt_mask(cpu); > > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > > > /* > > * If offline, @cpu is not its own sibling and > > * scx_pick_idle_cpu() can get caught in an infinite loop as > > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > > - * is eventually cleared. > > + * @cpu is never cleared from the idle SMT mask. Ensure that > > + * @cpu is eventually cleared. > > */ > > - if (cpumask_intersects(smt, idle_masks.smt)) > > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > > - __cpumask_clear_cpu(cpu, idle_masks.smt); > > + cpumask_andnot(idle_smt, idle_smt, smt); > > + __cpumask_clear_cpu(cpu, idle_smt); > > } these cpumask_intersects() and cpumask_test_cpu() seem to help instead. I think the reason is that they can help reduce some memory writes and therefore alleviate cache coherence pressure. So, even though we could logically remove them, they seem to offer practical benefits. I'll re-introduce them and add a comment to clarify this in the next patch set version. -Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCHSET sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks
@ 2024-11-26 9:56 Andrea Righi
2024-11-26 9:56 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
0 siblings, 1 reply; 12+ messages in thread
From: Andrea Righi @ 2024-11-26 9:56 UTC (permalink / raw)
To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel
As discussed during the sched_ext office hours, using a global cpumask
to keep track of the idle CPUs can be inefficient and it may not scale
really well on large NUMA systems.
Therefore, split the idle cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on such large systems.
----------------------------------------------------------------
Andrea Righi (2):
nodemask: Introduce for_each_node_mask_from/for_each_node_state_from()
sched_ext: Introduce per-NUMA idle cpumasks
include/linux/nodemask.h | 18 ++++++++
kernel/sched/ext.c | 110 +++++++++++++++++++++++++++++++++--------------
2 files changed, 96 insertions(+), 32 deletions(-)
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-26 9:56 [PATCHSET sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi @ 2024-11-26 9:56 ` Andrea Righi 2024-11-27 2:35 ` Yury Norov 0 siblings, 1 reply; 12+ messages in thread From: Andrea Righi @ 2024-11-26 9:56 UTC (permalink / raw) To: Tejun Heo, David Vernet; +Cc: Yury Norov, linux-kernel Using a single global idle mask can lead to inefficiencies and a lot of stress on the cache coherency protocol on large systems with multiple NUMA nodes, since all the CPUs can create a really intense read/write activity on the single global cpumask. Therefore, split the global cpumask into multiple per-NUMA node cpumasks to improve scalability and performance on large systems. The concept is that each cpumask will track only the idle CPUs within its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. In this way concurrent access to the idle cpumask will be restricted within each NUMA node. [Open issue] The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return a single cpumask for all the CPUs, have been changed to report only the cpumask of the current NUMA node (using the current CPU); this breaks the old behavior, so it can potentially introduce regressions in some scx schedulers. An alternative approach could be to construct a global cpumask on-the-fly, but this could add significant overhead to ops.select_cpu() for schedulers relying on these kfunc's. Additionally, it would be less reliable than accessing the actual cpumasks, as the copy could quickly become out of sync and not represent the actual idle state very well. Probably a better way to solve this issue is to introduce new kfunc's to explicitly select specific per-NUMA cpumask and modify the scx schedulers to transition to this new API, for example: const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node) const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node) Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 110 ++++++++++++++++++++++++++++++++------------- 1 file changed, 78 insertions(+), 32 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 3c4a94e4258f..ea2fae2db10e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work; static struct { cpumask_var_t cpu; cpumask_var_t smt; -} idle_masks CL_ALIGNED_IF_ONSTACK; +} **idle_masks CL_ALIGNED_IF_ONSTACK; + +static struct cpumask *get_idle_cpumask(int cpu) +{ + int node = cpu_to_node(cpu); + + return idle_masks[node]->cpu; +} + +static struct cpumask *get_idle_smtmask(int cpu) +{ + int node = cpu_to_node(cpu); + + return idle_masks[node]->smt; +} + +static void idle_masks_init(void) +{ + int node; + + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL); + BUG_ON(!idle_masks); + + for_each_node_state(node, N_POSSIBLE) { + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node); + BUG_ON(!idle_masks[node]); + + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node)); + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node)); + } +} #endif /* CONFIG_SMP */ @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) */ if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_smt = get_idle_smtmask(cpu); /* * If offline, @cpu is not its own sibling and * scx_pick_idle_cpu() can get caught in an infinite loop as - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu - * is eventually cleared. + * @cpu is never cleared from the idle SMT mask. Ensure that + * @cpu is eventually cleared. */ - if (cpumask_intersects(smt, idle_masks.smt)) - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); - else if (cpumask_test_cpu(cpu, idle_masks.smt)) - __cpumask_clear_cpu(cpu, idle_masks.smt); + if (cpumask_intersects(smt, idle_smt)) + cpumask_andnot(idle_smt, idle_smt, smt); + else if (cpumask_test_cpu(cpu, idle_smt)) + __cpumask_clear_cpu(cpu, idle_smt); } #endif - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu)); } static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { - int cpu; + int start = cpu_to_node(smp_processor_id()); + int node, cnt, cpu; retry: if (sched_smt_active()) { - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); - if (cpu < nr_cpu_ids) - goto found; + for_each_node_state_from(node, N_ONLINE, start, cnt) { + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, + cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + } if (flags & SCX_PICK_IDLE_CORE) return -EBUSY; } - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); - if (cpu >= nr_cpu_ids) - return -EBUSY; + for_each_node_state_from(node, N_ONLINE, start, cnt) { + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu, + cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + } + return -EBUSY; found: if (test_and_clear_cpu_idle(cpu)) @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, * piled up on it even if there is an idle core elsewhere on * the system. */ - if (!cpumask_empty(idle_masks.cpu) && + if (!cpumask_empty(get_idle_cpumask(cpu)) && !(current->flags & PF_EXITING) && cpu_rq(cpu)->scx.local_dsq.nr == 0) { if (cpumask_test_cpu(cpu, p->cpus_ptr)) @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, /* * Keep using @prev_cpu if it's part of a fully idle core. */ - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) && test_and_clear_cpu_idle(prev_cpu)) { cpu = prev_cpu; goto cpu_found; @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p, static void reset_idle_masks(void) { + int node; + /* * Consider all online cpus idle. Should converge to the actual state * quickly. */ - cpumask_copy(idle_masks.cpu, cpu_online_mask); - cpumask_copy(idle_masks.smt, cpu_online_mask); + for_each_node_state(node, N_POSSIBLE) { + const struct cpumask *node_mask = cpumask_of_node(node); + + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask); + cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask); + } } void __scx_update_idle(struct rq *rq, bool idle) @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle) } if (idle) - cpumask_set_cpu(cpu, idle_masks.cpu); + cpumask_set_cpu(cpu, get_idle_cpumask(cpu)); else - cpumask_clear_cpu(cpu, idle_masks.cpu); + cpumask_clear_cpu(cpu, get_idle_cpumask(cpu)); #ifdef CONFIG_SCHED_SMT if (sched_smt_active()) { const struct cpumask *smt = cpu_smt_mask(cpu); + struct cpumask *idle_cpu = get_idle_cpumask(cpu); + struct cpumask *idle_smt = get_idle_smtmask(cpu); if (idle) { /* @@ -3593,12 +3640,12 @@ void __scx_update_idle(struct rq *rq, bool idle) * it's only for optimization and self-correcting. */ for_each_cpu(cpu, smt) { - if (!cpumask_test_cpu(cpu, idle_masks.cpu)) + if (!cpumask_test_cpu(cpu, idle_cpu)) return; } - cpumask_or(idle_masks.smt, idle_masks.smt, smt); + cpumask_or(idle_smt, idle_smt, smt); } else { - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + cpumask_andnot(idle_smt, idle_smt, smt); } } #endif @@ -6174,8 +6221,7 @@ void __init init_sched_ext_class(void) BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); #ifdef CONFIG_SMP - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); + idle_masks_init(); #endif scx_kick_cpus_pnt_seqs = __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, @@ -7321,7 +7367,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) /** * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking - * per-CPU cpumask. + * per-CPU cpumask of the current NUMA node. * * Returns NULL if idle tracking is not enabled, or running on a UP kernel. */ @@ -7333,7 +7379,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) } #ifdef CONFIG_SMP - return idle_masks.cpu; + return get_idle_cpumask(smp_processor_id()); #else return cpu_none_mask; #endif @@ -7341,8 +7387,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) /** * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, - * per-physical-core cpumask. Can be used to determine if an entire physical - * core is free. + * per-physical-core cpumask of the current NUMA node. Can be used to determine + * if an entire physical core is free. * * Returns NULL if idle tracking is not enabled, or running on a UP kernel. */ @@ -7355,9 +7401,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) #ifdef CONFIG_SMP if (sched_smt_active()) - return idle_masks.smt; + return get_idle_smtmask(smp_processor_id()); else - return idle_masks.cpu; + return get_idle_cpumask(smp_processor_id()); #else return cpu_none_mask; #endif -- 2.47.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-26 9:56 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi @ 2024-11-27 2:35 ` Yury Norov 2024-11-27 7:41 ` Andrea Righi 0 siblings, 1 reply; 12+ messages in thread From: Yury Norov @ 2024-11-27 2:35 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, linux-kernel On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote: > Using a single global idle mask can lead to inefficiencies and a lot of > stress on the cache coherency protocol on large systems with multiple > NUMA nodes, since all the CPUs can create a really intense read/write > activity on the single global cpumask. > > Therefore, split the global cpumask into multiple per-NUMA node cpumasks > to improve scalability and performance on large systems. > > The concept is that each cpumask will track only the idle CPUs within > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. > In this way concurrent access to the idle cpumask will be restricted > within each NUMA node. This all is definitely a complication of the code. Have you any numbers to justify it? > [Open issue] > > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return > a single cpumask for all the CPUs, have been changed to report only the > cpumask of the current NUMA node (using the current CPU); this breaks > the old behavior, so it can potentially introduce regressions in some > scx schedulers. > > An alternative approach could be to construct a global cpumask > on-the-fly, but this could add significant overhead to ops.select_cpu() > for schedulers relying on these kfunc's. Additionally, it would be less > reliable than accessing the actual cpumasks, as the copy could quickly > become out of sync and not represent the actual idle state very well. > > Probably a better way to solve this issue is to introduce new kfunc's to > explicitly select specific per-NUMA cpumask and modify the scx > schedulers to transition to this new API, for example: > > const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node) > const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node) > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext.c | 110 ++++++++++++++++++++++++++++++++------------- > 1 file changed, 78 insertions(+), 32 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 3c4a94e4258f..ea2fae2db10e 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work; > static struct { > cpumask_var_t cpu; > cpumask_var_t smt; > -} idle_masks CL_ALIGNED_IF_ONSTACK; > +} **idle_masks CL_ALIGNED_IF_ONSTACK; > + > +static struct cpumask *get_idle_cpumask(int cpu) > +{ > + int node = cpu_to_node(cpu); > + > + return idle_masks[node]->cpu; > +} > + > +static struct cpumask *get_idle_smtmask(int cpu) > +{ > + int node = cpu_to_node(cpu); > + > + return idle_masks[node]->smt; > +} > + > +static void idle_masks_init(void) > +{ > + int node; > + > + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL); > + BUG_ON(!idle_masks); > + > + for_each_node_state(node, N_POSSIBLE) { > + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node); > + BUG_ON(!idle_masks[node]); > + > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node)); > + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node)); > + } > +} > > #endif /* CONFIG_SMP */ > > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) > */ > if (sched_smt_active()) { > const struct cpumask *smt = cpu_smt_mask(cpu); > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > /* > * If offline, @cpu is not its own sibling and > * scx_pick_idle_cpu() can get caught in an infinite loop as > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > - * is eventually cleared. > + * @cpu is never cleared from the idle SMT mask. Ensure that > + * @cpu is eventually cleared. > */ > - if (cpumask_intersects(smt, idle_masks.smt)) > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > - __cpumask_clear_cpu(cpu, idle_masks.smt); > + if (cpumask_intersects(smt, idle_smt)) > + cpumask_andnot(idle_smt, idle_smt, smt); cpumask_andnot() is a no-op if masks don't intersect. cpumask_intersects() is O(N), and you can safely drop it and save some cycles. > + else if (cpumask_test_cpu(cpu, idle_smt)) > + __cpumask_clear_cpu(cpu, idle_smt); Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op. I feel like you can just clear all that CPUs unconditionally. In the worst case, you'll clear them twice, which is harmless. Or I misunderstand something? > } > #endif > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu)); > } > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > { > - int cpu; > + int start = cpu_to_node(smp_processor_id()); > + int node, cnt, cpu; > > retry: > if (sched_smt_active()) { > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > - if (cpu < nr_cpu_ids) > - goto found; > + for_each_node_state_from(node, N_ONLINE, start, cnt) { > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, > + cpus_allowed); Nit: no need to break the line here. > + if (cpu < nr_cpu_ids) > + goto found; > + } > > if (flags & SCX_PICK_IDLE_CORE) > return -EBUSY; > } > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > - if (cpu >= nr_cpu_ids) > - return -EBUSY; > + for_each_node_state_from(node, N_ONLINE, start, cnt) { > + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu, > + cpus_allowed); > + if (cpu < nr_cpu_ids) > + goto found; > + } > + return -EBUSY; > > found: > if (test_and_clear_cpu_idle(cpu)) > @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > * piled up on it even if there is an idle core elsewhere on > * the system. > */ > - if (!cpumask_empty(idle_masks.cpu) && > + if (!cpumask_empty(get_idle_cpumask(cpu)) && > !(current->flags & PF_EXITING) && > cpu_rq(cpu)->scx.local_dsq.nr == 0) { > if (cpumask_test_cpu(cpu, p->cpus_ptr)) cpumask_empty() is O(N), and the other checks are O(1). If you reorder them such that cpumask_empty() would be the last check, you can increase probability that you will not check it at all. > @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > /* > * Keep using @prev_cpu if it's part of a fully idle core. > */ > - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && > + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) && > test_and_clear_cpu_idle(prev_cpu)) { > cpu = prev_cpu; > goto cpu_found; > @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > static void reset_idle_masks(void) > { > + int node; > + > /* > * Consider all online cpus idle. Should converge to the actual state > * quickly. > */ > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > - cpumask_copy(idle_masks.smt, cpu_online_mask); > + for_each_node_state(node, N_POSSIBLE) { > + const struct cpumask *node_mask = cpumask_of_node(node); > + > + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask); > + cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask); cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu) > + } > } > > void __scx_update_idle(struct rq *rq, bool idle) > @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle) > } > > if (idle) > - cpumask_set_cpu(cpu, idle_masks.cpu); > + cpumask_set_cpu(cpu, get_idle_cpumask(cpu)); > else > - cpumask_clear_cpu(cpu, idle_masks.cpu); > + cpumask_clear_cpu(cpu, get_idle_cpumask(cpu)); assign_cpu(cpu, get_idle_cpumask(cpu), idle) Thanks, Yury > > #ifdef CONFIG_SCHED_SMT > if (sched_smt_active()) { > const struct cpumask *smt = cpu_smt_mask(cpu); > + struct cpumask *idle_cpu = get_idle_cpumask(cpu); > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > if (idle) { > /* > @@ -3593,12 +3640,12 @@ void __scx_update_idle(struct rq *rq, bool idle) > * it's only for optimization and self-correcting. > */ > for_each_cpu(cpu, smt) { > - if (!cpumask_test_cpu(cpu, idle_masks.cpu)) > + if (!cpumask_test_cpu(cpu, idle_cpu)) > return; > } > - cpumask_or(idle_masks.smt, idle_masks.smt, smt); > + cpumask_or(idle_smt, idle_smt, smt); > } else { > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > + cpumask_andnot(idle_smt, idle_smt, smt); > } > } > #endif > @@ -6174,8 +6221,7 @@ void __init init_sched_ext_class(void) > > BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); > #ifdef CONFIG_SMP > - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); > - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); > + idle_masks_init(); > #endif > scx_kick_cpus_pnt_seqs = > __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, > @@ -7321,7 +7367,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask) > > /** > * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking > - * per-CPU cpumask. > + * per-CPU cpumask of the current NUMA node. > * > * Returns NULL if idle tracking is not enabled, or running on a UP kernel. > */ > @@ -7333,7 +7379,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) > } > > #ifdef CONFIG_SMP > - return idle_masks.cpu; > + return get_idle_cpumask(smp_processor_id()); > #else > return cpu_none_mask; > #endif > @@ -7341,8 +7387,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) > > /** > * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, > - * per-physical-core cpumask. Can be used to determine if an entire physical > - * core is free. > + * per-physical-core cpumask of the current NUMA node. Can be used to determine > + * if an entire physical core is free. > * > * Returns NULL if idle tracking is not enabled, or running on a UP kernel. > */ > @@ -7355,9 +7401,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) > > #ifdef CONFIG_SMP > if (sched_smt_active()) > - return idle_masks.smt; > + return get_idle_smtmask(smp_processor_id()); > else > - return idle_masks.cpu; > + return get_idle_cpumask(smp_processor_id()); > #else > return cpu_none_mask; > #endif > -- > 2.47.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks 2024-11-27 2:35 ` Yury Norov @ 2024-11-27 7:41 ` Andrea Righi 0 siblings, 0 replies; 12+ messages in thread From: Andrea Righi @ 2024-11-27 7:41 UTC (permalink / raw) To: Yury Norov; +Cc: Tejun Heo, David Vernet, linux-kernel On Tue, Nov 26, 2024 at 06:35:16PM -0800, Yury Norov wrote: > On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote: > > Using a single global idle mask can lead to inefficiencies and a lot of > > stress on the cache coherency protocol on large systems with multiple > > NUMA nodes, since all the CPUs can create a really intense read/write > > activity on the single global cpumask. > > > > Therefore, split the global cpumask into multiple per-NUMA node cpumasks > > to improve scalability and performance on large systems. > > > > The concept is that each cpumask will track only the idle CPUs within > > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy. > > In this way concurrent access to the idle cpumask will be restricted > > within each NUMA node. > > This all is definitely a complication of the code. Have you any numbers to > justify it? I'll include some numbers in v2. For now I just wanted to send the patch out to start some discussion. :) > > > [Open issue] > > > > The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return > > a single cpumask for all the CPUs, have been changed to report only the > > cpumask of the current NUMA node (using the current CPU); this breaks > > the old behavior, so it can potentially introduce regressions in some > > scx schedulers. > > > > An alternative approach could be to construct a global cpumask > > on-the-fly, but this could add significant overhead to ops.select_cpu() > > for schedulers relying on these kfunc's. Additionally, it would be less > > reliable than accessing the actual cpumasks, as the copy could quickly > > become out of sync and not represent the actual idle state very well. > > > > Probably a better way to solve this issue is to introduce new kfunc's to > > explicitly select specific per-NUMA cpumask and modify the scx > > schedulers to transition to this new API, for example: > > > > const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node) > > const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node) > > > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- ... > > @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu) > > */ > > if (sched_smt_active()) { > > const struct cpumask *smt = cpu_smt_mask(cpu); > > + struct cpumask *idle_smt = get_idle_smtmask(cpu); > > > > /* > > * If offline, @cpu is not its own sibling and > > * scx_pick_idle_cpu() can get caught in an infinite loop as > > - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu > > - * is eventually cleared. > > + * @cpu is never cleared from the idle SMT mask. Ensure that > > + * @cpu is eventually cleared. > > */ > > - if (cpumask_intersects(smt, idle_masks.smt)) > > - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); > > - else if (cpumask_test_cpu(cpu, idle_masks.smt)) > > - __cpumask_clear_cpu(cpu, idle_masks.smt); > > + if (cpumask_intersects(smt, idle_smt)) > > + cpumask_andnot(idle_smt, idle_smt, smt); > > cpumask_andnot() is a no-op if masks don't intersect. > cpumask_intersects() is O(N), and you can safely drop it and > save some cycles. > > > + else if (cpumask_test_cpu(cpu, idle_smt)) > > + __cpumask_clear_cpu(cpu, idle_smt); > > Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op. > I feel like you can just clear all that CPUs unconditionally. In > the worst case, you'll clear them twice, which is harmless. Or I > misunderstand something? Yeah, makes sense to me, it definitely seems more efficient to just clear the CPU. Thanks for noticing it. > > > } > > #endif > > - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); > > + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu)); > > } > > > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) > > { > > - int cpu; > > + int start = cpu_to_node(smp_processor_id()); > > + int node, cnt, cpu; > > > > retry: > > if (sched_smt_active()) { > > - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); > > - if (cpu < nr_cpu_ids) > > - goto found; > > + for_each_node_state_from(node, N_ONLINE, start, cnt) { > > + cpu = cpumask_any_and_distribute(idle_masks[node]->smt, > > + cpus_allowed); > > Nit: no need to break the line here. Ok. > > > + if (cpu < nr_cpu_ids) > > + goto found; > > + } > > > > if (flags & SCX_PICK_IDLE_CORE) > > return -EBUSY; > > } > > > > - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); > > - if (cpu >= nr_cpu_ids) > > - return -EBUSY; > > + for_each_node_state_from(node, N_ONLINE, start, cnt) { > > + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu, > > + cpus_allowed); > > + if (cpu < nr_cpu_ids) > > + goto found; > > + } > > + return -EBUSY; > > > > found: > > if (test_and_clear_cpu_idle(cpu)) > > @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > > * piled up on it even if there is an idle core elsewhere on > > * the system. > > */ > > - if (!cpumask_empty(idle_masks.cpu) && > > + if (!cpumask_empty(get_idle_cpumask(cpu)) && > > !(current->flags & PF_EXITING) && > > cpu_rq(cpu)->scx.local_dsq.nr == 0) { > > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > > cpumask_empty() is O(N), and the other checks are O(1). If you reorder > them such that cpumask_empty() would be the last check, you can > increase probability that you will not check it at all. Ok. > > > @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, > > /* > > * Keep using @prev_cpu if it's part of a fully idle core. > > */ > > - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && > > + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) && > > test_and_clear_cpu_idle(prev_cpu)) { > > cpu = prev_cpu; > > goto cpu_found; > > @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > > > static void reset_idle_masks(void) > > { > > + int node; > > + > > /* > > * Consider all online cpus idle. Should converge to the actual state > > * quickly. > > */ > > - cpumask_copy(idle_masks.cpu, cpu_online_mask); > > - cpumask_copy(idle_masks.smt, cpu_online_mask); > > + for_each_node_state(node, N_POSSIBLE) { > > + const struct cpumask *node_mask = cpumask_of_node(node); > > + > > + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask); > > + cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask); > > cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu) > Ok. > > + } > > } > > > > void __scx_update_idle(struct rq *rq, bool idle) > > @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle) > > } > > > > if (idle) > > - cpumask_set_cpu(cpu, idle_masks.cpu); > > + cpumask_set_cpu(cpu, get_idle_cpumask(cpu)); > > else > > - cpumask_clear_cpu(cpu, idle_masks.cpu); > > + cpumask_clear_cpu(cpu, get_idle_cpumask(cpu)); > > assign_cpu(cpu, get_idle_cpumask(cpu), idle) Ok. > > Thanks, > Yury Thanks for the review! Will include all these changes in v2. -Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-12-03 14:16 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-29 17:54 [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi 2024-11-29 17:54 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_wrap/for_each_node_state_wrap() Andrea Righi 2024-11-29 19:27 ` Yury Norov 2024-11-30 15:13 ` Andrea Righi 2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi 2024-11-29 19:38 ` Yury Norov 2024-11-30 15:24 ` Andrea Righi 2024-12-03 7:38 ` Andrea Righi 2024-12-03 14:16 ` Andrea Righi -- strict thread matches above, loose matches on Subject: below -- 2024-11-26 9:56 [PATCHSET sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi 2024-11-26 9:56 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi 2024-11-27 2:35 ` Yury Norov 2024-11-27 7:41 ` Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox