From: Yury Norov <yury.norov@gmail.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks
Date: Tue, 26 Nov 2024 18:35:16 -0800 [thread overview]
Message-ID: <Z0aFZJZ85LUwapZE@yury-ThinkPad> (raw)
In-Reply-To: <20241126101259.52077-3-arighi@nvidia.com>
On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
>
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
>
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.
This all is definitely a complication of the code. Have you any numbers to
justify it?
> [Open issue]
>
> The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return
> a single cpumask for all the CPUs, have been changed to report only the
> cpumask of the current NUMA node (using the current CPU); this breaks
> the old behavior, so it can potentially introduce regressions in some
> scx schedulers.
>
> An alternative approach could be to construct a global cpumask
> on-the-fly, but this could add significant overhead to ops.select_cpu()
> for schedulers relying on these kfunc's. Additionally, it would be less
> reliable than accessing the actual cpumasks, as the copy could quickly
> become out of sync and not represent the actual idle state very well.
>
> Probably a better way to solve this issue is to introduce new kfunc's to
> explicitly select specific per-NUMA cpumask and modify the scx
> schedulers to transition to this new API, for example:
>
> const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node)
> const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node)
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext.c | 110 ++++++++++++++++++++++++++++++++-------------
> 1 file changed, 78 insertions(+), 32 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 3c4a94e4258f..ea2fae2db10e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work;
> static struct {
> cpumask_var_t cpu;
> cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +} **idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +static struct cpumask *get_idle_cpumask(int cpu)
> +{
> + int node = cpu_to_node(cpu);
> +
> + return idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_smtmask(int cpu)
> +{
> + int node = cpu_to_node(cpu);
> +
> + return idle_masks[node]->smt;
> +}
> +
> +static void idle_masks_init(void)
> +{
> + int node;
> +
> + idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> + BUG_ON(!idle_masks);
> +
> + for_each_node_state(node, N_POSSIBLE) {
> + idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> + BUG_ON(!idle_masks[node]);
> +
> + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> + BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> + }
> +}
>
> #endif /* CONFIG_SMP */
>
> @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu)
> */
> if (sched_smt_active()) {
> const struct cpumask *smt = cpu_smt_mask(cpu);
> + struct cpumask *idle_smt = get_idle_smtmask(cpu);
>
> /*
> * If offline, @cpu is not its own sibling and
> * scx_pick_idle_cpu() can get caught in an infinite loop as
> - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> - * is eventually cleared.
> + * @cpu is never cleared from the idle SMT mask. Ensure that
> + * @cpu is eventually cleared.
> */
> - if (cpumask_intersects(smt, idle_masks.smt))
> - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> - else if (cpumask_test_cpu(cpu, idle_masks.smt))
> - __cpumask_clear_cpu(cpu, idle_masks.smt);
> + if (cpumask_intersects(smt, idle_smt))
> + cpumask_andnot(idle_smt, idle_smt, smt);
cpumask_andnot() is a no-op if masks don't intersect.
cpumask_intersects() is O(N), and you can safely drop it and
save some cycles.
> + else if (cpumask_test_cpu(cpu, idle_smt))
> + __cpumask_clear_cpu(cpu, idle_smt);
Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op.
I feel like you can just clear all that CPUs unconditionally. In
the worst case, you'll clear them twice, which is harmless. Or I
misunderstand something?
> }
> #endif
> - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> + return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu));
> }
>
> static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> {
> - int cpu;
> + int start = cpu_to_node(smp_processor_id());
> + int node, cnt, cpu;
>
> retry:
> if (sched_smt_active()) {
> - cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> - if (cpu < nr_cpu_ids)
> - goto found;
> + for_each_node_state_from(node, N_ONLINE, start, cnt) {
> + cpu = cpumask_any_and_distribute(idle_masks[node]->smt,
> + cpus_allowed);
Nit: no need to break the line here.
> + if (cpu < nr_cpu_ids)
> + goto found;
> + }
>
> if (flags & SCX_PICK_IDLE_CORE)
> return -EBUSY;
> }
>
> - cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> - if (cpu >= nr_cpu_ids)
> - return -EBUSY;
> + for_each_node_state_from(node, N_ONLINE, start, cnt) {
> + cpu = cpumask_any_and_distribute(idle_masks[node]->cpu,
> + cpus_allowed);
> + if (cpu < nr_cpu_ids)
> + goto found;
> + }
> + return -EBUSY;
>
> found:
> if (test_and_clear_cpu_idle(cpu))
> @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> * piled up on it even if there is an idle core elsewhere on
> * the system.
> */
> - if (!cpumask_empty(idle_masks.cpu) &&
> + if (!cpumask_empty(get_idle_cpumask(cpu)) &&
> !(current->flags & PF_EXITING) &&
> cpu_rq(cpu)->scx.local_dsq.nr == 0) {
> if (cpumask_test_cpu(cpu, p->cpus_ptr))
cpumask_empty() is O(N), and the other checks are O(1). If you reorder
them such that cpumask_empty() would be the last check, you can
increase probability that you will not check it at all.
> @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> /*
> * Keep using @prev_cpu if it's part of a fully idle core.
> */
> - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> + if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) &&
> test_and_clear_cpu_idle(prev_cpu)) {
> cpu = prev_cpu;
> goto cpu_found;
> @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>
> static void reset_idle_masks(void)
> {
> + int node;
> +
> /*
> * Consider all online cpus idle. Should converge to the actual state
> * quickly.
> */
> - cpumask_copy(idle_masks.cpu, cpu_online_mask);
> - cpumask_copy(idle_masks.smt, cpu_online_mask);
> + for_each_node_state(node, N_POSSIBLE) {
> + const struct cpumask *node_mask = cpumask_of_node(node);
> +
> + cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask);
> + cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask);
cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu)
> + }
> }
>
> void __scx_update_idle(struct rq *rq, bool idle)
> @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle)
> }
>
> if (idle)
> - cpumask_set_cpu(cpu, idle_masks.cpu);
> + cpumask_set_cpu(cpu, get_idle_cpumask(cpu));
> else
> - cpumask_clear_cpu(cpu, idle_masks.cpu);
> + cpumask_clear_cpu(cpu, get_idle_cpumask(cpu));
assign_cpu(cpu, get_idle_cpumask(cpu), idle)
Thanks,
Yury
>
> #ifdef CONFIG_SCHED_SMT
> if (sched_smt_active()) {
> const struct cpumask *smt = cpu_smt_mask(cpu);
> + struct cpumask *idle_cpu = get_idle_cpumask(cpu);
> + struct cpumask *idle_smt = get_idle_smtmask(cpu);
>
> if (idle) {
> /*
> @@ -3593,12 +3640,12 @@ void __scx_update_idle(struct rq *rq, bool idle)
> * it's only for optimization and self-correcting.
> */
> for_each_cpu(cpu, smt) {
> - if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> + if (!cpumask_test_cpu(cpu, idle_cpu))
> return;
> }
> - cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> + cpumask_or(idle_smt, idle_smt, smt);
> } else {
> - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> + cpumask_andnot(idle_smt, idle_smt, smt);
> }
> }
> #endif
> @@ -6174,8 +6221,7 @@ void __init init_sched_ext_class(void)
>
> BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
> #ifdef CONFIG_SMP
> - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> + idle_masks_init();
> #endif
> scx_kick_cpus_pnt_seqs =
> __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> @@ -7321,7 +7367,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>
> /**
> * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.
> *
> * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> */
> @@ -7333,7 +7379,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
> }
>
> #ifdef CONFIG_SMP
> - return idle_masks.cpu;
> + return get_idle_cpumask(smp_processor_id());
> #else
> return cpu_none_mask;
> #endif
> @@ -7341,8 +7387,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>
> /**
> * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask. Can be used to determine if an entire physical
> - * core is free.
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
> *
> * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> */
> @@ -7355,9 +7401,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>
> #ifdef CONFIG_SMP
> if (sched_smt_active())
> - return idle_masks.smt;
> + return get_idle_smtmask(smp_processor_id());
> else
> - return idle_masks.cpu;
> + return get_idle_cpumask(smp_processor_id());
> #else
> return cpu_none_mask;
> #endif
> --
> 2.47.0
next prev parent reply other threads:[~2024-11-27 2:35 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-26 9:56 [PATCHSET sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-11-26 9:56 ` [PATCH 1/2] nodemask: Introduce for_each_node_mask_from/for_each_node_state_from() Andrea Righi
2024-11-27 1:23 ` Yury Norov
2024-11-27 7:29 ` Andrea Righi
2024-11-26 9:56 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-11-27 2:35 ` Yury Norov [this message]
2024-11-27 7:41 ` Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2024-11-29 17:54 [PATCHSET v2 sched_ext/for-6.13] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-11-29 17:54 ` [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-11-29 19:38 ` Yury Norov
2024-11-30 15:24 ` Andrea Righi
2024-12-03 7:38 ` Andrea Righi
2024-12-03 14:16 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z0aFZJZ85LUwapZE@yury-ThinkPad \
--to=yury.norov@gmail.com \
--cc=arighi@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.