From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
Date: Sat, 14 Dec 2024 07:05:23 +0100 [thread overview]
Message-ID: <Z10gI2wV1FJ5lyTZ@gpd3> (raw)
In-Reply-To: <Z1ejJSBes62otQ0k@yury-ThinkPad>
On Mon, Dec 09, 2024 at 06:10:45PM -0800, Yury Norov wrote:
> On Tue, Dec 10, 2024 at 01:14:43AM +0100, Andrea Righi wrote:
> > > And altogether this should look like:
> > >
> > > int scx_pick_idle_cpu_from_hop(struct cpumask *hop_cpus, struct cpumask *cpus_allowed)
> > > {
> > > int node, cpu, random_cpu;
> > >
> > > do {
> > >
> > > /* Pick a 'random' CPU in the hop */
> > > random_cpu = cpumask_any_and_distribute(hop_cpus, cpus_allowed);
> > > if (random_cpu >= nr_cpu_ids)
> > > continue;
> > >
> > > node = cpu_to_node(random_cpu);
> > >
> > > /* Find an idle CPU in the same node */
> > > cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> > > if (cpu >= 0)
> > > break;
> > >
> > > /* No luck? Try other nodes */
> > > } while (cpumask_andnot(hop_cpus, hop_cpus, cpumask_of_node(node)));
> > >
> > > return cpu;
> > > }
> > >
> > > static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > > {
> > > const struct cpumask *next, *prev = cpu_none_mask;
> > > int prev_node = cpu_to_node(prev_cpu);
> > > ...
> > > for_each_numa_hop_mask(next, prev_node) {
> > > cpumask_andnot(hop_cpus, next, prev);
> > > cpu = scx_pick_idle_cpu_from_hop(hop_cpus, cpus_allowed);
> > > prev = next;
> > > }
> > > ...
> > > }
> > >
> > > Not tested, but should work.
> >
> > Makes sense to me, I'll do some testing with this.
>
> One thing you can do to optimize it is introducing a function that
> pulls nodes from the hop_cpus:
>
> void sched_get_hop_nodes(nodemask_t *hop_nodes, const struct cpumask *hop_cpus)
> {
> int cpu;
>
> for_each_cpu(cpu, hop_cpus) {
> node_set(cpu_to_node(cpu);, hop_nodes);
> cpu = cpumask_next_zero(cpu, cpumask_of_node(node)
> }
> }
>
> This should be O(N), but it will let you to avoid O(N*M) in the loop
> condition inside scx_pick_idle_cpu_from_hop():
>
> int scx_pick_idle_cpu_from_hop(nodemask_t *hop_nodes, struct cpumask *cpus_allowed)
> {
> int node, idle_cpu, random_cpu;
>
> for_each_node_mask(node, &hop_nodes) {
> /* Pick a 'random' CPU in the node */
> random_cpu = cpumask_any_and_distribute(cpumask_of_node(node), cpus_allowed);
> if (random_cpu >= nr_cpu_ids)
> continue;
>
> /* Find an idle CPU in the same node */
> idle_cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> if (idle_cpu >= 0)
> break;
>
> }
>
> return cpu;
> }
>
> And at this point I'd also compare the above with non-randomized
> version:
>
> static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> {
> const struct cpumask *next, *prev = cpu_none_mask;
> int prev_node = cpu_to_node(prev_cpu);
> nodemask_t hop_nodes;
> ...
> for_each_numa_hop_mask(next, prev_node) {
> if (!cpumask_and_andnot(hop_cpus, next, cpus_allow, prev))
> goto cont;
>
> sched_get_hop_nodes(hop_nodes, hop_cpus);
> for_each_node_mask(node, hop_nodes) {
> cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> if (cpu >= 0)
> goto found;
> }
>
> cont:
> prev = next;
> }
> ...
> }
>
> Don't know how it works, but it looks really good.
I've done some tests with multiple variations of this. I think one of the
issues is that we need to allocate hop_cpus and it seems to introduce some
overhead (or it could be the operations on the cpumasks), but it seems to
pretty much nullify the benefits of the per-node idle cpumasks.
However, I came up with this one, which doesn't seem to introduce much
overhead:
int sched_numa_hop_node(nodemask_t *hop_nodes, int start, unsigned int state)
{
int dist, n, min_node, min_dist;
if (state >= NR_NODE_STATES)
return NUMA_NO_NODE;
min_node = NUMA_NO_NODE;
min_dist = INT_MAX;
for_each_node_state(n, state) {
if (n == start || node_isset(n, *hop_nodes))
continue;
dist = node_distance(start, n);
if (dist < min_dist) {
min_dist = dist;
min_node = n;
}
}
if (min_node != NUMA_NO_NODE)
node_set(min_node, *hop_nodes);
return min_node;
}
#define for_each_numa_hop_node(__node, __start, __hop_nodes, __state) \
for (int __node = __start; \
__node != NUMA_NO_NODE; \
__node = sched_numa_hop_node(&(__hop_nodes), __start, __state))
It's O(N^2) with the number of nodes, but it shouldn't be too bad, unless
we are in MAXSMP systems maybe, but even in in this case scx schedulers
have the option to use scx_bpf_pick_idle_cpu_node(), instead of the generic
scx_pick_idle_cpu(), and define their own criteria to iterate on the NUMA
nodes.
I'll do more testing, if we don't find any issue I'm think I'll go with
this one in the next patch set.
-Andrea
next prev parent reply other threads:[~2024-12-14 6:06 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-12-09 19:32 ` Yury Norov
2024-12-09 20:40 ` Andrea Righi
2024-12-10 0:14 ` Andrea Righi
2024-12-10 2:10 ` Yury Norov
2024-12-14 6:05 ` Andrea Righi [this message]
2024-12-11 17:46 ` Yury Norov
2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-11 8:05 ` Changwoo Min
2024-12-11 12:22 ` Andrea Righi
2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-11 18:21 ` Yury Norov
2024-12-11 19:59 ` Andrea Righi
2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-11 17:43 ` Yury Norov
2024-12-11 20:20 ` Andrea Righi
2024-12-11 20:47 ` Yury Norov
2024-12-11 20:55 ` Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-05 21:00 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z10gI2wV1FJ5lyTZ@gpd3 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox