Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Yury Norov <yury.norov@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks
Date: Mon, 9 Dec 2024 21:40:46 +0100	[thread overview]
Message-ID: <Z1dVzm0WVZSayz8L@gpd3> (raw)
In-Reply-To: <Z1dF6HuEI2nyUD2V@yury-ThinkPad>

Hi Yury,

On Mon, Dec 09, 2024 at 11:32:56AM -0800, Yury Norov wrote:
> On Mon, Dec 09, 2024 at 11:40:55AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> > 
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> > 
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> > 
> > NOTE: the scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to
> > return a single cpumask for all the CPUs, have been changed to report
> > only the cpumask of the current NUMA node (using the current CPU).
> > 
> > This is breaking the old behavior, but it will be addressed in the next
> > commits, introducing a new flag to switch between the old single global
> > flat idle cpumask or the multiple per-node cpumasks.
> 
> Why don't you change the order of commits such that you first
> introduce the flag and then add new feature? That way you'll not have
> to explain yourself.

Good point! I'll refactor the patch set.

> 
> Also, the kernel/sched/ext.c is already 7k+ LOCs. Can you move the
> per-node idle masks to a separate file? You can also make this feature
> configurable, and those who don't care (pretty much everyone except
> for PLATINUM 8570 victims, right?) will not have to even compile it.
> 
> I'd like to see it enabled only for those who can really benefit from it.

Ok about moving it to a separate file.

I'm not completely convinvced about making it a config option, I think
it'd nice to allow individual scx schedulers to decide whether to use
the NUMA-aware idle selection or the flat idle selection logic. This can
also pave the way for future enhancements (i.e., introducing generic
sched domains).

Moreover, in terms of overhead, there's not much difference between a
scheduler that doesn't set SCX_OPS_BUILTIN_IDLE_PER_NODE and having a
single staticlly-built idle cpumask (in both cases we will still use a
single global cpumask).

Thanks,
-Andrea

next prev parent reply	other threads:[~2024-12-09 20:40 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-09 10:40 [PATCHSET v5 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-09 10:40 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi
2024-12-09 19:32   ` Yury Norov
2024-12-09 20:40     ` Andrea Righi [this message]
2024-12-10  0:14     ` Andrea Righi
2024-12-10  2:10       ` Yury Norov
2024-12-14  6:05         ` Andrea Righi
2024-12-11 17:46   ` Yury Norov
2024-12-09 10:40 ` [PATCH 2/4] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-11  8:05   ` Changwoo Min
2024-12-11 12:22     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 3/4] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-11 18:21   ` Yury Norov
2024-12-11 19:59     ` Andrea Righi
2024-12-09 10:40 ` [PATCH 4/4] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-11 17:43   ` Yury Norov
2024-12-11 20:20     ` Andrea Righi
2024-12-11 20:47       ` Yury Norov
2024-12-11 20:55         ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2024-12-05 21:00 [PATCHSET v4 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-05 21:00 ` [PATCH 1/4] sched_ext: Introduce per-NUMA idle cpumasks Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1dVzm0WVZSayz8L@gpd3 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox