public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Yury Norov <yury.norov@gmail.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
Date: Wed, 18 Dec 2024 11:21:30 +0100	[thread overview]
Message-ID: <Z2KiKs-Jw9meENCi@gpd3> (raw)
In-Reply-To: <Z2IHsuzeW5e7MAr6@slm.duckdns.org>

Hi Tejun,

On Tue, Dec 17, 2024 at 01:22:26PM -1000, Tejun Heo wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > +static int validate_node(int node)
> > +{
> > +	/* If no node is specified, return the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in the range of possible nodes */
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return -EINVAL;
> 
> Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> nr_node_ids`? Also, should probably add node_possible(node)?

Or even better add node_online(node), an offline NUMA node shouldn't be
used in this context.

> 
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> 
> As the masks are allocated separately anyway, the aligned attribute can be
> dropped. There's no reason to align the index array.

Right.

> 
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > +	node = validate_node(node);
> 
> It's odd to validate input node in an internal function. If node is being
> passed from BPF side, we should validate it and trigger scx_ops_error() if
> invalid, but once the node number is inside the kernel, we should be able to
> trust it.

Makes sense, I'll move the validation in the kfuncs and trigger
scx_ops_error() if the validation fails.

> 
> > +static struct cpumask *get_idle_cpumask_node(int node)
> > +{
> > +	return get_idle_mask_node(node, false);
> 
> Maybe make the inner function return `struct idle_cpumasks *` so that the
> caller can pick between cpu and smt?

Ok.

> 
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> 
> We probably want to use a variable name which is more qualified for a global
> variable - scx_idle_masks?

Ok.

> 
> > @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> >  
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> 
> Can we use plurals for cpumask varialbles - idle_cpus here?

Ok.

> 
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> 
> Do we need "from_node"?
> 
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> 
> This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
> There are no node-unaware counterparts to these functions, right?

Correct, we can just get rid of the _from_node() part.

> 
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	int start_node = cpu_to_node(prev_cpu);
> > +	s32 cpu = -EBUSY;
> > +
> > +	/*
> > +	 * Traverse all online nodes in order of increasing distance,
> > +	 * starting from prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> 
> Is rcu_read_lock() necessary? Does lockdep warn if the explicit
> rcu_read_lock() is dropped?

Good point, the other for_each_numa_hop_mask() iterator requires it, but
only to access the cpumasks via rcu_dereference(). Since we are iterating
node IDs I think we can get rid of rcu_read_lock/unlock() here. I'll double
check if lockdep complains without it.

> 
> > @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> 
> Can you do the same cpumask_and() here? I don't think it'll cause practical
> problems but idle_cpus can be updated inbetween and e.g. we can end up with
> idle_smts that have different idle states between siblings.

Makes sense, the state should still converge to the right one in any case,
but I agree that it's more accurate to use cpumask_and() also for idle_smt.
Will change that.

> 
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > - * per-CPU cpumask.
> > + * per-CPU cpumask of the current NUMA node.
> 
> This is a bit misleading as it can be system-wide too.
> 
> It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
> mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
> Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
> doesn't make sense for most use cases as it's usually called from
> ops.select_cpu() and the waker won't necessarily run on the same node as the
> wakee.
> 
> Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
> and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
> scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
> CPUs outside the specified node.

Yeah, I also don't like much the idea of implicitly use the current node
when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

I think it's totally reasonable to disallow the system-wide
scx_bpf_get_idle_cpu/smtmask() when the flag is enabled. Ultimately, it's
the scheduler's responsibility to enable or disable this feature, and if
it's enabled, the scheduler is expected to implement NUMA-aware logic.

I'm also fine with adding SCX_PICK_IDLE_NODE (or similar) to restrict the
search for an idle CPU to the specified node.

Thanks!
-Andrea

  reply	other threads:[~2024-12-18 10:21 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-17 21:57   ` Tejun Heo
2024-12-18 10:23     ` Andrea Righi
2024-12-18 16:04       ` Tejun Heo
2024-12-19 18:26         ` Yury Norov
2024-12-19 19:43           ` Andrea Righi
2024-12-19 19:52           ` Peter Zijlstra
2024-12-19 21:16             ` Andrea Righi
2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-17 23:22   ` Tejun Heo
2024-12-18 10:21     ` Andrea Righi [this message]
2024-12-18 16:10       ` Tejun Heo
2024-12-18 16:18         ` Andrea Righi
2024-12-17 23:23   ` Tejun Heo
2024-12-20 16:48   ` Yury Norov
2024-12-20 17:52     ` Andrea Righi
2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-20 14:53   ` Yury Norov
2024-12-20 14:58     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2KiKs-Jw9meENCi@gpd3 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox