All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Yury Norov <yury.norov@gmail.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks
Date: Wed, 18 Dec 2024 11:21:30 +0100	[thread overview]
Message-ID: <Z2KiKs-Jw9meENCi@gpd3> (raw)
In-Reply-To: <Z2IHsuzeW5e7MAr6@slm.duckdns.org>

Hi Tejun,

On Tue, Dec 17, 2024 at 01:22:26PM -1000, Tejun Heo wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > +static int validate_node(int node)
> > +{
> > +	/* If no node is specified, return the current one */
> > +	if (node == NUMA_NO_NODE)
> > +		return numa_node_id();
> > +
> > +	/* Make sure node is in the range of possible nodes */
> > +	if (node < 0 || node >= num_possible_nodes())
> > +		return -EINVAL;
> 
> Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
> nr_node_ids`? Also, should probably add node_possible(node)?

Or even better add node_online(node), an offline NUMA node shouldn't be
used in this context.

> 
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> 
> As the masks are allocated separately anyway, the aligned attribute can be
> dropped. There's no reason to align the index array.

Right.

> 
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
> > +{
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > +	node = validate_node(node);
> 
> It's odd to validate input node in an internal function. If node is being
> passed from BPF side, we should validate it and trigger scx_ops_error() if
> invalid, but once the node number is inside the kernel, we should be able to
> trust it.

Makes sense, I'll move the validation in the kfuncs and trigger
scx_ops_error() if the validation fails.

> 
> > +static struct cpumask *get_idle_cpumask_node(int node)
> > +{
> > +	return get_idle_mask_node(node, false);
> 
> Maybe make the inner function return `struct idle_cpumasks *` so that the
> caller can pick between cpu and smt?

Ok.

> 
> > +static void idle_masks_init(void)
> > +{
> > +	int node;
> > +
> > +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> 
> We probably want to use a variable name which is more qualified for a global
> variable - scx_idle_masks?

Ok.

> 
> > @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> >  
> >  static bool test_and_clear_cpu_idle(int cpu)
> >  {
> > +	int node = cpu_to_node(cpu);
> > +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> 
> Can we use plurals for cpumask varialbles - idle_cpus here?

Ok.

> 
> > -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> > +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)
> 
> Do we need "from_node"?
> 
> >  {
> >  	int cpu;
> >  
> >  retry:
> >  	if (sched_smt_active()) {
> > -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> > +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);
> 
> This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
> There are no node-unaware counterparts to these functions, right?

Correct, we can just get rid of the _from_node() part.

> 
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > +	nodemask_t hop_nodes = NODE_MASK_NONE;
> > +	int start_node = cpu_to_node(prev_cpu);
> > +	s32 cpu = -EBUSY;
> > +
> > +	/*
> > +	 * Traverse all online nodes in order of increasing distance,
> > +	 * starting from prev_cpu's node.
> > +	 */
> > +	rcu_read_lock();
> 
> Is rcu_read_lock() necessary? Does lockdep warn if the explicit
> rcu_read_lock() is dropped?

Good point, the other for_each_numa_hop_mask() iterator requires it, but
only to access the cpumasks via rcu_dereference(). Since we are iterating
node IDs I think we can get rid of rcu_read_lock/unlock() here. I'll double
check if lockdep complains without it.

> 
> > @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  
> >  static void reset_idle_masks(void)
> >  {
> > +	int node;
> > +
> > +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > +		return;
> > +	}
> > +
> >  	/*
> >  	 * Consider all online cpus idle. Should converge to the actual state
> >  	 * quickly.
> >  	 */
> > -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> > +	for_each_node_state(node, N_POSSIBLE) {
> > +		const struct cpumask *node_mask = cpumask_of_node(node);
> > +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > +		cpumask_copy(idle_smt, idle_cpu);
> 
> Can you do the same cpumask_and() here? I don't think it'll cause practical
> problems but idle_cpus can be updated inbetween and e.g. we can end up with
> idle_smts that have different idle states between siblings.

Makes sense, the state should still converge to the right one in any case,
but I agree that it's more accurate to use cpumask_and() also for idle_smt.
Will change that.

> 
> >  /**
> >   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> > - * per-CPU cpumask.
> > + * per-CPU cpumask of the current NUMA node.
> 
> This is a bit misleading as it can be system-wide too.
> 
> It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
> mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
> Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
> doesn't make sense for most use cases as it's usually called from
> ops.select_cpu() and the waker won't necessarily run on the same node as the
> wakee.
> 
> Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
> and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
> scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
> CPUs outside the specified node.

Yeah, I also don't like much the idea of implicitly use the current node
when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.

I think it's totally reasonable to disallow the system-wide
scx_bpf_get_idle_cpu/smtmask() when the flag is enabled. Ultimately, it's
the scheduler's responsibility to enable or disable this feature, and if
it's enabled, the scheduler is expected to implement NUMA-aware logic.

I'm also fine with adding SCX_PICK_IDLE_NODE (or similar) to restrict the
search for an idle CPU to the specified node.

Thanks!
-Andrea

  reply	other threads:[~2024-12-18 10:21 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-17  9:32 [PATCHSET v7 sched_ext/for-6.14] sched_ext: split global idle cpumask into per-NUMA cpumasks Andrea Righi
2024-12-17  9:32 ` [PATCH 1/6] sched/topology: introduce for_each_numa_hop_node() / sched_numa_hop_node() Andrea Righi
2024-12-17 21:57   ` Tejun Heo
2024-12-18 10:23     ` Andrea Righi
2024-12-18 16:04       ` Tejun Heo
2024-12-19 18:26         ` Yury Norov
2024-12-19 19:43           ` Andrea Righi
2024-12-19 19:52           ` Peter Zijlstra
2024-12-19 21:16             ` Andrea Righi
2024-12-17  9:32 ` [PATCH 2/6] sched_ext: Introduce SCX_OPS_NODE_BUILTIN_IDLE Andrea Righi
2024-12-17  9:32 ` [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks Andrea Righi
2024-12-17 23:22   ` Tejun Heo
2024-12-18 10:21     ` Andrea Righi [this message]
2024-12-18 16:10       ` Tejun Heo
2024-12-18 16:18         ` Andrea Righi
2024-12-17 23:23   ` Tejun Heo
2024-12-20 16:48   ` Yury Norov
2024-12-20 17:52     ` Andrea Righi
2024-12-17  9:32 ` [PATCH 4/6] sched_ext: Get rid of the scx_selcpu_topo_numa logic Andrea Righi
2024-12-17  9:32 ` [PATCH 5/6] sched_ext: Introduce NUMA aware idle cpu kfunc helpers Andrea Righi
2024-12-17  9:32 ` [PATCH 6/6] sched_ext: Move built-in idle CPU selection policy to a separate file Andrea Righi
2024-12-20 14:53   ` Yury Norov
2024-12-20 14:58     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z2KiKs-Jw9meENCi@gpd3 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.