From: Yury Norov <ynorov@nvidia.com>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, tglx@linutronix.de,
yury.norov@gmail.com, gregkh@linuxfoundation.org,
pbonzini@redhat.com, seanjc@google.com, kprateek.nayak@amd.com,
vschneid@redhat.com, iii@linux.ibm.com, huschle@linux.ibm.com,
rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de,
bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com,
hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org,
joelagnelf@nvidia.com
Subject: Re: [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask
Date: Wed, 8 Apr 2026 13:57:23 -0400 [thread overview]
Message-ID: <adaXA6mlZlGJS3Jo@yury> (raw)
In-Reply-To: <0d8412de-e18a-476f-9eb6-9a977f4474a3@linux.ibm.com>
On Wed, Apr 08, 2026 at 02:46:03PM +0530, Shrikanth Hegde wrote:
> Hi Yury. Thanks for going through the series.
>
> On 4/8/26 1:57 AM, Yury Norov wrote:
> > On Wed, Apr 08, 2026 at 12:49:36AM +0530, Shrikanth Hegde wrote:
> > > This patch does
> > > - Declare and Define cpu_preferred_mask.
> > > - Get/Set helpers for it.
> > >
> > > Values are set/clear by the scheduler by detecting the steal time values.
> > >
> > > A CPU is set to preferred when it comes online. Later it may be
> > > marked as non-preferred depending on steal time values with
> > > STEAL_MONITOR enabled.
> > >
> > > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> > > ---
> > > include/linux/cpumask.h | 22 ++++++++++++++++++++++
> > > kernel/cpu.c | 6 ++++++
> > > kernel/sched/core.c | 5 +++++
> > > 3 files changed, 33 insertions(+)
> > >
> > > diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> > > index 80211900f373..80c5cc13b8ad 100644
> > > --- a/include/linux/cpumask.h
> > > +++ b/include/linux/cpumask.h
> > > @@ -1296,6 +1296,28 @@ static __always_inline bool cpu_dying(unsigned int cpu)
> > > #endif /* NR_CPUS > 1 */
> > > +/*
> > > + * All related wrappers kept together to avoid too many ifdefs
> > > + * See Documentation/scheduler/sched-arch.rst for details
> > > + */
> > > +#ifdef CONFIG_PARAVIRT
> > > +extern struct cpumask __cpu_preferred_mask;
> > > +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
> > > +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
> > > +
> > > +static __always_inline bool cpu_preferred(unsigned int cpu)
> > > +{
> > > + return cpumask_test_cpu(cpu, cpu_preferred_mask);
> > > +}
> > > +#else
> > > +static __always_inline bool cpu_preferred(unsigned int cpu)
> > > +{
> > > + return true;
> > > +}
> >
> > This doesn't look consistent, probably not correct. What if
> > I pass an offline CPU here? Is it still preferred?
>
> preferred cpu state follows the online state. This was done by change
> below in set_cpu_online. So when cpu goes offline, it will be removed from
> the preferred mask too.
> In the design principle I wanted, preferred to be always subset of online
>
> preferred <= online <= possible.
>
> > Later you say that preferred CPU is online + STEAL-approved one.
> > So in non-paravirtualized case, I believe, you should consider
>
> There it would clearly be same as online CPUs.
In PARAVIRT-off case you have no cpu_preferred_mask at all, and always
return true. So, asking again: does cpu_preferred() work correctly in
this case?
From what you said, it should be:
+#ifdef CONFIG_PARAVIRT
...
+#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+ return cpu_online(cpu);
+}
+#endif
> > that only online CPUs are preferred. What about dying CPUs? Can
> > they be preferred too?
>
> When there is no CPU hotplug, preferred will be subset of online.
>
> Lets see different cases with CPU hotplug.
> when STEAL_MONITOR is on and there is high steal time.
>
> Lets say, 600 CPUs system with SMT.
>
> Case 1:
> CPU 500 was offline. It would have it's preferred bit=0 . after a while
> there was high steal time, and preferred_cpus = <0-399> and once the contention
> was gone, since it is using cpu_smt_mask, it would set 500's preferred bit=1, though
> it is offline.
>
> Case 2:
> all online CPUs were preferred. 500 was offline. after a while there was
> high steal and while iterating through cpu_smt_mask, after say 499 was done,
> 500 is brought online. that would set it in preferred.
> Since it was part of the mask, 500 will be marked preferred=0.
> That's ok. It was meant to be anyway.
>
> Case 3:
> all online CPUs were preferred. 500 was offline. after a while there was high steal
> and preferred_cpus = <0-399> and 500 is brought online. that would set it
> in preferred. In the next cycle, bringing online causes more steal time, and since it is
> the last CPU in the mask, it will be marked as non-preferred. Thats ok.
>
> So Case 1 is the one where the construct is broken.
> This is solvable by checking the online state in steal time handling code.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d3b2bcb6008c..bad091f1f604 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11329,7 +11329,7 @@ void sched_steal_detection_work(struct work_struct *work)
> if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
> return;
> - for_each_cpu(tmp_cpu, cpu_smt_mask(last_cpu)) {
> + for_each_cpu_and(tmp_cpu, cpu_smt_mask(last_cpu), cpu_online_mask) {
> set_cpu_preferred(tmp_cpu, false);
> if (tick_nohz_full_cpu(tmp_cpu))
> tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
> @@ -11345,7 +11345,7 @@ void sched_steal_detection_work(struct work_struct *work)
> if (first_cpu >= nr_cpu_ids)
> return;
> - for_each_cpu(tmp_cpu, cpu_smt_mask(first_cpu))
> + for_each_cpu_and(tmp_cpu, cpu_smt_mask(first_cpu), cpu_online_mask)
> set_cpu_preferred(tmp_cpu, true);
You don't need a loop:
cpumask_and(__cpu_preferred_mask, cpu_online_mask, cpu_smt_mask(first_cpu));
> }
I didn't passed through patch 6 yet :)
> I had thought of this scenario. I hadn't seen it from consistency point of
> view. It should be consistent since it is exposed to user.
>
> Functionality wise it was okay since, current code has enough checks to
> schedule only on online CPUs. Even is_cpu_allowed returns true only
> if it is online. But i get the point, and above diff should address it.
Yeah, your reasoning looks correct. To stay on a safe side, I'd add
assertions for that here and there, like:
#ifdef CONFIG_WHATEVER_DEBUG
WARN_ON(cpumask_subset(cpu_preferred_mask, cpu_online_mask));
#endif
> > At least, please run cpumask_check() on the argument.
>
> It is set either within online or in PATCH 15/17 by iterating through
> cpu_smt_mask. That should always yeild cpu < nr_cpu_ids.
>
> I didn't get why cpumask_check is needed again.
cpumsk_check() is a debugging feature. It's an no-op unless
CONFIG_DEBUG_PER_CPU_MAPS is enabled, and every cpumask function that
takes raw CPU is expected to use it.
In your case, it's needed when CONFIG_PARAVIRT=off.
> > There's a top-comment describing all the system cpumasks. Except for
> > cpu_dying, it's nice and complete. Can you describe your new creature
> > there?
>
> Ok. I can add a comment there.
>
> >
> > Finally, I don't think that __cpu_preferred_mask should depend on
> > PARAVIRT config. Consider cpu_present_mask. It mirrors cpu_possible_mask
> > if hotplug is disabled, but it's still a real mask even in that case.
> > The way you're doing it, you spread CONFIG_PARAVIRT ifdefery pretty
> > much anywhere where people might want to use this new mask for anything
> > except for testing a bit.
> >
>
> One concern you had raised earlier was bloating of the code for systems
> CONFIG_PARAVIRT=n.
>
> Maybe in some of the hotpaths we could do, IS_ENABLED(CONFIG_PARAVIRT) check and
> that should be ok?
My point is that there most likely will be users of PARAVIRT who will
not need this machinery, and will not be happy of bloating their kernels
with another useless (for them) feature. Moreover, it's O(N^2) in some
cases.
I suggest adding, for example, config PREFERRED_CPUS that would select
PARAVIRT, and would be disabled by default.
Regardless, whatever you decide, please keep all the cpu_paravirt_mask
ifdefery on the cpumasks level. For example, in patch #5:
+#ifdef CONFIG_PARAVIRT
+static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
+{
+ return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
+#else
+static inline bool task_can_run_on_preferred_cpu(struct task_struct *p)
+{
+ return true;
+}
+#endif
That looks wrong to me. Instead, either declare cpu_preferred_mask
unconditionally, and maintain it well, or
+#ifdef CONFIG_PREFERRED_CPUS
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_online_mask
+#endif
This way, your higher level code will be clean.
Thanks,
Yury
next prev parent reply other threads:[~2026-04-08 17:57 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27 ` Yury Norov
2026-04-08 9:16 ` Shrikanth Hegde
2026-04-08 17:57 ` Yury Norov [this message]
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08 1:05 ` Yury Norov
2026-04-08 12:56 ` Shrikanth Hegde
2026-04-08 18:09 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton
2026-04-08 13:49 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adaXA6mlZlGJS3Jo@yury \
--to=ynorov@nvidia.com \
--cc=bsegall@google.com \
--cc=chleroy@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=iii@linux.ibm.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=srikar@linux.ibm.com \
--cc=sshegde@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vschneid@redhat.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox