From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Yury Norov <ynorov@nvidia.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, tglx@linutronix.de,
yury.norov@gmail.com, gregkh@linuxfoundation.org,
pbonzini@redhat.com, seanjc@google.com, kprateek.nayak@amd.com,
vschneid@redhat.com, iii@linux.ibm.com, huschle@linux.ibm.com,
rostedt@goodmis.org, dietmar.eggemann@arm.com, mgorman@suse.de,
bsegall@google.com, maddy@linux.ibm.com, srikar@linux.ibm.com,
hdanton@sina.com, chleroy@kernel.org, vineeth@bitbyteword.org,
joelagnelf@nvidia.com
Subject: Re: [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask
Date: Wed, 8 Apr 2026 14:46:03 +0530 [thread overview]
Message-ID: <0d8412de-e18a-476f-9eb6-9a977f4474a3@linux.ibm.com> (raw)
In-Reply-To: <adVomnfu27N_OjUT@yury>
Hi Yury. Thanks for going through the series.
On 4/8/26 1:57 AM, Yury Norov wrote:
> On Wed, Apr 08, 2026 at 12:49:36AM +0530, Shrikanth Hegde wrote:
>> This patch does
>> - Declare and Define cpu_preferred_mask.
>> - Get/Set helpers for it.
>>
>> Values are set/clear by the scheduler by detecting the steal time values.
>>
>> A CPU is set to preferred when it comes online. Later it may be
>> marked as non-preferred depending on steal time values with
>> STEAL_MONITOR enabled.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> include/linux/cpumask.h | 22 ++++++++++++++++++++++
>> kernel/cpu.c | 6 ++++++
>> kernel/sched/core.c | 5 +++++
>> 3 files changed, 33 insertions(+)
>>
>> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
>> index 80211900f373..80c5cc13b8ad 100644
>> --- a/include/linux/cpumask.h
>> +++ b/include/linux/cpumask.h
>> @@ -1296,6 +1296,28 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>>
>> #endif /* NR_CPUS > 1 */
>>
>> +/*
>> + * All related wrappers kept together to avoid too many ifdefs
>> + * See Documentation/scheduler/sched-arch.rst for details
>> + */
>> +#ifdef CONFIG_PARAVIRT
>> +extern struct cpumask __cpu_preferred_mask;
>> +#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
>> +#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
>> +
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> + return cpumask_test_cpu(cpu, cpu_preferred_mask);
>> +}
>> +#else
>> +static __always_inline bool cpu_preferred(unsigned int cpu)
>> +{
>> + return true;
>> +}
>
> This doesn't look consistent, probably not correct. What if
> I pass an offline CPU here? Is it still preferred?
>
preferred cpu state follows the online state. This was done by change
below in set_cpu_online. So when cpu goes offline, it will be removed from
the preferred mask too.
In the design principle I wanted, preferred to be always subset of online
preferred <= online <= possible.
> Later you say that preferred CPU is online + STEAL-approved one.
> So in non-paravirtualized case, I believe, you should consider
There it would clearly be same as online CPUs.
> that only online CPUs are preferred. What about dying CPUs? Can
> they be preferred too?
When there is no CPU hotplug, preferred will be subset of online.
Lets see different cases with CPU hotplug.
when STEAL_MONITOR is on and there is high steal time.
Lets say, 600 CPUs system with SMT.
Case 1:
CPU 500 was offline. It would have it's preferred bit=0 . after a while
there was high steal time, and preferred_cpus = <0-399> and once the contention
was gone, since it is using cpu_smt_mask, it would set 500's preferred bit=1, though
it is offline.
Case 2:
all online CPUs were preferred. 500 was offline. after a while there was
high steal and while iterating through cpu_smt_mask, after say 499 was done,
500 is brought online. that would set it in preferred.
Since it was part of the mask, 500 will be marked preferred=0.
That's ok. It was meant to be anyway.
Case 3:
all online CPUs were preferred. 500 was offline. after a while there was high steal
and preferred_cpus = <0-399> and 500 is brought online. that would set it
in preferred. In the next cycle, bringing online causes more steal time, and since it is
the last CPU in the mask, it will be marked as non-preferred. Thats ok.
So Case 1 is the one where the construct is broken.
This is solvable by checking the online state in steal time handling code.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3b2bcb6008c..bad091f1f604 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11329,7 +11329,7 @@ void sched_steal_detection_work(struct work_struct *work)
if (cpumask_equal(cpu_smt_mask(last_cpu), cpu_smt_mask(this_cpu)))
return;
- for_each_cpu(tmp_cpu, cpu_smt_mask(last_cpu)) {
+ for_each_cpu_and(tmp_cpu, cpu_smt_mask(last_cpu), cpu_online_mask) {
set_cpu_preferred(tmp_cpu, false);
if (tick_nohz_full_cpu(tmp_cpu))
tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
@@ -11345,7 +11345,7 @@ void sched_steal_detection_work(struct work_struct *work)
if (first_cpu >= nr_cpu_ids)
return;
- for_each_cpu(tmp_cpu, cpu_smt_mask(first_cpu))
+ for_each_cpu_and(tmp_cpu, cpu_smt_mask(first_cpu), cpu_online_mask)
set_cpu_preferred(tmp_cpu, true);
}
I had thought of this scenario. I hadn't seen it from consistency point of
view. It should be consistent since it is exposed to user.
Functionality wise it was okay since, current code has enough checks to
schedule only on online CPUs. Even is_cpu_allowed returns true only
if it is online. But i get the point, and above diff should address it.
>
> At least, please run cpumask_check() on the argument.
It is set either within online or in PATCH 15/17 by iterating through
cpu_smt_mask. That should always yeild cpu < nr_cpu_ids.
I didn't get why cpumask_check is needed again.
>
> There's a top-comment describing all the system cpumasks. Except for
> cpu_dying, it's nice and complete. Can you describe your new creature
> there?
Ok. I can add a comment there.
>
> Finally, I don't think that __cpu_preferred_mask should depend on
> PARAVIRT config. Consider cpu_present_mask. It mirrors cpu_possible_mask
> if hotplug is disabled, but it's still a real mask even in that case.
> The way you're doing it, you spread CONFIG_PARAVIRT ifdefery pretty
> much anywhere where people might want to use this new mask for anything
> except for testing a bit.
>
One concern you had raised earlier was bloating of the code for systems
CONFIG_PARAVIRT=n.
Maybe in some of the hotpaths we could do, IS_ENABLED(CONFIG_PARAVIRT) check and
that should be ok?
If so, we can get rid off lot of this ifdefery.
cpu_preferred(cpu) is a bit check and shouldn't that expensive.
> Thanks,
> Yury
>
>> +static __always_inline void set_cpu_preferred(unsigned int cpu, bool preferred) { }
>> +#endif
>> +
>> #define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
>>
>> #if NR_CPUS <= BITS_PER_LONG
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bc4f7a9ba64e..2d4d037680d4 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3137,6 +3137,12 @@ void set_cpu_online(unsigned int cpu, bool online)
>> if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
>> atomic_dec(&__num_online_cpus);
>> }
>> +
>> + /*
>> + * An online CPU is by default assumed to be preferred
>> + * Unitl STEAL_MONITOR changes it
>> + */
>> + set_cpu_preferred(cpu, online);
>> }
Here, preferred follows the online state.
>>
>> /*
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index f351296922ac..7ea05a7a717b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -11228,3 +11228,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
>> p->sched_class->prio_changed(rq, p, ctx->prio);
>> }
>> }
>> +
>> +#ifdef CONFIG_PARAVIRT
>> +struct cpumask __cpu_preferred_mask __read_mostly;
>> +EXPORT_SYMBOL(__cpu_preferred_mask);
>> +#endif
>> --
>> 2.47.3
next prev parent reply other threads:[~2026-04-08 9:16 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27 ` Yury Norov
2026-04-08 9:16 ` Shrikanth Hegde [this message]
2026-04-08 17:57 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08 1:05 ` Yury Norov
2026-04-08 12:56 ` Shrikanth Hegde
2026-04-08 18:09 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton
2026-04-08 13:49 ` Shrikanth Hegde
2026-04-09 5:15 ` Hillf Danton
2026-04-09 10:27 ` Shrikanth Hegde
2026-04-10 9:47 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0d8412de-e18a-476f-9eb6-9a977f4474a3@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=bsegall@google.com \
--cc=chleroy@kernel.org \
--cc=dietmar.eggemann@arm.com \
--cc=gregkh@linuxfoundation.org \
--cc=hdanton@sina.com \
--cc=huschle@linux.ibm.com \
--cc=iii@linux.ibm.com \
--cc=joelagnelf@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=seanjc@google.com \
--cc=srikar@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vineeth@bitbyteword.org \
--cc=vschneid@redhat.com \
--cc=ynorov@nvidia.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox