From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Hillf Danton <hdanton@sina.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
Sean Christopherson <seanjc@google.com>,
vincent.guittot@linaro.org, yury.norov@gmail.com,
kprateek.nayak@amd.com
Subject: Re: [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff
Date: Thu, 9 Apr 2026 15:57:20 +0530 [thread overview]
Message-ID: <bddfcc31-1030-4133-b4e5-522cafcd7ca9@linux.ibm.com> (raw)
In-Reply-To: <20260409051556.1637-1-hdanton@sina.com>
Hi Hillf.
On 4/9/26 10:45 AM, Hillf Danton wrote:
> On Wed, 8 Apr 2026 19:19:05 +0530 Shrikanth Hegde wrote:
>> On 4/8/26 3:44 PM, Hillf Danton wrote:
>>> On Wed, 8 Apr 2026 00:49:33 +0530 Shrikanth Hegde wrote:
>>>> Core idea is:
>>>> - Maintain set of CPUs which can be used by workload. It is denoted as
>>>> cpu_preferred_mask
>>>> - Periodically compute the steal time. If steal time is high/low based
>>>> on the thresholds, either reduce/increase the preferred CPUs.
>>>> - If a CPU is marked as non-preferred, push the task running on it if
>>>> possible.
>>>> - Use this CPU state in wakeup and load balance to ensure tasks run
>>>> within preferred CPUs.
>>>>
>>>> For the host kernel, there is no steal time, so no changes to its preferred
>>>> CPUs. So series would affect only the guest kernels.
>>>>
>>> Changes are added to guest in order to detect if pCPU is overloaded, and if
>>> that is true (I mean it is layer violation), why not ask the pCPU governor,
>>> hypervisor, to monitor the loads on pCPU and migrate vCPUs forth and back
>>> if necessary.
>>>
>>
>> AFAIK, there in no information in the host scheduler on what
>> each vCPU is running. It maybe holding a mutex, spinlock with irq disabled
>
> This is what layer means (particularly in the data center environment).
>
Host / hypervisor scheduler
- Schedules vCPU threads as opaque entities.
Has no visibility into:
- whether a vCPU is holding a spinlock
- whether IRQs are disabled
- whether a guest mutex is contended
- guest scheduler state
Can only ensure fairness between vCPUs
Guest scheduler
Knows exact task‑level semantics
- lock ownership
- preemption state
- affinity constraints.
But does not control pCPUs directly, unless there is vCPU pinning.
Steal time is precisely the contract boundary between those layers:
So, This is not a layer violation. Guest is acting on its CPUs based on
the hint which host already provides.
Actual layer violation would be:
- host peeking into guest scheduler data
- host deciding which guest vCPUs are “important”
- host understanding guest locks or IRQ state
Or I am not understanding what you mean by layer violation.
If so, please explain to me.
Today, why is steal time is being reported?
So that guest/host can make appropriate decision. right?
When you see high steal values, You have two choices.
Either increase the underlying resource by re-partitioning the host
with more cores or reduce the incoming request from guest such that
host can meet. If the host is already at max cores,
then there is only option.
One could, say with series high steal values may not be seen, how will system
admin re-size the host. Just look at preferred vs online. If they are not same
then there was contention and preferred became subset of online. We might have
update the documentation of steal time section.
>> or maybe in interrupt context. Moving/migrating the vCPUs threads without
>> that knowledge will hurt the guest. And it has to ensure fairness.
>>
> We have to pay the cost for vCPU.
>
>> This has to work across different archs, some have linux as hypervisor, some
>> has non-linux hypervisor such as powerpc, s390.
>>
> Yeah, in the car cockpit product environment in Shenzhen Linux, Android and
> XYZ guests run on QNX, and your steal time approach looks half baked.
>
They likely don't have this problem. IIUC, they would prefer deterministic behavior in
automotive hypervisors. Having steal time brings unbounded latency.
If the guests are not linux, then yes. Same logic will have to be there in each guest.
But that problem exists in other direction too. You have to inform the host somehow, which of my
vCPU threads are important. That is going to be way more complex in IMHO.
Even in linux we don't have that interface today. And then repeat the same in other non-linux
guest. One could say that is even worse.
If the guests are all indeed linux, then solution would work just fine.
Just re-iterate:
- For host kernel - No Change as it can't have steal time construct. Minimal overhead.
- Guests don't have steal time - No functional change. Minimal overhead.
- Guest with steal time - NO_STEAL_MONITOR - No functional change. Minimal overhead.
- Guest with steal time - STEAL_MONITOR - Functional changes - Steal driven vCPU backoff.
>> Steal time in guest is common construct in all archs. I don't think such
>> commonality exists in host schedulers.
>>
>> If done in guest, guest actually knows what it is running and whats more important.
>> It can make better decisions IMHO.
next prev parent reply other threads:[~2026-04-09 10:27 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 19:19 [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 01/17] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 02/17] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 03/17] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-04-07 20:27 ` Yury Norov
2026-04-08 9:16 ` Shrikanth Hegde
2026-04-08 17:57 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 04/17] sysfs: Add preferred CPU file Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 05/17] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-04-08 1:05 ` Yury Norov
2026-04-08 12:56 ` Shrikanth Hegde
2026-04-08 18:09 ` Yury Norov
2026-04-07 19:19 ` [PATCH v2 06/17] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 07/17] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 08/17] sched/rt: Select a preferred CPU for wakeup and pulling rt task Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 09/17] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 10/17] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 11/17] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 12/17] sched/feature: Add STEAL_MONITOR feature Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 13/17] sched/core: Introduce a simple steal monitor Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 14/17] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 15/17] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 16/17] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-04-07 19:19 ` [PATCH v2 17/17] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-04-07 19:50 ` [PATCH v2 00/17] sched/paravirt: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-04-08 10:14 ` Hillf Danton
2026-04-08 13:49 ` Shrikanth Hegde
2026-04-09 5:15 ` Hillf Danton
2026-04-09 10:27 ` Shrikanth Hegde [this message]
2026-04-10 9:47 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bddfcc31-1030-4133-b4e5-522cafcd7ca9@linux.ibm.com \
--to=sshegde@linux.ibm.com \
--cc=hdanton@sina.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=seanjc@google.com \
--cc=vincent.guittot@linaro.org \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox