Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-01-30 20:32 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXz8ldAeoWwGIxdu@skinsburskii.localdomain>

On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > hypervisor deposited pages.
> > > > > > > 
> > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > management is implemented.
> > > > > > 
> > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > and would work without any issue for L1VH.
> > > > > > 
> > > > > 
> > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > 
> > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > right? What other deposited pages would be left?
> > > > 
> > > 
> > > The driver deposits two types of pages: one for the guests (withdrawn
> > > upon gust shutdown) and the other - for the host itself (never
> > > withdrawn).
> > > See hv_call_create_partition, for example: it deposits pages for the
> > > host partition.
> > 
> > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > Also, can't we forcefully kill all running partitions in module_exit and
> > then reclaim memory? Would this help with kernel consistency
> > irrespective of userspace behavior?
> > 
> 
> It would, but this is sloppy and cannot be a long-term solution.
> 
> It is also not reliable. We have no hook to prevent kexec. So if we fail
> to kill the guest or reclaim the memory for any reason, the new kernel
> may still crash.

Actually guests won't be running by the time we reach our module_exit
function during a kexec. Userspace processes would've been killed by
then.

Also, why is this sloppy? Isn't this what module_exit should be
doing anyway? If someone unloads our module we should be trying to
clean everything up (including killing guests) and reclaim memory.

In any case, we can BUG() out if we fail to reclaim the memory. That would
stop the kexec.

This is a better solution since instead of disabling KEXEC outright: our
driver made the best possible efforts to make kexec work.

> 
> There are two long-term solutions:
>  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.

I honestly think we should focus efforts on making kexec work rather
than finding ways to prevent it.

Thanks,
Anirudh

>  2. Hand the shared kernel state over to the new kernel.
> 
> I sent a series for the first one. The second one is not ready yet.
> Anything else is neither robust nor reliable, so I don’t think it makes
> sense to pursue it.
> 
> Thanks,
> Stanislav
> 
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Thanks,
> > > Stanislav
> > > 
> > > > Thanks,
> > > > Anirudh.
> > > > 
> > > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > > 
> > > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > > driver.
> > > > > > 
> > > > > 
> > > > > It's a temporary measure until proper page lifecycle management is
> > > > > supported in the driver.
> > > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > > expclitily stated in the Kconfig.
> > > > > 
> > > > > Thanks,
> > > > > Stanislav
> > > > > 
> > > > > > Thanks,
> > > > > > Anirudh.
> > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > ---
> > > > > > >  drivers/hv/Kconfig |    1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > > +	depends on !KEXEC
> > > > > > >  	select EVENTFD
> > > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > > >  	select HMM_MIRROR
> > > > > > > 
> > > > > > > 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-01-30 20:22 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXz9nssiRC1DUFSU@skinsburskii.localdomain>

On Fri, Jan 30, 2026 at 10:51:10AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote:
> > On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> > > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > > > 
> > > > > > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > > 
> > > > > > > Query the hypervisor for integrated scheduler support and use it if
> > > > > > > configured.
> > > > > > > 
> > > > > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > > > > scheduling entirely to the hypervisor.
> > > > > > > 
> > > > > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > > > > resources. These child partitions are effectively siblings, scheduled by
> > > > > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > > > > the system may appear to "steal" time from the L1VH and its children.
> > > > > > > 
> > > > > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > > > >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > > > > rest of the system.
> > > > > > > 
> > > > > > > The integrated scheduler is controlled by the root partition and gated by
> > > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > > > > is enabled by querying the corresponding extended partition property. If
> > > > > > > this property is true, the L1VH partition must use the root scheduler
> > > > > > > logic; otherwise, it must use the core scheduler.
> > > > > > > 
> > > > > > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > ---
> > > > > > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > > > > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > > > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > > > 
> > > 
> > >  <snip>
> > > 
> > > > > > > -root_sched_deinit:
> > > > > > > -	root_scheduler_deinit();
> > > > > > > -	return err;
> > > > > > >  }
> > > > > > > 
> > > > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > > > >  {
> > > > > > > -	/*
> > > > > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > > > > -	 */
> > > > > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > -					      0, &mshv_root.vmm_caps,
> > > > > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > +						0, &mshv_root.vmm_caps,
> > > > > > > +						sizeof(mshv_root.vmm_caps));
> > > > > > > +	if (ret) {
> > > > > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > > > > +		return ret;
> > > > > > > +	}
> > > > > > 
> > > > > > This is a functional change that isn't mentioned in the commit message.
> > > > > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > > > > as all disabled? Presumably there are older versions of the hypervisor that
> > > > > > don't support the requirements described in the original comment, but
> > > > > > perhaps they are no longer relevant?
> > > > > > 
> > > > > 
> > > > > To fail is now the only option for the L1VH partition. It must discover
> > > > > the scheduler type. Without this information, the partition cannot
> > > > > operate. The core scheduler logic will not work with an integrated
> > > > > scheduler, and vice versa.
> > > > 
> > > > I don't think we need to fail here. If we don't find vmm caps, that
> > > > means we are on an older hypervisor that supports l1vh but not
> > > > integrated scheduler (yes, such a version exists). In this case since
> > > > integrated scheduler is not supported by the hypervisor, the core
> > > > scheduler logic will work.
> > > > 
> > > 
> > > The older hypervisor version won't have the integrated scheduler
> > > capabity bit.
> > > And we can't operate in core schedule mode if the integrated is enabled
> > > underneath us.
> > 
> > The older hypervisor won't have the integrated scheduler capability bit.
> > This means that the older hypervisor doesn't support integrated
> > scheduler (this is how vmm caps work: if the bit doesn't exist or
> > vmm caps themselves don't exist the feature should be assumed as not
> > available). If the hypervisor doesn't support integrated scheduler in the
> > first place, it can't be enabled underneath us. So, it is safe to
> > operate in core scheduler mode.
> > 
> 
> We can’t tell whether the hypervisor is older and simply doesn’t have
> the VMM caps bit, or whether we just failed to fetch the VMM caps.

If we failed to fetch the VMM caps i.e. the hypervisor doesn't support
the vmm caps property, we must assume that all the bits in vmm caps are
0 (i.e. no features are available). This is how vmm capabilities are
supposed to be interpreted. This is something I checked with the
hypervisor team some time back.

> 
> In other words, we can’t distinguish between “an older hypervisor
> without integrated scheduler support” and “a newer hypervisor with an
> integrated scheduler, but we failed to fetch the VMM caps”.
> 
> But for completeness: are you saying there is an older hypervisor
> version that supports L1VH, but does not support VMM caps?

I don't know how much of the Azure fleet still runs it but yes such a
hypervisor version exists.

Thanks,
Anirudh

> 
> Thanks, Stanislav
> 
> > Thanks,
> > Anirudh.

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-30 19:47 UTC (permalink / raw)
  To: Stanislav Kinsburskii, Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXz7Y7As4XC9rNeL@skinsburskii.localdomain>

On 1/30/26 10:41, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:17:52PM +0000, Anirudh Rayabharam wrote:
>> On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
>>> On 1/28/26 15:08, Stanislav Kinsburskii wrote:
>>>> On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
>>>>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>>>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>>>>> management is implemented.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         drivers/hv/Kconfig |    1 +
>>>>>>>>>>>>>>         1 file changed, 1 insertion(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>>>>>         	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>>>>>         	# no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>>>>>         	depends on PAGE_SIZE_4KB
>>>>>>>>>>>>>> +	depends on !KEXEC
>>>>>>>>>>>>>>         	select EVENTFD
>>>>>>>>>>>>>>         	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>>>>>         	select HMM_MIRROR
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>>>>> and it was fine?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>>>>> will be affected as well.
>>>>>>>>>>>
>>>>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>>>>> bad user experience.
>>>>>>>>>
>>>>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>>>>> explore that and didn't find anything, hence this?
>>>>>>>>>
>>>>>>>>
>>>>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>>>>> is no hook to interrupt kexec process except the live update one.
>>>>>>>
>>>>>>> That's the one we want to interrupt and block right? crash kexec
>>>>>>> is ok and should be allowed. We can document we don't support kexec
>>>>>>> for update for now.
>>>>>>>
>>>>>>>> I sent an RFC for that one but given todays conversation details is
>>>>>>>> won't be accepted as is.
>>>>>>>
>>>>>>> Are you taking about this?
>>>>>>>
>>>>>>>            "mshv: Add kexec safety for deposited pages"
>>>>>>>
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>>>>> now given time constraints.
>>>>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>>>>> the future.
>>>>>>>
>>>>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>>>>> completely. What we want is just block kexec for updates from some
>>>>>>> mshv file for now, we an print during boot that kexec for updates is
>>>>>>> not supported on mshv. Hope that makes sense.
>>>>>>>
>>>>>>
>>>>>> The trade-off here is between disabling kexec support and having the
>>>>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>>>>> kexec and crash kexec.
>>>>>
>>>>> crash kexec on baremetal is not affected, hence disabling that
>>>>> doesn't make sense as we can't debug crashes then on bm.
>>>>>
>>>>
>>>> Bare metal support is not currently relevant, as it is not available.
>>>> This is the upstream kernel, and this driver will be accessible to
>>>> third-party customers beginning with kernel 6.19 for running their
>>>> kernels in Azure L1VH, so consistency is required.
>>>
>>> Well, without crashdump support, customers will not be running anything
>>> anywhere.
>>
>> This is my concern too. I don't think customers will be particularly
>> happy that kexec doesn't work with our driver.
>>
> 
> I wasn?t clear earlier, so let me restate it. Today, kexec is not
> supported in L1VH. This is a bug we have not fixed yet. Disabling kexec
> is not a long-term solution. But it is better to disable it explicitly
> than to have kernel crashes after kexec.

I don't think there is disagreement on this. The undesired part is turning
off KEXEC config completely.

Thanks,
-Mukesh


> This does not mean the bug should not be fixed. But the upstream kernel
> has its own policies and merge windows. For kernel 6.19, it is better to
> have a clear kexec error than random crashes after kexec.
> 
> Thanks,
> Stanislav
> 
>> Thanks,
>> Anirudh
>>
>>>
>>> Thanks,
>>> -Mukesh
>>>
>>>> Thanks,
>>>> Stanislav
>>>>
>>>>> Let me think and explore a bit, and if I come up with something, I'll
>>>>> send a patch here. If nothing, then we can do this as last resort.
>>>>>
>>>>> Thanks,
>>>>> -Mukesh
>>>>>
>>>>>
>>>>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>>>>> However, since crash kexec would hit the same issues, until we have a
>>>>>> proper state transition for deposted pages, the best workaround for now
>>>>>> is to reset the hypervisor state on every kexec, which needs design,
>>>>>> work, and testing.
>>>>>>
>>>>>> Disabling kexec is the only consistent way to handle this in the
>>>>>> upstream kernel at the moment.
>>>>>>
>>>>>> Thanks, Stanislav
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> -Mukesh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Stanislav
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -Mukesh
>>>>>>>>>
>>>>>>>>>> Therefor it should be explicitly forbidden as it's essentially not
>>>>>>>>>> supported yet.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Stanislav
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> -Mukesh
>>>


^ permalink raw reply

* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Stanislav Kinsburskii @ 2026-01-30 19:00 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXzltmZVDhYIDiaw@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:09:10PM +0000, Anirudh Rayabharam wrote:
> On Thu, Jan 29, 2026 at 09:03:54AM -0800, Stanislav Kinsburskii wrote:
> > On Thu, Jan 29, 2026 at 04:36:51AM +0000, Anirudh Rayabharam wrote:
> > > On Wed, Jan 28, 2026 at 03:03:51PM -0800, Stanislav Kinsburskii wrote:
> > > > On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> > > > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> > 
> > <snip>
> > 
> > > > 
> > > > > +static int mshv_irq = -1;
> > > > > +
> > > > 
> > > > Should this be a path of mshv_root structure?
> > > 
> > > This doesn't need to be globally accessible. It is only used in this file.
> > > So I guess it doesn't need to be in mshv_root. What do you think?
> > > 
> > 
> > Please, see below.
> 
> The below part doesn't make a case for this variable being part of the
> mshv_root structure. Did you miss this part in your reply?
> 

No, I didn't miss it. I just don't see the point of introducing there
variables unless the goal is to weave more logic into the existent flow.

> > 
> > <snip>
> > 
> > > > >  int mshv_synic_cpu_init(unsigned int cpu)
> > > > >  {
> > > > >  	union hv_synic_simp simp;
> > > > >  	union hv_synic_siefp siefp;
> > > > >  	union hv_synic_sirbp sirbp;
> > > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > > >  	union hv_synic_sint sint;
> > > > > -#endif
> > > > >  	union hv_synic_scontrol sctrl;
> > > > >  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > > > > @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > > >  
> > > > >  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> > > > >  
> > > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > > > +	if (mshv_irq != -1)
> > > > > +		enable_percpu_irq(mshv_irq, 0);
> > > > > +
> > > > 
> > > > It's better to explicitly separate x86 and arm64 paths with #ifdefs.
> > > > For example:
> > > > 
> > > > #ifdef CONFIG_X86_64
> > > > int setup_cpu_sint() {
> > > >   	/* Enable intercepts */
> > > >   	sint.as_uint64 = 0;
> > > > 	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > 	....
> > > > }
> > > > #endif
> > > > #ifdef CONFIG_ARM64
> > > > int setup_cpu_sint() {
> > > > 	enable_percpu_irq(mshv_irq, 0);
> > > > 
> > > >   	/* Enable intercepts */
> > > >   	sint.as_uint64 = 0;
> > > > 	sint.vector = mshv_interrupt;
> > > > 	....
> > > > }
> > > > #endif
> > > 
> > > This seems unnecessary. We've made the paths that determine
> > > mshv_interrupt separate. Now we can just use that here.
> > > 
> > > There is no need to write two copies of 
> > > 
> > > 	...
> > >    	sint.as_uint64 = 0;
> > >  	sint.vector = <whatever>;
> > > 	...
> > > 
> > > I could do the enable_percpu_irq() inside an ifdef. But do we gain
> > > anything from it? Won't the compiler optimize the current code as well
> > > since mshv_irq will always be -1 whenever HYPERVISOR_CALLBACK_VECTOR is
> > > defined?
> > > 
> > 
> > AFAIU this patc, x86 doesn’t need these variables at all. So it’s better
> > to separate them completely and explicitly.
> > 
> > Also, this isn’t the only place where ARM-specific logic is added. This
> > patch adds ARM-specific logic and tries to weave it into the existing
> > x86 flow.
> > 
> > If it were only one place, that might be OK. But here it happens in
> > several places. That makes the code harder to read and maintain. It also
> > makes future extensions more risky (and they will likely follow). The
> > dependencies are also not obvious. For example, on ARM the interrupt
> > vector comes from ACPI (at least that’s what the comments say). So it’s
> > not right to mix this into the common x86 path even if
> > HYPERVISOR_CALLBACK_VECTOR is a x86-specific define.
> 
> We shouldn't think of this code in terms of X86 & ARM64. It's not about
> arch at all. It's about whether or not we have a pre-defined vector
> (a.k.a HYPERVISOR_CALLBACK_VECTOR). I feel that the current code cleanly
> separates the two cases. The main difference in the two cases is in how
> the vector is determined which is well seperated in the code paths. Once
> the vector is determined, how we program it in the synic is the same for
> both cases.
> 

The major question is whether HYPERVISOR_CALLBACK_VECTOR can be
defined on ARM. If it can’t, then it’s effectively an x86-only feature.

The current code separates two cases. You are adding a third one: ARM,
with its own logic. But this is not stated explicitly in the code. As a
result, we now have three cases mixed together, and the flow becomes
spaghetti-like.

If we ever need to support DT on ARM (and we should expect that, because
ACPI on ARM looks odd), we will need to add yet another case to this
mix.

I hope you see the problem. The original code wasn't designed to be
extensible. Since you are adding a new case, this is a good opportunity
to redesign the flow and make it more extensible, instead of adding more
logic on top.

> > 
> > It would be much better to keep this ARM-specific logic in separate,
> > conditionally compiled code. I suggest changing the flow to make this
> > per-arch logic explicit. It will pay off later.
> 
> Most of the code introduced in this patch is conditionally compiled.
> Building code from this patch on x86 will conditionally compile out a
> large majority of it.
> 
> Are you by any chance suggesting we put it in a separate file?
> 

No, I’m not suggesting to move it into a separate file yet.
But making the arch-specific code clearly separated would be a good first step.

Thanks,
Stanislav.

> Thanks,
> Anirudh.
> 
> > 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh.
> > > 
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > >  	/* Enable intercepts */
> > > > >  	sint.as_uint64 = 0;
> > > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > > +	sint.vector = mshv_interrupt;
> > > > >  	sint.masked = false;
> > > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > > > > @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > > >  
> > > > >  	/* Doorbell SINT */
> > > > >  	sint.as_uint64 = 0;
> > > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > > +	sint.vector = mshv_interrupt;
> > > > >  	sint.masked = false;
> > > > >  	sint.as_intercept = 1;
> > > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > > >  			      sint.as_uint64);
> > > > > -#endif
> > > > >  
> > > > >  	/* Enable global synic bit */
> > > > >  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > > > > @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> > > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > > >  			      sint.as_uint64);
> > > > >  
> > > > > +	if (mshv_irq != -1)
> > > > > +		disable_percpu_irq(mshv_irq);
> > > > > +
> > > > >  	/* Disable Synic's event ring page */
> > > > >  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> > > > >  	sirbp.sirbp_enabled = false;
> > > > > -- 
> > > > > 2.34.1
> > > > > 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-30 18:51 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXz7vYzJOkzkj5V3@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > > 
> > > > > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > 
> > > > > > Query the hypervisor for integrated scheduler support and use it if
> > > > > > configured.
> > > > > > 
> > > > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > > > scheduling entirely to the hypervisor.
> > > > > > 
> > > > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > > > resources. These child partitions are effectively siblings, scheduled by
> > > > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > > > the system may appear to "steal" time from the L1VH and its children.
> > > > > > 
> > > > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > > >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > > > rest of the system.
> > > > > > 
> > > > > > The integrated scheduler is controlled by the root partition and gated by
> > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > > > is enabled by querying the corresponding extended partition property. If
> > > > > > this property is true, the L1VH partition must use the root scheduler
> > > > > > logic; otherwise, it must use the core scheduler.
> > > > > > 
> > > > > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > ---
> > > > > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > > > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > > 
> > 
> >  <snip>
> > 
> > > > > > -root_sched_deinit:
> > > > > > -	root_scheduler_deinit();
> > > > > > -	return err;
> > > > > >  }
> > > > > > 
> > > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > > >  {
> > > > > > -	/*
> > > > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > > > -	 */
> > > > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > -					      0, &mshv_root.vmm_caps,
> > > > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > +						0, &mshv_root.vmm_caps,
> > > > > > +						sizeof(mshv_root.vmm_caps));
> > > > > > +	if (ret) {
> > > > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > > > +		return ret;
> > > > > > +	}
> > > > > 
> > > > > This is a functional change that isn't mentioned in the commit message.
> > > > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > > > as all disabled? Presumably there are older versions of the hypervisor that
> > > > > don't support the requirements described in the original comment, but
> > > > > perhaps they are no longer relevant?
> > > > > 
> > > > 
> > > > To fail is now the only option for the L1VH partition. It must discover
> > > > the scheduler type. Without this information, the partition cannot
> > > > operate. The core scheduler logic will not work with an integrated
> > > > scheduler, and vice versa.
> > > 
> > > I don't think we need to fail here. If we don't find vmm caps, that
> > > means we are on an older hypervisor that supports l1vh but not
> > > integrated scheduler (yes, such a version exists). In this case since
> > > integrated scheduler is not supported by the hypervisor, the core
> > > scheduler logic will work.
> > > 
> > 
> > The older hypervisor version won't have the integrated scheduler
> > capabity bit.
> > And we can't operate in core schedule mode if the integrated is enabled
> > underneath us.
> 
> The older hypervisor won't have the integrated scheduler capability bit.
> This means that the older hypervisor doesn't support integrated
> scheduler (this is how vmm caps work: if the bit doesn't exist or
> vmm caps themselves don't exist the feature should be assumed as not
> available). If the hypervisor doesn't support integrated scheduler in the
> first place, it can't be enabled underneath us. So, it is safe to
> operate in core scheduler mode.
> 

We can’t tell whether the hypervisor is older and simply doesn’t have
the VMM caps bit, or whether we just failed to fetch the VMM caps.

In other words, we can’t distinguish between “an older hypervisor
without integrated scheduler support” and “a newer hypervisor with an
integrated scheduler, but we failed to fetch the VMM caps”.

But for completeness: are you saying there is an older hypervisor
version that supports L1VH, but does not support VMM caps?

Thanks, Stanislav

> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-30 18:46 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXzmMInsNSvFvBF1@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > hypervisor deposited pages.
> > > > > > 
> > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > management is implemented.
> > > > > 
> > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > and would work without any issue for L1VH.
> > > > > 
> > > > 
> > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > 
> > > All pages that were deposited in the context of a guest partition (i.e.
> > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > right? What other deposited pages would be left?
> > > 
> > 
> > The driver deposits two types of pages: one for the guests (withdrawn
> > upon gust shutdown) and the other - for the host itself (never
> > withdrawn).
> > See hv_call_create_partition, for example: it deposits pages for the
> > host partition.
> 
> Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> Also, can't we forcefully kill all running partitions in module_exit and
> then reclaim memory? Would this help with kernel consistency
> irrespective of userspace behavior?
> 

It would, but this is sloppy and cannot be a long-term solution.

It is also not reliable. We have no hook to prevent kexec. So if we fail
to kill the guest or reclaim the memory for any reason, the new kernel
may still crash.

There are two long-term solutions:
 1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
 2. Hand the shared kernel state over to the new kernel.

I sent a series for the first one. The second one is not ready yet.
Anything else is neither robust nor reliable, so I don’t think it makes
sense to pursue it.

Thanks,
Stanislav


> Thanks,
> Anirudh.
> 
> > 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh.
> > > 
> > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > 
> > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > driver.
> > > > > 
> > > > 
> > > > It's a temporary measure until proper page lifecycle management is
> > > > supported in the driver.
> > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > expclitily stated in the Kconfig.
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > Anirudh.
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > ---
> > > > > >  drivers/hv/Kconfig |    1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > --- a/drivers/hv/Kconfig
> > > > > > +++ b/drivers/hv/Kconfig
> > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > +	depends on !KEXEC
> > > > > >  	select EVENTFD
> > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > >  	select HMM_MIRROR
> > > > > > 
> > > > > > 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-01-30 18:43 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXz6cu8BG1vwiCeb@skinsburskii.localdomain>

On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > 
> > > > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > > > 
> > > > > Query the hypervisor for integrated scheduler support and use it if
> > > > > configured.
> > > > > 
> > > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > > scheduling entirely to the hypervisor.
> > > > > 
> > > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > > resources. These child partitions are effectively siblings, scheduled by
> > > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > > the system may appear to "steal" time from the L1VH and its children.
> > > > > 
> > > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > > rest of the system.
> > > > > 
> > > > > The integrated scheduler is controlled by the root partition and gated by
> > > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > > is enabled by querying the corresponding extended partition property. If
> > > > > this property is true, the L1VH partition must use the root scheduler
> > > > > logic; otherwise, it must use the core scheduler.
> > > > > 
> > > > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > ---
> > > > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > 
> 
>  <snip>
> 
> > > > > -root_sched_deinit:
> > > > > -	root_scheduler_deinit();
> > > > > -	return err;
> > > > >  }
> > > > > 
> > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > >  {
> > > > > -	/*
> > > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > > -	 */
> > > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > -					      0, &mshv_root.vmm_caps,
> > > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > > +	int ret;
> > > > > +
> > > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > +						0, &mshv_root.vmm_caps,
> > > > > +						sizeof(mshv_root.vmm_caps));
> > > > > +	if (ret) {
> > > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > > +		return ret;
> > > > > +	}
> > > > 
> > > > This is a functional change that isn't mentioned in the commit message.
> > > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > > as all disabled? Presumably there are older versions of the hypervisor that
> > > > don't support the requirements described in the original comment, but
> > > > perhaps they are no longer relevant?
> > > > 
> > > 
> > > To fail is now the only option for the L1VH partition. It must discover
> > > the scheduler type. Without this information, the partition cannot
> > > operate. The core scheduler logic will not work with an integrated
> > > scheduler, and vice versa.
> > 
> > I don't think we need to fail here. If we don't find vmm caps, that
> > means we are on an older hypervisor that supports l1vh but not
> > integrated scheduler (yes, such a version exists). In this case since
> > integrated scheduler is not supported by the hypervisor, the core
> > scheduler logic will work.
> > 
> 
> The older hypervisor version won't have the integrated scheduler
> capabity bit.
> And we can't operate in core schedule mode if the integrated is enabled
> underneath us.

The older hypervisor won't have the integrated scheduler capability bit.
This means that the older hypervisor doesn't support integrated
scheduler (this is how vmm caps work: if the bit doesn't exist or
vmm caps themselves don't exist the feature should be assumed as not
available). If the hypervisor doesn't support integrated scheduler in the
first place, it can't be enabled underneath us. So, it is safe to
operate in core scheduler mode.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-30 18:41 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Mukesh R, kys, haiyangz, wei.liu, decui, longli, linux-hyperv,
	linux-kernel
In-Reply-To: <aXznwGcuP9rdffYf@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:17:52PM +0000, Anirudh Rayabharam wrote:
> On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
> > On 1/28/26 15:08, Stanislav Kinsburskii wrote:
> > > On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
> > > > On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> > > > > On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> > > > > > On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> > > > > > > On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> > > > > > > > On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > > > > > > > > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > > > > > > > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > > > > > > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > > > > > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >        drivers/hv/Kconfig |    1 +
> > > > > > > > > > > > >        1 file changed, 1 insertion(+)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > > > > > > >        	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > > > > > > >        	# no particular order, making it impossible to reassemble larger pages
> > > > > > > > > > > > >        	depends on PAGE_SIZE_4KB
> > > > > > > > > > > > > +	depends on !KEXEC
> > > > > > > > > > > > >        	select EVENTFD
> > > > > > > > > > > > >        	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > > > > > > >        	select HMM_MIRROR
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > > > > > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > > > > > > > > and it was fine?
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > > > > > > > > will be affected as well.
> > > > > > > > > > 
> > > > > > > > > > So not sure I understand the reason for this patch. We can just block
> > > > > > > > > > kexec if there are any VMs running, right? Doing this would mean any
> > > > > > > > > > further developement would be without a ver important and major feature,
> > > > > > > > > > right?
> > > > > > > > > 
> > > > > > > > > This is an option. But until it's implemented and merged, a user mshv
> > > > > > > > > driver gets into a situation where kexec is broken in a non-obvious way.
> > > > > > > > > The system may crash at any time after kexec, depending on whether the
> > > > > > > > > new kernel touches the pages deposited to hypervisor or not. This is a
> > > > > > > > > bad user experience.
> > > > > > > > 
> > > > > > > > I understand that. But with this we cannot collect core and debug any
> > > > > > > > crashes. I was thinking there would be a quick way to prohibit kexec
> > > > > > > > for update via notifier or some other quick hack. Did you already
> > > > > > > > explore that and didn't find anything, hence this?
> > > > > > > > 
> > > > > > > 
> > > > > > > This quick hack you mention isn't quick in the upstream kernel as there
> > > > > > > is no hook to interrupt kexec process except the live update one.
> > > > > > 
> > > > > > That's the one we want to interrupt and block right? crash kexec
> > > > > > is ok and should be allowed. We can document we don't support kexec
> > > > > > for update for now.
> > > > > > 
> > > > > > > I sent an RFC for that one but given todays conversation details is
> > > > > > > won't be accepted as is.
> > > > > > 
> > > > > > Are you taking about this?
> > > > > > 
> > > > > >           "mshv: Add kexec safety for deposited pages"
> > > > > > 
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > > Making mshv mutually exclusive with kexec is the only viable option for
> > > > > > > now given time constraints.
> > > > > > > It is intended to be replaced with proper page lifecycle management in
> > > > > > > the future.
> > > > > > 
> > > > > > Yeah, that could take a long time and imo we cannot just disable KEXEC
> > > > > > completely. What we want is just block kexec for updates from some
> > > > > > mshv file for now, we an print during boot that kexec for updates is
> > > > > > not supported on mshv. Hope that makes sense.
> > > > > > 
> > > > > 
> > > > > The trade-off here is between disabling kexec support and having the
> > > > > kernel crash after kexec in a non-obvious way. This affects both regular
> > > > > kexec and crash kexec.
> > > > 
> > > > crash kexec on baremetal is not affected, hence disabling that
> > > > doesn't make sense as we can't debug crashes then on bm.
> > > > 
> > > 
> > > Bare metal support is not currently relevant, as it is not available.
> > > This is the upstream kernel, and this driver will be accessible to
> > > third-party customers beginning with kernel 6.19 for running their
> > > kernels in Azure L1VH, so consistency is required.
> > 
> > Well, without crashdump support, customers will not be running anything
> > anywhere.
> 
> This is my concern too. I don't think customers will be particularly
> happy that kexec doesn't work with our driver.
> 

I wasn’t clear earlier, so let me restate it. Today, kexec is not
supported in L1VH. This is a bug we have not fixed yet. Disabling kexec
is not a long-term solution. But it is better to disable it explicitly
than to have kernel crashes after kexec.

This does not mean the bug should not be fixed. But the upstream kernel
has its own policies and merge windows. For kernel 6.19, it is better to
have a clear kexec error than random crashes after kexec.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> > 
> > Thanks,
> > -Mukesh
> > 
> > > Thanks,
> > > Stanislav
> > > 
> > > > Let me think and explore a bit, and if I come up with something, I'll
> > > > send a patch here. If nothing, then we can do this as last resort.
> > > > 
> > > > Thanks,
> > > > -Mukesh
> > > > 
> > > > 
> > > > > It?s a pity we can?t apply a quick hack to disable only regular kexec.
> > > > > However, since crash kexec would hit the same issues, until we have a
> > > > > proper state transition for deposted pages, the best workaround for now
> > > > > is to reset the hypervisor state on every kexec, which needs design,
> > > > > work, and testing.
> > > > > 
> > > > > Disabling kexec is the only consistent way to handle this in the
> > > > > upstream kernel at the moment.
> > > > > 
> > > > > Thanks, Stanislav
> > > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > -Mukesh
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Stanislav
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > -Mukesh
> > > > > > > > 
> > > > > > > > > Therefor it should be explicitly forbidden as it's essentially not
> > > > > > > > > supported yet.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Stanislav
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Stanislav
> > > > > > > > > > > 
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > -Mukesh
> > 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-30 18:37 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXzqsfT8-h-g9mex@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > 
> > > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > > 
> > > > Query the hypervisor for integrated scheduler support and use it if
> > > > configured.
> > > > 
> > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > scheduling entirely to the hypervisor.
> > > > 
> > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > resources. These child partitions are effectively siblings, scheduled by
> > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > the system may appear to "steal" time from the L1VH and its children.
> > > > 
> > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > rest of the system.
> > > > 
> > > > The integrated scheduler is controlled by the root partition and gated by
> > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > is enabled by querying the corresponding extended partition property. If
> > > > this property is true, the L1VH partition must use the root scheduler
> > > > logic; otherwise, it must use the core scheduler.
> > > > 
> > > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > 

 <snip>

> > > > -root_sched_deinit:
> > > > -	root_scheduler_deinit();
> > > > -	return err;
> > > >  }
> > > > 
> > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > >  {
> > > > -	/*
> > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > -	 */
> > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > -					      0, &mshv_root.vmm_caps,
> > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > +	int ret;
> > > > +
> > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > +						0, &mshv_root.vmm_caps,
> > > > +						sizeof(mshv_root.vmm_caps));
> > > > +	if (ret) {
> > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > +		return ret;
> > > > +	}
> > > 
> > > This is a functional change that isn't mentioned in the commit message.
> > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > as all disabled? Presumably there are older versions of the hypervisor that
> > > don't support the requirements described in the original comment, but
> > > perhaps they are no longer relevant?
> > > 
> > 
> > To fail is now the only option for the L1VH partition. It must discover
> > the scheduler type. Without this information, the partition cannot
> > operate. The core scheduler logic will not work with an integrated
> > scheduler, and vice versa.
> 
> I don't think we need to fail here. If we don't find vmm caps, that
> means we are on an older hypervisor that supports l1vh but not
> integrated scheduler (yes, such a version exists). In this case since
> integrated scheduler is not supported by the hypervisor, the core
> scheduler logic will work.
> 

The older hypervisor version won't have the integrated scheduler
capabity bit.
And we can't operate in core schedule mode if the integrated is enabled
underneath us.

Thanks,
Stanislav


> Thanks,
> Anirudh.
> 
> > 
> > And yes, older hypervisor versions do not support L1VH.
> > 
> > Thanks,
> > Stanislav
> > 
> > > > 
> > > >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > > > +
> > > > +	return 0;
> > > >  }
> > > > 
> > > >  static int __init mshv_parent_partition_init(void)
> > > > @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
> > > > 
> > > >  	mshv_cpuhp_online = ret;
> > > > 
> > > > +	ret = mshv_init_vmm_caps(dev);
> > > > +	if (ret)
> > > > +		goto remove_cpu_state;
> > > > +
> > > >  	ret = mshv_retrieve_scheduler_type(dev);
> > > >  	if (ret)
> > > >  		goto remove_cpu_state;
> > > > @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> > > >  	if (ret)
> > > >  		goto remove_cpu_state;
> > > > 
> > > > -	mshv_init_vmm_caps(dev);
> > > > +	ret = root_scheduler_init(dev);
> > > > +	if (ret)
> > > > +		goto exit_partition;
> > > > 
> > > >  	ret = mshv_irqfd_wq_init();
> > > >  	if (ret)
> > > > -		goto exit_partition;
> > > > +		goto deinit_root_scheduler;
> > > > 
> > > >  	spin_lock_init(&mshv_root.pt_ht_lock);
> > > >  	hash_init(mshv_root.pt_htable);
> > > > @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
> > > > 
> > > >  	return 0;
> > > > 
> > > > +deinit_root_scheduler:
> > > > +	root_scheduler_deinit();
> > > >  exit_partition:
> > > >  	if (hv_root_partition())
> > > >  		mshv_root_partition_exit();
> > > > @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> > > >  	mshv_port_table_fini();
> > > >  	misc_deregister(&mshv_dev);
> > > >  	mshv_irqfd_wq_cleanup();
> > > > +	root_scheduler_deinit();
> > > >  	if (hv_root_partition())
> > > >  		mshv_root_partition_exit();
> > > >  	cpuhp_remove_state(mshv_cpuhp_online);
> > > > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > > > index aa03616f965b..0f7178fa88a8 100644
> > > > --- a/include/hyperv/hvhdk_mini.h
> > > > +++ b/include/hyperv/hvhdk_mini.h
> > > > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> > > >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> > > >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> > > > 
> > > > +	/* Integrated scheduling properties */
> > > > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > > > +
> > > >  	/* Resource properties */
> > > >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> > > >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > > > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> > > >  };
> > > > 
> > > >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > > > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
> > > > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> > > > 
> > > >  struct hv_partition_property_vmm_capabilities {
> > > >  	u16 bank_count;
> > > > @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> > > >  #endif
> > > >  			u64 assignable_synthetic_proc_features: 1;
> > > >  			u64 tag_hv_message_from_child: 1;
> > > > +			u64 vmm_enable_integrated_scheduler : 1;
> > > >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> > > >  		} __packed;
> > > >  	};
> > > > 
> > > > 
> > > 

^ permalink raw reply

* RE: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Michael Kelley @ 2026-01-30 17:49 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXzsVN3SnNXIDPMV@anirudh-surface.localdomain>

From: Anirudh Rayabharam <anirudh@anirudhrb.com> Sent: Friday, January 30, 2026 9:37 AM
> 
> On Fri, Jan 30, 2026 at 01:24:34AM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, January 29, 2026 11:10 AM
> > >
> > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > >
> > > <snip>
> > >
> > > > >  static int __init mshv_root_partition_init(struct device *dev)
> > > > >  {
> > > > >  	int err;
> > > > >
> > > > > -	err = root_scheduler_init(dev);
> > > > > -	if (err)
> > > > > -		return err;
> > > > > -
> > > > >  	err = register_reboot_notifier(&mshv_reboot_nb);
> > > > >  	if (err)
> > > > > -		goto root_sched_deinit;
> > > > > +		return err;
> > > > >
> > > > >  	return 0;
> > > >
> > > > This code is now:
> > > >
> > > > 	if (err)
> > > > 		return err;
> > > > 	return 0;
> > > >
> > > > which can be simplified to just:
> > > >
> > > > 	return err;
> > > >
> > > > Or drop the local variable 'err' and simplify the entire function to:
> > > >
> > > > 	return register_reboot_notifier(&mshv_reboot_nb);
> > > >
> > > > There's a tangential question here: Why is this reboot notifier
> > > > needed in the first place? All it does is remove the cpuhp state
> > > > that allocates/frees the per-cpu root_scheduler_input and
> > > > root_scheduler_output pages. Removing the state will free
> > > > the pages, but if Linux is rebooting, why bother?
> > > >
> > >
> > > This was originally done to support kexec.
> > > Here is the original commit message:
> > >
> > >     mshv: perform synic cleanup during kexec
> > >
> > >     Register a reboot notifier that performs synic cleanup when a kexec
> > >     is in progress.
> > >
> > >     One notable issue this commit fixes is one where after a kexec, virtio
> > >     devices are not functional. Linux root partition receives MMIO doorbell
> > >     events in the ring buffer in the SIRB synic page. The hypervisor maintains
> > >     a head pointer where it writes new events into the ring buffer. The root
> > >     partition maintains a tail pointer to read events from the buffer.
> > >
> > >     Upon kexec reboot, all root data structures are re-initialized and thus the
> > >     tail pointer gets reset to zero. The hypervisor on the other hand still
> > >     retains the pre-kexec head pointer which could be non-zero. This means that
> > >     when the hypervisor writes new events to the ring buffer, the root
> > >     partition looks at the wrong place and doesn't find any events. So, future
> > >     doorbell events never get delivered. As a result, virtqueue kicks never get
> > >     delivered to the host.
> > >
> > >     When the SIRB page is disabled the hypervisor resets the head pointer.
> >
> > FWIW, I don't see that commit message anywhere in a public source code
> > tree. The calls to register/unregister_reboot_notifier() were in the original
> > introduction of mshv_root_main.c in upstream commit 621191d709b14.
> > Evidently the code described by that commit message was not submitted
> > upstream. And of course, the kexec() topic is now being revisited ....
> >
> > So to clarify: Do you expect that in the future the reboot notifier will be
> > used for something that really is required for resetting hypervisor state
> > in the case of a kexec reboot?
> 
> While that commit wasn't individually sent upstream but all the code
> from that commit did land upstream probably bundled with other commits
> when the mshv driver was introduced. So the reboot notifier is indeed
> currently used for resetting the synic correctly during kexec reboot.
> 

Indeed, you are right. I confused the "mshv_root_sched_online" and
"mshv_cpuhp_online" cpuhp states. The reboot notifier removes the latter,
not the former.  And the latter does substantive cleanup work on the SynIC
when the state is removed. Apologies for the confusion.

Michael

^ permalink raw reply

* Re: [PATCH v3] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-01-30 17:47 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <176978905128.18763.15996443783319253336.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Fri, Jan 30, 2026 at 04:04:14PM +0000, Stanislav Kinsburskii wrote:
> Query the hypervisor for integrated scheduler support and use it if
> configured.
> 
> Microsoft Hypervisor originally provided two schedulers: root and core. The
> root scheduler allows the root partition to schedule guest vCPUs across
> physical cores, supporting both time slicing and CPU affinity (e.g., via
> cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> scheduling entirely to the hypervisor.
> 
> Direct virtualization introduces a new privileged guest partition type - L1
> Virtual Host (L1VH) — which can create child partitions from its own
> resources. These child partitions are effectively siblings, scheduled by
> the hypervisor's core scheduler. This prevents the L1VH parent from setting
> affinity or time slicing for its own processes or guest VPs. While cgroups,
> CFS, and cpuset controllers can still be used, their effectiveness is
> unpredictable, as the core scheduler swaps vCPUs according to its own logic
> (typically round-robin across all allocated physical CPUs). As a result,
> the system may appear to "steal" time from the L1VH and its children.
> 
> To address this, Microsoft Hypervisor introduces the integrated scheduler.
> This allows an L1VH partition to schedule its own vCPUs and those of its

How could an L1VH partition schedule its own vCPUs?

> guests across its "physical" cores, effectively emulating root scheduler
> behavior within the L1VH, while retaining core scheduler behavior for the
> rest of the system.
> 
> The integrated scheduler is controlled by the root partition and gated by
> the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> supports the integrated scheduler. The L1VH partition must then check if it
> is enabled by querying the corresponding extended partition property. If
> this property is true, the L1VH partition must use the root scheduler
> logic; otherwise, it must use the core scheduler. This requirement makes
> reading VMM capabilities in L1VH partition a requirement too.
> 
> Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   85 +++++++++++++++++++++++++++----------------
>  include/hyperv/hvhdk_mini.h |    7 +++-
>  2 files changed, 59 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1134a82c7881..6a6bf641b352 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
>  	};
>  }
>  
> +static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)

typo: retrieve*

> +{
> +	u64 integrated_sched_enabled;
> +	int ret;
> +
> +	*out = HV_SCHEDULER_TYPE_CORE_SMT;
> +
> +	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
> +		return 0;
> +
> +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> +						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
> +						0, &integrated_sched_enabled,
> +						sizeof(integrated_sched_enabled));
> +	if (ret)
> +		return ret;
> +
> +	if (integrated_sched_enabled)
> +		*out = HV_SCHEDULER_TYPE_ROOT;
> +
> +	pr_debug("%s: integrated scheduler property read: ret=%d value=%llu\n",
> +		 __func__, ret, integrated_sched_enabled);

ret is always 0 here, right? We don't need to bother printing then.

> +
> +	return 0;
> +}
> +
>  /* TODO move this to hv_common.c when needed outside */
>  static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
>  {
> @@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
>  /* Retrieve and stash the supported scheduler type */
>  static int __init mshv_retrieve_scheduler_type(struct device *dev)
>  {
> -	int ret = 0;
> +	int ret;
>  
>  	if (hv_l1vh_partition())
> -		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
> +		ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
>  	else
>  		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
> -
>  	if (ret)
>  		return ret;
>  
> @@ -2211,42 +2236,29 @@ struct notifier_block mshv_reboot_nb = {
>  static void mshv_root_partition_exit(void)
>  {
>  	unregister_reboot_notifier(&mshv_reboot_nb);
> -	root_scheduler_deinit();
>  }
>  
>  static int __init mshv_root_partition_init(struct device *dev)
>  {
> -	int err;
> -
> -	err = root_scheduler_init(dev);
> -	if (err)
> -		return err;
> -
> -	err = register_reboot_notifier(&mshv_reboot_nb);
> -	if (err)
> -		goto root_sched_deinit;
> -
> -	return 0;
> -
> -root_sched_deinit:
> -	root_scheduler_deinit();
> -	return err;
> +	return register_reboot_notifier(&mshv_reboot_nb);
>  }
>  
> -static void mshv_init_vmm_caps(struct device *dev)
> +static int __init mshv_init_vmm_caps(struct device *dev)
>  {
> -	/*
> -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> -	 */
> -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> -					      0, &mshv_root.vmm_caps,
> -					      sizeof(mshv_root.vmm_caps)))
> -		dev_warn(dev, "Unable to get VMM capabilities\n");
> +	int ret;
> +
> +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> +						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> +						0, &mshv_root.vmm_caps,
> +						sizeof(mshv_root.vmm_caps));
> +	if (ret && hv_l1vh_partition()) {
> +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> +		return ret;

I don't think we need to fail here. If there are not VMM caps available,
that means integrated scheduler is not supported by the hypervisor, so
fall back to core scheduler.

Thanks,
Anirudh

> +	}
>  
>  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> +
> +	return 0;
>  }
>  
>  static int __init mshv_parent_partition_init(void)
> @@ -2292,6 +2304,10 @@ static int __init mshv_parent_partition_init(void)
>  
>  	mshv_cpuhp_online = ret;
>  
> +	ret = mshv_init_vmm_caps(dev);
> +	if (ret)
> +		goto remove_cpu_state;
> +
>  	ret = mshv_retrieve_scheduler_type(dev);
>  	if (ret)
>  		goto remove_cpu_state;
> @@ -2301,11 +2317,13 @@ static int __init mshv_parent_partition_init(void)
>  	if (ret)
>  		goto remove_cpu_state;
>  
> -	mshv_init_vmm_caps(dev);
> +	ret = root_scheduler_init(dev);
> +	if (ret)
> +		goto exit_partition;
>  
>  	ret = mshv_irqfd_wq_init();
>  	if (ret)
> -		goto exit_partition;
> +		goto deinit_root_scheduler;
>  
>  	spin_lock_init(&mshv_root.pt_ht_lock);
>  	hash_init(mshv_root.pt_htable);
> @@ -2314,6 +2332,8 @@ static int __init mshv_parent_partition_init(void)
>  
>  	return 0;
>  
> +deinit_root_scheduler:
> +	root_scheduler_deinit();
>  exit_partition:
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
> @@ -2332,6 +2352,7 @@ static void __exit mshv_parent_partition_exit(void)
>  	mshv_port_table_fini();
>  	misc_deregister(&mshv_dev);
>  	mshv_irqfd_wq_cleanup();
> +	root_scheduler_deinit();
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
>  	cpuhp_remove_state(mshv_cpuhp_online);
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 41a29bf8ec14..c0300910808b 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -87,6 +87,9 @@ enum hv_partition_property_code {
>  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
>  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
>  
> +	/* Integrated scheduling properties */
> +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> +
>  	/* Resource properties */
>  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
>  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> @@ -102,7 +105,7 @@ enum hv_partition_property_code {
>  };
>  
>  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
> +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
>  
>  struct hv_partition_property_vmm_capabilities {
>  	u16 bank_count;
> @@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
>  			u64 reservedbit3: 1;
>  #endif
>  			u64 assignable_synthetic_proc_features: 1;
> +			u64 reservedbit5: 1;
> +			u64 vmm_enable_integrated_scheduler : 1;
>  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
>  		} __packed;
>  	};
> 
> 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-01-30 17:37 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157EE41697ABC1002750297D49FA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, Jan 30, 2026 at 01:24:34AM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, January 29, 2026 11:10 AM
> > 
> > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > 
> > <snip>
> > 
> > > >  static int __init mshv_root_partition_init(struct device *dev)
> > > >  {
> > > >  	int err;
> > > >
> > > > -	err = root_scheduler_init(dev);
> > > > -	if (err)
> > > > -		return err;
> > > > -
> > > >  	err = register_reboot_notifier(&mshv_reboot_nb);
> > > >  	if (err)
> > > > -		goto root_sched_deinit;
> > > > +		return err;
> > > >
> > > >  	return 0;
> > >
> > > This code is now:
> > >
> > > 	if (err)
> > > 		return err;
> > > 	return 0;
> > >
> > > which can be simplified to just:
> > >
> > > 	return err;
> > >
> > > Or drop the local variable 'err' and simplify the entire function to:
> > >
> > > 	return register_reboot_notifier(&mshv_reboot_nb);
> > >
> > > There's a tangential question here: Why is this reboot notifier
> > > needed in the first place? All it does is remove the cpuhp state
> > > that allocates/frees the per-cpu root_scheduler_input and
> > > root_scheduler_output pages. Removing the state will free
> > > the pages, but if Linux is rebooting, why bother?
> > >
> > 
> > This was originally done to support kexec.
> > Here is the original commit message:
> > 
> >     mshv: perform synic cleanup during kexec
> > 
> >     Register a reboot notifier that performs synic cleanup when a kexec
> >     is in progress.
> > 
> >     One notable issue this commit fixes is one where after a kexec, virtio
> >     devices are not functional. Linux root partition receives MMIO doorbell
> >     events in the ring buffer in the SIRB synic page. The hypervisor maintains
> >     a head pointer where it writes new events into the ring buffer. The root
> >     partition maintains a tail pointer to read events from the buffer.
> > 
> >     Upon kexec reboot, all root data structures are re-initialized and thus the
> >     tail pointer gets reset to zero. The hypervisor on the other hand still
> >     retains the pre-kexec head pointer which could be non-zero. This means that
> >     when the hypervisor writes new events to the ring buffer, the root
> >     partition looks at the wrong place and doesn't find any events. So, future
> >     doorbell events never get delivered. As a result, virtqueue kicks never get
> >     delivered to the host.
> > 
> >     When the SIRB page is disabled the hypervisor resets the head pointer.
> 
> FWIW, I don't see that commit message anywhere in a public source code
> tree. The calls to register/unregister_reboot_notifier() were in the original
> introduction of mshv_root_main.c in upstream commit 621191d709b14.
> Evidently the code described by that commit message was not submitted
> upstream. And of course, the kexec() topic is now being revisited ....
> 
> So to clarify: Do you expect that in the future the reboot notifier will be
> used for something that really is required for resetting hypervisor state
> in the case of a kexec reboot?

While that commit wasn't individually sent upstream but all the code
from that commit did land upstream probably bundled with other commits
when the mshv driver was introduced. So the reboot notifier is indeed
currently used for resetting the synic correctly during kexec reboot.

Thanks,
Anirudh.

> 
> > 
> > > > -root_sched_deinit:
> > > > -	root_scheduler_deinit();
> > > > -	return err;
> > > >  }
> > > >
> > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > >  {
> > > > -	/*
> > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > -	 */
> > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > -					      0, &mshv_root.vmm_caps,
> > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > +	int ret;
> > > > +
> > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > +						0, &mshv_root.vmm_caps,
> > > > +						sizeof(mshv_root.vmm_caps));
> > > > +	if (ret) {
> > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > +		return ret;
> > > > +	}
> > >
> > > This is a functional change that isn't mentioned in the commit message.
> > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > as all disabled? Presumably there are older versions of the hypervisor that
> > > don't support the requirements described in the original comment, but
> > > perhaps they are no longer relevant?
> > >
> > 
> > To fail is now the only option for the L1VH partition. It must discover
> > the scheduler type. Without this information, the partition cannot
> > operate. The core scheduler logic will not work with an integrated
> > scheduler, and vice versa.
> > 
> > And yes, older hypervisor versions do not support L1VH.
> 
> That makes sense. Your change in v2 of the patch handles this
> nicely. For the non-L1VH case, the v2 behavior is the same as before in
> that the init path won't error out on older hypervisors that don't
> support the requirements described in the original comment. That's
> the case I am concerned about.
> 
> Michael

^ permalink raw reply

* Re: [PATCH net-next v2] net: mana: Improve diagnostic logging for better debuggability
From: Erni Sri Satya Vennela @ 2026-01-30 17:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, davem, edumazet, pabeni, kotaranov, shradhagupta,
	yury.norov, dipayanroy, shirazsaleem, ssengar, gargaditya,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260126195850.GO13967@unreal>

On Mon, Jan 26, 2026 at 09:58:50PM +0200, Leon Romanovsky wrote:
> On Thu, Jan 22, 2026 at 06:07:45PM -0800, Jakub Kicinski wrote:
> > On Thu, 22 Jan 2026 09:43:42 -0800 Erni Sri Satya Vennela wrote:
> > > On Wed, Jan 21, 2026 at 08:14:12PM -0800, Jakub Kicinski wrote:
> > > > On Tue, 20 Jan 2026 22:56:55 -0800 Erni Sri Satya Vennela wrote:  
> > 
> > You will have to build proper support tooling like every single vendor
> > before you. Presumably you can also log from the hypervisor side which
> > makes your life so much easier than supporting real HW. Yet, real
> > NIC don't spew random trash to the logs all the time. SMH. Respectfully,
> > next time y'all "discuss things internally" start with the question of
> > what makes your case special :|
> 
> +100
> 
> Interesting. Completely independent of your comment, I provided the same
> feedback on their mana_ib driver. They added debug logs to nearly every
> command, even though those commands already had existing debug logging.
> 
> https://lore.kernel.org/linux-rdma/20260122131442.GL13201@unreal/T/#m51e8a12f4bca4a6c1377c5531c8a6d94a43af1e5
> 
> "In order to simplify things for you: unless you can clearly justify why this
> print is required and why you cannot proceed without it, I must ask you to stop
> adding any new debug or error messages to the mana_ib driver. There is a wide
> range of existing tools and well‑established practices for debugging the kernel,
> and none of them require spamming dmesg."
> 
> Thanks

Hi Jakub, Leon,

We agree with the concerns pointed out by adding new lines of logging,
hence we are planning to get the soc logs required for debugging issues
from customers by modifying the existing logs itself and would not be
adding any new lines.

Old Logs:

mana 7870:00:00.0: Microsoft Azure Network Adapter protocol version:
0.1.1
mana 7870:00:00.0 enP30832s1: Configured vPort 0 PD 18 DB 16
mana 7870:00:00.0 enP30832s1: Configured steering vPort 0 entries 64

Modified logs:

Initialization:
mana 7870:00:00.0: Microsoft Azure Network Adapter protocol version:
0.1.1 Max Resources: msix_usable=33 max_queues=32 VF version:
protocol=0x0 pf_caps=[0x1d]

Module load:
mana 7870:00:00.0 enP30832s1: Enabled vPort 0 PD 18 DB 16 MAC
60:45:bd:7b:76:30 Vport Config: max_txq=32 max_rxq=32 indir_ent=64
Device Config: max_vports=1 adapter_mtu=9216 bm_hostmode=0
mana 7870:00:00.0 enP30832s1: Configured steering vPort 0 entries 64 MAC
60:45:bd:7b:76:30 [rx:1 rss:1 update_indirection_table:1
cqe_coalescing:0]

Module unload:
mana 7870:00:00.0 enP30832s1: Configured steering vPort 0 entries 64 MAC
60:45:bd:7b:76:30 [rx:1 rss:1 update_indirection_table:1
cqe_coalescing:0]
mana 7870:00:00.0 enP30832s1: Disabled vPort 0 MAC 60:45:bd:7b:76:30

We considered this approach because we wanted to support older kernels,
which the customers are using and it is an easier way to backport these
changes. Is this approach acceptable? 
 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Anirudh Rayabharam @ 2026-01-30 17:30 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aXuwes2HNf4Og8lW@skinsburskii.localdomain>

On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > 
> > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > 
> > > Query the hypervisor for integrated scheduler support and use it if
> > > configured.
> > > 
> > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > root scheduler allows the root partition to schedule guest vCPUs across
> > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > scheduling entirely to the hypervisor.
> > > 
> > > Direct virtualization introduces a new privileged guest partition type - L1
> > > Virtual Host (L1VH) — which can create child partitions from its own
> > > resources. These child partitions are effectively siblings, scheduled by
> > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > (typically round-robin across all allocated physical CPUs). As a result,
> > > the system may appear to "steal" time from the L1VH and its children.
> > > 
> > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > guests across its "physical" cores, effectively emulating root scheduler
> > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > rest of the system.
> > > 
> > > The integrated scheduler is controlled by the root partition and gated by
> > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > supports the integrated scheduler. The L1VH partition must then check if it
> > > is enabled by querying the corresponding extended partition property. If
> > > this property is true, the L1VH partition must use the root scheduler
> > > logic; otherwise, it must use the core scheduler.
> > > 
> > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > >  include/hyperv/hvhdk_mini.h |    6 +++
> > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > 
> 
> <snip>
> 
> > >  static int __init mshv_root_partition_init(struct device *dev)
> > >  {
> > >  	int err;
> > > 
> > > -	err = root_scheduler_init(dev);
> > > -	if (err)
> > > -		return err;
> > > -
> > >  	err = register_reboot_notifier(&mshv_reboot_nb);
> > >  	if (err)
> > > -		goto root_sched_deinit;
> > > +		return err;
> > > 
> > >  	return 0;
> > 
> > This code is now:
> > 
> > 	if (err)
> > 		return err;
> > 	return 0;
> > 
> > which can be simplified to just:
> > 
> > 	return err;
> > 
> > Or drop the local variable 'err' and simplify the entire function to:
> > 
> > 	return register_reboot_notifier(&mshv_reboot_nb);
> > 
> > There's a tangential question here: Why is this reboot notifier
> > needed in the first place? All it does is remove the cpuhp state
> > that allocates/frees the per-cpu root_scheduler_input and
> > root_scheduler_output pages. Removing the state will free
> > the pages, but if Linux is rebooting, why bother?
> > 
> 
> This was originally done to support kexec.
> Here is the original commit message:
> 
>     mshv: perform synic cleanup during kexec
> 
>     Register a reboot notifier that performs synic cleanup when a kexec
>     is in progress.
> 
>     One notable issue this commit fixes is one where after a kexec, virtio
>     devices are not functional. Linux root partition receives MMIO doorbell
>     events in the ring buffer in the SIRB synic page. The hypervisor maintains
>     a head pointer where it writes new events into the ring buffer. The root
>     partition maintains a tail pointer to read events from the buffer.
> 
>     Upon kexec reboot, all root data structures are re-initialized and thus the
>     tail pointer gets reset to zero. The hypervisor on the other hand still
>     retains the pre-kexec head pointer which could be non-zero. This means that
>     when the hypervisor writes new events to the ring buffer, the root
>     partition looks at the wrong place and doesn't find any events. So, future
>     doorbell events never get delivered. As a result, virtqueue kicks never get
>     delivered to the host.
> 
>     When the SIRB page is disabled the hypervisor resets the head pointer.
> 
> > > -root_sched_deinit:
> > > -	root_scheduler_deinit();
> > > -	return err;
> > >  }
> > > 
> > > -static void mshv_init_vmm_caps(struct device *dev)
> > > +static int mshv_init_vmm_caps(struct device *dev)
> > >  {
> > > -	/*
> > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > -	 */
> > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > -					      0, &mshv_root.vmm_caps,
> > > -					      sizeof(mshv_root.vmm_caps)))
> > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > +	int ret;
> > > +
> > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > +						0, &mshv_root.vmm_caps,
> > > +						sizeof(mshv_root.vmm_caps));
> > > +	if (ret) {
> > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > +		return ret;
> > > +	}
> > 
> > This is a functional change that isn't mentioned in the commit message.
> > Why is it now appropriate to fail instead of treating the VMM capabilities
> > as all disabled? Presumably there are older versions of the hypervisor that
> > don't support the requirements described in the original comment, but
> > perhaps they are no longer relevant?
> > 
> 
> To fail is now the only option for the L1VH partition. It must discover
> the scheduler type. Without this information, the partition cannot
> operate. The core scheduler logic will not work with an integrated
> scheduler, and vice versa.

I don't think we need to fail here. If we don't find vmm caps, that
means we are on an older hypervisor that supports l1vh but not
integrated scheduler (yes, such a version exists). In this case since
integrated scheduler is not supported by the hypervisor, the core
scheduler logic will work.

Thanks,
Anirudh.

> 
> And yes, older hypervisor versions do not support L1VH.
> 
> Thanks,
> Stanislav
> 
> > > 
> > >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > > +
> > > +	return 0;
> > >  }
> > > 
> > >  static int __init mshv_parent_partition_init(void)
> > > @@ -2292,6 +2310,10 @@ static int __init mshv_parent_partition_init(void)
> > > 
> > >  	mshv_cpuhp_online = ret;
> > > 
> > > +	ret = mshv_init_vmm_caps(dev);
> > > +	if (ret)
> > > +		goto remove_cpu_state;
> > > +
> > >  	ret = mshv_retrieve_scheduler_type(dev);
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > > @@ -2301,11 +2323,13 @@ static int __init mshv_parent_partition_init(void)
> > >  	if (ret)
> > >  		goto remove_cpu_state;
> > > 
> > > -	mshv_init_vmm_caps(dev);
> > > +	ret = root_scheduler_init(dev);
> > > +	if (ret)
> > > +		goto exit_partition;
> > > 
> > >  	ret = mshv_irqfd_wq_init();
> > >  	if (ret)
> > > -		goto exit_partition;
> > > +		goto deinit_root_scheduler;
> > > 
> > >  	spin_lock_init(&mshv_root.pt_ht_lock);
> > >  	hash_init(mshv_root.pt_htable);
> > > @@ -2314,6 +2338,8 @@ static int __init mshv_parent_partition_init(void)
> > > 
> > >  	return 0;
> > > 
> > > +deinit_root_scheduler:
> > > +	root_scheduler_deinit();
> > >  exit_partition:
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > > @@ -2332,6 +2358,7 @@ static void __exit mshv_parent_partition_exit(void)
> > >  	mshv_port_table_fini();
> > >  	misc_deregister(&mshv_dev);
> > >  	mshv_irqfd_wq_cleanup();
> > > +	root_scheduler_deinit();
> > >  	if (hv_root_partition())
> > >  		mshv_root_partition_exit();
> > >  	cpuhp_remove_state(mshv_cpuhp_online);
> > > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > > index aa03616f965b..0f7178fa88a8 100644
> > > --- a/include/hyperv/hvhdk_mini.h
> > > +++ b/include/hyperv/hvhdk_mini.h
> > > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> > >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> > >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> > > 
> > > +	/* Integrated scheduling properties */
> > > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > > +
> > >  	/* Resource properties */
> > >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> > >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> > >  };
> > > 
> > >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	58
> > > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> > > 
> > >  struct hv_partition_property_vmm_capabilities {
> > >  	u16 bank_count;
> > > @@ -120,6 +123,7 @@ struct hv_partition_property_vmm_capabilities {
> > >  #endif
> > >  			u64 assignable_synthetic_proc_features: 1;
> > >  			u64 tag_hv_message_from_child: 1;
> > > +			u64 vmm_enable_integrated_scheduler : 1;
> > >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> > >  		} __packed;
> > >  	};
> > > 
> > > 
> > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-01-30 17:17 UTC (permalink / raw)
  To: Mukesh R
  Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <919446c3-e02f-d532-3ea8-74d0cee38d33@linux.microsoft.com>

On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
> On 1/28/26 15:08, Stanislav Kinsburskii wrote:
> > On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
> > > On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> > > > > On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> > > > > > > On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > > > > > > > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > > > > > > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > > > > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > > > 
> > > > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > > > management is implemented.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >        drivers/hv/Kconfig |    1 +
> > > > > > > > > > > >        1 file changed, 1 insertion(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > > > > > >        	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > > > > > >        	# no particular order, making it impossible to reassemble larger pages
> > > > > > > > > > > >        	depends on PAGE_SIZE_4KB
> > > > > > > > > > > > +	depends on !KEXEC
> > > > > > > > > > > >        	select EVENTFD
> > > > > > > > > > > >        	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > > > > > >        	select HMM_MIRROR
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > > > > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > > > > > > > and it was fine?
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > > > > > > > will be affected as well.
> > > > > > > > > 
> > > > > > > > > So not sure I understand the reason for this patch. We can just block
> > > > > > > > > kexec if there are any VMs running, right? Doing this would mean any
> > > > > > > > > further developement would be without a ver important and major feature,
> > > > > > > > > right?
> > > > > > > > 
> > > > > > > > This is an option. But until it's implemented and merged, a user mshv
> > > > > > > > driver gets into a situation where kexec is broken in a non-obvious way.
> > > > > > > > The system may crash at any time after kexec, depending on whether the
> > > > > > > > new kernel touches the pages deposited to hypervisor or not. This is a
> > > > > > > > bad user experience.
> > > > > > > 
> > > > > > > I understand that. But with this we cannot collect core and debug any
> > > > > > > crashes. I was thinking there would be a quick way to prohibit kexec
> > > > > > > for update via notifier or some other quick hack. Did you already
> > > > > > > explore that and didn't find anything, hence this?
> > > > > > > 
> > > > > > 
> > > > > > This quick hack you mention isn't quick in the upstream kernel as there
> > > > > > is no hook to interrupt kexec process except the live update one.
> > > > > 
> > > > > That's the one we want to interrupt and block right? crash kexec
> > > > > is ok and should be allowed. We can document we don't support kexec
> > > > > for update for now.
> > > > > 
> > > > > > I sent an RFC for that one but given todays conversation details is
> > > > > > won't be accepted as is.
> > > > > 
> > > > > Are you taking about this?
> > > > > 
> > > > >           "mshv: Add kexec safety for deposited pages"
> > > > > 
> > > > 
> > > > Yes.
> > > > 
> > > > > > Making mshv mutually exclusive with kexec is the only viable option for
> > > > > > now given time constraints.
> > > > > > It is intended to be replaced with proper page lifecycle management in
> > > > > > the future.
> > > > > 
> > > > > Yeah, that could take a long time and imo we cannot just disable KEXEC
> > > > > completely. What we want is just block kexec for updates from some
> > > > > mshv file for now, we an print during boot that kexec for updates is
> > > > > not supported on mshv. Hope that makes sense.
> > > > > 
> > > > 
> > > > The trade-off here is between disabling kexec support and having the
> > > > kernel crash after kexec in a non-obvious way. This affects both regular
> > > > kexec and crash kexec.
> > > 
> > > crash kexec on baremetal is not affected, hence disabling that
> > > doesn't make sense as we can't debug crashes then on bm.
> > > 
> > 
> > Bare metal support is not currently relevant, as it is not available.
> > This is the upstream kernel, and this driver will be accessible to
> > third-party customers beginning with kernel 6.19 for running their
> > kernels in Azure L1VH, so consistency is required.
> 
> Well, without crashdump support, customers will not be running anything
> anywhere.

This is my concern too. I don't think customers will be particularly
happy that kexec doesn't work with our driver.

Thanks,
Anirudh

> 
> Thanks,
> -Mukesh
> 
> > Thanks,
> > Stanislav
> > 
> > > Let me think and explore a bit, and if I come up with something, I'll
> > > send a patch here. If nothing, then we can do this as last resort.
> > > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > It?s a pity we can?t apply a quick hack to disable only regular kexec.
> > > > However, since crash kexec would hit the same issues, until we have a
> > > > proper state transition for deposted pages, the best workaround for now
> > > > is to reset the hypervisor state on every kexec, which needs design,
> > > > work, and testing.
> > > > 
> > > > Disabling kexec is the only consistent way to handle this in the
> > > > upstream kernel at the moment.
> > > > 
> > > > Thanks, Stanislav
> > > > 
> > > > 
> > > > > Thanks,
> > > > > -Mukesh
> > > > > 
> > > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > Thanks,
> > > > > > > -Mukesh
> > > > > > > 
> > > > > > > > Therefor it should be explicitly forbidden as it's essentially not
> > > > > > > > supported yet.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stanislav
> > > > > > > > > > 
> > > > > > > > > > > Thanks,
> > > > > > > > > > > -Mukesh
> 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-01-30 17:11 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXqXkhhl1xuvjm3P@skinsburskii.localdomain>

On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > hypervisor deposited pages.
> > > > > 
> > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > management is implemented.
> > > > 
> > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > and would work without any issue for L1VH.
> > > > 
> > > 
> > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > 
> > All pages that were deposited in the context of a guest partition (i.e.
> > with the guest partition ID), would be withdrawn when you kill the VMs,
> > right? What other deposited pages would be left?
> > 
> 
> The driver deposits two types of pages: one for the guests (withdrawn
> upon gust shutdown) and the other - for the host itself (never
> withdrawn).
> See hv_call_create_partition, for example: it deposits pages for the
> host partition.

Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
Also, can't we forcefully kill all running partitions in module_exit and
then reclaim memory? Would this help with kernel consistency
irrespective of userspace behavior?

Thanks,
Anirudh.

> 
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh.
> > 
> > > Also, kernel consisntency must no depend on use space behavior. 
> > > 
> > > > Also, I don't think it is reasonable at all that someone needs to
> > > > disable basic kernel functionality such as kexec in order to use our
> > > > driver.
> > > > 
> > > 
> > > It's a temporary measure until proper page lifecycle management is
> > > supported in the driver.
> > > Mutual exclusion of the driver and kexec is given and thus should be
> > > expclitily stated in the Kconfig.
> > > 
> > > Thanks,
> > > Stanislav
> > > 
> > > > Thanks,
> > > > Anirudh.
> > > > 
> > > > > 
> > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > ---
> > > > >  drivers/hv/Kconfig |    1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > --- a/drivers/hv/Kconfig
> > > > > +++ b/drivers/hv/Kconfig
> > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > >  	depends on PAGE_SIZE_4KB
> > > > > +	depends on !KEXEC
> > > > >  	select EVENTFD
> > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > >  	select HMM_MIRROR
> > > > > 
> > > > > 

^ permalink raw reply

* Re: [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-01-30 17:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXuS-ogiBX2Z3Gnf@skinsburskii.localdomain>

On Thu, Jan 29, 2026 at 09:03:54AM -0800, Stanislav Kinsburskii wrote:
> On Thu, Jan 29, 2026 at 04:36:51AM +0000, Anirudh Rayabharam wrote:
> > On Wed, Jan 28, 2026 at 03:03:51PM -0800, Stanislav Kinsburskii wrote:
> > > On Wed, Jan 28, 2026 at 04:04:37PM +0000, Anirudh Rayabharam wrote:
> > > > From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 
> <snip>
> 
> > > 
> > > > +static int mshv_irq = -1;
> > > > +
> > > 
> > > Should this be a path of mshv_root structure?
> > 
> > This doesn't need to be globally accessible. It is only used in this file.
> > So I guess it doesn't need to be in mshv_root. What do you think?
> > 
> 
> Please, see below.

The below part doesn't make a case for this variable being part of the
mshv_root structure. Did you miss this part in your reply?

> 
> <snip>
> 
> > > >  int mshv_synic_cpu_init(unsigned int cpu)
> > > >  {
> > > >  	union hv_synic_simp simp;
> > > >  	union hv_synic_siefp siefp;
> > > >  	union hv_synic_sirbp sirbp;
> > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > >  	union hv_synic_sint sint;
> > > > -#endif
> > > >  	union hv_synic_scontrol sctrl;
> > > >  	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> > > >  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> > > > @@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > >  
> > > >  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> > > >  
> > > > -#ifdef HYPERVISOR_CALLBACK_VECTOR
> > > > +	if (mshv_irq != -1)
> > > > +		enable_percpu_irq(mshv_irq, 0);
> > > > +
> > > 
> > > It's better to explicitly separate x86 and arm64 paths with #ifdefs.
> > > For example:
> > > 
> > > #ifdef CONFIG_X86_64
> > > int setup_cpu_sint() {
> > >   	/* Enable intercepts */
> > >   	sint.as_uint64 = 0;
> > > 	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > 	....
> > > }
> > > #endif
> > > #ifdef CONFIG_ARM64
> > > int setup_cpu_sint() {
> > > 	enable_percpu_irq(mshv_irq, 0);
> > > 
> > >   	/* Enable intercepts */
> > >   	sint.as_uint64 = 0;
> > > 	sint.vector = mshv_interrupt;
> > > 	....
> > > }
> > > #endif
> > 
> > This seems unnecessary. We've made the paths that determine
> > mshv_interrupt separate. Now we can just use that here.
> > 
> > There is no need to write two copies of 
> > 
> > 	...
> >    	sint.as_uint64 = 0;
> >  	sint.vector = <whatever>;
> > 	...
> > 
> > I could do the enable_percpu_irq() inside an ifdef. But do we gain
> > anything from it? Won't the compiler optimize the current code as well
> > since mshv_irq will always be -1 whenever HYPERVISOR_CALLBACK_VECTOR is
> > defined?
> > 
> 
> AFAIU this patc, x86 doesn’t need these variables at all. So it’s better
> to separate them completely and explicitly.
> 
> Also, this isn’t the only place where ARM-specific logic is added. This
> patch adds ARM-specific logic and tries to weave it into the existing
> x86 flow.
> 
> If it were only one place, that might be OK. But here it happens in
> several places. That makes the code harder to read and maintain. It also
> makes future extensions more risky (and they will likely follow). The
> dependencies are also not obvious. For example, on ARM the interrupt
> vector comes from ACPI (at least that’s what the comments say). So it’s
> not right to mix this into the common x86 path even if
> HYPERVISOR_CALLBACK_VECTOR is a x86-specific define.

We shouldn't think of this code in terms of X86 & ARM64. It's not about
arch at all. It's about whether or not we have a pre-defined vector
(a.k.a HYPERVISOR_CALLBACK_VECTOR). I feel that the current code cleanly
separates the two cases. The main difference in the two cases is in how
the vector is determined which is well seperated in the code paths. Once
the vector is determined, how we program it in the synic is the same for
both cases.

> 
> It would be much better to keep this ARM-specific logic in separate,
> conditionally compiled code. I suggest changing the flow to make this
> per-arch logic explicit. It will pay off later.

Most of the code introduced in this patch is conditionally compiled.
Building code from this patch on x86 will conditionally compile out a
large majority of it.

Are you by any chance suggesting we put it in a separate file?

Thanks,
Anirudh.

> 
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Thanks,
> > > Stanislav
> > > 
> > > >  	/* Enable intercepts */
> > > >  	sint.as_uint64 = 0;
> > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > +	sint.vector = mshv_interrupt;
> > > >  	sint.masked = false;
> > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> > > > @@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
> > > >  
> > > >  	/* Doorbell SINT */
> > > >  	sint.as_uint64 = 0;
> > > > -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> > > > +	sint.vector = mshv_interrupt;
> > > >  	sint.masked = false;
> > > >  	sint.as_intercept = 1;
> > > >  	sint.auto_eoi = hv_recommend_using_aeoi();
> > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > >  			      sint.as_uint64);
> > > > -#endif
> > > >  
> > > >  	/* Enable global synic bit */
> > > >  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> > > > @@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
> > > >  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> > > >  			      sint.as_uint64);
> > > >  
> > > > +	if (mshv_irq != -1)
> > > > +		disable_percpu_irq(mshv_irq);
> > > > +
> > > >  	/* Disable Synic's event ring page */
> > > >  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> > > >  	sirbp.sirbp_enabled = false;
> > > > -- 
> > > > 2.34.1
> > > > 

^ permalink raw reply

* RE: [PATCH v3] mshv: Add support for integrated scheduler
From: Michael Kelley @ 2026-01-30 17:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <176978905128.18763.15996443783319253336.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Friday, January 30, 2026 8:04 AM
> 
> Query the hypervisor for integrated scheduler support and use it if
> configured.
> 
> Microsoft Hypervisor originally provided two schedulers: root and core. The
> root scheduler allows the root partition to schedule guest vCPUs across
> physical cores, supporting both time slicing and CPU affinity (e.g., via
> cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> scheduling entirely to the hypervisor.
> 
> Direct virtualization introduces a new privileged guest partition type - L1
> Virtual Host (L1VH) — which can create child partitions from its own
> resources. These child partitions are effectively siblings, scheduled by
> the hypervisor's core scheduler. This prevents the L1VH parent from setting
> affinity or time slicing for its own processes or guest VPs. While cgroups,
> CFS, and cpuset controllers can still be used, their effectiveness is
> unpredictable, as the core scheduler swaps vCPUs according to its own logic
> (typically round-robin across all allocated physical CPUs). As a result,
> the system may appear to "steal" time from the L1VH and its children.
> 
> To address this, Microsoft Hypervisor introduces the integrated scheduler.
> This allows an L1VH partition to schedule its own vCPUs and those of its
> guests across its "physical" cores, effectively emulating root scheduler
> behavior within the L1VH, while retaining core scheduler behavior for the
> rest of the system.
> 
> The integrated scheduler is controlled by the root partition and gated by
> the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> supports the integrated scheduler. The L1VH partition must then check if it
> is enabled by querying the corresponding extended partition property. If
> this property is true, the L1VH partition must use the root scheduler
> logic; otherwise, it must use the core scheduler. This requirement makes
> reading VMM capabilities in L1VH partition a requirement too.
> 
> Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c |   85 +++++++++++++++++++++++++++----------------
>  include/hyperv/hvhdk_mini.h |    7 +++-
>  2 files changed, 59 insertions(+), 33 deletions(-)

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply

* [PATCH v3] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-30 16:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Query the hypervisor for integrated scheduler support and use it if
configured.

Microsoft Hypervisor originally provided two schedulers: root and core. The
root scheduler allows the root partition to schedule guest vCPUs across
physical cores, supporting both time slicing and CPU affinity (e.g., via
cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
scheduling entirely to the hypervisor.

Direct virtualization introduces a new privileged guest partition type - L1
Virtual Host (L1VH) — which can create child partitions from its own
resources. These child partitions are effectively siblings, scheduled by
the hypervisor's core scheduler. This prevents the L1VH parent from setting
affinity or time slicing for its own processes or guest VPs. While cgroups,
CFS, and cpuset controllers can still be used, their effectiveness is
unpredictable, as the core scheduler swaps vCPUs according to its own logic
(typically round-robin across all allocated physical CPUs). As a result,
the system may appear to "steal" time from the L1VH and its children.

To address this, Microsoft Hypervisor introduces the integrated scheduler.
This allows an L1VH partition to schedule its own vCPUs and those of its
guests across its "physical" cores, effectively emulating root scheduler
behavior within the L1VH, while retaining core scheduler behavior for the
rest of the system.

The integrated scheduler is controlled by the root partition and gated by
the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
supports the integrated scheduler. The L1VH partition must then check if it
is enabled by querying the corresponding extended partition property. If
this property is true, the L1VH partition must use the root scheduler
logic; otherwise, it must use the core scheduler. This requirement makes
reading VMM capabilities in L1VH partition a requirement too.

Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   85 +++++++++++++++++++++++++++----------------
 include/hyperv/hvhdk_mini.h |    7 +++-
 2 files changed, 59 insertions(+), 33 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..6a6bf641b352 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
 	};
 }
 
+static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
+{
+	u64 integrated_sched_enabled;
+	int ret;
+
+	*out = HV_SCHEDULER_TYPE_CORE_SMT;
+
+	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
+		return 0;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
+						0, &integrated_sched_enabled,
+						sizeof(integrated_sched_enabled));
+	if (ret)
+		return ret;
+
+	if (integrated_sched_enabled)
+		*out = HV_SCHEDULER_TYPE_ROOT;
+
+	pr_debug("%s: integrated scheduler property read: ret=%d value=%llu\n",
+		 __func__, ret, integrated_sched_enabled);
+
+	return 0;
+}
+
 /* TODO move this to hv_common.c when needed outside */
 static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 {
@@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 /* Retrieve and stash the supported scheduler type */
 static int __init mshv_retrieve_scheduler_type(struct device *dev)
 {
-	int ret = 0;
+	int ret;
 
 	if (hv_l1vh_partition())
-		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
+		ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
 	else
 		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
-
 	if (ret)
 		return ret;
 
@@ -2211,42 +2236,29 @@ struct notifier_block mshv_reboot_nb = {
 static void mshv_root_partition_exit(void)
 {
 	unregister_reboot_notifier(&mshv_reboot_nb);
-	root_scheduler_deinit();
 }
 
 static int __init mshv_root_partition_init(struct device *dev)
 {
-	int err;
-
-	err = root_scheduler_init(dev);
-	if (err)
-		return err;
-
-	err = register_reboot_notifier(&mshv_reboot_nb);
-	if (err)
-		goto root_sched_deinit;
-
-	return 0;
-
-root_sched_deinit:
-	root_scheduler_deinit();
-	return err;
+	return register_reboot_notifier(&mshv_reboot_nb);
 }
 
-static void mshv_init_vmm_caps(struct device *dev)
+static int __init mshv_init_vmm_caps(struct device *dev)
 {
-	/*
-	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
-	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
-	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
-	 */
-	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
-					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
-					      0, &mshv_root.vmm_caps,
-					      sizeof(mshv_root.vmm_caps)))
-		dev_warn(dev, "Unable to get VMM capabilities\n");
+	int ret;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
+						0, &mshv_root.vmm_caps,
+						sizeof(mshv_root.vmm_caps));
+	if (ret && hv_l1vh_partition()) {
+		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
+		return ret;
+	}
 
 	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
+
+	return 0;
 }
 
 static int __init mshv_parent_partition_init(void)
@@ -2292,6 +2304,10 @@ static int __init mshv_parent_partition_init(void)
 
 	mshv_cpuhp_online = ret;
 
+	ret = mshv_init_vmm_caps(dev);
+	if (ret)
+		goto remove_cpu_state;
+
 	ret = mshv_retrieve_scheduler_type(dev);
 	if (ret)
 		goto remove_cpu_state;
@@ -2301,11 +2317,13 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto remove_cpu_state;
 
-	mshv_init_vmm_caps(dev);
+	ret = root_scheduler_init(dev);
+	if (ret)
+		goto exit_partition;
 
 	ret = mshv_irqfd_wq_init();
 	if (ret)
-		goto exit_partition;
+		goto deinit_root_scheduler;
 
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
@@ -2314,6 +2332,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+deinit_root_scheduler:
+	root_scheduler_deinit();
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
@@ -2332,6 +2352,7 @@ static void __exit mshv_parent_partition_exit(void)
 	mshv_port_table_fini();
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
+	root_scheduler_deinit();
 	if (hv_root_partition())
 		mshv_root_partition_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 41a29bf8ec14..c0300910808b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -87,6 +87,9 @@ enum hv_partition_property_code {
 	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
 	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
 
+	/* Integrated scheduling properties */
+	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
+
 	/* Resource properties */
 	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
 	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
@@ -102,7 +105,7 @@ enum hv_partition_property_code {
 };
 
 #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
-#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
+#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
 
 struct hv_partition_property_vmm_capabilities {
 	u16 bank_count;
@@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
 			u64 reservedbit3: 1;
 #endif
 			u64 assignable_synthetic_proc_features: 1;
+			u64 reservedbit5: 1;
+			u64 vmm_enable_integrated_scheduler : 1;
 			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
 		} __packed;
 	};



^ permalink raw reply related

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-01-30 15:49 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157EE41697ABC1002750297D49FA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, Jan 30, 2026 at 01:24:34AM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, January 29, 2026 11:10 AM
> > 
> > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > 
> > <snip>
> > 
> > > >  static int __init mshv_root_partition_init(struct device *dev)
> > > >  {
> > > >  	int err;
> > > >
> > > > -	err = root_scheduler_init(dev);
> > > > -	if (err)
> > > > -		return err;
> > > > -
> > > >  	err = register_reboot_notifier(&mshv_reboot_nb);
> > > >  	if (err)
> > > > -		goto root_sched_deinit;
> > > > +		return err;
> > > >
> > > >  	return 0;
> > >
> > > This code is now:
> > >
> > > 	if (err)
> > > 		return err;
> > > 	return 0;
> > >
> > > which can be simplified to just:
> > >
> > > 	return err;
> > >
> > > Or drop the local variable 'err' and simplify the entire function to:
> > >
> > > 	return register_reboot_notifier(&mshv_reboot_nb);
> > >
> > > There's a tangential question here: Why is this reboot notifier
> > > needed in the first place? All it does is remove the cpuhp state
> > > that allocates/frees the per-cpu root_scheduler_input and
> > > root_scheduler_output pages. Removing the state will free
> > > the pages, but if Linux is rebooting, why bother?
> > >
> > 
> > This was originally done to support kexec.
> > Here is the original commit message:
> > 
> >     mshv: perform synic cleanup during kexec
> > 
> >     Register a reboot notifier that performs synic cleanup when a kexec
> >     is in progress.
> > 
> >     One notable issue this commit fixes is one where after a kexec, virtio
> >     devices are not functional. Linux root partition receives MMIO doorbell
> >     events in the ring buffer in the SIRB synic page. The hypervisor maintains
> >     a head pointer where it writes new events into the ring buffer. The root
> >     partition maintains a tail pointer to read events from the buffer.
> > 
> >     Upon kexec reboot, all root data structures are re-initialized and thus the
> >     tail pointer gets reset to zero. The hypervisor on the other hand still
> >     retains the pre-kexec head pointer which could be non-zero. This means that
> >     when the hypervisor writes new events to the ring buffer, the root
> >     partition looks at the wrong place and doesn't find any events. So, future
> >     doorbell events never get delivered. As a result, virtqueue kicks never get
> >     delivered to the host.
> > 
> >     When the SIRB page is disabled the hypervisor resets the head pointer.
> 
> FWIW, I don't see that commit message anywhere in a public source code
> tree. The calls to register/unregister_reboot_notifier() were in the original
> introduction of mshv_root_main.c in upstream commit 621191d709b14.
> Evidently the code described by that commit message was not submitted
> upstream. And of course, the kexec() topic is now being revisited ....
> 
> So to clarify: Do you expect that in the future the reboot notifier will be
> used for something that really is required for resetting hypervisor state
> in the case of a kexec reboot?
> 

Yes, for now it's the best we have.
This code can be dropped later if we get a better way to handle kexec.

> > 
> > > > -root_sched_deinit:
> > > > -	root_scheduler_deinit();
> > > > -	return err;
> > > >  }
> > > >
> > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > >  {
> > > > -	/*
> > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > -	 */
> > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > -					      0, &mshv_root.vmm_caps,
> > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > +	int ret;
> > > > +
> > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > +						0, &mshv_root.vmm_caps,
> > > > +						sizeof(mshv_root.vmm_caps));
> > > > +	if (ret) {
> > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > +		return ret;
> > > > +	}
> > >
> > > This is a functional change that isn't mentioned in the commit message.
> > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > as all disabled? Presumably there are older versions of the hypervisor that
> > > don't support the requirements described in the original comment, but
> > > perhaps they are no longer relevant?
> > >
> > 
> > To fail is now the only option for the L1VH partition. It must discover
> > the scheduler type. Without this information, the partition cannot
> > operate. The core scheduler logic will not work with an integrated
> > scheduler, and vice versa.
> > 
> > And yes, older hypervisor versions do not support L1VH.
> 
> That makes sense. Your change in v2 of the patch handles this
> nicely. For the non-L1VH case, the v2 behavior is the same as before in
> that the init path won't error out on older hypervisors that don't
> support the requirements described in the original comment. That's
> the case I am concerned about.
> 

Yes. Thank you for the review and feedback!

Stanislav
> Michael

^ permalink raw reply

* Re: [PATCH v2] mshv: Add support for integrated scheduler
From: kernel test robot @ 2026-01-30 15:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: oe-kbuild-all, linux-hyperv, linux-kernel
In-Reply-To: <176971725312.67225.3938191771112866951.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc7 next-20260129]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/mshv-Add-support-for-integrated-scheduler/20260130-041014
base:   linus/master
patch link:    https://lore.kernel.org/r/176971725312.67225.3938191771112866951.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH v2] mshv: Add support for integrated scheduler
config: x86_64-randconfig-002-20260130 (https://download.01.org/0day-ci/archive/20260130/202601302238.nUbp7p58-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260130/202601302238.nUbp7p58-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601302238.nUbp7p58-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/hv/mshv_root_main.c: In function 'mshv_init_vmm_caps':
   drivers/hv/mshv_root_main.c:2255:9: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
    2255 |         if (ret && hv_l1vh_partition())
         |         ^~
   drivers/hv/mshv_root_main.c:2257:17: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
    2257 |                 return ret;
         |                 ^~~~~~
   In file included from include/linux/device.h:15,
                    from include/linux/blk_types.h:11,
                    from include/linux/writeback.h:13,
                    from include/linux/memcontrol.h:23,
                    from include/linux/resume_user_mode.h:8,
                    from include/linux/entry-virt.h:6,
                    from drivers/hv/mshv_root_main.c:11:
   drivers/hv/mshv_root_main.c: At top level:
>> include/linux/dev_printk.h:137:10: error: expected identifier or '(' before '{' token
     137 |         ({                                                              \
         |          ^
   include/linux/dev_printk.h:171:9: note: in expansion of macro 'dev_no_printk'
     171 |         dev_no_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:2260:9: note: in expansion of macro 'dev_dbg'
    2260 |         dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
         |         ^~~~~~~
   drivers/hv/mshv_root_main.c:2262:9: error: expected identifier or '(' before 'return'
    2262 |         return 0;
         |         ^~~~~~
   drivers/hv/mshv_root_main.c:2263:1: error: expected identifier or '(' before '}' token
    2263 | }
         | ^


vim +137 include/linux/dev_printk.h

af628aae8640c26 Greg Kroah-Hartman 2019-12-09   99  
ad7d61f159db739 Chris Down         2021-06-15  100  /*
ad7d61f159db739 Chris Down         2021-06-15  101   * Need to take variadic arguments even though we don't use them, as dev_fmt()
ad7d61f159db739 Chris Down         2021-06-15  102   * may only just have been expanded and may result in multiple arguments.
ad7d61f159db739 Chris Down         2021-06-15  103   */
ad7d61f159db739 Chris Down         2021-06-15  104  #define dev_printk_index_emit(level, fmt, ...) \
ad7d61f159db739 Chris Down         2021-06-15  105  	printk_index_subsys_emit("%s %s: ", level, fmt)
ad7d61f159db739 Chris Down         2021-06-15  106  
ad7d61f159db739 Chris Down         2021-06-15  107  #define dev_printk_index_wrap(_p_func, level, dev, fmt, ...)		\
ad7d61f159db739 Chris Down         2021-06-15  108  	({								\
ad7d61f159db739 Chris Down         2021-06-15  109  		dev_printk_index_emit(level, fmt);			\
ad7d61f159db739 Chris Down         2021-06-15  110  		_p_func(dev, fmt, ##__VA_ARGS__);			\
ad7d61f159db739 Chris Down         2021-06-15  111  	})
ad7d61f159db739 Chris Down         2021-06-15  112  
ad7d61f159db739 Chris Down         2021-06-15  113  /*
ad7d61f159db739 Chris Down         2021-06-15  114   * Some callsites directly call dev_printk rather than going through the
ad7d61f159db739 Chris Down         2021-06-15  115   * dev_<level> infrastructure, so we need to emit here as well as inside those
ad7d61f159db739 Chris Down         2021-06-15  116   * level-specific macros. Only one index entry will be produced, either way,
ad7d61f159db739 Chris Down         2021-06-15  117   * since dev_printk's `fmt` isn't known at compile time if going through the
ad7d61f159db739 Chris Down         2021-06-15  118   * dev_<level> macros.
ad7d61f159db739 Chris Down         2021-06-15  119   *
ad7d61f159db739 Chris Down         2021-06-15  120   * dev_fmt() isn't called for dev_printk when used directly, as it's used by
ad7d61f159db739 Chris Down         2021-06-15  121   * the dev_<level> macros internally which already have dev_fmt() processed.
ad7d61f159db739 Chris Down         2021-06-15  122   *
ad7d61f159db739 Chris Down         2021-06-15  123   * We also can't use dev_printk_index_wrap directly, because we have a separate
ad7d61f159db739 Chris Down         2021-06-15  124   * level to process.
ad7d61f159db739 Chris Down         2021-06-15  125   */
ad7d61f159db739 Chris Down         2021-06-15  126  #define dev_printk(level, dev, fmt, ...)				\
ad7d61f159db739 Chris Down         2021-06-15  127  	({								\
ad7d61f159db739 Chris Down         2021-06-15  128  		dev_printk_index_emit(level, fmt);			\
ad7d61f159db739 Chris Down         2021-06-15  129  		_dev_printk(level, dev, fmt, ##__VA_ARGS__);		\
ad7d61f159db739 Chris Down         2021-06-15  130  	})
ad7d61f159db739 Chris Down         2021-06-15  131  
c26ec799042a388 Geert Uytterhoeven 2024-02-28  132  /*
c26ec799042a388 Geert Uytterhoeven 2024-02-28  133   * Dummy dev_printk for disabled debugging statements to use whilst maintaining
c26ec799042a388 Geert Uytterhoeven 2024-02-28  134   * gcc's format checking.
c26ec799042a388 Geert Uytterhoeven 2024-02-28  135   */
c26ec799042a388 Geert Uytterhoeven 2024-02-28  136  #define dev_no_printk(level, dev, fmt, ...)				\
c26ec799042a388 Geert Uytterhoeven 2024-02-28 @137  	({								\
c26ec799042a388 Geert Uytterhoeven 2024-02-28  138  		if (0)							\
c26ec799042a388 Geert Uytterhoeven 2024-02-28  139  			_dev_printk(level, dev, fmt, ##__VA_ARGS__);	\
c26ec799042a388 Geert Uytterhoeven 2024-02-28  140  	})
c26ec799042a388 Geert Uytterhoeven 2024-02-28  141  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mshv: Add support for integrated scheduler
From: kernel test robot @ 2026-01-30  9:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: llvm, oe-kbuild-all, linux-hyperv, linux-kernel
In-Reply-To: <176971725312.67225.3938191771112866951.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc7 next-20260129]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/mshv-Add-support-for-integrated-scheduler/20260130-041014
base:   linus/master
patch link:    https://lore.kernel.org/r/176971725312.67225.3938191771112866951.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH v2] mshv: Add support for integrated scheduler
config: x86_64-buildonly-randconfig-004-20260130 (https://download.01.org/0day-ci/archive/20260130/202601301732.2k4q81GI-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260130/202601301732.2k4q81GI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601301732.2k4q81GI-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/hv/mshv_root_main.c:2260:2: error: expected identifier or '('
    2260 |         dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
         |         ^
   include/linux/dev_printk.h:171:2: note: expanded from macro 'dev_dbg'
     171 |         dev_no_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^
   include/linux/dev_printk.h:137:3: note: expanded from macro 'dev_no_printk'
     137 |         ({                                                              \
         |          ^
>> drivers/hv/mshv_root_main.c:2260:2: error: expected ')'
   include/linux/dev_printk.h:171:2: note: expanded from macro 'dev_dbg'
     171 |         dev_no_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^
   include/linux/dev_printk.h:137:3: note: expanded from macro 'dev_no_printk'
     137 |         ({                                                              \
         |          ^
   drivers/hv/mshv_root_main.c:2260:2: note: to match this '('
   include/linux/dev_printk.h:171:2: note: expanded from macro 'dev_dbg'
     171 |         dev_no_printk(KERN_DEBUG, dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^
   include/linux/dev_printk.h:137:2: note: expanded from macro 'dev_no_printk'
     137 |         ({                                                              \
         |         ^
   drivers/hv/mshv_root_main.c:2262:2: error: expected identifier or '('
    2262 |         return 0;
         |         ^
>> drivers/hv/mshv_root_main.c:2263:1: error: extraneous closing brace ('}')
    2263 | }
         | ^
   4 errors generated.


vim +2260 drivers/hv/mshv_root_main.c

621191d709b148 Nuno Das Neves                  2025-03-14  2246  
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2247  static int __init mshv_init_vmm_caps(struct device *dev)
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10  2248  {
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2249  	int ret;
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2250  
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2251  	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10  2252  						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10  2253  						0, &mshv_root.vmm_caps,
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2254  						sizeof(mshv_root.vmm_caps));
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2255  	if (ret && hv_l1vh_partition())
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2256  		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2257  		return ret;
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2258  	}
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10  2259  
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10 @2260  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2261  
21480aa03ff5bc Stanislav Kinsburskii           2026-01-29  2262  	return 0;
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10 @2263  }
fd612d97a458f0 Purna Pavan Chandra Aekkaladevi 2025-10-10  2264  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v2] mshv: Add support for integrated scheduler
From: kernel test robot @ 2026-01-30  5:51 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli
  Cc: oe-kbuild-all, linux-hyperv, linux-kernel
In-Reply-To: <176971725312.67225.3938191771112866951.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Hi Stanislav,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.19-rc7 next-20260129]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Kinsburskii/mshv-Add-support-for-integrated-scheduler/20260130-041014
base:   linus/master
patch link:    https://lore.kernel.org/r/176971725312.67225.3938191771112866951.stgit%40skinsburskii-cloud-desktop.internal.cloudapp.net
patch subject: [PATCH v2] mshv: Add support for integrated scheduler
config: x86_64-randconfig-014-20260130 (https://download.01.org/0day-ci/archive/20260130/202601301357.SWdA3gzf-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260130/202601301357.SWdA3gzf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601301357.SWdA3gzf-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/hv/mshv_root_main.c: In function 'mshv_init_vmm_caps':
>> drivers/hv/mshv_root_main.c:2255:9: warning: this 'if' clause does not guard... [-Wmisleading-indentation]
    2255 |         if (ret && hv_l1vh_partition())
         |         ^~
   drivers/hv/mshv_root_main.c:2257:17: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
    2257 |                 return ret;
         |                 ^~~~~~
   In file included from include/linux/printk.h:621,
                    from include/asm-generic/bug.h:31,
                    from arch/x86/include/asm/bug.h:193,
                    from arch/x86/include/asm/alternative.h:9,
                    from arch/x86/include/asm/segment.h:6,
                    from arch/x86/include/asm/ptrace.h:5,
                    from arch/x86/include/asm/math_emu.h:5,
                    from arch/x86/include/asm/processor.h:13,
                    from include/linux/sched.h:13,
                    from include/linux/resume_user_mode.h:6,
                    from include/linux/entry-virt.h:6,
                    from drivers/hv/mshv_root_main.c:11:
   drivers/hv/mshv_root_main.c: At top level:
   include/linux/dynamic_debug.h:228:58: error: expected identifier or '(' before 'do'
     228 | #define __dynamic_func_call_cls(id, cls, fmt, func, ...) do {   \
         |                                                          ^~
   include/linux/dynamic_debug.h:259:9: note: in expansion of macro '__dynamic_func_call_cls'
     259 |         __dynamic_func_call_cls(__UNIQUE_ID(ddebug), cls, fmt, func, ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dynamic_debug.h:261:9: note: in expansion of macro '_dynamic_func_call_cls'
     261 |         _dynamic_func_call_cls(_DPRINTK_CLASS_DFLT, fmt, func, ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/dynamic_debug.h:284:9: note: in expansion of macro '_dynamic_func_call'
     284 |         _dynamic_func_call(fmt, __dynamic_dev_dbg,              \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:165:9: note: in expansion of macro 'dynamic_dev_dbg'
     165 |         dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:2260:9: note: in expansion of macro 'dev_dbg'
    2260 |         dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
         |         ^~~~~~~
   include/linux/dynamic_debug.h:234:3: error: expected identifier or '(' before 'while'
     234 | } while (0)
         |   ^~~~~
   include/linux/dynamic_debug.h:259:9: note: in expansion of macro '__dynamic_func_call_cls'
     259 |         __dynamic_func_call_cls(__UNIQUE_ID(ddebug), cls, fmt, func, ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dynamic_debug.h:261:9: note: in expansion of macro '_dynamic_func_call_cls'
     261 |         _dynamic_func_call_cls(_DPRINTK_CLASS_DFLT, fmt, func, ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/dynamic_debug.h:284:9: note: in expansion of macro '_dynamic_func_call'
     284 |         _dynamic_func_call(fmt, __dynamic_dev_dbg,              \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:165:9: note: in expansion of macro 'dynamic_dev_dbg'
     165 |         dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
         |         ^~~~~~~~~~~~~~~
   drivers/hv/mshv_root_main.c:2260:9: note: in expansion of macro 'dev_dbg'
    2260 |         dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
         |         ^~~~~~~
   drivers/hv/mshv_root_main.c:2262:9: error: expected identifier or '(' before 'return'
    2262 |         return 0;
         |         ^~~~~~
   drivers/hv/mshv_root_main.c:2263:1: error: expected identifier or '(' before '}' token
    2263 | }
         | ^


vim +/if +2255 drivers/hv/mshv_root_main.c

  2246	
  2247	static int __init mshv_init_vmm_caps(struct device *dev)
  2248	{
  2249		int ret;
  2250	
  2251		ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
  2252							HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
  2253							0, &mshv_root.vmm_caps,
  2254							sizeof(mshv_root.vmm_caps));
> 2255		if (ret && hv_l1vh_partition())
  2256			dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
  2257			return ret;
  2258		}
  2259	
  2260		dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
  2261	
  2262		return 0;
  2263	}
  2264	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-30  2:59 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXqW7v-lnAT_gr0s@skinsburskii.localdomain>

On 1/28/26 15:08, Stanislav Kinsburskii wrote:
> On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>
>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>> management is implemented.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        drivers/hv/Kconfig |    1 +
>>>>>>>>>>>        1 file changed, 1 insertion(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>>        	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>>        	# no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>>        	depends on PAGE_SIZE_4KB
>>>>>>>>>>> +	depends on !KEXEC
>>>>>>>>>>>        	select EVENTFD
>>>>>>>>>>>        	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>>        	select HMM_MIRROR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>> and it was fine?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>> will be affected as well.
>>>>>>>>
>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>> right?
>>>>>>>
>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>> bad user experience.
>>>>>>
>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>> explore that and didn't find anything, hence this?
>>>>>>
>>>>>
>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>> is no hook to interrupt kexec process except the live update one.
>>>>
>>>> That's the one we want to interrupt and block right? crash kexec
>>>> is ok and should be allowed. We can document we don't support kexec
>>>> for update for now.
>>>>
>>>>> I sent an RFC for that one but given todays conversation details is
>>>>> won't be accepted as is.
>>>>
>>>> Are you taking about this?
>>>>
>>>>           "mshv: Add kexec safety for deposited pages"
>>>>
>>>
>>> Yes.
>>>
>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>> now given time constraints.
>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>> the future.
>>>>
>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>> completely. What we want is just block kexec for updates from some
>>>> mshv file for now, we an print during boot that kexec for updates is
>>>> not supported on mshv. Hope that makes sense.
>>>>
>>>
>>> The trade-off here is between disabling kexec support and having the
>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>> kexec and crash kexec.
>>
>> crash kexec on baremetal is not affected, hence disabling that
>> doesn't make sense as we can't debug crashes then on bm.
>>
> 
> Bare metal support is not currently relevant, as it is not available.
> This is the upstream kernel, and this driver will be accessible to
> third-party customers beginning with kernel 6.19 for running their
> kernels in Azure L1VH, so consistency is required.

Well, without crashdump support, customers will not be running anything
anywhere.

Thanks,
-Mukesh

> Thanks,
> Stanislav
> 
>> Let me think and explore a bit, and if I come up with something, I'll
>> send a patch here. If nothing, then we can do this as last resort.
>>
>> Thanks,
>> -Mukesh
>>
>>
>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>> However, since crash kexec would hit the same issues, until we have a
>>> proper state transition for deposted pages, the best workaround for now
>>> is to reset the hypervisor state on every kexec, which needs design,
>>> work, and testing.
>>>
>>> Disabling kexec is the only consistent way to handle this in the
>>> upstream kernel at the moment.
>>>
>>> Thanks, Stanislav
>>>
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh
>>>>>>
>>>>>>> Therefor it should be explicitly forbidden as it's essentially not
>>>>>>> supported yet.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Stanislav
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Mukesh


^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-30  2:52 UTC (permalink / raw)
  To: Michael Kelley, Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157EDC69791EF24D5DA8661D491A@SN6PR02MB4157.namprd02.prod.outlook.com>

On 1/28/26 07:53, Michael Kelley wrote:
> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, January 27, 2026 11:56 AM
>> To: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>> Cc: kys@microsoft.com; haiyangz@microsoft.com; wei.liu@kernel.org;
>> decui@microsoft.com; longli@microsoft.com; linux-hyperv@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
>>
>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>
>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>> management is implemented.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        drivers/hv/Kconfig |    1 +
>>>>>>>>>>>        1 file changed, 1 insertion(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>>        	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>>        	# no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>>        	depends on PAGE_SIZE_4KB
>>>>>>>>>>> +	depends on !KEXEC
>>>>>>>>>>>        	select EVENTFD
>>>>>>>>>>>        	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>>        	select HMM_MIRROR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>> and it was fine?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>> will be affected as well.
>>>>>>>>
>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>> right?
>>>>>>>
>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>> bad user experience.
>>>>>>
>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>> explore that and didn't find anything, hence this?
>>>>>>
>>>>>
>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>> is no hook to interrupt kexec process except the live update one.
>>>>
>>>> That's the one we want to interrupt and block right? crash kexec
>>>> is ok and should be allowed. We can document we don't support kexec
>>>> for update for now.
>>>>
>>>>> I sent an RFC for that one but given todays conversation details is
>>>>> won't be accepted as is.
>>>>
>>>> Are you taking about this?
>>>>
>>>>           "mshv: Add kexec safety for deposited pages"
>>>>
>>>
>>> Yes.
>>>
>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>> now given time constraints.
>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>> the future.
>>>>
>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>> completely. What we want is just block kexec for updates from some
>>>> mshv file for now, we an print during boot that kexec for updates is
>>>> not supported on mshv. Hope that makes sense.
>>>>
>>>
>>> The trade-off here is between disabling kexec support and having the
>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>> kexec and crash kexec.
>>
>> crash kexec on baremetal is not affected, hence disabling that
>> doesn't make sense as we can't debug crashes then on bm.
>>
>> Let me think and explore a bit, and if I come up with something, I'll
>> send a patch here. If nothing, then we can do this as last resort.
>>
>> Thanks,
>> -Mukesh
> 
> Maybe you've already looked at this, but there's a sysctl parameter
> kernel.kexec_load_limit_reboot that prevents loading a kexec
> kernel for reboot if the value is zero. Separately, there is
> kernel.kexec_load_limit_panic that controls whether a kexec
> kernel can be loaded for kdump purposes.
> 
> kernel.kexec_load_limit_reboot defaults to -1, which allows an
> unlimited number of loading a kexec kernel for reboot. But the value
> can be set to zero with this kernel boot line parameter:
> 
> sysctl.kernel.kexec_load_limit_reboot=0
> 
> Alternatively, the mshv driver initialization could add code along
> the lines of process_sysctl_arg() to open
> /proc/sys/kernel/kexec_load_limit_reboot and write a value of zero.
> Then there's no dependency on setting the kernel boot line.
> 
> The downside to either method is that after Linux in the root partition
> is up-and-running, it is possible to change the sysctl to a non-zero value,
> and then load a kexec kernel for reboot. So this approach isn't absolute
> protection against doing a kexec for reboot. But it makes it harder, and
> until there's a mechanism to reclaim the deposited pages, it might be
> a viable compromise to allow kdump to still be used.

Mmm...eee...weelll... i think i see a much easier way to do this by
just hijacking __kexec_lock. I will resume my normal work tmrw/Fri,
so let me test it out. if it works, will send patch Monday.

Thanks,
-Mukesh



> Just a thought ....
> 
> Michael
> 
>>
>>
>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>> However, since crash kexec would hit the same issues, until we have a
>>> proper state transition for deposted pages, the best workaround for now
>>> is to reset the hypervisor state on every kexec, which needs design,
>>> work, and testing.
>>>
>>> Disabling kexec is the only consistent way to handle this in the
>>> upstream kernel at the moment.
>>>
>>> Thanks, Stanislav


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox