Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-02 17:10 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aX0Vbfocwa4WgXUw@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > hypervisor deposited pages.
> > > > > > > > 
> > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > management is implemented.
> > > > > > > 
> > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > and would work without any issue for L1VH.
> > > > > > > 
> > > > > > 
> > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > 
> > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > right? What other deposited pages would be left?
> > > > > 
> > > > 
> > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > upon gust shutdown) and the other - for the host itself (never
> > > > withdrawn).
> > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > host partition.
> > > 
> > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > Also, can't we forcefully kill all running partitions in module_exit and
> > > then reclaim memory? Would this help with kernel consistency
> > > irrespective of userspace behavior?
> > > 
> > 
> > It would, but this is sloppy and cannot be a long-term solution.
> > 
> > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > to kill the guest or reclaim the memory for any reason, the new kernel
> > may still crash.
> 
> Actually guests won't be running by the time we reach our module_exit
> function during a kexec. Userspace processes would've been killed by
> then.
> 

No, they will not: "kexec -e" doesn't kill user processes.
We must not rely on OS to do graceful shutdown before doing
kexec.

> Also, why is this sloppy? Isn't this what module_exit should be
> doing anyway? If someone unloads our module we should be trying to
> clean everything up (including killing guests) and reclaim memory.
> 

Kexec does not unload modules, but it doesn't really matter even if it
would.
There are other means to plug into the reboot flow, but neither of them
is robust or reliable.

> In any case, we can BUG() out if we fail to reclaim the memory. That would
> stop the kexec.
> 

By killing the whole system? This is not a good user experience and I
don't see how can this be justified.

> This is a better solution since instead of disabling KEXEC outright: our
> driver made the best possible efforts to make kexec work.
> 

How an unrealiable feature leading to potential system crashes is better
that disabling kexec outright?

It's a complete opposite story for me: the latter provides a limited,
but robust functionality, while the former provides an unreliable and
unpredictable behavior.

> > 
> > There are two long-term solutions:
> >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> 
> I honestly think we should focus efforts on making kexec work rather
> than finding ways to prevent it.
> 

There is no argument about it. But until we have it fixed properly, we
have two options: either disable kexec or stop claiming we have our
driver up and ready for external customers. Giving the importance of
this driver for current projects, I believe the better way would be to
explicitly limit the functionality instead of postponing the
productization of the driver.

In other words, this is not about our fillings about kexec support: it's
about what we can reliably provide to our customers today.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> >  2. Hand the shared kernel state over to the new kernel.
> > 
> > I sent a series for the first one. The second one is not ready yet.
> > Anything else is neither robust nor reliable, so I don’t think it makes
> > sense to pursue it.
> > 
> > Thanks,
> > Stanislav
> > 
> > 
> > > Thanks,
> > > Anirudh.
> > > 
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > Anirudh.
> > > > > 
> > > > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > > > 
> > > > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > > > driver.
> > > > > > > 
> > > > > > 
> > > > > > It's a temporary measure until proper page lifecycle management is
> > > > > > supported in the driver.
> > > > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > > > expclitily stated in the Kconfig.
> > > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Anirudh.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > ---
> > > > > > > >  drivers/hv/Kconfig |    1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > > > +	depends on !KEXEC
> > > > > > > >  	select EVENTFD
> > > > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > >  	select HMM_MIRROR
> > > > > > > > 
> > > > > > > > 

^ permalink raw reply

* Re: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Stanislav Kinsburskii @ 2026-02-02 17:17 UTC (permalink / raw)
  To: mhkelley58
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260202165101.1750-1-mhklinux@outlook.com>

On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Huge page mappings in the guest physical address space depend on having
> matching alignment of the userspace address in the parent partition and
> of the guest physical address. Add a comment that captures this
> information. See the link to the mailing list thread.
> 
> No code or functional change.
> 
> Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  drivers/hv/mshv_root_main.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 681b58154d5e..bc738ff4508e 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
>  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
>  		return mshv_unmap_user_memory(partition, mem);
>  
> +	/*
> +	 * If the userspace_addr and the guest physical address (as derived
> +	 * from the guest_pfn) have the same alignment modulo PMD huge page
> +	 * size, the MSHV driver can map any PMD huge pages to the guest
> +	 * physical address space as PMD huge pages. If the alignments do
> +	 * not match, PMD huge pages must be mapped as single pages in the
> +	 * guest physical address space. The MSHV driver does not enforce
> +	 * that the alignments match, and it invokes the hypervisor to set
> +	 * up correct functional mappings either way. See mshv_chunk_stride().
> +	 * The caller of the ioctl is responsible for providing userspace_addr
> +	 * and guest_pfn values with matching alignments if it wants the guest
> +	 * to get the performance benefits of PMD huge page mappings of its
> +	 * physical address space to real system memory.
> +	 */

Thanks. However, I'd suggest to reduce this commet a lot and put the
details into the commit message instead. Also, why this place? Why not a
part of the function description instead, for example?

Thanks,
Stanislav

>  	return mshv_map_user_memory(partition, mem);
>  }
>  
> -- 
> 2.25.1

^ permalink raw reply

* Re: [PATCH 2/2] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-02-02 17:19 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aX0TCpXFxI8zVlQ1@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 08:22:34PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 10:51:10AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 06:43:09PM +0000, Anirudh Rayabharam wrote:
> > > On Fri, Jan 30, 2026 at 10:37:38AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 05:30:25PM +0000, Anirudh Rayabharam wrote:
> > > > > On Thu, Jan 29, 2026 at 11:09:46AM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Thu, Jan 29, 2026 at 05:47:02PM +0000, Michael Kelley wrote:
> > > > > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Wednesday, January 21, 2026 2:36 PM
> > > > > > > > 
> > > > > > > > From: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > > > 
> > > > > > > > Query the hypervisor for integrated scheduler support and use it if
> > > > > > > > configured.
> > > > > > > > 
> > > > > > > > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > > > > > > > root scheduler allows the root partition to schedule guest vCPUs across
> > > > > > > > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > > > > > > > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > > > > > > > scheduling entirely to the hypervisor.
> > > > > > > > 
> > > > > > > > Direct virtualization introduces a new privileged guest partition type - L1
> > > > > > > > Virtual Host (L1VH) — which can create child partitions from its own
> > > > > > > > resources. These child partitions are effectively siblings, scheduled by
> > > > > > > > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > > > > > > > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > > > > > > > CFS, and cpuset controllers can still be used, their effectiveness is
> > > > > > > > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > > > > > > > (typically round-robin across all allocated physical CPUs). As a result,
> > > > > > > > the system may appear to "steal" time from the L1VH and its children.
> > > > > > > > 
> > > > > > > > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > > > > > >   This the s allows an L1VH partition to schedule its own vCPUs and those of its
> > > > > > > > guests across its "physical" cores, effectively emulating root scheduler
> > > > > > > > behavior within the L1VH, while retaining core scheduler behavior for the
> > > > > > > > rest of the system.
> > > > > > > > 
> > > > > > > > The integrated scheduler is controlled by the root partition and gated by
> > > > > > > > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > > > > > > > supports the integrated scheduler. The L1VH partition must then check if it
> > > > > > > > is enabled by querying the corresponding extended partition property. If
> > > > > > > > this property is true, the L1VH partition must use the root scheduler
> > > > > > > > logic; otherwise, it must use the core scheduler.
> > > > > > > > 
> > > > > > > > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > ---
> > > > > > > >  drivers/hv/mshv_root_main.c |   79 +++++++++++++++++++++++++++++--------------
> > > > > > > >  include/hyperv/hvhdk_mini.h |    6 +++
> > > > > > > >  2 files changed, 58 insertions(+), 27 deletions(-)
> > > > > > > > 
> > > > 
> > > >  <snip>
> > > > 
> > > > > > > > -root_sched_deinit:
> > > > > > > > -	root_scheduler_deinit();
> > > > > > > > -	return err;
> > > > > > > >  }
> > > > > > > > 
> > > > > > > > -static void mshv_init_vmm_caps(struct device *dev)
> > > > > > > > +static int mshv_init_vmm_caps(struct device *dev)
> > > > > > > >  {
> > > > > > > > -	/*
> > > > > > > > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > > > > > > > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > > > > > > > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > > > > > > > -	 */
> > > > > > > > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > -					      0, &mshv_root.vmm_caps,
> > > > > > > > -					      sizeof(mshv_root.vmm_caps)))
> > > > > > > > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > > > > > > > +	int ret;
> > > > > > > > +
> > > > > > > > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > > > > > > > +					 	HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > > > > > > > +						0, &mshv_root.vmm_caps,
> > > > > > > > +						sizeof(mshv_root.vmm_caps));
> > > > > > > > +	if (ret) {
> > > > > > > > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > This is a functional change that isn't mentioned in the commit message.
> > > > > > > Why is it now appropriate to fail instead of treating the VMM capabilities
> > > > > > > as all disabled? Presumably there are older versions of the hypervisor that
> > > > > > > don't support the requirements described in the original comment, but
> > > > > > > perhaps they are no longer relevant?
> > > > > > > 
> > > > > > 
> > > > > > To fail is now the only option for the L1VH partition. It must discover
> > > > > > the scheduler type. Without this information, the partition cannot
> > > > > > operate. The core scheduler logic will not work with an integrated
> > > > > > scheduler, and vice versa.
> > > > > 
> > > > > I don't think we need to fail here. If we don't find vmm caps, that
> > > > > means we are on an older hypervisor that supports l1vh but not
> > > > > integrated scheduler (yes, such a version exists). In this case since
> > > > > integrated scheduler is not supported by the hypervisor, the core
> > > > > scheduler logic will work.
> > > > > 
> > > > 
> > > > The older hypervisor version won't have the integrated scheduler
> > > > capabity bit.
> > > > And we can't operate in core schedule mode if the integrated is enabled
> > > > underneath us.
> > > 
> > > The older hypervisor won't have the integrated scheduler capability bit.
> > > This means that the older hypervisor doesn't support integrated
> > > scheduler (this is how vmm caps work: if the bit doesn't exist or
> > > vmm caps themselves don't exist the feature should be assumed as not
> > > available). If the hypervisor doesn't support integrated scheduler in the
> > > first place, it can't be enabled underneath us. So, it is safe to
> > > operate in core scheduler mode.
> > > 
> > 
> > We can’t tell whether the hypervisor is older and simply doesn’t have
> > the VMM caps bit, or whether we just failed to fetch the VMM caps.
> 
> If we failed to fetch the VMM caps i.e. the hypervisor doesn't support
> the vmm caps property, we must assume that all the bits in vmm caps are
> 0 (i.e. no features are available). This is how vmm capabilities are
> supposed to be interpreted. This is something I checked with the
> hypervisor team some time back.
> 
> > 
> > In other words, we can’t distinguish between “an older hypervisor
> > without integrated scheduler support” and “a newer hypervisor with an
> > integrated scheduler, but we failed to fetch the VMM caps”.
> > 
> > But for completeness: are you saying there is an older hypervisor
> > version that supports L1VH, but does not support VMM caps?
> 
> I don't know how much of the Azure fleet still runs it but yes such a
> hypervisor version exists.
> 

We don't need to support interim hypervisor versions in the upstream
kernel: these version will go away, and then this logic will become not
only a dead code path but also incorrect.

We can keep the existing logic that treats failure to fetch VMM as
notrmal internally until required.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> > 
> > Thanks, Stanislav
> > 
> > > Thanks,
> > > Anirudh.

^ permalink raw reply

* [PATCH v2 0/4] Improve Hyper-V memory deposit error handling
From: Stanislav Kinsburskii @ 2026-02-02 17:58 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

This series extends the MSHV driver to properly handle additional
memory-related error codes from the Microsoft Hypervisor by depositing
memory pages when needed.

Currently, when the hypervisor returns HV_STATUS_INSUFFICIENT_MEMORY
during partition creation, the driver calls hv_call_deposit_pages() to
provide the necessary memory. However, there are other memory-related
error codes that indicate the hypervisor needs additional memory
resources, but the driver does not attempt to deposit pages for these
cases.

This series introduces a dedicated helper function macro to identify all
memory-related error codes (HV_STATUS_INSUFFICIENT_MEMORY,
HV_STATUS_INSUFFICIENT_BUFFERS, HV_STATUS_INSUFFICIENT_DEVICE_DOMAINS, and
HV_STATUS_INSUFFICIENT_ROOT_MEMORY) and ensures the driver attempts to
deposit pages for all of them via new hv_deposit_memory() helper.

With these changes, partition creation becomes more robust by handling
all scenarios where the hypervisor requires additional memory deposits.

v2:
- Rename hv_result_oom() into hv_result_needs_memory()

---

Stanislav Kinsburskii (4):
      mshv: Introduce hv_result_needs_memory() helper function
      mshv: Introduce hv_deposit_memory helper functions
      mshv: Handle insufficient contiguous memory hypervisor status
      mshv: Handle insufficient root memory hypervisor statuses

 drivers/hv/hv_common.c         |    3 ++
 drivers/hv/hv_proc.c           |   54 +++++++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root_hv_call.c |   45 +++++++++++++-------------------
 drivers/hv/mshv_root_main.c    |    5 +---
 include/asm-generic/mshyperv.h |   13 +++++++++
 include/hyperv/hvgdk_mini.h    |   57 +++++++++++++++++++++-------------------
 include/hyperv/hvhdk_mini.h    |    2 +
 7 files changed, 119 insertions(+), 60 deletions(-)

^ permalink raw reply

* [PATCH v2 1/4] mshv: Introduce hv_result_needs_memory() helper function
From: Stanislav Kinsburskii @ 2026-02-02 17:58 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005499596.120041.5908089206606113719.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Replace direct comparisons of hv_result(status) against
HV_STATUS_INSUFFICIENT_MEMORY with a new hv_result_needs_memory() helper
function.
This improves code readability and provides a consistent and extendable
interface for checking out-of-memory conditions in hypercall results.

No functional changes intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/hv_proc.c           |   14 ++++++++++++--
 drivers/hv/mshv_root_hv_call.c |   20 ++++++++++----------
 drivers/hv/mshv_root_main.c    |    2 +-
 include/asm-generic/mshyperv.h |    3 +++
 4 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index fbb4eb3901bb..e53204b9e05d 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -110,6 +110,16 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 }
 EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
 
+bool hv_result_needs_memory(u64 status)
+{
+	switch (hv_result(status)) {
+	case HV_STATUS_INSUFFICIENT_MEMORY:
+		return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_result_needs_memory);
+
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 {
 	struct hv_input_add_logical_processor *input;
@@ -137,7 +147,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 					 input, output);
 		local_irq_restore(flags);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			if (!hv_result_success(status)) {
 				hv_status_err(status, "cpu %u apic ID: %u\n",
 					      lp_index, apic_id);
@@ -179,7 +189,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 		status = hv_do_hypercall(HVCALL_CREATE_VP, input, NULL);
 		local_irq_restore(irq_flags);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			if (!hv_result_success(status)) {
 				hv_status_err(status, "vcpu: %u, lp: %u\n",
 					      vp_index, flags);
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 598eaff4ff29..89afeeda21dd 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -115,7 +115,7 @@ int hv_call_create_partition(u64 flags,
 		status = hv_do_hypercall(HVCALL_CREATE_PARTITION,
 					 input, output);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			if (hv_result_success(status))
 				*partition_id = output->partition_id;
 			local_irq_restore(irq_flags);
@@ -147,7 +147,7 @@ int hv_call_initialize_partition(u64 partition_id)
 		status = hv_do_fast_hypercall8(HVCALL_INITIALIZE_PARTITION,
 					       *(u64 *)&input);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
@@ -239,7 +239,7 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
 
 		completed = hv_repcomp(status);
 
-		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (hv_result_needs_memory(status)) {
 			ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
 						    HV_MAP_GPA_DEPOSIT_PAGES);
 			if (ret)
@@ -455,7 +455,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
 
 		status = hv_do_hypercall(control, input, output);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			if (hv_result_success(status) && ret_output)
 				memcpy(ret_output, output, sizeof(*output));
 
@@ -518,7 +518,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
 
 		status = hv_do_hypercall(control, input, NULL);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			local_irq_restore(flags);
 			ret = hv_result_to_errno(status);
 			break;
@@ -563,7 +563,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
 		status = hv_do_hypercall(HVCALL_MAP_VP_STATE_PAGE, input,
 					 output);
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			if (hv_result_success(status))
 				*state_page = pfn_to_page(output->map_location);
 			local_irq_restore(flags);
@@ -718,7 +718,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
 		if (hv_result_success(status))
 			break;
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
@@ -772,7 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
 		if (hv_result_success(status))
 			break;
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			ret = hv_result_to_errno(status);
 			break;
 		}
@@ -843,7 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 		if (!ret)
 			break;
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			hv_status_debug(status, "\n");
 			break;
 		}
@@ -878,7 +878,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 		pfn = output->map_location;
 
 		local_irq_restore(flags);
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
+		if (!hv_result_needs_memory(status)) {
 			ret = hv_result_to_errno(status);
 			if (hv_result_success(status))
 				break;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 6a6bf641b352..ee30bfa6bb2e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -261,7 +261,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
 		if (hv_result_success(status))
 			break;
 
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
+		if (!hv_result_needs_memory(status))
 			ret = hv_result_to_errno(status);
 		else
 			ret = hv_call_deposit_pages(NUMA_NO_NODE,
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index ecedab554c80..452426d5b2ab 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -342,6 +342,8 @@ static inline bool hv_parent_partition(void)
 {
 	return hv_root_partition() || hv_l1vh_partition();
 }
+
+bool hv_result_needs_memory(u64 status);
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
@@ -350,6 +352,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
 static inline bool hv_root_partition(void) { return false; }
 static inline bool hv_l1vh_partition(void) { return false; }
 static inline bool hv_parent_partition(void) { return false; }
+static inline bool hv_result_needs_memory(u64 status) { return false; }
 static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 {
 	return -EOPNOTSUPP;



^ permalink raw reply related

* [PATCH v2 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Stanislav Kinsburskii @ 2026-02-02 17:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005499596.120041.5908089206606113719.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
functions to handle memory deposition with proper error handling.

The new hv_deposit_memory_node() function takes the hypervisor status
as a parameter and validates it before depositing pages. It checks for
HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
unexpected status codes.

This is a precursor patch to new out-of-memory error codes support.
No functional changes intended.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/hv_proc.c           |   22 ++++++++++++++++++++--
 drivers/hv/mshv_root_hv_call.c |   25 +++++++++----------------
 drivers/hv/mshv_root_main.c    |    3 +--
 include/asm-generic/mshyperv.h |   10 ++++++++++
 4 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index e53204b9e05d..ffa25cd6e4e9 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 }
 EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
 
+int hv_deposit_memory_node(int node, u64 partition_id,
+			   u64 hv_status)
+{
+	u32 num_pages;
+
+	switch (hv_result(hv_status)) {
+	case HV_STATUS_INSUFFICIENT_MEMORY:
+		num_pages = 1;
+		break;
+	default:
+		hv_status_err(hv_status, "Unexpected!\n");
+		return -ENOMEM;
+	}
+	return hv_call_deposit_pages(node, partition_id, num_pages);
+}
+EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
+
 bool hv_result_needs_memory(u64 status)
 {
 	switch (hv_result(status)) {
@@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 			}
 			break;
 		}
-		ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
+		ret = hv_deposit_memory_node(node, hv_current_partition_id,
+					     status);
 	} while (!ret);
 
 	return ret;
@@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 			}
 			break;
 		}
-		ret = hv_call_deposit_pages(node, partition_id, 1);
+		ret = hv_deposit_memory_node(node, partition_id, status);
 
 	} while (!ret);
 
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 89afeeda21dd..174431cb5e0e 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
 			break;
 		}
 		local_irq_restore(irq_flags);
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    hv_current_partition_id, 1);
+		ret = hv_deposit_memory(hv_current_partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
 			ret = hv_result_to_errno(status);
 			break;
 		}
-		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
 		}
 		local_irq_restore(flags);
 
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    partition_id, 1);
+		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
 		}
 		local_irq_restore(flags);
 
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    partition_id, 1);
+		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
 
 		local_irq_restore(flags);
 
-		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
+		ret = hv_deposit_memory(partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
 			ret = hv_result_to_errno(status);
 			break;
 		}
-		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
-
+		ret = hv_deposit_memory(port_partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
 			ret = hv_result_to_errno(status);
 			break;
 		}
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    connection_partition_id, 1);
+		ret = hv_deposit_memory(connection_partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 			break;
 		}
 
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    hv_current_partition_id, 1);
+		ret = hv_deposit_memory(hv_current_partition_id, status);
 	} while (!ret);
 
 	return ret;
@@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 			return ret;
 		}
 
-		ret = hv_call_deposit_pages(NUMA_NO_NODE,
-					    hv_current_partition_id, 1);
+		ret = hv_deposit_memory(hv_current_partition_id, status);
 		if (ret)
 			return ret;
 	} while (!ret);
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index ee30bfa6bb2e..dce255c94f9e 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
 		if (!hv_result_needs_memory(status))
 			ret = hv_result_to_errno(status);
 		else
-			ret = hv_call_deposit_pages(NUMA_NO_NODE,
-						    pt_id, 1);
+			ret = hv_deposit_memory(pt_id, status);
 	} while (!ret);
 
 	args.status = hv_result(status);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 452426d5b2ab..d37b68238c97 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
 }
 
 bool hv_result_needs_memory(u64 status);
+int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
@@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
 static inline bool hv_l1vh_partition(void) { return false; }
 static inline bool hv_parent_partition(void) { return false; }
 static inline bool hv_result_needs_memory(u64 status) { return false; }
+static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
+{
+	return -EOPNOTSUPP;
+}
 static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 {
 	return -EOPNOTSUPP;
@@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
 }
 #endif /* CONFIG_MSHV_ROOT */
 
+static inline int hv_deposit_memory(u64 partition_id, u64 status)
+{
+	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
+}
+
 #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
 u8 __init get_vtl(void);
 #else



^ permalink raw reply related

* [PATCH v2 3/4] mshv: Handle insufficient contiguous memory hypervisor status
From: Stanislav Kinsburskii @ 2026-02-02 17:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005499596.120041.5908089206606113719.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

The HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY status indicates that the
hypervisor lacks sufficient contiguous memory for its internal allocations.

When this status is encountered, allocate and deposit
HV_MAX_CONTIGUOUS_ALLOCATION_PAGES contiguous pages to the hypervisor.
HV_MAX_CONTIGUOUS_ALLOCATION_PAGES is defined in the hypervisor headers, a
deposit of this size will always satisfy the hypervisor's requirements.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/hv_common.c      |    1 +
 drivers/hv/hv_proc.c        |    4 ++++
 include/hyperv/hvgdk_mini.h |    1 +
 include/hyperv/hvhdk_mini.h |    2 ++
 4 files changed, 8 insertions(+)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 0a3ab7efed46..c7f63c9de503 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -791,6 +791,7 @@ static const struct hv_status_info hv_status_infos[] = {
 	_STATUS_INFO(HV_STATUS_UNKNOWN_PROPERTY,		-EIO),
 	_STATUS_INFO(HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE,	-EIO),
 	_STATUS_INFO(HV_STATUS_INSUFFICIENT_MEMORY,		-ENOMEM),
+	_STATUS_INFO(HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY,	-ENOMEM),
 	_STATUS_INFO(HV_STATUS_INVALID_PARTITION_ID,		-EINVAL),
 	_STATUS_INFO(HV_STATUS_INVALID_VP_INDEX,		-EINVAL),
 	_STATUS_INFO(HV_STATUS_NOT_FOUND,			-EIO),
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index ffa25cd6e4e9..dfa27be66ff7 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -119,6 +119,9 @@ int hv_deposit_memory_node(int node, u64 partition_id,
 	case HV_STATUS_INSUFFICIENT_MEMORY:
 		num_pages = 1;
 		break;
+	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY:
+		num_pages = HV_MAX_CONTIGUOUS_ALLOCATION_PAGES;
+		break;
 	default:
 		hv_status_err(hv_status, "Unexpected!\n");
 		return -ENOMEM;
@@ -131,6 +134,7 @@ bool hv_result_needs_memory(u64 status)
 {
 	switch (hv_result(status)) {
 	case HV_STATUS_INSUFFICIENT_MEMORY:
+	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY:
 		return true;
 	}
 	return false;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 04b18d0e37af..70f22ef44948 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -38,6 +38,7 @@ struct hv_u128 {
 #define HV_STATUS_INVALID_LP_INDEX		    0x41
 #define HV_STATUS_INVALID_REGISTER_VALUE	    0x50
 #define HV_STATUS_OPERATION_FAILED		    0x71
+#define HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY    0x75
 #define HV_STATUS_TIME_OUT			    0x78
 #define HV_STATUS_CALL_PENDING			    0x79
 #define HV_STATUS_VTL_ALREADY_ENABLED		    0x86
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index c0300910808b..091c03e26046 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -7,6 +7,8 @@
 
 #include "hvgdk_mini.h"
 
+#define HV_MAX_CONTIGUOUS_ALLOCATION_PAGES	8
+
 /*
  * Doorbell connection_info flags.
  */



^ permalink raw reply related

* [PATCH v2 4/4] mshv: Handle insufficient root memory hypervisor statuses
From: Stanislav Kinsburskii @ 2026-02-02 17:59 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177005499596.120041.5908089206606113719.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

When creating guest partition objects, the hypervisor may fail to
allocate root partition pages and return an insufficient memory status.
In this case, deposit memory using the root partition ID instead.

Note: This error should never occur in a guest of L1VH partition context.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/hv_common.c      |    2 +
 drivers/hv/hv_proc.c        |   14 ++++++++++
 include/hyperv/hvgdk_mini.h |   58 ++++++++++++++++++++++---------------------
 3 files changed, 46 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index c7f63c9de503..cab0d1733607 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -792,6 +792,8 @@ static const struct hv_status_info hv_status_infos[] = {
 	_STATUS_INFO(HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE,	-EIO),
 	_STATUS_INFO(HV_STATUS_INSUFFICIENT_MEMORY,		-ENOMEM),
 	_STATUS_INFO(HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY,	-ENOMEM),
+	_STATUS_INFO(HV_STATUS_INSUFFICIENT_ROOT_MEMORY,	-ENOMEM),
+	_STATUS_INFO(HV_STATUS_INSUFFICIENT_CONTIGUOUS_ROOT_MEMORY,	-ENOMEM),
 	_STATUS_INFO(HV_STATUS_INVALID_PARTITION_ID,		-EINVAL),
 	_STATUS_INFO(HV_STATUS_INVALID_VP_INDEX,		-EINVAL),
 	_STATUS_INFO(HV_STATUS_NOT_FOUND,			-EIO),
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index dfa27be66ff7..935129e0b39d 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -122,6 +122,18 @@ int hv_deposit_memory_node(int node, u64 partition_id,
 	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY:
 		num_pages = HV_MAX_CONTIGUOUS_ALLOCATION_PAGES;
 		break;
+
+	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_ROOT_MEMORY:
+		num_pages = HV_MAX_CONTIGUOUS_ALLOCATION_PAGES;
+		fallthrough;
+	case HV_STATUS_INSUFFICIENT_ROOT_MEMORY:
+		if (!hv_root_partition()) {
+			hv_status_err(hv_status, "Unexpected root memory deposit\n");
+			return -ENOMEM;
+		}
+		partition_id = HV_PARTITION_ID_SELF;
+		break;
+
 	default:
 		hv_status_err(hv_status, "Unexpected!\n");
 		return -ENOMEM;
@@ -135,6 +147,8 @@ bool hv_result_needs_memory(u64 status)
 	switch (hv_result(status)) {
 	case HV_STATUS_INSUFFICIENT_MEMORY:
 	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY:
+	case HV_STATUS_INSUFFICIENT_ROOT_MEMORY:
+	case HV_STATUS_INSUFFICIENT_CONTIGUOUS_ROOT_MEMORY:
 		return true;
 	}
 	return false;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 70f22ef44948..5b74a857ef43 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -14,34 +14,36 @@ struct hv_u128 {
 } __packed;
 
 /* NOTE: when adding below, update hv_result_to_string() */
-#define HV_STATUS_SUCCESS			    0x0
-#define HV_STATUS_INVALID_HYPERCALL_CODE	    0x2
-#define HV_STATUS_INVALID_HYPERCALL_INPUT	    0x3
-#define HV_STATUS_INVALID_ALIGNMENT		    0x4
-#define HV_STATUS_INVALID_PARAMETER		    0x5
-#define HV_STATUS_ACCESS_DENIED			    0x6
-#define HV_STATUS_INVALID_PARTITION_STATE	    0x7
-#define HV_STATUS_OPERATION_DENIED		    0x8
-#define HV_STATUS_UNKNOWN_PROPERTY		    0x9
-#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE	    0xA
-#define HV_STATUS_INSUFFICIENT_MEMORY		    0xB
-#define HV_STATUS_INVALID_PARTITION_ID		    0xD
-#define HV_STATUS_INVALID_VP_INDEX		    0xE
-#define HV_STATUS_NOT_FOUND			    0x10
-#define HV_STATUS_INVALID_PORT_ID		    0x11
-#define HV_STATUS_INVALID_CONNECTION_ID		    0x12
-#define HV_STATUS_INSUFFICIENT_BUFFERS		    0x13
-#define HV_STATUS_NOT_ACKNOWLEDGED		    0x14
-#define HV_STATUS_INVALID_VP_STATE		    0x15
-#define HV_STATUS_NO_RESOURCES			    0x1D
-#define HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED   0x20
-#define HV_STATUS_INVALID_LP_INDEX		    0x41
-#define HV_STATUS_INVALID_REGISTER_VALUE	    0x50
-#define HV_STATUS_OPERATION_FAILED		    0x71
-#define HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY    0x75
-#define HV_STATUS_TIME_OUT			    0x78
-#define HV_STATUS_CALL_PENDING			    0x79
-#define HV_STATUS_VTL_ALREADY_ENABLED		    0x86
+#define HV_STATUS_SUCCESS				0x0
+#define HV_STATUS_INVALID_HYPERCALL_CODE		0x2
+#define HV_STATUS_INVALID_HYPERCALL_INPUT		0x3
+#define HV_STATUS_INVALID_ALIGNMENT			0x4
+#define HV_STATUS_INVALID_PARAMETER			0x5
+#define HV_STATUS_ACCESS_DENIED				0x6
+#define HV_STATUS_INVALID_PARTITION_STATE		0x7
+#define HV_STATUS_OPERATION_DENIED			0x8
+#define HV_STATUS_UNKNOWN_PROPERTY			0x9
+#define HV_STATUS_PROPERTY_VALUE_OUT_OF_RANGE		0xA
+#define HV_STATUS_INSUFFICIENT_MEMORY			0xB
+#define HV_STATUS_INVALID_PARTITION_ID			0xD
+#define HV_STATUS_INVALID_VP_INDEX			0xE
+#define HV_STATUS_NOT_FOUND				0x10
+#define HV_STATUS_INVALID_PORT_ID			0x11
+#define HV_STATUS_INVALID_CONNECTION_ID			0x12
+#define HV_STATUS_INSUFFICIENT_BUFFERS			0x13
+#define HV_STATUS_NOT_ACKNOWLEDGED			0x14
+#define HV_STATUS_INVALID_VP_STATE			0x15
+#define HV_STATUS_NO_RESOURCES				0x1D
+#define HV_STATUS_PROCESSOR_FEATURE_NOT_SUPPORTED	0x20
+#define HV_STATUS_INVALID_LP_INDEX			0x41
+#define HV_STATUS_INVALID_REGISTER_VALUE		0x50
+#define HV_STATUS_OPERATION_FAILED			0x71
+#define HV_STATUS_INSUFFICIENT_ROOT_MEMORY		0x73
+#define HV_STATUS_INSUFFICIENT_CONTIGUOUS_MEMORY	0x75
+#define HV_STATUS_TIME_OUT				0x78
+#define HV_STATUS_CALL_PENDING				0x79
+#define HV_STATUS_INSUFFICIENT_CONTIGUOUS_ROOT_MEMORY	0x83
+#define HV_STATUS_VTL_ALREADY_ENABLED			0x86
 
 /*
  * The Hyper-V TimeRefCount register and the TSC



^ permalink raw reply related

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-02 18:09 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXzmMInsNSvFvBF1@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > hypervisor deposited pages.
> > > > > > 
> > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > management is implemented.
> > > > > 
> > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > and would work without any issue for L1VH.
> > > > > 
> > > > 
> > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > 
> > > All pages that were deposited in the context of a guest partition (i.e.
> > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > right? What other deposited pages would be left?
> > > 
> > 
> > The driver deposits two types of pages: one for the guests (withdrawn
> > upon gust shutdown) and the other - for the host itself (never
> > withdrawn).
> > See hv_call_create_partition, for example: it deposits pages for the
> > host partition.
> 
> Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> Also, can't we forcefully kill all running partitions in module_exit and
> then reclaim memory? Would this help with kernel consistency
> irrespective of userspace behavior?
> 

First, module_exit is not called during kexec. Second, forcefully
killing all partitions during a kexec reboot would be bulky,
error-prone, and slow. It also does not guarantee robust behavior. Too
many things can go wrong, and we could still end up in the same broken
state.

To reiterate: today, the only safe way to use kexec is to avoid any
shared state between the kernel and the hypervisor. In other words, that
state should never be created, or it must be destroyed before issuing
kexec.
Neither of this states is controlled by our driver, so the only safe
options yet is to disable kexec.

Thanks,
Stanislav


> Thanks,
> Anirudh.
> 
> > 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh.
> > > 
> > > > Also, kernel consisntency must no depend on use space behavior. 
> > > > 
> > > > > Also, I don't think it is reasonable at all that someone needs to
> > > > > disable basic kernel functionality such as kexec in order to use our
> > > > > driver.
> > > > > 
> > > > 
> > > > It's a temporary measure until proper page lifecycle management is
> > > > supported in the driver.
> > > > Mutual exclusion of the driver and kexec is given and thus should be
> > > > expclitily stated in the Kconfig.
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > Anirudh.
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > ---
> > > > > >  drivers/hv/Kconfig |    1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > --- a/drivers/hv/Kconfig
> > > > > > +++ b/drivers/hv/Kconfig
> > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > >  	# no particular order, making it impossible to reassemble larger pages
> > > > > >  	depends on PAGE_SIZE_4KB
> > > > > > +	depends on !KEXEC
> > > > > >  	select EVENTFD
> > > > > >  	select VIRT_XFER_TO_GUEST_WORK
> > > > > >  	select HMM_MIRROR
> > > > > > 
> > > > > > 

^ permalink raw reply

* RE: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Michael Kelley @ 2026-02-02 18:26 UTC (permalink / raw)
  To: Stanislav Kinsburskii, mhkelley58@gmail.com
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <aYDcLRhxx9wXRXBG@skinsburskii.localdomain>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 9:18 AM
> 
> On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > Huge page mappings in the guest physical address space depend on having
> > matching alignment of the userspace address in the parent partition and
> > of the guest physical address. Add a comment that captures this
> > information. See the link to the mailing list thread.
> >
> > No code or functional change.
> >
> > Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
> > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > ---
> >  drivers/hv/mshv_root_main.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 681b58154d5e..bc738ff4508e 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
> >  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
> >  		return mshv_unmap_user_memory(partition, mem);
> >
> > +	/*
> > +	 * If the userspace_addr and the guest physical address (as derived
> > +	 * from the guest_pfn) have the same alignment modulo PMD huge page
> > +	 * size, the MSHV driver can map any PMD huge pages to the guest
> > +	 * physical address space as PMD huge pages. If the alignments do
> > +	 * not match, PMD huge pages must be mapped as single pages in the
> > +	 * guest physical address space. The MSHV driver does not enforce
> > +	 * that the alignments match, and it invokes the hypervisor to set
> > +	 * up correct functional mappings either way. See mshv_chunk_stride().
> > +	 * The caller of the ioctl is responsible for providing userspace_addr
> > +	 * and guest_pfn values with matching alignments if it wants the guest
> > +	 * to get the performance benefits of PMD huge page mappings of its
> > +	 * physical address space to real system memory.
> > +	 */
> 
> Thanks. However, I'd suggest to reduce this commet a lot and put the
> details into the commit message instead. Also, why this place? Why not a
> part of the function description instead, for example?

In general, I'm very much an advocate of putting a bit more detail into code
comments, so that someone new reading the code has a chance of figuring
out what's going on without having to search through the commit history
and read commit messages. The commit history is certainly useful for the
historical record, and especially how things have changed over time. But for
"how non-obvious things work now", I like to see that in the code comments.

As for where to put the comment, I'm flexible. I thought about placing it
outside the function as a "header" (which is what I think you mean by the
"function description"), but the function handles both "map" and "unmap"
operations, and this comment applies only to "map".  Hence I put it after
the test for whether we're doing "map" vs. "unmap".  But I wouldn't object
to it being placed as a function description, though the text would need to be
enhanced to more broadly be a function description instead of just a comment
about a specific aspect of "map" behavior.

Michael

> 
> Thanks,
> Stanislav
> 
> >  	return mshv_map_user_memory(partition, mem);
> >  }
> >
> > --
> > 2.25.1

^ permalink raw reply

* [PATCH v2 0/2] ARM64 support for doorbell and intercept SINTs
From: Anirudh Rayabharam @ 2026-02-02 18:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh

From: "Anirudh Rayabharam (Microsoft)" <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the INTID for SINTs should be in the SGI or PPI range. The
hypervisor exposes a virtual device in the ACPI that reserves a
PPI for this use. Introduce a platform_driver that binds to this ACPI
device and obtains the interrupt vector that can be used for SINTs.

Changes in v2:
Addressed review comments:
  - Moved more stuff into mshv_synic.c
  - Code simplifications
  - Removed unnecessary debug prints

v1: https://lore.kernel.org/linux-hyperv/20260128160437.3342167-1-anirudh@anirudhrb.com/

Anirudh Rayabharam (Microsoft) (2):
  mshv: refactor synic init and cleanup
  mshv: add arm64 support for doorbell & intercept SINTs

 drivers/hv/mshv_root.h      |   5 +-
 drivers/hv/mshv_root_main.c |  59 ++-------
 drivers/hv/mshv_synic.c     | 232 ++++++++++++++++++++++++++++++++++--
 3 files changed, 230 insertions(+), 66 deletions(-)

-- 
2.34.1

^ permalink raw reply

* [PATCH v2 1/2] mshv: refactor synic init and cleanup
From: Anirudh Rayabharam @ 2026-02-02 18:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260202182706.648192-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

Rename mshv_synic_init() to mshv_synic_cpu_init() and
mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
these functions handle per-cpu synic setup and teardown.

Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
Move all the synic related setup from mshv_parent_partition_init.

Move the reboot notifier to mshv_synic.c because it currently only
operates on the synic cpuhp state.

Move out synic_pages from the global mshv_root since it's use is now
completely local to mshv_synic.c.

This is in preparation for the next patch which will add more stuff to
mshv_synic_init().

No functional change.

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root.h      |  5 ++-
 drivers/hv/mshv_root_main.c | 59 +++++-------------------------
 drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
 3 files changed, 75 insertions(+), 60 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..26e0320c8097 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -183,7 +183,6 @@ struct hv_synic_pages {
 };
 
 struct mshv_root {
-	struct hv_synic_pages __percpu *synic_pages;
 	spinlock_t pt_ht_lock;
 	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
 	struct hv_partition_property_vmm_capabilities vmm_caps;
@@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
 void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
 
 void mshv_isr(void);
-int mshv_synic_init(unsigned int cpu);
-int mshv_synic_cleanup(unsigned int cpu);
+int mshv_synic_init(struct device *dev);
+void mshv_synic_cleanup(void);
 
 static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
 {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 681b58154d5e..7c1666456e78 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int mshv_cpuhp_online;
 static int mshv_root_sched_online;
 
 static const char *scheduler_type_to_string(enum hv_scheduler_type type)
@@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
 	free_percpu(root_scheduler_output);
 }
 
-static int mshv_reboot_notify(struct notifier_block *nb,
-			      unsigned long code, void *unused)
-{
-	cpuhp_remove_state(mshv_cpuhp_online);
-	return 0;
-}
-
-struct notifier_block mshv_reboot_nb = {
-	.notifier_call = mshv_reboot_notify,
-};
-
 static void mshv_root_partition_exit(void)
 {
-	unregister_reboot_notifier(&mshv_reboot_nb);
 	root_scheduler_deinit();
 }
 
 static int __init mshv_root_partition_init(struct device *dev)
 {
-	int err;
-
-	err = root_scheduler_init(dev);
-	if (err)
-		return err;
-
-	err = register_reboot_notifier(&mshv_reboot_nb);
-	if (err)
-		goto root_sched_deinit;
-
-	return 0;
-
-root_sched_deinit:
-	root_scheduler_deinit();
-	return err;
+	return root_scheduler_init(dev);
 }
 
 static void mshv_init_vmm_caps(struct device *dev)
@@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
 			MSHV_HV_MAX_VERSION);
 	}
 
-	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
-	if (!mshv_root.synic_pages) {
-		dev_err(dev, "Failed to allocate percpu synic page\n");
-		ret = -ENOMEM;
+	ret = mshv_synic_init(dev);
+	if (ret)
 		goto device_deregister;
-	}
-
-	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
-				mshv_synic_init,
-				mshv_synic_cleanup);
-	if (ret < 0) {
-		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
-		goto free_synic_pages;
-	}
-
-	mshv_cpuhp_online = ret;
 
 	ret = mshv_retrieve_scheduler_type(dev);
 	if (ret)
-		goto remove_cpu_state;
+		goto synic_cleanup;
 
 	if (hv_root_partition())
 		ret = mshv_root_partition_init(dev);
 	if (ret)
-		goto remove_cpu_state;
+		goto synic_cleanup;
 
 	mshv_init_vmm_caps(dev);
 
@@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
-remove_cpu_state:
-	cpuhp_remove_state(mshv_cpuhp_online);
-free_synic_pages:
-	free_percpu(mshv_root.synic_pages);
+synic_cleanup:
+	mshv_synic_cleanup();
 device_deregister:
 	misc_deregister(&mshv_dev);
 	return ret;
@@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
 	mshv_irqfd_wq_cleanup();
 	if (hv_root_partition())
 		mshv_root_partition_exit();
-	cpuhp_remove_state(mshv_cpuhp_online);
-	free_percpu(mshv_root.synic_pages);
+	mshv_synic_cleanup();
 }
 
 module_init(mshv_parent_partition_init);
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f8b0337cdc82..98c58755846d 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -12,11 +12,16 @@
 #include <linux/mm.h>
 #include <linux/io.h>
 #include <linux/random.h>
+#include <linux/cpuhotplug.h>
+#include <linux/reboot.h>
 #include <asm/mshyperv.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 
+static int synic_cpuhp_online;
+static struct hv_synic_pages __percpu *synic_pages;
+
 static u32 synic_event_ring_get_queued_port(u32 sint_index)
 {
 	struct hv_synic_event_ring_page **event_ring_page;
@@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
 	u32 message;
 	u8 tail;
 
-	spages = this_cpu_ptr(mshv_root.synic_pages);
+	spages = this_cpu_ptr(synic_pages);
 	event_ring_page = &spages->synic_event_ring_page;
 	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
 
@@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
 
 void mshv_isr(void)
 {
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_message *msg;
 	bool handled;
@@ -446,7 +451,7 @@ void mshv_isr(void)
 	}
 }
 
-int mshv_synic_init(unsigned int cpu)
+static int mshv_synic_cpu_init(unsigned int cpu)
 {
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
@@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
 	union hv_synic_sint sint;
 #endif
 	union hv_synic_scontrol sctrl;
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 			&spages->synic_event_flags_page;
@@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
 	return -EFAULT;
 }
 
-int mshv_synic_cleanup(unsigned int cpu)
+static int mshv_synic_cpu_exit(unsigned int cpu)
 {
 	union hv_synic_sint sint;
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
 	union hv_synic_scontrol sctrl;
-	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
+	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 		&spages->synic_event_flags_page;
@@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
 
 	mshv_portid_free(doorbell_portid);
 }
+
+static int mshv_synic_reboot_notify(struct notifier_block *nb,
+			      unsigned long code, void *unused)
+{
+	cpuhp_remove_state(synic_cpuhp_online);
+	return 0;
+}
+
+static struct notifier_block mshv_synic_reboot_nb = {
+	.notifier_call = mshv_synic_reboot_notify,
+};
+
+int __init mshv_synic_init(struct device *dev)
+{
+	int ret = 0;
+
+	synic_pages = alloc_percpu(struct hv_synic_pages);
+	if (!synic_pages) {
+		dev_err(dev, "Failed to allocate percpu synic page\n");
+		return -ENOMEM;
+	}
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
+				mshv_synic_cpu_init,
+				mshv_synic_cpu_exit);
+	if (ret < 0) {
+		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
+		goto free_synic_pages;
+	}
+
+	synic_cpuhp_online = ret;
+
+	if (hv_root_partition()) {
+		ret = register_reboot_notifier(&mshv_synic_reboot_nb);
+		if (ret)
+			goto remove_cpuhp_state;
+	}
+
+	return 0;
+
+remove_cpuhp_state:
+	cpuhp_remove_state(synic_cpuhp_online);
+free_synic_pages:
+	free_percpu(synic_pages);
+	return ret;
+}
+
+void mshv_synic_cleanup(void)
+{
+	if (hv_root_partition())
+		unregister_reboot_notifier(&mshv_synic_reboot_nb);
+	cpuhp_remove_state(synic_cpuhp_online);
+	free_percpu(synic_pages);
+}
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-02-02 18:27 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260202182706.648192-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the INTID for SINTs should be in the SGI or PPI range. The
hypervisor exposes a virtual device in the ACPI that reserves a
PPI for this use. Introduce a platform_driver that binds to this ACPI
device and obtains the interrupt vector that can be used for SINTs.

To better unify x86 and arm64 paths, introduce mshv_sint_vector_init() that
either registers the platform_driver and obtains the INTID (arm64) or
just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_synic.c | 163 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 156 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 98c58755846d..de5fee6e9f29 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -10,17 +10,24 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
+#include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/random.h>
 #include <linux/cpuhotplug.h>
 #include <linux/reboot.h>
 #include <asm/mshyperv.h>
+#include <linux/platform_device.h>
+#include <linux/acpi.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 
 static int synic_cpuhp_online;
 static struct hv_synic_pages __percpu *synic_pages;
+static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
+#endif
 
 static u32 synic_event_ring_get_queued_port(u32 sint_index)
 {
@@ -456,9 +463,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-#ifdef HYPERVISOR_CALLBACK_VECTOR
 	union hv_synic_sint sint;
-#endif
 	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
@@ -501,10 +506,13 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
-#ifdef HYPERVISOR_CALLBACK_VECTOR
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+	enable_percpu_irq(mshv_sint_irq, 0);
+#endif
+
 	/* Enable intercepts */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_sint_vector;
 	sint.masked = false;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
@@ -512,13 +520,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 
 	/* Doorbell SINT */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_sint_vector;
 	sint.masked = false;
 	sint.as_intercept = 1;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
-#endif
 
 	/* Enable global synic bit */
 	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
@@ -573,6 +580,10 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+	disable_percpu_irq(mshv_sint_irq);
+#endif
+
 	/* Disable Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
@@ -680,14 +691,149 @@ static struct notifier_block mshv_synic_reboot_nb = {
 	.notifier_call = mshv_synic_reboot_notify,
 };
 
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+#ifdef CONFIG_ACPI
+static long __percpu *mshv_evt;
+
+static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
+{
+	struct resource r;
+
+	if (res->type == ACPI_RESOURCE_TYPE_EXTENDED_IRQ) {
+		if (!acpi_dev_resource_interrupt(res, 0, &r)) {
+			pr_err("Unable to parse MSHV ACPI interrupt\n");
+			return AE_ERROR;
+		}
+		/* ARM64 INTID */
+		mshv_sint_vector = res->data.extended_irq.interrupts[0];
+		/* Linux IRQ number */
+		mshv_sint_irq = r.start;
+	}
+
+	return AE_OK;
+}
+
+static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
+{
+	mshv_isr();
+	return IRQ_HANDLED;
+}
+
+static int mshv_sint_probe(struct platform_device *pdev)
+{
+	acpi_status result;
+	int ret;
+	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
+
+	result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
+					mshv_walk_resources, NULL);
+	if (ACPI_FAILURE(result)) {
+		ret = -ENODEV;
+		goto out_fail;
+	}
+
+	mshv_evt = alloc_percpu(long);
+	if (!mshv_evt) {
+		ret = -ENOMEM;
+		goto out_fail;
+	}
+
+	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
+		mshv_evt);
+	if (ret)
+		goto free_evt;
+
+	return 0;
+
+free_evt:
+	free_percpu(mshv_evt);
+out_fail:
+	mshv_sint_vector = -1;
+	mshv_sint_irq = -1;
+	return ret;
+}
+
+static void mshv_sint_remove(struct platform_device *pdev)
+{
+	free_percpu_irq(mshv_sint_irq, mshv_evt);
+	free_percpu(mshv_evt);
+}
+#else
+static int mshv_sint_probe(struct platform_device *pdev)
+{
+	return -ENODEV;
+}
+
+static void mshv_sint_remove(struct platform_device *pdev)
+{
+}
+#endif
+
+static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
+	{"MSFT1003", 0},
+	{"", 0},
+};
+
+static struct platform_driver mshv_sint_drv = {
+	.probe = mshv_sint_probe,
+	.remove = mshv_sint_remove,
+	.driver = {
+		.name = "mshv_sint",
+		.acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
+		.probe_type = PROBE_FORCE_SYNCHRONOUS,
+	},
+};
+
+static int __init mshv_sint_vector_init(void)
+{
+	int ret;
+
+	if (acpi_disabled)
+		return -ENODEV;
+
+	ret = platform_driver_register(&mshv_sint_drv);
+	if (ret)
+		return ret;
+
+	if (mshv_sint_vector == -1 || mshv_sint_irq == -1) {
+		platform_driver_unregister(&mshv_sint_drv);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static void mshv_sint_vector_cleanup(void)
+{
+	platform_driver_unregister(&mshv_sint_drv);
+}
+#else /* HYPERVISOR_CALLBACK_VECTOR */
+static int __init mshv_sint_vector_init(void)
+{
+	mshv_sint_vector = HYPERVISOR_CALLBACK_VECTOR;
+	return 0;
+}
+
+static void mshv_sint_vector_cleanup(void)
+{
+}
+#endif /* HYPERVISOR_CALLBACK_VECTOR */
+
 int __init mshv_synic_init(struct device *dev)
 {
 	int ret = 0;
 
+	ret = mshv_sint_vector_init();
+	if (ret) {
+		dev_err(dev, "Failed to get MSHV SINT vector: %i\n", ret);
+		return ret;
+	}
+
 	synic_pages = alloc_percpu(struct hv_synic_pages);
 	if (!synic_pages) {
 		dev_err(dev, "Failed to allocate percpu synic page\n");
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto sint_vector_cleanup;
 	}
 
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
@@ -712,6 +858,8 @@ int __init mshv_synic_init(struct device *dev)
 	cpuhp_remove_state(synic_cpuhp_online);
 free_synic_pages:
 	free_percpu(synic_pages);
+sint_vector_cleanup:
+	mshv_sint_vector_cleanup();
 	return ret;
 }
 
@@ -721,4 +869,5 @@ void mshv_synic_cleanup(void)
 		unregister_reboot_notifier(&mshv_synic_reboot_nb);
 	cpuhp_remove_state(synic_cpuhp_online);
 	free_percpu(synic_pages);
+	mshv_sint_vector_cleanup();
 }
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Stanislav Kinsburskii @ 2026-02-02 18:56 UTC (permalink / raw)
  To: Michael Kelley
  Cc: mhkelley58@gmail.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41570BBE17C50675E94789FDD49AA@SN6PR02MB4157.namprd02.prod.outlook.com>

On Mon, Feb 02, 2026 at 06:26:42PM +0000, Michael Kelley wrote:
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Monday, February 2, 2026 9:18 AM
> > 
> > On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
> > > From: Michael Kelley <mhklinux@outlook.com>
> > >
> > > Huge page mappings in the guest physical address space depend on having
> > > matching alignment of the userspace address in the parent partition and
> > > of the guest physical address. Add a comment that captures this
> > > information. See the link to the mailing list thread.
> > >
> > > No code or functional change.
> > >
> > > Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
> > > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > > ---
> > >  drivers/hv/mshv_root_main.c | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > >
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 681b58154d5e..bc738ff4508e 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
> > >  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
> > >  		return mshv_unmap_user_memory(partition, mem);
> > >
> > > +	/*
> > > +	 * If the userspace_addr and the guest physical address (as derived
> > > +	 * from the guest_pfn) have the same alignment modulo PMD huge page
> > > +	 * size, the MSHV driver can map any PMD huge pages to the guest
> > > +	 * physical address space as PMD huge pages. If the alignments do
> > > +	 * not match, PMD huge pages must be mapped as single pages in the
> > > +	 * guest physical address space. The MSHV driver does not enforce
> > > +	 * that the alignments match, and it invokes the hypervisor to set
> > > +	 * up correct functional mappings either way. See mshv_chunk_stride().
> > > +	 * The caller of the ioctl is responsible for providing userspace_addr
> > > +	 * and guest_pfn values with matching alignments if it wants the guest
> > > +	 * to get the performance benefits of PMD huge page mappings of its
> > > +	 * physical address space to real system memory.
> > > +	 */
> > 
> > Thanks. However, I'd suggest to reduce this commet a lot and put the
> > details into the commit message instead. Also, why this place? Why not a
> > part of the function description instead, for example?
> 
> In general, I'm very much an advocate of putting a bit more detail into code
> comments, so that someone new reading the code has a chance of figuring
> out what's going on without having to search through the commit history
> and read commit messages. The commit history is certainly useful for the
> historical record, and especially how things have changed over time. But for
> "how non-obvious things work now", I like to see that in the code comments.
> 

This approach is not well aligned with the existing kernel coding style.
It is common to answer the “why” question in the commit message.
Code comments should focus on “what” the code does.

https://www.kernel.org/doc/html/latest/process/coding-style.html

For more details, it is common to use `git blame` to learn the context
of a change when needed.

> As for where to put the comment, I'm flexible. I thought about placing it
> outside the function as a "header" (which is what I think you mean by the
> "function description"), but the function handles both "map" and "unmap"
> operations, and this comment applies only to "map".  Hence I put it after
> the test for whether we're doing "map" vs. "unmap".  But I wouldn't object
> to it being placed as a function description, though the text would need to be
> enhanced to more broadly be a function description instead of just a comment
> about a specific aspect of "map" behavior.
> 

As for the location, since this documents the userspace API, I would
rather place it above the function as part of the function description.
Even though the function handles both map and unmap, unmap also deals
with huge pages.

Thanks,
Stanislav

> Michael
> 
> > 
> > Thanks,
> > Stanislav
> > 
> > >  	return mshv_map_user_memory(partition, mem);
> > >  }
> > >
> > > --
> > > 2.25.1

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-02-02 19:01 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYDaaIK0J4SjvnCe@skinsburskii.localdomain>

On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > hypervisor deposited pages.
> > > > > > > > > 
> > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > management is implemented.
> > > > > > > > 
> > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > and would work without any issue for L1VH.
> > > > > > > > 
> > > > > > > 
> > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > 
> > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > right? What other deposited pages would be left?
> > > > > > 
> > > > > 
> > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > withdrawn).
> > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > host partition.
> > > > 
> > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > then reclaim memory? Would this help with kernel consistency
> > > > irrespective of userspace behavior?
> > > > 
> > > 
> > > It would, but this is sloppy and cannot be a long-term solution.
> > > 
> > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > may still crash.
> > 
> > Actually guests won't be running by the time we reach our module_exit
> > function during a kexec. Userspace processes would've been killed by
> > then.
> > 
> 
> No, they will not: "kexec -e" doesn't kill user processes.
> We must not rely on OS to do graceful shutdown before doing
> kexec.

I see kexec -e is too brutal. Something like systemctl kexec is
more graceful and is probably used more commonly. In this case at least
we could register a reboot notifier and attempt to clean things up.

I think it is better to support kexec to this extent rather than
disabling it entirely.

> 
> > Also, why is this sloppy? Isn't this what module_exit should be
> > doing anyway? If someone unloads our module we should be trying to
> > clean everything up (including killing guests) and reclaim memory.
> > 
> 
> Kexec does not unload modules, but it doesn't really matter even if it
> would.
> There are other means to plug into the reboot flow, but neither of them
> is robust or reliable.
> 
> > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > stop the kexec.
> > 
> 
> By killing the whole system? This is not a good user experience and I
> don't see how can this be justified.

It is justified because, as you said, once we reach that failure we can
no longer guarantee integrity. So BUG() makes sense. This BUG() would
cause the system to go for a full reboot and restore integrity.

> 
> > This is a better solution since instead of disabling KEXEC outright: our
> > driver made the best possible efforts to make kexec work.
> > 
> 
> How an unrealiable feature leading to potential system crashes is better
> that disabling kexec outright?

Because there are ways of using the feature reliably. What if someone
has MSHV_ROOT enabled but never start a VM? (Just because someone has our
driver enabled in the kernel doesn't mean they're using it.) What about crash
dump?

It is far better to support some of these scenarios and be unreliable in
some corner cases rather than disabling the feature completely.

Also, I'm curious if any other driver in the kernel has ever done this
(force disable KEXEC).

> 
> It's a complete opposite story for me: the latter provides a limited,
> but robust functionality, while the former provides an unreliable and
> unpredictable behavior.
> 
> > > 
> > > There are two long-term solutions:
> > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > 
> > I honestly think we should focus efforts on making kexec work rather
> > than finding ways to prevent it.
> > 
> 
> There is no argument about it. But until we have it fixed properly, we
> have two options: either disable kexec or stop claiming we have our
> driver up and ready for external customers. Giving the importance of
> this driver for current projects, I believe the better way would be to
> explicitly limit the functionality instead of postponing the
> productization of the driver.

It is okay to claim our driver as ready even if it doesn't support all
kexec cases. If we can support the common cases such as crash dump and
maybe kexec based servicing (pretty sure people do systemctl kexec and
not kexec -e for this with proper teardown) we can claim that our driver
is ready for general use.

Thanks,
Anirudh.


^ permalink raw reply

* Re: [PATCH v2 1/2] mshv: refactor synic init and cleanup
From: Stanislav Kinsburskii @ 2026-02-02 19:07 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260202182706.648192-2-anirudh@anirudhrb.com>

On Mon, Feb 02, 2026 at 06:27:05PM +0000, Anirudh Rayabharam wrote:
> From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 
> Rename mshv_synic_init() to mshv_synic_cpu_init() and
> mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
> these functions handle per-cpu synic setup and teardown.
> 
> Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
> Move all the synic related setup from mshv_parent_partition_init.
> 
> Move the reboot notifier to mshv_synic.c because it currently only
> operates on the synic cpuhp state.
> 
> Move out synic_pages from the global mshv_root since it's use is now
> completely local to mshv_synic.c.
> 
> This is in preparation for the next patch which will add more stuff to
> mshv_synic_init().
> 
> No functional change.
> 
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
>  drivers/hv/mshv_root.h      |  5 ++-
>  drivers/hv/mshv_root_main.c | 59 +++++-------------------------
>  drivers/hv/mshv_synic.c     | 71 +++++++++++++++++++++++++++++++++----
>  3 files changed, 75 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3c1d88b36741..26e0320c8097 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -183,7 +183,6 @@ struct hv_synic_pages {
>  };
>  
>  struct mshv_root {
> -	struct hv_synic_pages __percpu *synic_pages;
>  	spinlock_t pt_ht_lock;
>  	DECLARE_HASHTABLE(pt_htable, MSHV_PARTITIONS_HASH_BITS);
>  	struct hv_partition_property_vmm_capabilities vmm_caps;
> @@ -242,8 +241,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
>  void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
>  
>  void mshv_isr(void);
> -int mshv_synic_init(unsigned int cpu);
> -int mshv_synic_cleanup(unsigned int cpu);
> +int mshv_synic_init(struct device *dev);
> +void mshv_synic_cleanup(void);
>  
>  static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
>  {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 681b58154d5e..7c1666456e78 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2035,7 +2035,6 @@ mshv_dev_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> -static int mshv_cpuhp_online;
>  static int mshv_root_sched_online;
>  
>  static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> @@ -2198,40 +2197,14 @@ root_scheduler_deinit(void)
>  	free_percpu(root_scheduler_output);
>  }
>  
> -static int mshv_reboot_notify(struct notifier_block *nb,
> -			      unsigned long code, void *unused)
> -{
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -	return 0;
> -}
> -

Unrelated to the change, but it would be great to get rid of this
notifier altogether and just do the cleanup in the device shutdown hook.
This is a cleaner approach as this is a device driver and we do have the
device in hands.
Do you think you could make this change a part of this series?

> -struct notifier_block mshv_reboot_nb = {
> -	.notifier_call = mshv_reboot_notify,
> -};
> -
>  static void mshv_root_partition_exit(void)
>  {
> -	unregister_reboot_notifier(&mshv_reboot_nb);
>  	root_scheduler_deinit();
>  }
>  
>  static int __init mshv_root_partition_init(struct device *dev)
>  {
> -	int err;
> -
> -	err = root_scheduler_init(dev);
> -	if (err)
> -		return err;
> -
> -	err = register_reboot_notifier(&mshv_reboot_nb);
> -	if (err)
> -		goto root_sched_deinit;
> -
> -	return 0;
> -
> -root_sched_deinit:
> -	root_scheduler_deinit();
> -	return err;
> +	return root_scheduler_init(dev);
>  }
>  

This conflicts with the "mshv: Add support for integrated scheduler"
patch out there.
Perhaps we should ask Wei to merge that change first.

>  static void mshv_init_vmm_caps(struct device *dev)
> @@ -2276,31 +2249,18 @@ static int __init mshv_parent_partition_init(void)
>  			MSHV_HV_MAX_VERSION);
>  	}
>  
> -	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
> -	if (!mshv_root.synic_pages) {
> -		dev_err(dev, "Failed to allocate percpu synic page\n");
> -		ret = -ENOMEM;
> +	ret = mshv_synic_init(dev);
> +	if (ret)
>  		goto device_deregister;
> -	}
> -
> -	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> -				mshv_synic_init,
> -				mshv_synic_cleanup);
> -	if (ret < 0) {
> -		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> -		goto free_synic_pages;
> -	}
> -
> -	mshv_cpuhp_online = ret;
>  
>  	ret = mshv_retrieve_scheduler_type(dev);
>  	if (ret)
> -		goto remove_cpu_state;
> +		goto synic_cleanup;
>  
>  	if (hv_root_partition())
>  		ret = mshv_root_partition_init(dev);
>  	if (ret)
> -		goto remove_cpu_state;
> +		goto synic_cleanup;
>  
>  	mshv_init_vmm_caps(dev);
>  
> @@ -2318,10 +2278,8 @@ static int __init mshv_parent_partition_init(void)
>  exit_partition:
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
> -remove_cpu_state:
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -free_synic_pages:
> -	free_percpu(mshv_root.synic_pages);
> +synic_cleanup:
> +	mshv_synic_cleanup();
>  device_deregister:
>  	misc_deregister(&mshv_dev);
>  	return ret;
> @@ -2335,8 +2293,7 @@ static void __exit mshv_parent_partition_exit(void)
>  	mshv_irqfd_wq_cleanup();
>  	if (hv_root_partition())
>  		mshv_root_partition_exit();
> -	cpuhp_remove_state(mshv_cpuhp_online);
> -	free_percpu(mshv_root.synic_pages);
> +	mshv_synic_cleanup();
>  }
>  
>  module_init(mshv_parent_partition_init);
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index f8b0337cdc82..98c58755846d 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -12,11 +12,16 @@
>  #include <linux/mm.h>
>  #include <linux/io.h>
>  #include <linux/random.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/reboot.h>
>  #include <asm/mshyperv.h>
>  
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
>  
> +static int synic_cpuhp_online;
> +static struct hv_synic_pages __percpu *synic_pages;
> +
>  static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  {
>  	struct hv_synic_event_ring_page **event_ring_page;
> @@ -26,7 +31,7 @@ static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  	u32 message;
>  	u8 tail;
>  
> -	spages = this_cpu_ptr(mshv_root.synic_pages);
> +	spages = this_cpu_ptr(synic_pages);
>  	event_ring_page = &spages->synic_event_ring_page;
>  	synic_eventring_tail = (u8 **)this_cpu_ptr(hv_synic_eventring_tail);
>  
> @@ -393,7 +398,7 @@ mshv_intercept_isr(struct hv_message *msg)
>  
>  void mshv_isr(void)
>  {
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_message *msg;
>  	bool handled;
> @@ -446,7 +451,7 @@ void mshv_isr(void)
>  	}
>  }
>  
> -int mshv_synic_init(unsigned int cpu)
> +static int mshv_synic_cpu_init(unsigned int cpu)
>  {
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
> @@ -455,7 +460,7 @@ int mshv_synic_init(unsigned int cpu)
>  	union hv_synic_sint sint;
>  #endif
>  	union hv_synic_scontrol sctrl;
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  			&spages->synic_event_flags_page;
> @@ -542,14 +547,14 @@ int mshv_synic_init(unsigned int cpu)
>  	return -EFAULT;
>  }
>  
> -int mshv_synic_cleanup(unsigned int cpu)
> +static int mshv_synic_cpu_exit(unsigned int cpu)
>  {
>  	union hv_synic_sint sint;
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
>  	union hv_synic_scontrol sctrl;
> -	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> +	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
>  	struct hv_synic_event_flags_page **event_flags_page =
>  		&spages->synic_event_flags_page;
> @@ -663,3 +668,57 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
>  
>  	mshv_portid_free(doorbell_portid);
>  }
> +
> +static int mshv_synic_reboot_notify(struct notifier_block *nb,
> +			      unsigned long code, void *unused)
> +{
> +	cpuhp_remove_state(synic_cpuhp_online);
> +	return 0;
> +}
> +
> +static struct notifier_block mshv_synic_reboot_nb = {
> +	.notifier_call = mshv_synic_reboot_notify,
> +};
> +
> +int __init mshv_synic_init(struct device *dev)
> +{
> +	int ret = 0;
> +
> +	synic_pages = alloc_percpu(struct hv_synic_pages);
> +	if (!synic_pages) {
> +		dev_err(dev, "Failed to allocate percpu synic page\n");
> +		return -ENOMEM;
> +	}
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> +				mshv_synic_cpu_init,
> +				mshv_synic_cpu_exit);
> +	if (ret < 0) {
> +		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
> +		goto free_synic_pages;
> +	}
> +
> +	synic_cpuhp_online = ret;
> +
> +	if (hv_root_partition()) {

Nit: it's probably better to branch in the notifier itself.
It will introduce an additional object, but the branching will be in one
palce instead of two and it will also make to code simpler and easier to
read.

Thanks
Stanislav.

> +		ret = register_reboot_notifier(&mshv_synic_reboot_nb);
> +		if (ret)
> +			goto remove_cpuhp_state;
> +	}
> +
> +	return 0;
> +
> +remove_cpuhp_state:
> +	cpuhp_remove_state(synic_cpuhp_online);
> +free_synic_pages:
> +	free_percpu(synic_pages);
> +	return ret;
> +}
> +
> +void mshv_synic_cleanup(void)
> +{
> +	if (hv_root_partition())
> +		unregister_reboot_notifier(&mshv_synic_reboot_nb);
> +	cpuhp_remove_state(synic_cpuhp_online);
> +	free_percpu(synic_pages);
> +}
> -- 
> 2.34.1
> 

^ permalink raw reply

* Re: [PATCH v2 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Stanislav Kinsburskii @ 2026-02-02 19:13 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260202182706.648192-3-anirudh@anirudhrb.com>

On Mon, Feb 02, 2026 at 06:27:06PM +0000, Anirudh Rayabharam wrote:
> From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 
> On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
> interrupts (SINTs) from the hypervisor for doorbells and intercepts.
> There is no such vector reserved for arm64.
> 
> On arm64, the INTID for SINTs should be in the SGI or PPI range. The
> hypervisor exposes a virtual device in the ACPI that reserves a
> PPI for this use. Introduce a platform_driver that binds to this ACPI
> device and obtains the interrupt vector that can be used for SINTs.
> 
> To better unify x86 and arm64 paths, introduce mshv_sint_vector_init() that
> either registers the platform_driver and obtains the INTID (arm64) or
> just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
> 
> Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> ---
>  drivers/hv/mshv_synic.c | 163 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 156 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 98c58755846d..de5fee6e9f29 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -10,17 +10,24 @@
>  #include <linux/kernel.h>
>  #include <linux/slab.h>
>  #include <linux/mm.h>
> +#include <linux/interrupt.h>
>  #include <linux/io.h>
>  #include <linux/random.h>
>  #include <linux/cpuhotplug.h>
>  #include <linux/reboot.h>
>  #include <asm/mshyperv.h>
> +#include <linux/platform_device.h>
> +#include <linux/acpi.h>
>  
>  #include "mshv_eventfd.h"
>  #include "mshv.h"
>  
>  static int synic_cpuhp_online;
>  static struct hv_synic_pages __percpu *synic_pages;
> +static int mshv_sint_vector = -1; /* hwirq for the SynIC SINTs */
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +static int mshv_sint_irq = -1; /* Linux IRQ for mshv_sint_vector */
> +#endif
>  
>  static u32 synic_event_ring_get_queued_port(u32 sint_index)
>  {
> @@ -456,9 +463,7 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  	union hv_synic_simp simp;
>  	union hv_synic_siefp siefp;
>  	union hv_synic_sirbp sirbp;
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
>  	union hv_synic_sint sint;
> -#endif
>  	union hv_synic_scontrol sctrl;
>  	struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
>  	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> @@ -501,10 +506,13 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  
>  	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>  
> -#ifdef HYPERVISOR_CALLBACK_VECTOR
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +	enable_percpu_irq(mshv_sint_irq, 0);
> +#endif
> +
>  	/* Enable intercepts */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_sint_vector;
>  	sint.masked = false;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
> @@ -512,13 +520,12 @@ static int mshv_synic_cpu_init(unsigned int cpu)
>  
>  	/* Doorbell SINT */
>  	sint.as_uint64 = 0;
> -	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
> +	sint.vector = mshv_sint_vector;
>  	sint.masked = false;
>  	sint.as_intercept = 1;
>  	sint.auto_eoi = hv_recommend_using_aeoi();
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
> -#endif
>  
>  	/* Enable global synic bit */
>  	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> @@ -573,6 +580,10 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
>  	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
>  			      sint.as_uint64);
>  
> +#ifndef HYPERVISOR_CALLBACK_VECTOR
> +	disable_percpu_irq(mshv_sint_irq);
> +#endif
> +
>  	/* Disable Synic's event ring page */
>  	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
>  	sirbp.sirbp_enabled = false;
> @@ -680,14 +691,149 @@ static struct notifier_block mshv_synic_reboot_nb = {
>  	.notifier_call = mshv_synic_reboot_notify,
>  };
>  
> +#ifndef HYPERVISOR_CALLBACK_VECTOR

You have introduced 4 ifdef branches (one aroung the variable and three
in mshv_synic_cpu_init) and then you still have a big ifdef branch here.

Why is it better than simply introducing two different
mshv_synic_cpu_init functions and have a single big ifdef instead
(especially giving that this code is arch-specific anyway and thus won't
bloat the binary)?

This will also allows to get rid of redundant mshv_sint_vector variable
on x86.

Thanks,
Stanislav

> +#ifdef CONFIG_ACPI
> +static long __percpu *mshv_evt;
> +
> +static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
> +{
> +	struct resource r;
> +
> +	if (res->type == ACPI_RESOURCE_TYPE_EXTENDED_IRQ) {
> +		if (!acpi_dev_resource_interrupt(res, 0, &r)) {
> +			pr_err("Unable to parse MSHV ACPI interrupt\n");
> +			return AE_ERROR;
> +		}
> +		/* ARM64 INTID */
> +		mshv_sint_vector = res->data.extended_irq.interrupts[0];
> +		/* Linux IRQ number */
> +		mshv_sint_irq = r.start;
> +	}
> +
> +	return AE_OK;
> +}
> +
> +static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
> +{
> +	mshv_isr();
> +	return IRQ_HANDLED;
> +}
> +
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> +	acpi_status result;
> +	int ret;
> +	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
> +
> +	result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> +					mshv_walk_resources, NULL);
> +	if (ACPI_FAILURE(result)) {
> +		ret = -ENODEV;
> +		goto out_fail;
> +	}
> +
> +	mshv_evt = alloc_percpu(long);
> +	if (!mshv_evt) {
> +		ret = -ENOMEM;
> +		goto out_fail;
> +	}
> +
> +	ret = request_percpu_irq(mshv_sint_irq, mshv_percpu_isr, "MSHV",
> +		mshv_evt);
> +	if (ret)
> +		goto free_evt;
> +
> +	return 0;
> +
> +free_evt:
> +	free_percpu(mshv_evt);
> +out_fail:
> +	mshv_sint_vector = -1;
> +	mshv_sint_irq = -1;
> +	return ret;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> +	free_percpu_irq(mshv_sint_irq, mshv_evt);
> +	free_percpu(mshv_evt);
> +}
> +#else
> +static int mshv_sint_probe(struct platform_device *pdev)
> +{
> +	return -ENODEV;
> +}
> +
> +static void mshv_sint_remove(struct platform_device *pdev)
> +{
> +}
> +#endif
> +
> +static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
> +	{"MSFT1003", 0},
> +	{"", 0},
> +};
> +
> +static struct platform_driver mshv_sint_drv = {
> +	.probe = mshv_sint_probe,
> +	.remove = mshv_sint_remove,
> +	.driver = {
> +		.name = "mshv_sint",
> +		.acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
> +		.probe_type = PROBE_FORCE_SYNCHRONOUS,
> +	},
> +};
> +
> +static int __init mshv_sint_vector_init(void)
> +{
> +	int ret;
> +
> +	if (acpi_disabled)
> +		return -ENODEV;
> +
> +	ret = platform_driver_register(&mshv_sint_drv);
> +	if (ret)
> +		return ret;
> +
> +	if (mshv_sint_vector == -1 || mshv_sint_irq == -1) {
> +		platform_driver_unregister(&mshv_sint_drv);
> +		return -ENODEV;
> +	}
> +
> +	return 0;
> +}
> +
> +static void mshv_sint_vector_cleanup(void)
> +{
> +	platform_driver_unregister(&mshv_sint_drv);
> +}
> +#else /* HYPERVISOR_CALLBACK_VECTOR */
> +static int __init mshv_sint_vector_init(void)
> +{
> +	mshv_sint_vector = HYPERVISOR_CALLBACK_VECTOR;
> +	return 0;
> +}
> +
> +static void mshv_sint_vector_cleanup(void)
> +{
> +}
> +#endif /* HYPERVISOR_CALLBACK_VECTOR */
> +
>  int __init mshv_synic_init(struct device *dev)
>  {
>  	int ret = 0;
>  
> +	ret = mshv_sint_vector_init();
> +	if (ret) {
> +		dev_err(dev, "Failed to get MSHV SINT vector: %i\n", ret);
> +		return ret;
> +	}
> +
>  	synic_pages = alloc_percpu(struct hv_synic_pages);
>  	if (!synic_pages) {
>  		dev_err(dev, "Failed to allocate percpu synic page\n");
> -		return -ENOMEM;
> +		ret = -ENOMEM;
> +		goto sint_vector_cleanup;
>  	}
>  
>  	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
> @@ -712,6 +858,8 @@ int __init mshv_synic_init(struct device *dev)
>  	cpuhp_remove_state(synic_cpuhp_online);
>  free_synic_pages:
>  	free_percpu(synic_pages);
> +sint_vector_cleanup:
> +	mshv_sint_vector_cleanup();
>  	return ret;
>  }
>  
> @@ -721,4 +869,5 @@ void mshv_synic_cleanup(void)
>  		unregister_reboot_notifier(&mshv_synic_reboot_nb);
>  	cpuhp_remove_state(synic_cpuhp_online);
>  	free_percpu(synic_pages);
> +	mshv_sint_vector_cleanup();
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-02-02 19:18 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYD0bafU3UYuSvDW@anirudh-surface.localdomain>

On Mon, Feb 02, 2026 at 07:01:01PM +0000, Anirudh Rayabharam wrote:
> On Mon, Feb 02, 2026 at 09:10:00AM -0800, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 08:32:45PM +0000, Anirudh Rayabharam wrote:
> > > On Fri, Jan 30, 2026 at 10:46:45AM -0800, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 05:11:12PM +0000, Anirudh Rayabharam wrote:
> > > > > On Wed, Jan 28, 2026 at 03:11:14PM -0800, Stanislav Kinsburskii wrote:
> > > > > > On Wed, Jan 28, 2026 at 04:16:31PM +0000, Anirudh Rayabharam wrote:
> > > > > > > On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> > > > > > > > On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > > > > > > > > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > > > hypervisor deposited pages.
> > > > > > > > > > 
> > > > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > > > management is implemented.
> > > > > > > > > 
> > > > > > > > > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > > > > > > > > and would work without any issue for L1VH.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, it won't work and hypervsisor depostied pages won't be withdrawn.
> > > > > > > 
> > > > > > > All pages that were deposited in the context of a guest partition (i.e.
> > > > > > > with the guest partition ID), would be withdrawn when you kill the VMs,
> > > > > > > right? What other deposited pages would be left?
> > > > > > > 
> > > > > > 
> > > > > > The driver deposits two types of pages: one for the guests (withdrawn
> > > > > > upon gust shutdown) and the other - for the host itself (never
> > > > > > withdrawn).
> > > > > > See hv_call_create_partition, for example: it deposits pages for the
> > > > > > host partition.
> > > > > 
> > > > > Hmm.. I see. Is it not possible to reclaim this memory in module_exit?
> > > > > Also, can't we forcefully kill all running partitions in module_exit and
> > > > > then reclaim memory? Would this help with kernel consistency
> > > > > irrespective of userspace behavior?
> > > > > 
> > > > 
> > > > It would, but this is sloppy and cannot be a long-term solution.
> > > > 
> > > > It is also not reliable. We have no hook to prevent kexec. So if we fail
> > > > to kill the guest or reclaim the memory for any reason, the new kernel
> > > > may still crash.
> > > 
> > > Actually guests won't be running by the time we reach our module_exit
> > > function during a kexec. Userspace processes would've been killed by
> > > then.
> > > 
> > 
> > No, they will not: "kexec -e" doesn't kill user processes.
> > We must not rely on OS to do graceful shutdown before doing
> > kexec.
> 
> I see kexec -e is too brutal. Something like systemctl kexec is
> more graceful and is probably used more commonly. In this case at least
> we could register a reboot notifier and attempt to clean things up.
> 
> I think it is better to support kexec to this extent rather than
> disabling it entirely.
> 

You do understand that once our kernel is released to third parties, we
can’t control how they will use kexec, right?

This is a valid and existing option. We have to account for it. Yet
again, L1VH will be used by arbitrary third parties out there, not just
by us.

We can’t say the kernel supports MSHV until we close these gaps. We must
not depend on user space to keep the kernel safe.

Do you agree?

Thanks,
Stanislav

> > 
> > > Also, why is this sloppy? Isn't this what module_exit should be
> > > doing anyway? If someone unloads our module we should be trying to
> > > clean everything up (including killing guests) and reclaim memory.
> > > 
> > 
> > Kexec does not unload modules, but it doesn't really matter even if it
> > would.
> > There are other means to plug into the reboot flow, but neither of them
> > is robust or reliable.
> > 
> > > In any case, we can BUG() out if we fail to reclaim the memory. That would
> > > stop the kexec.
> > > 
> > 
> > By killing the whole system? This is not a good user experience and I
> > don't see how can this be justified.
> 
> It is justified because, as you said, once we reach that failure we can
> no longer guarantee integrity. So BUG() makes sense. This BUG() would
> cause the system to go for a full reboot and restore integrity.
> 
> > 
> > > This is a better solution since instead of disabling KEXEC outright: our
> > > driver made the best possible efforts to make kexec work.
> > > 
> > 
> > How an unrealiable feature leading to potential system crashes is better
> > that disabling kexec outright?
> 
> Because there are ways of using the feature reliably. What if someone
> has MSHV_ROOT enabled but never start a VM? (Just because someone has our
> driver enabled in the kernel doesn't mean they're using it.) What about crash
> dump?
> 
> It is far better to support some of these scenarios and be unreliable in
> some corner cases rather than disabling the feature completely.
> 
> Also, I'm curious if any other driver in the kernel has ever done this
> (force disable KEXEC).
> 
> > 
> > It's a complete opposite story for me: the latter provides a limited,
> > but robust functionality, while the former provides an unreliable and
> > unpredictable behavior.
> > 
> > > > 
> > > > There are two long-term solutions:
> > > >  1. Add a way to prevent kexec when there is shared state between the hypervisor and the kernel.
> > > 
> > > I honestly think we should focus efforts on making kexec work rather
> > > than finding ways to prevent it.
> > > 
> > 
> > There is no argument about it. But until we have it fixed properly, we
> > have two options: either disable kexec or stop claiming we have our
> > driver up and ready for external customers. Giving the importance of
> > this driver for current projects, I believe the better way would be to
> > explicitly limit the functionality instead of postponing the
> > productization of the driver.
> 
> It is okay to claim our driver as ready even if it doesn't support all
> kexec cases. If we can support the common cases such as crash dump and
> maybe kexec based servicing (pretty sure people do systemctl kexec and
> not kexec -e for this with proper teardown) we can claim that our driver
> is ready for general use.
> 
> Thanks,
> Anirudh.

^ permalink raw reply

* Re: [PATCH v3] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-02-02 19:24 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXzuxHwLofHaW-Xe@anirudh-surface.localdomain>

On Fri, Jan 30, 2026 at 05:47:48PM +0000, Anirudh Rayabharam wrote:
> On Fri, Jan 30, 2026 at 04:04:14PM +0000, Stanislav Kinsburskii wrote:
> > Query the hypervisor for integrated scheduler support and use it if
> > configured.
> > 
> > Microsoft Hypervisor originally provided two schedulers: root and core. The
> > root scheduler allows the root partition to schedule guest vCPUs across
> > physical cores, supporting both time slicing and CPU affinity (e.g., via
> > cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
> > scheduling entirely to the hypervisor.
> > 
> > Direct virtualization introduces a new privileged guest partition type - L1
> > Virtual Host (L1VH) — which can create child partitions from its own
> > resources. These child partitions are effectively siblings, scheduled by
> > the hypervisor's core scheduler. This prevents the L1VH parent from setting
> > affinity or time slicing for its own processes or guest VPs. While cgroups,
> > CFS, and cpuset controllers can still be used, their effectiveness is
> > unpredictable, as the core scheduler swaps vCPUs according to its own logic
> > (typically round-robin across all allocated physical CPUs). As a result,
> > the system may appear to "steal" time from the L1VH and its children.
> > 
> > To address this, Microsoft Hypervisor introduces the integrated scheduler.
> > This allows an L1VH partition to schedule its own vCPUs and those of its
> 
> How could an L1VH partition schedule its own vCPUs?
> 

By the mean of the integrated scheduler. Or,  from another perspective,
the same way like any other root partition does: by placing load on a
particular code and halting when there is nothing to do.

> > guests across its "physical" cores, effectively emulating root scheduler
> > behavior within the L1VH, while retaining core scheduler behavior for the
> > rest of the system.
> > 
> > The integrated scheduler is controlled by the root partition and gated by
> > the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
> > supports the integrated scheduler. The L1VH partition must then check if it
> > is enabled by querying the corresponding extended partition property. If
> > this property is true, the L1VH partition must use the root scheduler
> > logic; otherwise, it must use the core scheduler. This requirement makes
> > reading VMM capabilities in L1VH partition a requirement too.
> > 
> > Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   85 +++++++++++++++++++++++++++----------------
> >  include/hyperv/hvhdk_mini.h |    7 +++-
> >  2 files changed, 59 insertions(+), 33 deletions(-)
> > 
> > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > index 1134a82c7881..6a6bf641b352 100644
> > --- a/drivers/hv/mshv_root_main.c
> > +++ b/drivers/hv/mshv_root_main.c
> > @@ -2053,6 +2053,32 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
> >  	};
> >  }
> >  
> > +static int __init l1vh_retrive_scheduler_type(enum hv_scheduler_type *out)
> 
> typo: retrieve*
> 

Thanks, will fix.

> > +{
> > +	u64 integrated_sched_enabled;
> > +	int ret;
> > +
> > +	*out = HV_SCHEDULER_TYPE_CORE_SMT;
> > +
> > +	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
> > +		return 0;
> > +
> > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > +						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
> > +						0, &integrated_sched_enabled,
> > +						sizeof(integrated_sched_enabled));
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (integrated_sched_enabled)
> > +		*out = HV_SCHEDULER_TYPE_ROOT;
> > +
> > +	pr_debug("%s: integrated scheduler property read: ret=%d value=%llu\n",
> > +		 __func__, ret, integrated_sched_enabled);
> 
> ret is always 0 here, right? We don't need to bother printing then.
> 

Oh yes, good point. Will fix.

> > +
> > +	return 0;
> > +}
> > +
> >  /* TODO move this to hv_common.c when needed outside */
> >  static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> >  {
> > @@ -2085,13 +2111,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
> >  /* Retrieve and stash the supported scheduler type */
> >  static int __init mshv_retrieve_scheduler_type(struct device *dev)
> >  {
> > -	int ret = 0;
> > +	int ret;
> >  
> >  	if (hv_l1vh_partition())
> > -		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
> > +		ret = l1vh_retrive_scheduler_type(&hv_scheduler_type);
> >  	else
> >  		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
> > -
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2211,42 +2236,29 @@ struct notifier_block mshv_reboot_nb = {
> >  static void mshv_root_partition_exit(void)
> >  {
> >  	unregister_reboot_notifier(&mshv_reboot_nb);
> > -	root_scheduler_deinit();
> >  }
> >  
> >  static int __init mshv_root_partition_init(struct device *dev)
> >  {
> > -	int err;
> > -
> > -	err = root_scheduler_init(dev);
> > -	if (err)
> > -		return err;
> > -
> > -	err = register_reboot_notifier(&mshv_reboot_nb);
> > -	if (err)
> > -		goto root_sched_deinit;
> > -
> > -	return 0;
> > -
> > -root_sched_deinit:
> > -	root_scheduler_deinit();
> > -	return err;
> > +	return register_reboot_notifier(&mshv_reboot_nb);
> >  }
> >  
> > -static void mshv_init_vmm_caps(struct device *dev)
> > +static int __init mshv_init_vmm_caps(struct device *dev)
> >  {
> > -	/*
> > -	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
> > -	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
> > -	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
> > -	 */
> > -	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > -					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > -					      0, &mshv_root.vmm_caps,
> > -					      sizeof(mshv_root.vmm_caps)))
> > -		dev_warn(dev, "Unable to get VMM capabilities\n");
> > +	int ret;
> > +
> > +	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
> > +						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
> > +						0, &mshv_root.vmm_caps,
> > +						sizeof(mshv_root.vmm_caps));
> > +	if (ret && hv_l1vh_partition()) {
> > +		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
> > +		return ret;
> 
> I don't think we need to fail here. If there are not VMM caps available,
> that means integrated scheduler is not supported by the hypervisor, so
> fall back to core scheduler.
> 

I believe we discussed this in a personal conversation earlier.
Let me know, is we need to discuss it further.

Thanks,
Stanislav

> Thanks,
> Anirudh
> 
> > +	}
> >  
> >  	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
> > +
> > +	return 0;
> >  }
> >  
> >  static int __init mshv_parent_partition_init(void)
> > @@ -2292,6 +2304,10 @@ static int __init mshv_parent_partition_init(void)
> >  
> >  	mshv_cpuhp_online = ret;
> >  
> > +	ret = mshv_init_vmm_caps(dev);
> > +	if (ret)
> > +		goto remove_cpu_state;
> > +
> >  	ret = mshv_retrieve_scheduler_type(dev);
> >  	if (ret)
> >  		goto remove_cpu_state;
> > @@ -2301,11 +2317,13 @@ static int __init mshv_parent_partition_init(void)
> >  	if (ret)
> >  		goto remove_cpu_state;
> >  
> > -	mshv_init_vmm_caps(dev);
> > +	ret = root_scheduler_init(dev);
> > +	if (ret)
> > +		goto exit_partition;
> >  
> >  	ret = mshv_irqfd_wq_init();
> >  	if (ret)
> > -		goto exit_partition;
> > +		goto deinit_root_scheduler;
> >  
> >  	spin_lock_init(&mshv_root.pt_ht_lock);
> >  	hash_init(mshv_root.pt_htable);
> > @@ -2314,6 +2332,8 @@ static int __init mshv_parent_partition_init(void)
> >  
> >  	return 0;
> >  
> > +deinit_root_scheduler:
> > +	root_scheduler_deinit();
> >  exit_partition:
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> > @@ -2332,6 +2352,7 @@ static void __exit mshv_parent_partition_exit(void)
> >  	mshv_port_table_fini();
> >  	misc_deregister(&mshv_dev);
> >  	mshv_irqfd_wq_cleanup();
> > +	root_scheduler_deinit();
> >  	if (hv_root_partition())
> >  		mshv_root_partition_exit();
> >  	cpuhp_remove_state(mshv_cpuhp_online);
> > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> > index 41a29bf8ec14..c0300910808b 100644
> > --- a/include/hyperv/hvhdk_mini.h
> > +++ b/include/hyperv/hvhdk_mini.h
> > @@ -87,6 +87,9 @@ enum hv_partition_property_code {
> >  	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
> >  	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
> >  
> > +	/* Integrated scheduling properties */
> > +	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
> > +
> >  	/* Resource properties */
> >  	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
> >  	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
> > @@ -102,7 +105,7 @@ enum hv_partition_property_code {
> >  };
> >  
> >  #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
> > -#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
> > +#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
> >  
> >  struct hv_partition_property_vmm_capabilities {
> >  	u16 bank_count;
> > @@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
> >  			u64 reservedbit3: 1;
> >  #endif
> >  			u64 assignable_synthetic_proc_features: 1;
> > +			u64 reservedbit5: 1;
> > +			u64 vmm_enable_integrated_scheduler : 1;
> >  			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
> >  		} __packed;
> >  	};
> > 
> > 

^ permalink raw reply

* [PATCH v4] mshv: Add support for integrated scheduler
From: Stanislav Kinsburskii @ 2026-02-02 19:26 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel

Query the hypervisor for integrated scheduler support and use it if
configured.

Microsoft Hypervisor originally provided two schedulers: root and core. The
root scheduler allows the root partition to schedule guest vCPUs across
physical cores, supporting both time slicing and CPU affinity (e.g., via
cgroups). In contrast, the core scheduler delegates vCPU-to-physical-core
scheduling entirely to the hypervisor.

Direct virtualization introduces a new privileged guest partition type - L1
Virtual Host (L1VH) — which can create child partitions from its own
resources. These child partitions are effectively siblings, scheduled by
the hypervisor's core scheduler. This prevents the L1VH parent from setting
affinity or time slicing for its own processes or guest VPs. While cgroups,
CFS, and cpuset controllers can still be used, their effectiveness is
unpredictable, as the core scheduler swaps vCPUs according to its own logic
(typically round-robin across all allocated physical CPUs). As a result,
the system may appear to "steal" time from the L1VH and its children.

To address this, Microsoft Hypervisor introduces the integrated scheduler.
This allows an L1VH partition to schedule its own vCPUs and those of its
guests across its "physical" cores, effectively emulating root scheduler
behavior within the L1VH, while retaining core scheduler behavior for the
rest of the system.

The integrated scheduler is controlled by the root partition and gated by
the vmm_enable_integrated_scheduler capability bit. If set, the hypervisor
supports the integrated scheduler. The L1VH partition must then check if it
is enabled by querying the corresponding extended partition property. If
this property is true, the L1VH partition must use the root scheduler
logic; otherwise, it must use the core scheduler. This requirement makes
reading VMM capabilities in L1VH partition a requirement too.

Signed-off-by: Andreea Pintilie <anpintil@microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c |   82 ++++++++++++++++++++++++++-----------------
 include/hyperv/hvhdk_mini.h |    7 +++-
 2 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..006dd9e68c27 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2053,6 +2053,29 @@ static const char *scheduler_type_to_string(enum hv_scheduler_type type)
 	};
 }
 
+static int __init l1vh_retrieve_scheduler_type(enum hv_scheduler_type *out)
+{
+	u64 integrated_sched_enabled;
+	int ret;
+
+	*out = HV_SCHEDULER_TYPE_CORE_SMT;
+
+	if (!mshv_root.vmm_caps.vmm_enable_integrated_scheduler)
+		return 0;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED,
+						0, &integrated_sched_enabled,
+						sizeof(integrated_sched_enabled));
+	if (ret)
+		return ret;
+
+	if (integrated_sched_enabled)
+		*out = HV_SCHEDULER_TYPE_ROOT;
+
+	return 0;
+}
+
 /* TODO move this to hv_common.c when needed outside */
 static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 {
@@ -2085,13 +2108,12 @@ static int __init hv_retrieve_scheduler_type(enum hv_scheduler_type *out)
 /* Retrieve and stash the supported scheduler type */
 static int __init mshv_retrieve_scheduler_type(struct device *dev)
 {
-	int ret = 0;
+	int ret;
 
 	if (hv_l1vh_partition())
-		hv_scheduler_type = HV_SCHEDULER_TYPE_CORE_SMT;
+		ret = l1vh_retrieve_scheduler_type(&hv_scheduler_type);
 	else
 		ret = hv_retrieve_scheduler_type(&hv_scheduler_type);
-
 	if (ret)
 		return ret;
 
@@ -2211,42 +2233,29 @@ struct notifier_block mshv_reboot_nb = {
 static void mshv_root_partition_exit(void)
 {
 	unregister_reboot_notifier(&mshv_reboot_nb);
-	root_scheduler_deinit();
 }
 
 static int __init mshv_root_partition_init(struct device *dev)
 {
-	int err;
-
-	err = root_scheduler_init(dev);
-	if (err)
-		return err;
-
-	err = register_reboot_notifier(&mshv_reboot_nb);
-	if (err)
-		goto root_sched_deinit;
-
-	return 0;
-
-root_sched_deinit:
-	root_scheduler_deinit();
-	return err;
+	return register_reboot_notifier(&mshv_reboot_nb);
 }
 
-static void mshv_init_vmm_caps(struct device *dev)
+static int __init mshv_init_vmm_caps(struct device *dev)
 {
-	/*
-	 * This can only fail here if HVCALL_GET_PARTITION_PROPERTY_EX or
-	 * HV_PARTITION_PROPERTY_VMM_CAPABILITIES are not supported. In that
-	 * case it's valid to proceed as if all vmm_caps are disabled (zero).
-	 */
-	if (hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
-					      HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
-					      0, &mshv_root.vmm_caps,
-					      sizeof(mshv_root.vmm_caps)))
-		dev_warn(dev, "Unable to get VMM capabilities\n");
+	int ret;
+
+	ret = hv_call_get_partition_property_ex(HV_PARTITION_ID_SELF,
+						HV_PARTITION_PROPERTY_VMM_CAPABILITIES,
+						0, &mshv_root.vmm_caps,
+						sizeof(mshv_root.vmm_caps));
+	if (ret && hv_l1vh_partition()) {
+		dev_err(dev, "Failed to get VMM capabilities: %d\n", ret);
+		return ret;
+	}
 
 	dev_dbg(dev, "vmm_caps = %#llx\n", mshv_root.vmm_caps.as_uint64[0]);
+
+	return 0;
 }
 
 static int __init mshv_parent_partition_init(void)
@@ -2292,6 +2301,10 @@ static int __init mshv_parent_partition_init(void)
 
 	mshv_cpuhp_online = ret;
 
+	ret = mshv_init_vmm_caps(dev);
+	if (ret)
+		goto remove_cpu_state;
+
 	ret = mshv_retrieve_scheduler_type(dev);
 	if (ret)
 		goto remove_cpu_state;
@@ -2301,11 +2314,13 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto remove_cpu_state;
 
-	mshv_init_vmm_caps(dev);
+	ret = root_scheduler_init(dev);
+	if (ret)
+		goto exit_partition;
 
 	ret = mshv_irqfd_wq_init();
 	if (ret)
-		goto exit_partition;
+		goto deinit_root_scheduler;
 
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
@@ -2314,6 +2329,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+deinit_root_scheduler:
+	root_scheduler_deinit();
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
@@ -2332,6 +2349,7 @@ static void __exit mshv_parent_partition_exit(void)
 	mshv_port_table_fini();
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
+	root_scheduler_deinit();
 	if (hv_root_partition())
 		mshv_root_partition_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 41a29bf8ec14..c0300910808b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -87,6 +87,9 @@ enum hv_partition_property_code {
 	HV_PARTITION_PROPERTY_PRIVILEGE_FLAGS			= 0x00010000,
 	HV_PARTITION_PROPERTY_SYNTHETIC_PROC_FEATURES		= 0x00010001,
 
+	/* Integrated scheduling properties */
+	HV_PARTITION_PROPERTY_INTEGRATED_SCHEDULER_ENABLED	= 0x00020005,
+
 	/* Resource properties */
 	HV_PARTITION_PROPERTY_GPA_PAGE_ACCESS_TRACKING		= 0x00050005,
 	HV_PARTITION_PROPERTY_UNIMPLEMENTED_MSR_ACTION		= 0x00050017,
@@ -102,7 +105,7 @@ enum hv_partition_property_code {
 };
 
 #define HV_PARTITION_VMM_CAPABILITIES_BANK_COUNT		1
-#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	59
+#define HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT	57
 
 struct hv_partition_property_vmm_capabilities {
 	u16 bank_count;
@@ -119,6 +122,8 @@ struct hv_partition_property_vmm_capabilities {
 			u64 reservedbit3: 1;
 #endif
 			u64 assignable_synthetic_proc_features: 1;
+			u64 reservedbit5: 1;
+			u64 vmm_enable_integrated_scheduler : 1;
 			u64 reserved0: HV_PARTITION_VMM_CAPABILITIES_RESERVED_BITFIELD_COUNT;
 		} __packed;
 	};



^ permalink raw reply related

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-02-02 20:15 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Anirudh Rayabharam, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <aYDUOeXIoOV4qtRk@skinsburskii.localdomain>

On 2/2/26 08:43, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 11:47:48AM -0800, Mukesh R wrote:
>> On 1/30/26 10:41, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 30, 2026 at 05:17:52PM +0000, Anirudh Rayabharam wrote:
>>>> On Thu, Jan 29, 2026 at 06:59:31PM -0800, Mukesh R wrote:
>>>>> On 1/28/26 15:08, Stanislav Kinsburskii wrote:
>>>>>> On Tue, Jan 27, 2026 at 11:56:02AM -0800, Mukesh R wrote:
>>>>>>> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
>>>>>>>> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>>>>>>>>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>>>>>>>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>>>>>>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>>>>>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>>>>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>>>>>>>>> management is implemented.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>          drivers/hv/Kconfig |    1 +
>>>>>>>>>>>>>>>>          1 file changed, 1 insertion(+)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>>>>>>>>          	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>>>>>>>>          	# no particular order, making it impossible to reassemble larger pages
>>>>>>>>>>>>>>>>          	depends on PAGE_SIZE_4KB
>>>>>>>>>>>>>>>> +	depends on !KEXEC
>>>>>>>>>>>>>>>>          	select EVENTFD
>>>>>>>>>>>>>>>>          	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>>>>>>>>          	select HMM_MIRROR
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>>>>>>>>> and it was fine?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>>>>>>>>> will be affected as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>>>>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>>>>>>>>> further developement would be without a ver important and major feature,
>>>>>>>>>>>>> right?
>>>>>>>>>>>>
>>>>>>>>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>>>>>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>>>>>>>>> The system may crash at any time after kexec, depending on whether the
>>>>>>>>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>>>>>>>>> bad user experience.
>>>>>>>>>>>
>>>>>>>>>>> I understand that. But with this we cannot collect core and debug any
>>>>>>>>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>>>>>>>>> for update via notifier or some other quick hack. Did you already
>>>>>>>>>>> explore that and didn't find anything, hence this?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This quick hack you mention isn't quick in the upstream kernel as there
>>>>>>>>>> is no hook to interrupt kexec process except the live update one.
>>>>>>>>>
>>>>>>>>> That's the one we want to interrupt and block right? crash kexec
>>>>>>>>> is ok and should be allowed. We can document we don't support kexec
>>>>>>>>> for update for now.
>>>>>>>>>
>>>>>>>>>> I sent an RFC for that one but given todays conversation details is
>>>>>>>>>> won't be accepted as is.
>>>>>>>>>
>>>>>>>>> Are you taking about this?
>>>>>>>>>
>>>>>>>>>             "mshv: Add kexec safety for deposited pages"
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>>> Making mshv mutually exclusive with kexec is the only viable option for
>>>>>>>>>> now given time constraints.
>>>>>>>>>> It is intended to be replaced with proper page lifecycle management in
>>>>>>>>>> the future.
>>>>>>>>>
>>>>>>>>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>>>>>>>>> completely. What we want is just block kexec for updates from some
>>>>>>>>> mshv file for now, we an print during boot that kexec for updates is
>>>>>>>>> not supported on mshv. Hope that makes sense.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The trade-off here is between disabling kexec support and having the
>>>>>>>> kernel crash after kexec in a non-obvious way. This affects both regular
>>>>>>>> kexec and crash kexec.
>>>>>>>
>>>>>>> crash kexec on baremetal is not affected, hence disabling that
>>>>>>> doesn't make sense as we can't debug crashes then on bm.
>>>>>>>
>>>>>>
>>>>>> Bare metal support is not currently relevant, as it is not available.
>>>>>> This is the upstream kernel, and this driver will be accessible to
>>>>>> third-party customers beginning with kernel 6.19 for running their
>>>>>> kernels in Azure L1VH, so consistency is required.
>>>>>
>>>>> Well, without crashdump support, customers will not be running anything
>>>>> anywhere.
>>>>
>>>> This is my concern too. I don't think customers will be particularly
>>>> happy that kexec doesn't work with our driver.
>>>>
>>>
>>> I wasn?t clear earlier, so let me restate it. Today, kexec is not
>>> supported in L1VH. This is a bug we have not fixed yet. Disabling kexec
>>> is not a long-term solution. But it is better to disable it explicitly
>>> than to have kernel crashes after kexec.
>>
>> I don't think there is disagreement on this. The undesired part is turning
>> off KEXEC config completely.
>>
> 
> There is no disagreement on this either. If you have a better solution
> that can be implemented and merged before next kernel merge window,
> please propose it. Otherwise, this patch will remain as is for now.

Like I said previously, I'll explore a bit. I think I found something,
but need to test it a bit and get second opinion on it. For me, I am
not convinced this absolutely has to be in this merge window as it only
involves MSHV for l1vh and has been like this all this time. Moreover,
other things like makedumpfile are broken on l1vh. But Wei can make
final decision.

Thanks,
-Mukesh

> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> This does not mean the bug should not be fixed. But the upstream kernel
>>> has its own policies and merge windows. For kernel 6.19, it is better to
>>> have a clear kexec error than random crashes after kexec.
>>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> Anirudh
>>>>
>>>>>
>>>>> Thanks,
>>>>> -Mukesh
>>>>>
>>>>>> Thanks,
>>>>>> Stanislav
>>>>>>
>>>>>>> Let me think and explore a bit, and if I come up with something, I'll
>>>>>>> send a patch here. If nothing, then we can do this as last resort.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Mukesh
>>>>>>>
>>>>>>>
>>>>>>>> It?s a pity we can?t apply a quick hack to disable only regular kexec.
>>>>>>>> However, since crash kexec would hit the same issues, until we have a
>>>>>>>> proper state transition for deposted pages, the best workaround for now
>>>>>>>> is to reset the hypervisor state on every kexec, which needs design,
>>>>>>>> work, and testing.
>>>>>>>>
>>>>>>>> Disabling kexec is the only consistent way to handle this in the
>>>>>>>> upstream kernel at the moment.
>>>>>>>>
>>>>>>>> Thanks, Stanislav
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> -Mukesh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Stanislav
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -Mukesh
>>>>>>>>>>>
>>>>>>>>>>>> Therefor it should be explicitly forbidden as it's essentially not
>>>>>>>>>>>> supported yet.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> -Mukesh
>>>>>
>>


^ permalink raw reply

* Re: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
From: kernel test robot @ 2026-02-02 22:31 UTC (permalink / raw)
  To: Shashank Balaji, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Broadcom internal kernel review list, Jan Kiszka,
	Paolo Bonzini, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky
  Cc: llvm, oe-kbuild-all, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Shashank Balaji,
	Daniel Palmer, Tim Bird
In-Reply-To: <20260202-x2apic-fix-v1-1-71c8f488a88b@sony.com>

Hi Shashank,

kernel test robot noticed the following build errors:

[auto build test ERROR on 18f7fcd5e69a04df57b563360b88be72471d6b62]

url:    https://github.com/intel-lab-lkp/linux/commits/Shashank-Balaji/x86-x2apic-disable-x2apic-on-resume-if-the-kernel-expects-so/20260202-181147
base:   18f7fcd5e69a04df57b563360b88be72471d6b62
patch link:    https://lore.kernel.org/r/20260202-x2apic-fix-v1-1-71c8f488a88b%40sony.com
patch subject: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
config: i386-randconfig-001-20260202 (https://download.01.org/0day-ci/archive/20260203/202602030600.jFhsJyEC-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260203/202602030600.jFhsJyEC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602030600.jFhsJyEC-lkp@intel.com/

All errors (new ones prefixed by >>):

>> arch/x86/kernel/apic/apic.c:2463:3: error: call to undeclared function '__x2apic_disable'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2463 |                 __x2apic_disable();
         |                 ^
   arch/x86/kernel/apic/apic.c:2463:3: note: did you mean '__x2apic_enable'?
   arch/x86/kernel/apic/apic.c:1896:20: note: '__x2apic_enable' declared here
    1896 | static inline void __x2apic_enable(void) { }
         |                    ^
   1 error generated.


vim +/__x2apic_disable +2463 arch/x86/kernel/apic/apic.c

  2435	
  2436	static void lapic_resume(void *data)
  2437	{
  2438		unsigned int l, h;
  2439		unsigned long flags;
  2440		int maxlvt;
  2441	
  2442		if (!apic_pm_state.active)
  2443			return;
  2444	
  2445		local_irq_save(flags);
  2446	
  2447		/*
  2448		 * IO-APIC and PIC have their own resume routines.
  2449		 * We just mask them here to make sure the interrupt
  2450		 * subsystem is completely quiet while we enable x2apic
  2451		 * and interrupt-remapping.
  2452		 */
  2453		mask_ioapic_entries();
  2454		legacy_pic->mask_all();
  2455	
  2456		if (x2apic_mode) {
  2457			__x2apic_enable();
  2458		} else {
  2459			/*
  2460			 * x2apic may have been re-enabled by the
  2461			 * firmware on resuming from s2ram
  2462			 */
> 2463			__x2apic_disable();
  2464	
  2465			/*
  2466			 * Make sure the APICBASE points to the right address
  2467			 *
  2468			 * FIXME! This will be wrong if we ever support suspend on
  2469			 * SMP! We'll need to do this as part of the CPU restore!
  2470			 */
  2471			if (boot_cpu_data.x86 >= 6) {
  2472				rdmsr(MSR_IA32_APICBASE, l, h);
  2473				l &= ~MSR_IA32_APICBASE_BASE;
  2474				l |= MSR_IA32_APICBASE_ENABLE | mp_lapic_addr;
  2475				wrmsr(MSR_IA32_APICBASE, l, h);
  2476			}
  2477		}
  2478	
  2479		maxlvt = lapic_get_maxlvt();
  2480		apic_write(APIC_LVTERR, ERROR_APIC_VECTOR | APIC_LVT_MASKED);
  2481		apic_write(APIC_ID, apic_pm_state.apic_id);
  2482		apic_write(APIC_DFR, apic_pm_state.apic_dfr);
  2483		apic_write(APIC_LDR, apic_pm_state.apic_ldr);
  2484		apic_write(APIC_TASKPRI, apic_pm_state.apic_taskpri);
  2485		apic_write(APIC_SPIV, apic_pm_state.apic_spiv);
  2486		apic_write(APIC_LVT0, apic_pm_state.apic_lvt0);
  2487		apic_write(APIC_LVT1, apic_pm_state.apic_lvt1);
  2488	#ifdef CONFIG_X86_THERMAL_VECTOR
  2489		if (maxlvt >= 5)
  2490			apic_write(APIC_LVTTHMR, apic_pm_state.apic_thmr);
  2491	#endif
  2492	#ifdef CONFIG_X86_MCE_INTEL
  2493		if (maxlvt >= 6)
  2494			apic_write(APIC_LVTCMCI, apic_pm_state.apic_cmci);
  2495	#endif
  2496		if (maxlvt >= 4)
  2497			apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
  2498		apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
  2499		apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
  2500		apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
  2501		apic_write(APIC_ESR, 0);
  2502		apic_read(APIC_ESR);
  2503		apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
  2504		apic_write(APIC_ESR, 0);
  2505		apic_read(APIC_ESR);
  2506	
  2507		irq_remapping_reenable(x2apic_mode);
  2508	
  2509		local_irq_restore(flags);
  2510	}
  2511	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* RE: [EXTERNAL] [PATCH] scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
From: Long Li @ 2026-02-02 23:47 UTC (permalink / raw)
  To: Jan Kiszka, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen,
	linux-hyperv@vger.kernel.org
  Cc: linux-scsi@vger.kernel.org, Linux Kernel Mailing List,
	Florian Bezdeka, RT, Mitchell Levy
In-Reply-To: <0c7fb5cd-fb21-4760-8593-e04bade84744@siemens.com>

> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> This resolves the follow splat and lock-up when running with PREEMPT_RT
> enabled on Hyper-V:

Hi Jan,

It's interesting to know the use-case of running a RT kernel over Hyper-V.

Can you give an example?

As far as I know, Hyper-V makes no RT guarantees of scheduling VPs for a VM.

Thanks,
Long

> 
> [  415.140818] BUG: scheduling while atomic: stress-ng-
> iomix/1048/0x00000002 [  415.140822] INFO: lockdep is turned off.
> [  415.140823] Modules linked in: intel_rapl_msr intel_rapl_common
> intel_uncore_frequency_common intel_pmc_core pmt_telemetry
> pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec
> ghash_clmulni_intel aesni_intel rapl binfmt_misc nls_ascii nls_cp437 vfat fat
> snd_pcm hyperv_drm snd_timer drm_client_lib drm_shmem_helper snd sg
> soundcore drm_kms_helper pcspkr hv_balloon hv_utils evdev joydev drm
> configfs efi_pstore nfnetlink vsock_loopback
> vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport
> vsock vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod
> cdrom hv_storvsc serio_raw hid_generic scsi_transport_fc hid_hyperv
> scsi_mod hid hv_netvsc hyperv_keyboard scsi_common [  415.140846]
> Preemption disabled at:
> [  415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0
> [hv_storvsc] [  415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix
> Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)} [  415.140856] Hardware
> name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V
> UEFI Release v4.1 09/04/2024 [  415.140857] Call Trace:
> [  415.140861]  <TASK>
> [  415.140861]  ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
> [  415.140863]  dump_stack_lvl+0x91/0xb0 [  415.140870]
> __schedule_bug+0x9c/0xc0 [  415.140875]  __schedule+0xdf6/0x1300
> [  415.140877]  ? rtlock_slowlock_locked+0x56c/0x1980
> [  415.140879]  ? rcu_is_watching+0x12/0x60 [  415.140883]
> schedule_rtlock+0x21/0x40 [  415.140885]
> rtlock_slowlock_locked+0x502/0x1980
> [  415.140891]  rt_spin_lock+0x89/0x1e0
> [  415.140893]  hv_ringbuffer_write+0x87/0x2a0 [  415.140899]
> vmbus_sendpacket_mpb_desc+0xb6/0xe0
> [  415.140900]  ? rcu_is_watching+0x12/0x60 [  415.140902]
> storvsc_queuecommand+0x669/0xbe0 [hv_storvsc] [  415.140904]  ?
> HARDIRQ_verbose+0x10/0x10 [  415.140908]  ? __rq_qos_issue+0x28/0x40
> [  415.140911]  scsi_queue_rq+0x760/0xd80 [scsi_mod] [  415.140926]
> __blk_mq_issue_directly+0x4a/0xc0 [  415.140928]
> blk_mq_issue_direct+0x87/0x2b0 [  415.140931]
> blk_mq_dispatch_queue_requests+0x120/0x440
> [  415.140933]  blk_mq_flush_plug_list+0x7a/0x1a0 [  415.140935]
> __blk_flush_plug+0xf4/0x150 [  415.140940]  __submit_bio+0x2b2/0x5c0
> [  415.140944]  ? submit_bio_noacct_nocheck+0x272/0x360
> [  415.140946]  submit_bio_noacct_nocheck+0x272/0x360
> [  415.140951]  ext4_read_bh_lock+0x3e/0x60 [ext4] [  415.140995]
> ext4_block_write_begin+0x396/0x650 [ext4] [  415.141018]  ?
> __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4] [  415.141038]
> ext4_da_write_begin+0x1c4/0x350 [ext4] [  415.141060]
> generic_perform_write+0x14e/0x2c0 [  415.141065]
> ext4_buffered_write_iter+0x6b/0x120 [ext4] [  415.141083]
> vfs_write+0x2ca/0x570 [  415.141087]  ksys_write+0x76/0xf0
> [  415.141089]  do_syscall_64+0x99/0x1490 [  415.141093]  ?
> rcu_is_watching+0x12/0x60 [  415.141095]  ?
> finish_task_switch.isra.0+0xdf/0x3d0
> [  415.141097]  ? rcu_is_watching+0x12/0x60 [  415.141098]  ?
> lock_release+0x1f0/0x2a0 [  415.141100]  ? rcu_is_watching+0x12/0x60
> [  415.141101]  ? finish_task_switch.isra.0+0xe4/0x3d0
> [  415.141103]  ? rcu_is_watching+0x12/0x60 [  415.141104]  ?
> __schedule+0xb34/0x1300 [  415.141106]  ?
> hrtimer_try_to_cancel+0x1d/0x170 [  415.141109]  ?
> do_nanosleep+0x8b/0x160 [  415.141111]  ?
> hrtimer_nanosleep+0x89/0x100 [  415.141114]  ?
> __pfx_hrtimer_wakeup+0x10/0x10 [  415.141116]  ?
> xfd_validate_state+0x26/0x90 [  415.141118]  ? rcu_is_watching+0x12/0x60
> [  415.141120]  ? do_syscall_64+0x1e0/0x1490 [  415.141121]  ?
> do_syscall_64+0x1e0/0x1490 [  415.141123]  ? rcu_is_watching+0x12/0x60
> [  415.141124]  ? do_syscall_64+0x1e0/0x1490 [  415.141125]  ?
> do_syscall_64+0x1e0/0x1490 [  415.141127]  ? irqentry_exit+0x140/0x7e0
> [  415.141129]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> get_cpu() disables preemption while the spinlock hv_ringbuffer_write is using
> is converted to an rt-mutex under PREEMPT_RT.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> 
> This is likely just the tip of an iceberg, see specifically [1], but if you never start
> addressing it, it will continue to crash ships, even if those are only on test
> cruises (we are fully aware that Hyper-V provides no RT guarantees for
> guests). A pragmatic alternative to that would be a simple
> 
> config HYPERV
>     depends on !PREEMPT_RT
> 
> Please share your thoughts if this fix is worth it, or if we should better stop
> looking at the next splats that show up after it. We are currently considering to
> thread some of the hv platform IRQs under PREEMPT_RT as potential next
> step.
> 
> TIA!
> 
> [1]
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fall%2F20230809-b4-rt_preempt-fix-v1-0-
> 7283bbdc8b14%40gmail.com%2F&data=05%7C02%7Clongli%40microsoft.c
> om%7C9bcc663272304e06251908de5f42fe3b%7C72f988bf86f141af91ab2
> d7cd011db47%7C1%7C0%7C639052938514762134%7CUnknown%7CTWF
> pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW
> 4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=WyFA
> %2FIUPpZDcayM%2Fj7Ky8%2Bm93bey239zVWguDspSbdo%3D&reserved=0
> 
>  drivers/scsi/storvsc_drv.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> b43d876747b7..68c837146b9e 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1855,8 +1855,9 @@ static int storvsc_queuecommand(struct Scsi_Host
> *host, struct scsi_cmnd *scmnd)
>  	cmd_request->payload_sz = payload_sz;
> 
>  	/* Invokes the vsc to start an IO */
> -	ret = storvsc_do_io(dev, cmd_request, get_cpu());
> -	put_cpu();
> +	migrate_disable();
> +	ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
> +	migrate_enable();
> 
>  	if (ret)
>  		scsi_dma_unmap(scmnd);
> --
> 2.51.0

^ permalink raw reply

* Re: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
From: Shashank Balaji @ 2026-02-03  0:24 UTC (permalink / raw)
  To: kernel test robot
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Suresh Siddha, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Jan Kiszka, Paolo Bonzini,
	Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, llvm,
	oe-kbuild-all, linux-kernel, linux-hyperv, virtualization,
	jailhouse-dev, kvm, xen-devel, Rahul Bukte, Daniel Palmer,
	Tim Bird
In-Reply-To: <202602030600.jFhsJyEC-lkp@intel.com>

On Tue, Feb 03, 2026 at 06:31:40AM +0800, kernel test robot wrote:
> Hi Shashank,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on 18f7fcd5e69a04df57b563360b88be72471d6b62]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Shashank-Balaji/x86-x2apic-disable-x2apic-on-resume-if-the-kernel-expects-so/20260202-181147
> base:   18f7fcd5e69a04df57b563360b88be72471d6b62
> patch link:    https://lore.kernel.org/r/20260202-x2apic-fix-v1-1-71c8f488a88b%40sony.com
> patch subject: [PATCH 1/3] x86/x2apic: disable x2apic on resume if the kernel expects so
> config: i386-randconfig-001-20260202 (https://download.01.org/0day-ci/archive/20260203/202602030600.jFhsJyEC-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260203/202602030600.jFhsJyEC-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202602030600.jFhsJyEC-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
> >> arch/x86/kernel/apic/apic.c:2463:3: error: call to undeclared function '__x2apic_disable'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>     2463 |                 __x2apic_disable();
>          |                 ^
>    arch/x86/kernel/apic/apic.c:2463:3: note: did you mean '__x2apic_enable'?
>    arch/x86/kernel/apic/apic.c:1896:20: note: '__x2apic_enable' declared here
>     1896 | static inline void __x2apic_enable(void) { }
>          |                    ^
>    1 error generated.

This happens when CONFIG_X86_X2APIC is disabled. This patch fixes it,
which I'll include in v2:

diff --git i/arch/x86/kernel/apic/apic.c w/arch/x86/kernel/apic/apic.c
index 8820b631f8a2..06cce23b89c1 100644
--- i/arch/x86/kernel/apic/apic.c
+++ w/arch/x86/kernel/apic/apic.c
@@ -1894,6 +1894,7 @@ void __init check_x2apic(void)

 static inline void try_to_enable_x2apic(int remap_mode) { }
 static inline void __x2apic_enable(void) { }
+static inline void __x2apic_disable(void) {}
 #endif /* !CONFIG_X86_X2APIC */

 void __init enable_IR_x2apic(void)

^ permalink raw reply related

* Re: [PATCH 1/1] mshv: Add comment about huge page mappings in guest physical address space
From: Mukesh R @ 2026-02-03  1:09 UTC (permalink / raw)
  To: Stanislav Kinsburskii, mhkelley58
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aYDcLRhxx9wXRXBG@skinsburskii.localdomain>

On 2/2/26 09:17, Stanislav Kinsburskii wrote:
> On Mon, Feb 02, 2026 at 08:51:01AM -0800, mhkelley58@gmail.com wrote:
>> From: Michael Kelley <mhklinux@outlook.com>
>>
>> Huge page mappings in the guest physical address space depend on having
>> matching alignment of the userspace address in the parent partition and
>> of the guest physical address. Add a comment that captures this
>> information. See the link to the mailing list thread.
>>
>> No code or functional change.
>>
>> Link: https://lore.kernel.org/linux-hyperv/aUrC94YvscoqBzh3@skinsburskii.localdomain/T/#m0871d2cae9b297fd397ddb8459e534981307c7dc
>> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
>> ---
>>   drivers/hv/mshv_root_main.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 681b58154d5e..bc738ff4508e 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -1389,6 +1389,20 @@ mshv_partition_ioctl_set_memory(struct mshv_partition *partition,
>>   	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP))
>>   		return mshv_unmap_user_memory(partition, mem);
>>   
>> +	/*
>> +	 * If the userspace_addr and the guest physical address (as derived
>> +	 * from the guest_pfn) have the same alignment modulo PMD huge page
>> +	 * size, the MSHV driver can map any PMD huge pages to the guest
>> +	 * physical address space as PMD huge pages. If the alignments do
>> +	 * not match, PMD huge pages must be mapped as single pages in the
>> +	 * guest physical address space. The MSHV driver does not enforce
>> +	 * that the alignments match, and it invokes the hypervisor to set
>> +	 * up correct functional mappings either way. See mshv_chunk_stride().
>> +	 * The caller of the ioctl is responsible for providing userspace_addr
>> +	 * and guest_pfn values with matching alignments if it wants the guest
>> +	 * to get the performance benefits of PMD huge page mappings of its
>> +	 * physical address space to real system memory.
>> +	 */
> 
> Thanks. However, I'd suggest to reduce this commet a lot and put the
> details into the commit message instead. Also, why this place? Why not a
> part of the function description instead, for example?

Fwiw, I also prefer this in the function prologue. IMO, larger comments
belong outside the function rather than inside, unless of course cases
where it has to be that way. This makes functions easier to study.

Thanks,
-Mukesh



> Thanks,
> Stanislav
> 
>>   	return mshv_map_user_memory(partition, mem);
>>   }
>>   
>> -- 
>> 2.25.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox