Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Stanislav Kinsburskii @ 2026-01-27 18:46 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <c40e6dc8-8e42-b0f3-f8e5-3c637adb7f13@linux.microsoft.com>

On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
> On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> > > On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > 
> > > > > Add a new file to implement management of device domains, mapping and
> > > > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > > > parent. This also implements direct attach mechanism for PCI passthru,
> > > > > and it is also made to work within the VFIO framework.
> > > > > 
> > > > > At a high level, during boot the hypervisor creates a default identity
> > > > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > > > PCI passthru:
> > > > > 
> > > > >     1. Device Domain: root must create a device domain in the hypervisor,
> > > > >        and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > > >        All hypervisor communications use device id of type PCI for
> > > > >        identifying and referencing the device.
> > > > > 
> > > > >     2. Direct Attach: the hypervisor will simply use the guest's HW
> > > > >        page table for mappings, thus the host need not do map/unmap
> > > > >        device memory hypercalls. As such, direct attach passthru setup
> > > > >        during guest boot is extremely fast. A direct attached device
> > > > >        must be referenced via logical device id and not via the PCI
> > > > >        device id.
> > > > > 
> > > > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > > > attach is default in non-L1VH cases because there are some significant
> > > > > performance issues with device domain implementation currently for guests
> > > > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > > > addressed in the short term.
> > > > > 
> > > > 
> > > > <snip>
> > > > 
> > 
> > <snip>
> > 
> > > > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > > > +{
> > > > > +	struct pci_dev *pdev;
> > > > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > > > +
> > > > > +	/* See the attach function, only PCI devices for now */
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return;
> > > > > +
> > > > > +	if (hvdom->num_attchd == 0)
> > > > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > > > +
> > > > > +	pdev = to_pci_dev(dev);
> > > > > +
> > > > > +	if (hvdom->attached_dom) {
> > > > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > > > +
> > > > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > > > +		 * next.
> > > > > +		 */
> > > > > +	} else {
> > > > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > > > +	}
> > > > > +
> > > > > +	hvdom->num_attchd--;
> > > > 
> > > > Shouldn't this be modified iff the detach succeeded?
> > > 
> > > We want to still free the domain and not let it get stuck. The purpose
> > > is more to make sure detach was called before domain free.
> > > 
> > 
> > How can one debug subseqent errors if num_attchd is decremented
> > unconditionally? In reality the device is left attached, but the related
> > kernel metadata is gone.
> 
> Error is printed in case of failed detach. If there is panic, at least
> you can get some info about the device. Metadata in hypervisor is
> around if failed.
> 

With this approach the only thing left is a kernel message.
But if the state is kept intact, one could collect a kernel core and
analyze it.

And note, that there won't be a hypervisor core by default: our main
context with the usptreamed version of the driver is L1VH and a kernel
core is the only thing a third party customer can provide for our
analysis.

Thanks,
Stanislav


^ permalink raw reply

* Re: [PATCH 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Stanislav Kinsburskii @ 2026-01-27 18:30 UTC (permalink / raw)
  To: Mukesh R
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <df21ce10-3cd5-9d78-a3ce-84c4b1ff9275@linux.microsoft.com>

On Mon, Jan 26, 2026 at 06:06:23PM -0800, Mukesh R wrote:
> On 1/25/26 14:41, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 04:33:39PM -0800, Mukesh R wrote:
> > > On 1/22/26 17:35, Stanislav Kinsburskii wrote:
> > > > Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
> > > > functions to handle memory deposition with proper error handling.
> > > > 
> > > > The new hv_deposit_memory_node() function takes the hypervisor status
> > > > as a parameter and validates it before depositing pages. It checks for
> > > > HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
> > > > unexpected status codes.
> > > > 
> > > > This is a precursor patch to new out-of-memory error codes support.
> > > > No functional changes intended.
> > > > 
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >    drivers/hv/hv_proc.c           |   22 ++++++++++++++++++++--
> > > >    drivers/hv/mshv_root_hv_call.c |   25 +++++++++----------------
> > > >    drivers/hv/mshv_root_main.c    |    3 +--
> > > >    include/asm-generic/mshyperv.h |   10 ++++++++++
> > > >    4 files changed, 40 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> > > > index 80c66d1c74d5..c0c2bfc80d77 100644
> > > > --- a/drivers/hv/hv_proc.c
> > > > +++ b/drivers/hv/hv_proc.c
> > > > @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
> > > >    }
> > > >    EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
> > > > +int hv_deposit_memory_node(int node, u64 partition_id,
> > > > +			   u64 hv_status)
> > > > +{
> > > > +	u32 num_pages;
> > > > +
> > > > +	switch (hv_result(hv_status)) {
> > > > +	case HV_STATUS_INSUFFICIENT_MEMORY:
> > > > +		num_pages = 1;
> > > > +		break;
> > > > +	default:
> > > > +		hv_status_err(hv_status, "Unexpected!\n");
> > > > +		return -ENOMEM;
> > > > +	}
> > > > +	return hv_call_deposit_pages(node, partition_id, num_pages);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
> > > > +
> > > 
> > > Different hypercalls may want to deposit different number of pages in one
> > > shot. As feature evolves, page sizes get mixed, we'd almost need that
> > > flexibility. So, imo, either we just don't do this for now, or add num pages
> > > parameter to be passed down.
> > > 
> > 
> > What you do mean by "page sizes get mixed"?
> > A helper to deposit num pages already exists: its
> > hv_call_deposit_pages().
> 
> My point, you are removing number of pages, and we may want to keep
> that so one can quickly play around and change them.
> 
> -                       ret = hv_call_deposit_pages(NUMA_NO_NODE,
> -                                                   pt_id, 1);
> +                       ret = hv_deposit_memory(pt_id, status);
> 
> For example, in hv_call_initialize_partition() we may realize after
> some analysis that depositing 2 pages or 4 pages is much better.
> 

We have been using this 1-page deposit logic from the beginning. To
change the number of pages, simply replace hv_deposit_memory with
hv_call_deposit_pages and specify the desired number of pages.

The proposed approach reduces code duplication and is less error-prone,
as there are multiple error codes to handle. Consolidating the logic
also makes the driver more robust.


Thanks,  Stanislav

> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > 
> > > >    bool hv_result_oom(u64 status)
> > > >    {
> > > >    	switch (hv_result(status)) {
> > > > @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
> > > >    			}
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
> > > > +		ret = hv_deposit_memory_node(node, hv_current_partition_id,
> > > > +					     status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> > > >    			}
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(node, partition_id, 1);
> > > > +		ret = hv_deposit_memory_node(node, partition_id, status);
> > > >    	} while (!ret);
> > > > diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> > > > index 58c5cbf2e567..06f2bac8039d 100644
> > > > --- a/drivers/hv/mshv_root_hv_call.c
> > > > +++ b/drivers/hv/mshv_root_hv_call.c
> > > > @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
> > > >    			break;
> > > >    		}
> > > >    		local_irq_restore(irq_flags);
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    hv_current_partition_id, 1);
> > > > +		ret = hv_deposit_memory(hv_current_partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
> > > >    			ret = hv_result_to_errno(status);
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> > > > +		ret = hv_deposit_memory(partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
> > > >    		}
> > > >    		local_irq_restore(flags);
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    partition_id, 1);
> > > > +		ret = hv_deposit_memory(partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
> > > >    		}
> > > >    		local_irq_restore(flags);
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    partition_id, 1);
> > > > +		ret = hv_deposit_memory(partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> > > >    		local_irq_restore(flags);
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
> > > > +		ret = hv_deposit_memory(partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> > > >    			ret = hv_result_to_errno(status);
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
> > > > -
> > > > +		ret = hv_deposit_memory(port_partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> > > >    			ret = hv_result_to_errno(status);
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    connection_partition_id, 1);
> > > > +		ret = hv_deposit_memory(connection_partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
> > > >    			break;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    hv_current_partition_id, 1);
> > > > +		ret = hv_deposit_memory(hv_current_partition_id, status);
> > > >    	} while (!ret);
> > > >    	return ret;
> > > > @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
> > > >    			return ret;
> > > >    		}
> > > > -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -					    hv_current_partition_id, 1);
> > > > +		ret = hv_deposit_memory(hv_current_partition_id, status);
> > > >    		if (ret)
> > > >    			return ret;
> > > >    	} while (!ret);
> > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > index f4697497f83e..5fc572e31cd7 100644
> > > > --- a/drivers/hv/mshv_root_main.c
> > > > +++ b/drivers/hv/mshv_root_main.c
> > > > @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
> > > >    		if (!hv_result_oom(status))
> > > >    			ret = hv_result_to_errno(status);
> > > >    		else
> > > > -			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > > -						    pt_id, 1);
> > > > +			ret = hv_deposit_memory(pt_id, status);
> > > >    	} while (!ret);
> > > >    	args.status = hv_result(status);
> > > > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > > > index b73352a7fc9e..c8e8976839f8 100644
> > > > --- a/include/asm-generic/mshyperv.h
> > > > +++ b/include/asm-generic/mshyperv.h
> > > > @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
> > > >    }
> > > >    bool hv_result_oom(u64 status);
> > > > +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
> > > >    int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> > > >    int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> > > >    int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> > > > @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
> > > >    static inline bool hv_l1vh_partition(void) { return false; }
> > > >    static inline bool hv_parent_partition(void) { return false; }
> > > >    static inline bool hv_result_oom(u64 status) { return false; }
> > > > +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
> > > > +{
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > >    static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
> > > >    {
> > > >    	return -EOPNOTSUPP;
> > > > @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
> > > >    }
> > > >    #endif /* CONFIG_MSHV_ROOT */
> > > > +static inline int hv_deposit_memory(u64 partition_id, u64 status)
> > > > +{
> > > > +	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
> > > > +}
> > > > +
> > > >    #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
> > > >    u8 __init get_vtl(void);
> > > >    #else
> > > > 
> > > > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-27 17:47 UTC (permalink / raw)
  To: Mukesh R
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <257ad7f1-5dc0-2644-41c3-960c396caa38@linux.microsoft.com>

On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> > > On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > > > hypervisor deposited pages.
> > > > > > > > 
> > > > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > > > management is implemented.
> > > > > > > > 
> > > > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > > > ---
> > > > > > > >      drivers/hv/Kconfig |    1 +
> > > > > > > >      1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > > > --- a/drivers/hv/Kconfig
> > > > > > > > +++ b/drivers/hv/Kconfig
> > > > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > > > >      	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > > > >      	# no particular order, making it impossible to reassemble larger pages
> > > > > > > >      	depends on PAGE_SIZE_4KB
> > > > > > > > +	depends on !KEXEC
> > > > > > > >      	select EVENTFD
> > > > > > > >      	select VIRT_XFER_TO_GUEST_WORK
> > > > > > > >      	select HMM_MIRROR
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > > > and it was fine?
> > > > > > > 
> > > > > > 
> > > > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > > > will be affected as well.
> > > > > 
> > > > > So not sure I understand the reason for this patch. We can just block
> > > > > kexec if there are any VMs running, right? Doing this would mean any
> > > > > further developement would be without a ver important and major feature,
> > > > > right?
> > > > 
> > > > This is an option. But until it's implemented and merged, a user mshv
> > > > driver gets into a situation where kexec is broken in a non-obvious way.
> > > > The system may crash at any time after kexec, depending on whether the
> > > > new kernel touches the pages deposited to hypervisor or not. This is a
> > > > bad user experience.
> > > 
> > > I understand that. But with this we cannot collect core and debug any
> > > crashes. I was thinking there would be a quick way to prohibit kexec
> > > for update via notifier or some other quick hack. Did you already
> > > explore that and didn't find anything, hence this?
> > > 
> > 
> > This quick hack you mention isn't quick in the upstream kernel as there
> > is no hook to interrupt kexec process except the live update one.
> 
> That's the one we want to interrupt and block right? crash kexec
> is ok and should be allowed. We can document we don't support kexec
> for update for now.
> 
> > I sent an RFC for that one but given todays conversation details is
> > won't be accepted as is.
> 
> Are you taking about this?
> 
>         "mshv: Add kexec safety for deposited pages"
> 

Yes.

> > Making mshv mutually exclusive with kexec is the only viable option for
> > now given time constraints.
> > It is intended to be replaced with proper page lifecycle management in
> > the future.
> 
> Yeah, that could take a long time and imo we cannot just disable KEXEC
> completely. What we want is just block kexec for updates from some
> mshv file for now, we an print during boot that kexec for updates is
> not supported on mshv. Hope that makes sense.
> 

The trade-off here is between disabling kexec support and having the
kernel crash after kexec in a non-obvious way. This affects both regular
kexec and crash kexec.

It’s a pity we can’t apply a quick hack to disable only regular kexec.
However, since crash kexec would hit the same issues, until we have a
proper state transition for deposted pages, the best workaround for now
is to reset the hypervisor state on every kexec, which needs design,
work, and testing.

Disabling kexec is the only consistent way to handle this in the
upstream kernel at the moment.

Thanks, Stanislav


> Thanks,
> -Mukesh
> 
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > > Therefor it should be explicitly forbidden as it's essentially not
> > > > supported yet.
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > Thanks,
> > > > > > > -Mukesh

^ permalink raw reply

* RE: [PATCH v5 6/7] mshv: Add data for printing stats page counters
From: Michael Kelley @ 2026-01-27 16:57 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
	paekkaladevi@linux.microsoft.com
In-Reply-To: <20260126205603.404655-7-nunodasneves@linux.microsoft.com>

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, January 26, 2026 12:56 PM
> 
> Introduce mshv_debugfs_counters.c, containing static data
> corresponding to HV_*_COUNTER enums in the hypervisor source.
> Defining the enum members as an array instead makes more sense,
> since it will be iterated over to print counter information to
> debugfs.
> 
> Include hypervisor, logical processor, partition, and virtual
> processor counters.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++++++++++++
>  1 file changed, 490 insertions(+)
>  create mode 100644 drivers/hv/mshv_debugfs_counters.c

I'm getting these warnings when using "git am" to apply the patch:

root:~/linux-next20260109# git am /root/nunodebugfsv5/0006_mshv_add_data_for_printing_stats_page_counters.patch
Applying: mshv: Add data for printing stats page counters
.git/rebase-apply/patch:99: trailing whitespace.
        [51] = "LpPerfmonInterruptCount",
.git/rebase-apply/patch:116: trailing whitespace.
        [46] = "LpRunningPriority",
.git/rebase-apply/patch:499: trailing whitespace.
        [105] = "VpExpressSchedulingCount",
warning: 3 lines add whitespace errors.

If I open the patch file in 'vim' with the option to show trailing whitespace,
there is clearly some spurious whitespace after the comma on each of
these three lines.

scripts/checkpatch.pl also shows these trailing whitespace errors.

Michael

> 
> diff --git a/drivers/hv/mshv_debugfs_counters.c
> b/drivers/hv/mshv_debugfs_counters.c
> new file mode 100644
> index 000000000000..838af4673dd1
> --- /dev/null
> +++ b/drivers/hv/mshv_debugfs_counters.c
> @@ -0,0 +1,490 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2026, Microsoft Corporation.
> + *
> + * Data for printing stats page counters via debugfs.
> + *
> + * Authors: Microsoft Linux virtualization team
> + */
> +
> +/*
> + * For simplicity, this file is included directly in mshv_debugfs.c.
> + * If these are ever needed elsewhere they should be compiled separately.
> + * Ensure this file is not used twice by accident.
> + */
> +#ifndef MSHV_DEBUGFS_C
> +#error "This file should only be included in mshv_debugfs.c"
> +#endif
> +
> +/* HV_HYPERVISOR_COUNTER */
> +static char *hv_hypervisor_counters[] = {
> +	[1] = "HvLogicalProcessors",
> +	[2] = "HvPartitions",
> +	[3] = "HvTotalPages",
> +	[4] = "HvVirtualProcessors",
> +	[5] = "HvMonitoredNotifications",
> +	[6] = "HvModernStandbyEntries",
> +	[7] = "HvPlatformIdleTransitions",
> +	[8] = "HvHypervisorStartupCost",
> +
> +	[10] = "HvIOSpacePages",
> +	[11] = "HvNonEssentialPagesForDump",
> +	[12] = "HvSubsumedPages",
> +};
> +
> +/* HV_CPU_COUNTER */
> +static char *hv_lp_counters[] = {
> +	[1] = "LpGlobalTime",
> +	[2] = "LpTotalRunTime",
> +	[3] = "LpHypervisorRunTime",
> +	[4] = "LpHardwareInterrupts",
> +	[5] = "LpContextSwitches",
> +	[6] = "LpInterProcessorInterrupts",
> +	[7] = "LpSchedulerInterrupts",
> +	[8] = "LpTimerInterrupts",
> +	[9] = "LpInterProcessorInterruptsSent",
> +	[10] = "LpProcessorHalts",
> +	[11] = "LpMonitorTransitionCost",
> +	[12] = "LpContextSwitchTime",
> +	[13] = "LpC1TransitionsCount",
> +	[14] = "LpC1RunTime",
> +	[15] = "LpC2TransitionsCount",
> +	[16] = "LpC2RunTime",
> +	[17] = "LpC3TransitionsCount",
> +	[18] = "LpC3RunTime",
> +	[19] = "LpRootVpIndex",
> +	[20] = "LpIdleSequenceNumber",
> +	[21] = "LpGlobalTscCount",
> +	[22] = "LpActiveTscCount",
> +	[23] = "LpIdleAccumulation",
> +	[24] = "LpReferenceCycleCount0",
> +	[25] = "LpActualCycleCount0",
> +	[26] = "LpReferenceCycleCount1",
> +	[27] = "LpActualCycleCount1",
> +	[28] = "LpProximityDomainId",
> +	[29] = "LpPostedInterruptNotifications",
> +	[30] = "LpBranchPredictorFlushes",
> +#if IS_ENABLED(CONFIG_X86_64)
> +	[31] = "LpL1DataCacheFlushes",
> +	[32] = "LpImmediateL1DataCacheFlushes",
> +	[33] = "LpMbFlushes",
> +	[34] = "LpCounterRefreshSequenceNumber",
> +	[35] = "LpCounterRefreshReferenceTime",
> +	[36] = "LpIdleAccumulationSnapshot",
> +	[37] = "LpActiveTscCountSnapshot",
> +	[38] = "LpHwpRequestContextSwitches",
> +	[39] = "LpPlaceholder1",
> +	[40] = "LpPlaceholder2",
> +	[41] = "LpPlaceholder3",
> +	[42] = "LpPlaceholder4",
> +	[43] = "LpPlaceholder5",
> +	[44] = "LpPlaceholder6",
> +	[45] = "LpPlaceholder7",
> +	[46] = "LpPlaceholder8",
> +	[47] = "LpPlaceholder9",
> +	[48] = "LpSchLocalRunListSize",
> +	[49] = "LpReserveGroupId",
> +	[50] = "LpRunningPriority",
> +	[51] = "LpPerfmonInterruptCount",
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	[31] = "LpCounterRefreshSequenceNumber",
> +	[32] = "LpCounterRefreshReferenceTime",
> +	[33] = "LpIdleAccumulationSnapshot",
> +	[34] = "LpActiveTscCountSnapshot",
> +	[35] = "LpHwpRequestContextSwitches",
> +	[36] = "LpPlaceholder2",
> +	[37] = "LpPlaceholder3",
> +	[38] = "LpPlaceholder4",
> +	[39] = "LpPlaceholder5",
> +	[40] = "LpPlaceholder6",
> +	[41] = "LpPlaceholder7",
> +	[42] = "LpPlaceholder8",
> +	[43] = "LpPlaceholder9",
> +	[44] = "LpSchLocalRunListSize",
> +	[45] = "LpReserveGroupId",
> +	[46] = "LpRunningPriority",
> +#endif
> +};
> +
> +/* HV_PROCESS_COUNTER */
> +static char *hv_partition_counters[] = {
> +	[1] = "PtVirtualProcessors",
> +
> +	[3] = "PtTlbSize",
> +	[4] = "PtAddressSpaces",
> +	[5] = "PtDepositedPages",
> +	[6] = "PtGpaPages",
> +	[7] = "PtGpaSpaceModifications",
> +	[8] = "PtVirtualTlbFlushEntires",
> +	[9] = "PtRecommendedTlbSize",
> +	[10] = "PtGpaPages4K",
> +	[11] = "PtGpaPages2M",
> +	[12] = "PtGpaPages1G",
> +	[13] = "PtGpaPages512G",
> +	[14] = "PtDevicePages4K",
> +	[15] = "PtDevicePages2M",
> +	[16] = "PtDevicePages1G",
> +	[17] = "PtDevicePages512G",
> +	[18] = "PtAttachedDevices",
> +	[19] = "PtDeviceInterruptMappings",
> +	[20] = "PtIoTlbFlushes",
> +	[21] = "PtIoTlbFlushCost",
> +	[22] = "PtDeviceInterruptErrors",
> +	[23] = "PtDeviceDmaErrors",
> +	[24] = "PtDeviceInterruptThrottleEvents",
> +	[25] = "PtSkippedTimerTicks",
> +	[26] = "PtPartitionId",
> +#if IS_ENABLED(CONFIG_X86_64)
> +	[27] = "PtNestedTlbSize",
> +	[28] = "PtRecommendedNestedTlbSize",
> +	[29] = "PtNestedTlbFreeListSize",
> +	[30] = "PtNestedTlbTrimmedPages",
> +	[31] = "PtPagesShattered",
> +	[32] = "PtPagesRecombined",
> +	[33] = "PtHwpRequestValue",
> +	[34] = "PtAutoSuspendEnableTime",
> +	[35] = "PtAutoSuspendTriggerTime",
> +	[36] = "PtAutoSuspendDisableTime",
> +	[37] = "PtPlaceholder1",
> +	[38] = "PtPlaceholder2",
> +	[39] = "PtPlaceholder3",
> +	[40] = "PtPlaceholder4",
> +	[41] = "PtPlaceholder5",
> +	[42] = "PtPlaceholder6",
> +	[43] = "PtPlaceholder7",
> +	[44] = "PtPlaceholder8",
> +	[45] = "PtHypervisorStateTransferGeneration",
> +	[46] = "PtNumberofActiveChildPartitions",
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	[27] = "PtHwpRequestValue",
> +	[28] = "PtAutoSuspendEnableTime",
> +	[29] = "PtAutoSuspendTriggerTime",
> +	[30] = "PtAutoSuspendDisableTime",
> +	[31] = "PtPlaceholder1",
> +	[32] = "PtPlaceholder2",
> +	[33] = "PtPlaceholder3",
> +	[34] = "PtPlaceholder4",
> +	[35] = "PtPlaceholder5",
> +	[36] = "PtPlaceholder6",
> +	[37] = "PtPlaceholder7",
> +	[38] = "PtPlaceholder8",
> +	[39] = "PtHypervisorStateTransferGeneration",
> +	[40] = "PtNumberofActiveChildPartitions",
> +#endif
> +};
> +
> +/* HV_THREAD_COUNTER */
> +static char *hv_vp_counters[] = {
> +	[1] = "VpTotalRunTime",
> +	[2] = "VpHypervisorRunTime",
> +	[3] = "VpRemoteNodeRunTime",
> +	[4] = "VpNormalizedRunTime",
> +	[5] = "VpIdealCpu",
> +
> +	[7] = "VpHypercallsCount",
> +	[8] = "VpHypercallsTime",
> +#if IS_ENABLED(CONFIG_X86_64)
> +	[9] = "VpPageInvalidationsCount",
> +	[10] = "VpPageInvalidationsTime",
> +	[11] = "VpControlRegisterAccessesCount",
> +	[12] = "VpControlRegisterAccessesTime",
> +	[13] = "VpIoInstructionsCount",
> +	[14] = "VpIoInstructionsTime",
> +	[15] = "VpHltInstructionsCount",
> +	[16] = "VpHltInstructionsTime",
> +	[17] = "VpMwaitInstructionsCount",
> +	[18] = "VpMwaitInstructionsTime",
> +	[19] = "VpCpuidInstructionsCount",
> +	[20] = "VpCpuidInstructionsTime",
> +	[21] = "VpMsrAccessesCount",
> +	[22] = "VpMsrAccessesTime",
> +	[23] = "VpOtherInterceptsCount",
> +	[24] = "VpOtherInterceptsTime",
> +	[25] = "VpExternalInterruptsCount",
> +	[26] = "VpExternalInterruptsTime",
> +	[27] = "VpPendingInterruptsCount",
> +	[28] = "VpPendingInterruptsTime",
> +	[29] = "VpEmulatedInstructionsCount",
> +	[30] = "VpEmulatedInstructionsTime",
> +	[31] = "VpDebugRegisterAccessesCount",
> +	[32] = "VpDebugRegisterAccessesTime",
> +	[33] = "VpPageFaultInterceptsCount",
> +	[34] = "VpPageFaultInterceptsTime",
> +	[35] = "VpGuestPageTableMaps",
> +	[36] = "VpLargePageTlbFills",
> +	[37] = "VpSmallPageTlbFills",
> +	[38] = "VpReflectedGuestPageFaults",
> +	[39] = "VpApicMmioAccesses",
> +	[40] = "VpIoInterceptMessages",
> +	[41] = "VpMemoryInterceptMessages",
> +	[42] = "VpApicEoiAccesses",
> +	[43] = "VpOtherMessages",
> +	[44] = "VpPageTableAllocations",
> +	[45] = "VpLogicalProcessorMigrations",
> +	[46] = "VpAddressSpaceEvictions",
> +	[47] = "VpAddressSpaceSwitches",
> +	[48] = "VpAddressDomainFlushes",
> +	[49] = "VpAddressSpaceFlushes",
> +	[50] = "VpGlobalGvaRangeFlushes",
> +	[51] = "VpLocalGvaRangeFlushes",
> +	[52] = "VpPageTableEvictions",
> +	[53] = "VpPageTableReclamations",
> +	[54] = "VpPageTableResets",
> +	[55] = "VpPageTableValidations",
> +	[56] = "VpApicTprAccesses",
> +	[57] = "VpPageTableWriteIntercepts",
> +	[58] = "VpSyntheticInterrupts",
> +	[59] = "VpVirtualInterrupts",
> +	[60] = "VpApicIpisSent",
> +	[61] = "VpApicSelfIpisSent",
> +	[62] = "VpGpaSpaceHypercalls",
> +	[63] = "VpLogicalProcessorHypercalls",
> +	[64] = "VpLongSpinWaitHypercalls",
> +	[65] = "VpOtherHypercalls",
> +	[66] = "VpSyntheticInterruptHypercalls",
> +	[67] = "VpVirtualInterruptHypercalls",
> +	[68] = "VpVirtualMmuHypercalls",
> +	[69] = "VpVirtualProcessorHypercalls",
> +	[70] = "VpHardwareInterrupts",
> +	[71] = "VpNestedPageFaultInterceptsCount",
> +	[72] = "VpNestedPageFaultInterceptsTime",
> +	[73] = "VpPageScans",
> +	[74] = "VpLogicalProcessorDispatches",
> +	[75] = "VpWaitingForCpuTime",
> +	[76] = "VpExtendedHypercalls",
> +	[77] = "VpExtendedHypercallInterceptMessages",
> +	[78] = "VpMbecNestedPageTableSwitches",
> +	[79] = "VpOtherReflectedGuestExceptions",
> +	[80] = "VpGlobalIoTlbFlushes",
> +	[81] = "VpGlobalIoTlbFlushCost",
> +	[82] = "VpLocalIoTlbFlushes",
> +	[83] = "VpLocalIoTlbFlushCost",
> +	[84] = "VpHypercallsForwardedCount",
> +	[85] = "VpHypercallsForwardingTime",
> +	[86] = "VpPageInvalidationsForwardedCount",
> +	[87] = "VpPageInvalidationsForwardingTime",
> +	[88] = "VpControlRegisterAccessesForwardedCount",
> +	[89] = "VpControlRegisterAccessesForwardingTime",
> +	[90] = "VpIoInstructionsForwardedCount",
> +	[91] = "VpIoInstructionsForwardingTime",
> +	[92] = "VpHltInstructionsForwardedCount",
> +	[93] = "VpHltInstructionsForwardingTime",
> +	[94] = "VpMwaitInstructionsForwardedCount",
> +	[95] = "VpMwaitInstructionsForwardingTime",
> +	[96] = "VpCpuidInstructionsForwardedCount",
> +	[97] = "VpCpuidInstructionsForwardingTime",
> +	[98] = "VpMsrAccessesForwardedCount",
> +	[99] = "VpMsrAccessesForwardingTime",
> +	[100] = "VpOtherInterceptsForwardedCount",
> +	[101] = "VpOtherInterceptsForwardingTime",
> +	[102] = "VpExternalInterruptsForwardedCount",
> +	[103] = "VpExternalInterruptsForwardingTime",
> +	[104] = "VpPendingInterruptsForwardedCount",
> +	[105] = "VpPendingInterruptsForwardingTime",
> +	[106] = "VpEmulatedInstructionsForwardedCount",
> +	[107] = "VpEmulatedInstructionsForwardingTime",
> +	[108] = "VpDebugRegisterAccessesForwardedCount",
> +	[109] = "VpDebugRegisterAccessesForwardingTime",
> +	[110] = "VpPageFaultInterceptsForwardedCount",
> +	[111] = "VpPageFaultInterceptsForwardingTime",
> +	[112] = "VpVmclearEmulationCount",
> +	[113] = "VpVmclearEmulationTime",
> +	[114] = "VpVmptrldEmulationCount",
> +	[115] = "VpVmptrldEmulationTime",
> +	[116] = "VpVmptrstEmulationCount",
> +	[117] = "VpVmptrstEmulationTime",
> +	[118] = "VpVmreadEmulationCount",
> +	[119] = "VpVmreadEmulationTime",
> +	[120] = "VpVmwriteEmulationCount",
> +	[121] = "VpVmwriteEmulationTime",
> +	[122] = "VpVmxoffEmulationCount",
> +	[123] = "VpVmxoffEmulationTime",
> +	[124] = "VpVmxonEmulationCount",
> +	[125] = "VpVmxonEmulationTime",
> +	[126] = "VpNestedVMEntriesCount",
> +	[127] = "VpNestedVMEntriesTime",
> +	[128] = "VpNestedSLATSoftPageFaultsCount",
> +	[129] = "VpNestedSLATSoftPageFaultsTime",
> +	[130] = "VpNestedSLATHardPageFaultsCount",
> +	[131] = "VpNestedSLATHardPageFaultsTime",
> +	[132] = "VpInvEptAllContextEmulationCount",
> +	[133] = "VpInvEptAllContextEmulationTime",
> +	[134] = "VpInvEptSingleContextEmulationCount",
> +	[135] = "VpInvEptSingleContextEmulationTime",
> +	[136] = "VpInvVpidAllContextEmulationCount",
> +	[137] = "VpInvVpidAllContextEmulationTime",
> +	[138] = "VpInvVpidSingleContextEmulationCount",
> +	[139] = "VpInvVpidSingleContextEmulationTime",
> +	[140] = "VpInvVpidSingleAddressEmulationCount",
> +	[141] = "VpInvVpidSingleAddressEmulationTime",
> +	[142] = "VpNestedTlbPageTableReclamations",
> +	[143] = "VpNestedTlbPageTableEvictions",
> +	[144] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
> +	[145] = "VpFlushGuestPhysicalAddressListHypercalls",
> +	[146] = "VpPostedInterruptNotifications",
> +	[147] = "VpPostedInterruptScans",
> +	[148] = "VpTotalCoreRunTime",
> +	[149] = "VpMaximumRunTime",
> +	[150] = "VpHwpRequestContextSwitches",
> +	[151] = "VpWaitingForCpuTimeBucket0",
> +	[152] = "VpWaitingForCpuTimeBucket1",
> +	[153] = "VpWaitingForCpuTimeBucket2",
> +	[154] = "VpWaitingForCpuTimeBucket3",
> +	[155] = "VpWaitingForCpuTimeBucket4",
> +	[156] = "VpWaitingForCpuTimeBucket5",
> +	[157] = "VpWaitingForCpuTimeBucket6",
> +	[158] = "VpVmloadEmulationCount",
> +	[159] = "VpVmloadEmulationTime",
> +	[160] = "VpVmsaveEmulationCount",
> +	[161] = "VpVmsaveEmulationTime",
> +	[162] = "VpGifInstructionEmulationCount",
> +	[163] = "VpGifInstructionEmulationTime",
> +	[164] = "VpEmulatedErrataSvmInstructions",
> +	[165] = "VpPlaceholder1",
> +	[166] = "VpPlaceholder2",
> +	[167] = "VpPlaceholder3",
> +	[168] = "VpPlaceholder4",
> +	[169] = "VpPlaceholder5",
> +	[170] = "VpPlaceholder6",
> +	[171] = "VpPlaceholder7",
> +	[172] = "VpPlaceholder8",
> +	[173] = "VpContentionTime",
> +	[174] = "VpWakeUpTime",
> +	[175] = "VpSchedulingPriority",
> +	[176] = "VpRdpmcInstructionsCount",
> +	[177] = "VpRdpmcInstructionsTime",
> +	[178] = "VpPerfmonPmuMsrAccessesCount",
> +	[179] = "VpPerfmonLbrMsrAccessesCount",
> +	[180] = "VpPerfmonIptMsrAccessesCount",
> +	[181] = "VpPerfmonInterruptCount",
> +	[182] = "VpVtl1DispatchCount",
> +	[183] = "VpVtl2DispatchCount",
> +	[184] = "VpVtl2DispatchBucket0",
> +	[185] = "VpVtl2DispatchBucket1",
> +	[186] = "VpVtl2DispatchBucket2",
> +	[187] = "VpVtl2DispatchBucket3",
> +	[188] = "VpVtl2DispatchBucket4",
> +	[189] = "VpVtl2DispatchBucket5",
> +	[190] = "VpVtl2DispatchBucket6",
> +	[191] = "VpVtl1RunTime",
> +	[192] = "VpVtl2RunTime",
> +	[193] = "VpIommuHypercalls",
> +	[194] = "VpCpuGroupHypercalls",
> +	[195] = "VpVsmHypercalls",
> +	[196] = "VpEventLogHypercalls",
> +	[197] = "VpDeviceDomainHypercalls",
> +	[198] = "VpDepositHypercalls",
> +	[199] = "VpSvmHypercalls",
> +	[200] = "VpBusLockAcquisitionCount",
> +	[201] = "VpLoadAvg",
> +	[202] = "VpRootDispatchThreadBlocked",
> +	[203] = "VpIdleCpuTime",
> +	[204] = "VpWaitingForCpuTimeBucket7",
> +	[205] = "VpWaitingForCpuTimeBucket8",
> +	[206] = "VpWaitingForCpuTimeBucket9",
> +	[207] = "VpWaitingForCpuTimeBucket10",
> +	[208] = "VpWaitingForCpuTimeBucket11",
> +	[209] = "VpWaitingForCpuTimeBucket12",
> +	[210] = "VpHierarchicalSuspendTime",
> +	[211] = "VpExpressSchedulingAttempts",
> +	[212] = "VpExpressSchedulingCount",
> +#elif IS_ENABLED(CONFIG_ARM64)
> +	[9] = "VpSysRegAccessesCount",
> +	[10] = "VpSysRegAccessesTime",
> +	[11] = "VpSmcInstructionsCount",
> +	[12] = "VpSmcInstructionsTime",
> +	[13] = "VpOtherInterceptsCount",
> +	[14] = "VpOtherInterceptsTime",
> +	[15] = "VpExternalInterruptsCount",
> +	[16] = "VpExternalInterruptsTime",
> +	[17] = "VpPendingInterruptsCount",
> +	[18] = "VpPendingInterruptsTime",
> +	[19] = "VpGuestPageTableMaps",
> +	[20] = "VpLargePageTlbFills",
> +	[21] = "VpSmallPageTlbFills",
> +	[22] = "VpReflectedGuestPageFaults",
> +	[23] = "VpMemoryInterceptMessages",
> +	[24] = "VpOtherMessages",
> +	[25] = "VpLogicalProcessorMigrations",
> +	[26] = "VpAddressDomainFlushes",
> +	[27] = "VpAddressSpaceFlushes",
> +	[28] = "VpSyntheticInterrupts",
> +	[29] = "VpVirtualInterrupts",
> +	[30] = "VpApicSelfIpisSent",
> +	[31] = "VpGpaSpaceHypercalls",
> +	[32] = "VpLogicalProcessorHypercalls",
> +	[33] = "VpLongSpinWaitHypercalls",
> +	[34] = "VpOtherHypercalls",
> +	[35] = "VpSyntheticInterruptHypercalls",
> +	[36] = "VpVirtualInterruptHypercalls",
> +	[37] = "VpVirtualMmuHypercalls",
> +	[38] = "VpVirtualProcessorHypercalls",
> +	[39] = "VpHardwareInterrupts",
> +	[40] = "VpNestedPageFaultInterceptsCount",
> +	[41] = "VpNestedPageFaultInterceptsTime",
> +	[42] = "VpLogicalProcessorDispatches",
> +	[43] = "VpWaitingForCpuTime",
> +	[44] = "VpExtendedHypercalls",
> +	[45] = "VpExtendedHypercallInterceptMessages",
> +	[46] = "VpMbecNestedPageTableSwitches",
> +	[47] = "VpOtherReflectedGuestExceptions",
> +	[48] = "VpGlobalIoTlbFlushes",
> +	[49] = "VpGlobalIoTlbFlushCost",
> +	[50] = "VpLocalIoTlbFlushes",
> +	[51] = "VpLocalIoTlbFlushCost",
> +	[52] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
> +	[53] = "VpFlushGuestPhysicalAddressListHypercalls",
> +	[54] = "VpPostedInterruptNotifications",
> +	[55] = "VpPostedInterruptScans",
> +	[56] = "VpTotalCoreRunTime",
> +	[57] = "VpMaximumRunTime",
> +	[58] = "VpWaitingForCpuTimeBucket0",
> +	[59] = "VpWaitingForCpuTimeBucket1",
> +	[60] = "VpWaitingForCpuTimeBucket2",
> +	[61] = "VpWaitingForCpuTimeBucket3",
> +	[62] = "VpWaitingForCpuTimeBucket4",
> +	[63] = "VpWaitingForCpuTimeBucket5",
> +	[64] = "VpWaitingForCpuTimeBucket6",
> +	[65] = "VpHwpRequestContextSwitches",
> +	[66] = "VpPlaceholder2",
> +	[67] = "VpPlaceholder3",
> +	[68] = "VpPlaceholder4",
> +	[69] = "VpPlaceholder5",
> +	[70] = "VpPlaceholder6",
> +	[71] = "VpPlaceholder7",
> +	[72] = "VpPlaceholder8",
> +	[73] = "VpContentionTime",
> +	[74] = "VpWakeUpTime",
> +	[75] = "VpSchedulingPriority",
> +	[76] = "VpVtl1DispatchCount",
> +	[77] = "VpVtl2DispatchCount",
> +	[78] = "VpVtl2DispatchBucket0",
> +	[79] = "VpVtl2DispatchBucket1",
> +	[80] = "VpVtl2DispatchBucket2",
> +	[81] = "VpVtl2DispatchBucket3",
> +	[82] = "VpVtl2DispatchBucket4",
> +	[83] = "VpVtl2DispatchBucket5",
> +	[84] = "VpVtl2DispatchBucket6",
> +	[85] = "VpVtl1RunTime",
> +	[86] = "VpVtl2RunTime",
> +	[87] = "VpIommuHypercalls",
> +	[88] = "VpCpuGroupHypercalls",
> +	[89] = "VpVsmHypercalls",
> +	[90] = "VpEventLogHypercalls",
> +	[91] = "VpDeviceDomainHypercalls",
> +	[92] = "VpDepositHypercalls",
> +	[93] = "VpSvmHypercalls",
> +	[94] = "VpLoadAvg",
> +	[95] = "VpRootDispatchThreadBlocked",
> +	[96] = "VpIdleCpuTime",
> +	[97] = "VpWaitingForCpuTimeBucket7",
> +	[98] = "VpWaitingForCpuTimeBucket8",
> +	[99] = "VpWaitingForCpuTimeBucket9",
> +	[100] = "VpWaitingForCpuTimeBucket10",
> +	[101] = "VpWaitingForCpuTimeBucket11",
> +	[102] = "VpWaitingForCpuTimeBucket12",
> +	[103] = "VpHierarchicalSuspendTime",
> +	[104] = "VpExpressSchedulingAttempts",
> +	[105] = "VpExpressSchedulingCount",
> +#endif
> +};
> --
> 2.34.1


^ permalink raw reply

* RE: [PATCH v5 3/7] mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
From: Michael Kelley @ 2026-01-27 16:57 UTC (permalink / raw)
  To: Nuno Das Neves, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, skinsburskii@linux.microsoft.com
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
	paekkaladevi@linux.microsoft.com
In-Reply-To: <20260126205603.404655-4-nunodasneves@linux.microsoft.com>

From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Monday, January 26, 2026 12:56 PM
> 
> From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> 
> These functions are currently only used to map child partition VP stats,
> on root partition. However, they will soon be used on L1VH, and and also

Duplicate word "and".

Remember to run checkpatch.pl over your patches -- this duplicate
word was flagged by checkpatch.

Michael

> used for mapping the host's own VP stats.
> 
> Introduce a helper is_l1vh_parent() to determine whether we are mapping
> our own VP stats. In this case, do not attempt to map the PARENT area.
> Note this is a different case than mapping PARENT on an older hypervisor
> where it is not available at all, so must be handled separately.
> 
> On unmap, pass the stats pages since on L1VH the kernel allocates them
> and they must be freed in hv_unmap_stats_page().
> 
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h      | 10 ++++++
>  drivers/hv/mshv_root_main.c | 61 ++++++++++++++++++++++++++-----------
>  2 files changed, 54 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 05ba1f716f9e..e4912b0618fa 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -254,6 +254,16 @@ struct mshv_partition *mshv_partition_get(struct
> mshv_partition *partition);
>  void mshv_partition_put(struct mshv_partition *partition);
>  struct mshv_partition *mshv_partition_find(u64 partition_id) __must_hold(RCU);
> 
> +static inline bool is_l1vh_parent(u64 partition_id)
> +{
> +	return hv_l1vh_partition() && (partition_id == HV_PARTITION_ID_SELF);
> +}
> +
> +int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> +		      struct hv_stats_page **stats_pages);
> +void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
> +			 struct hv_stats_page **stats_pages);
> +
>  /* hypercalls */
> 
>  int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index be5ad0fbfbee..faca3cc63e79 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -956,23 +956,36 @@ mshv_vp_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
> 
> -static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
> -				struct hv_stats_page *stats_pages[])
> +void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
> +			 struct hv_stats_page *stats_pages[])
>  {
>  	union hv_stats_object_identity identity = {
>  		.vp.partition_id = partition_id,
>  		.vp.vp_index = vp_index,
>  	};
> +	int err;
> 
>  	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
> -	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
> -
> -	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> -	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
> +	err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
> +				  stats_pages[HV_STATS_AREA_SELF],
> +				  &identity);
> +	if (err)
> +		pr_err("%s: failed to unmap partition %llu vp %u self stats, err: %d\n",
> +		       __func__, partition_id, vp_index, err);
> +
> +	if (stats_pages[HV_STATS_AREA_PARENT] !=
> stats_pages[HV_STATS_AREA_SELF]) {
> +		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> +		err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
> +					  stats_pages[HV_STATS_AREA_PARENT],
> +					  &identity);
> +		if (err)
> +			pr_err("%s: failed to unmap partition %llu vp %u parent stats,
> err: %d\n",
> +			       __func__, partition_id, vp_index, err);
> +	}
>  }
> 
> -static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> -			     struct hv_stats_page *stats_pages[])
> +int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
> +		      struct hv_stats_page *stats_pages[])
>  {
>  	union hv_stats_object_identity identity = {
>  		.vp.partition_id = partition_id,
> @@ -983,23 +996,37 @@ static int mshv_vp_stats_map(u64 partition_id, u32
> vp_index,
>  	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
>  	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
>  				&stats_pages[HV_STATS_AREA_SELF]);
> -	if (err)
> +	if (err) {
> +		pr_err("%s: failed to map partition %llu vp %u self stats, err: %d\n",
> +		       __func__, partition_id, vp_index, err);
>  		return err;
> +	}
> 
> -	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> -	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
> -				&stats_pages[HV_STATS_AREA_PARENT]);
> -	if (err)
> -		goto unmap_self;
> -
> -	if (!stats_pages[HV_STATS_AREA_PARENT])
> +	/*
> +	 * L1VH partition cannot access its vp stats in parent area.
> +	 */
> +	if (is_l1vh_parent(partition_id)) {
>  		stats_pages[HV_STATS_AREA_PARENT] =
> stats_pages[HV_STATS_AREA_SELF];
> +	} else {
> +		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
> +		err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
> +					&stats_pages[HV_STATS_AREA_PARENT]);
> +		if (err) {
> +			pr_err("%s: failed to map partition %llu vp %u parent stats, err:
> %d\n",
> +			       __func__, partition_id, vp_index, err);
> +			goto unmap_self;
> +		}
> +		if (!stats_pages[HV_STATS_AREA_PARENT])
> +			stats_pages[HV_STATS_AREA_PARENT] =
> stats_pages[HV_STATS_AREA_SELF];
> +	}
> 
>  	return 0;
> 
>  unmap_self:
>  	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
> -	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
> +	hv_unmap_stats_page(HV_STATS_OBJECT_VP,
> +			    stats_pages[HV_STATS_AREA_SELF],
> +			    &identity);
>  	return err;
>  }
> 
> --
> 2.34.1


^ permalink raw reply

* Re: [PATCH net-next v16 00/12] vsock: add namespace support to vhost-vsock and loopback
From: patchwork-bot+netdevbpf @ 2026-01-27 10:00 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: sgarzare, davem, edumazet, kuba, pabeni, horms, stefanha, mst,
	jasowang, eperezma, xuanzhuo, kys, haiyangz, wei.liu, decui,
	bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list, shuah,
	longli, corbet, linux-kernel, virtualization, netdev, kvm,
	linux-hyperv, linux-kselftest, berrange, sargun, linux-doc,
	bobbyeshleman
In-Reply-To: <20260121-vsock-vmtest-v16-0-2859a7512097@meta.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Wed, 21 Jan 2026 14:11:40 -0800 you wrote:
> This series adds namespace support to vhost-vsock and loopback. It does
> not add namespaces to any of the other guest transports (virtio-vsock,
> hyperv, or vmci).
> 
> The current revision supports two modes: local and global. Local
> mode is complete isolation of namespaces, while global mode is complete
> sharing between namespaces of CIDs (the original behavior).
> 
> [...]

Here is the summary with links:
  - [net-next,v16,01/12] vsock: add netns to vsock core
    https://git.kernel.org/netdev/net-next/c/eafb64f40ca4
  - [net-next,v16,02/12] virtio: set skb owner of virtio_transport_reset_no_sock() reply
    https://git.kernel.org/netdev/net-next/c/a6ae12a599e0
  - [net-next,v16,03/12] vsock: add netns support to virtio transports
    https://git.kernel.org/netdev/net-next/c/a69686327e42
  - [net-next,v16,04/12] selftests/vsock: increase timeout to 1200
    https://git.kernel.org/netdev/net-next/c/873e7de9f9a3
  - [net-next,v16,05/12] selftests/vsock: add namespace helpers to vmtest.sh
    https://git.kernel.org/netdev/net-next/c/423ec6383edb
  - [net-next,v16,06/12] selftests/vsock: prepare vm management helpers for namespaces
    https://git.kernel.org/netdev/net-next/c/fd1b41725d58
  - [net-next,v16,07/12] selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers
    https://git.kernel.org/netdev/net-next/c/4e870ac81df7
  - [net-next,v16,08/12] selftests/vsock: use ss to wait for listeners instead of /proc/net
    https://git.kernel.org/netdev/net-next/c/7418f3bb3aa2
  - [net-next,v16,09/12] selftests/vsock: add tests for proc sys vsock ns_mode
    https://git.kernel.org/netdev/net-next/c/06cf7895abf9
  - [net-next,v16,10/12] selftests/vsock: add namespace tests for CID collisions
    https://git.kernel.org/netdev/net-next/c/605caec5adc2
  - [net-next,v16,11/12] selftests/vsock: add tests for host <-> vm connectivity with namespaces
    https://git.kernel.org/netdev/net-next/c/0424ee7c3a17
  - [net-next,v16,12/12] selftests/vsock: add tests for namespace deletion
    https://git.kernel.org/netdev/net-next/c/b3b7b33264c6

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
From: Mukesh R @ 2026-01-27  3:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <aXevWXolgNrrLltF@skinsburskii.localdomain>

On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>> ---
>>>>    drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 115 insertions(+)
>>>>
>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>> --- a/drivers/hv/mshv_root_main.c
>>>> +++ b/drivers/hv/mshv_root_main.c
>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>    	};
>>>>    } __packed;
>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>> +static int __init setup_hv_full_mmio(char *str)
>>>> +{
>>>> +	hv_nofull_mmio = true;
>>>> +	return 0;
>>>> +}
>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>> +
>>>>    struct mshv_root mshv_root;
>>>>    enum hv_scheduler_type hv_scheduler_type;
>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>    }
>>>>    #ifdef CONFIG_X86_64
>>>> +
>>>> +/*
>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>> + * else just return -errno.
>>>> + */
>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>> +				       u64 *mmio_pfnp)
>>>> +{
>>>> +	struct vm_area_struct *vma;
>>>> +	bool is_mmio;
>>>> +	u64 uaddr;
>>>> +	struct mshv_mem_region *mreg;
>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>> +	int rc = -EINVAL;
>>>> +
>>>> +	/*
>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>> +	 */
>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>> +
>>>> +	/* Get the region again under the lock */
>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>> +		goto unlock_pt_out;
>>>> +
>>>> +	uaddr = mreg->start_uaddr +
>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>> +
>>>> +	mmap_read_lock(current->mm);
>>>
>>> Semaphore can't be taken under spinlock.
> 
>>
>> Yeah, something didn't feel right here and I meant to recheck, now regret
>> rushing to submit the patch.
>>
>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>> the uaddr because unmap will properly serialize via the mm lock.
>>
>>
>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>
>>> Why this check is needed again?
>>
>> To make sure region did not change. This check is under lock.
>>
> 
> How can this happen? One can't change VMA type without unmapping it
> first. And unmapping it leads to a kernel MMIO region state dangling
> around without corresponding user space mapping.

Right, and vm_flags would not be mmio expected then.

> This is similar to dangling pinned regions and should likely be
> addressed the same way by utilizing MMU notifiers to destpoy memoty
> regions is VMA is detached.

I don't think we need that. Either it succeeds if the region did not
change at all, or just fails.


>>> The region type is stored on the region itself.
>>> And the type is checked on the caller side.
>>>
>>>> +	if (!is_mmio)
>>>> +		goto unlock_mmap_out;
>>>> +
>>>> +	pfnmap_args.vma = vma;
>>>> +	pfnmap_args.address = uaddr;
>>>> +
>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>> +	if (rc) {
>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>> +				      NULL);
>>>> +		if (rc)
>>>> +			goto unlock_mmap_out;
>>>> +
>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>> +		if (rc)
>>>> +			goto unlock_mmap_out;
>>>> +	}
>>>> +
>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>> +d
>>>> +unlock_mmap_out:
>>>> +	mmap_read_unlock(current->mm);
>>>> +unlock_pt_out:
>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>> +	return rc;
>>>> +}
>>>> +
>>>> +/*
>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>> + * and resolve if possible.
>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>> + */
>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>> +{
>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>> +	union hv_x64_memory_access_info accinfo;
>>>> +	u64 gfn, mmio_spa, numpgs;
>>>> +	struct mshv_mem_region *mreg;
>>>> +	int rc;
>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>> +
>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>> +	accinfo = msg->memory_access_info;
>>>> +
>>>> +	if (!accinfo.gva_gpa_valid)
>>>> +		return false;
>>>> +
>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>
>>> This call needs to be protected by the spinlock.
>>
>> This is sorta fast path to bail. We recheck under partition lock above.
>>
> 
> Accessing the list of regions without lock is unsafe.

I am not sure why? This check is done by a vcpu thread, so regions
will not have just gone away.

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>> +		return false;
>>>> +
>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>> +	if (rc)
>>>> +		return false;
>>>> +
>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>> +		gfn = mreg->start_gfn;
>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>> +		numpgs = mreg->nr_pages;
>>>> +	} else
>>>> +		numpgs = 1;
>>>> +
>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>> +
>>>> +	return rc == 0;
>>>> +}
>>>> +
>>>>    static struct mshv_mem_region *
>>>>    mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>    {
>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>    	return ret;
>>>>    }
>>>> +
>>>>    #else  /* CONFIG_X86_64 */
>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>    static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>    #endif /* CONFIG_X86_64 */
>>>>    static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>    {
>>>>    	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>> +	case HVMSG_UNMAPPED_GPA:
>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>    	case HVMSG_GPA_INTERCEPT:
>>>>    		return mshv_handle_gpa_intercept(vp);
>>>>    	}
>>>> -- 
>>>> 2.51.2.vfs.0.1
>>>>


^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Mukesh R @ 2026-01-27  3:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <aXeO7wh7bpacJ1Sk@skinsburskii.localdomain>

On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
>> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Add a new file to implement management of device domains, mapping and
>>>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>>>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>>>> parent. This also implements direct attach mechanism for PCI passthru,
>>>> and it is also made to work within the VFIO framework.
>>>>
>>>> At a high level, during boot the hypervisor creates a default identity
>>>> domain and attaches all devices to it. This nicely maps to Linux iommu
>>>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>>>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>>>> during boot. As mentioned previously, Hyper-V supports two ways to do
>>>> PCI passthru:
>>>>
>>>>     1. Device Domain: root must create a device domain in the hypervisor,
>>>>        and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>>>        All hypervisor communications use device id of type PCI for
>>>>        identifying and referencing the device.
>>>>
>>>>     2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>        page table for mappings, thus the host need not do map/unmap
>>>>        device memory hypercalls. As such, direct attach passthru setup
>>>>        during guest boot is extremely fast. A direct attached device
>>>>        must be referenced via logical device id and not via the PCI
>>>>        device id.
>>>>
>>>> At present, L1VH root/parent only supports direct attaches. Also direct
>>>> attach is default in non-L1VH cases because there are some significant
>>>> performance issues with device domain implementation currently for guests
>>>> with higher RAM (say more than 8GB), and that unfortunately cannot be
>>>> addressed in the short term.
>>>>
>>>
>>> <snip>
>>>
> 
> <snip>
> 
>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
>>>> +{
>>>> +	struct pci_dev *pdev;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +
>>>> +	/* See the attach function, only PCI devices for now */
>>>> +	if (!dev_is_pci(dev))
>>>> +		return;
>>>> +
>>>> +	if (hvdom->num_attchd == 0)
>>>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
>>>> +
>>>> +	pdev = to_pci_dev(dev);
>>>> +
>>>> +	if (hvdom->attached_dom) {
>>>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>>>> +
>>>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
>>>> +		 * next.
>>>> +		 */
>>>> +	} else {
>>>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>>>> +	}
>>>> +
>>>> +	hvdom->num_attchd--;
>>>
>>> Shouldn't this be modified iff the detach succeeded?
>>
>> We want to still free the domain and not let it get stuck. The purpose
>> is more to make sure detach was called before domain free.
>>
> 
> How can one debug subseqent errors if num_attchd is decremented
> unconditionally? In reality the device is left attached, but the related
> kernel metadata is gone.

Error is printed in case of failed detach. If there is panic, at least
you can get some info about the device. Metadata in hypervisor is
around if failed.

>>>> +}
>>>> +
>>>> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
>>>> +				     unsigned long iova, phys_addr_t paddr,
>>>> +				     size_t size, u32 flags)
>>>> +{
>>>> +	unsigned long irqflags;
>>>> +	struct hv_iommu_mapping *mapping;
>>>> +
>>>> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
>>>> +	if (!mapping)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	mapping->paddr = paddr;
>>>> +	mapping->iova.start = iova;
>>>> +	mapping->iova.last = iova + size - 1;
>>>> +	mapping->flags = flags;
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
>>>> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
>>>> +					unsigned long iova, size_t size)
>>>> +{
>>>> +	unsigned long flags;
>>>> +	size_t unmapped = 0;
>>>> +	unsigned long last = iova + size - 1;
>>>> +	struct hv_iommu_mapping *mapping = NULL;
>>>> +	struct interval_tree_node *node, *next;
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>>>> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
>>>> +	while (next) {
>>>> +		node = next;
>>>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>>>> +		next = interval_tree_iter_next(node, iova, last);
>>>> +
>>>> +		/* Trying to split a mapping? Not supported for now. */
>>>> +		if (mapping->iova.start < iova)
>>>> +			break;
>>>> +
>>>> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
>>>> +
>>>> +		interval_tree_remove(node, &hvdom->mappings_tree);
>>>> +		kfree(mapping);
>>>> +	}
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>>>> +
>>>> +	return unmapped;
>>>> +}
>>>> +
>>>> +/* Return: must return exact status from the hypercall without changes */
>>>> +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
>>>> +			    unsigned long iova, phys_addr_t paddr,
>>>> +			    unsigned long npages, u32 map_flags)
>>>> +{
>>>> +	u64 status;
>>>> +	int i;
>>>> +	struct hv_input_map_device_gpa_pages *input;
>>>> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +	input->map_flags = map_flags;
>>>> +	input->target_device_va_base = iova;
>>>> +
>>>> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
>>>> +	for (i = 0; i < npages; i++, pfn++)
>>>> +		input->gpa_page_list[i] = pfn;
>>>> +
>>>> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
>>>> +				     input, NULL);
>>>> +
>>>> +	local_irq_restore(flags);
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The core VFIO code loops over memory ranges calling this function with
>>>> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
>>>> + */
>>>> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
>>>> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
>>>> +			      int prot, gfp_t gfp, size_t *mapped)
>>>> +{
>>>> +	u32 map_flags;
>>>> +	int ret;
>>>> +	u64 status;
>>>> +	unsigned long npages, done = 0;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	size_t size = pgsize * pgcount;
>>>> +
>>>> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
>>>> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
>>>> +
>>>> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	if (hvdom->attached_dom) {
>>>> +		*mapped = size;
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>>>> +	while (done < npages) {
>>>> +		ulong completed, remain = npages - done;
>>>> +
>>>> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
>>>> +					  map_flags);
>>>> +
>>>> +		completed = hv_repcomp(status);
>>>> +		done = done + completed;
>>>> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
>>>> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
>>>> +
>>>> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
>>>> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>> +						    hv_current_partition_id,
>>>> +						    256);
>>>> +			if (ret)
>>>> +				break;
>>>> +		}
>>>> +		if (!hv_result_success(status))
>>>> +			break;
>>>> +	}
>>>> +
>>>> +	if (!hv_result_success(status)) {
>>>> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
>>>> +			      done, npages, iova);
>>>> +		/*
>>>> +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
>>>> +		 * only remove from [0 - done], we need to remove second chunk
>>>> +		 * [done+1 - size-1].
>>>> +		 */
>>>> +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
>>>> +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
>>>> +				     done, NULL);
>>>> +		if (mapped)
>>>> +			*mapped = 0;
>>>> +	} else
>>>> +		if (mapped)
>>>> +			*mapped = size;
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
>>>> +				   size_t pgsize, size_t pgcount,
>>>> +				   struct iommu_iotlb_gather *gather)
>>>> +{
>>>> +	unsigned long flags, npages;
>>>> +	struct hv_input_unmap_device_gpa_pages *input;
>>>> +	u64 status;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	size_t unmapped, size = pgsize * pgcount;
>>>> +
>>>> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
>>>> +	if (unmapped < size)
>>>> +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
>>>> +		       __func__, iova, unmapped, size);
>>>> +
>>>> +	if (hvdom->attached_dom)
>>>> +		return size;
>>>> +
>>>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +	input->target_device_va_base = iova;
>>>> +
>>>> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
>>>> +				     0, input, NULL);
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>
>>> There is some inconsistency in namings and behaviour of paired
>>> functions:
>>> 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
>>
>> The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
>> hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.
>>
> 
> Hv_iommu_map_pages is a wrapper around hv_iommu_map_pgs while
> hv_iommu_unmap_pages is a wrapper around the correspodning hypercall.
> That's the inconsistency I meant.

Unmap does not need intermediate function.

>>> 2. hv_iommu_map_pgs doesn't print status in case of error.

  We print error upon its failure in hv_iommu_map_pages():

          if (!hv_result_success(status)) {
                 size_t done_size = done << HV_HYP_PAGE_SHIFT;
                 hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
                               done, npages, iova);


>> it does:
>>              hv_status_err(status, "\n");  <==============
> 
> It does not. I guess you are confusing it with some other function.
> Here is the function:
> 
>>
>>
>>> It would be much better to keep this code consistent.
>>>
>>>> +	return unmapped;
>>>> +}
>>>> +
>>>> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
>>>> +					 dma_addr_t iova)
>>>> +{
>>>> +	u64 paddr = 0;
>>>> +	unsigned long flags;
>>>> +	struct hv_iommu_mapping *mapping;
>>>> +	struct interval_tree_node *node;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>>>> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
>>>> +	if (node) {
>>>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>>>> +		paddr = mapping->paddr + (iova - mapping->iova.start);
>>>> +	}
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>>>> +
>>>> +	return paddr;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Currently, hypervisor does not provide list of devices it is using
>>>> + * dynamically. So use this to allow users to manually specify devices that
>>>> + * should be skipped. (eg. hypervisor debugger using some network device).
>>>> + */
>>>> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>>>> +{
>>>> +	if (!dev_is_pci(dev))
>>>> +		return ERR_PTR(-ENODEV);
>>>> +
>>>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
>>>> +		int rc, pos = 0;
>>>> +		int parsed;
>>>> +		int segment, bus, slot, func;
>>>> +		struct pci_dev *pdev = to_pci_dev(dev);
>>>> +
>>>> +		do {
>>>> +			parsed = 0;
>>>> +
>>>> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
>>>> +				    &segment, &bus, &slot, &func, &parsed);
>>>> +			if (rc)
>>>> +				break;
>>>> +			if (parsed <= 0)
>>>> +				break;
>>>> +
>>>> +			if (pci_domain_nr(pdev->bus) == segment &&
>>>> +			    pdev->bus->number == bus &&
>>>> +			    PCI_SLOT(pdev->devfn) == slot &&
>>>> +			    PCI_FUNC(pdev->devfn) == func) {
>>>> +
>>>> +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
>>>> +				return ERR_PTR(-ENODEV);
>>>> +			}
>>>> +			pos += parsed;
>>>> +
>>>> +		} while (pci_devs_to_skip[pos]);
>>>> +	}
>>>> +
>>>> +	/* Device will be explicitly attached to the default domain, so no need
>>>> +	 * to do dev_iommu_priv_set() here.
>>>> +	 */
>>>> +
>>>> +	return &hv_virt_iommu;
>>>> +}
>>>> +
>>>> +static void hv_iommu_probe_finalize(struct device *dev)
>>>> +{
>>>> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
>>>> +
>>>> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
>>>> +		iommu_setup_dma_ops(dev);
>>>> +	else
>>>> +		set_dma_ops(dev, NULL);
>>>> +}
>>>> +
>>>> +static void hv_iommu_release_device(struct device *dev)
>>>> +{
>>>> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
>>>> +
>>>> +	/* Need to detach device from device domain if necessary. */
>>>> +	if (hvdom)
>>>> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
>>>> +
>>>> +	dev_iommu_priv_set(dev, NULL);
>>>> +	set_dma_ops(dev, NULL);
>>>> +}
>>>> +
>>>> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
>>>> +{
>>>> +	if (dev_is_pci(dev))
>>>> +		return pci_device_group(dev);
>>>> +	else
>>>> +		return generic_device_group(dev);
>>>> +}
>>>> +
>>>> +static int hv_iommu_def_domain_type(struct device *dev)
>>>> +{
>>>> +	/* The hypervisor always creates this by default during boot */
>>>> +	return IOMMU_DOMAIN_IDENTITY;
>>>> +}
>>>> +
>>>> +static struct iommu_ops hv_iommu_ops = {
>>>> +	.capable	    = hv_iommu_capable,
>>>> +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
>>>> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
>>>> +	.probe_device	    = hv_iommu_probe_device,
>>>> +	.probe_finalize     = hv_iommu_probe_finalize,
>>>> +	.release_device     = hv_iommu_release_device,
>>>> +	.def_domain_type    = hv_iommu_def_domain_type,
>>>> +	.device_group	    = hv_iommu_device_group,
>>>> +	.default_domain_ops = &(const struct iommu_domain_ops) {
>>>> +		.attach_dev   = hv_iommu_attach_dev,
>>>> +		.map_pages    = hv_iommu_map_pages,
>>>> +		.unmap_pages  = hv_iommu_unmap_pages,
>>>> +		.iova_to_phys = hv_iommu_iova_to_phys,
>>>> +		.free	      = hv_iommu_domain_free,
>>>> +	},
>>>> +	.owner		    = THIS_MODULE,
>>>> +};
>>>> +
>>>> +static void __init hv_initialize_special_domains(void)
>>>> +{
>>>> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
>>>> +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
>>>
>>> hv_def_identity_dom is a static global variable.
>>> Why not initialize hv_def_identity_dom upon definition instead of
>>> introducing a new function?
>>
>> Originally, it was function. I changed it static, but during 6.6
>> review I changed it back to function.  I can't remember why, but is
>> pretty harmless. We may add more domains, for example null domain to the
>> initilization in future.
>>
>>>> +}
>>>> +
>>>> +static int __init hv_iommu_init(void)
>>>> +{
>>>> +	int ret;
>>>> +	struct iommu_device *iommup = &hv_virt_iommu;
>>>> +
>>>> +	if (!hv_is_hyperv_initialized())
>>>> +		return -ENODEV;
>>>> +
>>>> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
>>>> +	if (ret) {
>>>> +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	/* This must come before iommu_device_register because the latter calls
>>>> +	 * into the hooks.
>>>> +	 */
>>>> +	hv_initialize_special_domains();
>>>> +
>>>> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
>>>
>>> It looks weird to initialize an object after creating sysfs entries for
>>> it.
>>> It should be the other way around.
>>
>> Not sure if it should be, much easier to remove sysfs entry than other
>> cleanup, even tho iommu_device_unregister is there. I am sure we'll add
>> more code here, probably why it was originally done this way.
>>
> 
> Sysfs provides user space access to kernel objects. If the object is not
> initialized, it's not only a useless sysfs entry, but also a potential
> cause for kernel panic if user space will try to access this entry
> before the object is initialized.

I hear you... but,
   o there is nothing under sysfs to be accessed when created
   o it is during boot
   o it should almost never fail...
   o iommu_device_sysfs_remove is much more light weight than
     iommu_device_unregister
   o i expect more to be added there as we enhance it

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
> 
>> Thanks,
>> -Mukesh
>>
>>
>> ... snip........


^ permalink raw reply

* Re: [PATCH 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Mukesh R @ 2026-01-27  2:06 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXacAQP3gjZ1gSLs@skinsburskii.localdomain>

On 1/25/26 14:41, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 04:33:39PM -0800, Mukesh R wrote:
>> On 1/22/26 17:35, Stanislav Kinsburskii wrote:
>>> Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
>>> functions to handle memory deposition with proper error handling.
>>>
>>> The new hv_deposit_memory_node() function takes the hypervisor status
>>> as a parameter and validates it before depositing pages. It checks for
>>> HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
>>> unexpected status codes.
>>>
>>> This is a precursor patch to new out-of-memory error codes support.
>>> No functional changes intended.
>>>
>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>> ---
>>>    drivers/hv/hv_proc.c           |   22 ++++++++++++++++++++--
>>>    drivers/hv/mshv_root_hv_call.c |   25 +++++++++----------------
>>>    drivers/hv/mshv_root_main.c    |    3 +--
>>>    include/asm-generic/mshyperv.h |   10 ++++++++++
>>>    4 files changed, 40 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>>> index 80c66d1c74d5..c0c2bfc80d77 100644
>>> --- a/drivers/hv/hv_proc.c
>>> +++ b/drivers/hv/hv_proc.c
>>> @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>    }
>>>    EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>>> +int hv_deposit_memory_node(int node, u64 partition_id,
>>> +			   u64 hv_status)
>>> +{
>>> +	u32 num_pages;
>>> +
>>> +	switch (hv_result(hv_status)) {
>>> +	case HV_STATUS_INSUFFICIENT_MEMORY:
>>> +		num_pages = 1;
>>> +		break;
>>> +	default:
>>> +		hv_status_err(hv_status, "Unexpected!\n");
>>> +		return -ENOMEM;
>>> +	}
>>> +	return hv_call_deposit_pages(node, partition_id, num_pages);
>>> +}
>>> +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
>>> +
>>
>> Different hypercalls may want to deposit different number of pages in one
>> shot. As feature evolves, page sizes get mixed, we'd almost need that
>> flexibility. So, imo, either we just don't do this for now, or add num pages
>> parameter to be passed down.
>>
> 
> What you do mean by "page sizes get mixed"?
> A helper to deposit num pages already exists: its
> hv_call_deposit_pages().

My point, you are removing number of pages, and we may want to keep
that so one can quickly play around and change them.

-                       ret = hv_call_deposit_pages(NUMA_NO_NODE,
-                                                   pt_id, 1);
+                       ret = hv_deposit_memory(pt_id, status);

For example, in hv_call_initialize_partition() we may realize after
some analysis that depositing 2 pages or 4 pages is much better.

> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>
>>
>>>    bool hv_result_oom(u64 status)
>>>    {
>>>    	switch (hv_result(status)) {
>>> @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>>    			}
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
>>> +		ret = hv_deposit_memory_node(node, hv_current_partition_id,
>>> +					     status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>>    			}
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(node, partition_id, 1);
>>> +		ret = hv_deposit_memory_node(node, partition_id, status);
>>>    	} while (!ret);
>>> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
>>> index 58c5cbf2e567..06f2bac8039d 100644
>>> --- a/drivers/hv/mshv_root_hv_call.c
>>> +++ b/drivers/hv/mshv_root_hv_call.c
>>> @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
>>>    			break;
>>>    		}
>>>    		local_irq_restore(irq_flags);
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    hv_current_partition_id, 1);
>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
>>>    			ret = hv_result_to_errno(status);
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>> +		ret = hv_deposit_memory(partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
>>>    		}
>>>    		local_irq_restore(flags);
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    partition_id, 1);
>>> +		ret = hv_deposit_memory(partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
>>>    		}
>>>    		local_irq_restore(flags);
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    partition_id, 1);
>>> +		ret = hv_deposit_memory(partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
>>>    		local_irq_restore(flags);
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>> +		ret = hv_deposit_memory(partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
>>>    			ret = hv_result_to_errno(status);
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
>>> -
>>> +		ret = hv_deposit_memory(port_partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
>>>    			ret = hv_result_to_errno(status);
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    connection_partition_id, 1);
>>> +		ret = hv_deposit_memory(connection_partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
>>>    			break;
>>>    		}
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    hv_current_partition_id, 1);
>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>    	} while (!ret);
>>>    	return ret;
>>> @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
>>>    			return ret;
>>>    		}
>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -					    hv_current_partition_id, 1);
>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>    		if (ret)
>>>    			return ret;
>>>    	} while (!ret);
>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>> index f4697497f83e..5fc572e31cd7 100644
>>> --- a/drivers/hv/mshv_root_main.c
>>> +++ b/drivers/hv/mshv_root_main.c
>>> @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
>>>    		if (!hv_result_oom(status))
>>>    			ret = hv_result_to_errno(status);
>>>    		else
>>> -			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>> -						    pt_id, 1);
>>> +			ret = hv_deposit_memory(pt_id, status);
>>>    	} while (!ret);
>>>    	args.status = hv_result(status);
>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>> index b73352a7fc9e..c8e8976839f8 100644
>>> --- a/include/asm-generic/mshyperv.h
>>> +++ b/include/asm-generic/mshyperv.h
>>> @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
>>>    }
>>>    bool hv_result_oom(u64 status);
>>> +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
>>>    int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>>>    int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>>>    int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>>> @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
>>>    static inline bool hv_l1vh_partition(void) { return false; }
>>>    static inline bool hv_parent_partition(void) { return false; }
>>>    static inline bool hv_result_oom(u64 status) { return false; }
>>> +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
>>> +{
>>> +	return -EOPNOTSUPP;
>>> +}
>>>    static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>    {
>>>    	return -EOPNOTSUPP;
>>> @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
>>>    }
>>>    #endif /* CONFIG_MSHV_ROOT */
>>> +static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>> +{
>>> +	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>>> +}
>>> +
>>>    #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>>    u8 __init get_vtl(void);
>>>    #else
>>>
>>>


^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-27  1:39 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXgFFz7YuJJQabyp@skinsburskii.localdomain>

On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>> hypervisor deposited pages.
>>>>>>>
>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>> management is implemented.
>>>>>>>
>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>> ---
>>>>>>>      drivers/hv/Kconfig |    1 +
>>>>>>>      1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>      	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>      	# no particular order, making it impossible to reassemble larger pages
>>>>>>>      	depends on PAGE_SIZE_4KB
>>>>>>> +	depends on !KEXEC
>>>>>>>      	select EVENTFD
>>>>>>>      	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>      	select HMM_MIRROR
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>> and it was fine?
>>>>>>
>>>>>
>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>> will be affected as well.
>>>>
>>>> So not sure I understand the reason for this patch. We can just block
>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>> further developement would be without a ver important and major feature,
>>>> right?
>>>
>>> This is an option. But until it's implemented and merged, a user mshv
>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>> The system may crash at any time after kexec, depending on whether the
>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>> bad user experience.
>>
>> I understand that. But with this we cannot collect core and debug any
>> crashes. I was thinking there would be a quick way to prohibit kexec
>> for update via notifier or some other quick hack. Did you already
>> explore that and didn't find anything, hence this?
>>
> 
> This quick hack you mention isn't quick in the upstream kernel as there
> is no hook to interrupt kexec process except the live update one.

That's the one we want to interrupt and block right? crash kexec
is ok and should be allowed. We can document we don't support kexec
for update for now.

> I sent an RFC for that one but given todays conversation details is
> won't be accepted as is.

Are you taking about this?

         "mshv: Add kexec safety for deposited pages"

> Making mshv mutually exclusive with kexec is the only viable option for
> now given time constraints.
> It is intended to be replaced with proper page lifecycle management in
> the future.

Yeah, that could take a long time and imo we cannot just disable KEXEC
completely. What we want is just block kexec for updates from some
mshv file for now, we an print during boot that kexec for updates is
not supported on mshv. Hope that makes sense.

Thanks,
-Mukesh



> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>> Therefor it should be explicitly forbidden as it's essentially not
>>> supported yet.
>>>
>>> Thanks,
>>> Stanislav
>>>
>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh


^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-27  0:21 UTC (permalink / raw)
  To: Mukesh R
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <2b42997d-7cc0-56ba-e1ca-a8640ce71ea9@linux.microsoft.com>

On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> > > On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > > > hypervisor deposited pages.
> > > > > > 
> > > > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > > > management is implemented.
> > > > > > 
> > > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > > ---
> > > > > >     drivers/hv/Kconfig |    1 +
> > > > > >     1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > > > --- a/drivers/hv/Kconfig
> > > > > > +++ b/drivers/hv/Kconfig
> > > > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > > > >     	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > > > >     	# no particular order, making it impossible to reassemble larger pages
> > > > > >     	depends on PAGE_SIZE_4KB
> > > > > > +	depends on !KEXEC
> > > > > >     	select EVENTFD
> > > > > >     	select VIRT_XFER_TO_GUEST_WORK
> > > > > >     	select HMM_MIRROR
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > > > implying that crash dump might be involved. Or did you test kdump
> > > > > and it was fine?
> > > > > 
> > > > 
> > > > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > > > will be affected as well.
> > > 
> > > So not sure I understand the reason for this patch. We can just block
> > > kexec if there are any VMs running, right? Doing this would mean any
> > > further developement would be without a ver important and major feature,
> > > right?
> > 
> > This is an option. But until it's implemented and merged, a user mshv
> > driver gets into a situation where kexec is broken in a non-obvious way.
> > The system may crash at any time after kexec, depending on whether the
> > new kernel touches the pages deposited to hypervisor or not. This is a
> > bad user experience.
> 
> I understand that. But with this we cannot collect core and debug any
> crashes. I was thinking there would be a quick way to prohibit kexec
> for update via notifier or some other quick hack. Did you already
> explore that and didn't find anything, hence this?
> 

This quick hack you mention isn't quick in the upstream kernel as there
is no hook to interrupt kexec process except the live update one.
I sent an RFC for that one but given todays conversation details is
won't be accepted as is.
Making mshv mutually exclusive with kexec is the only viable option for
now given time constraints.
It is intended to be replaced with proper page lifecycle management in
the future.

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> > Therefor it should be explicitly forbidden as it's essentially not
> > supported yet.
> > 
> > Thanks,
> > Stanislav
> > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > -Mukesh

^ permalink raw reply

* Re: [PATCH v5 0/2] Add VMBus message connection ID support via DeviceTree
From: Hardik Garg @ 2026-01-27  0:12 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, krzk+dt, robh, conor+dt, mhklinux
  Cc: devicetree, linux-hyperv, linux-kernel, ssengar, longli,
	Naman Jain, hargar, jacob.pan
In-Reply-To: <58cb22cb-b0c8-4694-b9e4-971aa7f0f972@linux.microsoft.com>

Just a gentle ping on this patch in case it got lost. I want to
check if anyone had a chance to look at it, or if there is
anything I should update or clarify.

I’ve also noticed a formatting issue in patch 2/2, which I
will address when sending a v6 along with any feedback.



Thanks,
Hardik

On 12/23/2025 3:05 PM, Hardik Garg wrote:
> This patch series adds support for reading the VMBus message
> connection ID from DeviceTree. The connection-id determines which
> hypervisor communication channel the guest should use to talk to
> the VMBus host.
>
> Changes in v5:
> - Updated subject line and commit description to clarify what
>   connection ID is and why DeviceTree support is required
> - Addressed reviewer feedback about zero handling and binding
>   constraints
> - Revised binding description to clarify version-based selection
>   instead of using "defaults" language
> - Fixed checkpatch warnings (indentation and alignment) 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-26 23:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXfSDm-4BjPPZMNu@skinsburskii.localdomain>

On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>> hypervisor deposited pages.
>>>>>
>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>> management is implemented.
>>>>>
>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>> ---
>>>>>     drivers/hv/Kconfig |    1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>> --- a/drivers/hv/Kconfig
>>>>> +++ b/drivers/hv/Kconfig
>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>     	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>     	# no particular order, making it impossible to reassemble larger pages
>>>>>     	depends on PAGE_SIZE_4KB
>>>>> +	depends on !KEXEC
>>>>>     	select EVENTFD
>>>>>     	select VIRT_XFER_TO_GUEST_WORK
>>>>>     	select HMM_MIRROR
>>>>>
>>>>>
>>>>
>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>> implying that crash dump might be involved. Or did you test kdump
>>>> and it was fine?
>>>>
>>>
>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>> will be affected as well.
>>
>> So not sure I understand the reason for this patch. We can just block
>> kexec if there are any VMs running, right? Doing this would mean any
>> further developement would be without a ver important and major feature,
>> right?
> 
> This is an option. But until it's implemented and merged, a user mshv
> driver gets into a situation where kexec is broken in a non-obvious way.
> The system may crash at any time after kexec, depending on whether the
> new kernel touches the pages deposited to hypervisor or not. This is a
> bad user experience.

I understand that. But with this we cannot collect core and debug any
crashes. I was thinking there would be a quick way to prohibit kexec
for update via notifier or some other quick hack. Did you already
explore that and didn't find anything, hence this?

Thanks,
-Mukesh

> Therefor it should be explicitly forbidden as it's essentially not
> supported yet.
> 
> Thanks,
> Stanislav
> 
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh


^ permalink raw reply

* [PATCH v5 7/7] mshv: Add debugfs to view hypervisor statistics
From: Nuno Das Neves @ 2026-01-26 20:56 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves, Jinank Jain
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

Introduce a debugfs interface to expose root and child partition stats
when running with mshv_root.

Create a debugfs directory "mshv" containing 'stats' files organized by
type and id. A stats file contains a number of counters depending on
its type. e.g. an excerpt from a VP stats file:

TotalRunTime                  : 1997602722
HypervisorRunTime             : 649671371
RemoteNodeRunTime             : 0
NormalizedRunTime             : 1997602721
IdealCpu                      : 0
HypercallsCount               : 1708169
HypercallsTime                : 111914774
PageInvalidationsCount        : 0
PageInvalidationsTime         : 0

On a root partition with some active child partitions, the entire
directory structure may look like:

mshv/
  stats             # hypervisor stats
  lp/               # logical processors
    0/              # LP id
      stats         # LP 0 stats
    1/
    2/
    3/
  partition/        # partition stats
    1/              # root partition id
      stats         # root partition stats
      vp/           # root virtual processors
        0/          # root VP id
          stats     # root VP 0 stats
        1/
        2/
        3/
    42/             # child partition id
      stats         # child partition stats
      vp/           # child VPs
        0/          # child VP id
          stats     # child VP 0 stats
        1/
    43/
    55/

On L1VH, some stats are not present as it does not own the hardware
like the root partition does:
- The hypervisor and lp stats are not present
- L1VH's partition directory is named "self" because it can't get its
  own id
- Some of L1VH's partition and VP stats fields are not populated, because
  it can't map its own HV_STATS_AREA_PARENT page.

Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Co-developed-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Makefile         |   1 +
 drivers/hv/mshv_debugfs.c   | 726 ++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h      |  34 ++
 drivers/hv/mshv_root_main.c |  26 +-
 4 files changed, 785 insertions(+), 2 deletions(-)
 create mode 100644 drivers/hv/mshv_debugfs.c

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index a49f93c2d245..2593711c3628 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
 	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
new file mode 100644
index 000000000000..4553163e8665
--- /dev/null
+++ b/drivers/hv/mshv_debugfs.c
@@ -0,0 +1,726 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * The /sys/kernel/debug/mshv directory contents.
+ * Contains various statistics data, provided by the hypervisor.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/debugfs.h>
+#include <linux/stringify.h>
+#include <asm/mshyperv.h>
+#include <linux/slab.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+/* Ensure this file is not used elsewhere by accident */
+#define MSHV_DEBUGFS_C
+#include "mshv_debugfs_counters.c"
+
+#define U32_BUF_SZ 11
+#define U64_BUF_SZ 21
+/* Only support SELF and PARENT areas */
+#define NUM_STATS_AREAS 2
+static_assert(HV_STATS_AREA_SELF == 0 && HV_STATS_AREA_PARENT == 1,
+	      "SELF and PARENT areas must be usable as indices into an array of size NUM_STATS_AREAS");
+/* HV_HYPERVISOR_COUNTER */
+#define HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS 1
+
+static struct dentry *mshv_debugfs;
+static struct dentry *mshv_debugfs_partition;
+static struct dentry *mshv_debugfs_lp;
+static struct dentry **parent_vp_stats;
+static struct dentry *parent_partition_stats;
+
+static u64 mshv_lps_count;
+static struct hv_stats_page **mshv_lps_stats;
+
+static int lp_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page *stats = m->private;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_lp_counters); idx++) {
+		char *name = hv_lp_counters[idx];
+
+		if (!name)
+			continue;
+		seq_printf(m, "%-32s: %llu\n", name, stats->data[idx]);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(lp_stats);
+
+static void mshv_lp_stats_unmap(u32 lp_index)
+{
+	union hv_stats_object_identity identity = {
+		.lp.lp_index = lp_index,
+		.lp.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
+				  mshv_lps_stats[lp_index], &identity);
+	if (err)
+		pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
+		       __func__, lp_index, err);
+
+	mshv_lps_stats[lp_index] = NULL;
+}
+
+static struct hv_stats_page * __init mshv_lp_stats_map(u32 lp_index)
+{
+	union hv_stats_object_identity identity = {
+		.lp.lp_index = lp_index,
+		.lp.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR, &identity,
+				&stats);
+	if (err) {
+		pr_err("%s: failed to map logical processor %u stats, err: %d\n",
+		       __func__, lp_index, err);
+		return ERR_PTR(err);
+	}
+	mshv_lps_stats[lp_index] = stats;
+
+	return stats;
+}
+
+static struct hv_stats_page * __init lp_debugfs_stats_create(u32 lp_index,
+							     struct dentry *parent)
+{
+	struct dentry *dentry;
+	struct hv_stats_page *stats;
+
+	stats = mshv_lp_stats_map(lp_index);
+	if (IS_ERR(stats))
+		return stats;
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     stats, &lp_stats_fops);
+	if (IS_ERR(dentry)) {
+		mshv_lp_stats_unmap(lp_index);
+		return ERR_CAST(dentry);
+	}
+	return stats;
+}
+
+static int __init lp_debugfs_create(u32 lp_index, struct dentry *parent)
+{
+	struct dentry *idx;
+	char lp_idx_str[U32_BUF_SZ];
+	struct hv_stats_page *stats;
+	int err;
+
+	sprintf(lp_idx_str, "%u", lp_index);
+
+	idx = debugfs_create_dir(lp_idx_str, parent);
+	if (IS_ERR(idx))
+		return PTR_ERR(idx);
+
+	stats = lp_debugfs_stats_create(lp_index, idx);
+	if (IS_ERR(stats)) {
+		err = PTR_ERR(stats);
+		goto remove_debugfs_lp_idx;
+	}
+
+	return 0;
+
+remove_debugfs_lp_idx:
+	debugfs_remove_recursive(idx);
+	return err;
+}
+
+static void mshv_debugfs_lp_remove(void)
+{
+	int lp_index;
+
+	debugfs_remove_recursive(mshv_debugfs_lp);
+
+	for (lp_index = 0; lp_index < mshv_lps_count; lp_index++)
+		mshv_lp_stats_unmap(lp_index);
+
+	kfree(mshv_lps_stats);
+	mshv_lps_stats = NULL;
+}
+
+static int __init mshv_debugfs_lp_create(struct dentry *parent)
+{
+	struct dentry *lp_dir;
+	int err, lp_index;
+
+	mshv_lps_stats = kcalloc(mshv_lps_count,
+				 sizeof(*mshv_lps_stats),
+				 GFP_KERNEL_ACCOUNT);
+
+	if (!mshv_lps_stats)
+		return -ENOMEM;
+
+	lp_dir = debugfs_create_dir("lp", parent);
+	if (IS_ERR(lp_dir)) {
+		err = PTR_ERR(lp_dir);
+		goto free_lp_stats;
+	}
+
+	for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
+		err = lp_debugfs_create(lp_index, lp_dir);
+		if (err)
+			goto remove_debugfs_lps;
+	}
+
+	mshv_debugfs_lp = lp_dir;
+
+	return 0;
+
+remove_debugfs_lps:
+	for (lp_index -= 1; lp_index >= 0; lp_index--)
+		mshv_lp_stats_unmap(lp_index);
+	debugfs_remove_recursive(lp_dir);
+free_lp_stats:
+	kfree(mshv_lps_stats);
+	mshv_lps_stats = NULL;
+
+	return err;
+}
+
+static int vp_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page **pstats = m->private;
+	u64 parent_val, self_val;
+	int idx;
+
+	/*
+	 * For VP and partition stats, there may be two stats areas mapped,
+	 * SELF and PARENT. These refer to the privilege level of the data in
+	 * each page. Some fields may be 0 in SELF and nonzero in PARENT, or
+	 * vice versa.
+	 *
+	 * Hence, prioritize printing from the PARENT page (more privileged
+	 * data), but use the value from the SELF page if the PARENT value is
+	 * 0.
+	 */
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_vp_counters); idx++) {
+		char *name = hv_vp_counters[idx];
+
+		if (!name)
+			continue;
+
+		parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+		self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+		seq_printf(m, "%-43s: %llu\n", name,
+			   parent_val ? parent_val : self_val);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(vp_stats);
+
+static void vp_debugfs_remove(struct dentry *vp_stats)
+{
+	debugfs_remove_recursive(vp_stats->d_parent);
+}
+
+static int vp_debugfs_create(u64 partition_id, u32 vp_index,
+			     struct hv_stats_page **pstats,
+			     struct dentry **vp_stats_ptr,
+			     struct dentry *parent)
+{
+	struct dentry *vp_idx_dir, *d;
+	char vp_idx_str[U32_BUF_SZ];
+	int err;
+
+	sprintf(vp_idx_str, "%u", vp_index);
+
+	vp_idx_dir = debugfs_create_dir(vp_idx_str, parent);
+	if (IS_ERR(vp_idx_dir))
+		return PTR_ERR(vp_idx_dir);
+
+	d = debugfs_create_file("stats", 0400, vp_idx_dir,
+				     pstats, &vp_stats_fops);
+	if (IS_ERR(d)) {
+		err = PTR_ERR(d);
+		goto remove_debugfs_vp_idx;
+	}
+
+	*vp_stats_ptr = d;
+
+	return 0;
+
+remove_debugfs_vp_idx:
+	debugfs_remove_recursive(vp_idx_dir);
+	return err;
+}
+
+static int partition_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page **pstats = m->private;
+	u64 parent_val, self_val;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_partition_counters); idx++) {
+		char *name = hv_partition_counters[idx];
+
+		if (!name)
+			continue;
+
+		parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+		self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+		seq_printf(m, "%-37s: %llu\n", name,
+			   parent_val ? parent_val : self_val);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(partition_stats);
+
+static void mshv_partition_stats_unmap(u64 partition_id,
+				       struct hv_stats_page *stats_page,
+				       enum hv_stats_area_type stats_area_type)
+{
+	union hv_stats_object_identity identity = {
+		.partition.partition_id = partition_id,
+		.partition.stats_area_type = stats_area_type,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_PARTITION, stats_page,
+				  &identity);
+	if (err)
+		pr_err("%s: failed to unmap partition %lld %s stats, err: %d\n",
+		       __func__, partition_id,
+		       (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+		       err);
+}
+
+static struct hv_stats_page *mshv_partition_stats_map(u64 partition_id,
+						      enum hv_stats_area_type stats_area_type)
+{
+	union hv_stats_object_identity identity = {
+		.partition.partition_id = partition_id,
+		.partition.stats_area_type = stats_area_type,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_PARTITION, &identity, &stats);
+	if (err) {
+		pr_err("%s: failed to map partition %lld %s stats, err: %d\n",
+		       __func__, partition_id,
+		       (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+		       err);
+		return ERR_PTR(err);
+	}
+	return stats;
+}
+
+static int mshv_debugfs_partition_stats_create(u64 partition_id,
+					    struct dentry **partition_stats_ptr,
+					    struct dentry *parent)
+{
+	struct dentry *dentry;
+	struct hv_stats_page **pstats;
+	int err;
+
+	pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+			 GFP_KERNEL_ACCOUNT);
+	if (!pstats)
+		return -ENOMEM;
+
+	pstats[HV_STATS_AREA_SELF] = mshv_partition_stats_map(partition_id,
+							      HV_STATS_AREA_SELF);
+	if (IS_ERR(pstats[HV_STATS_AREA_SELF])) {
+		err = PTR_ERR(pstats[HV_STATS_AREA_SELF]);
+		goto cleanup;
+	}
+
+	/*
+	 * L1VH partition cannot access its partition stats in parent area.
+	 */
+	if (is_l1vh_parent(partition_id)) {
+		pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+	} else {
+		pstats[HV_STATS_AREA_PARENT] = mshv_partition_stats_map(partition_id,
+									HV_STATS_AREA_PARENT);
+		if (IS_ERR(pstats[HV_STATS_AREA_PARENT])) {
+			err = PTR_ERR(pstats[HV_STATS_AREA_PARENT]);
+			goto unmap_self;
+		}
+		if (!pstats[HV_STATS_AREA_PARENT])
+			pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+	}
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     pstats, &partition_stats_fops);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto unmap_partition_stats;
+	}
+
+	*partition_stats_ptr = dentry;
+	return 0;
+
+unmap_partition_stats:
+	if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF])
+		mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_PARENT],
+					   HV_STATS_AREA_PARENT);
+unmap_self:
+	mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_SELF],
+				   HV_STATS_AREA_SELF);
+cleanup:
+	kfree(pstats);
+	return err;
+}
+
+static void partition_debugfs_remove(u64 partition_id, struct dentry *dentry)
+{
+	struct hv_stats_page **pstats = NULL;
+
+	pstats = dentry->d_inode->i_private;
+
+	debugfs_remove_recursive(dentry->d_parent);
+
+	if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF]) {
+		mshv_partition_stats_unmap(partition_id,
+					   pstats[HV_STATS_AREA_PARENT],
+					   HV_STATS_AREA_PARENT);
+	}
+
+	mshv_partition_stats_unmap(partition_id,
+				   pstats[HV_STATS_AREA_SELF],
+				   HV_STATS_AREA_SELF);
+
+	kfree(pstats);
+}
+
+static int partition_debugfs_create(u64 partition_id,
+				    struct dentry **vp_dir_ptr,
+				    struct dentry **partition_stats_ptr,
+				    struct dentry *parent)
+{
+	char part_id_str[U64_BUF_SZ];
+	struct dentry *part_id_dir, *vp_dir;
+	int err;
+
+	if (is_l1vh_parent(partition_id))
+		sprintf(part_id_str, "self");
+	else
+		sprintf(part_id_str, "%llu", partition_id);
+
+	part_id_dir = debugfs_create_dir(part_id_str, parent);
+	if (IS_ERR(part_id_dir))
+		return PTR_ERR(part_id_dir);
+
+	vp_dir = debugfs_create_dir("vp", part_id_dir);
+	if (IS_ERR(vp_dir)) {
+		err = PTR_ERR(vp_dir);
+		goto remove_debugfs_partition_id;
+	}
+
+	err = mshv_debugfs_partition_stats_create(partition_id,
+						  partition_stats_ptr,
+						  part_id_dir);
+	if (err)
+		goto remove_debugfs_partition_id;
+
+	*vp_dir_ptr = vp_dir;
+
+	return 0;
+
+remove_debugfs_partition_id:
+	debugfs_remove_recursive(part_id_dir);
+	return err;
+}
+
+static void parent_vp_debugfs_remove(u32 vp_index,
+				     struct dentry *vp_stats_ptr)
+{
+	struct hv_stats_page **pstats;
+
+	pstats = vp_stats_ptr->d_inode->i_private;
+	vp_debugfs_remove(vp_stats_ptr);
+	mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+	kfree(pstats);
+}
+
+static void mshv_debugfs_parent_partition_remove(void)
+{
+	int idx;
+
+	for_each_online_cpu(idx)
+		parent_vp_debugfs_remove(hv_vp_index[idx],
+					 parent_vp_stats[idx]);
+
+	partition_debugfs_remove(hv_current_partition_id,
+				 parent_partition_stats);
+	kfree(parent_vp_stats);
+	parent_vp_stats = NULL;
+	parent_partition_stats = NULL;
+}
+
+static int __init parent_vp_debugfs_create(u32 vp_index,
+					   struct dentry **vp_stats_ptr,
+					   struct dentry *parent)
+{
+	struct hv_stats_page **pstats;
+	int err;
+
+	pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+			 GFP_KERNEL_ACCOUNT);
+	if (!pstats)
+		return -ENOMEM;
+
+	err = mshv_vp_stats_map(hv_current_partition_id, vp_index, pstats);
+	if (err)
+		goto cleanup;
+
+	err = vp_debugfs_create(hv_current_partition_id, vp_index, pstats,
+				vp_stats_ptr, parent);
+	if (err)
+		goto unmap_vp_stats;
+
+	return 0;
+
+unmap_vp_stats:
+	mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+cleanup:
+	kfree(pstats);
+	return err;
+}
+
+static int __init mshv_debugfs_parent_partition_create(void)
+{
+	struct dentry *vp_dir;
+	int err, idx, i;
+
+	mshv_debugfs_partition = debugfs_create_dir("partition",
+						     mshv_debugfs);
+	if (IS_ERR(mshv_debugfs_partition))
+		return PTR_ERR(mshv_debugfs_partition);
+
+	err = partition_debugfs_create(hv_current_partition_id,
+				       &vp_dir,
+				       &parent_partition_stats,
+				       mshv_debugfs_partition);
+	if (err)
+		goto remove_debugfs_partition;
+
+	parent_vp_stats = kcalloc(nr_cpu_ids, sizeof(*parent_vp_stats),
+				  GFP_KERNEL);
+	if (!parent_vp_stats) {
+		err = -ENOMEM;
+		goto remove_debugfs_partition;
+	}
+
+	for_each_online_cpu(idx) {
+		err = parent_vp_debugfs_create(hv_vp_index[idx],
+					       &parent_vp_stats[idx],
+					       vp_dir);
+		if (err)
+			goto remove_debugfs_partition_vp;
+	}
+
+	return 0;
+
+remove_debugfs_partition_vp:
+	for_each_online_cpu(i) {
+		if (i >= idx)
+			break;
+		parent_vp_debugfs_remove(i, parent_vp_stats[i]);
+	}
+	partition_debugfs_remove(hv_current_partition_id,
+				 parent_partition_stats);
+
+	kfree(parent_vp_stats);
+	parent_vp_stats = NULL;
+	parent_partition_stats = NULL;
+
+remove_debugfs_partition:
+	debugfs_remove_recursive(mshv_debugfs_partition);
+	mshv_debugfs_partition = NULL;
+	return err;
+}
+
+static int hv_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page *stats = m->private;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_hypervisor_counters); idx++) {
+		char *name = hv_hypervisor_counters[idx];
+
+		if (!name)
+			continue;
+		seq_printf(m, "%-27s: %llu\n", name, stats->data[idx]);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(hv_stats);
+
+static void mshv_hv_stats_unmap(void)
+{
+	union hv_stats_object_identity identity = {
+		.hv.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_HYPERVISOR, NULL, &identity);
+	if (err)
+		pr_err("%s: failed to unmap hypervisor stats: %d\n",
+		       __func__, err);
+}
+
+static void * __init mshv_hv_stats_map(void)
+{
+	union hv_stats_object_identity identity = {
+		.hv.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_HYPERVISOR, &identity, &stats);
+	if (err) {
+		pr_err("%s: failed to map hypervisor stats: %d\n",
+		       __func__, err);
+		return ERR_PTR(err);
+	}
+	return stats;
+}
+
+static int __init mshv_debugfs_hv_stats_create(struct dentry *parent)
+{
+	struct dentry *dentry;
+	u64 *stats;
+	int err;
+
+	stats = mshv_hv_stats_map();
+	if (IS_ERR(stats))
+		return PTR_ERR(stats);
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     stats, &hv_stats_fops);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		pr_err("%s: failed to create hypervisor stats dentry: %d\n",
+		       __func__, err);
+		goto unmap_hv_stats;
+	}
+
+	mshv_lps_count = stats[HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS];
+
+	return 0;
+
+unmap_hv_stats:
+	mshv_hv_stats_unmap();
+	return err;
+}
+
+int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+
+	if (!mshv_debugfs)
+		return 0;
+
+	return vp_debugfs_create(p->pt_id, vp->vp_index,
+				 vp->vp_stats_pages,
+				 &vp->vp_stats_dentry,
+				 p->pt_vp_dentry);
+}
+
+void mshv_debugfs_vp_remove(struct mshv_vp *vp)
+{
+	if (!mshv_debugfs)
+		return;
+
+	vp_debugfs_remove(vp->vp_stats_dentry);
+}
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+	int err;
+
+	if (!mshv_debugfs)
+		return 0;
+
+	err = partition_debugfs_create(partition->pt_id,
+				       &partition->pt_vp_dentry,
+				       &partition->pt_stats_dentry,
+				       mshv_debugfs_partition);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+void mshv_debugfs_partition_remove(struct mshv_partition *partition)
+{
+	if (!mshv_debugfs)
+		return;
+
+	partition_debugfs_remove(partition->pt_id,
+				 partition->pt_stats_dentry);
+}
+
+int __init mshv_debugfs_init(void)
+{
+	int err;
+
+	mshv_debugfs = debugfs_create_dir("mshv", NULL);
+	if (IS_ERR(mshv_debugfs)) {
+		pr_err("%s: failed to create debugfs directory\n", __func__);
+		return PTR_ERR(mshv_debugfs);
+	}
+
+	if (hv_root_partition()) {
+		err = mshv_debugfs_hv_stats_create(mshv_debugfs);
+		if (err)
+			goto remove_mshv_dir;
+
+		err = mshv_debugfs_lp_create(mshv_debugfs);
+		if (err)
+			goto unmap_hv_stats;
+	}
+
+	err = mshv_debugfs_parent_partition_create();
+	if (err)
+		goto unmap_lp_stats;
+
+	return 0;
+
+unmap_lp_stats:
+	if (hv_root_partition()) {
+		mshv_debugfs_lp_remove();
+		mshv_debugfs_lp = NULL;
+	}
+unmap_hv_stats:
+	if (hv_root_partition())
+		mshv_hv_stats_unmap();
+remove_mshv_dir:
+	debugfs_remove_recursive(mshv_debugfs);
+	mshv_debugfs = NULL;
+	return err;
+}
+
+void mshv_debugfs_exit(void)
+{
+	mshv_debugfs_parent_partition_remove();
+
+	if (hv_root_partition()) {
+		mshv_debugfs_lp_remove();
+		mshv_debugfs_lp = NULL;
+		mshv_hv_stats_unmap();
+	}
+
+	debugfs_remove_recursive(mshv_debugfs);
+	mshv_debugfs = NULL;
+	mshv_debugfs_partition = NULL;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e4912b0618fa..7332d9af8373 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -52,6 +52,9 @@ struct mshv_vp {
 		unsigned int kicked_by_hv;
 		wait_queue_head_t vp_suspend_queue;
 	} run;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+	struct dentry *vp_stats_dentry;
+#endif
 };
 
 #define vp_fmt(fmt) "p%lluvp%u: " fmt
@@ -136,6 +139,10 @@ struct mshv_partition {
 	u64 isolation_type;
 	bool import_completed;
 	bool pt_initialized;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+	struct dentry *pt_stats_dentry;
+	struct dentry *pt_vp_dentry;
+#endif
 };
 
 #define pt_fmt(fmt) "p%llu: " fmt
@@ -327,6 +334,33 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
 				      void *property_value, size_t property_value_sz);
 
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+int __init mshv_debugfs_init(void);
+void mshv_debugfs_exit(void);
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition);
+void mshv_debugfs_partition_remove(struct mshv_partition *partition);
+int mshv_debugfs_vp_create(struct mshv_vp *vp);
+void mshv_debugfs_vp_remove(struct mshv_vp *vp);
+#else
+static inline int __init mshv_debugfs_init(void)
+{
+	return 0;
+}
+static inline void mshv_debugfs_exit(void) { }
+
+static inline int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+	return 0;
+}
+static inline void mshv_debugfs_partition_remove(struct mshv_partition *partition) { }
+static inline int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+	return 0;
+}
+static inline void mshv_debugfs_vp_remove(struct mshv_vp *vp) { }
+#endif
+
 extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 414d9cee5252..3a43e41e16a1 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1095,6 +1095,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 
 	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
 
+	ret = mshv_debugfs_vp_create(vp);
+	if (ret)
+		goto put_partition;
+
 	/*
 	 * Keep anon_inode_getfd last: it installs fd in the file struct and
 	 * thus makes the state accessible in user space.
@@ -1102,7 +1106,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
 			       O_RDWR | O_CLOEXEC);
 	if (ret < 0)
-		goto put_partition;
+		goto remove_debugfs_vp;
 
 	/* already exclusive with the partition mutex for all ioctls */
 	partition->pt_vp_count++;
@@ -1110,6 +1114,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 
 	return ret;
 
+remove_debugfs_vp:
+	mshv_debugfs_vp_remove(vp);
 put_partition:
 	mshv_partition_put(partition);
 free_vp:
@@ -1552,10 +1558,16 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
 	if (ret)
 		goto withdraw_mem;
 
+	ret = mshv_debugfs_partition_create(partition);
+	if (ret)
+		goto finalize_partition;
+
 	partition->pt_initialized = true;
 
 	return 0;
 
+finalize_partition:
+	hv_call_finalize_partition(partition->pt_id);
 withdraw_mem:
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
 
@@ -1735,6 +1747,7 @@ static void destroy_partition(struct mshv_partition *partition)
 			if (!vp)
 				continue;
 
+			mshv_debugfs_vp_remove(vp);
 			mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
 					    vp->vp_stats_pages);
 
@@ -1768,6 +1781,8 @@ static void destroy_partition(struct mshv_partition *partition)
 			partition->pt_vp_array[i] = NULL;
 		}
 
+		mshv_debugfs_partition_remove(partition);
+
 		/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
 		hv_call_finalize_partition(partition->pt_id);
 
@@ -2313,10 +2328,14 @@ static int __init mshv_parent_partition_init(void)
 
 	mshv_init_vmm_caps(dev);
 
-	ret = mshv_irqfd_wq_init();
+	ret = mshv_debugfs_init();
 	if (ret)
 		goto exit_partition;
 
+	ret = mshv_irqfd_wq_init();
+	if (ret)
+		goto exit_debugfs;
+
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
 
@@ -2324,6 +2343,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+exit_debugfs:
+	mshv_debugfs_exit();
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
@@ -2340,6 +2361,7 @@ static void __exit mshv_parent_partition_exit(void)
 {
 	hv_setup_mshv_handler(NULL);
 	mshv_port_table_fini();
+	mshv_debugfs_exit();
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
 	if (hv_root_partition())
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 6/7] mshv: Add data for printing stats page counters
From: Nuno Das Neves @ 2026-01-26 20:56 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

Introduce mshv_debugfs_counters.c, containing static data
corresponding to HV_*_COUNTER enums in the hypervisor source.
Defining the enum members as an array instead makes more sense,
since it will be iterated over to print counter information to
debugfs.

Include hypervisor, logical processor, partition, and virtual
processor counters.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++++++++++++
 1 file changed, 490 insertions(+)
 create mode 100644 drivers/hv/mshv_debugfs_counters.c

diff --git a/drivers/hv/mshv_debugfs_counters.c b/drivers/hv/mshv_debugfs_counters.c
new file mode 100644
index 000000000000..838af4673dd1
--- /dev/null
+++ b/drivers/hv/mshv_debugfs_counters.c
@@ -0,0 +1,490 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Data for printing stats page counters via debugfs.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+/*
+ * For simplicity, this file is included directly in mshv_debugfs.c.
+ * If these are ever needed elsewhere they should be compiled separately.
+ * Ensure this file is not used twice by accident.
+ */
+#ifndef MSHV_DEBUGFS_C
+#error "This file should only be included in mshv_debugfs.c"
+#endif
+
+/* HV_HYPERVISOR_COUNTER */
+static char *hv_hypervisor_counters[] = {
+	[1] = "HvLogicalProcessors",
+	[2] = "HvPartitions",
+	[3] = "HvTotalPages",
+	[4] = "HvVirtualProcessors",
+	[5] = "HvMonitoredNotifications",
+	[6] = "HvModernStandbyEntries",
+	[7] = "HvPlatformIdleTransitions",
+	[8] = "HvHypervisorStartupCost",
+
+	[10] = "HvIOSpacePages",
+	[11] = "HvNonEssentialPagesForDump",
+	[12] = "HvSubsumedPages",
+};
+
+/* HV_CPU_COUNTER */
+static char *hv_lp_counters[] = {
+	[1] = "LpGlobalTime",
+	[2] = "LpTotalRunTime",
+	[3] = "LpHypervisorRunTime",
+	[4] = "LpHardwareInterrupts",
+	[5] = "LpContextSwitches",
+	[6] = "LpInterProcessorInterrupts",
+	[7] = "LpSchedulerInterrupts",
+	[8] = "LpTimerInterrupts",
+	[9] = "LpInterProcessorInterruptsSent",
+	[10] = "LpProcessorHalts",
+	[11] = "LpMonitorTransitionCost",
+	[12] = "LpContextSwitchTime",
+	[13] = "LpC1TransitionsCount",
+	[14] = "LpC1RunTime",
+	[15] = "LpC2TransitionsCount",
+	[16] = "LpC2RunTime",
+	[17] = "LpC3TransitionsCount",
+	[18] = "LpC3RunTime",
+	[19] = "LpRootVpIndex",
+	[20] = "LpIdleSequenceNumber",
+	[21] = "LpGlobalTscCount",
+	[22] = "LpActiveTscCount",
+	[23] = "LpIdleAccumulation",
+	[24] = "LpReferenceCycleCount0",
+	[25] = "LpActualCycleCount0",
+	[26] = "LpReferenceCycleCount1",
+	[27] = "LpActualCycleCount1",
+	[28] = "LpProximityDomainId",
+	[29] = "LpPostedInterruptNotifications",
+	[30] = "LpBranchPredictorFlushes",
+#if IS_ENABLED(CONFIG_X86_64)
+	[31] = "LpL1DataCacheFlushes",
+	[32] = "LpImmediateL1DataCacheFlushes",
+	[33] = "LpMbFlushes",
+	[34] = "LpCounterRefreshSequenceNumber",
+	[35] = "LpCounterRefreshReferenceTime",
+	[36] = "LpIdleAccumulationSnapshot",
+	[37] = "LpActiveTscCountSnapshot",
+	[38] = "LpHwpRequestContextSwitches",
+	[39] = "LpPlaceholder1",
+	[40] = "LpPlaceholder2",
+	[41] = "LpPlaceholder3",
+	[42] = "LpPlaceholder4",
+	[43] = "LpPlaceholder5",
+	[44] = "LpPlaceholder6",
+	[45] = "LpPlaceholder7",
+	[46] = "LpPlaceholder8",
+	[47] = "LpPlaceholder9",
+	[48] = "LpSchLocalRunListSize",
+	[49] = "LpReserveGroupId",
+	[50] = "LpRunningPriority",
+	[51] = "LpPerfmonInterruptCount",	
+#elif IS_ENABLED(CONFIG_ARM64)
+	[31] = "LpCounterRefreshSequenceNumber",
+	[32] = "LpCounterRefreshReferenceTime",
+	[33] = "LpIdleAccumulationSnapshot",
+	[34] = "LpActiveTscCountSnapshot",
+	[35] = "LpHwpRequestContextSwitches",
+	[36] = "LpPlaceholder2",
+	[37] = "LpPlaceholder3",
+	[38] = "LpPlaceholder4",
+	[39] = "LpPlaceholder5",
+	[40] = "LpPlaceholder6",
+	[41] = "LpPlaceholder7",
+	[42] = "LpPlaceholder8",
+	[43] = "LpPlaceholder9",
+	[44] = "LpSchLocalRunListSize",
+	[45] = "LpReserveGroupId",
+	[46] = "LpRunningPriority",	
+#endif
+};
+
+/* HV_PROCESS_COUNTER */
+static char *hv_partition_counters[] = {
+	[1] = "PtVirtualProcessors",
+
+	[3] = "PtTlbSize",
+	[4] = "PtAddressSpaces",
+	[5] = "PtDepositedPages",
+	[6] = "PtGpaPages",
+	[7] = "PtGpaSpaceModifications",
+	[8] = "PtVirtualTlbFlushEntires",
+	[9] = "PtRecommendedTlbSize",
+	[10] = "PtGpaPages4K",
+	[11] = "PtGpaPages2M",
+	[12] = "PtGpaPages1G",
+	[13] = "PtGpaPages512G",
+	[14] = "PtDevicePages4K",
+	[15] = "PtDevicePages2M",
+	[16] = "PtDevicePages1G",
+	[17] = "PtDevicePages512G",
+	[18] = "PtAttachedDevices",
+	[19] = "PtDeviceInterruptMappings",
+	[20] = "PtIoTlbFlushes",
+	[21] = "PtIoTlbFlushCost",
+	[22] = "PtDeviceInterruptErrors",
+	[23] = "PtDeviceDmaErrors",
+	[24] = "PtDeviceInterruptThrottleEvents",
+	[25] = "PtSkippedTimerTicks",
+	[26] = "PtPartitionId",
+#if IS_ENABLED(CONFIG_X86_64)
+	[27] = "PtNestedTlbSize",
+	[28] = "PtRecommendedNestedTlbSize",
+	[29] = "PtNestedTlbFreeListSize",
+	[30] = "PtNestedTlbTrimmedPages",
+	[31] = "PtPagesShattered",
+	[32] = "PtPagesRecombined",
+	[33] = "PtHwpRequestValue",
+	[34] = "PtAutoSuspendEnableTime",
+	[35] = "PtAutoSuspendTriggerTime",
+	[36] = "PtAutoSuspendDisableTime",
+	[37] = "PtPlaceholder1",
+	[38] = "PtPlaceholder2",
+	[39] = "PtPlaceholder3",
+	[40] = "PtPlaceholder4",
+	[41] = "PtPlaceholder5",
+	[42] = "PtPlaceholder6",
+	[43] = "PtPlaceholder7",
+	[44] = "PtPlaceholder8",
+	[45] = "PtHypervisorStateTransferGeneration",
+	[46] = "PtNumberofActiveChildPartitions",
+#elif IS_ENABLED(CONFIG_ARM64)
+	[27] = "PtHwpRequestValue",
+	[28] = "PtAutoSuspendEnableTime",
+	[29] = "PtAutoSuspendTriggerTime",
+	[30] = "PtAutoSuspendDisableTime",
+	[31] = "PtPlaceholder1",
+	[32] = "PtPlaceholder2",
+	[33] = "PtPlaceholder3",
+	[34] = "PtPlaceholder4",
+	[35] = "PtPlaceholder5",
+	[36] = "PtPlaceholder6",
+	[37] = "PtPlaceholder7",
+	[38] = "PtPlaceholder8",
+	[39] = "PtHypervisorStateTransferGeneration",
+	[40] = "PtNumberofActiveChildPartitions",
+#endif
+};
+
+/* HV_THREAD_COUNTER */
+static char *hv_vp_counters[] = {
+	[1] = "VpTotalRunTime",
+	[2] = "VpHypervisorRunTime",
+	[3] = "VpRemoteNodeRunTime",
+	[4] = "VpNormalizedRunTime",
+	[5] = "VpIdealCpu",
+
+	[7] = "VpHypercallsCount",
+	[8] = "VpHypercallsTime",
+#if IS_ENABLED(CONFIG_X86_64)
+	[9] = "VpPageInvalidationsCount",
+	[10] = "VpPageInvalidationsTime",
+	[11] = "VpControlRegisterAccessesCount",
+	[12] = "VpControlRegisterAccessesTime",
+	[13] = "VpIoInstructionsCount",
+	[14] = "VpIoInstructionsTime",
+	[15] = "VpHltInstructionsCount",
+	[16] = "VpHltInstructionsTime",
+	[17] = "VpMwaitInstructionsCount",
+	[18] = "VpMwaitInstructionsTime",
+	[19] = "VpCpuidInstructionsCount",
+	[20] = "VpCpuidInstructionsTime",
+	[21] = "VpMsrAccessesCount",
+	[22] = "VpMsrAccessesTime",
+	[23] = "VpOtherInterceptsCount",
+	[24] = "VpOtherInterceptsTime",
+	[25] = "VpExternalInterruptsCount",
+	[26] = "VpExternalInterruptsTime",
+	[27] = "VpPendingInterruptsCount",
+	[28] = "VpPendingInterruptsTime",
+	[29] = "VpEmulatedInstructionsCount",
+	[30] = "VpEmulatedInstructionsTime",
+	[31] = "VpDebugRegisterAccessesCount",
+	[32] = "VpDebugRegisterAccessesTime",
+	[33] = "VpPageFaultInterceptsCount",
+	[34] = "VpPageFaultInterceptsTime",
+	[35] = "VpGuestPageTableMaps",
+	[36] = "VpLargePageTlbFills",
+	[37] = "VpSmallPageTlbFills",
+	[38] = "VpReflectedGuestPageFaults",
+	[39] = "VpApicMmioAccesses",
+	[40] = "VpIoInterceptMessages",
+	[41] = "VpMemoryInterceptMessages",
+	[42] = "VpApicEoiAccesses",
+	[43] = "VpOtherMessages",
+	[44] = "VpPageTableAllocations",
+	[45] = "VpLogicalProcessorMigrations",
+	[46] = "VpAddressSpaceEvictions",
+	[47] = "VpAddressSpaceSwitches",
+	[48] = "VpAddressDomainFlushes",
+	[49] = "VpAddressSpaceFlushes",
+	[50] = "VpGlobalGvaRangeFlushes",
+	[51] = "VpLocalGvaRangeFlushes",
+	[52] = "VpPageTableEvictions",
+	[53] = "VpPageTableReclamations",
+	[54] = "VpPageTableResets",
+	[55] = "VpPageTableValidations",
+	[56] = "VpApicTprAccesses",
+	[57] = "VpPageTableWriteIntercepts",
+	[58] = "VpSyntheticInterrupts",
+	[59] = "VpVirtualInterrupts",
+	[60] = "VpApicIpisSent",
+	[61] = "VpApicSelfIpisSent",
+	[62] = "VpGpaSpaceHypercalls",
+	[63] = "VpLogicalProcessorHypercalls",
+	[64] = "VpLongSpinWaitHypercalls",
+	[65] = "VpOtherHypercalls",
+	[66] = "VpSyntheticInterruptHypercalls",
+	[67] = "VpVirtualInterruptHypercalls",
+	[68] = "VpVirtualMmuHypercalls",
+	[69] = "VpVirtualProcessorHypercalls",
+	[70] = "VpHardwareInterrupts",
+	[71] = "VpNestedPageFaultInterceptsCount",
+	[72] = "VpNestedPageFaultInterceptsTime",
+	[73] = "VpPageScans",
+	[74] = "VpLogicalProcessorDispatches",
+	[75] = "VpWaitingForCpuTime",
+	[76] = "VpExtendedHypercalls",
+	[77] = "VpExtendedHypercallInterceptMessages",
+	[78] = "VpMbecNestedPageTableSwitches",
+	[79] = "VpOtherReflectedGuestExceptions",
+	[80] = "VpGlobalIoTlbFlushes",
+	[81] = "VpGlobalIoTlbFlushCost",
+	[82] = "VpLocalIoTlbFlushes",
+	[83] = "VpLocalIoTlbFlushCost",
+	[84] = "VpHypercallsForwardedCount",
+	[85] = "VpHypercallsForwardingTime",
+	[86] = "VpPageInvalidationsForwardedCount",
+	[87] = "VpPageInvalidationsForwardingTime",
+	[88] = "VpControlRegisterAccessesForwardedCount",
+	[89] = "VpControlRegisterAccessesForwardingTime",
+	[90] = "VpIoInstructionsForwardedCount",
+	[91] = "VpIoInstructionsForwardingTime",
+	[92] = "VpHltInstructionsForwardedCount",
+	[93] = "VpHltInstructionsForwardingTime",
+	[94] = "VpMwaitInstructionsForwardedCount",
+	[95] = "VpMwaitInstructionsForwardingTime",
+	[96] = "VpCpuidInstructionsForwardedCount",
+	[97] = "VpCpuidInstructionsForwardingTime",
+	[98] = "VpMsrAccessesForwardedCount",
+	[99] = "VpMsrAccessesForwardingTime",
+	[100] = "VpOtherInterceptsForwardedCount",
+	[101] = "VpOtherInterceptsForwardingTime",
+	[102] = "VpExternalInterruptsForwardedCount",
+	[103] = "VpExternalInterruptsForwardingTime",
+	[104] = "VpPendingInterruptsForwardedCount",
+	[105] = "VpPendingInterruptsForwardingTime",
+	[106] = "VpEmulatedInstructionsForwardedCount",
+	[107] = "VpEmulatedInstructionsForwardingTime",
+	[108] = "VpDebugRegisterAccessesForwardedCount",
+	[109] = "VpDebugRegisterAccessesForwardingTime",
+	[110] = "VpPageFaultInterceptsForwardedCount",
+	[111] = "VpPageFaultInterceptsForwardingTime",
+	[112] = "VpVmclearEmulationCount",
+	[113] = "VpVmclearEmulationTime",
+	[114] = "VpVmptrldEmulationCount",
+	[115] = "VpVmptrldEmulationTime",
+	[116] = "VpVmptrstEmulationCount",
+	[117] = "VpVmptrstEmulationTime",
+	[118] = "VpVmreadEmulationCount",
+	[119] = "VpVmreadEmulationTime",
+	[120] = "VpVmwriteEmulationCount",
+	[121] = "VpVmwriteEmulationTime",
+	[122] = "VpVmxoffEmulationCount",
+	[123] = "VpVmxoffEmulationTime",
+	[124] = "VpVmxonEmulationCount",
+	[125] = "VpVmxonEmulationTime",
+	[126] = "VpNestedVMEntriesCount",
+	[127] = "VpNestedVMEntriesTime",
+	[128] = "VpNestedSLATSoftPageFaultsCount",
+	[129] = "VpNestedSLATSoftPageFaultsTime",
+	[130] = "VpNestedSLATHardPageFaultsCount",
+	[131] = "VpNestedSLATHardPageFaultsTime",
+	[132] = "VpInvEptAllContextEmulationCount",
+	[133] = "VpInvEptAllContextEmulationTime",
+	[134] = "VpInvEptSingleContextEmulationCount",
+	[135] = "VpInvEptSingleContextEmulationTime",
+	[136] = "VpInvVpidAllContextEmulationCount",
+	[137] = "VpInvVpidAllContextEmulationTime",
+	[138] = "VpInvVpidSingleContextEmulationCount",
+	[139] = "VpInvVpidSingleContextEmulationTime",
+	[140] = "VpInvVpidSingleAddressEmulationCount",
+	[141] = "VpInvVpidSingleAddressEmulationTime",
+	[142] = "VpNestedTlbPageTableReclamations",
+	[143] = "VpNestedTlbPageTableEvictions",
+	[144] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+	[145] = "VpFlushGuestPhysicalAddressListHypercalls",
+	[146] = "VpPostedInterruptNotifications",
+	[147] = "VpPostedInterruptScans",
+	[148] = "VpTotalCoreRunTime",
+	[149] = "VpMaximumRunTime",
+	[150] = "VpHwpRequestContextSwitches",
+	[151] = "VpWaitingForCpuTimeBucket0",
+	[152] = "VpWaitingForCpuTimeBucket1",
+	[153] = "VpWaitingForCpuTimeBucket2",
+	[154] = "VpWaitingForCpuTimeBucket3",
+	[155] = "VpWaitingForCpuTimeBucket4",
+	[156] = "VpWaitingForCpuTimeBucket5",
+	[157] = "VpWaitingForCpuTimeBucket6",
+	[158] = "VpVmloadEmulationCount",
+	[159] = "VpVmloadEmulationTime",
+	[160] = "VpVmsaveEmulationCount",
+	[161] = "VpVmsaveEmulationTime",
+	[162] = "VpGifInstructionEmulationCount",
+	[163] = "VpGifInstructionEmulationTime",
+	[164] = "VpEmulatedErrataSvmInstructions",
+	[165] = "VpPlaceholder1",
+	[166] = "VpPlaceholder2",
+	[167] = "VpPlaceholder3",
+	[168] = "VpPlaceholder4",
+	[169] = "VpPlaceholder5",
+	[170] = "VpPlaceholder6",
+	[171] = "VpPlaceholder7",
+	[172] = "VpPlaceholder8",
+	[173] = "VpContentionTime",
+	[174] = "VpWakeUpTime",
+	[175] = "VpSchedulingPriority",
+	[176] = "VpRdpmcInstructionsCount",
+	[177] = "VpRdpmcInstructionsTime",
+	[178] = "VpPerfmonPmuMsrAccessesCount",
+	[179] = "VpPerfmonLbrMsrAccessesCount",
+	[180] = "VpPerfmonIptMsrAccessesCount",
+	[181] = "VpPerfmonInterruptCount",
+	[182] = "VpVtl1DispatchCount",
+	[183] = "VpVtl2DispatchCount",
+	[184] = "VpVtl2DispatchBucket0",
+	[185] = "VpVtl2DispatchBucket1",
+	[186] = "VpVtl2DispatchBucket2",
+	[187] = "VpVtl2DispatchBucket3",
+	[188] = "VpVtl2DispatchBucket4",
+	[189] = "VpVtl2DispatchBucket5",
+	[190] = "VpVtl2DispatchBucket6",
+	[191] = "VpVtl1RunTime",
+	[192] = "VpVtl2RunTime",
+	[193] = "VpIommuHypercalls",
+	[194] = "VpCpuGroupHypercalls",
+	[195] = "VpVsmHypercalls",
+	[196] = "VpEventLogHypercalls",
+	[197] = "VpDeviceDomainHypercalls",
+	[198] = "VpDepositHypercalls",
+	[199] = "VpSvmHypercalls",
+	[200] = "VpBusLockAcquisitionCount",
+	[201] = "VpLoadAvg",
+	[202] = "VpRootDispatchThreadBlocked",
+	[203] = "VpIdleCpuTime",
+	[204] = "VpWaitingForCpuTimeBucket7",
+	[205] = "VpWaitingForCpuTimeBucket8",
+	[206] = "VpWaitingForCpuTimeBucket9",
+	[207] = "VpWaitingForCpuTimeBucket10",
+	[208] = "VpWaitingForCpuTimeBucket11",
+	[209] = "VpWaitingForCpuTimeBucket12",
+	[210] = "VpHierarchicalSuspendTime",
+	[211] = "VpExpressSchedulingAttempts",
+	[212] = "VpExpressSchedulingCount",
+#elif IS_ENABLED(CONFIG_ARM64)
+	[9] = "VpSysRegAccessesCount",
+	[10] = "VpSysRegAccessesTime",
+	[11] = "VpSmcInstructionsCount",
+	[12] = "VpSmcInstructionsTime",
+	[13] = "VpOtherInterceptsCount",
+	[14] = "VpOtherInterceptsTime",
+	[15] = "VpExternalInterruptsCount",
+	[16] = "VpExternalInterruptsTime",
+	[17] = "VpPendingInterruptsCount",
+	[18] = "VpPendingInterruptsTime",
+	[19] = "VpGuestPageTableMaps",
+	[20] = "VpLargePageTlbFills",
+	[21] = "VpSmallPageTlbFills",
+	[22] = "VpReflectedGuestPageFaults",
+	[23] = "VpMemoryInterceptMessages",
+	[24] = "VpOtherMessages",
+	[25] = "VpLogicalProcessorMigrations",
+	[26] = "VpAddressDomainFlushes",
+	[27] = "VpAddressSpaceFlushes",
+	[28] = "VpSyntheticInterrupts",
+	[29] = "VpVirtualInterrupts",
+	[30] = "VpApicSelfIpisSent",
+	[31] = "VpGpaSpaceHypercalls",
+	[32] = "VpLogicalProcessorHypercalls",
+	[33] = "VpLongSpinWaitHypercalls",
+	[34] = "VpOtherHypercalls",
+	[35] = "VpSyntheticInterruptHypercalls",
+	[36] = "VpVirtualInterruptHypercalls",
+	[37] = "VpVirtualMmuHypercalls",
+	[38] = "VpVirtualProcessorHypercalls",
+	[39] = "VpHardwareInterrupts",
+	[40] = "VpNestedPageFaultInterceptsCount",
+	[41] = "VpNestedPageFaultInterceptsTime",
+	[42] = "VpLogicalProcessorDispatches",
+	[43] = "VpWaitingForCpuTime",
+	[44] = "VpExtendedHypercalls",
+	[45] = "VpExtendedHypercallInterceptMessages",
+	[46] = "VpMbecNestedPageTableSwitches",
+	[47] = "VpOtherReflectedGuestExceptions",
+	[48] = "VpGlobalIoTlbFlushes",
+	[49] = "VpGlobalIoTlbFlushCost",
+	[50] = "VpLocalIoTlbFlushes",
+	[51] = "VpLocalIoTlbFlushCost",
+	[52] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+	[53] = "VpFlushGuestPhysicalAddressListHypercalls",
+	[54] = "VpPostedInterruptNotifications",
+	[55] = "VpPostedInterruptScans",
+	[56] = "VpTotalCoreRunTime",
+	[57] = "VpMaximumRunTime",
+	[58] = "VpWaitingForCpuTimeBucket0",
+	[59] = "VpWaitingForCpuTimeBucket1",
+	[60] = "VpWaitingForCpuTimeBucket2",
+	[61] = "VpWaitingForCpuTimeBucket3",
+	[62] = "VpWaitingForCpuTimeBucket4",
+	[63] = "VpWaitingForCpuTimeBucket5",
+	[64] = "VpWaitingForCpuTimeBucket6",
+	[65] = "VpHwpRequestContextSwitches",
+	[66] = "VpPlaceholder2",
+	[67] = "VpPlaceholder3",
+	[68] = "VpPlaceholder4",
+	[69] = "VpPlaceholder5",
+	[70] = "VpPlaceholder6",
+	[71] = "VpPlaceholder7",
+	[72] = "VpPlaceholder8",
+	[73] = "VpContentionTime",
+	[74] = "VpWakeUpTime",
+	[75] = "VpSchedulingPriority",
+	[76] = "VpVtl1DispatchCount",
+	[77] = "VpVtl2DispatchCount",
+	[78] = "VpVtl2DispatchBucket0",
+	[79] = "VpVtl2DispatchBucket1",
+	[80] = "VpVtl2DispatchBucket2",
+	[81] = "VpVtl2DispatchBucket3",
+	[82] = "VpVtl2DispatchBucket4",
+	[83] = "VpVtl2DispatchBucket5",
+	[84] = "VpVtl2DispatchBucket6",
+	[85] = "VpVtl1RunTime",
+	[86] = "VpVtl2RunTime",
+	[87] = "VpIommuHypercalls",
+	[88] = "VpCpuGroupHypercalls",
+	[89] = "VpVsmHypercalls",
+	[90] = "VpEventLogHypercalls",
+	[91] = "VpDeviceDomainHypercalls",
+	[92] = "VpDepositHypercalls",
+	[93] = "VpSvmHypercalls",
+	[94] = "VpLoadAvg",
+	[95] = "VpRootDispatchThreadBlocked",
+	[96] = "VpIdleCpuTime",
+	[97] = "VpWaitingForCpuTimeBucket7",
+	[98] = "VpWaitingForCpuTimeBucket8",
+	[99] = "VpWaitingForCpuTimeBucket9",
+	[100] = "VpWaitingForCpuTimeBucket10",
+	[101] = "VpWaitingForCpuTimeBucket11",
+	[102] = "VpWaitingForCpuTimeBucket12",
+	[103] = "VpHierarchicalSuspendTime",
+	[104] = "VpExpressSchedulingAttempts",
+	[105] = "VpExpressSchedulingCount",	
+#endif
+};
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 5/7] mshv: Update hv_stats_page definitions
From: Nuno Das Neves @ 2026-01-26 20:56 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

hv_stats_page belongs in hvhdk.h, move it there.

It does not require a union to access the data for different counters,
just use a single u64 array for simplicity and to match the Windows
definitions.

While at it, correct the ARM64 value for VpRootDispatchThreadBlocked.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 27 ++++++++-------------------
 include/hyperv/hvhdk.h      |  8 ++++++++
 2 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index fbfc9e7d9fa4..414d9cee5252 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -39,22 +39,12 @@ MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
 
-/* TODO move this to another file when debugfs code is added */
-enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
-#if defined(CONFIG_X86)
-	VpRootDispatchThreadBlocked			= 202,
+/* HV_THREAD_COUNTER */
+#if defined(CONFIG_X86_64)
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 202
 #elif defined(CONFIG_ARM64)
-	VpRootDispatchThreadBlocked			= 94,
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 95
 #endif
-	VpStatsMaxCounter
-};
-
-struct hv_stats_page {
-	union {
-		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
-		u8 data[HV_HYP_PAGE_SIZE];
-	};
-} __packed;
 
 struct mshv_root mshv_root;
 
@@ -485,12 +475,11 @@ static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
 static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
 {
 	struct hv_stats_page **stats = vp->vp_stats_pages;
-	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
-	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
+	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->data;
+	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->data;
 
-	if (self_vp_cntrs[VpRootDispatchThreadBlocked])
-		return self_vp_cntrs[VpRootDispatchThreadBlocked];
-	return parent_vp_cntrs[VpRootDispatchThreadBlocked];
+	return parent_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED] ||
+	       self_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED];
 }
 
 static int
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 469186df7826..ac501969105c 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -10,6 +10,14 @@
 #include "hvhdk_mini.h"
 #include "hvgdk.h"
 
+/*
+ * Hypervisor statistics page format
+ */
+struct hv_stats_page {
+	u64 data[HV_HYP_PAGE_SIZE / sizeof(u64)];
+} __packed;
+
+
 /* Bits for dirty mask of hv_vp_register_page */
 #define HV_X64_REGISTER_CLASS_GENERAL	0
 #define HV_X64_REGISTER_CLASS_IP	1
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 4/7] mshv: Always map child vp stats pages regardless of scheduler type
From: Nuno Das Neves @ 2026-01-26 20:56 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Currently vp->vp_stats_pages is only used by the root scheduler for fast
interrupt injection.

Soon, vp_stats_pages will also be needed for exposing child VP stats to
userspace via debugfs. Mapping the pages a second time to a different
address causes an error on L1VH.

Remove the scheduler requirement and always map the vp stats pages.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index faca3cc63e79..fbfc9e7d9fa4 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1077,16 +1077,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 			goto unmap_register_page;
 	}
 
-	/*
-	 * This mapping of the stats page is for detecting if dispatch thread
-	 * is blocked - only relevant for root scheduler
-	 */
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) {
-		ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
-					stats_pages);
-		if (ret)
-			goto unmap_ghcb_page;
-	}
+	ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
+				stats_pages);
+	if (ret)
+		goto unmap_ghcb_page;
 
 	vp = kzalloc(sizeof(*vp), GFP_KERNEL);
 	if (!vp)
@@ -1110,8 +1104,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
 		vp->vp_ghcb_page = page_to_virt(ghcb_page);
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
+	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
 
 	/*
 	 * Keep anon_inode_getfd last: it installs fd in the file struct and
@@ -1133,8 +1126,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 free_vp:
 	kfree(vp);
 unmap_stats_pages:
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
+	mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
 unmap_ghcb_page:
 	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
 		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
@@ -1754,9 +1746,8 @@ static void destroy_partition(struct mshv_partition *partition)
 			if (!vp)
 				continue;
 
-			if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
-						    vp->vp_stats_pages);
+			mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
+					    vp->vp_stats_pages);
 
 			if (vp->vp_register_page) {
 				(void)hv_unmap_vp_state_page(partition->pt_id,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 3/7] mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
From: Nuno Das Neves @ 2026-01-26 20:55 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

These functions are currently only used to map child partition VP stats,
on root partition. However, they will soon be used on L1VH, and and also
used for mapping the host's own VP stats.

Introduce a helper is_l1vh_parent() to determine whether we are mapping
our own VP stats. In this case, do not attempt to map the PARENT area.
Note this is a different case than mapping PARENT on an older hypervisor
where it is not available at all, so must be handled separately.

On unmap, pass the stats pages since on L1VH the kernel allocates them
and they must be freed in hv_unmap_stats_page().

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root.h      | 10 ++++++
 drivers/hv/mshv_root_main.c | 61 ++++++++++++++++++++++++++-----------
 2 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 05ba1f716f9e..e4912b0618fa 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -254,6 +254,16 @@ struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
 void mshv_partition_put(struct mshv_partition *partition);
 struct mshv_partition *mshv_partition_find(u64 partition_id) __must_hold(RCU);
 
+static inline bool is_l1vh_parent(u64 partition_id)
+{
+	return hv_l1vh_partition() && (partition_id == HV_PARTITION_ID_SELF);
+}
+
+int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
+		      struct hv_stats_page **stats_pages);
+void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
+			 struct hv_stats_page **stats_pages);
+
 /* hypercalls */
 
 int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index be5ad0fbfbee..faca3cc63e79 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -956,23 +956,36 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
-				struct hv_stats_page *stats_pages[])
+void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
+			 struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
 		.vp.vp_index = vp_index,
 	};
+	int err;
 
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
-
-	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+				  stats_pages[HV_STATS_AREA_SELF],
+				  &identity);
+	if (err)
+		pr_err("%s: failed to unmap partition %llu vp %u self stats, err: %d\n",
+		       __func__, partition_id, vp_index, err);
+
+	if (stats_pages[HV_STATS_AREA_PARENT] != stats_pages[HV_STATS_AREA_SELF]) {
+		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+		err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+					  stats_pages[HV_STATS_AREA_PARENT],
+					  &identity);
+		if (err)
+			pr_err("%s: failed to unmap partition %llu vp %u parent stats, err: %d\n",
+			       __func__, partition_id, vp_index, err);
+	}
 }
 
-static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
-			     struct hv_stats_page *stats_pages[])
+int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
+		      struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -983,23 +996,37 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
 	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
 				&stats_pages[HV_STATS_AREA_SELF]);
-	if (err)
+	if (err) {
+		pr_err("%s: failed to map partition %llu vp %u self stats, err: %d\n",
+		       __func__, partition_id, vp_index, err);
 		return err;
+	}
 
-	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
-	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
-				&stats_pages[HV_STATS_AREA_PARENT]);
-	if (err)
-		goto unmap_self;
-
-	if (!stats_pages[HV_STATS_AREA_PARENT])
+	/*
+	 * L1VH partition cannot access its vp stats in parent area.
+	 */
+	if (is_l1vh_parent(partition_id)) {
 		stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+	} else {
+		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+		err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
+					&stats_pages[HV_STATS_AREA_PARENT]);
+		if (err) {
+			pr_err("%s: failed to map partition %llu vp %u parent stats, err: %d\n",
+			       __func__, partition_id, vp_index, err);
+			goto unmap_self;
+		}
+		if (!stats_pages[HV_STATS_AREA_PARENT])
+			stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+	}
 
 	return 0;
 
 unmap_self:
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
+	hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+			    stats_pages[HV_STATS_AREA_SELF],
+			    &identity);
 	return err;
 }
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 2/7] mshv: Use typed hv_stats_page pointers
From: Nuno Das Neves @ 2026-01-26 20:55 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Refactor all relevant functions to use struct hv_stats_page pointers
instead of void pointers for stats page mapping and unmapping thus
improving type safety and code clarity across the Hyper-V stats mapping
APIs.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |  5 +++--
 drivers/hv/mshv_root_hv_call.c | 12 +++++++-----
 drivers/hv/mshv_root_main.c    |  8 ++++----
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..05ba1f716f9e 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -307,8 +307,9 @@ int hv_call_disconnect_port(u64 connection_partition_id,
 int hv_call_notify_port_ring_empty(u32 sint_index);
 int hv_map_stats_page(enum hv_stats_object_type type,
 		      const union hv_stats_object_identity *identity,
-		      void **addr);
-int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr,
+		      struct hv_stats_page **addr);
+int hv_unmap_stats_page(enum hv_stats_object_type type,
+			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity);
 int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 				   u64 page_struct_count, u32 host_access,
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 1f93b94d7580..daee036e48bc 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -890,9 +890,10 @@ hv_stats_get_area_type(enum hv_stats_object_type type,
  * caller should check for this case and instead fallback to the SELF area
  * alone.
  */
-static int hv_call_map_stats_page(enum hv_stats_object_type type,
-				  const union hv_stats_object_identity *identity,
-				  void **addr)
+static int
+hv_call_map_stats_page(enum hv_stats_object_type type,
+		       const union hv_stats_object_identity *identity,
+		       struct hv_stats_page **addr)
 {
 	unsigned long flags;
 	struct hv_input_map_stats_page *input;
@@ -942,7 +943,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 
 int hv_map_stats_page(enum hv_stats_object_type type,
 		      const union hv_stats_object_identity *identity,
-		      void **addr)
+		      struct hv_stats_page **addr)
 {
 	int ret;
 	struct page *allocated_page = NULL;
@@ -990,7 +991,8 @@ static int hv_call_unmap_stats_page(enum hv_stats_object_type type,
 	return hv_result_to_errno(status);
 }
 
-int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr,
+int hv_unmap_stats_page(enum hv_stats_object_type type,
+			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity)
 {
 	int ret;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1777778f84b8..be5ad0fbfbee 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -957,7 +957,7 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 }
 
 static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
-				void *stats_pages[])
+				struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -972,7 +972,7 @@ static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
 }
 
 static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
-			     void *stats_pages[])
+			     struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -1010,7 +1010,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	struct mshv_create_vp args;
 	struct mshv_vp *vp;
 	struct page *intercept_msg_page, *register_page, *ghcb_page;
-	void *stats_pages[2];
+	struct hv_stats_page *stats_pages[2];
 	long ret;
 
 	if (copy_from_user(&args, arg, sizeof(args)))
@@ -1729,7 +1729,7 @@ static void destroy_partition(struct mshv_partition *partition)
 
 			if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
 				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
-						    (void **)vp->vp_stats_pages);
+						    vp->vp_stats_pages);
 
 			if (vp->vp_register_page) {
 				(void)hv_unmap_vp_state_page(partition->pt_id,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 1/7] mshv: Ignore second stats page map result failure
From: Nuno Das Neves @ 2026-01-26 20:55 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260126205603.404655-1-nunodasneves@linux.microsoft.com>

From: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>

Older versions of the hypervisor do not have a concept of separate SELF
and PARENT stats areas. In this case, mapping the HV_STATS_AREA_SELF page
is sufficient - it's the only page and it contains all available stats.

Mapping HV_STATS_AREA_PARENT returns HV_STATUS_INVALID_PARAMETER which
currently causes module init to fail on older hypevisor versions.

Detect this case and gracefully fall back to populating
stats_pages[HV_STATS_AREA_PARENT] with the already-mapped SELF page.

Add comments to clarify the behavior, including a clarification of why
this isn't needed for hv_call_map_stats_page2() which always supports
PARENT and SELF areas.

Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c | 52 +++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root_main.c    |  3 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 598eaff4ff29..1f93b94d7580 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -813,6 +813,13 @@ hv_call_notify_port_ring_empty(u32 sint_index)
 	return hv_result_to_errno(status);
 }
 
+/*
+ * Equivalent of hv_call_map_stats_page() for cases when the caller provides
+ * the map location.
+ *
+ * NOTE: This is a newer hypercall that always supports SELF and PARENT stats
+ * areas, unlike hv_call_map_stats_page().
+ */
 static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 				   const union hv_stats_object_identity *identity,
 				   u64 map_location)
@@ -855,6 +862,34 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 	return ret;
 }
 
+static int
+hv_stats_get_area_type(enum hv_stats_object_type type,
+		       const union hv_stats_object_identity *identity)
+{
+	switch (type) {
+	case HV_STATS_OBJECT_HYPERVISOR:
+		return identity->hv.stats_area_type;
+	case HV_STATS_OBJECT_LOGICAL_PROCESSOR:
+		return identity->lp.stats_area_type;
+	case HV_STATS_OBJECT_PARTITION:
+		return identity->partition.stats_area_type;
+	case HV_STATS_OBJECT_VP:
+		return identity->vp.stats_area_type;
+	}
+
+	return -EINVAL;
+}
+
+/*
+ * Map a stats page, where the page location is provided by the hypervisor.
+ *
+ * NOTE: The concept of separate SELF and PARENT stats areas does not exist on
+ * older hypervisor versions. All the available stats information can be found
+ * on the SELF page. When attempting to map the PARENT area on a hypervisor
+ * that doesn't support it, return "success" but with a NULL address. The
+ * caller should check for this case and instead fallback to the SELF area
+ * alone.
+ */
 static int hv_call_map_stats_page(enum hv_stats_object_type type,
 				  const union hv_stats_object_identity *identity,
 				  void **addr)
@@ -863,7 +898,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 	struct hv_input_map_stats_page *input;
 	struct hv_output_map_stats_page *output;
 	u64 status, pfn;
-	int ret = 0;
+	int hv_status, ret = 0;
 
 	do {
 		local_irq_save(flags);
@@ -878,11 +913,20 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 		pfn = output->map_location;
 
 		local_irq_restore(flags);
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
-			ret = hv_result_to_errno(status);
+
+		hv_status = hv_result(status);
+		if (hv_status != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (hv_result_success(status))
 				break;
-			return ret;
+
+			if (hv_stats_get_area_type(type, identity) == HV_STATS_AREA_PARENT &&
+			    hv_status == HV_STATUS_INVALID_PARAMETER) {
+				*addr = NULL;
+				return 0;
+			}
+
+			hv_status_debug(status, "\n");
+			return hv_result_to_errno(status);
 		}
 
 		ret = hv_call_deposit_pages(NUMA_NO_NODE,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..1777778f84b8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -992,6 +992,9 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
 	if (err)
 		goto unmap_self;
 
+	if (!stats_pages[HV_STATS_AREA_PARENT])
+		stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+
 	return 0;
 
 unmap_self:
-- 
2.34.1


^ permalink raw reply related

* [PATCH v5 0/7] mshv: Debugfs interface for mshv_root
From: Nuno Das Neves @ 2026-01-26 20:55 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves

Expose hypervisor, logical processor, partition, and virtual processor
statistics via debugfs. These are provided by mapping 'stats' pages via
hypercall.

Patch #1: Update hv_call_map_stats_page() to return success when
          HV_STATS_AREA_PARENT is unavailable, which is the case on some
          hypervisor versions, where it can fall back to HV_STATS_AREA_SELF
Patch #2: Use struct hv_stats_page pointers instead of void *
Patch #3: Make mshv_vp_stats_map/unmap() more flexible to use with debugfs code
Patch #4: Always map vp stats page regardless of scheduler, to reuse in debugfs
Patch #5: Change to hv_stats_page definition and VpRootDispatchThreadBlocked
Patch #6: Introduce the definitions needed for the various stats pages
Patch #7: Add mshv_debugfs.c, and integrate it with the mshv_root driver to
          expose the partition and VP stats.

---
Changes in v5:
- Rename hv_counters.c to mshv_debugfs_counters.c [Michael]
- Clarify unusual inclusion of mshv_debugfs_counters.c with comment. After
  discussion it is still included directly to keep things simple. Including
  arrays with unspecified size via a header means sizeof() cannot be used on
  the array.
- Error if mshv_debugfs_counters.c is included elsewhere than mshv_debugfs.c
- Use array index as stats page index to save space [Stanislav]
- Enforce HV_STATS_AREA_PARENT and SELF fit in NUM_STATS_AREAS with
  static_assert and clarify with comment [Michael]
- Return to using lp count from hv stats page for mshv_lps_count [Michael]
- Use nr_cpu_ids instead of num_possible_cpus() [Michael]
- Set mshv_lps_stats[idx] and the array itself to NULL on unmap and cleanup
  [Michael]
- Rename HvLogicalProcessors and VpRootDispatchThreadBlocked to Linux style
- Translate Linux cpu index to vp index via hv_vp_index on partition destroy
  [Michael]
- Minor formatting cleanups [Michael]

Changes in v4:
- Put the counters definitions in static arrays in hv_counters.c, instead of as
  enums in hvhdk.h [Michael]
- Due to the above, add an additional patch (#5) to simplify hv_stats_page, and
  retain the enum definition at the top of mshv_root_main.c for use with
  VpRootDispatchThreadBlocked. That is the only remaining use of the counter
  enum.
- Due to the above, use num_present_cpus() as the number of LPs to map stats
  pages for - this number shouldn't change at runtime because the hypervisor
  doesn't support hotplug for root partition.

Changes in v3:
- Add 3 small refactor/cleanup patches (patches 2,3,4) from Stanislav. These
  simplify some of the debugfs code, and fix issues with mapping VP stats on
  L1VH.
- Fix cleanup of parent stats dentries on module removal (via squashing some
  internal patches into patch #6) [Praveen]
- Remove unused goto label [Stanislav, kernel bot]
- Use struct hv_stats_page * instead of void * in mshv_debugfs.c [Stanislav]
- Remove some redundant variables [Stanislav]
- Rename debugfs dentry fields for brevity [Stanislav]
- Use ERR_CAST() for the dentry error pointer returned from
  lp_debugfs_stats_create() [Stanislav]
- Fix leak of pages allocated for lp stats mappings by storing them in an array
  [Michael]
- Add comments to clarify PARENT vs SELF usage and edge cases [Michael]
- Add VpLoadAvg for x86 and print the stat [Michael]
- Add NUM_STATS_AREAS for array sizing in mshv_debugfs.c [Michael]

Changes in v2:
- Remove unnecessary pr_debug_once() in patch 1 [Stanislav Kinsburskii]
- CONFIG_X86 -> CONFIG_X86_64 in patch 2 [Stanislav Kinsburskii]

---
Nuno Das Neves (3):
  mshv: Update hv_stats_page definitions
  mshv: Add data for printing stats page counters
  mshv: Add debugfs to view hypervisor statistics

Purna Pavan Chandra Aekkaladevi (1):
  mshv: Ignore second stats page map result failure

Stanislav Kinsburskii (3):
  mshv: Use typed hv_stats_page pointers
  mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
  mshv: Always map child vp stats pages regardless of scheduler type

 drivers/hv/Makefile                |   1 +
 drivers/hv/mshv_debugfs.c          | 726 +++++++++++++++++++++++++++++
 drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++
 drivers/hv/mshv_root.h             |  49 +-
 drivers/hv/mshv_root_hv_call.c     |  64 ++-
 drivers/hv/mshv_root_main.c        | 140 +++---
 include/hyperv/hvhdk.h             |   8 +
 7 files changed, 1413 insertions(+), 65 deletions(-)
 create mode 100644 drivers/hv/mshv_debugfs.c
 create mode 100644 drivers/hv/mshv_debugfs_counters.c

-- 
2.34.1


^ permalink raw reply

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
From: Stanislav Kinsburskii @ 2026-01-26 20:50 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <a2e54fff-3cbb-e332-c35e-7520c36eceed@linux.microsoft.com>

On Fri, Jan 23, 2026 at 04:42:54PM -0800, Mukesh R wrote:
> On 1/20/26 14:22, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > > interrupts, etc need a device id as a parameter. This device id refers
> > > to that specific device during the lifetime of passthru.
> > > 
> > > An L1VH VM only contains VMBus based devices. A device id for a VMBus
> > > device is slightly different in that it uses the hv_pcibus_device info
> > > for building it to make sure it matches exactly what the hypervisor
> > > expects. This VMBus based device id is needed when attaching devices in
> > > an L1VH based guest VM. Before building it, a check is done to make sure
> > > the device is a valid VMBus device.
> > > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   arch/x86/include/asm/mshyperv.h     |  2 ++
> > >   drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
> > >   2 files changed, 31 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > > index eef4c3a5ba28..0d7fdfb25e76 100644
> > > --- a/arch/x86/include/asm/mshyperv.h
> > > +++ b/arch/x86/include/asm/mshyperv.h
> > > @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
> > >   static inline void hv_apic_init(void) {}
> > >   #endif
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > > +
> > >   struct irq_domain *hv_create_pci_msi_domain(void);
> > >   int hv_map_msi_interrupt(struct irq_data *data,
> > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > > index 8bc6a38c9b5a..40f0b06bb966 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
> > >   #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > >   #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> > > +
> > 
> > Why not moving this static function definition above the called instead of
> > defining the prototype?
> 
> Did you see the function implementation? It has other dependencies that
> are later, it would need code reorg.
> 

Why not placing the caller side after the function definition then?

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > >   static int hv_pci_irqchip_init(void)
> > >   {
> > >   	return 0;
> > > @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> > >   #define hv_msi_prepare		pci_msi_prepare
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > > +{
> > > +	u64 u64val;
> > 
> > This variable is redundant.
> 
> Not really. It helps with debug by putting a quick print, and is
> harmless.
> 
> > > +	struct hv_pcibus_device *hbus;
> > > +	struct pci_bus *pbus = pdev->bus;
> > > +
> > > +	if (!hv_vmbus_pci_device(pbus))
> > > +		return 0;
> > > +
> > > +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > > +	u64val = (hbus->hdev->dev_instance.b[5] << 24) |
> > > +		 (hbus->hdev->dev_instance.b[4] << 16) |
> > > +		 (hbus->hdev->dev_instance.b[7] << 8) |
> > > +		 (hbus->hdev->dev_instance.b[6] & 0xf8) |
> > > +		 PCI_FUNC(pdev->devfn);
> > > +
> > 
> > It looks like this value always fits into 32 bit, so what is the value
> > in returning 64 bit?
> 
> The ABI has device id defined as 64bits where this is assigned.
> 
> Thanks,
> -Mukesh
> 
> 
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > +	return u64val;
> > > +}
> > > +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> > > +
> > >   /**
> > >    * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
> > >    * affinity.
> > > @@ -1404,6 +1426,13 @@ static struct pci_ops hv_pcifront_ops = {
> > >   	.write = hv_pcifront_write_config,
> > >   };
> > > +#ifdef CONFIG_X86
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> > > +{
> > > +	return pbus->ops == &hv_pcifront_ops;
> > > +}
> > > +#endif /* CONFIG_X86 */
> > > +
> > >   /*
> > >    * Paravirtual backchannel
> > >    *
> > > -- 
> > > 2.51.2.vfs.0.1
> > > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-26 20:46 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <xyzkeqng3767mlpzu7xbmgobjr6ob2wp2brocmjczbbl4dypxh@wkibga46f33c>

On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > runtime and never withdraws them. This creates a fundamental incompatibility
> > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > hypervisor deposited pages.
> > 
> > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > management is implemented.
> 
> Someone might want to stop all guest VMs and do a kexec. Which is valid
> and would work without any issue for L1VH.
> 

No, it won't work and hypervsisor depostied pages won't be withdrawn.
Also, kernel consisntency must no depend on use space behavior. 

> Also, I don't think it is reasonable at all that someone needs to
> disable basic kernel functionality such as kexec in order to use our
> driver.
> 

It's a temporary measure until proper page lifecycle management is
supported in the driver.
Mutual exclusion of the driver and kexec is given and thus should be
expclitily stated in the Kconfig.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/Kconfig |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > index 7937ac0cbd0f..cfd4501db0fa 100644
> > --- a/drivers/hv/Kconfig
> > +++ b/drivers/hv/Kconfig
> > @@ -74,6 +74,7 @@ config MSHV_ROOT
> >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> >  	# no particular order, making it impossible to reassemble larger pages
> >  	depends on PAGE_SIZE_4KB
> > +	depends on !KEXEC
> >  	select EVENTFD
> >  	select VIRT_XFER_TO_GUEST_WORK
> >  	select HMM_MIRROR
> > 
> > 

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Stanislav Kinsburskii @ 2026-01-26 20:43 UTC (permalink / raw)
  To: Mukesh R
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <890506f6-9b91-5d59-8c98-086cf5d206bb@linux.microsoft.com>

On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> > > On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> > > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > > hypervisor deposited pages.
> > > > 
> > > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > > management is implemented.
> > > > 
> > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > ---
> > > >    drivers/hv/Kconfig |    1 +
> > > >    1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > > --- a/drivers/hv/Kconfig
> > > > +++ b/drivers/hv/Kconfig
> > > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > > >    	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > > >    	# no particular order, making it impossible to reassemble larger pages
> > > >    	depends on PAGE_SIZE_4KB
> > > > +	depends on !KEXEC
> > > >    	select EVENTFD
> > > >    	select VIRT_XFER_TO_GUEST_WORK
> > > >    	select HMM_MIRROR
> > > > 
> > > > 
> > > 
> > > Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> > > implying that crash dump might be involved. Or did you test kdump
> > > and it was fine?
> > > 
> > 
> > Yes, it will. Crash kexec depends on normal kexec functionality, so it
> > will be affected as well.
> 
> So not sure I understand the reason for this patch. We can just block
> kexec if there are any VMs running, right? Doing this would mean any
> further developement would be without a ver important and major feature,
> right?

This is an option. But until it's implemented and merged, a user mshv
driver gets into a situation where kexec is broken in a non-obvious way.
The system may crash at any time after kexec, depending on whether the
new kernel touches the pages deposited to hypervisor or not. This is a
bad user experience.
Therefor it should be explicitly forbidden as it's essentially not
supported yet.

Thanks,
Stanislav

> 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh

^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-26 20:20 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXabnnCV50Thv9tZ@skinsburskii.localdomain>

On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>> hypervisor deposited pages.
>>>
>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>> management is implemented.
>>>
>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>> ---
>>>    drivers/hv/Kconfig |    1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>> --- a/drivers/hv/Kconfig
>>> +++ b/drivers/hv/Kconfig
>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>    	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>    	# no particular order, making it impossible to reassemble larger pages
>>>    	depends on PAGE_SIZE_4KB
>>> +	depends on !KEXEC
>>>    	select EVENTFD
>>>    	select VIRT_XFER_TO_GUEST_WORK
>>>    	select HMM_MIRROR
>>>
>>>
>>
>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>> implying that crash dump might be involved. Or did you test kdump
>> and it was fine?
>>
> 
> Yes, it will. Crash kexec depends on normal kexec functionality, so it
> will be affected as well.

So not sure I understand the reason for this patch. We can just block
kexec if there are any VMs running, right? Doing this would mean any
further developement would be without a ver important and major feature,
right?

> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox