Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Stanislav Kinsburskii @ 2026-01-27 18:46 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <c40e6dc8-8e42-b0f3-f8e5-3c637adb7f13@linux.microsoft.com>

On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
> On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> > > On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > 
> > > > > Add a new file to implement management of device domains, mapping and
> > > > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > > > parent. This also implements direct attach mechanism for PCI passthru,
> > > > > and it is also made to work within the VFIO framework.
> > > > > 
> > > > > At a high level, during boot the hypervisor creates a default identity
> > > > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > > > PCI passthru:
> > > > > 
> > > > >     1. Device Domain: root must create a device domain in the hypervisor,
> > > > >        and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > > >        All hypervisor communications use device id of type PCI for
> > > > >        identifying and referencing the device.
> > > > > 
> > > > >     2. Direct Attach: the hypervisor will simply use the guest's HW
> > > > >        page table for mappings, thus the host need not do map/unmap
> > > > >        device memory hypercalls. As such, direct attach passthru setup
> > > > >        during guest boot is extremely fast. A direct attached device
> > > > >        must be referenced via logical device id and not via the PCI
> > > > >        device id.
> > > > > 
> > > > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > > > attach is default in non-L1VH cases because there are some significant
> > > > > performance issues with device domain implementation currently for guests
> > > > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > > > addressed in the short term.
> > > > > 
> > > > 
> > > > <snip>
> > > > 
> > 
> > <snip>
> > 
> > > > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > > > +{
> > > > > +	struct pci_dev *pdev;
> > > > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > > > +
> > > > > +	/* See the attach function, only PCI devices for now */
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return;
> > > > > +
> > > > > +	if (hvdom->num_attchd == 0)
> > > > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > > > +
> > > > > +	pdev = to_pci_dev(dev);
> > > > > +
> > > > > +	if (hvdom->attached_dom) {
> > > > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > > > +
> > > > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > > > +		 * next.
> > > > > +		 */
> > > > > +	} else {
> > > > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > > > +	}
> > > > > +
> > > > > +	hvdom->num_attchd--;
> > > > 
> > > > Shouldn't this be modified iff the detach succeeded?
> > > 
> > > We want to still free the domain and not let it get stuck. The purpose
> > > is more to make sure detach was called before domain free.
> > > 
> > 
> > How can one debug subseqent errors if num_attchd is decremented
> > unconditionally? In reality the device is left attached, but the related
> > kernel metadata is gone.
> 
> Error is printed in case of failed detach. If there is panic, at least
> you can get some info about the device. Metadata in hypervisor is
> around if failed.
> 

With this approach the only thing left is a kernel message.
But if the state is kept intact, one could collect a kernel core and
analyze it.

And note, that there won't be a hypervisor core by default: our main
context with the usptreamed version of the driver is L1VH and a kernel
core is the only thing a third party customer can provide for our
analysis.

Thanks,
Stanislav


^ permalink raw reply

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
From: Stanislav Kinsburskii @ 2026-01-27 18:57 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <f39a501e-478f-66ff-26c8-229ca3991f4f@linux.microsoft.com>

On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > 
> > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > 
> > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > ---
> > > > >    drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 115 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > >    	};
> > > > >    } __packed;
> > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > +{
> > > > > +	hv_nofull_mmio = true;
> > > > > +	return 0;
> > > > > +}
> > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > +
> > > > >    struct mshv_root mshv_root;
> > > > >    enum hv_scheduler_type hv_scheduler_type;
> > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > >    }
> > > > >    #ifdef CONFIG_X86_64
> > > > > +
> > > > > +/*
> > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > + * else just return -errno.
> > > > > + */
> > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > +				       u64 *mmio_pfnp)
> > > > > +{
> > > > > +	struct vm_area_struct *vma;
> > > > > +	bool is_mmio;
> > > > > +	u64 uaddr;
> > > > > +	struct mshv_mem_region *mreg;
> > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > +	int rc = -EINVAL;
> > > > > +
> > > > > +	/*
> > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > +	 */
> > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > +
> > > > > +	/* Get the region again under the lock */
> > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > +		goto unlock_pt_out;
> > > > > +
> > > > > +	uaddr = mreg->start_uaddr +
> > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > +
> > > > > +	mmap_read_lock(current->mm);
> > > > 
> > > > Semaphore can't be taken under spinlock.
> > 
> > > 
> > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > rushing to submit the patch.
> > > 
> > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > the uaddr because unmap will properly serialize via the mm lock.
> > > 
> > > 
> > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > 
> > > > Why this check is needed again?
> > > 
> > > To make sure region did not change. This check is under lock.
> > > 
> > 
> > How can this happen? One can't change VMA type without unmapping it
> > first. And unmapping it leads to a kernel MMIO region state dangling
> > around without corresponding user space mapping.
> 
> Right, and vm_flags would not be mmio expected then.
> 
> > This is similar to dangling pinned regions and should likely be
> > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > regions is VMA is detached.
> 
> I don't think we need that. Either it succeeds if the region did not
> change at all, or just fails.
> 

I'm afraid we do, as if the driver mapped a page with the previous
memory region, and then the region is unmapped, the page will stay
mapped in the hypervisor, but will be considered free by kernel, which
in turn will lead to GPF upn next allocation.

With pinned regions we issue is similar but less impacting: pages can't
be released by user space unmapping and thus will be simply leaked, but
the system stays intact.

MMIO regions are simila to movable region in this regard: they don't
reference the user pages, and thus this guest region replaement is a
stright wat to kernel panic.

> 
> > > > The region type is stored on the region itself.
> > > > And the type is checked on the caller side.
> > > > 
> > > > > +	if (!is_mmio)
> > > > > +		goto unlock_mmap_out;
> > > > > +
> > > > > +	pfnmap_args.vma = vma;
> > > > > +	pfnmap_args.address = uaddr;
> > > > > +
> > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > +	if (rc) {
> > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > +				      NULL);
> > > > > +		if (rc)
> > > > > +			goto unlock_mmap_out;
> > > > > +
> > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > +		if (rc)
> > > > > +			goto unlock_mmap_out;
> > > > > +	}
> > > > > +
> > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > +d
> > > > > +unlock_mmap_out:
> > > > > +	mmap_read_unlock(current->mm);
> > > > > +unlock_pt_out:
> > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > +	return rc;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > + * and resolve if possible.
> > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > + */
> > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > +{
> > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > +	struct mshv_mem_region *mreg;
> > > > > +	int rc;
> > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > +
> > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > +	accinfo = msg->memory_access_info;
> > > > > +
> > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > +		return false;
> > > > > +
> > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > 
> > > > This call needs to be protected by the spinlock.
> > > 
> > > This is sorta fast path to bail. We recheck under partition lock above.
> > > 
> > 
> > Accessing the list of regions without lock is unsafe.
> 
> I am not sure why? This check is done by a vcpu thread, so regions
> will not have just gone away.
> 

This is shared resources. Multiple VP thread get into this function
simultaneously, so there is a race already. But this one we can live
with without locking as they don't mutate the list of the regions.

The issue happens when VMM adds or removed another region as it mutates
the list and races with VP threads doing this lookup.

Thanks,
Stanislav


> Thanks,
> -Mukesh
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > +		return false;
> > > > > +
> > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > +	if (rc)
> > > > > +		return false;
> > > > > +
> > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > +		gfn = mreg->start_gfn;
> > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > +		numpgs = mreg->nr_pages;
> > > > > +	} else
> > > > > +		numpgs = 1;
> > > > > +
> > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > +
> > > > > +	return rc == 0;
> > > > > +}
> > > > > +
> > > > >    static struct mshv_mem_region *
> > > > >    mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > >    {
> > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > >    	return ret;
> > > > >    }
> > > > > +
> > > > >    #else  /* CONFIG_X86_64 */
> > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > >    static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > >    #endif /* CONFIG_X86_64 */
> > > > >    static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > >    {
> > > > >    	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > >    	case HVMSG_GPA_INTERCEPT:
> > > > >    		return mshv_handle_gpa_intercept(vp);
> > > > >    	}
> > > > > -- 
> > > > > 2.51.2.vfs.0.1
> > > > > 

^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Jacob Pan @ 2026-01-27 19:21 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <34de2049-912e-fc9e-9fc1-727fade0480f@linux.microsoft.com>

Hi Mukesh,

On Fri, 23 Jan 2026 18:01:29 -0800
Mukesh R <mrathor@linux.microsoft.com> wrote:

> On 1/21/26 21:18, Jacob Pan wrote:
> > Hi Mukesh,
> > 
> > On Mon, 19 Jan 2026 22:42:27 -0800
> > Mukesh R <mrathor@linux.microsoft.com> wrote:
> >   
> >> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> >>
> >> Add a new file to implement management of device domains, mapping
> >> and unmapping of iommu memory, and other iommu_ops to fit within
> >> the VFIO framework for PCI passthru on Hyper-V running Linux as
> >> root or L1VH parent. This also implements direct attach mechanism
> >> for PCI passthru, and it is also made to work within the VFIO
> >> framework.
> >>
> >> At a high level, during boot the hypervisor creates a default
> >> identity domain and attaches all devices to it. This nicely maps
> >> to Linux iommu subsystem IOMMU_DOMAIN_IDENTITY domain. As a
> >> result, Linux does not need to explicitly ask Hyper-V to attach
> >> devices and do maps/unmaps during boot. As mentioned previously,
> >> Hyper-V supports two ways to do PCI passthru:
> >>
> >>    1. Device Domain: root must create a device domain in the
> >> hypervisor, and do map/unmap hypercalls for mapping and unmapping
> >> guest RAM. All hypervisor communications use device id of type PCI
> >> for identifying and referencing the device.
> >>
> >>    2. Direct Attach: the hypervisor will simply use the guest's HW
> >>       page table for mappings, thus the host need not do map/unmap
> >>       device memory hypercalls. As such, direct attach passthru
> >> setup during guest boot is extremely fast. A direct attached device
> >>       must be referenced via logical device id and not via the PCI
> >>       device id.
> >>
> >> At present, L1VH root/parent only supports direct attaches. Also
> >> direct attach is default in non-L1VH cases because there are some
> >> significant performance issues with device domain implementation
> >> currently for guests with higher RAM (say more than 8GB), and that
> >> unfortunately cannot be addressed in the short term.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>   MAINTAINERS                     |   1 +
> >>   arch/x86/include/asm/mshyperv.h |   7 +-
> >>   arch/x86/kernel/pci-dma.c       |   2 +
> >>   drivers/iommu/Makefile          |   2 +-
> >>   drivers/iommu/hyperv-iommu.c    | 876
> >> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
> >> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
> >>   create mode 100644 drivers/iommu/hyperv-iommu.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 381a0e086382..63160cee942c 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
> >>   F:	drivers/hv/
> >>   F:	drivers/infiniband/hw/mana/
> >>   F:	drivers/input/serio/hyperv-keyboard.c
> >> +F:	drivers/iommu/hyperv-iommu.c  
> > Given we are also developing a guest iommu driver on hyperv, I
> > think it is more clear to name them accordingly. Perhaps,
> > hyperv-iommu-root.c?  
> 
> well, l1vh is not quite root, more like a parent. But we've been using
> l1vh root loosely to mean l1vh parent. so probably ok to rename it
> to hyperv-iommu-root.c. I prefer not calling it parent or something
> like that.
yeah, something specific and different than the guest driver will do.

> >>   F:	drivers/iommu/hyperv-irq.c
> >>   F:	drivers/net/ethernet/microsoft/
> >>   F:	drivers/net/hyperv/
> >> diff --git a/arch/x86/include/asm/mshyperv.h
> >> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
> >> 100644 --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
> >>   #endif
> >>   
> >>   #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> >> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> >> -{ return false; }       /* temporary */
> >> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
> >>   u64 hv_build_devid_oftype(struct pci_dev *pdev, enum
> >> hv_device_type type); +u64 hv_iommu_get_curr_partid(void);
> >>   #else	/* CONFIG_HYPERV_IOMMU */
> >>   static inline bool hv_pcidev_is_attached_dev(struct pci_dev
> >> *pdev) { return false; }
> >> -
> >>   static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
> >>   				       enum hv_device_type type)
> >>   { return 0; }
> >> +static inline u64 hv_iommu_get_curr_partid(void)
> >> +{ return HV_PARTITION_ID_INVALID; }
> >>   
> >>   #endif	/* CONFIG_HYPERV_IOMMU */
> >>   
> >> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> >> index 6267363e0189..cfeee6505e17 100644
> >> --- a/arch/x86/kernel/pci-dma.c
> >> +++ b/arch/x86/kernel/pci-dma.c
> >> @@ -8,6 +8,7 @@
> >>   #include <linux/gfp.h>
> >>   #include <linux/pci.h>
> >>   #include <linux/amd-iommu.h>
> >> +#include <linux/hyperv.h>
> >>   
> >>   #include <asm/proto.h>
> >>   #include <asm/dma.h>
> >> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
> >>   	gart_iommu_hole_init();
> >>   	amd_iommu_detect();
> >>   	detect_intel_iommu();
> >> +	hv_iommu_detect();  
> j
> > Will this driver be x86 only?  
> Yes for now.
If there is nothing x86 specific in this driver (assuming the
hypercalls here are not x86 only), maybe you can move to the generic
startup code.

> >>   	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
> >>   }
> >>   
> >> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> >> index 598c39558e7d..cc9774864b00 100644
> >> --- a/drivers/iommu/Makefile
> >> +++ b/drivers/iommu/Makefile
> >> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
> >>   obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
> >>   obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
> >>   obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> >> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
> >> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o  
> > DMA and IRQ remapping should be separate  
> 
> not sure i follow.
In IOMMU subsystem, DMA remapping and IRQ remapping can be turned
on/off independently. e.g. you could have an option to turn on IRQ
remapping w/o DMA remapping. But here you tied them together.

> 
> >>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> >>   obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> >>   obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> >> diff --git a/drivers/iommu/hyperv-iommu.c
> >> b/drivers/iommu/hyperv-iommu.c new file mode 100644
> >> index 000000000000..548483fec6b1
> >> --- /dev/null
> >> +++ b/drivers/iommu/hyperv-iommu.c
> >> @@ -0,0 +1,876 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Hyper-V root vIOMMU driver.
> >> + * Copyright (C) 2026, Microsoft, Inc.
> >> + */
> >> +
> >> +#include <linux/module.h>  
> > I don't think this is needed since this driver cannot be a module
> >   
> >> +#include <linux/pci.h>
> >> +#include <linux/dmar.h>  
> > should not depend on Intel's DMAR
> >   
> >> +#include <linux/dma-map-ops.h>
> >> +#include <linux/interval_tree.h>
> >> +#include <linux/hyperv.h>
> >> +#include "dma-iommu.h"
> >> +#include <asm/iommu.h>
> >> +#include <asm/mshyperv.h>
> >> +
> >> +/* We will not claim these PCI devices, eg hypervisor needs it for
> >> debugger */ +static char *pci_devs_to_skip;
> >> +static int __init hv_iommu_setup_skip(char *str)
> >> +{
> >> +	pci_devs_to_skip = str;
> >> +
> >> +	return 0;
> >> +}
> >> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> >> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> >> +
> >> +bool hv_no_attdev;	 /* disable direct device attach for
> >> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
> >> +static int __init setup_hv_no_attdev(char *str)
> >> +{
> >> +	hv_no_attdev = true;
> >> +	return 0;
> >> +}
> >> +__setup("hv_no_attdev", setup_hv_no_attdev);
> >> +
> >> +/* Iommu device that we export to the world. HyperV supports max
> >> of one */ +static struct iommu_device hv_virt_iommu;
> >> +
> >> +struct hv_domain {
> >> +	struct iommu_domain iommu_dom;
> >> +	u32 domid_num;			      /* as opposed
> >> to domain_id.type */
> >> +	u32 num_attchd;		      /* number of
> >> currently attached devices */  
> > rename to num_dev_attached?
> >   
> >> +	bool attached_dom;		      /* is this direct
> >> attached dom? */
> >> +	spinlock_t mappings_lock;	      /* protects
> >> mappings_tree */
> >> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
> >> tree */ +};
> >> +
> >> +#define to_hv_domain(d) container_of(d, struct hv_domain,
> >> iommu_dom) +
> >> +struct hv_iommu_mapping {
> >> +	phys_addr_t paddr;
> >> +	struct interval_tree_node iova;
> >> +	u32 flags;
> >> +};
> >> +
> >> +/*
> >> + * By default, during boot the hypervisor creates one Stage 2 (S2)
> >> default
> >> + * domain. Stage 2 means that the page table is controlled by the
> >> hypervisor.
> >> + *   S2 default: access to entire root partition memory. This for
> >> us easily
> >> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
> >> subsystem, and
> >> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
> >> hypervisor.
> >> + *
> >> + * Device Management:
> >> + *   There are two ways to manage device attaches to domains:
> >> + *     1. Domain Attach: A device domain is created in the
> >> hypervisor, the
> >> + *			 device is attached to this domain, and
> >> then memory
> >> + *			 ranges are mapped in the map callbacks.
> >> + *     2. Direct Attach: No need to create a domain in the
> >> hypervisor for direct
> >> + *			 attached devices. A hypercall is made
> >> to tell the
> >> + *			 hypervisor to attach the device to a
> >> guest. There is
> >> + *			 no need for explicit memory mappings
> >> because the
> >> + *			 hypervisor will just use the guest HW
> >> page table.
> >> + *
> >> + * Since a direct attach is much faster, it is the default. This
> >> can be
> >> + * changed via hv_no_attdev.
> >> + *
> >> + * L1VH: hypervisor only supports direct attach.
> >> + */
> >> +
> >> +/*
> >> + * Create dummy domain to correspond to hypervisor prebuilt
> >> default identity
> >> + * domain (dummy because we do not make hypercall to create them).
> >> + */
> >> +static struct hv_domain hv_def_identity_dom;
> >> +
> >> +static bool hv_special_domain(struct hv_domain *hvdom)
> >> +{
> >> +	return hvdom == &hv_def_identity_dom;
> >> +}
> >> +
> >> +struct iommu_domain_geometry default_geometry = (struct
> >> iommu_domain_geometry) {
> >> +	.aperture_start = 0,
> >> +	.aperture_end = -1UL,
> >> +	.force_aperture = true,
> >> +};
> >> +
> >> +/*
> >> + * Since the relevant hypercalls can only fit less than 512 PFNs
> >> in the pfn
> >> + * array, report 1M max.
> >> + */
> >> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
> >> +
> >> +static u32 unique_id;	      /* unique numeric id of a new
> >> domain */ +
> >> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> >> +				struct device *dev);
> >> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> >> ulong iova,
> >> +				   size_t pgsize, size_t pgcount,
> >> +				   struct iommu_iotlb_gather
> >> *gather); +
> >> +/*
> >> + * If the current thread is a VMM thread, return the partition id
> >> of the VM it
> >> + * is managing, else return HV_PARTITION_ID_INVALID.
> >> + */
> >> +u64 hv_iommu_get_curr_partid(void)
> >> +{
> >> +	u64 (*fn)(pid_t pid);
> >> +	u64 partid;
> >> +
> >> +	fn = symbol_get(mshv_pid_to_partid);
> >> +	if (!fn)
> >> +		return HV_PARTITION_ID_INVALID;
> >> +
> >> +	partid = fn(current->tgid);
> >> +	symbol_put(mshv_pid_to_partid);
> >> +
> >> +	return partid;
> >> +}  
> > This function is not iommu specific. Maybe move it to mshv code?  
> 
> Well, it is getting the information from mshv by calling a function
> there for iommu, and is not needed if no HYPER_IOMMU. So this is
> probably the best place for it.
> 
ok, maybe move it to mshv after we have a second user. But the function
name can be just hv_get_curr_partid(void), no?

> >> +
> >> +/* If this is a VMM thread, then this domain is for a guest VM */
> >> +static bool hv_curr_thread_is_vmm(void)
> >> +{
> >> +	return hv_iommu_get_curr_partid() !=
> >> HV_PARTITION_ID_INVALID; +}
> >> +
> >> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap
> >> cap) +{
> >> +	switch (cap) {
> >> +	case IOMMU_CAP_CACHE_COHERENCY:
> >> +		return true;
> >> +	default:
> >> +		return false;
> >> +	}
> >> +	return false;
> >> +}
> >> +
> >> +/*
> >> + * Check if given pci device is a direct attached device. Caller
> >> must have
> >> + * verified pdev is a valid pci device.
> >> + */
> >> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> >> +{
> >> +	struct iommu_domain *iommu_domain;
> >> +	struct hv_domain *hvdom;
> >> +	struct device *dev = &pdev->dev;
> >> +
> >> +	iommu_domain = iommu_get_domain_for_dev(dev);
> >> +	if (iommu_domain) {
> >> +		hvdom = to_hv_domain(iommu_domain);
> >> +		return hvdom->attached_dom;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);  
> > Attached domain can change anytime, what guarantee does the caller
> > have?  
> 
> Not sure I understand what can change: the device moving from attached
> to non-attached? or the domain getting deleted? In any case, this is
> called from leaf functions, so that should not happen... and it
> will return false if the device did somehow got removed.
> 
I was thinking the device can be attached to a different domain type at
runtime, e.g. via sysfs to identity or DMA. But I guess here is a static
attachment either for l1vh or root.

> >> +
> >> +/* Create a new device domain in the hypervisor */
> >> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> >> +{
> >> +	u64 status;
> >> +	unsigned long flags;
> >> +	struct hv_input_device_domain *ddp;
> >> +	struct hv_input_create_device_domain *input;  
> > nit: use consistent coding style, inverse Christmas tree.
> >   
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	ddp = &input->device_domain;
> >> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> >> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	ddp->domain_id.id = hvdom->domid_num;
> >> +
> >> +
> >> input->create_device_domain_flags.forward_progress_required = 1;
> >> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> >> +
> >> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN,
> >> input, NULL); +
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* During boot, all devices are attached to this */
> >> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
> >> device *dev) +{
> >> +	return &hv_def_identity_dom.iommu_dom;
> >> +}
> >> +
> >> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> >> device *dev) +{
> >> +	struct hv_domain *hvdom;
> >> +	int rc;
> >> +
> >> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> >> !hv_no_attdev) {
> >> +		pr_err("Hyper-V: l1vh iommu does not support host
> >> devices\n");  
> > why is this an error if user input choose not to do direct attach?  
> 
> Like the error message says: on l1vh, direct attaches of host devices
> (eg dpdk) is not supported. and l1vh only does direct attaches. IOW,
> no host devices on l1vh.
> 
This hv_no_attdev flag is really confusing to me, by default
hv_no_attdev is false, which allows direct attach. And you are saying
l1vh allows it.

Why is this flag also controls host device attachment in l1vh? If you
can tell the difference between direct host device attach and other
direct attach, why don't you reject always reject host attach in l1vh?

> >> +		return NULL;
> >> +	}
> >> +
> >> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> >> +	if (hvdom == NULL)
> >> +		goto out;
> >> +
> >> +	spin_lock_init(&hvdom->mappings_lock);
> >> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> >> +
> >> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /*
> >> ie, 0 */  
> > This is true only when unique_id wraps around, right? Then this
> > driver stops working?  
> 
> Correct. It's a u32, so if my math is right, and a device is attached
> every second, it will take 136 years to wrap! Did i get that right?
> 
This is still a unnecessary vulnerability.

> > can you use an IDR for the unique_id and free it as you detach
> > instead of doing this cyclic allocation?
> >   
> >> +		goto out_free;
> >> +
> >> +	hvdom->domid_num = unique_id;
> >> +	hvdom->iommu_dom.geometry = default_geometry;
> >> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> >> +
> >> +	/* For guests, by default we do direct attaches, so no
> >> domain in hyp */
> >> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
> >> +		hvdom->attached_dom = true;
> >> +	else {
> >> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> >> +		if (rc)
> >> +			goto out_free_id;
> >> +	}
> >> +
> >> +	return &hvdom->iommu_dom;
> >> +
> >> +out_free_id:
> >> +	unique_id--;
> >> +out_free:
> >> +	kfree(hvdom);
> >> +out:
> >> +	return NULL;
> >> +}
> >> +
> >> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> >> +{
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	unsigned long flags;
> >> +	u64 status;
> >> +	struct hv_input_delete_device_domain *input;
> >> +
> >> +	if (hv_special_domain(hvdom))
> >> +		return;
> >> +
> >> +	if (hvdom->num_attchd) {
> >> +		pr_err("Hyper-V: can't free busy iommu domain
> >> (%p)\n", immdom);
> >> +		return;
> >> +	}
> >> +
> >> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
> >> +		struct hv_input_device_domain *ddp;
> >> +
> >> +		local_irq_save(flags);
> >> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		ddp = &input->device_domain;
> >> +		memset(input, 0, sizeof(*input));
> >> +
> >> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> >> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> >> +		ddp->domain_id.id = hvdom->domid_num;
> >> +
> >> +		status =
> >> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> >> +					 NULL);
> >> +		local_irq_restore(flags);
> >> +
> >> +		if (!hv_result_success(status))
> >> +			hv_status_err(status, "\n");
> >> +	}  
> 
> > you could free the domid here, no?  
> sorry, don't follow what you mean by domid, you mean unique_id?
> 
yes.
> >> +
> >> +	kfree(hvdom);
> >> +}
> >> +
> >> +/* Attach a device to a domain previously created in the
> >> hypervisor */ +static int hv_iommu_att_dev2dom(struct hv_domain
> >> *hvdom, struct pci_dev *pdev) +{
> >> +	unsigned long flags;
> >> +	u64 status;
> >> +	enum hv_device_type dev_type;
> >> +	struct hv_input_attach_device_domain *input;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +
> >> +	/* NB: Upon guest shutdown, device is re-attached to the
> >> default domain
> >> +	 * without explicit detach.
> >> +	 */
> >> +	if (hv_l1vh_partition())
> >> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> >> +	else
> >> +		dev_type = HV_DEVICE_TYPE_PCI;
> >> +
> >> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
> >> dev_type); +
> >> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN,
> >> input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* Caller must have validated that dev is a valid pci dev */
> >> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
> >> +{
> >> +	struct hv_input_attach_device *input;
> >> +	u64 status;
> >> +	int rc;
> >> +	unsigned long flags;
> >> +	union hv_device_id host_devid;
> >> +	enum hv_device_type dev_type;
> >> +	u64 ptid = hv_iommu_get_curr_partid();
> >> +
> >> +	if (ptid == HV_PARTITION_ID_INVALID) {
> >> +		pr_err("Hyper-V: Invalid partition id in direct
> >> attach\n");
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	if (hv_l1vh_partition())
> >> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> >> +	else
> >> +		dev_type = HV_DEVICE_TYPE_PCI;
> >> +
> >> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev,
> >> dev_type); +
> >> +	do {
> >> +		local_irq_save(flags);
> >> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		memset(input, 0, sizeof(*input));
> >> +		input->partition_id = ptid;
> >> +		input->device_id = host_devid;
> >> +
> >> +		/* Hypervisor associates logical_id with this
> >> device, and in
> >> +		 * some hypercalls like retarget interrupts,
> >> logical_id must be
> >> +		 * used instead of the BDF. It is a required
> >> parameter.
> >> +		 */
> >> +		input->attdev_flags.logical_id = 1;
> >> +		input->logical_devid =
> >> +			   hv_build_devid_oftype(pdev,
> >> HV_DEVICE_TYPE_LOGICAL); +
> >> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
> >> input, NULL);
> >> +		local_irq_restore(flags);
> >> +
> >> +		if (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
> >> ptid, 1);
> >> +			if (rc)
> >> +				break;
> >> +		}
> >> +	} while (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY); +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* This to attach a device to both host app (like DPDK) and a
> >> guest VM */  
> > The IOMMU driver should be agnostic to the type of consumer,
> > whether a userspace driver or a VM. This comment is not necessary.
> >   
> >> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
> >> struct device *dev,
> >> +			       struct iommu_domain *old)  
> > This does not match upstream kernel prototype, which kernel version
> > is this based on? I will stop here for now.  
> 
> As I mentioned in the cover letter:
>           Based on: 8f0b4cce4481 (origin/hyperv-next)
> 
where is this repo?

> which is now 6.19 based.
> 
> > struct iommu_domain_ops {
> > 	int (*attach_dev)(struct iommu_domain *domain, struct device
> > 	*dev);  
> 
> I think you got it backwards, 6.6 has this. 6.19 has extra paremeter.
> 
you are right, this is a very recent change. my bad.

> 
> 
> >> +{
> >> +	struct pci_dev *pdev;
> >> +	int rc;
> >> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> >> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> >> +
> >> +	/* Only allow PCI devices for now */
> >> +	if (!dev_is_pci(dev))
> >> +		return -EINVAL;
> >> +
> >> +	pdev = to_pci_dev(dev);
> >> +
> >> +	/* l1vh does not support host device (eg DPDK) passthru */
> >> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new)
> >> &&
> >> +	    !hvdom_new->attached_dom)
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * VFIO does not do explicit detach calls, hence check
> >> first if we need
> >> +	 * to detach first. Also, in case of guest shutdown, it's
> >> the VMM
> >> +	 * thread that attaches it back to the
> >> hv_def_identity_dom, and
> >> +	 * hvdom_prev will not be null then. It is null during
> >> boot.
> >> +	 */
> >> +	if (hvdom_prev)
> >> +		if (!hv_l1vh_partition() ||
> >> !hv_special_domain(hvdom_prev))
> >> +
> >> hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev); +
> >> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> >> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> >> "private" field */
> >> +		return 0;
> >> +	}
> >> +
> >> +	if (hvdom_new->attached_dom)
> >> +		rc = hv_iommu_direct_attach_device(pdev);
> >> +	else
> >> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
> >> +
> >> +	if (rc && hvdom_prev) {
> >> +		int rc1;
> >> +
> >> +		if (hvdom_prev->attached_dom)
> >> +			rc1 = hv_iommu_direct_attach_device(pdev);
> >> +		else
> >> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev,
> >> pdev); +
> >> +		if (rc1)
> >> +			pr_err("Hyper-V: iommu could not restore
> >> orig device state.. dev:%s\n",
> >> +			       dev_name(dev));
> >> +	}
> >> +
> >> +	if (rc == 0) {
> >> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> >> "private" field */
> >> +		hvdom_new->num_attchd++;
> >> +	}
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
> >> +					struct pci_dev *pdev)
> >> +{
> >> +	struct hv_input_detach_device *input;
> >> +	u64 status, log_devid;
> >> +	unsigned long flags;
> >> +
> >> +	log_devid = hv_build_devid_oftype(pdev,
> >> HV_DEVICE_TYPE_LOGICAL); +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->partition_id = hv_iommu_get_curr_partid();
> >> +	input->logical_devid = log_devid;
> >> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input,
> >> NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +}
> >> +
> >> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
> >> +				      struct pci_dev *pdev)
> >> +{
> >> +	u64 status, devid;
> >> +	unsigned long flags;
> >> +	struct hv_input_detach_device_domain *input;
> >> +
> >> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_id.as_uint64 = devid;
> >> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN,
> >> input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +}
> >> +
> >> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> >> struct device *dev) +{
> >> +	struct pci_dev *pdev;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +
> >> +	/* See the attach function, only PCI devices for now */
> >> +	if (!dev_is_pci(dev))
> >> +		return;
> >> +
> >> +	if (hvdom->num_attchd == 0)
> >> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n",
> >> dev_name(dev)); +
> >> +	pdev = to_pci_dev(dev);
> >> +
> >> +	if (hvdom->attached_dom) {
> >> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> >> +
> >> +		/* Do not reset attached_dom, hv_iommu_unmap_pages
> >> happens
> >> +		 * next.
> >> +		 */
> >> +	} else {
> >> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> >> +	}
> >> +
> >> +	hvdom->num_attchd--;
> >> +}
> >> +
> >> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> >> +				     unsigned long iova,
> >> phys_addr_t paddr,
> >> +				     size_t size, u32 flags)
> >> +{
> >> +	unsigned long irqflags;
> >> +	struct hv_iommu_mapping *mapping;
> >> +
> >> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> >> +	if (!mapping)
> >> +		return -ENOMEM;
> >> +
> >> +	mapping->paddr = paddr;
> >> +	mapping->iova.start = iova;
> >> +	mapping->iova.last = iova + size - 1;
> >> +	mapping->flags = flags;
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> >> +	interval_tree_insert(&mapping->iova,
> >> &hvdom->mappings_tree);
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> >> +					unsigned long iova, size_t
> >> size) +{
> >> +	unsigned long flags;
> >> +	size_t unmapped = 0;
> >> +	unsigned long last = iova + size - 1;
> >> +	struct hv_iommu_mapping *mapping = NULL;
> >> +	struct interval_tree_node *node, *next;
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> >> +	next = interval_tree_iter_first(&hvdom->mappings_tree,
> >> iova, last);
> >> +	while (next) {
> >> +		node = next;
> >> +		mapping = container_of(node, struct
> >> hv_iommu_mapping, iova);
> >> +		next = interval_tree_iter_next(node, iova, last);
> >> +
> >> +		/* Trying to split a mapping? Not supported for
> >> now. */
> >> +		if (mapping->iova.start < iova)
> >> +			break;
> >> +
> >> +		unmapped += mapping->iova.last -
> >> mapping->iova.start
> >> + 1; +
> >> +		interval_tree_remove(node, &hvdom->mappings_tree);
> >> +		kfree(mapping);
> >> +	}
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> >> +
> >> +	return unmapped;
> >> +}
> >> +
> >> +/* Return: must return exact status from the hypercall without
> >> changes */ +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> >> +			    unsigned long iova, phys_addr_t paddr,
> >> +			    unsigned long npages, u32 map_flags)
> >> +{
> >> +	u64 status;
> >> +	int i;
> >> +	struct hv_input_map_device_gpa_pages *input;
> >> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +	input->map_flags = map_flags;
> >> +	input->target_device_va_base = iova;
> >> +
> >> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> >> +	for (i = 0; i < npages; i++, pfn++)
> >> +		input->gpa_page_list[i] = pfn;
> >> +
> >> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES,
> >> npages, 0,
> >> +				     input, NULL);
> >> +
> >> +	local_irq_restore(flags);
> >> +	return status;
> >> +}
> >> +
> >> +/*
> >> + * The core VFIO code loops over memory ranges calling this
> >> function with
> >> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in
> >> vfio_iommu_map.
> >> + */
> >> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong
> >> iova,
> >> +			      phys_addr_t paddr, size_t pgsize,
> >> size_t pgcount,
> >> +			      int prot, gfp_t gfp, size_t *mapped)
> >> +{
> >> +	u32 map_flags;
> >> +	int ret;
> >> +	u64 status;
> >> +	unsigned long npages, done = 0;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	size_t size = pgsize * pgcount;
> >> +
> >> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> >> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> >> +
> >> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size,
> >> map_flags);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	if (hvdom->attached_dom) {
> >> +		*mapped = size;
> >> +		return 0;
> >> +	}
> >> +
> >> +	npages = size >> HV_HYP_PAGE_SHIFT;
> >> +	while (done < npages) {
> >> +		ulong completed, remain = npages - done;
> >> +
> >> +		status = hv_iommu_map_pgs(hvdom, iova, paddr,
> >> remain,
> >> +					  map_flags);
> >> +
> >> +		completed = hv_repcomp(status);
> >> +		done = done + completed;
> >> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> >> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> >> +
> >> +		if (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +
> >> hv_current_partition_id,
> >> +						    256);
> >> +			if (ret)
> >> +				break;
> >> +		}
> >> +		if (!hv_result_success(status))
> >> +			break;
> >> +	}
> >> +
> >> +	if (!hv_result_success(status)) {
> >> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> >> +
> >> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> >> +			      done, npages, iova);
> >> +		/*
> >> +		 * lookup tree has all mappings [0 - size-1].
> >> Below unmap will
> >> +		 * only remove from [0 - done], we need to remove
> >> second chunk
> >> +		 * [done+1 - size-1].
> >> +		 */
> >> +		hv_iommu_del_tree_mappings(hvdom, iova, size -
> >> done_size);
> >> +		hv_iommu_unmap_pages(immdom, iova - done_size,
> >> pgsize,
> >> +				     done, NULL);
> >> +		if (mapped)
> >> +			*mapped = 0;
> >> +	} else
> >> +		if (mapped)
> >> +			*mapped = size;
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> >> ulong iova,
> >> +				   size_t pgsize, size_t pgcount,
> >> +				   struct iommu_iotlb_gather
> >> *gather) +{
> >> +	unsigned long flags, npages;
> >> +	struct hv_input_unmap_device_gpa_pages *input;
> >> +	u64 status;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	size_t unmapped, size = pgsize * pgcount;
> >> +
> >> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> >> +	if (unmapped < size)
> >> +		pr_err("%s: could not delete all mappings
> >> (%lx:%lx/%lx)\n",
> >> +		       __func__, iova, unmapped, size);
> >> +
> >> +	if (hvdom->attached_dom)
> >> +		return size;
> >> +
> >> +	npages = size >> HV_HYP_PAGE_SHIFT;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +	input->target_device_va_base = iova;
> >> +
> >> +	status =
> >> hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> >> +				     0, input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return unmapped;
> >> +}
> >> +
> >> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain
> >> *immdom,
> >> +					 dma_addr_t iova)
> >> +{
> >> +	u64 paddr = 0;
> >> +	unsigned long flags;
> >> +	struct hv_iommu_mapping *mapping;
> >> +	struct interval_tree_node *node;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> >> +	node = interval_tree_iter_first(&hvdom->mappings_tree,
> >> iova, iova);
> >> +	if (node) {
> >> +		mapping = container_of(node, struct
> >> hv_iommu_mapping, iova);
> >> +		paddr = mapping->paddr + (iova -
> >> mapping->iova.start);
> >> +	}
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> >> +
> >> +	return paddr;
> >> +}
> >> +
> >> +/*
> >> + * Currently, hypervisor does not provide list of devices it is
> >> using
> >> + * dynamically. So use this to allow users to manually specify
> >> devices that
> >> + * should be skipped. (eg. hypervisor debugger using some network
> >> device).
> >> + */
> >> +static struct iommu_device *hv_iommu_probe_device(struct device
> >> *dev) +{
> >> +	if (!dev_is_pci(dev))
> >> +		return ERR_PTR(-ENODEV);
> >> +
> >> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> >> +		int rc, pos = 0;
> >> +		int parsed;
> >> +		int segment, bus, slot, func;
> >> +		struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> +		do {
> >> +			parsed = 0;
> >> +
> >> +			rc = sscanf(pci_devs_to_skip + pos, "
> >> (%x:%x:%x.%x) %n",
> >> +				    &segment, &bus, &slot, &func,
> >> &parsed);
> >> +			if (rc)
> >> +				break;
> >> +			if (parsed <= 0)
> >> +				break;
> >> +
> >> +			if (pci_domain_nr(pdev->bus) == segment &&
> >> +			    pdev->bus->number == bus &&
> >> +			    PCI_SLOT(pdev->devfn) == slot &&
> >> +			    PCI_FUNC(pdev->devfn) == func) {
> >> +
> >> +				dev_info(dev, "skipped by Hyper-V
> >> IOMMU\n");
> >> +				return ERR_PTR(-ENODEV);
> >> +			}
> >> +			pos += parsed;
> >> +
> >> +		} while (pci_devs_to_skip[pos]);
> >> +	}
> >> +
> >> +	/* Device will be explicitly attached to the default
> >> domain, so no need
> >> +	 * to do dev_iommu_priv_set() here.
> >> +	 */
> >> +
> >> +	return &hv_virt_iommu;
> >> +}
> >> +
> >> +static void hv_iommu_probe_finalize(struct device *dev)
> >> +{
> >> +	struct iommu_domain *immdom =
> >> iommu_get_domain_for_dev(dev); +
> >> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> >> +		iommu_setup_dma_ops(dev);
> >> +	else
> >> +		set_dma_ops(dev, NULL);
> >> +}
> >> +
> >> +static void hv_iommu_release_device(struct device *dev)
> >> +{
> >> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> >> +
> >> +	/* Need to detach device from device domain if necessary.
> >> */
> >> +	if (hvdom)
> >> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> >> +
> >> +	dev_iommu_priv_set(dev, NULL);
> >> +	set_dma_ops(dev, NULL);
> >> +}
> >> +
> >> +static struct iommu_group *hv_iommu_device_group(struct device
> >> *dev) +{
> >> +	if (dev_is_pci(dev))
> >> +		return pci_device_group(dev);
> >> +	else
> >> +		return generic_device_group(dev);
> >> +}
> >> +
> >> +static int hv_iommu_def_domain_type(struct device *dev)
> >> +{
> >> +	/* The hypervisor always creates this by default during
> >> boot */
> >> +	return IOMMU_DOMAIN_IDENTITY;
> >> +}
> >> +
> >> +static struct iommu_ops hv_iommu_ops = {
> >> +	.capable	    = hv_iommu_capable,
> >> +	.domain_alloc_identity	=
> >> hv_iommu_domain_alloc_identity,
> >> +	.domain_alloc_paging	=
> >> hv_iommu_domain_alloc_paging,
> >> +	.probe_device	    = hv_iommu_probe_device,
> >> +	.probe_finalize     = hv_iommu_probe_finalize,
> >> +	.release_device     = hv_iommu_release_device,
> >> +	.def_domain_type    = hv_iommu_def_domain_type,
> >> +	.device_group	    = hv_iommu_device_group,
> >> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> >> +		.attach_dev   = hv_iommu_attach_dev,
> >> +		.map_pages    = hv_iommu_map_pages,
> >> +		.unmap_pages  = hv_iommu_unmap_pages,
> >> +		.iova_to_phys = hv_iommu_iova_to_phys,
> >> +		.free	      = hv_iommu_domain_free,
> >> +	},
> >> +	.owner		    = THIS_MODULE,
> >> +};
> >> +
> >> +static void __init hv_initialize_special_domains(void)
> >> +{
> >> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> >> +	hv_def_identity_dom.domid_num =
> >> HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */ +}  
> > This could be initialized statically.
> >   
> >> +
> >> +static int __init hv_iommu_init(void)
> >> +{
> >> +	int ret;
> >> +	struct iommu_device *iommup = &hv_virt_iommu;
> >> +
> >> +	if (!hv_is_hyperv_initialized())
> >> +		return -ENODEV;
> >> +
> >> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s",
> >> "hyperv-iommu");
> >> +	if (ret) {
> >> +		pr_err("Hyper-V: iommu_device_sysfs_add failed:
> >> %d\n", ret);
> >> +		return ret;
> >> +	}
> >> +
> >> +	/* This must come before iommu_device_register because the
> >> latter calls
> >> +	 * into the hooks.
> >> +	 */
> >> +	hv_initialize_special_domains();
> >> +
> >> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> >> +	if (ret) {
> >> +		pr_err("Hyper-V: iommu_device_register failed:
> >> %d\n", ret);
> >> +		goto err_sysfs_remove;
> >> +	}
> >> +
> >> +	pr_info("Hyper-V IOMMU initialized\n");
> >> +
> >> +	return 0;
> >> +
> >> +err_sysfs_remove:
> >> +	iommu_device_sysfs_remove(iommup);
> >> +	return ret;
> >> +}
> >> +
> >> +void __init hv_iommu_detect(void)
> >> +{
> >> +	if (no_iommu || iommu_detected)
> >> +		return;
> >> +
> >> +	/* For l1vh, always expose an iommu unit */
> >> +	if (!hv_l1vh_partition())
> >> +		if (!(ms_hyperv.misc_features &
> >> HV_DEVICE_DOMAIN_AVAILABLE))
> >> +			return;
> >> +
> >> +	iommu_detected = 1;
> >> +	x86_init.iommu.iommu_init = hv_iommu_init;
> >> +
> >> +	pci_request_acs();
> >> +}
> >> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> >> index dfc516c1c719..2ad111727e82 100644
> >> --- a/include/linux/hyperv.h
> >> +++ b/include/linux/hyperv.h
> >> @@ -1767,4 +1767,10 @@ static inline unsigned long
> >> virt_to_hvpfn(void *addr) #define HVPFN_DOWN(x)	((x) >>
> >> HV_HYP_PAGE_SHIFT) #define page_to_hvpfn(page)
> >> (page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
> >> +#ifdef CONFIG_HYPERV_IOMMU
> >> +void __init hv_iommu_detect(void);
> >> +#else
> >> +static inline void hv_iommu_detect(void) { }
> >> +#endif /* CONFIG_HYPERV_IOMMU */
> >> +
> >>   #endif /* _HYPERV_H */  


^ permalink raw reply

* Re: [PATCH 2/4] mshv: Introduce hv_deposit_memory helper functions
From: Mukesh R @ 2026-01-27 19:44 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXkEMnDy8UFwJitP@skinsburskii.localdomain>

On 1/27/26 10:30, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 06:06:23PM -0800, Mukesh R wrote:
>> On 1/25/26 14:41, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 04:33:39PM -0800, Mukesh R wrote:
>>>> On 1/22/26 17:35, Stanislav Kinsburskii wrote:
>>>>> Introduce hv_deposit_memory_node() and hv_deposit_memory() helper
>>>>> functions to handle memory deposition with proper error handling.
>>>>>
>>>>> The new hv_deposit_memory_node() function takes the hypervisor status
>>>>> as a parameter and validates it before depositing pages. It checks for
>>>>> HV_STATUS_INSUFFICIENT_MEMORY specifically and returns an error for
>>>>> unexpected status codes.
>>>>>
>>>>> This is a precursor patch to new out-of-memory error codes support.
>>>>> No functional changes intended.
>>>>>
>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>> ---
>>>>>     drivers/hv/hv_proc.c           |   22 ++++++++++++++++++++--
>>>>>     drivers/hv/mshv_root_hv_call.c |   25 +++++++++----------------
>>>>>     drivers/hv/mshv_root_main.c    |    3 +--
>>>>>     include/asm-generic/mshyperv.h |   10 ++++++++++
>>>>>     4 files changed, 40 insertions(+), 20 deletions(-)
>>>>>
>>>>> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
>>>>> index 80c66d1c74d5..c0c2bfc80d77 100644
>>>>> --- a/drivers/hv/hv_proc.c
>>>>> +++ b/drivers/hv/hv_proc.c
>>>>> @@ -110,6 +110,23 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(hv_call_deposit_pages);
>>>>> +int hv_deposit_memory_node(int node, u64 partition_id,
>>>>> +			   u64 hv_status)
>>>>> +{
>>>>> +	u32 num_pages;
>>>>> +
>>>>> +	switch (hv_result(hv_status)) {
>>>>> +	case HV_STATUS_INSUFFICIENT_MEMORY:
>>>>> +		num_pages = 1;
>>>>> +		break;
>>>>> +	default:
>>>>> +		hv_status_err(hv_status, "Unexpected!\n");
>>>>> +		return -ENOMEM;
>>>>> +	}
>>>>> +	return hv_call_deposit_pages(node, partition_id, num_pages);
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(hv_deposit_memory_node);
>>>>> +
>>>>
>>>> Different hypercalls may want to deposit different number of pages in one
>>>> shot. As feature evolves, page sizes get mixed, we'd almost need that
>>>> flexibility. So, imo, either we just don't do this for now, or add num pages
>>>> parameter to be passed down.
>>>>
>>>
>>> What you do mean by "page sizes get mixed"?
>>> A helper to deposit num pages already exists: its
>>> hv_call_deposit_pages().
>>
>> My point, you are removing number of pages, and we may want to keep
>> that so one can quickly play around and change them.
>>
>> -                       ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> -                                                   pt_id, 1);
>> +                       ret = hv_deposit_memory(pt_id, status);
>>
>> For example, in hv_call_initialize_partition() we may realize after
>> some analysis that depositing 2 pages or 4 pages is much better.
>>
> 
> We have been using this 1-page deposit logic from the beginning. To
> change the number of pages, simply replace hv_deposit_memory with
> hv_call_deposit_pages and specify the desired number of pages.

You could perhaps rename it to hv_deposit_page().

> The proposed approach reduces code duplication and is less error-prone,
> as there are multiple error codes to handle. Consolidating the logic
> also makes the driver more robust.
> 
> 
> Thanks,  Stanislav
> 
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>
>>>>>     bool hv_result_oom(u64 status)
>>>>>     {
>>>>>     	switch (hv_result(status)) {
>>>>> @@ -155,7 +172,8 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
>>>>>     			}
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(node, hv_current_partition_id, 1);
>>>>> +		ret = hv_deposit_memory_node(node, hv_current_partition_id,
>>>>> +					     status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -197,7 +215,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
>>>>>     			}
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(node, partition_id, 1);
>>>>> +		ret = hv_deposit_memory_node(node, partition_id, status);
>>>>>     	} while (!ret);
>>>>> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
>>>>> index 58c5cbf2e567..06f2bac8039d 100644
>>>>> --- a/drivers/hv/mshv_root_hv_call.c
>>>>> +++ b/drivers/hv/mshv_root_hv_call.c
>>>>> @@ -123,8 +123,7 @@ int hv_call_create_partition(u64 flags,
>>>>>     			break;
>>>>>     		}
>>>>>     		local_irq_restore(irq_flags);
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    hv_current_partition_id, 1);
>>>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -151,7 +150,7 @@ int hv_call_initialize_partition(u64 partition_id)
>>>>>     			ret = hv_result_to_errno(status);
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>>>> +		ret = hv_deposit_memory(partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -465,8 +464,7 @@ int hv_call_get_vp_state(u32 vp_index, u64 partition_id,
>>>>>     		}
>>>>>     		local_irq_restore(flags);
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    partition_id, 1);
>>>>> +		ret = hv_deposit_memory(partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -525,8 +523,7 @@ int hv_call_set_vp_state(u32 vp_index, u64 partition_id,
>>>>>     		}
>>>>>     		local_irq_restore(flags);
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    partition_id, 1);
>>>>> +		ret = hv_deposit_memory(partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -573,7 +570,7 @@ static int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
>>>>>     		local_irq_restore(flags);
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id, 1);
>>>>> +		ret = hv_deposit_memory(partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -722,8 +719,7 @@ hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
>>>>>     			ret = hv_result_to_errno(status);
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE, port_partition_id, 1);
>>>>> -
>>>>> +		ret = hv_deposit_memory(port_partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -776,8 +772,7 @@ hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
>>>>>     			ret = hv_result_to_errno(status);
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    connection_partition_id, 1);
>>>>> +		ret = hv_deposit_memory(connection_partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -848,8 +843,7 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
>>>>>     			break;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    hv_current_partition_id, 1);
>>>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>     	} while (!ret);
>>>>>     	return ret;
>>>>> @@ -885,8 +879,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
>>>>>     			return ret;
>>>>>     		}
>>>>> -		ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -					    hv_current_partition_id, 1);
>>>>> +		ret = hv_deposit_memory(hv_current_partition_id, status);
>>>>>     		if (ret)
>>>>>     			return ret;
>>>>>     	} while (!ret);
>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>> index f4697497f83e..5fc572e31cd7 100644
>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>> @@ -264,8 +264,7 @@ static int mshv_ioctl_passthru_hvcall(struct mshv_partition *partition,
>>>>>     		if (!hv_result_oom(status))
>>>>>     			ret = hv_result_to_errno(status);
>>>>>     		else
>>>>> -			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>>> -						    pt_id, 1);
>>>>> +			ret = hv_deposit_memory(pt_id, status);
>>>>>     	} while (!ret);
>>>>>     	args.status = hv_result(status);
>>>>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>>>>> index b73352a7fc9e..c8e8976839f8 100644
>>>>> --- a/include/asm-generic/mshyperv.h
>>>>> +++ b/include/asm-generic/mshyperv.h
>>>>> @@ -344,6 +344,7 @@ static inline bool hv_parent_partition(void)
>>>>>     }
>>>>>     bool hv_result_oom(u64 status);
>>>>> +int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
>>>>>     int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
>>>>>     int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>>>>>     int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>>>>> @@ -353,6 +354,10 @@ static inline bool hv_root_partition(void) { return false; }
>>>>>     static inline bool hv_l1vh_partition(void) { return false; }
>>>>>     static inline bool hv_parent_partition(void) { return false; }
>>>>>     static inline bool hv_result_oom(u64 status) { return false; }
>>>>> +static inline int hv_deposit_memory_node(int node, u64 partition_id, u64 status)
>>>>> +{
>>>>> +	return -EOPNOTSUPP;
>>>>> +}
>>>>>     static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
>>>>>     {
>>>>>     	return -EOPNOTSUPP;
>>>>> @@ -367,6 +372,11 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
>>>>>     }
>>>>>     #endif /* CONFIG_MSHV_ROOT */
>>>>> +static inline int hv_deposit_memory(u64 partition_id, u64 status)
>>>>> +{
>>>>> +	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
>>>>> +}
>>>>> +
>>>>>     #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
>>>>>     u8 __init get_vtl(void);
>>>>>     #else
>>>>>
>>>>>


^ permalink raw reply

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Mukesh R @ 2026-01-27 19:56 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXj6FXahxZU8QFq0@skinsburskii.localdomain>

On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
>> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
>>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
>>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
>>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
>>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
>>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
>>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
>>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
>>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
>>>>>>>>> hypervisor deposited pages.
>>>>>>>>>
>>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
>>>>>>>>> management is implemented.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
>>>>>>>>> ---
>>>>>>>>>       drivers/hv/Kconfig |    1 +
>>>>>>>>>       1 file changed, 1 insertion(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
>>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
>>>>>>>>> --- a/drivers/hv/Kconfig
>>>>>>>>> +++ b/drivers/hv/Kconfig
>>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
>>>>>>>>>       	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>>>>>>>>>       	# no particular order, making it impossible to reassemble larger pages
>>>>>>>>>       	depends on PAGE_SIZE_4KB
>>>>>>>>> +	depends on !KEXEC
>>>>>>>>>       	select EVENTFD
>>>>>>>>>       	select VIRT_XFER_TO_GUEST_WORK
>>>>>>>>>       	select HMM_MIRROR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
>>>>>>>> implying that crash dump might be involved. Or did you test kdump
>>>>>>>> and it was fine?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
>>>>>>> will be affected as well.
>>>>>>
>>>>>> So not sure I understand the reason for this patch. We can just block
>>>>>> kexec if there are any VMs running, right? Doing this would mean any
>>>>>> further developement would be without a ver important and major feature,
>>>>>> right?
>>>>>
>>>>> This is an option. But until it's implemented and merged, a user mshv
>>>>> driver gets into a situation where kexec is broken in a non-obvious way.
>>>>> The system may crash at any time after kexec, depending on whether the
>>>>> new kernel touches the pages deposited to hypervisor or not. This is a
>>>>> bad user experience.
>>>>
>>>> I understand that. But with this we cannot collect core and debug any
>>>> crashes. I was thinking there would be a quick way to prohibit kexec
>>>> for update via notifier or some other quick hack. Did you already
>>>> explore that and didn't find anything, hence this?
>>>>
>>>
>>> This quick hack you mention isn't quick in the upstream kernel as there
>>> is no hook to interrupt kexec process except the live update one.
>>
>> That's the one we want to interrupt and block right? crash kexec
>> is ok and should be allowed. We can document we don't support kexec
>> for update for now.
>>
>>> I sent an RFC for that one but given todays conversation details is
>>> won't be accepted as is.
>>
>> Are you taking about this?
>>
>>          "mshv: Add kexec safety for deposited pages"
>>
> 
> Yes.
> 
>>> Making mshv mutually exclusive with kexec is the only viable option for
>>> now given time constraints.
>>> It is intended to be replaced with proper page lifecycle management in
>>> the future.
>>
>> Yeah, that could take a long time and imo we cannot just disable KEXEC
>> completely. What we want is just block kexec for updates from some
>> mshv file for now, we an print during boot that kexec for updates is
>> not supported on mshv. Hope that makes sense.
>>
> 
> The trade-off here is between disabling kexec support and having the
> kernel crash after kexec in a non-obvious way. This affects both regular
> kexec and crash kexec.

crash kexec on baremetal is not affected, hence disabling that
doesn't make sense as we can't debug crashes then on bm.

Let me think and explore a bit, and if I come up with something, I'll
send a patch here. If nothing, then we can do this as last resort.

Thanks,
-Mukesh


> It?s a pity we can?t apply a quick hack to disable only regular kexec.
> However, since crash kexec would hit the same issues, until we have a
> proper state transition for deposted pages, the best workaround for now
> is to reset the hypervisor state on every kexec, which needs design,
> work, and testing.
> 
> Disabling kexec is the only consistent way to handle this in the
> upstream kernel at the moment.
> 
> Thanks, Stanislav
> 
> 
>> Thanks,
>> -Mukesh
>>
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>> Therefor it should be explicitly forbidden as it's essentially not
>>>>> supported yet.
>>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Mukesh


^ permalink raw reply

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
From: Jacob Pan @ 2026-01-27 22:31 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux
In-Reply-To: <20260127112144.00002991@linux.microsoft.com>

Hi Mukesh,

> > >> +
> > >> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> > >> !hv_no_attdev) {
> > >> +		pr_err("Hyper-V: l1vh iommu does not support
> > >> host devices\n");    
> > > why is this an error if user input choose not to do direct
> > > attach?    
> > 
> > Like the error message says: on l1vh, direct attaches of host
> > devices (eg dpdk) is not supported. and l1vh only does direct
> > attaches. IOW, no host devices on l1vh.
> >   
> This hv_no_attdev flag is really confusing to me, by default
> hv_no_attdev is false, which allows direct attach. And you are saying
> l1vh allows it.
> 
> Why is this flag also controls host device attachment in l1vh? If you
> can tell the difference between direct host device attach and other
> direct attach, why don't you reject always reject host attach in l1vh?
On second thought, if the hv_no_attdev knob is only meant to control
host domain attach vs. direct attach, then it is irrelevant on L1VH.

Would it make more sense to rename this to something like
hv_host_disable_direct_attach? That would better reflect its scope and
allow it to be ignored under L1VH, and reduce the risk of users
misinterpreting or misusing it.

^ permalink raw reply

* [PATCH rdma-next] MAINTAINERS: Drop RDMA files from Hyper-V section
From: Leon Romanovsky @ 2026-01-28  9:55 UTC (permalink / raw)
  To: Long Li, Konstantin Taranov; +Cc: linux-rdma, linux-hyperv

From: Leon Romanovsky <leonro@nvidia.com>

MAINTAINERS entries are organized by subsystem ownership, and the RDMA
files belong under drivers/infiniband. Remove the overly broad mana_ib
entries from the Hyper-V section, and instead add the Hyper-V mailing list
to CC on mana_ib patches.

This makes get_maintainer.pl behave more sensibly when running it on
mana_ib patches.

Fixes: 428ca2d4c6aa ("MAINTAINERS: Add Long Li as a Hyper-V maintainer")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 MAINTAINERS | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 12f49de7fe03..d2e3353a1d29 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11739,7 +11739,6 @@ F:	arch/x86/kernel/cpu/mshyperv.c
 F:	drivers/clocksource/hyperv_timer.c
 F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
-F:	drivers/infiniband/hw/mana/
 F:	drivers/input/serio/hyperv-keyboard.c
 F:	drivers/iommu/hyperv-iommu.c
 F:	drivers/net/ethernet/microsoft/
@@ -11758,7 +11757,6 @@ F:	include/hyperv/hvhdk_mini.h
 F:	include/linux/hyperv.h
 F:	include/net/mana
 F:	include/uapi/linux/hyperv.h
-F:	include/uapi/rdma/mana-abi.h
 F:	net/vmw_vsock/hyperv_transport.c
 F:	tools/hv/
 
@@ -17318,6 +17316,7 @@ MICROSOFT MANA RDMA DRIVER
 M:	Long Li <longli@microsoft.com>
 M:	Konstantin Taranov <kotaranov@microsoft.com>
 L:	linux-rdma@vger.kernel.org
+L:	linux-hyperv@vger.kernel.org
 S:	Supported
 F:	drivers/infiniband/hw/mana/
 F:	include/net/mana

---
base-commit: a01745ccf7c41043c503546cae7ba7b0ff499d38
change-id: 20260128-get-maintainers-fix-a9319fc985c8

Best regards,
--  
Leon Romanovsky <leonro@nvidia.com>


^ permalink raw reply related

* Re: [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
From: Manivannan Sadhasivam @ 2026-01-28 14:03 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank
In-Reply-To: <20260120064230.3602565-9-mrathor@linux.microsoft.com>

On Mon, Jan 19, 2026 at 10:42:23PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Main change here is to rename hv_compose_msi_msg to
> hv_vmbus_compose_msi_msg as we introduce hv_compose_msi_msg in upcoming
> patches that builds MSI messages for both VMBus and non-VMBus cases. VMBus
> is not used on baremetal root partition for example.

> While at it, replace
> spaces with tabs and fix some formatting involving excessive line wraps.
>

Don't mix up cleanup changes. Do it in a separate patch.

- Mani
 
> There is no functional change.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 95 +++++++++++++++--------------
>  1 file changed, 48 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..8bc6a38c9b5a 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -30,7 +30,7 @@
>   * function's configuration space is zero.
>   *
>   * The rest of this driver mostly maps PCI concepts onto underlying Hyper-V
> - * facilities.  For instance, the configuration space of a function exposed
> + * facilities.	For instance, the configuration space of a function exposed
>   * by Hyper-V is mapped into a single page of memory space, and the
>   * read and write handlers for config space must be aware of this mechanism.
>   * Similarly, device setup and teardown involves messages sent to and from
> @@ -109,33 +109,33 @@ enum pci_message_type {
>  	/*
>  	 * Version 1.1
>  	 */
> -	PCI_MESSAGE_BASE                = 0x42490000,
> -	PCI_BUS_RELATIONS               = PCI_MESSAGE_BASE + 0,
> -	PCI_QUERY_BUS_RELATIONS         = PCI_MESSAGE_BASE + 1,
> -	PCI_POWER_STATE_CHANGE          = PCI_MESSAGE_BASE + 4,
> +	PCI_MESSAGE_BASE		= 0x42490000,
> +	PCI_BUS_RELATIONS		= PCI_MESSAGE_BASE + 0,
> +	PCI_QUERY_BUS_RELATIONS		= PCI_MESSAGE_BASE + 1,
> +	PCI_POWER_STATE_CHANGE		= PCI_MESSAGE_BASE + 4,
>  	PCI_QUERY_RESOURCE_REQUIREMENTS = PCI_MESSAGE_BASE + 5,
> -	PCI_QUERY_RESOURCE_RESOURCES    = PCI_MESSAGE_BASE + 6,
> -	PCI_BUS_D0ENTRY                 = PCI_MESSAGE_BASE + 7,
> -	PCI_BUS_D0EXIT                  = PCI_MESSAGE_BASE + 8,
> -	PCI_READ_BLOCK                  = PCI_MESSAGE_BASE + 9,
> -	PCI_WRITE_BLOCK                 = PCI_MESSAGE_BASE + 0xA,
> -	PCI_EJECT                       = PCI_MESSAGE_BASE + 0xB,
> -	PCI_QUERY_STOP                  = PCI_MESSAGE_BASE + 0xC,
> -	PCI_REENABLE                    = PCI_MESSAGE_BASE + 0xD,
> -	PCI_QUERY_STOP_FAILED           = PCI_MESSAGE_BASE + 0xE,
> -	PCI_EJECTION_COMPLETE           = PCI_MESSAGE_BASE + 0xF,
> -	PCI_RESOURCES_ASSIGNED          = PCI_MESSAGE_BASE + 0x10,
> -	PCI_RESOURCES_RELEASED          = PCI_MESSAGE_BASE + 0x11,
> -	PCI_INVALIDATE_BLOCK            = PCI_MESSAGE_BASE + 0x12,
> -	PCI_QUERY_PROTOCOL_VERSION      = PCI_MESSAGE_BASE + 0x13,
> -	PCI_CREATE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x14,
> -	PCI_DELETE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x15,
> +	PCI_QUERY_RESOURCE_RESOURCES	= PCI_MESSAGE_BASE + 6,
> +	PCI_BUS_D0ENTRY			= PCI_MESSAGE_BASE + 7,
> +	PCI_BUS_D0EXIT			= PCI_MESSAGE_BASE + 8,
> +	PCI_READ_BLOCK			= PCI_MESSAGE_BASE + 9,
> +	PCI_WRITE_BLOCK			= PCI_MESSAGE_BASE + 0xA,
> +	PCI_EJECT			= PCI_MESSAGE_BASE + 0xB,
> +	PCI_QUERY_STOP			= PCI_MESSAGE_BASE + 0xC,
> +	PCI_REENABLE			= PCI_MESSAGE_BASE + 0xD,
> +	PCI_QUERY_STOP_FAILED		= PCI_MESSAGE_BASE + 0xE,
> +	PCI_EJECTION_COMPLETE		= PCI_MESSAGE_BASE + 0xF,
> +	PCI_RESOURCES_ASSIGNED		= PCI_MESSAGE_BASE + 0x10,
> +	PCI_RESOURCES_RELEASED		= PCI_MESSAGE_BASE + 0x11,
> +	PCI_INVALIDATE_BLOCK		= PCI_MESSAGE_BASE + 0x12,
> +	PCI_QUERY_PROTOCOL_VERSION	= PCI_MESSAGE_BASE + 0x13,
> +	PCI_CREATE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x14,
> +	PCI_DELETE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x15,
>  	PCI_RESOURCES_ASSIGNED2		= PCI_MESSAGE_BASE + 0x16,
>  	PCI_CREATE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x17,
>  	PCI_DELETE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x18, /* unused */
>  	PCI_BUS_RELATIONS2		= PCI_MESSAGE_BASE + 0x19,
> -	PCI_RESOURCES_ASSIGNED3         = PCI_MESSAGE_BASE + 0x1A,
> -	PCI_CREATE_INTERRUPT_MESSAGE3   = PCI_MESSAGE_BASE + 0x1B,
> +	PCI_RESOURCES_ASSIGNED3		= PCI_MESSAGE_BASE + 0x1A,
> +	PCI_CREATE_INTERRUPT_MESSAGE3	= PCI_MESSAGE_BASE + 0x1B,
>  	PCI_MESSAGE_MAXIMUM
>  };
>  
> @@ -1775,20 +1775,21 @@ static u32 hv_compose_msi_req_v1(
>   * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
>   * not irrelevant because Hyper-V chooses the physical CPU to handle the
>   * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> - * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> - * but assigning too many vPCI device interrupts to the same pCPU can cause a
> - * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> - * to spread out the pCPUs that it selects.
> + * hv_vmbus_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the
> + * guest, but assigning too many vPCI device interrupts to the same pCPU can
> + * cause a performance bottleneck. So we spread out the dummy vCPUs to influence
> + * Hyper-V to spread out the pCPUs that it selects.
>   *
>   * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
>   * to always return the same dummy vCPU, because a second call to
> - * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> - * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> - * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> - * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> - * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> - * the same pCPU, even though the vCPUs will be spread out by later calls
> - * to hv_irq_unmask(), but that is the best we can do now.
> + * hv_vmbus_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> + * choose a new pCPU for the interrupt. But for the multi-MSI case, the second
> + * call to hv_vmbus_compose_msi_msg() exits without sending a message to the
> + * vPCI VSP, so the original dummy vCPU is used. This dummy vCPU must be
> + * round-robin'ed so that the pCPUs are spread out. All interrupts for a
> + * multi-MSI device end up using the same pCPU, even though the vCPUs will be
> + * spread out by later calls to hv_irq_unmask(), but that is the best we can do
> + * now.
>   *
>   * With Hyper-V in Nov 2022, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
>   * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> @@ -1863,7 +1864,7 @@ static u32 hv_compose_msi_req_v3(
>  }
>  
>  /**
> - * hv_compose_msi_msg() - Supplies a valid MSI address/data
> + * hv_vmbus_compose_msi_msg() - Supplies a valid MSI address/data
>   * @data:	Everything about this MSI
>   * @msg:	Buffer that is filled in by this function
>   *
> @@ -1873,7 +1874,7 @@ static u32 hv_compose_msi_req_v3(
>   * response supplies a data value and address to which that data
>   * should be written to trigger that interrupt.
>   */
> -static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> +static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct hv_pcibus_device *hbus;
>  	struct vmbus_channel *channel;
> @@ -1955,7 +1956,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  			return;
>  		}
>  		/*
> -		 * The vector we select here is a dummy value.  The correct
> +		 * The vector we select here is a dummy value.	The correct
>  		 * value gets sent to the hypervisor in unmask().  This needs
>  		 * to be aligned with the count, and also not zero.  Multi-msi
>  		 * is powers of 2 up to 32, so 32 will always work here.
> @@ -2047,7 +2048,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  		/*
>  		 * Make sure that the ring buffer data structure doesn't get
> -		 * freed while we dereference the ring buffer pointer.  Test
> +		 * freed while we dereference the ring buffer pointer.	Test
>  		 * for the channel's onchannel_callback being NULL within a
>  		 * sched_lock critical section.  See also the inline comments
>  		 * in vmbus_reset_channel_cb().
> @@ -2147,7 +2148,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
>  /* HW Interrupt Chip Descriptor */
>  static struct irq_chip hv_msi_irq_chip = {
>  	.name			= "Hyper-V PCIe MSI",
> -	.irq_compose_msi_msg	= hv_compose_msi_msg,
> +	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
>  	.irq_set_affinity	= irq_chip_set_affinity_parent,
>  	.irq_ack		= irq_chip_ack_parent,
>  	.irq_eoi		= irq_chip_eoi_parent,
> @@ -2159,8 +2160,8 @@ static int hv_pcie_domain_alloc(struct irq_domain *d, unsigned int virq, unsigne
>  			       void *arg)
>  {
>  	/*
> -	 * TODO: Allocating and populating struct tran_int_desc in hv_compose_msi_msg()
> -	 * should be moved here.
> +	 * TODO: Allocating and populating struct tran_int_desc in
> +	 *	 hv_vmbus_compose_msi_msg() should be moved here.
>  	 */
>  	int ret;
>  
> @@ -2227,7 +2228,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
>  /**
>   * get_bar_size() - Get the address space consumed by a BAR
>   * @bar_val:	Value that a BAR returned after -1 was written
> - *              to it.
> + *		to it.
>   *
>   * This function returns the size of the BAR, rounded up to 1
>   * page.  It has to be rounded up because the hypervisor's page
> @@ -2573,7 +2574,7 @@ static void q_resource_requirements(void *context, struct pci_response *resp,
>   * new_pcichild_device() - Create a new child device
>   * @hbus:	The internal struct tracking this root PCI bus.
>   * @desc:	The information supplied so far from the host
> - *              about the device.
> + *		about the device.
>   *
>   * This function creates the tracking structure for a new child
>   * device and kicks off the process of figuring out what it is.
> @@ -3100,7 +3101,7 @@ static void hv_pci_onchannelcallback(void *context)
>  			 * sure that the packet pointer is still valid during the call:
>  			 * here 'valid' means that there's a task still waiting for the
>  			 * completion, and that the packet data is still on the waiting
> -			 * task's stack.  Cf. hv_compose_msi_msg().
> +			 * task's stack.  Cf. hv_vmbus_compose_msi_msg().
>  			 */
>  			comp_packet->completion_func(comp_packet->compl_ctxt,
>  						     response,
> @@ -3417,7 +3418,7 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
>  	 * vmbus_allocate_mmio() gets used for allocating both device endpoint
>  	 * resource claims (those which cannot be overlapped) and the ranges
>  	 * which are valid for the children of this bus, which are intended
> -	 * to be overlapped by those children.  Set the flag on this claim
> +	 * to be overlapped by those children.	Set the flag on this claim
>  	 * meaning that this region can't be overlapped.
>  	 */
>  
> @@ -4066,7 +4067,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>  		irq_data = irq_get_irq_data(entry->irq);
>  		if (WARN_ON_ONCE(!irq_data))
>  			return -EINVAL;
> -		hv_compose_msi_msg(irq_data, &entry->msg);
> +		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
>  	}
>  	return 0;
>  }
> @@ -4074,7 +4075,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>  /*
>   * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
>   * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> - * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * doesn't trap and emulate the MMIO accesses, here hv_vmbus_compose_msi_msg()
>   * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
>   * Table entries.
>   */
> -- 
> 2.51.2.vfs.0.1
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
From: Manivannan Sadhasivam @ 2026-01-28 14:36 UTC (permalink / raw)
  To: Mukesh R
  Cc: Stanislav Kinsburskii, linux-kernel, linux-hyperv,
	linux-arm-kernel, iommu, linux-pci, linux-arch, kys, haiyangz,
	wei.liu, decui, longli, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, hpa, joro, lpieralisi, kwilczynski, robh, bhelgaas,
	arnd, nunodasneves, mhklinux, romank
In-Reply-To: <a2e54fff-3cbb-e332-c35e-7520c36eceed@linux.microsoft.com>

On Fri, Jan 23, 2026 at 04:42:54PM -0800, Mukesh R wrote:
> On 1/20/26 14:22, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > > interrupts, etc need a device id as a parameter. This device id refers
> > > to that specific device during the lifetime of passthru.
> > > 
> > > An L1VH VM only contains VMBus based devices. A device id for a VMBus
> > > device is slightly different in that it uses the hv_pcibus_device info
> > > for building it to make sure it matches exactly what the hypervisor
> > > expects. This VMBus based device id is needed when attaching devices in
> > > an L1VH based guest VM. Before building it, a check is done to make sure
> > > the device is a valid VMBus device.
> > > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   arch/x86/include/asm/mshyperv.h     |  2 ++
> > >   drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
> > >   2 files changed, 31 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > > index eef4c3a5ba28..0d7fdfb25e76 100644
> > > --- a/arch/x86/include/asm/mshyperv.h
> > > +++ b/arch/x86/include/asm/mshyperv.h
> > > @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
> > >   static inline void hv_apic_init(void) {}
> > >   #endif
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > > +
> > >   struct irq_domain *hv_create_pci_msi_domain(void);
> > >   int hv_map_msi_interrupt(struct irq_data *data,
> > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > > index 8bc6a38c9b5a..40f0b06bb966 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
> > >   #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > >   #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> > > +
> > 
> > Why not moving this static function definition above the called instead of
> > defining the prototype?
> 
> Did you see the function implementation? It has other dependencies that
> are later, it would need code reorg.
> 
> Thanks,
> -Mukesh
> 
> 
> > >   static int hv_pci_irqchip_init(void)
> > >   {
> > >   	return 0;
> > > @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> > >   #define hv_msi_prepare		pci_msi_prepare
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > > +{
> > > +	u64 u64val;
> > 
> > This variable is redundant.
> 
> Not really. It helps with debug by putting a quick print, and is
> harmless.
> 

Such debug print do not exist now. So there is no need of a variable, drop it.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* RE: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Michael Kelley @ 2026-01-28 15:53 UTC (permalink / raw)
  To: Mukesh R, Stanislav Kinsburskii
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <4bcd7b66-6e3b-8f53-b688-ce0272123839@linux.microsoft.com>

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, January 27, 2026 11:56 AM
> To: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Cc: kys@microsoft.com; haiyangz@microsoft.com; wei.liu@kernel.org;
> decui@microsoft.com; longli@microsoft.com; linux-hyperv@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
> 
> On 1/27/26 09:47, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 05:39:49PM -0800, Mukesh R wrote:
> >> On 1/26/26 16:21, Stanislav Kinsburskii wrote:
> >>> On Mon, Jan 26, 2026 at 03:07:18PM -0800, Mukesh R wrote:
> >>>> On 1/26/26 12:43, Stanislav Kinsburskii wrote:
> >>>>> On Mon, Jan 26, 2026 at 12:20:09PM -0800, Mukesh R wrote:
> >>>>>> On 1/25/26 14:39, Stanislav Kinsburskii wrote:
> >>>>>>> On Fri, Jan 23, 2026 at 04:16:33PM -0800, Mukesh R wrote:
> >>>>>>>> On 1/23/26 14:20, Stanislav Kinsburskii wrote:
> >>>>>>>>> The MSHV driver deposits kernel-allocated pages to the hypervisor during
> >>>>>>>>> runtime and never withdraws them. This creates a fundamental incompatibility
> >>>>>>>>> with KEXEC, as these deposited pages remain unavailable to the new kernel
> >>>>>>>>> loaded via KEXEC, leading to potential system crashes upon kernel accessing
> >>>>>>>>> hypervisor deposited pages.
> >>>>>>>>>
> >>>>>>>>> Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> >>>>>>>>> management is implemented.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> >>>>>>>>> ---
> >>>>>>>>>       drivers/hv/Kconfig |    1 +
> >>>>>>>>>       1 file changed, 1 insertion(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> >>>>>>>>> index 7937ac0cbd0f..cfd4501db0fa 100644
> >>>>>>>>> --- a/drivers/hv/Kconfig
> >>>>>>>>> +++ b/drivers/hv/Kconfig
> >>>>>>>>> @@ -74,6 +74,7 @@ config MSHV_ROOT
> >>>>>>>>>       	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> >>>>>>>>>       	# no particular order, making it impossible to reassemble larger pages
> >>>>>>>>>       	depends on PAGE_SIZE_4KB
> >>>>>>>>> +	depends on !KEXEC
> >>>>>>>>>       	select EVENTFD
> >>>>>>>>>       	select VIRT_XFER_TO_GUEST_WORK
> >>>>>>>>>       	select HMM_MIRROR
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Will this affect CRASH kexec? I see few CONFIG_CRASH_DUMP in kexec.c
> >>>>>>>> implying that crash dump might be involved. Or did you test kdump
> >>>>>>>> and it was fine?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yes, it will. Crash kexec depends on normal kexec functionality, so it
> >>>>>>> will be affected as well.
> >>>>>>
> >>>>>> So not sure I understand the reason for this patch. We can just block
> >>>>>> kexec if there are any VMs running, right? Doing this would mean any
> >>>>>> further developement would be without a ver important and major feature,
> >>>>>> right?
> >>>>>
> >>>>> This is an option. But until it's implemented and merged, a user mshv
> >>>>> driver gets into a situation where kexec is broken in a non-obvious way.
> >>>>> The system may crash at any time after kexec, depending on whether the
> >>>>> new kernel touches the pages deposited to hypervisor or not. This is a
> >>>>> bad user experience.
> >>>>
> >>>> I understand that. But with this we cannot collect core and debug any
> >>>> crashes. I was thinking there would be a quick way to prohibit kexec
> >>>> for update via notifier or some other quick hack. Did you already
> >>>> explore that and didn't find anything, hence this?
> >>>>
> >>>
> >>> This quick hack you mention isn't quick in the upstream kernel as there
> >>> is no hook to interrupt kexec process except the live update one.
> >>
> >> That's the one we want to interrupt and block right? crash kexec
> >> is ok and should be allowed. We can document we don't support kexec
> >> for update for now.
> >>
> >>> I sent an RFC for that one but given todays conversation details is
> >>> won't be accepted as is.
> >>
> >> Are you taking about this?
> >>
> >>          "mshv: Add kexec safety for deposited pages"
> >>
> >
> > Yes.
> >
> >>> Making mshv mutually exclusive with kexec is the only viable option for
> >>> now given time constraints.
> >>> It is intended to be replaced with proper page lifecycle management in
> >>> the future.
> >>
> >> Yeah, that could take a long time and imo we cannot just disable KEXEC
> >> completely. What we want is just block kexec for updates from some
> >> mshv file for now, we an print during boot that kexec for updates is
> >> not supported on mshv. Hope that makes sense.
> >>
> >
> > The trade-off here is between disabling kexec support and having the
> > kernel crash after kexec in a non-obvious way. This affects both regular
> > kexec and crash kexec.
> 
> crash kexec on baremetal is not affected, hence disabling that
> doesn't make sense as we can't debug crashes then on bm.
> 
> Let me think and explore a bit, and if I come up with something, I'll
> send a patch here. If nothing, then we can do this as last resort.
> 
> Thanks,
> -Mukesh

Maybe you've already looked at this, but there's a sysctl parameter
kernel.kexec_load_limit_reboot that prevents loading a kexec
kernel for reboot if the value is zero. Separately, there is
kernel.kexec_load_limit_panic that controls whether a kexec
kernel can be loaded for kdump purposes.

kernel.kexec_load_limit_reboot defaults to -1, which allows an
unlimited number of loading a kexec kernel for reboot. But the value
can be set to zero with this kernel boot line parameter:

sysctl.kernel.kexec_load_limit_reboot=0

Alternatively, the mshv driver initialization could add code along
the lines of process_sysctl_arg() to open
/proc/sys/kernel/kexec_load_limit_reboot and write a value of zero.
Then there's no dependency on setting the kernel boot line.

The downside to either method is that after Linux in the root partition
is up-and-running, it is possible to change the sysctl to a non-zero value,
and then load a kexec kernel for reboot. So this approach isn't absolute
protection against doing a kexec for reboot. But it makes it harder, and 
until there's a mechanism to reclaim the deposited pages, it might be
a viable compromise to allow kdump to still be used.

Just a thought ....

Michael

> 
> 
> > It?s a pity we can?t apply a quick hack to disable only regular kexec.
> > However, since crash kexec would hit the same issues, until we have a
> > proper state transition for deposted pages, the best workaround for now
> > is to reset the hypervisor state on every kexec, which needs design,
> > work, and testing.
> >
> > Disabling kexec is the only consistent way to handle this in the
> > upstream kernel at the moment.
> >
> > Thanks, Stanislav

^ permalink raw reply

* [PATCH 0/2] ARM64 support for doorbell and intercept SINTs
From: Anirudh Rayabharam @ 2026-01-28 16:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh

From: "Anirudh Rayabharam (Microsoft)" <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the INTID for SINTs should be in the SGI or PPI range. The
hypervisor exposes a virtual device in the ACPI that reserves a
PPI for this use. Introduce a platform_driver that binds to this ACPI
device and obtains the interrupt vector that can be used for SINTs.

Anirudh Rayabharam (Microsoft) (2):
  mshv: rename synic per-cpu init/cleanup functions
  mshv: add arm64 support for doorbell & intercept SINTs

 drivers/hv/mshv_root.h      |   6 +-
 drivers/hv/mshv_root_main.c |  15 +++-
 drivers/hv/mshv_synic.c     | 156 ++++++++++++++++++++++++++++++++++--
 3 files changed, 164 insertions(+), 13 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH 1/2] mshv: rename synic per-cpu init/cleanup functions
From: Anirudh Rayabharam @ 2026-01-28 16:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260128160437.3342167-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

Rename mshv_synic_init() to mshv_synic_cpu_init() and
mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
these functions handle per-cpu synic setup and teardown.

This prepares for a future patch that will introduce mshv_synic_init()
and mshv_synic_cleanup() for common, non per-cpu initialization.

No functional change.

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root.h      | 4 ++--
 drivers/hv/mshv_root_main.c | 4 ++--
 drivers/hv/mshv_synic.c     | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..c02513f75429 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -242,8 +242,8 @@ int mshv_register_doorbell(u64 partition_id, doorbell_cb_t doorbell_cb,
 void mshv_unregister_doorbell(u64 partition_id, int doorbell_portid);
 
 void mshv_isr(void);
-int mshv_synic_init(unsigned int cpu);
-int mshv_synic_cleanup(unsigned int cpu);
+int mshv_synic_cpu_init(unsigned int cpu);
+int mshv_synic_cpu_exit(unsigned int cpu);
 
 static inline bool mshv_partition_encrypted(struct mshv_partition *partition)
 {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 681b58154d5e..abb34b37d552 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2284,8 +2284,8 @@ static int __init mshv_parent_partition_init(void)
 	}
 
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
-				mshv_synic_init,
-				mshv_synic_cleanup);
+				mshv_synic_cpu_init,
+				mshv_synic_cpu_exit);
 	if (ret < 0) {
 		dev_err(dev, "Failed to setup cpu hotplug state: %i\n", ret);
 		goto free_synic_pages;
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f8b0337cdc82..ba89655b0910 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -446,7 +446,7 @@ void mshv_isr(void)
 	}
 }
 
-int mshv_synic_init(unsigned int cpu)
+int mshv_synic_cpu_init(unsigned int cpu)
 {
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
@@ -542,7 +542,7 @@ int mshv_synic_init(unsigned int cpu)
 	return -EFAULT;
 }
 
-int mshv_synic_cleanup(unsigned int cpu)
+int mshv_synic_cpu_exit(unsigned int cpu)
 {
 	union hv_synic_sint sint;
 	union hv_synic_simp simp;
-- 
2.34.1


^ permalink raw reply related

* [PATCH 2/2] mshv: add arm64 support for doorbell & intercept SINTs
From: Anirudh Rayabharam @ 2026-01-28 16:04 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel; +Cc: anirudh
In-Reply-To: <20260128160437.3342167-1-anirudh@anirudhrb.com>

From: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.

On arm64, the INTID for SINTs should be in the SGI or PPI range. The
hypervisor exposes a virtual device in the ACPI that reserves a
PPI for this use. Introduce a platform_driver that binds to this ACPI
device and obtains the interrupt vector that can be used for SINTs.

To better unify x86 and arm64 paths, introduce mshv_sint_irq_init() that
either registers the platform_driver and obtains the INTID (arm64) or
just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).

Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
 drivers/hv/mshv_root.h      |   2 +
 drivers/hv/mshv_root_main.c |  11 ++-
 drivers/hv/mshv_synic.c     | 152 ++++++++++++++++++++++++++++++++++--
 3 files changed, 158 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index c02513f75429..c2d1e8d7452c 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -332,5 +332,7 @@ int mshv_region_get(struct mshv_mem_region *region);
 bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
+int mshv_synic_init(void);
+void mshv_synic_cleanup(void);
 
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index abb34b37d552..6c2d4a80dbe3 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2276,11 +2276,17 @@ static int __init mshv_parent_partition_init(void)
 			MSHV_HV_MAX_VERSION);
 	}
 
+	ret = mshv_synic_init();
+	if (ret) {
+		dev_err(dev, "Failed to initialize synic: %i\n", ret);
+		goto device_deregister;
+	}
+
 	mshv_root.synic_pages = alloc_percpu(struct hv_synic_pages);
 	if (!mshv_root.synic_pages) {
 		dev_err(dev, "Failed to allocate percpu synic page\n");
 		ret = -ENOMEM;
-		goto device_deregister;
+		goto synic_cleanup;
 	}
 
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mshv_synic",
@@ -2322,6 +2328,8 @@ static int __init mshv_parent_partition_init(void)
 	cpuhp_remove_state(mshv_cpuhp_online);
 free_synic_pages:
 	free_percpu(mshv_root.synic_pages);
+synic_cleanup:
+	mshv_synic_cleanup();
 device_deregister:
 	misc_deregister(&mshv_dev);
 	return ret;
@@ -2337,6 +2345,7 @@ static void __exit mshv_parent_partition_exit(void)
 		mshv_root_partition_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
 	free_percpu(mshv_root.synic_pages);
+	mshv_synic_cleanup();
 }
 
 module_init(mshv_parent_partition_init);
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index ba89655b0910..b7860a75b97e 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -10,13 +10,19 @@
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
+#include <linux/interrupt.h>
 #include <linux/io.h>
 #include <linux/random.h>
 #include <asm/mshyperv.h>
+#include <linux/platform_device.h>
+#include <linux/acpi.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
 
+static int mshv_interrupt = -1;
+static int mshv_irq = -1;
+
 static u32 synic_event_ring_get_queued_port(u32 sint_index)
 {
 	struct hv_synic_event_ring_page **event_ring_page;
@@ -446,14 +452,144 @@ void mshv_isr(void)
 	}
 }
 
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+#ifdef CONFIG_ACPI
+static long __percpu *mshv_evt;
+
+static acpi_status mshv_walk_resources(struct acpi_resource *res, void *ctx)
+{
+	struct resource r;
+
+	switch (res->type) {
+	case ACPI_RESOURCE_TYPE_EXTENDED_IRQ:
+		if (!acpi_dev_resource_interrupt(res, 0, &r)) {
+			pr_err("Unable to parse MSHV ACPI interrupt\n");
+			return AE_ERROR;
+		}
+		/* ARM64 INTID */
+		mshv_interrupt = res->data.extended_irq.interrupts[0];
+		/* Linux IRQ number */
+		mshv_irq = r.start;
+		pr_info("MSHV SINT INTID %d, IRQ %d\n",
+			mshv_interrupt, mshv_irq);
+		return AE_OK;
+	default:
+		/* Unused resource type */
+		return AE_OK;
+	}
+
+	return AE_OK;
+}
+
+static irqreturn_t mshv_percpu_isr(int irq, void *dev_id)
+{
+	mshv_isr();
+	add_interrupt_randomness(irq);
+	return IRQ_HANDLED;
+}
+
+static int mshv_sint_probe(struct platform_device *pdev)
+{
+	acpi_status result;
+	int ret = 0;
+	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
+
+	result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
+					mshv_walk_resources, NULL);
+
+	if (ACPI_FAILURE(result)) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	mshv_evt = alloc_percpu(long);
+	if (!mshv_evt) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = request_percpu_irq(mshv_irq, mshv_percpu_isr, "MSHV", mshv_evt);
+out:
+	return ret;
+}
+
+static void mshv_sint_remove(struct platform_device *pdev)
+{
+	free_percpu_irq(mshv_irq, mshv_evt);
+	free_percpu(mshv_evt);
+}
+#else
+static int mshv_sint_probe(struct platform_device *pdev)
+{
+	return -ENODEV;
+}
+
+static void mshv_sint_remove(struct platform_device *pdev)
+{
+	return;
+}
+#endif
+
+
+static const __maybe_unused struct acpi_device_id mshv_sint_device_ids[] = {
+	{"MSFT1003", 0},
+	{"", 0},
+};
+
+static struct platform_driver mshv_sint_drv = {
+	.probe = mshv_sint_probe,
+	.remove = mshv_sint_remove,
+	.driver = {
+		.name = "mshv_sint",
+		.acpi_match_table = ACPI_PTR(mshv_sint_device_ids),
+		.probe_type = PROBE_FORCE_SYNCHRONOUS,
+	},
+};
+#endif /* HYPERVISOR_CALLBACK_VECTOR */
+
+int mshv_synic_init(void)
+{
+#ifdef HYPERVISOR_CALLBACK_VECTOR
+	mshv_interrupt = HYPERVISOR_CALLBACK_VECTOR;
+	mshv_irq = -1;
+	return 0;
+#else
+	int ret;
+
+	if (acpi_disabled)
+		return -ENODEV;
+
+	ret = platform_driver_register(&mshv_sint_drv);
+	if (ret)
+		return ret;
+
+	if (mshv_interrupt == -1 || mshv_irq == -1) {
+		ret = -ENODEV;
+		goto out_unregister;
+	}
+
+	return 0;
+
+out_unregister:
+	platform_driver_unregister(&mshv_sint_drv);
+	return ret;
+#endif
+}
+
+void mshv_synic_cleanup(void)
+{
+#ifndef HYPERVISOR_CALLBACK_VECTOR
+	if (!acpi_disabled)
+		platform_driver_unregister(&mshv_sint_drv);
+#endif
+}
+
 int mshv_synic_cpu_init(unsigned int cpu)
 {
 	union hv_synic_simp simp;
 	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-#ifdef HYPERVISOR_CALLBACK_VECTOR
 	union hv_synic_sint sint;
-#endif
 	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
@@ -496,10 +632,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
 
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
-#ifdef HYPERVISOR_CALLBACK_VECTOR
+	if (mshv_irq != -1)
+		enable_percpu_irq(mshv_irq, 0);
+
 	/* Enable intercepts */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_interrupt;
 	sint.masked = false;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX,
@@ -507,13 +645,12 @@ int mshv_synic_cpu_init(unsigned int cpu)
 
 	/* Doorbell SINT */
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = mshv_interrupt;
 	sint.masked = false;
 	sint.as_intercept = 1;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
-#endif
 
 	/* Enable global synic bit */
 	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
@@ -568,6 +705,9 @@ int mshv_synic_cpu_exit(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
+	if (mshv_irq != -1)
+		disable_percpu_irq(mshv_irq);
+
 	/* Disable Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] mshv: Make MSHV mutually exclusive with KEXEC
From: Anirudh Rayabharam @ 2026-01-28 16:16 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <aXfStKqKiSSHEmXj@skinsburskii.localdomain>

On Mon, Jan 26, 2026 at 12:46:44PM -0800, Stanislav Kinsburskii wrote:
> On Tue, Jan 27, 2026 at 12:19:24AM +0530, Anirudh Rayabharam wrote:
> > On Fri, Jan 23, 2026 at 10:20:53PM +0000, Stanislav Kinsburskii wrote:
> > > The MSHV driver deposits kernel-allocated pages to the hypervisor during
> > > runtime and never withdraws them. This creates a fundamental incompatibility
> > > with KEXEC, as these deposited pages remain unavailable to the new kernel
> > > loaded via KEXEC, leading to potential system crashes upon kernel accessing
> > > hypervisor deposited pages.
> > > 
> > > Make MSHV mutually exclusive with KEXEC until proper page lifecycle
> > > management is implemented.
> > 
> > Someone might want to stop all guest VMs and do a kexec. Which is valid
> > and would work without any issue for L1VH.
> > 
> 
> No, it won't work and hypervsisor depostied pages won't be withdrawn.

All pages that were deposited in the context of a guest partition (i.e.
with the guest partition ID), would be withdrawn when you kill the VMs,
right? What other deposited pages would be left?

Thanks,
Anirudh.

> Also, kernel consisntency must no depend on use space behavior. 
> 
> > Also, I don't think it is reasonable at all that someone needs to
> > disable basic kernel functionality such as kexec in order to use our
> > driver.
> > 
> 
> It's a temporary measure until proper page lifecycle management is
> supported in the driver.
> Mutual exclusion of the driver and kexec is given and thus should be
> expclitily stated in the Kconfig.
> 
> Thanks,
> Stanislav
> 
> > Thanks,
> > Anirudh.
> > 
> > > 
> > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > ---
> > >  drivers/hv/Kconfig |    1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > > index 7937ac0cbd0f..cfd4501db0fa 100644
> > > --- a/drivers/hv/Kconfig
> > > +++ b/drivers/hv/Kconfig
> > > @@ -74,6 +74,7 @@ config MSHV_ROOT
> > >  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > >  	# no particular order, making it impossible to reassemble larger pages
> > >  	depends on PAGE_SIZE_4KB
> > > +	depends on !KEXEC
> > >  	select EVENTFD
> > >  	select VIRT_XFER_TO_GUEST_WORK
> > >  	select HMM_MIRROR
> > > 
> > > 

^ permalink raw reply

* [PATCH 0/2] kexec: Refuse kernel-unsafe Microsoft Hypervisor transitions
From: Stanislav Kinsburskii @ 2026-01-28 17:41 UTC (permalink / raw)
  To: rppt, akpm, bhe, kys, haiyangz, wei.liu, decui, longli
  Cc: kexec, linux-hyperv, linux-kernel

When Microsoft Hypervisor is active, the kernel may have memory “deposited”
to the hypervisor. Those pages are no longer safe for the kernel to touch,
and attempting to access them can trigger a GPF. The problem becomes acute
with kexec: the “deposited pages” state does not survive the transition,
and the next kernel has no reliable way to know which pages are still
owned/managed by the hypervisor.

Until there is a proper handoff mechanism to preserve that state across
kexec, the only safe behavior is to refuse kexec whenever there is shared
hypervisor state that cannot survive the transition—most notably deposited
pages, and also cases where VMs are still running.

This series adds the missing kexec integration point needed by MSHV: a
callback at the kexec “freeze” stage so the driver can make the transition
safe (or block it). With this hook, MSHV can refuse kexec while VMs are
running, attempt to withdraw deposited pages when possible (e.g. L1VH
host), and fail the transition if any pages remain deposited.

---

Stanislav Kinsburskii (2):
      kexec: Add permission notifier chain for kexec operations
      mshv: Add kexec blocking support


 drivers/hv/Makefile            |    1 +
 drivers/hv/hv_proc.c           |    4 ++
 drivers/hv/mshv_kexec.c        |   66 ++++++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h         |   14 ++++++++
 drivers/hv/mshv_root_hv_call.c |    2 +
 drivers/hv/mshv_root_main.c    |    7 ++++
 include/linux/kexec.h          |    6 ++++
 kernel/kexec_core.c            |   24 +++++++++++++++
 8 files changed, 124 insertions(+)
 create mode 100644 drivers/hv/mshv_kexec.c


^ permalink raw reply

* [PATCH 1/2] kexec: Add permission notifier chain for kexec operations
From: Stanislav Kinsburskii @ 2026-01-28 17:42 UTC (permalink / raw)
  To: rppt, akpm, bhe, kys, haiyangz, wei.liu, decui, longli
  Cc: kexec, linux-hyperv, linux-kernel
In-Reply-To: <176962149772.85424.9395505307198316093.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Add a blocking notifier chain to allow subsystems to be notified
before kexec execution. This enables modules to perform necessary
cleanup or validation before the system transitions to a new kernel or
block kexec if not possible under current conditions.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 include/linux/kexec.h |    6 ++++++
 kernel/kexec_core.c   |   24 ++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ff7e231b0485..311037d30f9e 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -35,6 +35,7 @@ extern note_buf_t __percpu *crash_notes;
 #include <linux/ioport.h>
 #include <linux/module.h>
 #include <linux/highmem.h>
+#include <linux/notifier.h>
 #include <asm/kexec.h>
 #include <linux/crash_core.h>
 
@@ -532,10 +533,13 @@ extern bool kexec_file_dbg_print;
 
 extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size);
 extern void kimage_unmap_segment(void *buffer);
+extern int kexec_block_notifier_register(struct notifier_block *nb);
+extern int kexec_block_notifier_unregister(struct notifier_block *nb);
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
 struct kimage;
+struct notifier_block;
 static inline void __crash_kexec(struct pt_regs *regs) { }
 static inline void crash_kexec(struct pt_regs *regs) { }
 static inline int kexec_should_crash(struct task_struct *p) { return 0; }
@@ -543,6 +547,8 @@ static inline int kexec_crash_loaded(void) { return 0; }
 static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size)
 { return NULL; }
 static inline void kimage_unmap_segment(void *buffer) { }
+static inline int kexec_block_notifier_register(struct notifier_block *nb) { }
+static inline int kexec_block_notifier_unregister(struct notifier_block *nb) { }
 #define kexec_in_progress false
 #endif /* CONFIG_KEXEC_CORE */
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 0f92acdd354d..1e86a6f175f0 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -57,6 +57,20 @@ bool kexec_in_progress = false;
 
 bool kexec_file_dbg_print;
 
+static BLOCKING_NOTIFIER_HEAD(kexec_block_list);
+
+int kexec_block_notifier_register(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&kexec_block_list, nb);
+}
+EXPORT_SYMBOL_GPL(kexec_block_notifier_register);
+
+int kexec_block_notifier_unregister(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&kexec_block_list, nb);
+}
+EXPORT_SYMBOL_GPL(kexec_block_notifier_unregister);
+
 /*
  * When kexec transitions to the new kernel there is a one-to-one
  * mapping between physical and virtual addresses.  On processors
@@ -1124,6 +1138,12 @@ bool kexec_load_permitted(int kexec_image_type)
 	return true;
 }
 
+static int kexec_check_blockers(void)
+{
+	/* Notify subsystems of impending kexec */
+	return blocking_notifier_call_chain(&kexec_block_list, 0, NULL);
+}
+
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
@@ -1139,6 +1159,10 @@ int kernel_kexec(void)
 		goto Unlock;
 	}
 
+	error = kexec_check_blockers();
+	if (error)
+		goto Unlock;
+
 	error = liveupdate_reboot();
 	if (error)
 		goto Unlock;



^ permalink raw reply related

* [PATCH 2/2] mshv: Add kexec blocking support
From: Stanislav Kinsburskii @ 2026-01-28 17:42 UTC (permalink / raw)
  To: rppt, akpm, bhe, kys, haiyangz, wei.liu, decui, longli
  Cc: kexec, linux-hyperv, linux-kernel
In-Reply-To: <176962149772.85424.9395505307198316093.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

Add kexec notifier to prevent kexec when VMs are active or memory
is deposited. The notifier blocks kexec operations if:
- Active VMs exist in the partition table
- Pages are still deposited to the hypervisor

The kernel cannot access hypervisor deposited pages: any access
triggers a GPF. Until the deposited page state can be handed over
to the next kernel, kexec must be blocked if there is any shared
state between kernel and hypervisor.

For L1 host virtualization, attempt to withdraw all deposited memory before
allowing kexec to proceed. If withdrawal fails or pages remain deposited
block the kexec operation.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Makefile            |    1 +
 drivers/hv/hv_proc.c           |    4 ++
 drivers/hv/mshv_kexec.c        |   66 ++++++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h         |   14 ++++++++
 drivers/hv/mshv_root_hv_call.c |    2 +
 drivers/hv/mshv_root_main.c    |    7 ++++
 6 files changed, 94 insertions(+)
 create mode 100644 drivers/hv/mshv_kexec.c

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index a49f93c2d245..bb72be5cc525 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
 	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+mshv_root-$(CONFIG_KEXEC) += mshv_kexec.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 89870c1b0087..39bbbedb0340 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -15,6 +15,8 @@
  */
 #define HV_DEPOSIT_MAX (HV_HYP_PAGE_SIZE / sizeof(u64) - 1)
 
+atomic_t hv_pages_deposited;
+
 /* Deposits exact number of pages. Must be called with interrupts enabled.  */
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 {
@@ -93,6 +95,8 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 		goto err_free_allocations;
 	}
 
+	atomic_add(page_count, &hv_pages_deposited);
+
 	ret = 0;
 	goto free_buf;
 
diff --git a/drivers/hv/mshv_kexec.c b/drivers/hv/mshv_kexec.c
new file mode 100644
index 000000000000..5222b2e4ff97
--- /dev/null
+++ b/drivers/hv/mshv_kexec.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Live update orchestration management for mshv_root module.
+ *
+ * Author: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
+ */
+
+#include <linux/kexec.h>
+#include <linux/notifier.h>
+#include <asm/mshyperv.h>
+#include "mshv_root.h"
+
+static BLOCKING_NOTIFIER_HEAD(overlay_notify_chain);
+
+static int mshv_block_kexec_notify(struct notifier_block *nb,
+				   unsigned long action, void *arg)
+{
+	if (!hash_empty(mshv_root.pt_htable)) {
+		pr_warn("mshv: Cannot perform kexec while VMs are active\n");
+		return -EBUSY;
+	}
+
+	if (hv_l1vh_partition()) {
+		int err;
+
+		/* Attempt to withdraw all the deposited pages */
+		err = hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE,
+					      hv_current_partition_id);
+		if (err) {
+			pr_err("mshv: Failed to withdraw memory from L1 virtualization: %d\n",
+			       err);
+			return err;
+		}
+	}
+
+	if (atomic_read(&hv_pages_deposited)) {
+		pr_warn("mshv: Cannot perform kexec while pages are deposited\n");
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static struct notifier_block mshv_kexec_notifier = {
+	.notifier_call = mshv_block_kexec_notify,
+};
+
+int __init mshv_kexec_init(void)
+{
+	int err;
+
+	err = kexec_block_notifier_register(&mshv_kexec_notifier);
+	if (err) {
+		pr_err("mshv: Could not register kexec notifier: %pe\n",
+		       ERR_PTR(err));
+		return err;
+	}
+
+	return 0;
+}
+
+void __exit mshv_kexec_exit(void)
+{
+	(void)kexec_block_notifier_unregister(&mshv_kexec_notifier);
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..311f76262d10 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -17,6 +17,7 @@
 #include <linux/build_bug.h>
 #include <linux/mmu_notifier.h>
 #include <uapi/linux/mshv.h>
+#include <hyperv/hvhdk.h>
 
 /*
  * Hypervisor must be between these version numbers (inclusive)
@@ -319,6 +320,7 @@ int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 a
 extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
+extern atomic_t hv_pages_deposited;
 
 struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
 					   u64 uaddr, u32 flags);
@@ -333,4 +335,16 @@ bool mshv_region_handle_gfn_fault(struct mshv_mem_region *region, u64 gfn);
 void mshv_region_movable_fini(struct mshv_mem_region *region);
 bool mshv_region_movable_init(struct mshv_mem_region *region);
 
+#if IS_ENABLED(CONFIG_KEXEC)
+int mshv_kexec_init(void);
+void mshv_kexec_exit(void);
+#else
+static inline int mshv_kexec_init(void)
+{
+	return 0;
+}
+
+static inline void mshv_kexec_exit(void) { }
+#endif
+
 #endif /* _MSHV_ROOT_H_ */
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 06f2bac8039d..4203af5190ee 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -73,6 +73,8 @@ int hv_call_withdraw_memory(u64 count, int node, u64 partition_id)
 		for (i = 0; i < completed; i++)
 			__free_page(pfn_to_page(output_page->gpa_page_list[i]));
 
+		atomic_sub(completed, &hv_pages_deposited);
+
 		if (!hv_result_success(status)) {
 			if (hv_result(status) == HV_STATUS_NO_RESOURCES)
 				status = HV_STATUS_SUCCESS;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 5fc572e31cd7..d55aa69d130c 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2330,6 +2330,10 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto deinit_root_scheduler;
 
+	ret = mshv_kexec_init();
+	if (ret)
+		goto deinit_irqfd_wq;
+
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
 
@@ -2337,6 +2341,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+deinit_irqfd_wq:
+	mshv_irqfd_wq_cleanup();
 deinit_root_scheduler:
 	root_scheduler_deinit();
 exit_partition:
@@ -2356,6 +2362,7 @@ static void __exit mshv_parent_partition_exit(void)
 	hv_setup_mshv_handler(NULL);
 	mshv_port_table_fini();
 	misc_deregister(&mshv_dev);
+	mshv_kexec_exit();
 	mshv_irqfd_wq_cleanup();
 	root_scheduler_deinit();
 	if (hv_root_partition())



^ permalink raw reply related

* [PATCH v6 0/7] mshv: Debugfs interface for mshv_root
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves

Expose hypervisor, logical processor, partition, and virtual processor
statistics via debugfs. These are provided by mapping 'stats' pages via
hypercall.

Patch #1: Update hv_call_map_stats_page() to return success when
          HV_STATS_AREA_PARENT is unavailable, which is the case on some
          hypervisor versions, where it can fall back to HV_STATS_AREA_SELF
Patch #2: Use struct hv_stats_page pointers instead of void *
Patch #3: Make mshv_vp_stats_map/unmap() more flexible to use with debugfs
          code
Patch #4: Always map vp stats page regardless of scheduler, to reuse in
          debugfs
Patch #5: Change to hv_stats_page definition and
          VpRootDispatchThreadBlocked
Patch #6: Introduce the definitions needed for the various stats pages
Patch #7: Add mshv_debugfs.c, and integrate it with the mshv_root driver to
          expose the partition and VP stats.

---
Changes in v6:
- Fix whitespace and other checkpatch issues [Michael]

Changes in v5:
- Rename hv_counters.c to mshv_debugfs_counters.c [Michael]
- Clarify unusual inclusion of mshv_debugfs_counters.c with comment. After
  discussion it is still included directly to keep things simple. Including
  arrays with unspecified size via a header means sizeof() cannot be used on
  the array.
- Error if mshv_debugfs_counters.c is included elsewhere than mshv_debugfs.c
- Use array index as stats page index to save space [Stanislav]
- Enforce HV_STATS_AREA_PARENT and SELF fit in NUM_STATS_AREAS with
  static_assert and clarify with comment [Michael]
- Return to using lp count from hv stats page for mshv_lps_count [Michael]
- Use nr_cpu_ids instead of num_possible_cpus() [Michael]
- Set mshv_lps_stats[idx] and the array itself to NULL on unmap and cleanup
  [Michael]
- Rename HvLogicalProcessors and VpRootDispatchThreadBlocked to Linux style
- Translate Linux cpu index to vp index via hv_vp_index on partition destroy
  [Michael]
- Minor formatting cleanups [Michael]

Changes in v4:
- Put the counters definitions in static arrays in hv_counters.c, instead of
  as enums in hvhdk.h [Michael]
- Due to the above, add an additional patch (#5) to simplify hv_stats_page,
  and retain the enum definition at the top of mshv_root_main.c for use with
  VpRootDispatchThreadBlocked. That is the only remaining use of the counter
  enum.
- Due to the above, use num_present_cpus() as the number of LPs to map stats
  pages for - this number shouldn't change at runtime because the hypervisor
  doesn't support hotplug for root partition.

Changes in v3:
- Add 3 small refactor/cleanup patches (patches 2,3,4) from Stanislav. These
  simplify some of the debugfs code, and fix issues with mapping VP stats on
  L1VH.
- Fix cleanup of parent stats dentries on module removal (via squashing some
  internal patches into patch #6) [Praveen]
- Remove unused goto label [Stanislav, kernel bot]
- Use struct hv_stats_page * instead of void * in mshv_debugfs.c [Stanislav]
- Remove some redundant variables [Stanislav]
- Rename debugfs dentry fields for brevity [Stanislav]
- Use ERR_CAST() for the dentry error pointer returned from
  lp_debugfs_stats_create() [Stanislav]
- Fix leak of pages allocated for lp stats mappings by storing them in an array
  [Michael]
- Add comments to clarify PARENT vs SELF usage and edge cases [Michael]
- Add VpLoadAvg for x86 and print the stat [Michael]
- Add NUM_STATS_AREAS for array sizing in mshv_debugfs.c [Michael]

Changes in v2:
- Remove unnecessary pr_debug_once() in patch 1 [Stanislav Kinsburskii]
- CONFIG_X86 -> CONFIG_X86_64 in patch 2 [Stanislav Kinsburskii]

---
Nuno Das Neves (3):
  mshv: Update hv_stats_page definitions
  mshv: Add data for printing stats page counters
  mshv: Add debugfs to view hypervisor statistics

Purna Pavan Chandra Aekkaladevi (1):
  mshv: Ignore second stats page map result failure

Stanislav Kinsburskii (3):
  mshv: Use typed hv_stats_page pointers
  mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
  mshv: Always map child vp stats pages regardless of scheduler type

 drivers/hv/Makefile                |   1 +
 drivers/hv/mshv_debugfs.c          | 726 +++++++++++++++++++++++++++++
 drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++
 drivers/hv/mshv_root.h             |  49 +-
 drivers/hv/mshv_root_hv_call.c     |  64 ++-
 drivers/hv/mshv_root_main.c        | 140 +++---
 include/hyperv/hvhdk.h             |   7 +
 7 files changed, 1412 insertions(+), 65 deletions(-)
 create mode 100644 drivers/hv/mshv_debugfs.c
 create mode 100644 drivers/hv/mshv_debugfs_counters.c

-- 
2.34.1


^ permalink raw reply

* [PATCH v6 1/7] mshv: Ignore second stats page map result failure
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

From: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>

Older versions of the hypervisor do not have a concept of separate SELF
and PARENT stats areas. In this case, mapping the HV_STATS_AREA_SELF page
is sufficient - it's the only page and it contains all available stats.

Mapping HV_STATS_AREA_PARENT returns HV_STATUS_INVALID_PARAMETER which
currently causes module init to fail on older hypevisor versions.

Detect this case and gracefully fall back to populating
stats_pages[HV_STATS_AREA_PARENT] with the already-mapped SELF page.

Add comments to clarify the behavior, including a clarification of why
this isn't needed for hv_call_map_stats_page2() which always supports
PARENT and SELF areas.

Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/mshv_root_hv_call.c | 52 +++++++++++++++++++++++++++++++---
 drivers/hv/mshv_root_main.c    |  3 ++
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 598eaff4ff29..1f93b94d7580 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -813,6 +813,13 @@ hv_call_notify_port_ring_empty(u32 sint_index)
 	return hv_result_to_errno(status);
 }
 
+/*
+ * Equivalent of hv_call_map_stats_page() for cases when the caller provides
+ * the map location.
+ *
+ * NOTE: This is a newer hypercall that always supports SELF and PARENT stats
+ * areas, unlike hv_call_map_stats_page().
+ */
 static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 				   const union hv_stats_object_identity *identity,
 				   u64 map_location)
@@ -855,6 +862,34 @@ static int hv_call_map_stats_page2(enum hv_stats_object_type type,
 	return ret;
 }
 
+static int
+hv_stats_get_area_type(enum hv_stats_object_type type,
+		       const union hv_stats_object_identity *identity)
+{
+	switch (type) {
+	case HV_STATS_OBJECT_HYPERVISOR:
+		return identity->hv.stats_area_type;
+	case HV_STATS_OBJECT_LOGICAL_PROCESSOR:
+		return identity->lp.stats_area_type;
+	case HV_STATS_OBJECT_PARTITION:
+		return identity->partition.stats_area_type;
+	case HV_STATS_OBJECT_VP:
+		return identity->vp.stats_area_type;
+	}
+
+	return -EINVAL;
+}
+
+/*
+ * Map a stats page, where the page location is provided by the hypervisor.
+ *
+ * NOTE: The concept of separate SELF and PARENT stats areas does not exist on
+ * older hypervisor versions. All the available stats information can be found
+ * on the SELF page. When attempting to map the PARENT area on a hypervisor
+ * that doesn't support it, return "success" but with a NULL address. The
+ * caller should check for this case and instead fallback to the SELF area
+ * alone.
+ */
 static int hv_call_map_stats_page(enum hv_stats_object_type type,
 				  const union hv_stats_object_identity *identity,
 				  void **addr)
@@ -863,7 +898,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 	struct hv_input_map_stats_page *input;
 	struct hv_output_map_stats_page *output;
 	u64 status, pfn;
-	int ret = 0;
+	int hv_status, ret = 0;
 
 	do {
 		local_irq_save(flags);
@@ -878,11 +913,20 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 		pfn = output->map_location;
 
 		local_irq_restore(flags);
-		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY) {
-			ret = hv_result_to_errno(status);
+
+		hv_status = hv_result(status);
+		if (hv_status != HV_STATUS_INSUFFICIENT_MEMORY) {
 			if (hv_result_success(status))
 				break;
-			return ret;
+
+			if (hv_stats_get_area_type(type, identity) == HV_STATS_AREA_PARENT &&
+			    hv_status == HV_STATUS_INVALID_PARAMETER) {
+				*addr = NULL;
+				return 0;
+			}
+
+			hv_status_debug(status, "\n");
+			return hv_result_to_errno(status);
 		}
 
 		ret = hv_call_deposit_pages(NUMA_NO_NODE,
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..1777778f84b8 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -992,6 +992,9 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
 	if (err)
 		goto unmap_self;
 
+	if (!stats_pages[HV_STATS_AREA_PARENT])
+		stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+
 	return 0;
 
 unmap_self:
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 2/7] mshv: Use typed hv_stats_page pointers
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Refactor all relevant functions to use struct hv_stats_page pointers
instead of void pointers for stats page mapping and unmapping thus
improving type safety and code clarity across the Hyper-V stats mapping
APIs.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |  5 +++--
 drivers/hv/mshv_root_hv_call.c | 12 +++++++-----
 drivers/hv/mshv_root_main.c    |  8 ++++----
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..05ba1f716f9e 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -307,8 +307,9 @@ int hv_call_disconnect_port(u64 connection_partition_id,
 int hv_call_notify_port_ring_empty(u32 sint_index);
 int hv_map_stats_page(enum hv_stats_object_type type,
 		      const union hv_stats_object_identity *identity,
-		      void **addr);
-int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr,
+		      struct hv_stats_page **addr);
+int hv_unmap_stats_page(enum hv_stats_object_type type,
+			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity);
 int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 				   u64 page_struct_count, u32 host_access,
diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
index 1f93b94d7580..daee036e48bc 100644
--- a/drivers/hv/mshv_root_hv_call.c
+++ b/drivers/hv/mshv_root_hv_call.c
@@ -890,9 +890,10 @@ hv_stats_get_area_type(enum hv_stats_object_type type,
  * caller should check for this case and instead fallback to the SELF area
  * alone.
  */
-static int hv_call_map_stats_page(enum hv_stats_object_type type,
-				  const union hv_stats_object_identity *identity,
-				  void **addr)
+static int
+hv_call_map_stats_page(enum hv_stats_object_type type,
+		       const union hv_stats_object_identity *identity,
+		       struct hv_stats_page **addr)
 {
 	unsigned long flags;
 	struct hv_input_map_stats_page *input;
@@ -942,7 +943,7 @@ static int hv_call_map_stats_page(enum hv_stats_object_type type,
 
 int hv_map_stats_page(enum hv_stats_object_type type,
 		      const union hv_stats_object_identity *identity,
-		      void **addr)
+		      struct hv_stats_page **addr)
 {
 	int ret;
 	struct page *allocated_page = NULL;
@@ -990,7 +991,8 @@ static int hv_call_unmap_stats_page(enum hv_stats_object_type type,
 	return hv_result_to_errno(status);
 }
 
-int hv_unmap_stats_page(enum hv_stats_object_type type, void *page_addr,
+int hv_unmap_stats_page(enum hv_stats_object_type type,
+			struct hv_stats_page *page_addr,
 			const union hv_stats_object_identity *identity)
 {
 	int ret;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1777778f84b8..be5ad0fbfbee 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -957,7 +957,7 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 }
 
 static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
-				void *stats_pages[])
+				struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -972,7 +972,7 @@ static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
 }
 
 static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
-			     void *stats_pages[])
+			     struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -1010,7 +1010,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	struct mshv_create_vp args;
 	struct mshv_vp *vp;
 	struct page *intercept_msg_page, *register_page, *ghcb_page;
-	void *stats_pages[2];
+	struct hv_stats_page *stats_pages[2];
 	long ret;
 
 	if (copy_from_user(&args, arg, sizeof(args)))
@@ -1729,7 +1729,7 @@ static void destroy_partition(struct mshv_partition *partition)
 
 			if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
 				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
-						    (void **)vp->vp_stats_pages);
+						    vp->vp_stats_pages);
 
 			if (vp->vp_register_page) {
 				(void)hv_unmap_vp_state_page(partition->pt_id,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 3/7] mshv: Improve mshv_vp_stats_map/unmap(), add them to mshv_root.h
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

These functions are currently only used to map child partition VP stats,
on root partition. However, they will soon be used on L1VH, and also
used for mapping the host's own VP stats.

Introduce a helper is_l1vh_parent() to determine whether we are mapping
our own VP stats. In this case, do not attempt to map the PARENT area.
Note this is a different case than mapping PARENT on an older hypervisor
where it is not available at all, so must be handled separately.

On unmap, pass the stats pages since on L1VH the kernel allocates them
and they must be freed in hv_unmap_stats_page().

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root.h      | 10 ++++++
 drivers/hv/mshv_root_main.c | 61 ++++++++++++++++++++++++++-----------
 2 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 05ba1f716f9e..e4912b0618fa 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -254,6 +254,16 @@ struct mshv_partition *mshv_partition_get(struct mshv_partition *partition);
 void mshv_partition_put(struct mshv_partition *partition);
 struct mshv_partition *mshv_partition_find(u64 partition_id) __must_hold(RCU);
 
+static inline bool is_l1vh_parent(u64 partition_id)
+{
+	return hv_l1vh_partition() && (partition_id == HV_PARTITION_ID_SELF);
+}
+
+int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
+		      struct hv_stats_page **stats_pages);
+void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
+			 struct hv_stats_page **stats_pages);
+
 /* hypercalls */
 
 int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index be5ad0fbfbee..faca3cc63e79 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -956,23 +956,36 @@ mshv_vp_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
-				struct hv_stats_page *stats_pages[])
+void mshv_vp_stats_unmap(u64 partition_id, u32 vp_index,
+			 struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
 		.vp.vp_index = vp_index,
 	};
+	int err;
 
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
-
-	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+				  stats_pages[HV_STATS_AREA_SELF],
+				  &identity);
+	if (err)
+		pr_err("%s: failed to unmap partition %llu vp %u self stats, err: %d\n",
+		       __func__, partition_id, vp_index, err);
+
+	if (stats_pages[HV_STATS_AREA_PARENT] != stats_pages[HV_STATS_AREA_SELF]) {
+		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+		err = hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+					  stats_pages[HV_STATS_AREA_PARENT],
+					  &identity);
+		if (err)
+			pr_err("%s: failed to unmap partition %llu vp %u parent stats, err: %d\n",
+			       __func__, partition_id, vp_index, err);
+	}
 }
 
-static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
-			     struct hv_stats_page *stats_pages[])
+int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
+		      struct hv_stats_page *stats_pages[])
 {
 	union hv_stats_object_identity identity = {
 		.vp.partition_id = partition_id,
@@ -983,23 +996,37 @@ static int mshv_vp_stats_map(u64 partition_id, u32 vp_index,
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
 	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
 				&stats_pages[HV_STATS_AREA_SELF]);
-	if (err)
+	if (err) {
+		pr_err("%s: failed to map partition %llu vp %u self stats, err: %d\n",
+		       __func__, partition_id, vp_index, err);
 		return err;
+	}
 
-	identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
-	err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
-				&stats_pages[HV_STATS_AREA_PARENT]);
-	if (err)
-		goto unmap_self;
-
-	if (!stats_pages[HV_STATS_AREA_PARENT])
+	/*
+	 * L1VH partition cannot access its vp stats in parent area.
+	 */
+	if (is_l1vh_parent(partition_id)) {
 		stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+	} else {
+		identity.vp.stats_area_type = HV_STATS_AREA_PARENT;
+		err = hv_map_stats_page(HV_STATS_OBJECT_VP, &identity,
+					&stats_pages[HV_STATS_AREA_PARENT]);
+		if (err) {
+			pr_err("%s: failed to map partition %llu vp %u parent stats, err: %d\n",
+			       __func__, partition_id, vp_index, err);
+			goto unmap_self;
+		}
+		if (!stats_pages[HV_STATS_AREA_PARENT])
+			stats_pages[HV_STATS_AREA_PARENT] = stats_pages[HV_STATS_AREA_SELF];
+	}
 
 	return 0;
 
 unmap_self:
 	identity.vp.stats_area_type = HV_STATS_AREA_SELF;
-	hv_unmap_stats_page(HV_STATS_OBJECT_VP, NULL, &identity);
+	hv_unmap_stats_page(HV_STATS_OBJECT_VP,
+			    stats_pages[HV_STATS_AREA_SELF],
+			    &identity);
 	return err;
 }
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 4/7] mshv: Always map child vp stats pages regardless of scheduler type
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>

Currently vp->vp_stats_pages is only used by the root scheduler for fast
interrupt injection.

Soon, vp_stats_pages will also be needed for exposing child VP stats to
userspace via debugfs. Mapping the pages a second time to a different
address causes an error on L1VH.

Remove the scheduler requirement and always map the vp stats pages.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index faca3cc63e79..fbfc9e7d9fa4 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1077,16 +1077,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 			goto unmap_register_page;
 	}
 
-	/*
-	 * This mapping of the stats page is for detecting if dispatch thread
-	 * is blocked - only relevant for root scheduler
-	 */
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT) {
-		ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
-					stats_pages);
-		if (ret)
-			goto unmap_ghcb_page;
-	}
+	ret = mshv_vp_stats_map(partition->pt_id, args.vp_index,
+				stats_pages);
+	if (ret)
+		goto unmap_ghcb_page;
 
 	vp = kzalloc(sizeof(*vp), GFP_KERNEL);
 	if (!vp)
@@ -1110,8 +1104,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
 		vp->vp_ghcb_page = page_to_virt(ghcb_page);
 
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
+	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
 
 	/*
 	 * Keep anon_inode_getfd last: it installs fd in the file struct and
@@ -1133,8 +1126,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 free_vp:
 	kfree(vp);
 unmap_stats_pages:
-	if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-		mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
+	mshv_vp_stats_unmap(partition->pt_id, args.vp_index, stats_pages);
 unmap_ghcb_page:
 	if (mshv_partition_encrypted(partition) && is_ghcb_mapping_available())
 		hv_unmap_vp_state_page(partition->pt_id, args.vp_index,
@@ -1754,9 +1746,8 @@ static void destroy_partition(struct mshv_partition *partition)
 			if (!vp)
 				continue;
 
-			if (hv_scheduler_type == HV_SCHEDULER_TYPE_ROOT)
-				mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
-						    vp->vp_stats_pages);
+			mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
+					    vp->vp_stats_pages);
 
 			if (vp->vp_register_page) {
 				(void)hv_unmap_vp_state_page(partition->pt_id,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 5/7] mshv: Update hv_stats_page definitions
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

hv_stats_page belongs in hvhdk.h, move it there.

It does not require a union to access the data for different counters,
just use a single u64 array for simplicity and to match the Windows
definitions.

While at it, correct the ARM64 value for VpRootDispatchThreadBlocked.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 27 ++++++++-------------------
 include/hyperv/hvhdk.h      |  7 +++++++
 2 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index fbfc9e7d9fa4..414d9cee5252 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -39,22 +39,12 @@ MODULE_AUTHOR("Microsoft");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
 
-/* TODO move this to another file when debugfs code is added */
-enum hv_stats_vp_counters {			/* HV_THREAD_COUNTER */
-#if defined(CONFIG_X86)
-	VpRootDispatchThreadBlocked			= 202,
+/* HV_THREAD_COUNTER */
+#if defined(CONFIG_X86_64)
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 202
 #elif defined(CONFIG_ARM64)
-	VpRootDispatchThreadBlocked			= 94,
+#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 95
 #endif
-	VpStatsMaxCounter
-};
-
-struct hv_stats_page {
-	union {
-		u64 vp_cntrs[VpStatsMaxCounter];		/* VP counters */
-		u8 data[HV_HYP_PAGE_SIZE];
-	};
-} __packed;
 
 struct mshv_root mshv_root;
 
@@ -485,12 +475,11 @@ static u64 mshv_vp_interrupt_pending(struct mshv_vp *vp)
 static bool mshv_vp_dispatch_thread_blocked(struct mshv_vp *vp)
 {
 	struct hv_stats_page **stats = vp->vp_stats_pages;
-	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->vp_cntrs;
-	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->vp_cntrs;
+	u64 *self_vp_cntrs = stats[HV_STATS_AREA_SELF]->data;
+	u64 *parent_vp_cntrs = stats[HV_STATS_AREA_PARENT]->data;
 
-	if (self_vp_cntrs[VpRootDispatchThreadBlocked])
-		return self_vp_cntrs[VpRootDispatchThreadBlocked];
-	return parent_vp_cntrs[VpRootDispatchThreadBlocked];
+	return parent_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED] ||
+	       self_vp_cntrs[HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED];
 }
 
 static int
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 469186df7826..d87cfdb7d360 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -10,6 +10,13 @@
 #include "hvhdk_mini.h"
 #include "hvgdk.h"
 
+/*
+ * Hypervisor statistics page format
+ */
+struct hv_stats_page {
+	u64 data[HV_HYP_PAGE_SIZE / sizeof(u64)];
+} __packed;
+
 /* Bits for dirty mask of hv_vp_register_page */
 #define HV_X64_REGISTER_CLASS_GENERAL	0
 #define HV_X64_REGISTER_CLASS_IP	1
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 6/7] mshv: Add data for printing stats page counters
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

Introduce mshv_debugfs_counters.c, containing static data
corresponding to HV_*_COUNTER enums in the hypervisor source.
Defining the enum members as an array instead makes more sense,
since it will be iterated over to print counter information to
debugfs.

Include hypervisor, logical processor, partition, and virtual
processor counters.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs_counters.c | 490 +++++++++++++++++++++++++++++
 1 file changed, 490 insertions(+)
 create mode 100644 drivers/hv/mshv_debugfs_counters.c

diff --git a/drivers/hv/mshv_debugfs_counters.c b/drivers/hv/mshv_debugfs_counters.c
new file mode 100644
index 000000000000..978536ba691f
--- /dev/null
+++ b/drivers/hv/mshv_debugfs_counters.c
@@ -0,0 +1,490 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * Data for printing stats page counters via debugfs.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+/*
+ * For simplicity, this file is included directly in mshv_debugfs.c.
+ * If these are ever needed elsewhere they should be compiled separately.
+ * Ensure this file is not used twice by accident.
+ */
+#ifndef MSHV_DEBUGFS_C
+#error "This file should only be included in mshv_debugfs.c"
+#endif
+
+/* HV_HYPERVISOR_COUNTER */
+static char *hv_hypervisor_counters[] = {
+	[1] = "HvLogicalProcessors",
+	[2] = "HvPartitions",
+	[3] = "HvTotalPages",
+	[4] = "HvVirtualProcessors",
+	[5] = "HvMonitoredNotifications",
+	[6] = "HvModernStandbyEntries",
+	[7] = "HvPlatformIdleTransitions",
+	[8] = "HvHypervisorStartupCost",
+
+	[10] = "HvIOSpacePages",
+	[11] = "HvNonEssentialPagesForDump",
+	[12] = "HvSubsumedPages",
+};
+
+/* HV_CPU_COUNTER */
+static char *hv_lp_counters[] = {
+	[1] = "LpGlobalTime",
+	[2] = "LpTotalRunTime",
+	[3] = "LpHypervisorRunTime",
+	[4] = "LpHardwareInterrupts",
+	[5] = "LpContextSwitches",
+	[6] = "LpInterProcessorInterrupts",
+	[7] = "LpSchedulerInterrupts",
+	[8] = "LpTimerInterrupts",
+	[9] = "LpInterProcessorInterruptsSent",
+	[10] = "LpProcessorHalts",
+	[11] = "LpMonitorTransitionCost",
+	[12] = "LpContextSwitchTime",
+	[13] = "LpC1TransitionsCount",
+	[14] = "LpC1RunTime",
+	[15] = "LpC2TransitionsCount",
+	[16] = "LpC2RunTime",
+	[17] = "LpC3TransitionsCount",
+	[18] = "LpC3RunTime",
+	[19] = "LpRootVpIndex",
+	[20] = "LpIdleSequenceNumber",
+	[21] = "LpGlobalTscCount",
+	[22] = "LpActiveTscCount",
+	[23] = "LpIdleAccumulation",
+	[24] = "LpReferenceCycleCount0",
+	[25] = "LpActualCycleCount0",
+	[26] = "LpReferenceCycleCount1",
+	[27] = "LpActualCycleCount1",
+	[28] = "LpProximityDomainId",
+	[29] = "LpPostedInterruptNotifications",
+	[30] = "LpBranchPredictorFlushes",
+#if IS_ENABLED(CONFIG_X86_64)
+	[31] = "LpL1DataCacheFlushes",
+	[32] = "LpImmediateL1DataCacheFlushes",
+	[33] = "LpMbFlushes",
+	[34] = "LpCounterRefreshSequenceNumber",
+	[35] = "LpCounterRefreshReferenceTime",
+	[36] = "LpIdleAccumulationSnapshot",
+	[37] = "LpActiveTscCountSnapshot",
+	[38] = "LpHwpRequestContextSwitches",
+	[39] = "LpPlaceholder1",
+	[40] = "LpPlaceholder2",
+	[41] = "LpPlaceholder3",
+	[42] = "LpPlaceholder4",
+	[43] = "LpPlaceholder5",
+	[44] = "LpPlaceholder6",
+	[45] = "LpPlaceholder7",
+	[46] = "LpPlaceholder8",
+	[47] = "LpPlaceholder9",
+	[48] = "LpSchLocalRunListSize",
+	[49] = "LpReserveGroupId",
+	[50] = "LpRunningPriority",
+	[51] = "LpPerfmonInterruptCount",
+#elif IS_ENABLED(CONFIG_ARM64)
+	[31] = "LpCounterRefreshSequenceNumber",
+	[32] = "LpCounterRefreshReferenceTime",
+	[33] = "LpIdleAccumulationSnapshot",
+	[34] = "LpActiveTscCountSnapshot",
+	[35] = "LpHwpRequestContextSwitches",
+	[36] = "LpPlaceholder2",
+	[37] = "LpPlaceholder3",
+	[38] = "LpPlaceholder4",
+	[39] = "LpPlaceholder5",
+	[40] = "LpPlaceholder6",
+	[41] = "LpPlaceholder7",
+	[42] = "LpPlaceholder8",
+	[43] = "LpPlaceholder9",
+	[44] = "LpSchLocalRunListSize",
+	[45] = "LpReserveGroupId",
+	[46] = "LpRunningPriority",
+#endif
+};
+
+/* HV_PROCESS_COUNTER */
+static char *hv_partition_counters[] = {
+	[1] = "PtVirtualProcessors",
+
+	[3] = "PtTlbSize",
+	[4] = "PtAddressSpaces",
+	[5] = "PtDepositedPages",
+	[6] = "PtGpaPages",
+	[7] = "PtGpaSpaceModifications",
+	[8] = "PtVirtualTlbFlushEntires",
+	[9] = "PtRecommendedTlbSize",
+	[10] = "PtGpaPages4K",
+	[11] = "PtGpaPages2M",
+	[12] = "PtGpaPages1G",
+	[13] = "PtGpaPages512G",
+	[14] = "PtDevicePages4K",
+	[15] = "PtDevicePages2M",
+	[16] = "PtDevicePages1G",
+	[17] = "PtDevicePages512G",
+	[18] = "PtAttachedDevices",
+	[19] = "PtDeviceInterruptMappings",
+	[20] = "PtIoTlbFlushes",
+	[21] = "PtIoTlbFlushCost",
+	[22] = "PtDeviceInterruptErrors",
+	[23] = "PtDeviceDmaErrors",
+	[24] = "PtDeviceInterruptThrottleEvents",
+	[25] = "PtSkippedTimerTicks",
+	[26] = "PtPartitionId",
+#if IS_ENABLED(CONFIG_X86_64)
+	[27] = "PtNestedTlbSize",
+	[28] = "PtRecommendedNestedTlbSize",
+	[29] = "PtNestedTlbFreeListSize",
+	[30] = "PtNestedTlbTrimmedPages",
+	[31] = "PtPagesShattered",
+	[32] = "PtPagesRecombined",
+	[33] = "PtHwpRequestValue",
+	[34] = "PtAutoSuspendEnableTime",
+	[35] = "PtAutoSuspendTriggerTime",
+	[36] = "PtAutoSuspendDisableTime",
+	[37] = "PtPlaceholder1",
+	[38] = "PtPlaceholder2",
+	[39] = "PtPlaceholder3",
+	[40] = "PtPlaceholder4",
+	[41] = "PtPlaceholder5",
+	[42] = "PtPlaceholder6",
+	[43] = "PtPlaceholder7",
+	[44] = "PtPlaceholder8",
+	[45] = "PtHypervisorStateTransferGeneration",
+	[46] = "PtNumberofActiveChildPartitions",
+#elif IS_ENABLED(CONFIG_ARM64)
+	[27] = "PtHwpRequestValue",
+	[28] = "PtAutoSuspendEnableTime",
+	[29] = "PtAutoSuspendTriggerTime",
+	[30] = "PtAutoSuspendDisableTime",
+	[31] = "PtPlaceholder1",
+	[32] = "PtPlaceholder2",
+	[33] = "PtPlaceholder3",
+	[34] = "PtPlaceholder4",
+	[35] = "PtPlaceholder5",
+	[36] = "PtPlaceholder6",
+	[37] = "PtPlaceholder7",
+	[38] = "PtPlaceholder8",
+	[39] = "PtHypervisorStateTransferGeneration",
+	[40] = "PtNumberofActiveChildPartitions",
+#endif
+};
+
+/* HV_THREAD_COUNTER */
+static char *hv_vp_counters[] = {
+	[1] = "VpTotalRunTime",
+	[2] = "VpHypervisorRunTime",
+	[3] = "VpRemoteNodeRunTime",
+	[4] = "VpNormalizedRunTime",
+	[5] = "VpIdealCpu",
+
+	[7] = "VpHypercallsCount",
+	[8] = "VpHypercallsTime",
+#if IS_ENABLED(CONFIG_X86_64)
+	[9] = "VpPageInvalidationsCount",
+	[10] = "VpPageInvalidationsTime",
+	[11] = "VpControlRegisterAccessesCount",
+	[12] = "VpControlRegisterAccessesTime",
+	[13] = "VpIoInstructionsCount",
+	[14] = "VpIoInstructionsTime",
+	[15] = "VpHltInstructionsCount",
+	[16] = "VpHltInstructionsTime",
+	[17] = "VpMwaitInstructionsCount",
+	[18] = "VpMwaitInstructionsTime",
+	[19] = "VpCpuidInstructionsCount",
+	[20] = "VpCpuidInstructionsTime",
+	[21] = "VpMsrAccessesCount",
+	[22] = "VpMsrAccessesTime",
+	[23] = "VpOtherInterceptsCount",
+	[24] = "VpOtherInterceptsTime",
+	[25] = "VpExternalInterruptsCount",
+	[26] = "VpExternalInterruptsTime",
+	[27] = "VpPendingInterruptsCount",
+	[28] = "VpPendingInterruptsTime",
+	[29] = "VpEmulatedInstructionsCount",
+	[30] = "VpEmulatedInstructionsTime",
+	[31] = "VpDebugRegisterAccessesCount",
+	[32] = "VpDebugRegisterAccessesTime",
+	[33] = "VpPageFaultInterceptsCount",
+	[34] = "VpPageFaultInterceptsTime",
+	[35] = "VpGuestPageTableMaps",
+	[36] = "VpLargePageTlbFills",
+	[37] = "VpSmallPageTlbFills",
+	[38] = "VpReflectedGuestPageFaults",
+	[39] = "VpApicMmioAccesses",
+	[40] = "VpIoInterceptMessages",
+	[41] = "VpMemoryInterceptMessages",
+	[42] = "VpApicEoiAccesses",
+	[43] = "VpOtherMessages",
+	[44] = "VpPageTableAllocations",
+	[45] = "VpLogicalProcessorMigrations",
+	[46] = "VpAddressSpaceEvictions",
+	[47] = "VpAddressSpaceSwitches",
+	[48] = "VpAddressDomainFlushes",
+	[49] = "VpAddressSpaceFlushes",
+	[50] = "VpGlobalGvaRangeFlushes",
+	[51] = "VpLocalGvaRangeFlushes",
+	[52] = "VpPageTableEvictions",
+	[53] = "VpPageTableReclamations",
+	[54] = "VpPageTableResets",
+	[55] = "VpPageTableValidations",
+	[56] = "VpApicTprAccesses",
+	[57] = "VpPageTableWriteIntercepts",
+	[58] = "VpSyntheticInterrupts",
+	[59] = "VpVirtualInterrupts",
+	[60] = "VpApicIpisSent",
+	[61] = "VpApicSelfIpisSent",
+	[62] = "VpGpaSpaceHypercalls",
+	[63] = "VpLogicalProcessorHypercalls",
+	[64] = "VpLongSpinWaitHypercalls",
+	[65] = "VpOtherHypercalls",
+	[66] = "VpSyntheticInterruptHypercalls",
+	[67] = "VpVirtualInterruptHypercalls",
+	[68] = "VpVirtualMmuHypercalls",
+	[69] = "VpVirtualProcessorHypercalls",
+	[70] = "VpHardwareInterrupts",
+	[71] = "VpNestedPageFaultInterceptsCount",
+	[72] = "VpNestedPageFaultInterceptsTime",
+	[73] = "VpPageScans",
+	[74] = "VpLogicalProcessorDispatches",
+	[75] = "VpWaitingForCpuTime",
+	[76] = "VpExtendedHypercalls",
+	[77] = "VpExtendedHypercallInterceptMessages",
+	[78] = "VpMbecNestedPageTableSwitches",
+	[79] = "VpOtherReflectedGuestExceptions",
+	[80] = "VpGlobalIoTlbFlushes",
+	[81] = "VpGlobalIoTlbFlushCost",
+	[82] = "VpLocalIoTlbFlushes",
+	[83] = "VpLocalIoTlbFlushCost",
+	[84] = "VpHypercallsForwardedCount",
+	[85] = "VpHypercallsForwardingTime",
+	[86] = "VpPageInvalidationsForwardedCount",
+	[87] = "VpPageInvalidationsForwardingTime",
+	[88] = "VpControlRegisterAccessesForwardedCount",
+	[89] = "VpControlRegisterAccessesForwardingTime",
+	[90] = "VpIoInstructionsForwardedCount",
+	[91] = "VpIoInstructionsForwardingTime",
+	[92] = "VpHltInstructionsForwardedCount",
+	[93] = "VpHltInstructionsForwardingTime",
+	[94] = "VpMwaitInstructionsForwardedCount",
+	[95] = "VpMwaitInstructionsForwardingTime",
+	[96] = "VpCpuidInstructionsForwardedCount",
+	[97] = "VpCpuidInstructionsForwardingTime",
+	[98] = "VpMsrAccessesForwardedCount",
+	[99] = "VpMsrAccessesForwardingTime",
+	[100] = "VpOtherInterceptsForwardedCount",
+	[101] = "VpOtherInterceptsForwardingTime",
+	[102] = "VpExternalInterruptsForwardedCount",
+	[103] = "VpExternalInterruptsForwardingTime",
+	[104] = "VpPendingInterruptsForwardedCount",
+	[105] = "VpPendingInterruptsForwardingTime",
+	[106] = "VpEmulatedInstructionsForwardedCount",
+	[107] = "VpEmulatedInstructionsForwardingTime",
+	[108] = "VpDebugRegisterAccessesForwardedCount",
+	[109] = "VpDebugRegisterAccessesForwardingTime",
+	[110] = "VpPageFaultInterceptsForwardedCount",
+	[111] = "VpPageFaultInterceptsForwardingTime",
+	[112] = "VpVmclearEmulationCount",
+	[113] = "VpVmclearEmulationTime",
+	[114] = "VpVmptrldEmulationCount",
+	[115] = "VpVmptrldEmulationTime",
+	[116] = "VpVmptrstEmulationCount",
+	[117] = "VpVmptrstEmulationTime",
+	[118] = "VpVmreadEmulationCount",
+	[119] = "VpVmreadEmulationTime",
+	[120] = "VpVmwriteEmulationCount",
+	[121] = "VpVmwriteEmulationTime",
+	[122] = "VpVmxoffEmulationCount",
+	[123] = "VpVmxoffEmulationTime",
+	[124] = "VpVmxonEmulationCount",
+	[125] = "VpVmxonEmulationTime",
+	[126] = "VpNestedVMEntriesCount",
+	[127] = "VpNestedVMEntriesTime",
+	[128] = "VpNestedSLATSoftPageFaultsCount",
+	[129] = "VpNestedSLATSoftPageFaultsTime",
+	[130] = "VpNestedSLATHardPageFaultsCount",
+	[131] = "VpNestedSLATHardPageFaultsTime",
+	[132] = "VpInvEptAllContextEmulationCount",
+	[133] = "VpInvEptAllContextEmulationTime",
+	[134] = "VpInvEptSingleContextEmulationCount",
+	[135] = "VpInvEptSingleContextEmulationTime",
+	[136] = "VpInvVpidAllContextEmulationCount",
+	[137] = "VpInvVpidAllContextEmulationTime",
+	[138] = "VpInvVpidSingleContextEmulationCount",
+	[139] = "VpInvVpidSingleContextEmulationTime",
+	[140] = "VpInvVpidSingleAddressEmulationCount",
+	[141] = "VpInvVpidSingleAddressEmulationTime",
+	[142] = "VpNestedTlbPageTableReclamations",
+	[143] = "VpNestedTlbPageTableEvictions",
+	[144] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+	[145] = "VpFlushGuestPhysicalAddressListHypercalls",
+	[146] = "VpPostedInterruptNotifications",
+	[147] = "VpPostedInterruptScans",
+	[148] = "VpTotalCoreRunTime",
+	[149] = "VpMaximumRunTime",
+	[150] = "VpHwpRequestContextSwitches",
+	[151] = "VpWaitingForCpuTimeBucket0",
+	[152] = "VpWaitingForCpuTimeBucket1",
+	[153] = "VpWaitingForCpuTimeBucket2",
+	[154] = "VpWaitingForCpuTimeBucket3",
+	[155] = "VpWaitingForCpuTimeBucket4",
+	[156] = "VpWaitingForCpuTimeBucket5",
+	[157] = "VpWaitingForCpuTimeBucket6",
+	[158] = "VpVmloadEmulationCount",
+	[159] = "VpVmloadEmulationTime",
+	[160] = "VpVmsaveEmulationCount",
+	[161] = "VpVmsaveEmulationTime",
+	[162] = "VpGifInstructionEmulationCount",
+	[163] = "VpGifInstructionEmulationTime",
+	[164] = "VpEmulatedErrataSvmInstructions",
+	[165] = "VpPlaceholder1",
+	[166] = "VpPlaceholder2",
+	[167] = "VpPlaceholder3",
+	[168] = "VpPlaceholder4",
+	[169] = "VpPlaceholder5",
+	[170] = "VpPlaceholder6",
+	[171] = "VpPlaceholder7",
+	[172] = "VpPlaceholder8",
+	[173] = "VpContentionTime",
+	[174] = "VpWakeUpTime",
+	[175] = "VpSchedulingPriority",
+	[176] = "VpRdpmcInstructionsCount",
+	[177] = "VpRdpmcInstructionsTime",
+	[178] = "VpPerfmonPmuMsrAccessesCount",
+	[179] = "VpPerfmonLbrMsrAccessesCount",
+	[180] = "VpPerfmonIptMsrAccessesCount",
+	[181] = "VpPerfmonInterruptCount",
+	[182] = "VpVtl1DispatchCount",
+	[183] = "VpVtl2DispatchCount",
+	[184] = "VpVtl2DispatchBucket0",
+	[185] = "VpVtl2DispatchBucket1",
+	[186] = "VpVtl2DispatchBucket2",
+	[187] = "VpVtl2DispatchBucket3",
+	[188] = "VpVtl2DispatchBucket4",
+	[189] = "VpVtl2DispatchBucket5",
+	[190] = "VpVtl2DispatchBucket6",
+	[191] = "VpVtl1RunTime",
+	[192] = "VpVtl2RunTime",
+	[193] = "VpIommuHypercalls",
+	[194] = "VpCpuGroupHypercalls",
+	[195] = "VpVsmHypercalls",
+	[196] = "VpEventLogHypercalls",
+	[197] = "VpDeviceDomainHypercalls",
+	[198] = "VpDepositHypercalls",
+	[199] = "VpSvmHypercalls",
+	[200] = "VpBusLockAcquisitionCount",
+	[201] = "VpLoadAvg",
+	[202] = "VpRootDispatchThreadBlocked",
+	[203] = "VpIdleCpuTime",
+	[204] = "VpWaitingForCpuTimeBucket7",
+	[205] = "VpWaitingForCpuTimeBucket8",
+	[206] = "VpWaitingForCpuTimeBucket9",
+	[207] = "VpWaitingForCpuTimeBucket10",
+	[208] = "VpWaitingForCpuTimeBucket11",
+	[209] = "VpWaitingForCpuTimeBucket12",
+	[210] = "VpHierarchicalSuspendTime",
+	[211] = "VpExpressSchedulingAttempts",
+	[212] = "VpExpressSchedulingCount",
+#elif IS_ENABLED(CONFIG_ARM64)
+	[9] = "VpSysRegAccessesCount",
+	[10] = "VpSysRegAccessesTime",
+	[11] = "VpSmcInstructionsCount",
+	[12] = "VpSmcInstructionsTime",
+	[13] = "VpOtherInterceptsCount",
+	[14] = "VpOtherInterceptsTime",
+	[15] = "VpExternalInterruptsCount",
+	[16] = "VpExternalInterruptsTime",
+	[17] = "VpPendingInterruptsCount",
+	[18] = "VpPendingInterruptsTime",
+	[19] = "VpGuestPageTableMaps",
+	[20] = "VpLargePageTlbFills",
+	[21] = "VpSmallPageTlbFills",
+	[22] = "VpReflectedGuestPageFaults",
+	[23] = "VpMemoryInterceptMessages",
+	[24] = "VpOtherMessages",
+	[25] = "VpLogicalProcessorMigrations",
+	[26] = "VpAddressDomainFlushes",
+	[27] = "VpAddressSpaceFlushes",
+	[28] = "VpSyntheticInterrupts",
+	[29] = "VpVirtualInterrupts",
+	[30] = "VpApicSelfIpisSent",
+	[31] = "VpGpaSpaceHypercalls",
+	[32] = "VpLogicalProcessorHypercalls",
+	[33] = "VpLongSpinWaitHypercalls",
+	[34] = "VpOtherHypercalls",
+	[35] = "VpSyntheticInterruptHypercalls",
+	[36] = "VpVirtualInterruptHypercalls",
+	[37] = "VpVirtualMmuHypercalls",
+	[38] = "VpVirtualProcessorHypercalls",
+	[39] = "VpHardwareInterrupts",
+	[40] = "VpNestedPageFaultInterceptsCount",
+	[41] = "VpNestedPageFaultInterceptsTime",
+	[42] = "VpLogicalProcessorDispatches",
+	[43] = "VpWaitingForCpuTime",
+	[44] = "VpExtendedHypercalls",
+	[45] = "VpExtendedHypercallInterceptMessages",
+	[46] = "VpMbecNestedPageTableSwitches",
+	[47] = "VpOtherReflectedGuestExceptions",
+	[48] = "VpGlobalIoTlbFlushes",
+	[49] = "VpGlobalIoTlbFlushCost",
+	[50] = "VpLocalIoTlbFlushes",
+	[51] = "VpLocalIoTlbFlushCost",
+	[52] = "VpFlushGuestPhysicalAddressSpaceHypercalls",
+	[53] = "VpFlushGuestPhysicalAddressListHypercalls",
+	[54] = "VpPostedInterruptNotifications",
+	[55] = "VpPostedInterruptScans",
+	[56] = "VpTotalCoreRunTime",
+	[57] = "VpMaximumRunTime",
+	[58] = "VpWaitingForCpuTimeBucket0",
+	[59] = "VpWaitingForCpuTimeBucket1",
+	[60] = "VpWaitingForCpuTimeBucket2",
+	[61] = "VpWaitingForCpuTimeBucket3",
+	[62] = "VpWaitingForCpuTimeBucket4",
+	[63] = "VpWaitingForCpuTimeBucket5",
+	[64] = "VpWaitingForCpuTimeBucket6",
+	[65] = "VpHwpRequestContextSwitches",
+	[66] = "VpPlaceholder2",
+	[67] = "VpPlaceholder3",
+	[68] = "VpPlaceholder4",
+	[69] = "VpPlaceholder5",
+	[70] = "VpPlaceholder6",
+	[71] = "VpPlaceholder7",
+	[72] = "VpPlaceholder8",
+	[73] = "VpContentionTime",
+	[74] = "VpWakeUpTime",
+	[75] = "VpSchedulingPriority",
+	[76] = "VpVtl1DispatchCount",
+	[77] = "VpVtl2DispatchCount",
+	[78] = "VpVtl2DispatchBucket0",
+	[79] = "VpVtl2DispatchBucket1",
+	[80] = "VpVtl2DispatchBucket2",
+	[81] = "VpVtl2DispatchBucket3",
+	[82] = "VpVtl2DispatchBucket4",
+	[83] = "VpVtl2DispatchBucket5",
+	[84] = "VpVtl2DispatchBucket6",
+	[85] = "VpVtl1RunTime",
+	[86] = "VpVtl2RunTime",
+	[87] = "VpIommuHypercalls",
+	[88] = "VpCpuGroupHypercalls",
+	[89] = "VpVsmHypercalls",
+	[90] = "VpEventLogHypercalls",
+	[91] = "VpDeviceDomainHypercalls",
+	[92] = "VpDepositHypercalls",
+	[93] = "VpSvmHypercalls",
+	[94] = "VpLoadAvg",
+	[95] = "VpRootDispatchThreadBlocked",
+	[96] = "VpIdleCpuTime",
+	[97] = "VpWaitingForCpuTimeBucket7",
+	[98] = "VpWaitingForCpuTimeBucket8",
+	[99] = "VpWaitingForCpuTimeBucket9",
+	[100] = "VpWaitingForCpuTimeBucket10",
+	[101] = "VpWaitingForCpuTimeBucket11",
+	[102] = "VpWaitingForCpuTimeBucket12",
+	[103] = "VpHierarchicalSuspendTime",
+	[104] = "VpExpressSchedulingAttempts",
+	[105] = "VpExpressSchedulingCount",
+#endif
+};
-- 
2.34.1


^ permalink raw reply related

* [PATCH v6 7/7] mshv: Add debugfs to view hypervisor statistics
From: Nuno Das Neves @ 2026-01-28 18:11 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, mhklinux, skinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, prapal, mrathor,
	paekkaladevi, Nuno Das Neves, Jinank Jain
In-Reply-To: <20260128181146.517708-1-nunodasneves@linux.microsoft.com>

Introduce a debugfs interface to expose root and child partition stats
when running with mshv_root.

Create a debugfs directory "mshv" containing 'stats' files organized by
type and id. A stats file contains a number of counters depending on
its type. e.g. an excerpt from a VP stats file:

TotalRunTime                  : 1997602722
HypervisorRunTime             : 649671371
RemoteNodeRunTime             : 0
NormalizedRunTime             : 1997602721
IdealCpu                      : 0
HypercallsCount               : 1708169
HypercallsTime                : 111914774
PageInvalidationsCount        : 0
PageInvalidationsTime         : 0

On a root partition with some active child partitions, the entire
directory structure may look like:

mshv/
  stats             # hypervisor stats
  lp/               # logical processors
    0/              # LP id
      stats         # LP 0 stats
    1/
    2/
    3/
  partition/        # partition stats
    1/              # root partition id
      stats         # root partition stats
      vp/           # root virtual processors
        0/          # root VP id
          stats     # root VP 0 stats
        1/
        2/
        3/
    42/             # child partition id
      stats         # child partition stats
      vp/           # child VPs
        0/          # child VP id
          stats     # child VP 0 stats
        1/
    43/
    55/

On L1VH, some stats are not present as it does not own the hardware
like the root partition does:
- The hypervisor and lp stats are not present
- L1VH's partition directory is named "self" because it can't get its
  own id
- Some of L1VH's partition and VP stats fields are not populated, because
  it can't map its own HV_STATS_AREA_PARENT page.

Co-developed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Co-developed-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Co-developed-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com>
Co-developed-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/Makefile         |   1 +
 drivers/hv/mshv_debugfs.c   | 726 ++++++++++++++++++++++++++++++++++++
 drivers/hv/mshv_root.h      |  34 ++
 drivers/hv/mshv_root_main.c |  26 +-
 4 files changed, 785 insertions(+), 2 deletions(-)
 create mode 100644 drivers/hv/mshv_debugfs.c

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index a49f93c2d245..2593711c3628 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -15,6 +15,7 @@ hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
 	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
new file mode 100644
index 000000000000..ebf2549eb44d
--- /dev/null
+++ b/drivers/hv/mshv_debugfs.c
@@ -0,0 +1,726 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, Microsoft Corporation.
+ *
+ * The /sys/kernel/debug/mshv directory contents.
+ * Contains various statistics data, provided by the hypervisor.
+ *
+ * Authors: Microsoft Linux virtualization team
+ */
+
+#include <linux/debugfs.h>
+#include <linux/stringify.h>
+#include <asm/mshyperv.h>
+#include <linux/slab.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+/* Ensure this file is not used elsewhere by accident */
+#define MSHV_DEBUGFS_C
+#include "mshv_debugfs_counters.c"
+
+#define U32_BUF_SZ 11
+#define U64_BUF_SZ 21
+/* Only support SELF and PARENT areas */
+#define NUM_STATS_AREAS 2
+static_assert(HV_STATS_AREA_SELF == 0 && HV_STATS_AREA_PARENT == 1,
+	      "SELF and PARENT areas must be usable as indices into an array of size NUM_STATS_AREAS");
+/* HV_HYPERVISOR_COUNTER */
+#define HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS 1
+
+static struct dentry *mshv_debugfs;
+static struct dentry *mshv_debugfs_partition;
+static struct dentry *mshv_debugfs_lp;
+static struct dentry **parent_vp_stats;
+static struct dentry *parent_partition_stats;
+
+static u64 mshv_lps_count;
+static struct hv_stats_page **mshv_lps_stats;
+
+static int lp_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page *stats = m->private;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_lp_counters); idx++) {
+		char *name = hv_lp_counters[idx];
+
+		if (!name)
+			continue;
+		seq_printf(m, "%-32s: %llu\n", name, stats->data[idx]);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(lp_stats);
+
+static void mshv_lp_stats_unmap(u32 lp_index)
+{
+	union hv_stats_object_identity identity = {
+		.lp.lp_index = lp_index,
+		.lp.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR,
+				  mshv_lps_stats[lp_index], &identity);
+	if (err)
+		pr_err("%s: failed to unmap logical processor %u stats, err: %d\n",
+		       __func__, lp_index, err);
+
+	mshv_lps_stats[lp_index] = NULL;
+}
+
+static struct hv_stats_page * __init mshv_lp_stats_map(u32 lp_index)
+{
+	union hv_stats_object_identity identity = {
+		.lp.lp_index = lp_index,
+		.lp.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_LOGICAL_PROCESSOR, &identity,
+				&stats);
+	if (err) {
+		pr_err("%s: failed to map logical processor %u stats, err: %d\n",
+		       __func__, lp_index, err);
+		return ERR_PTR(err);
+	}
+	mshv_lps_stats[lp_index] = stats;
+
+	return stats;
+}
+
+static struct hv_stats_page * __init lp_debugfs_stats_create(u32 lp_index,
+							     struct dentry *parent)
+{
+	struct dentry *dentry;
+	struct hv_stats_page *stats;
+
+	stats = mshv_lp_stats_map(lp_index);
+	if (IS_ERR(stats))
+		return stats;
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     stats, &lp_stats_fops);
+	if (IS_ERR(dentry)) {
+		mshv_lp_stats_unmap(lp_index);
+		return ERR_CAST(dentry);
+	}
+	return stats;
+}
+
+static int __init lp_debugfs_create(u32 lp_index, struct dentry *parent)
+{
+	struct dentry *idx;
+	char lp_idx_str[U32_BUF_SZ];
+	struct hv_stats_page *stats;
+	int err;
+
+	sprintf(lp_idx_str, "%u", lp_index);
+
+	idx = debugfs_create_dir(lp_idx_str, parent);
+	if (IS_ERR(idx))
+		return PTR_ERR(idx);
+
+	stats = lp_debugfs_stats_create(lp_index, idx);
+	if (IS_ERR(stats)) {
+		err = PTR_ERR(stats);
+		goto remove_debugfs_lp_idx;
+	}
+
+	return 0;
+
+remove_debugfs_lp_idx:
+	debugfs_remove_recursive(idx);
+	return err;
+}
+
+static void mshv_debugfs_lp_remove(void)
+{
+	int lp_index;
+
+	debugfs_remove_recursive(mshv_debugfs_lp);
+
+	for (lp_index = 0; lp_index < mshv_lps_count; lp_index++)
+		mshv_lp_stats_unmap(lp_index);
+
+	kfree(mshv_lps_stats);
+	mshv_lps_stats = NULL;
+}
+
+static int __init mshv_debugfs_lp_create(struct dentry *parent)
+{
+	struct dentry *lp_dir;
+	int err, lp_index;
+
+	mshv_lps_stats = kcalloc(mshv_lps_count,
+				 sizeof(*mshv_lps_stats),
+				 GFP_KERNEL_ACCOUNT);
+
+	if (!mshv_lps_stats)
+		return -ENOMEM;
+
+	lp_dir = debugfs_create_dir("lp", parent);
+	if (IS_ERR(lp_dir)) {
+		err = PTR_ERR(lp_dir);
+		goto free_lp_stats;
+	}
+
+	for (lp_index = 0; lp_index < mshv_lps_count; lp_index++) {
+		err = lp_debugfs_create(lp_index, lp_dir);
+		if (err)
+			goto remove_debugfs_lps;
+	}
+
+	mshv_debugfs_lp = lp_dir;
+
+	return 0;
+
+remove_debugfs_lps:
+	for (lp_index -= 1; lp_index >= 0; lp_index--)
+		mshv_lp_stats_unmap(lp_index);
+	debugfs_remove_recursive(lp_dir);
+free_lp_stats:
+	kfree(mshv_lps_stats);
+	mshv_lps_stats = NULL;
+
+	return err;
+}
+
+static int vp_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page **pstats = m->private;
+	u64 parent_val, self_val;
+	int idx;
+
+	/*
+	 * For VP and partition stats, there may be two stats areas mapped,
+	 * SELF and PARENT. These refer to the privilege level of the data in
+	 * each page. Some fields may be 0 in SELF and nonzero in PARENT, or
+	 * vice versa.
+	 *
+	 * Hence, prioritize printing from the PARENT page (more privileged
+	 * data), but use the value from the SELF page if the PARENT value is
+	 * 0.
+	 */
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_vp_counters); idx++) {
+		char *name = hv_vp_counters[idx];
+
+		if (!name)
+			continue;
+
+		parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+		self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+		seq_printf(m, "%-43s: %llu\n", name,
+			   parent_val ? parent_val : self_val);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(vp_stats);
+
+static void vp_debugfs_remove(struct dentry *vp_stats)
+{
+	debugfs_remove_recursive(vp_stats->d_parent);
+}
+
+static int vp_debugfs_create(u64 partition_id, u32 vp_index,
+			     struct hv_stats_page **pstats,
+			     struct dentry **vp_stats_ptr,
+			     struct dentry *parent)
+{
+	struct dentry *vp_idx_dir, *d;
+	char vp_idx_str[U32_BUF_SZ];
+	int err;
+
+	sprintf(vp_idx_str, "%u", vp_index);
+
+	vp_idx_dir = debugfs_create_dir(vp_idx_str, parent);
+	if (IS_ERR(vp_idx_dir))
+		return PTR_ERR(vp_idx_dir);
+
+	d = debugfs_create_file("stats", 0400, vp_idx_dir,
+				pstats, &vp_stats_fops);
+	if (IS_ERR(d)) {
+		err = PTR_ERR(d);
+		goto remove_debugfs_vp_idx;
+	}
+
+	*vp_stats_ptr = d;
+
+	return 0;
+
+remove_debugfs_vp_idx:
+	debugfs_remove_recursive(vp_idx_dir);
+	return err;
+}
+
+static int partition_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page **pstats = m->private;
+	u64 parent_val, self_val;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_partition_counters); idx++) {
+		char *name = hv_partition_counters[idx];
+
+		if (!name)
+			continue;
+
+		parent_val = pstats[HV_STATS_AREA_PARENT]->data[idx];
+		self_val = pstats[HV_STATS_AREA_SELF]->data[idx];
+		seq_printf(m, "%-37s: %llu\n", name,
+			   parent_val ? parent_val : self_val);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(partition_stats);
+
+static void mshv_partition_stats_unmap(u64 partition_id,
+				       struct hv_stats_page *stats_page,
+				       enum hv_stats_area_type stats_area_type)
+{
+	union hv_stats_object_identity identity = {
+		.partition.partition_id = partition_id,
+		.partition.stats_area_type = stats_area_type,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_PARTITION, stats_page,
+				  &identity);
+	if (err)
+		pr_err("%s: failed to unmap partition %lld %s stats, err: %d\n",
+		       __func__, partition_id,
+		       (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+		       err);
+}
+
+static struct hv_stats_page *mshv_partition_stats_map(u64 partition_id,
+						      enum hv_stats_area_type stats_area_type)
+{
+	union hv_stats_object_identity identity = {
+		.partition.partition_id = partition_id,
+		.partition.stats_area_type = stats_area_type,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_PARTITION, &identity, &stats);
+	if (err) {
+		pr_err("%s: failed to map partition %lld %s stats, err: %d\n",
+		       __func__, partition_id,
+		       (stats_area_type == HV_STATS_AREA_SELF) ? "self" : "parent",
+		       err);
+		return ERR_PTR(err);
+	}
+	return stats;
+}
+
+static int mshv_debugfs_partition_stats_create(u64 partition_id,
+					       struct dentry **partition_stats_ptr,
+					       struct dentry *parent)
+{
+	struct dentry *dentry;
+	struct hv_stats_page **pstats;
+	int err;
+
+	pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+			 GFP_KERNEL_ACCOUNT);
+	if (!pstats)
+		return -ENOMEM;
+
+	pstats[HV_STATS_AREA_SELF] = mshv_partition_stats_map(partition_id,
+							      HV_STATS_AREA_SELF);
+	if (IS_ERR(pstats[HV_STATS_AREA_SELF])) {
+		err = PTR_ERR(pstats[HV_STATS_AREA_SELF]);
+		goto cleanup;
+	}
+
+	/*
+	 * L1VH partition cannot access its partition stats in parent area.
+	 */
+	if (is_l1vh_parent(partition_id)) {
+		pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+	} else {
+		pstats[HV_STATS_AREA_PARENT] = mshv_partition_stats_map(partition_id,
+									HV_STATS_AREA_PARENT);
+		if (IS_ERR(pstats[HV_STATS_AREA_PARENT])) {
+			err = PTR_ERR(pstats[HV_STATS_AREA_PARENT]);
+			goto unmap_self;
+		}
+		if (!pstats[HV_STATS_AREA_PARENT])
+			pstats[HV_STATS_AREA_PARENT] = pstats[HV_STATS_AREA_SELF];
+	}
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     pstats, &partition_stats_fops);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		goto unmap_partition_stats;
+	}
+
+	*partition_stats_ptr = dentry;
+	return 0;
+
+unmap_partition_stats:
+	if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF])
+		mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_PARENT],
+					   HV_STATS_AREA_PARENT);
+unmap_self:
+	mshv_partition_stats_unmap(partition_id, pstats[HV_STATS_AREA_SELF],
+				   HV_STATS_AREA_SELF);
+cleanup:
+	kfree(pstats);
+	return err;
+}
+
+static void partition_debugfs_remove(u64 partition_id, struct dentry *dentry)
+{
+	struct hv_stats_page **pstats = NULL;
+
+	pstats = dentry->d_inode->i_private;
+
+	debugfs_remove_recursive(dentry->d_parent);
+
+	if (pstats[HV_STATS_AREA_PARENT] != pstats[HV_STATS_AREA_SELF]) {
+		mshv_partition_stats_unmap(partition_id,
+					   pstats[HV_STATS_AREA_PARENT],
+					   HV_STATS_AREA_PARENT);
+	}
+
+	mshv_partition_stats_unmap(partition_id,
+				   pstats[HV_STATS_AREA_SELF],
+				   HV_STATS_AREA_SELF);
+
+	kfree(pstats);
+}
+
+static int partition_debugfs_create(u64 partition_id,
+				    struct dentry **vp_dir_ptr,
+				    struct dentry **partition_stats_ptr,
+				    struct dentry *parent)
+{
+	char part_id_str[U64_BUF_SZ];
+	struct dentry *part_id_dir, *vp_dir;
+	int err;
+
+	if (is_l1vh_parent(partition_id))
+		sprintf(part_id_str, "self");
+	else
+		sprintf(part_id_str, "%llu", partition_id);
+
+	part_id_dir = debugfs_create_dir(part_id_str, parent);
+	if (IS_ERR(part_id_dir))
+		return PTR_ERR(part_id_dir);
+
+	vp_dir = debugfs_create_dir("vp", part_id_dir);
+	if (IS_ERR(vp_dir)) {
+		err = PTR_ERR(vp_dir);
+		goto remove_debugfs_partition_id;
+	}
+
+	err = mshv_debugfs_partition_stats_create(partition_id,
+						  partition_stats_ptr,
+						  part_id_dir);
+	if (err)
+		goto remove_debugfs_partition_id;
+
+	*vp_dir_ptr = vp_dir;
+
+	return 0;
+
+remove_debugfs_partition_id:
+	debugfs_remove_recursive(part_id_dir);
+	return err;
+}
+
+static void parent_vp_debugfs_remove(u32 vp_index,
+				     struct dentry *vp_stats_ptr)
+{
+	struct hv_stats_page **pstats;
+
+	pstats = vp_stats_ptr->d_inode->i_private;
+	vp_debugfs_remove(vp_stats_ptr);
+	mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+	kfree(pstats);
+}
+
+static void mshv_debugfs_parent_partition_remove(void)
+{
+	int idx;
+
+	for_each_online_cpu(idx)
+		parent_vp_debugfs_remove(hv_vp_index[idx],
+					 parent_vp_stats[idx]);
+
+	partition_debugfs_remove(hv_current_partition_id,
+				 parent_partition_stats);
+	kfree(parent_vp_stats);
+	parent_vp_stats = NULL;
+	parent_partition_stats = NULL;
+}
+
+static int __init parent_vp_debugfs_create(u32 vp_index,
+					   struct dentry **vp_stats_ptr,
+					   struct dentry *parent)
+{
+	struct hv_stats_page **pstats;
+	int err;
+
+	pstats = kcalloc(NUM_STATS_AREAS, sizeof(struct hv_stats_page *),
+			 GFP_KERNEL_ACCOUNT);
+	if (!pstats)
+		return -ENOMEM;
+
+	err = mshv_vp_stats_map(hv_current_partition_id, vp_index, pstats);
+	if (err)
+		goto cleanup;
+
+	err = vp_debugfs_create(hv_current_partition_id, vp_index, pstats,
+				vp_stats_ptr, parent);
+	if (err)
+		goto unmap_vp_stats;
+
+	return 0;
+
+unmap_vp_stats:
+	mshv_vp_stats_unmap(hv_current_partition_id, vp_index, pstats);
+cleanup:
+	kfree(pstats);
+	return err;
+}
+
+static int __init mshv_debugfs_parent_partition_create(void)
+{
+	struct dentry *vp_dir;
+	int err, idx, i;
+
+	mshv_debugfs_partition = debugfs_create_dir("partition",
+						    mshv_debugfs);
+	if (IS_ERR(mshv_debugfs_partition))
+		return PTR_ERR(mshv_debugfs_partition);
+
+	err = partition_debugfs_create(hv_current_partition_id,
+				       &vp_dir,
+				       &parent_partition_stats,
+				       mshv_debugfs_partition);
+	if (err)
+		goto remove_debugfs_partition;
+
+	parent_vp_stats = kcalloc(nr_cpu_ids, sizeof(*parent_vp_stats),
+				  GFP_KERNEL);
+	if (!parent_vp_stats) {
+		err = -ENOMEM;
+		goto remove_debugfs_partition;
+	}
+
+	for_each_online_cpu(idx) {
+		err = parent_vp_debugfs_create(hv_vp_index[idx],
+					       &parent_vp_stats[idx],
+					       vp_dir);
+		if (err)
+			goto remove_debugfs_partition_vp;
+	}
+
+	return 0;
+
+remove_debugfs_partition_vp:
+	for_each_online_cpu(i) {
+		if (i >= idx)
+			break;
+		parent_vp_debugfs_remove(i, parent_vp_stats[i]);
+	}
+	partition_debugfs_remove(hv_current_partition_id,
+				 parent_partition_stats);
+
+	kfree(parent_vp_stats);
+	parent_vp_stats = NULL;
+	parent_partition_stats = NULL;
+
+remove_debugfs_partition:
+	debugfs_remove_recursive(mshv_debugfs_partition);
+	mshv_debugfs_partition = NULL;
+	return err;
+}
+
+static int hv_stats_show(struct seq_file *m, void *v)
+{
+	const struct hv_stats_page *stats = m->private;
+	int idx;
+
+	for (idx = 0; idx < ARRAY_SIZE(hv_hypervisor_counters); idx++) {
+		char *name = hv_hypervisor_counters[idx];
+
+		if (!name)
+			continue;
+		seq_printf(m, "%-27s: %llu\n", name, stats->data[idx]);
+	}
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(hv_stats);
+
+static void mshv_hv_stats_unmap(void)
+{
+	union hv_stats_object_identity identity = {
+		.hv.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	int err;
+
+	err = hv_unmap_stats_page(HV_STATS_OBJECT_HYPERVISOR, NULL, &identity);
+	if (err)
+		pr_err("%s: failed to unmap hypervisor stats: %d\n",
+		       __func__, err);
+}
+
+static void * __init mshv_hv_stats_map(void)
+{
+	union hv_stats_object_identity identity = {
+		.hv.stats_area_type = HV_STATS_AREA_SELF,
+	};
+	struct hv_stats_page *stats;
+	int err;
+
+	err = hv_map_stats_page(HV_STATS_OBJECT_HYPERVISOR, &identity, &stats);
+	if (err) {
+		pr_err("%s: failed to map hypervisor stats: %d\n",
+		       __func__, err);
+		return ERR_PTR(err);
+	}
+	return stats;
+}
+
+static int __init mshv_debugfs_hv_stats_create(struct dentry *parent)
+{
+	struct dentry *dentry;
+	u64 *stats;
+	int err;
+
+	stats = mshv_hv_stats_map();
+	if (IS_ERR(stats))
+		return PTR_ERR(stats);
+
+	dentry = debugfs_create_file("stats", 0400, parent,
+				     stats, &hv_stats_fops);
+	if (IS_ERR(dentry)) {
+		err = PTR_ERR(dentry);
+		pr_err("%s: failed to create hypervisor stats dentry: %d\n",
+		       __func__, err);
+		goto unmap_hv_stats;
+	}
+
+	mshv_lps_count = stats[HV_HYPERVISOR_COUNTER_LOGICAL_PROCESSORS];
+
+	return 0;
+
+unmap_hv_stats:
+	mshv_hv_stats_unmap();
+	return err;
+}
+
+int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+	struct mshv_partition *p = vp->vp_partition;
+
+	if (!mshv_debugfs)
+		return 0;
+
+	return vp_debugfs_create(p->pt_id, vp->vp_index,
+				 vp->vp_stats_pages,
+				 &vp->vp_stats_dentry,
+				 p->pt_vp_dentry);
+}
+
+void mshv_debugfs_vp_remove(struct mshv_vp *vp)
+{
+	if (!mshv_debugfs)
+		return;
+
+	vp_debugfs_remove(vp->vp_stats_dentry);
+}
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+	int err;
+
+	if (!mshv_debugfs)
+		return 0;
+
+	err = partition_debugfs_create(partition->pt_id,
+				       &partition->pt_vp_dentry,
+				       &partition->pt_stats_dentry,
+				       mshv_debugfs_partition);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+void mshv_debugfs_partition_remove(struct mshv_partition *partition)
+{
+	if (!mshv_debugfs)
+		return;
+
+	partition_debugfs_remove(partition->pt_id,
+				 partition->pt_stats_dentry);
+}
+
+int __init mshv_debugfs_init(void)
+{
+	int err;
+
+	mshv_debugfs = debugfs_create_dir("mshv", NULL);
+	if (IS_ERR(mshv_debugfs)) {
+		pr_err("%s: failed to create debugfs directory\n", __func__);
+		return PTR_ERR(mshv_debugfs);
+	}
+
+	if (hv_root_partition()) {
+		err = mshv_debugfs_hv_stats_create(mshv_debugfs);
+		if (err)
+			goto remove_mshv_dir;
+
+		err = mshv_debugfs_lp_create(mshv_debugfs);
+		if (err)
+			goto unmap_hv_stats;
+	}
+
+	err = mshv_debugfs_parent_partition_create();
+	if (err)
+		goto unmap_lp_stats;
+
+	return 0;
+
+unmap_lp_stats:
+	if (hv_root_partition()) {
+		mshv_debugfs_lp_remove();
+		mshv_debugfs_lp = NULL;
+	}
+unmap_hv_stats:
+	if (hv_root_partition())
+		mshv_hv_stats_unmap();
+remove_mshv_dir:
+	debugfs_remove_recursive(mshv_debugfs);
+	mshv_debugfs = NULL;
+	return err;
+}
+
+void mshv_debugfs_exit(void)
+{
+	mshv_debugfs_parent_partition_remove();
+
+	if (hv_root_partition()) {
+		mshv_debugfs_lp_remove();
+		mshv_debugfs_lp = NULL;
+		mshv_hv_stats_unmap();
+	}
+
+	debugfs_remove_recursive(mshv_debugfs);
+	mshv_debugfs = NULL;
+	mshv_debugfs_partition = NULL;
+}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index e4912b0618fa..7332d9af8373 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -52,6 +52,9 @@ struct mshv_vp {
 		unsigned int kicked_by_hv;
 		wait_queue_head_t vp_suspend_queue;
 	} run;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+	struct dentry *vp_stats_dentry;
+#endif
 };
 
 #define vp_fmt(fmt) "p%lluvp%u: " fmt
@@ -136,6 +139,10 @@ struct mshv_partition {
 	u64 isolation_type;
 	bool import_completed;
 	bool pt_initialized;
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+	struct dentry *pt_stats_dentry;
+	struct dentry *pt_vp_dentry;
+#endif
 };
 
 #define pt_fmt(fmt) "p%llu: " fmt
@@ -327,6 +334,33 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
 int hv_call_get_partition_property_ex(u64 partition_id, u64 property_code, u64 arg,
 				      void *property_value, size_t property_value_sz);
 
+#if IS_ENABLED(CONFIG_DEBUG_FS)
+int __init mshv_debugfs_init(void);
+void mshv_debugfs_exit(void);
+
+int mshv_debugfs_partition_create(struct mshv_partition *partition);
+void mshv_debugfs_partition_remove(struct mshv_partition *partition);
+int mshv_debugfs_vp_create(struct mshv_vp *vp);
+void mshv_debugfs_vp_remove(struct mshv_vp *vp);
+#else
+static inline int __init mshv_debugfs_init(void)
+{
+	return 0;
+}
+static inline void mshv_debugfs_exit(void) { }
+
+static inline int mshv_debugfs_partition_create(struct mshv_partition *partition)
+{
+	return 0;
+}
+static inline void mshv_debugfs_partition_remove(struct mshv_partition *partition) { }
+static inline int mshv_debugfs_vp_create(struct mshv_vp *vp)
+{
+	return 0;
+}
+static inline void mshv_debugfs_vp_remove(struct mshv_vp *vp) { }
+#endif
+
 extern struct mshv_root mshv_root;
 extern enum hv_scheduler_type hv_scheduler_type;
 extern u8 * __percpu *hv_synic_eventring_tail;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 414d9cee5252..3a43e41e16a1 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1095,6 +1095,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 
 	memcpy(vp->vp_stats_pages, stats_pages, sizeof(stats_pages));
 
+	ret = mshv_debugfs_vp_create(vp);
+	if (ret)
+		goto put_partition;
+
 	/*
 	 * Keep anon_inode_getfd last: it installs fd in the file struct and
 	 * thus makes the state accessible in user space.
@@ -1102,7 +1106,7 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 	ret = anon_inode_getfd("mshv_vp", &mshv_vp_fops, vp,
 			       O_RDWR | O_CLOEXEC);
 	if (ret < 0)
-		goto put_partition;
+		goto remove_debugfs_vp;
 
 	/* already exclusive with the partition mutex for all ioctls */
 	partition->pt_vp_count++;
@@ -1110,6 +1114,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
 
 	return ret;
 
+remove_debugfs_vp:
+	mshv_debugfs_vp_remove(vp);
 put_partition:
 	mshv_partition_put(partition);
 free_vp:
@@ -1552,10 +1558,16 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
 	if (ret)
 		goto withdraw_mem;
 
+	ret = mshv_debugfs_partition_create(partition);
+	if (ret)
+		goto finalize_partition;
+
 	partition->pt_initialized = true;
 
 	return 0;
 
+finalize_partition:
+	hv_call_finalize_partition(partition->pt_id);
 withdraw_mem:
 	hv_call_withdraw_memory(U64_MAX, NUMA_NO_NODE, partition->pt_id);
 
@@ -1735,6 +1747,7 @@ static void destroy_partition(struct mshv_partition *partition)
 			if (!vp)
 				continue;
 
+			mshv_debugfs_vp_remove(vp);
 			mshv_vp_stats_unmap(partition->pt_id, vp->vp_index,
 					    vp->vp_stats_pages);
 
@@ -1768,6 +1781,8 @@ static void destroy_partition(struct mshv_partition *partition)
 			partition->pt_vp_array[i] = NULL;
 		}
 
+		mshv_debugfs_partition_remove(partition);
+
 		/* Deallocates and unmaps everything including vcpus, GPA mappings etc */
 		hv_call_finalize_partition(partition->pt_id);
 
@@ -2313,10 +2328,14 @@ static int __init mshv_parent_partition_init(void)
 
 	mshv_init_vmm_caps(dev);
 
-	ret = mshv_irqfd_wq_init();
+	ret = mshv_debugfs_init();
 	if (ret)
 		goto exit_partition;
 
+	ret = mshv_irqfd_wq_init();
+	if (ret)
+		goto exit_debugfs;
+
 	spin_lock_init(&mshv_root.pt_ht_lock);
 	hash_init(mshv_root.pt_htable);
 
@@ -2324,6 +2343,8 @@ static int __init mshv_parent_partition_init(void)
 
 	return 0;
 
+exit_debugfs:
+	mshv_debugfs_exit();
 exit_partition:
 	if (hv_root_partition())
 		mshv_root_partition_exit();
@@ -2340,6 +2361,7 @@ static void __exit mshv_parent_partition_exit(void)
 {
 	hv_setup_mshv_handler(NULL);
 	mshv_port_table_fini();
+	mshv_debugfs_exit();
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
 	if (hv_root_partition())
-- 
2.34.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox