* [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests
@ 2025-12-09 5:11 Yu Zhang
2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang
` (5 more replies)
0 siblings, 6 replies; 28+ messages in thread
From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw)
To: linux-kernel, linux-hyperv, iommu, linux-pci
Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani,
robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan,
jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch
This patch series introduces a para-virtualized IOMMU driver for
Linux guests running on Microsoft Hyper-V. The primary objective
is to enable hardware-assisted DMA isolation and scalable device
assignment for Hyper-V child partitions, bypassing the performance
overhead and complexity associated with emulated IOMMU hardware.
The driver implements the following core functionality:
* Hypercall-based Enumeration
Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS),
this driver enumerates the Hyper-V IOMMU capabilities directly
via hypercalls. This approach allows the guest to discover
IOMMU presence and features without requiring specific virtual
firmware extensions or modifications.
* Domain Management
The driver manages IOMMU domains through a new set of Hyper-V
hypercall interfaces, handling domain allocation, attachment,
and detachment for endpoint devices.
* IOTLB Invalidation
IOTLB invalidation requests are marshaled and issued to the
hypervisor through the same hypercall mechanism.
* Nested Translation Support
This implementation leverages guest-managed stage-1 I/O page
tables nested with host stage-2 translations. It is built
upon the consolidated IOMMU page table framework designed by
Jason Gunthorpe [1]. This design eliminates the need for complex
emulation during map operations and ensures scalability across
different architectures.
Implementation Notes:
* Architecture Independence
While the current implementation only supports x86 platforms (Intel
VT-d and AMD IOMMU), the driver design aims to be as architecture-
agnostic as possible. To achieve this, initialization occurs via
`device_initcall` rather than `x86_init.iommu.iommu_init`, and shutdown
is handled via `syscore_ops` instead of `x86_platform.iommu_shutdown`.
* MSI Region Handling
In this RFC, the hardware MSI region is hard-coded to the standard
x86 interrupt range (0xfee00000 - 0xfeefffff). Future updates may
allow this configuration to be queried via hypercalls if new hardware
platforms are to be supported.
* Reserved Regions (RMRR)
There is currently no requirement to support assigned devices with
ACPI RMRR limitations. Consequently, this patch series does not specify
or query reserved memory regions.
Testing:
This series has been validated using dmatest with Intel DSA devices
assigned to the child partition. The tests confirmed successful DMA
transactions under the para-virtualized IOMMU.
Future Work:
* Page-selective IOTLB Invalidation
The current implementation relies on full-domain flushes. Support
for page-selective invalidation is planned for a future series.
* Advanced Features
Support for vSVA and virtual PRI will be addressed in subsequent
updates.
* Root Partition Co-existence
Ensure compatibility with the distinct para-virtualized IOMMU driver
used by Hyper-V's Linux root partition, in which the DMA remapping
is not achieved by stage-1 IO page tables and another set of iommu
ops is provided.
[1] https://github.com/jgunthorpe/linux/tree/iommu_pt_all
Easwar Hariharan (2):
PCI: hv: Create and export hv_build_logical_dev_id()
iommu: Move Hyper-V IOMMU driver to its own subdirectory
Wei Liu (1):
hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
Yu Zhang (2):
hyperv: allow hypercall output pages to be allocated for child
partitions
iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
drivers/hv/hv_common.c | 21 +-
drivers/iommu/Kconfig | 10 +-
drivers/iommu/Makefile | 2 +-
drivers/iommu/hyperv/Kconfig | 24 +
drivers/iommu/hyperv/Makefile | 3 +
drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++
drivers/iommu/hyperv/iommu.h | 53 ++
.../irq_remapping.c} | 2 +-
drivers/pci/controller/pci-hyperv.c | 28 +-
include/asm-generic/mshyperv.h | 2 +
include/hyperv/hvgdk_mini.h | 8 +
include/hyperv/hvhdk_mini.h | 123 ++++
12 files changed, 850 insertions(+), 34 deletions(-)
create mode 100644 drivers/iommu/hyperv/Kconfig
create mode 100644 drivers/iommu/hyperv/Makefile
create mode 100644 drivers/iommu/hyperv/iommu.c
create mode 100644 drivers/iommu/hyperv/iommu.h
rename drivers/iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} (99%)
--
2.49.0
^ permalink raw reply [flat|nested] 28+ messages in thread* [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang @ 2025-12-09 5:11 ` Yu Zhang 2025-12-09 5:21 ` Randy Dunlap ` (2 more replies) 2025-12-09 5:11 ` [RFC v1 2/5] iommu: Move Hyper-V IOMMU driver to its own subdirectory Yu Zhang ` (4 subsequent siblings) 5 siblings, 3 replies; 28+ messages in thread From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw) To: linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Hyper-V uses a logical device ID to identify a PCI endpoint device for child partitions. This ID will also be required for future hypercalls used by the Hyper-V IOMMU driver. Refactor the logic for building this logical device ID into a standalone helper function and export the interface for wider use. Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> --- drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- include/asm-generic/mshyperv.h | 2 ++ 2 files changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 146b43981b27..4b82e06b5d93 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) #define hv_msi_prepare pci_msi_prepare +/** + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the + * function number of the device. + */ +u64 hv_build_logical_dev_id(struct pci_dev *pdev) +{ + struct pci_bus *pbus = pdev->bus; + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, + struct hv_pcibus_device, sysdata); + + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | + (hbus->hdev->dev_instance.b[4] << 16) | + (hbus->hdev->dev_instance.b[7] << 8) | + (hbus->hdev->dev_instance.b[6] & 0xf8) | + PCI_FUNC(pdev->devfn)); +} +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); + /** * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current * affinity. * @data: Describes the IRQ * * Build new a destination for the MSI and make a hypercall to - * update the Interrupt Redirection Table. "Device Logical ID" - * is built out of this PCI bus's instance GUID and the function - * number of the device. + * update the Interrupt Redirection Table. */ static void hv_irq_retarget_interrupt(struct irq_data *data) { @@ -642,11 +658,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data) params->int_entry.source = HV_INTERRUPT_SOURCE_MSI; params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff; params->int_entry.msi_entry.data.as_uint32 = int_desc->data; - params->device_id = (hbus->hdev->dev_instance.b[5] << 24) | - (hbus->hdev->dev_instance.b[4] << 16) | - (hbus->hdev->dev_instance.b[7] << 8) | - (hbus->hdev->dev_instance.b[6] & 0xf8) | - PCI_FUNC(pdev->devfn); + params->device_id = hv_build_logical_dev_id(pdev); params->int_target.vector = hv_msi_get_int_vector(data); if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) { diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index 64ba6bc807d9..1a205ed69435 100644 --- a/include/asm-generic/mshyperv.h +++ b/include/asm-generic/mshyperv.h @@ -71,6 +71,8 @@ extern enum hv_partition_type hv_curr_partition_type; extern void * __percpu *hyperv_pcpu_input_arg; extern void * __percpu *hyperv_pcpu_output_arg; +extern u64 hv_build_logical_dev_id(struct pci_dev *pdev); + u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr); u64 hv_do_fast_hypercall8(u16 control, u64 input8); u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2); -- 2.49.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang @ 2025-12-09 5:21 ` Randy Dunlap 2025-12-10 17:03 ` Easwar Hariharan 2025-12-10 21:39 ` Bjorn Helgaas 2026-01-08 18:46 ` Michael Kelley 2 siblings, 1 reply; 28+ messages in thread From: Randy Dunlap @ 2025-12-09 5:21 UTC (permalink / raw) To: Yu Zhang, linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch Hi-- On 12/8/25 9:11 PM, Yu Zhang wrote: > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Hyper-V uses a logical device ID to identify a PCI endpoint device for > child partitions. This ID will also be required for future hypercalls > used by the Hyper-V IOMMU driver. > > Refactor the logic for building this logical device ID into a standalone > helper function and export the interface for wider use. > > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- > include/asm-generic/mshyperv.h | 2 ++ > 2 files changed, 22 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index 146b43981b27..4b82e06b5d93 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) > > #define hv_msi_prepare pci_msi_prepare > > +/** > + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the > + * function number of the device. > + */ Don't use kernel-doc notation "/**" unless you are using kernel-doc comments. You could just convert it to a kernel-doc style comment... > +u64 hv_build_logical_dev_id(struct pci_dev *pdev) > +{ thanks. -- ~Randy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-09 5:21 ` Randy Dunlap @ 2025-12-10 17:03 ` Easwar Hariharan 0 siblings, 0 replies; 28+ messages in thread From: Easwar Hariharan @ 2025-12-10 17:03 UTC (permalink / raw) To: Randy Dunlap Cc: Yu Zhang, linux-kernel, linux-hyperv, iommu, linux-pci, easwar.hariharan, kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch On 12/8/2025 9:21 PM, Randy Dunlap wrote: > Hi-- > > On 12/8/25 9:11 PM, Yu Zhang wrote: >> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> >> >> Hyper-V uses a logical device ID to identify a PCI endpoint device for >> child partitions. This ID will also be required for future hypercalls >> used by the Hyper-V IOMMU driver. >> >> Refactor the logic for building this logical device ID into a standalone >> helper function and export the interface for wider use. >> >> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> >> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> >> --- >> drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- >> include/asm-generic/mshyperv.h | 2 ++ >> 2 files changed, 22 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c >> index 146b43981b27..4b82e06b5d93 100644 >> --- a/drivers/pci/controller/pci-hyperv.c >> +++ b/drivers/pci/controller/pci-hyperv.c >> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) >> >> #define hv_msi_prepare pci_msi_prepare >> >> +/** >> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the >> + * function number of the device. >> + */ > > Don't use kernel-doc notation "/**" unless you are using kernel-doc comments. > You could just convert it to a kernel-doc style comment... Thank you for the review, I will fix in a future revision. Thanks, Easwar (he/him) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang 2025-12-09 5:21 ` Randy Dunlap @ 2025-12-10 21:39 ` Bjorn Helgaas 2025-12-11 8:31 ` Yu Zhang 2026-01-08 18:46 ` Michael Kelley 2 siblings, 1 reply; 28+ messages in thread From: Bjorn Helgaas @ 2025-12-10 21:39 UTC (permalink / raw) To: Yu Zhang Cc: linux-kernel, linux-hyperv, iommu, linux-pci, kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch On Tue, Dec 09, 2025 at 01:11:24PM +0800, Yu Zhang wrote: > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Hyper-V uses a logical device ID to identify a PCI endpoint device for > child partitions. This ID will also be required for future hypercalls > used by the Hyper-V IOMMU driver. > > Refactor the logic for building this logical device ID into a standalone > helper function and export the interface for wider use. > > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> > --- > drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- > include/asm-generic/mshyperv.h | 2 ++ > 2 files changed, 22 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index 146b43981b27..4b82e06b5d93 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) > > #define hv_msi_prepare pci_msi_prepare > > +/** > + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the > + * function number of the device. > + */ > +u64 hv_build_logical_dev_id(struct pci_dev *pdev) > +{ > + struct pci_bus *pbus = pdev->bus; > + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, > + struct hv_pcibus_device, sysdata); > + > + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | > + (hbus->hdev->dev_instance.b[4] << 16) | > + (hbus->hdev->dev_instance.b[7] << 8) | > + (hbus->hdev->dev_instance.b[6] & 0xf8) | > + PCI_FUNC(pdev->devfn)); > +} > +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); > + > /** > * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current > * affinity. > * @data: Describes the IRQ > * > * Build new a destination for the MSI and make a hypercall to > - * update the Interrupt Redirection Table. "Device Logical ID" > - * is built out of this PCI bus's instance GUID and the function > - * number of the device. > + * update the Interrupt Redirection Table. > */ > static void hv_irq_retarget_interrupt(struct irq_data *data) > { > @@ -642,11 +658,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data) > params->int_entry.source = HV_INTERRUPT_SOURCE_MSI; > params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff; > params->int_entry.msi_entry.data.as_uint32 = int_desc->data; > - params->device_id = (hbus->hdev->dev_instance.b[5] << 24) | > - (hbus->hdev->dev_instance.b[4] << 16) | > - (hbus->hdev->dev_instance.b[7] << 8) | > - (hbus->hdev->dev_instance.b[6] & 0xf8) | > - PCI_FUNC(pdev->devfn); > + params->device_id = hv_build_logical_dev_id(pdev); > params->int_target.vector = hv_msi_get_int_vector(data); > > if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) { > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h > index 64ba6bc807d9..1a205ed69435 100644 > --- a/include/asm-generic/mshyperv.h > +++ b/include/asm-generic/mshyperv.h > @@ -71,6 +71,8 @@ extern enum hv_partition_type hv_curr_partition_type; > extern void * __percpu *hyperv_pcpu_input_arg; > extern void * __percpu *hyperv_pcpu_output_arg; > > +extern u64 hv_build_logical_dev_id(struct pci_dev *pdev); Curious why you would include the "extern" in this declaration? It's not *wrong*, but it's not necessary, and other declarations in this file omit it, e.g., the ones below: > u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr); > u64 hv_do_fast_hypercall8(u16 control, u64 input8); > u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2); > -- > 2.49.0 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-10 21:39 ` Bjorn Helgaas @ 2025-12-11 8:31 ` Yu Zhang 0 siblings, 0 replies; 28+ messages in thread From: Yu Zhang @ 2025-12-11 8:31 UTC (permalink / raw) To: Bjorn Helgaas Cc: linux-kernel, linux-hyperv, iommu, linux-pci, kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch On Wed, Dec 10, 2025 at 03:39:45PM -0600, Bjorn Helgaas wrote: > On Tue, Dec 09, 2025 at 01:11:24PM +0800, Yu Zhang wrote: > > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > > > Hyper-V uses a logical device ID to identify a PCI endpoint device for > > child partitions. This ID will also be required for future hypercalls > > used by the Hyper-V IOMMU driver. > > > > Refactor the logic for building this logical device ID into a standalone > > helper function and export the interface for wider use. > > > > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com> > > > --- > > drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- > > include/asm-generic/mshyperv.h | 2 ++ > > 2 files changed, 22 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > > index 146b43981b27..4b82e06b5d93 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) > > > > #define hv_msi_prepare pci_msi_prepare > > > > +/** > > + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the > > + * function number of the device. > > + */ > > +u64 hv_build_logical_dev_id(struct pci_dev *pdev) > > +{ > > + struct pci_bus *pbus = pdev->bus; > > + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, > > + struct hv_pcibus_device, sysdata); > > + > > + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | > > + (hbus->hdev->dev_instance.b[4] << 16) | > > + (hbus->hdev->dev_instance.b[7] << 8) | > > + (hbus->hdev->dev_instance.b[6] & 0xf8) | > > + PCI_FUNC(pdev->devfn)); > > +} > > +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); > > + > > /** > > * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current > > * affinity. > > * @data: Describes the IRQ > > * > > * Build new a destination for the MSI and make a hypercall to > > - * update the Interrupt Redirection Table. "Device Logical ID" > > - * is built out of this PCI bus's instance GUID and the function > > - * number of the device. > > + * update the Interrupt Redirection Table. > > */ > > static void hv_irq_retarget_interrupt(struct irq_data *data) > > { > > @@ -642,11 +658,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data) > > params->int_entry.source = HV_INTERRUPT_SOURCE_MSI; > > params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff; > > params->int_entry.msi_entry.data.as_uint32 = int_desc->data; > > - params->device_id = (hbus->hdev->dev_instance.b[5] << 24) | > > - (hbus->hdev->dev_instance.b[4] << 16) | > > - (hbus->hdev->dev_instance.b[7] << 8) | > > - (hbus->hdev->dev_instance.b[6] & 0xf8) | > > - PCI_FUNC(pdev->devfn); > > + params->device_id = hv_build_logical_dev_id(pdev); > > params->int_target.vector = hv_msi_get_int_vector(data); > > > > if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) { > > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h > > index 64ba6bc807d9..1a205ed69435 100644 > > --- a/include/asm-generic/mshyperv.h > > +++ b/include/asm-generic/mshyperv.h > > @@ -71,6 +71,8 @@ extern enum hv_partition_type hv_curr_partition_type; > > extern void * __percpu *hyperv_pcpu_input_arg; > > extern void * __percpu *hyperv_pcpu_output_arg; > > > > +extern u64 hv_build_logical_dev_id(struct pci_dev *pdev); > > Curious why you would include the "extern" in this declaration? It's > not *wrong*, but it's not necessary, and other declarations in this > file omit it, e.g., the ones below: > Thank you, Bjorn. Actually I don't think it is necessary. Will remove it in v2. :) B.R. Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang 2025-12-09 5:21 ` Randy Dunlap 2025-12-10 21:39 ` Bjorn Helgaas @ 2026-01-08 18:46 ` Michael Kelley 2026-01-09 18:40 ` Easwar Hariharan 2 siblings, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-08 18:46 UTC (permalink / raw) To: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Hyper-V uses a logical device ID to identify a PCI endpoint device for > child partitions. This ID will also be required for future hypercalls > used by the Hyper-V IOMMU driver. > > Refactor the logic for building this logical device ID into a standalone > helper function and export the interface for wider use. > > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- > include/asm-generic/mshyperv.h | 2 ++ > 2 files changed, 22 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index 146b43981b27..4b82e06b5d93 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) > > #define hv_msi_prepare pci_msi_prepare > > +/** > + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the > + * function number of the device. > + */ > +u64 hv_build_logical_dev_id(struct pci_dev *pdev) > +{ > + struct pci_bus *pbus = pdev->bus; > + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, > + struct hv_pcibus_device, sysdata); > + > + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | > + (hbus->hdev->dev_instance.b[4] << 16) | > + (hbus->hdev->dev_instance.b[7] << 8) | > + (hbus->hdev->dev_instance.b[6] & 0xf8) | > + PCI_FUNC(pdev->devfn)); > +} > +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the new IOMMU driver because pci-hyperv.c can (and often is) built as a module. The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't use this symbol in that case -- you'll get a link error on vmlinux when building the kernel. Requiring pci-hyperv.c to *not* be built as a module would also require that the VMBus driver not be built as a module, so I don't think that's the right solution. This is a messy problem. The new IOMMU driver needs to start with a generic "struct device" for the PCI device, and somehow find the corresponding VMBus PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking about ways to do this that don't depend on code and data structures that are private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion. I was wondering if this "logical device id" is actually parsed by the hypervisor, or whether it is just a unique ID that is opaque to the hypervisor. From the usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears to be the former. Evidently the hypervisor is taking this logical device ID and and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru devices offered to the guest, so as to identify a particular PCI pass-thru device. If that's the case, then Linux doesn't have the option of choosing some other unique ID that is easier to generate and access. There's a uniqueness issue with this kind of logical device ID that has been around for years, but I had never thought about before. In hv_pci_probe() instance GUID bytes 4 and 5 are used to generate the PCI domain number for the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with a collision. (The full GUID is unique, but not necessarily some subset of the GUID.) It seems like the same kind of uniqueness issue could occur here. Does the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru 7 as a unit, and if not, what happens if there is a collision? Again, this uniqueness issue has existed for years, so it's not new to this patch set, but with new uses of the logical device ID, it seems relevant to consider. Michael > + > /** > * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current > * affinity. > * @data: Describes the IRQ > * > * Build new a destination for the MSI and make a hypercall to > - * update the Interrupt Redirection Table. "Device Logical ID" > - * is built out of this PCI bus's instance GUID and the function > - * number of the device. > + * update the Interrupt Redirection Table. > */ > static void hv_irq_retarget_interrupt(struct irq_data *data) > { > @@ -642,11 +658,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data) > params->int_entry.source = HV_INTERRUPT_SOURCE_MSI; > params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff; > params->int_entry.msi_entry.data.as_uint32 = int_desc->data; > - params->device_id = (hbus->hdev->dev_instance.b[5] << 24) | > - (hbus->hdev->dev_instance.b[4] << 16) | > - (hbus->hdev->dev_instance.b[7] << 8) | > - (hbus->hdev->dev_instance.b[6] & 0xf8) | > - PCI_FUNC(pdev->devfn); > + params->device_id = hv_build_logical_dev_id(pdev); > params->int_target.vector = hv_msi_get_int_vector(data); > > if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) { > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h > index 64ba6bc807d9..1a205ed69435 100644 > --- a/include/asm-generic/mshyperv.h > +++ b/include/asm-generic/mshyperv.h > @@ -71,6 +71,8 @@ extern enum hv_partition_type hv_curr_partition_type; > extern void * __percpu *hyperv_pcpu_input_arg; > extern void * __percpu *hyperv_pcpu_output_arg; > > +extern u64 hv_build_logical_dev_id(struct pci_dev *pdev); > + > u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr); > u64 hv_do_fast_hypercall8(u16 control, u64 input8); > u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2); > -- > 2.49.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2026-01-08 18:46 ` Michael Kelley @ 2026-01-09 18:40 ` Easwar Hariharan 2026-01-11 17:36 ` Michael Kelley 0 siblings, 1 reply; 28+ messages in thread From: Easwar Hariharan @ 2026-01-09 18:40 UTC (permalink / raw) To: Michael Kelley Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, easwar.hariharan, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org On 1/8/2026 10:46 AM, Michael Kelley wrote: > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM >> >> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> >> >> Hyper-V uses a logical device ID to identify a PCI endpoint device for >> child partitions. This ID will also be required for future hypercalls >> used by the Hyper-V IOMMU driver. >> >> Refactor the logic for building this logical device ID into a standalone >> helper function and export the interface for wider use. >> >> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> >> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> >> --- >> drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- >> include/asm-generic/mshyperv.h | 2 ++ >> 2 files changed, 22 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c >> index 146b43981b27..4b82e06b5d93 100644 >> --- a/drivers/pci/controller/pci-hyperv.c >> +++ b/drivers/pci/controller/pci-hyperv.c >> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) >> >> #define hv_msi_prepare pci_msi_prepare >> >> +/** >> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the >> + * function number of the device. >> + */ >> +u64 hv_build_logical_dev_id(struct pci_dev *pdev) >> +{ >> + struct pci_bus *pbus = pdev->bus; >> + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, >> + struct hv_pcibus_device, sysdata); >> + >> + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | >> + (hbus->hdev->dev_instance.b[4] << 16) | >> + (hbus->hdev->dev_instance.b[7] << 8) | >> + (hbus->hdev->dev_instance.b[6] & 0xf8) | >> + PCI_FUNC(pdev->devfn)); >> +} >> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); > > This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the > new IOMMU driver because pci-hyperv.c can (and often is) built as a module. > The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't > use this symbol in that case -- you'll get a link error on vmlinux when building > the kernel. Requiring pci-hyperv.c to *not* be built as a module would also > require that the VMBus driver not be built as a module, so I don't think that's > the right solution. > > This is a messy problem. The new IOMMU driver needs to start with a generic > "struct device" for the PCI device, and somehow find the corresponding VMBus > PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking > about ways to do this that don't depend on code and data structures that are > private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion. Thank you, Michael. FWIW, I did try to pull out the device ID components out of pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h but it was just too messy as you say. > I was wondering if this "logical device id" is actually parsed by the hypervisor, > or whether it is just a unique ID that is opaque to the hypervisor. From the > usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears > to be the former. Evidently the hypervisor is taking this logical device ID and > and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru > devices offered to the guest, so as to identify a particular PCI pass-thru device. > If that's the case, then Linux doesn't have the option of choosing some other > unique ID that is easier to generate and access. Yes, the device ID is actually used by the hypervisor to find the corresponding PCI pass-thru device and the physical IOMMUs the device is behind and execute the requested operation for those IOMMUs. > There's a uniqueness issue with this kind of logical device ID that has been > around for years, but I had never thought about before. In hv_pci_probe() > instance GUID bytes 4 and 5 are used to generate the PCI domain number for > the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the > lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with > a collision. (The full GUID is unique, but not necessarily some subset of the > GUID.) It seems like the same kind of uniqueness issue could occur here. Does > the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru > 7 as a unit, and if not, what happens if there is a collision? Again, this > uniqueness issue has existed for years, so it's not new to this patch set, but > with new uses of the logical device ID, it seems relevant to consider. Thank you for bringing that up, I was aware of the uniqueness workaround but, like you, I had not considered that the workaround could prevent matching the device ID with the record the hypervisor has of the PCI pass-thru device assigned to us. I will work with the hypervisor folks to resolve this before this patch series is posted for merge. Thanks, Easwar (he/him) ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() 2026-01-09 18:40 ` Easwar Hariharan @ 2026-01-11 17:36 ` Michael Kelley 0 siblings, 0 replies; 28+ messages in thread From: Michael Kelley @ 2026-01-11 17:36 UTC (permalink / raw) To: Easwar Hariharan Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:41 AM > > On 1/8/2026 10:46 AM, Michael Kelley wrote: > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > >> > >> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > >> > >> Hyper-V uses a logical device ID to identify a PCI endpoint device for > >> child partitions. This ID will also be required for future hypercalls > >> used by the Hyper-V IOMMU driver. > >> > >> Refactor the logic for building this logical device ID into a standalone > >> helper function and export the interface for wider use. > >> > >> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > >> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > >> --- > >> drivers/pci/controller/pci-hyperv.c | 28 ++++++++++++++++++++-------- > >> include/asm-generic/mshyperv.h | 2 ++ > >> 2 files changed, 22 insertions(+), 8 deletions(-) > >> > >> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > >> index 146b43981b27..4b82e06b5d93 100644 > >> --- a/drivers/pci/controller/pci-hyperv.c > >> +++ b/drivers/pci/controller/pci-hyperv.c > >> @@ -598,15 +598,31 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data) > >> > >> #define hv_msi_prepare pci_msi_prepare > >> > >> +/** > >> + * Build a "Device Logical ID" out of this PCI bus's instance GUID and the > >> + * function number of the device. > >> + */ > >> +u64 hv_build_logical_dev_id(struct pci_dev *pdev) > >> +{ > >> + struct pci_bus *pbus = pdev->bus; > >> + struct hv_pcibus_device *hbus = container_of(pbus->sysdata, > >> + struct hv_pcibus_device, sysdata); > >> + > >> + return (u64)((hbus->hdev->dev_instance.b[5] << 24) | > >> + (hbus->hdev->dev_instance.b[4] << 16) | > >> + (hbus->hdev->dev_instance.b[7] << 8) | > >> + (hbus->hdev->dev_instance.b[6] & 0xf8) | > >> + PCI_FUNC(pdev->devfn)); > >> +} > >> +EXPORT_SYMBOL_GPL(hv_build_logical_dev_id); > > > > This change is fine for hv_irq_retarget_interrupt(), it doesn't help for the > > new IOMMU driver because pci-hyperv.c can (and often is) built as a module. > > The new Hyper-V IOMMU driver in this patch series is built-in, and so it can't > > use this symbol in that case -- you'll get a link error on vmlinux when building > > the kernel. Requiring pci-hyperv.c to *not* be built as a module would also > > require that the VMBus driver not be built as a module, so I don't think that's > > the right solution. > > > > This is a messy problem. The new IOMMU driver needs to start with a generic > > "struct device" for the PCI device, and somehow find the corresponding VMBus > > PCI pass-thru device from which it can get the VMBus instance ID. I'm thinking > > about ways to do this that don't depend on code and data structures that are > > private to the pci-hyperv.c driver, and will follow-up if I have a good suggestion. > > Thank you, Michael. FWIW, I did try to pull out the device ID components out of > pci-hyperv into include/linux/hyperv.h and/or a new include/linux/pci-hyperv.h > but it was just too messy as you say. Yes, the current approach for getting the device ID wanders through struct hv_pcibus_device (which is private to the pci-hyperv driver), and through struct hv_device (which is a VMBus data structure). That makes the linkage between the PV IOMMU driver and the pci-hyperv and VMBus drivers rather substantial, which is not good. But here's an idea for an alternate approach. The PV IOMMU driver doesn't have to generate the logical device ID on-the-fly by going to the dev_instance field of struct hv_device. Instead, the pci-hyperv driver can generate the logical device ID in hv_pci_probe(), and put it somewhere that's easy for the IOMMU driver to access. The logical device ID doesn't change while Linux is running, so stashing another copy somewhere isn't a problem. So have the Hyper-V PV IOMMU driver provide an EXPORTed function to accept a PCI domain ID and the related logical device ID. The PV IOMMU driver is responsible for storing this data in a form that it can later search. hv_pci_probe() calls this new function when it instantiates a new PCI pass-thru device. Then when the IOMMU driver needs to attach a new device, it can get the PCI domain ID from the struct pci_dev (or struct pci_bus), search for the related logical device ID in its own data structure, and use it. The pci-hyperv driver has a dependency on the IOMMU driver, but that's a dependency in the desired direction. The PCI domain ID and logical device ID are just integers, so no data structures are shared. Note that the pci-hyperv must inform the PV IOMMU driver of the logical device ID *before* create_root_hv_pci_bus() calls pci_scan_root_bus_bridge(). The latter function eventually invokes hv_iommu_attach_dev(), which will need the logical device ID. See example stack trace. [1] I don't think the pci-hyperv driver even needs to tell the IOMMU driver to remove the information if a PCI pass-thru device is unbound or removed, as the logical device ID will be the same if the device ever comes back. At worst, the IOMMU driver can simply replace an existing logical device ID if a new one is provided for the same PCI domain ID. An include file must provide a stub for the new function if CONFIG_HYPERV_PVIOMMU is not defined, so that the pci-hyperv driver still builds and works. I haven't coded this up, but it seems like it should be pretty clean. Michael [1] Example stack trace, starting with vmbus_add_channel_work() as a result of Hyper-V offering the PCI pass-thru device to the guest. hv_pci_probe() runs, and ends up in the generic Linux code for adding a PCI device, which in turn sets up the IOMMU. [ 1.731786] hv_iommu_attach_dev+0xf0/0x1d0 [ 1.731788] __iommu_attach_device+0x21/0xb0 [ 1.731790] __iommu_device_set_domain+0x65/0xd0 [ 1.731792] __iommu_group_set_domain_internal+0x61/0x120 [ 1.731795] iommu_setup_default_domain+0x3a4/0x530 [ 1.731796] __iommu_probe_device.part.0+0x15d/0x1d0 [ 1.731798] iommu_probe_device+0x81/0xb0 [ 1.731799] iommu_bus_notifier+0x2c/0x80 [ 1.731800] notifier_call_chain+0x66/0xe0 [ 1.731802] blocking_notifier_call_chain+0x47/0x70 [ 1.731804] bus_notify+0x3b/0x50 [ 1.731805] device_add+0x631/0x850 [ 1.731807] pci_device_add+0x2db/0x670 [ 1.731809] pci_scan_single_device+0xc3/0x100 [ 1.731810] pci_scan_slot+0x97/0x230 [ 1.731812] pci_scan_child_bus_extend+0x3b/0x2f0 [ 1.731814] pci_scan_root_bus_bridge+0xc0/0xf0 [ 1.731816] hv_pci_probe+0x398/0x5f0 [ 1.731817] vmbus_probe+0x42/0xa0 [ 1.731819] really_probe+0xe5/0x3e0 [ 1.731822] __driver_probe_device+0x7e/0x170 [ 1.731823] driver_probe_device+0x23/0xa0 [ 1.731824] __device_attach_driver+0x92/0x130 [ 1.731826] bus_for_each_drv+0x8c/0xe0 [ 1.731828] __device_attach+0xc0/0x200 [ 1.731830] device_initial_probe+0x4c/0x50 [ 1.731831] bus_probe_device+0x32/0x90 [ 1.731832] device_add+0x65b/0x850 [ 1.731836] device_register+0x1f/0x30 [ 1.731837] vmbus_device_register+0x87/0x130 [ 1.731840] vmbus_add_channel_work+0x139/0x1a0 [ 1.731841] process_one_work+0x19f/0x3f0 [ 1.731843] worker_thread+0x188/0x2f0 [ 1.731845] kthread+0x119/0x230 [ 1.731852] ret_from_fork+0x1b4/0x1e0 [ 1.731854] ret_from_fork_asm+0x1a/0x30 > > > I was wondering if this "logical device id" is actually parsed by the hypervisor, > > or whether it is just a unique ID that is opaque to the hypervisor. From the > > usage in the hypercalls in pci-hyperv.c and this new IOMMU driver, it appears > > to be the former. Evidently the hypervisor is taking this logical device ID and > > and matching against bytes 4 thru 7 of the instance GUIDs of PCI pass-thru > > devices offered to the guest, so as to identify a particular PCI pass-thru device. > > If that's the case, then Linux doesn't have the option of choosing some other > > unique ID that is easier to generate and access. > > Yes, the device ID is actually used by the hypervisor to find the corresponding PCI > pass-thru device and the physical IOMMUs the device is behind and execute the > requested operation for those IOMMUs. > > > There's a uniqueness issue with this kind of logical device ID that has been > > around for years, but I had never thought about before. In hv_pci_probe() > > instance GUID bytes 4 and 5 are used to generate the PCI domain number for > > the "fake" PCI bus that the PCI pass-thru device resides on. The issue is the > > lack of guaranteed uniqueness of bytes 4 and 5, so there's code to deal with > > a collision. (The full GUID is unique, but not necessarily some subset of the > > GUID.) It seems like the same kind of uniqueness issue could occur here. Does > > the Hyper-V host provide any guarantees about the uniqueness of bytes 4 thru > > 7 as a unit, and if not, what happens if there is a collision? Again, this > > uniqueness issue has existed for years, so it's not new to this patch set, but > > with new uses of the logical device ID, it seems relevant to consider. > > Thank you for bringing that up, I was aware of the uniqueness workaround but, like you, > I had not considered that the workaround could prevent matching the device ID with the > record the hypervisor has of the PCI pass-thru device assigned to us. I will work with > the hypervisor folks to resolve this before this patch series is posted for merge. > > Thanks, > Easwar (he/him) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC v1 2/5] iommu: Move Hyper-V IOMMU driver to its own subdirectory 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang @ 2025-12-09 5:11 ` Yu Zhang 2025-12-09 5:11 ` [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang ` (3 subsequent siblings) 5 siblings, 0 replies; 28+ messages in thread From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw) To: linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch From: Easwar Hariharan <eahariha@linux.microsoft.com> The Hyper-V IOMMU driver currently only supports IRQ remapping. As it will be adding DMA remapping support, prepare a directory to contain all the different feature files. This is a simple rename commit and has no functional changes. Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> --- drivers/iommu/Kconfig | 10 +--------- drivers/iommu/Makefile | 2 +- drivers/iommu/hyperv/Kconfig | 10 ++++++++++ drivers/iommu/hyperv/Makefile | 2 ++ .../iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} | 2 +- 5 files changed, 15 insertions(+), 11 deletions(-) create mode 100644 drivers/iommu/hyperv/Kconfig create mode 100644 drivers/iommu/hyperv/Makefile rename drivers/iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} (99%) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index c9ae3221cd6f..661ff4e764cc 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -194,6 +194,7 @@ config MSM_IOMMU source "drivers/iommu/amd/Kconfig" source "drivers/iommu/arm/Kconfig" source "drivers/iommu/intel/Kconfig" +source "drivers/iommu/hyperv/Kconfig" source "drivers/iommu/iommufd/Kconfig" source "drivers/iommu/riscv/Kconfig" @@ -350,15 +351,6 @@ config MTK_IOMMU_V1 if unsure, say N here. -config HYPERV_IOMMU - bool "Hyper-V IRQ Handling" - depends on HYPERV && X86 - select IOMMU_API - default HYPERV - help - Stub IOMMU driver to handle IRQs to support Hyper-V Linux - guest and root partitions. - config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index b17ef9818759..757dc377cb66 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -4,6 +4,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd/ obj-$(CONFIG_INTEL_IOMMU) += intel/ obj-$(CONFIG_RISCV_IOMMU) += riscv/ obj-$(CONFIG_GENERIC_PT) += generic_pt/fmt/ +obj-$(CONFIG_HYPERV_IOMMU) += hyperv/ obj-$(CONFIG_IOMMU_API) += iommu.o obj-$(CONFIG_IOMMU_SUPPORT) += iommu-pages.o obj-$(CONFIG_IOMMU_API) += iommu-traces.o @@ -29,7 +30,6 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o obj-$(CONFIG_S390_IOMMU) += s390-iommu.o -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig new file mode 100644 index 000000000000..30f40d867036 --- /dev/null +++ b/drivers/iommu/hyperv/Kconfig @@ -0,0 +1,10 @@ +# SPDX-License-Identifier: GPL-2.0-only +# HyperV paravirtualized IOMMU support +config HYPERV_IOMMU + bool "Hyper-V IRQ Handling" + depends on HYPERV && X86 + select IOMMU_API + default HYPERV + help + Stub IOMMU driver to handle IRQs to support Hyper-V Linux + guest and root partitions. diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile new file mode 100644 index 000000000000..9f557bad94ff --- /dev/null +++ b/drivers/iommu/hyperv/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv/irq_remapping.c similarity index 99% rename from drivers/iommu/hyperv-iommu.c rename to drivers/iommu/hyperv/irq_remapping.c index 0961ac805944..f2c4c7d67302 100644 --- a/drivers/iommu/hyperv-iommu.c +++ b/drivers/iommu/hyperv/irq_remapping.c @@ -22,7 +22,7 @@ #include <asm/hypervisor.h> #include <asm/mshyperv.h> -#include "irq_remapping.h" +#include "../irq_remapping.h" #ifdef CONFIG_IRQ_REMAP -- 2.49.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang 2025-12-09 5:11 ` [RFC v1 2/5] iommu: Move Hyper-V IOMMU driver to its own subdirectory Yu Zhang @ 2025-12-09 5:11 ` Yu Zhang 2026-01-08 18:47 ` Michael Kelley 2025-12-09 5:11 ` [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions Yu Zhang ` (2 subsequent siblings) 5 siblings, 1 reply; 28+ messages in thread From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw) To: linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch From: Wei Liu <wei.liu@kernel.org> Hyper-V guest IOMMU is a para-virtualized IOMMU based on hypercalls. Introduce the hypercalls used by the child partition to interact with this facility. These hypercalls fall into below categories: - Detection and capability: HVCALL_GET_IOMMU_CAPABILITIES is used to detect the existence and capabilities of the guest IOMMU. - Device management: HVCALL_GET_LOGICAL_DEVICE_PROPERTY is used to check whether an endpoint device is managed by the guest IOMMU. - Domain management: A set of hypercalls is provided to handle the creation, configuration, and deletion of guest domains, as well as the attachment/detachment of endpoint devices to/from those domains. - IOTLB flushing: HVCALL_FLUSH_DEVICE_DOMAIN is used to ask Hyper-V for a domain-selective IOTLB flush(which in its handler may flush the device TLB as well). Page-selective IOTLB flushes will be offered by new hypercalls in future patches. Signed-off-by: Wei Liu <wei.liu@kernel.org> Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> --- include/hyperv/hvgdk_mini.h | 8 +++ include/hyperv/hvhdk_mini.h | 123 ++++++++++++++++++++++++++++++++++++ 2 files changed, 131 insertions(+) diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h index 77abddfc750e..e5b302bbfe14 100644 --- a/include/hyperv/hvgdk_mini.h +++ b/include/hyperv/hvgdk_mini.h @@ -478,10 +478,16 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ #define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 +#define HVCALL_CREATE_DEVICE_DOMAIN 0x00b1 +#define HVCALL_ATTACH_DEVICE_DOMAIN 0x00b2 #define HVCALL_SIGNAL_EVENT_DIRECT 0x00c0 #define HVCALL_POST_MESSAGE_DIRECT 0x00c1 #define HVCALL_DISPATCH_VP 0x00c2 +#define HVCALL_DETACH_DEVICE_DOMAIN 0x00c4 +#define HVCALL_DELETE_DEVICE_DOMAIN 0x00c5 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES 0x00c9 +#define HVCALL_CONFIGURE_DEVICE_DOMAIN 0x00ce +#define HVCALL_FLUSH_DEVICE_DOMAIN 0x00d0 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d7 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d8 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY 0x00db @@ -492,6 +498,8 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ #define HVCALL_GET_VP_CPUID_VALUES 0x00f4 #define HVCALL_MMIO_READ 0x0106 #define HVCALL_MMIO_WRITE 0x0107 +#define HVCALL_GET_IOMMU_CAPABILITIES 0x0125 +#define HVCALL_GET_LOGICAL_DEVICE_PROPERTY 0x0127 /* HV_HYPERCALL_INPUT */ #define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0) diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h index 858f6a3925b3..ba6b91746b13 100644 --- a/include/hyperv/hvhdk_mini.h +++ b/include/hyperv/hvhdk_mini.h @@ -400,4 +400,127 @@ union hv_device_id { /* HV_DEVICE_ID */ } acpi; } __packed; +/* Device domain types */ +#define HV_DEVICE_DOMAIN_TYPE_S1 1 /* Stage 1 domain */ + +/* ID for default domain and NULL domain */ +#define HV_DEVICE_DOMAIN_ID_DEFAULT 0 +#define HV_DEVICE_DOMAIN_ID_NULL 0xFFFFFFFFULL + +union hv_device_domain_id { + u64 as_uint64; + struct { + u32 type: 4; + u32 reserved: 28; + u32 id; + } __packed; +}; + +struct hv_input_device_domain { + u64 partition_id; + union hv_input_vtl owner_vtl; + u8 padding[7]; + union hv_device_domain_id domain_id; +} __packed; + +union hv_create_device_domain_flags { + u32 as_uint32; + struct { + u32 forward_progress_required: 1; + u32 inherit_owning_vtl: 1; + u32 reserved: 30; + } __packed; +}; + +struct hv_input_create_device_domain { + struct hv_input_device_domain device_domain; + union hv_create_device_domain_flags create_device_domain_flags; +} __packed; + +struct hv_input_delete_device_domain { + struct hv_input_device_domain device_domain; +} __packed; + +struct hv_input_attach_device_domain { + struct hv_input_device_domain device_domain; + union hv_device_id device_id; +} __packed; + +struct hv_input_detach_device_domain { + u64 partition_id; + union hv_device_id device_id; +} __packed; + +struct hv_device_domain_settings { + struct { + /* + * Enable translations. If not enabled, all transaction bypass + * S1 translations. + */ + u64 translation_enabled: 1; + u64 blocked: 1; + /* + * First stage address translation paging mode: + * 0: 4-level paging (default) + * 1: 5-level paging + */ + u64 first_stage_paging_mode: 1; + u64 reserved: 61; + } flags; + + /* Address of translation table */ + u64 page_table_root; +} __packed; + +struct hv_input_configure_device_domain { + struct hv_input_device_domain device_domain; + struct hv_device_domain_settings settings; +} __packed; + +struct hv_input_get_iommu_capabilities { + u64 partition_id; + u64 reserved; +} __packed; + +struct hv_output_get_iommu_capabilities { + u32 size; + u16 reserved; + u8 max_iova_width; + u8 max_pasid_width; + +#define HV_IOMMU_CAP_PRESENT (1ULL << 0) +#define HV_IOMMU_CAP_S2 (1ULL << 1) +#define HV_IOMMU_CAP_S1 (1ULL << 2) +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3) +#define HV_IOMMU_CAP_PASID (1ULL << 4) +#define HV_IOMMU_CAP_ATS (1ULL << 5) +#define HV_IOMMU_CAP_PRI (1ULL << 6) + + u64 iommu_cap; + u64 pgsize_bitmap; +} __packed; + +enum hv_logical_device_property_code { + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10, +}; + +struct hv_input_get_logical_device_property { + u64 partition_id; + u64 logical_device_id; + enum hv_logical_device_property_code code; + u32 reserved; +} __packed; + +struct hv_output_get_logical_device_property { +#define HV_DEVICE_IOMMU_ENABLED (1ULL << 0) + u64 device_iommu; + u64 reserved; +} __packed; + +struct hv_input_flush_device_domain { + struct hv_input_device_domain device_domain; + u32 flags; + u32 reserved; +} __packed; + #endif /* _HV_HVHDK_MINI_H */ -- 2.49.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* RE: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU 2025-12-09 5:11 ` [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang @ 2026-01-08 18:47 ` Michael Kelley 2026-01-09 18:47 ` Easwar Hariharan 0 siblings, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-08 18:47 UTC (permalink / raw) To: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > From: Wei Liu <wei.liu@kernel.org> > > Hyper-V guest IOMMU is a para-virtualized IOMMU based on hypercalls. > Introduce the hypercalls used by the child partition to interact with > this facility. > > These hypercalls fall into below categories: > - Detection and capability: HVCALL_GET_IOMMU_CAPABILITIES is used to > detect the existence and capabilities of the guest IOMMU. > > - Device management: HVCALL_GET_LOGICAL_DEVICE_PROPERTY is used to > check whether an endpoint device is managed by the guest IOMMU. > > - Domain management: A set of hypercalls is provided to handle the > creation, configuration, and deletion of guest domains, as well as > the attachment/detachment of endpoint devices to/from those domains. > > - IOTLB flushing: HVCALL_FLUSH_DEVICE_DOMAIN is used to ask Hyper-V > for a domain-selective IOTLB flush(which in its handler may flush Typo: Add a space after "IOTLB flush" and before the open parenthesis. > the device TLB as well). Page-selective IOTLB flushes will be offered > by new hypercalls in future patches. > > Signed-off-by: Wei Liu <wei.liu@kernel.org> > Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > include/hyperv/hvgdk_mini.h | 8 +++ > include/hyperv/hvhdk_mini.h | 123 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 131 insertions(+) > > diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h > index 77abddfc750e..e5b302bbfe14 100644 > --- a/include/hyperv/hvgdk_mini.h > +++ b/include/hyperv/hvgdk_mini.h > @@ -478,10 +478,16 @@ union hv_vp_assist_msr_contents { /* > HV_REGISTER_VP_ASSIST_PAGE */ > #define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a > #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af > #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 > +#define HVCALL_CREATE_DEVICE_DOMAIN 0x00b1 > +#define HVCALL_ATTACH_DEVICE_DOMAIN 0x00b2 > #define HVCALL_SIGNAL_EVENT_DIRECT 0x00c0 > #define HVCALL_POST_MESSAGE_DIRECT 0x00c1 > #define HVCALL_DISPATCH_VP 0x00c2 > +#define HVCALL_DETACH_DEVICE_DOMAIN 0x00c4 > +#define HVCALL_DELETE_DEVICE_DOMAIN 0x00c5 > #define HVCALL_GET_GPA_PAGES_ACCESS_STATES 0x00c9 > +#define HVCALL_CONFIGURE_DEVICE_DOMAIN 0x00ce > +#define HVCALL_FLUSH_DEVICE_DOMAIN 0x00d0 > #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d7 > #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d8 > #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY 0x00db > @@ -492,6 +498,8 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ > #define HVCALL_GET_VP_CPUID_VALUES 0x00f4 > #define HVCALL_MMIO_READ 0x0106 > #define HVCALL_MMIO_WRITE 0x0107 > +#define HVCALL_GET_IOMMU_CAPABILITIES 0x0125 > +#define HVCALL_GET_LOGICAL_DEVICE_PROPERTY 0x0127 > > /* HV_HYPERCALL_INPUT */ > #define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0) > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h > index 858f6a3925b3..ba6b91746b13 100644 > --- a/include/hyperv/hvhdk_mini.h > +++ b/include/hyperv/hvhdk_mini.h > @@ -400,4 +400,127 @@ union hv_device_id { /* HV_DEVICE_ID */ > } acpi; > } __packed; > > +/* Device domain types */ > +#define HV_DEVICE_DOMAIN_TYPE_S1 1 /* Stage 1 domain */ > + > +/* ID for default domain and NULL domain */ > +#define HV_DEVICE_DOMAIN_ID_DEFAULT 0 > +#define HV_DEVICE_DOMAIN_ID_NULL 0xFFFFFFFFULL > + > +union hv_device_domain_id { > + u64 as_uint64; > + struct { > + u32 type: 4; > + u32 reserved: 28; > + u32 id; > + } __packed; > +}; > + > +struct hv_input_device_domain { > + u64 partition_id; > + union hv_input_vtl owner_vtl; > + u8 padding[7]; > + union hv_device_domain_id domain_id; > +} __packed; > + > +union hv_create_device_domain_flags { > + u32 as_uint32; > + struct { > + u32 forward_progress_required: 1; > + u32 inherit_owning_vtl: 1; > + u32 reserved: 30; > + } __packed; > +}; > + > +struct hv_input_create_device_domain { > + struct hv_input_device_domain device_domain; > + union hv_create_device_domain_flags create_device_domain_flags; > +} __packed; > + > +struct hv_input_delete_device_domain { > + struct hv_input_device_domain device_domain; > +} __packed; > + > +struct hv_input_attach_device_domain { > + struct hv_input_device_domain device_domain; > + union hv_device_id device_id; > +} __packed; > + > +struct hv_input_detach_device_domain { > + u64 partition_id; > + union hv_device_id device_id; > +} __packed; > + > +struct hv_device_domain_settings { > + struct { > + /* > + * Enable translations. If not enabled, all transaction bypass > + * S1 translations. > + */ > + u64 translation_enabled: 1; > + u64 blocked: 1; > + /* > + * First stage address translation paging mode: > + * 0: 4-level paging (default) > + * 1: 5-level paging > + */ > + u64 first_stage_paging_mode: 1; > + u64 reserved: 61; > + } flags; > + > + /* Address of translation table */ > + u64 page_table_root; > +} __packed; > + > +struct hv_input_configure_device_domain { > + struct hv_input_device_domain device_domain; > + struct hv_device_domain_settings settings; > +} __packed; > + > +struct hv_input_get_iommu_capabilities { > + u64 partition_id; > + u64 reserved; > +} __packed; > + > +struct hv_output_get_iommu_capabilities { > + u32 size; > + u16 reserved; > + u8 max_iova_width; > + u8 max_pasid_width; > + > +#define HV_IOMMU_CAP_PRESENT (1ULL << 0) > +#define HV_IOMMU_CAP_S2 (1ULL << 1) > +#define HV_IOMMU_CAP_S1 (1ULL << 2) > +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3) > +#define HV_IOMMU_CAP_PASID (1ULL << 4) > +#define HV_IOMMU_CAP_ATS (1ULL << 5) > +#define HV_IOMMU_CAP_PRI (1ULL << 6) > + > + u64 iommu_cap; > + u64 pgsize_bitmap; > +} __packed; > + > +enum hv_logical_device_property_code { > + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10, > +}; > + > +struct hv_input_get_logical_device_property { > + u64 partition_id; > + u64 logical_device_id; > + enum hv_logical_device_property_code code; Historically we've avoided "enum" types in structures that are part of the hypervisor ABI. Use u32 here? Michael > + u32 reserved; > +} __packed; > + > +struct hv_output_get_logical_device_property { > +#define HV_DEVICE_IOMMU_ENABLED (1ULL << 0) > + u64 device_iommu; > + u64 reserved; > +} __packed; > + > +struct hv_input_flush_device_domain { > + struct hv_input_device_domain device_domain; > + u32 flags; > + u32 reserved; > +} __packed; > + > #endif /* _HV_HVHDK_MINI_H */ > -- > 2.49.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU 2026-01-08 18:47 ` Michael Kelley @ 2026-01-09 18:47 ` Easwar Hariharan 2026-01-09 19:24 ` Michael Kelley 0 siblings, 1 reply; 28+ messages in thread From: Easwar Hariharan @ 2026-01-09 18:47 UTC (permalink / raw) To: Michael Kelley Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, easwar.hariharan, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org On 1/8/2026 10:47 AM, Michael Kelley wrote: > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM >> >> From: Wei Liu <wei.liu@kernel.org> >> <snip> >> +struct hv_input_get_iommu_capabilities { >> + u64 partition_id; >> + u64 reserved; >> +} __packed; >> + >> +struct hv_output_get_iommu_capabilities { >> + u32 size; >> + u16 reserved; >> + u8 max_iova_width; >> + u8 max_pasid_width; >> + >> +#define HV_IOMMU_CAP_PRESENT (1ULL << 0) >> +#define HV_IOMMU_CAP_S2 (1ULL << 1) >> +#define HV_IOMMU_CAP_S1 (1ULL << 2) >> +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3) >> +#define HV_IOMMU_CAP_PASID (1ULL << 4) >> +#define HV_IOMMU_CAP_ATS (1ULL << 5) >> +#define HV_IOMMU_CAP_PRI (1ULL << 6) >> + >> + u64 iommu_cap; >> + u64 pgsize_bitmap; >> +} __packed; >> + >> +enum hv_logical_device_property_code { >> + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10, >> +}; >> + >> +struct hv_input_get_logical_device_property { >> + u64 partition_id; >> + u64 logical_device_id; >> + enum hv_logical_device_property_code code; > > Historically we've avoided "enum" types in structures that are part of > the hypervisor ABI. Use u32 here? <snip> What has been the reasoning for that practice? Since the introduction of the include/hyperv/ headers, we have generally wanted to import as directly as possible the relevant definitions from the hypervisor code base. If there's a strong reason, we could consider switching the enum for a u32 here since, at least for the moment, there's only a single value being used. Thanks, Easwar (he/him) ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU 2026-01-09 18:47 ` Easwar Hariharan @ 2026-01-09 19:24 ` Michael Kelley 0 siblings, 0 replies; 28+ messages in thread From: Michael Kelley @ 2026-01-09 19:24 UTC (permalink / raw) To: Easwar Hariharan Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Sent: Friday, January 9, 2026 10:47 AM > > On 1/8/2026 10:47 AM, Michael Kelley wrote: > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > >> > >> From: Wei Liu <wei.liu@kernel.org> > >> > > <snip> > > >> +struct hv_input_get_iommu_capabilities { > >> + u64 partition_id; > >> + u64 reserved; > >> +} __packed; > >> + > >> +struct hv_output_get_iommu_capabilities { > >> + u32 size; > >> + u16 reserved; > >> + u8 max_iova_width; > >> + u8 max_pasid_width; > >> + > >> +#define HV_IOMMU_CAP_PRESENT (1ULL << 0) > >> +#define HV_IOMMU_CAP_S2 (1ULL << 1) > >> +#define HV_IOMMU_CAP_S1 (1ULL << 2) > >> +#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3) > >> +#define HV_IOMMU_CAP_PASID (1ULL << 4) > >> +#define HV_IOMMU_CAP_ATS (1ULL << 5) > >> +#define HV_IOMMU_CAP_PRI (1ULL << 6) > >> + > >> + u64 iommu_cap; > >> + u64 pgsize_bitmap; > >> +} __packed; > >> + > >> +enum hv_logical_device_property_code { > >> + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10, > >> +}; > >> + > >> +struct hv_input_get_logical_device_property { > >> + u64 partition_id; > >> + u64 logical_device_id; > >> + enum hv_logical_device_property_code code; > > > > Historically we've avoided "enum" types in structures that are part of > > the hypervisor ABI. Use u32 here? > > <snip> > What has been the reasoning for that practice? Since the introduction of the > include/hyperv/ headers, we have generally wanted to import as directly as > possible the relevant definitions from the hypervisor code base. If there's > a strong reason, we could consider switching the enum for a u32 here > since, at least for the moment, there's only a single value being used. > In the C language, the size of an enum is implementation defined. Do a Co-Pilot search on "How many bytes is an enum in C", and you'll get a fairly long answer explaining the idiosyncrasies. For gcc, and for MSVC on the hypervisor side, the default is that an "enum" size is the same as an "int", so everything works in current practice. But the compiler is allowed to optimize the size of an enum if a smaller integer type can contain all the values, and that would mess things up in an ABI. Hence the intent to not use "enum" in the hypervisor ABI. Windows/Hyper-V historically didn't have to worry about such things since they controlled both sides of the ABI, but the more Linux uses the ABI, the greater potential for something to go wrong. I wish Windows/Hyper-V would tighten up their ABI specification, but it is what it is. So I'm not sure how best to deal with the issue in light of wanting to take the hypervisor ABI definitions directly from the Windows environment and not modify them. I did a quick grep of the hv*.h files in include/hyperv from linux-next, and while there are many enum types defined, none are used as fields in a structure. There are many cases of u32, and a couple u16's, followed by a comment identifying the enum type that should be used to populate the field. Michael ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang ` (2 preceding siblings ...) 2025-12-09 5:11 ` [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang @ 2025-12-09 5:11 ` Yu Zhang 2026-01-08 18:47 ` Michael Kelley 2025-12-09 5:11 ` [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang 2026-01-08 18:45 ` [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Michael Kelley 5 siblings, 1 reply; 28+ messages in thread From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw) To: linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch Previously, the allocation of per-CPU output argument pages was restricted to root partitions or those operating in VTL mode. Remove this restriction to support guest IOMMU related hypercalls, which require valid output pages to function correctly. While unconditionally allocating per-CPU output pages scales with the number of vCPUs, and potentially adding overhead for guests that may not utilize the IOMMU, this change anticipates that future hypercalls from child partitions may also require these output pages. Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> --- drivers/hv/hv_common.c | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-) diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index e109a620c83f..034fb2592884 100644 --- a/drivers/hv/hv_common.c +++ b/drivers/hv/hv_common.c @@ -255,11 +255,6 @@ static void hv_kmsg_dump_register(void) } } -static inline bool hv_output_page_exists(void) -{ - return hv_parent_partition() || IS_ENABLED(CONFIG_HYPERV_VTL_MODE); -} - void __init hv_get_partition_id(void) { struct hv_output_get_partition_id *output; @@ -371,11 +366,9 @@ int __init hv_common_init(void) hyperv_pcpu_input_arg = alloc_percpu(void *); BUG_ON(!hyperv_pcpu_input_arg); - /* Allocate the per-CPU state for output arg for root */ - if (hv_output_page_exists()) { - hyperv_pcpu_output_arg = alloc_percpu(void *); - BUG_ON(!hyperv_pcpu_output_arg); - } + /* Allocate the per-CPU state for output arg*/ + hyperv_pcpu_output_arg = alloc_percpu(void *); + BUG_ON(!hyperv_pcpu_output_arg); if (hv_parent_partition()) { hv_synic_eventring_tail = alloc_percpu(u8 *); @@ -473,7 +466,7 @@ int hv_common_cpu_init(unsigned int cpu) u8 **synic_eventring_tail; u64 msr_vp_index; gfp_t flags; - const int pgcount = hv_output_page_exists() ? 2 : 1; + const int pgcount = 2; void *mem; int ret = 0; @@ -491,10 +484,8 @@ int hv_common_cpu_init(unsigned int cpu) if (!mem) return -ENOMEM; - if (hv_output_page_exists()) { - outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg); - *outputarg = (char *)mem + HV_HYP_PAGE_SIZE; - } + outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg); + *outputarg = (char *)mem + HV_HYP_PAGE_SIZE; if (!ms_hyperv.paravisor_present && (hv_isolation_type_snp() || hv_isolation_type_tdx())) { -- 2.49.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* RE: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions 2025-12-09 5:11 ` [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions Yu Zhang @ 2026-01-08 18:47 ` Michael Kelley 2026-01-10 5:07 ` Yu Zhang 0 siblings, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-08 18:47 UTC (permalink / raw) To: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > The "Subject:" line prefix for this patch should probably be "Drivers: hv:" to be consistent with most other changes to this source code file. > Previously, the allocation of per-CPU output argument pages was restricted > to root partitions or those operating in VTL mode. > > Remove this restriction to support guest IOMMU related hypercalls, which > require valid output pages to function correctly. The thinking here isn't quite correct. Just because a hypercall produces output doesn't mean that Linux needs to allocate a page for the output that is separate from the input. It's perfectly OK to use the same page for both input and output, as long as the two areas don't overlap. Yes, the page is called "hyperv_pcpu_input_arg", but that's a historical artifact from before the time it was realized that the same page can be used for both input and output. Of course, if there's ever a hypercall that needs lots of input and lots of output such that the combined size doesn't fit in a single page, then separate input and output pages will be needed. But I'm skeptical that will ever happen. Rep hypercalls could have large amounts of input and/or output, but I'd venture that the rep count can always be managed so everything fits in a single page. > > While unconditionally allocating per-CPU output pages scales with the number > of vCPUs, and potentially adding overhead for guests that may not utilize the > IOMMU, this change anticipates that future hypercalls from child partitions > may also require these output pages. I've heard the argument that the amount of overhead is modest relative to the overall amount of memory that is typically in a VM, particularly VMs with high vCPU counts. And I don't disagree. But on the flip side, why tie up memory when there's no need to do so? I'd argue for dropping this patch, and changing the two hypercall call sites in Patch 5 to just use part of the so-called hypercall input page for the output as well. It's only a one-line change in each hypercall call site. If folks really want to always allocate the separate output page, it's not an issue that I'll continue to fight. But at least give a valid reason "why" in the commit message. Michael > > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > drivers/hv/hv_common.c | 21 ++++++--------------- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c > index e109a620c83f..034fb2592884 100644 > --- a/drivers/hv/hv_common.c > +++ b/drivers/hv/hv_common.c > @@ -255,11 +255,6 @@ static void hv_kmsg_dump_register(void) > } > } > > -static inline bool hv_output_page_exists(void) > -{ > - return hv_parent_partition() || IS_ENABLED(CONFIG_HYPERV_VTL_MODE); > -} > - > void __init hv_get_partition_id(void) > { > struct hv_output_get_partition_id *output; > @@ -371,11 +366,9 @@ int __init hv_common_init(void) > hyperv_pcpu_input_arg = alloc_percpu(void *); > BUG_ON(!hyperv_pcpu_input_arg); > > - /* Allocate the per-CPU state for output arg for root */ > - if (hv_output_page_exists()) { > - hyperv_pcpu_output_arg = alloc_percpu(void *); > - BUG_ON(!hyperv_pcpu_output_arg); > - } > + /* Allocate the per-CPU state for output arg*/ > + hyperv_pcpu_output_arg = alloc_percpu(void *); > + BUG_ON(!hyperv_pcpu_output_arg); > > if (hv_parent_partition()) { > hv_synic_eventring_tail = alloc_percpu(u8 *); > @@ -473,7 +466,7 @@ int hv_common_cpu_init(unsigned int cpu) > u8 **synic_eventring_tail; > u64 msr_vp_index; > gfp_t flags; > - const int pgcount = hv_output_page_exists() ? 2 : 1; > + const int pgcount = 2; > void *mem; > int ret = 0; > > @@ -491,10 +484,8 @@ int hv_common_cpu_init(unsigned int cpu) > if (!mem) > return -ENOMEM; > > - if (hv_output_page_exists()) { > - outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg); > - *outputarg = (char *)mem + HV_HYP_PAGE_SIZE; > - } > + outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg); > + *outputarg = (char *)mem + HV_HYP_PAGE_SIZE; > > if (!ms_hyperv.paravisor_present && > (hv_isolation_type_snp() || hv_isolation_type_tdx())) { > -- > 2.49.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions 2026-01-08 18:47 ` Michael Kelley @ 2026-01-10 5:07 ` Yu Zhang 2026-01-11 22:27 ` Michael Kelley 0 siblings, 1 reply; 28+ messages in thread From: Yu Zhang @ 2026-01-10 5:07 UTC (permalink / raw) To: Michael Kelley Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org On Thu, Jan 08, 2026 at 06:47:44PM +0000, Michael Kelley wrote: > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > > > The "Subject:" line prefix for this patch should probably be "Drivers: hv:" > to be consistent with most other changes to this source code file. > > > Previously, the allocation of per-CPU output argument pages was restricted > > to root partitions or those operating in VTL mode. > > > > Remove this restriction to support guest IOMMU related hypercalls, which > > require valid output pages to function correctly. > > The thinking here isn't quite correct. Just because a hypercall produces output > doesn't mean that Linux needs to allocate a page for the output that is separate > from the input. It's perfectly OK to use the same page for both input and output, > as long as the two areas don't overlap. Yes, the page is called > "hyperv_pcpu_input_arg", but that's a historical artifact from before the time > it was realized that the same page can be used for both input and output. > > Of course, if there's ever a hypercall that needs lots of input and lots of output > such that the combined size doesn't fit in a single page, then separate input > and output pages will be needed. But I'm skeptical that will ever happen. Rep > hypercalls could have large amounts of input and/or output, but I'd venture > that the rep count can always be managed so everything fits in a single page. > Thanks, Michael. Is there an existing hypercall precedent that reuses the input page for output? I believe reusing the input page should be acceptable, at least for pvIOMMU's hypercalls, but I will confirm these interfaces with the Hyper-V team. > > > > While unconditionally allocating per-CPU output pages scales with the number > > of vCPUs, and potentially adding overhead for guests that may not utilize the > > IOMMU, this change anticipates that future hypercalls from child partitions > > may also require these output pages. > > I've heard the argument that the amount of overhead is modest relative to the > overall amount of memory that is typically in a VM, particularly VMs with high > vCPU counts. And I don't disagree. But on the flip side, why tie up memory when > there's no need to do so? I'd argue for dropping this patch, and changing the > two hypercall call sites in Patch 5 to just use part of the so-called hypercall input > page for the output as well. It's only a one-line change in each hypercall call site. > I share your concern about unconditionally allocating a separate output page for each vCPU. And if reusing the input page isn't accepted by the Hyper-V team, perhaps we could gate the allocation by checking IS_ENABLED(CONFIG_HYPERV_PVIOMMU) in hv_output_page_exist()? B.R. Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions 2026-01-10 5:07 ` Yu Zhang @ 2026-01-11 22:27 ` Michael Kelley 0 siblings, 0 replies; 28+ messages in thread From: Michael Kelley @ 2026-01-11 22:27 UTC (permalink / raw) To: Yu Zhang Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, January 9, 2026 9:07 PM > > On Thu, Jan 08, 2026 at 06:47:44PM +0000, Michael Kelley wrote: > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 > 9:11 PM > > > > > > > The "Subject:" line prefix for this patch should probably be "Drivers: hv:" > > to be consistent with most other changes to this source code file. > > > > > Previously, the allocation of per-CPU output argument pages was restricted > > > to root partitions or those operating in VTL mode. > > > > > > Remove this restriction to support guest IOMMU related hypercalls, which > > > require valid output pages to function correctly. > > > > The thinking here isn't quite correct. Just because a hypercall produces output > > doesn't mean that Linux needs to allocate a page for the output that is separate > > from the input. It's perfectly OK to use the same page for both input and output, > > as long as the two areas don't overlap. Yes, the page is called > > "hyperv_pcpu_input_arg", but that's a historical artifact from before the time > > it was realized that the same page can be used for both input and output. > > > > Of course, if there's ever a hypercall that needs lots of input and lots of output > > such that the combined size doesn't fit in a single page, then separate input > > and output pages will be needed. But I'm skeptical that will ever happen. Rep > > hypercalls could have large amounts of input and/or output, but I'd venture > > that the rep count can always be managed so everything fits in a single page. > > > > Thanks, Michael. > > Is there an existing hypercall precedent that reuses the input page for output? > I believe reusing the input page should be acceptable, at least for pvIOMMU's > hypercalls, but I will confirm these interfaces with the Hyper-V team. See hv_pci_read_mmio() for a precedent in current kernel code. There's also hv_get_partition_id() which uses hyperv_pcpu_input_page for the hypercall output. But in this case, there is no input, so input and output aren't actually sharing the page. In the kernel 6.13 and earlier, get_vtl() used the hyperv_pcpu_input_page for both input and output, but it did it wrong because the input and output areas overlapped. While overlap worked because the hypercall is a simple "one-shot" operation (i.e., read the input, then write the output), it's not legal according to the TLFS. When the illegal overlap was fixed in commit 07412e1f163d, the developer decided to allocate the hyperv_pcpu_output_page for VTL2 images, so the fix uses separate pages for the input and output. There was extensive discussion of the tradeoffs in allocating the output page for VTL2. In my view it was an unnecessary use of memory, but the developer preferred to do it for consistency, and I didn't press the argument because it was limited to VTL2. Similarly, I won't press the argument here if folks really want to always allocate the output page. My only request is that the commit message not be misleading about the reason. See https://elixir.bootlin.com/linux/v6.13/source/arch/x86/hyperv/hv_init.c#L416 for the older get_vtl() code that puts the input and output in the same page, but improperly overlaps. > > > > > > > While unconditionally allocating per-CPU output pages scales with the number > > > of vCPUs, and potentially adding overhead for guests that may not utilize the > > > IOMMU, this change anticipates that future hypercalls from child partitions > > > may also require these output pages. > > > > I've heard the argument that the amount of overhead is modest relative to the > > overall amount of memory that is typically in a VM, particularly VMs with high > > vCPU counts. And I don't disagree. But on the flip side, why tie up memory when > > there's no need to do so? I'd argue for dropping this patch, and changing the > > two hypercall call sites in Patch 5 to just use part of the so-called hypercall input > > page for the output as well. It's only a one-line change in each hypercall call site. > > > > I share your concern about unconditionally allocating a separate output page > for each vCPU. And if reusing the input page isn't accepted by the Hyper-V team, > perhaps we could gate the allocation by checking > IS_ENABLED(CONFIG_HYPERV_PVIOMMU) > in hv_output_page_exist()? Yes, that's doable, though I hope it doesn't come to that. At some point the additional complexity starts to favor just allocating the output page. :-) Michael ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang ` (3 preceding siblings ...) 2025-12-09 5:11 ` [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions Yu Zhang @ 2025-12-09 5:11 ` Yu Zhang 2025-12-10 17:15 ` Easwar Hariharan 2026-01-08 18:48 ` Michael Kelley 2026-01-08 18:45 ` [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Michael Kelley 5 siblings, 2 replies; 28+ messages in thread From: Yu Zhang @ 2025-12-09 5:11 UTC (permalink / raw) To: linux-kernel, linux-hyperv, iommu, linux-pci Cc: kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, easwar.hariharan, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V. This driver implements stage-1 IO translation within the guest OS. It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls for: - Capability discovery - Domain allocation, configuration, and deallocation - Device attachment and detachment - IOTLB invalidation The driver constructs x86-compatible stage-1 IO page tables in the guest memory using consolidated IO page table helpers. This allows the guest to manage stage-1 translations independently of vendor- specific drivers (like Intel VT-d or AMD IOMMU). Hyper-v consumes this stage-1 IO page table, when a device domain is created and configured, and nests it with the host's stage-2 IO page tables, therefore elemenating the VM exits for guest IOMMU mapping operations. For guest IOMMU unmapping operations, VM exits to perform the IOTLB flush(and possibly the device TLB flush) is still unavoidable. For now, HVCALL_FLUSH_DEVICE_DOMAIN is used to implement a domain-selective IOTLB flush. New hypercalls for finer-grained hypercall will be provided in future patches. Co-developed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Wei Liu <wei.liu@kernel.org> Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> --- drivers/iommu/hyperv/Kconfig | 14 + drivers/iommu/hyperv/Makefile | 1 + drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++++++++++++++++++ drivers/iommu/hyperv/iommu.h | 53 +++ 4 files changed, 676 insertions(+) create mode 100644 drivers/iommu/hyperv/iommu.c create mode 100644 drivers/iommu/hyperv/iommu.h diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig index 30f40d867036..fa3c77752d7b 100644 --- a/drivers/iommu/hyperv/Kconfig +++ b/drivers/iommu/hyperv/Kconfig @@ -8,3 +8,17 @@ config HYPERV_IOMMU help Stub IOMMU driver to handle IRQs to support Hyper-V Linux guest and root partitions. + +if HYPERV_IOMMU +config HYPERV_PVIOMMU + bool "Microsoft Hypervisor para-virtualized IOMMU support" + depends on X86 && HYPERV && PCI_HYPERV + depends on IOMMU_PT + select IOMMU_API + select IOMMU_DMA + select DMA_OPS + select IOMMU_IOVA + default HYPERV + help + A para-virtualized IOMMU for Microsoft Hypervisor guest. +endif diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile index 9f557bad94ff..8669741c0a51 100644 --- a/drivers/iommu/hyperv/Makefile +++ b/drivers/iommu/hyperv/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c new file mode 100644 index 000000000000..3d0aff868e16 --- /dev/null +++ b/drivers/iommu/hyperv/iommu.c @@ -0,0 +1,608 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Hyper-V IOMMU driver. + * + * Copyright (C) 2019, 2024-2025 Microsoft, Inc. + */ + +#include <linux/iommu.h> +#include <linux/pci.h> +#include <linux/dma-map-ops.h> +#include <linux/generic_pt/iommu.h> +#include <linux/syscore_ops.h> +#include <linux/pci-ats.h> + +#include <asm/iommu.h> +#include <asm/hypervisor.h> +#include <asm/mshyperv.h> + +#include "iommu.h" +#include "../dma-iommu.h" +#include "../iommu-pages.h" + +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev); +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain); +struct hv_iommu_dev *hv_iommu_device; +static struct hv_iommu_domain hv_identity_domain; +static struct hv_iommu_domain hv_blocking_domain; +static const struct iommu_domain_ops hv_iommu_identity_domain_ops; +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops; +static struct iommu_ops hv_iommu_ops; + +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT) +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1) +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL) +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS) + +static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_stage) +{ + int ret; + u64 status; + unsigned long flags; + struct hv_input_create_device_domain *input; + + ret = ida_alloc_range(&hv_iommu_device->domain_ids, + hv_iommu_device->first_domain, hv_iommu_device->last_domain, + GFP_KERNEL); + if (ret < 0) + return ret; + + hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF; + hv_domain->device_domain.domain_id.type = domain_stage; + hv_domain->device_domain.domain_id.id = ret; + hv_domain->hv_iommu = hv_iommu_device; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->device_domain = hv_domain->device_domain; + input->create_device_domain_flags.forward_progress_required = 1; + input->create_device_domain_flags.inherit_owning_vtl = 0; + status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) { + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + ida_free(&hv_iommu_device->domain_ids, hv_domain->device_domain.domain_id.id); + } + + return hv_result_to_errno(status); +} + +static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain) +{ + u64 status; + unsigned long flags; + struct hv_input_delete_device_domain *input; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->device_domain = hv_domain->device_domain; + status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + + ida_free(&hv_domain->hv_iommu->domain_ids, hv_domain->device_domain.domain_id.id); +} + +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + case IOMMU_CAP_DEFERRED_FLUSH: + return true; + default: + return false; + } +} + +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device *dev) +{ + u64 status; + unsigned long flags; + struct pci_dev *pdev; + struct hv_input_attach_device_domain *input; + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); + + /* Only allow PCI devices for now */ + if (!dev_is_pci(dev)) + return -EINVAL; + + if (vdev->hv_domain == hv_domain) + return 0; + + if (vdev->hv_domain) + hv_iommu_detach_dev(&vdev->hv_domain->domain, dev); + + pdev = to_pci_dev(dev); + dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "", + hv_domain->device_domain.domain_id.id); + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->device_domain = hv_domain->device_domain; + input->device_id.as_uint64 = hv_build_logical_dev_id(pdev); + status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) { + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + } else { + vdev->hv_domain = hv_domain; + spin_lock_irqsave(&hv_domain->lock, flags); + list_add(&vdev->list, &hv_domain->dev_list); + spin_unlock_irqrestore(&hv_domain->lock, flags); + } + + return hv_result_to_errno(status); +} + +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev) +{ + u64 status; + unsigned long flags; + struct hv_input_detach_device_domain *input; + struct pci_dev *pdev; + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); + + /* See the attach function, only PCI devices for now */ + if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain) + return; + + pdev = to_pci_dev(dev); + + dev_dbg(dev, "Detaching from %d\n", hv_domain->device_domain.domain_id.id); + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->partition_id = HV_PARTITION_ID_SELF; + input->device_id.as_uint64 = hv_build_logical_dev_id(pdev); + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + + spin_lock_irqsave(&hv_domain->lock, flags); + hv_flush_device_domain(hv_domain); + list_del(&vdev->list); + spin_unlock_irqrestore(&hv_domain->lock, flags); + + vdev->hv_domain = NULL; +} + +static int hv_iommu_get_logical_device_property(struct device *dev, + enum hv_logical_device_property_code code, + struct hv_output_get_logical_device_property *property) +{ + u64 status; + unsigned long flags; + struct hv_input_get_logical_device_property *input; + struct hv_output_get_logical_device_property *output; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + output = *this_cpu_ptr(hyperv_pcpu_output_arg); + memset(input, 0, sizeof(*input)); + memset(output, 0, sizeof(*output)); + input->partition_id = HV_PARTITION_ID_SELF; + input->logical_device_id = hv_build_logical_dev_id(to_pci_dev(dev)); + input->code = code; + status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output); + *property = *output; + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + + return hv_result_to_errno(status); +} + +static struct iommu_device *hv_iommu_probe_device(struct device *dev) +{ + struct pci_dev *pdev; + struct hv_iommu_endpoint *vdev; + struct hv_output_get_logical_device_property device_iommu_property = {0}; + + if (!dev_is_pci(dev)) + return ERR_PTR(-ENODEV); + + if (hv_iommu_get_logical_device_property(dev, + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU, + &device_iommu_property) || + !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED)) + return ERR_PTR(-ENODEV); + + pdev = to_pci_dev(dev); + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); + if (!vdev) + return ERR_PTR(-ENOMEM); + + vdev->dev = dev; + vdev->hv_iommu = hv_iommu_device; + dev_iommu_priv_set(dev, vdev); + + if (hv_iommu_ats_supported(hv_iommu_device->cap) && + pci_ats_supported(pdev)) + pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap)); + + return &vdev->hv_iommu->iommu; +} + +static void hv_iommu_release_device(struct device *dev) +{ + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); + + if (vdev->hv_domain) + hv_iommu_detach_dev(&vdev->hv_domain->domain, dev); + + dev_iommu_priv_set(dev, NULL); + set_dma_ops(dev, NULL); + + kfree(vdev); +} + +static struct iommu_group *hv_iommu_device_group(struct device *dev) +{ + if (dev_is_pci(dev)) + return pci_device_group(dev); + else + return generic_device_group(dev); +} + +static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_type) +{ + u64 status; + unsigned long flags; + struct pt_iommu_x86_64_hw_info pt_info; + struct hv_input_configure_device_domain *input; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->device_domain = hv_domain->device_domain; + input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED); + input->settings.flags.translation_enabled = (domain_type != IOMMU_DOMAIN_IDENTITY); + + if (domain_type & __IOMMU_DOMAIN_PAGING) { + pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info); + input->settings.page_table_root = pt_info.gcr3_pt; + input->settings.flags.first_stage_paging_mode = + pt_info.levels == 5; + } + status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + + return hv_result_to_errno(status); +} + +static int __init hv_initialize_static_domains(void) +{ + int ret; + struct hv_iommu_domain *hv_domain; + + /* Default stage-1 identity domain */ + hv_domain = &hv_identity_domain; + memset(hv_domain, 0, sizeof(*hv_domain)); + + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); + if (ret) + return ret; + + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY); + if (ret) + goto delete_identity_domain; + + hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY; + hv_domain->domain.ops = &hv_iommu_identity_domain_ops; + hv_domain->domain.owner = &hv_iommu_ops; + hv_domain->domain.geometry = hv_iommu_device->geometry; + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; + INIT_LIST_HEAD(&hv_domain->dev_list); + + /* Default stage-1 blocked domain */ + hv_domain = &hv_blocking_domain; + memset(hv_domain, 0, sizeof(*hv_domain)); + + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); + if (ret) + goto delete_identity_domain; + + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED); + if (ret) + goto delete_blocked_domain; + + hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED; + hv_domain->domain.ops = &hv_iommu_blocking_domain_ops; + hv_domain->domain.owner = &hv_iommu_ops; + hv_domain->domain.geometry = hv_iommu_device->geometry; + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; + INIT_LIST_HEAD(&hv_domain->dev_list); + + return 0; + +delete_blocked_domain: + hv_delete_device_domain(&hv_blocking_domain); +delete_identity_domain: + hv_delete_device_domain(&hv_identity_domain); + return ret; +} + +#define INTERRUPT_RANGE_START (0xfee00000) +#define INTERRUPT_RANGE_END (0xfeefffff) +static void hv_iommu_get_resv_regions(struct device *dev, + struct list_head *head) +{ + struct iommu_resv_region *region; + + region = iommu_alloc_resv_region(INTERRUPT_RANGE_START, + INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1, + 0, IOMMU_RESV_MSI, GFP_KERNEL); + if (!region) + return; + + list_add_tail(®ion->list, head); +} + +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain) +{ + u64 status; + unsigned long flags; + struct hv_input_flush_device_domain *input; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->device_domain.partition_id = hv_domain->device_domain.partition_id; + input->device_domain.owner_vtl = hv_domain->device_domain.owner_vtl; + input->device_domain.domain_id.type = hv_domain->device_domain.domain_id.type; + input->device_domain.domain_id.id = hv_domain->device_domain.domain_id.id; + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL); + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); +} + +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain) +{ + hv_flush_device_domain(to_hv_iommu_domain(domain)); +} + +static void hv_iommu_iotlb_sync(struct iommu_domain *domain, + struct iommu_iotlb_gather *iotlb_gather) +{ + hv_flush_device_domain(to_hv_iommu_domain(domain)); + + iommu_put_pages_list(&iotlb_gather->freelist); +} + +static void hv_iommu_paging_domain_free(struct iommu_domain *domain) +{ + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); + + /* Free all remaining mappings */ + pt_iommu_deinit(&hv_domain->pt_iommu); + + hv_delete_device_domain(hv_domain); + + kfree(hv_domain); +} + +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = { + .attach_dev = hv_iommu_attach_dev, +}; + +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = { + .attach_dev = hv_iommu_attach_dev, +}; + +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = { + .attach_dev = hv_iommu_attach_dev, + IOMMU_PT_DOMAIN_OPS(x86_64), + .flush_iotlb_all = hv_iommu_flush_iotlb_all, + .iotlb_sync = hv_iommu_iotlb_sync, + .free = hv_iommu_paging_domain_free, +}; + +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev) +{ + int ret; + struct hv_iommu_domain *hv_domain; + struct pt_iommu_x86_64_cfg cfg = {}; + + hv_domain = kzalloc(sizeof(*hv_domain), GFP_KERNEL); + if (!hv_domain) + return ERR_PTR(-ENOMEM); + + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); + if (ret) { + kfree(hv_domain); + return ERR_PTR(ret); + } + + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; + hv_domain->domain.geometry = hv_iommu_device->geometry; + hv_domain->pt_iommu.nid = dev_to_node(dev); + INIT_LIST_HEAD(&hv_domain->dev_list); + spin_lock_init(&hv_domain->lock); + + cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width; + cfg.common.hw_max_oasz_lg2 = 52; + + ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL); + if (ret) { + hv_delete_device_domain(hv_domain); + return ERR_PTR(ret); + } + + hv_domain->domain.ops = &hv_iommu_paging_domain_ops; + + ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING); + if (ret) { + pt_iommu_deinit(&hv_domain->pt_iommu); + hv_delete_device_domain(hv_domain); + return ERR_PTR(ret); + } + + return &hv_domain->domain; +} + +static struct iommu_ops hv_iommu_ops = { + .capable = hv_iommu_capable, + .domain_alloc_paging = hv_iommu_domain_alloc_paging, + .probe_device = hv_iommu_probe_device, + .release_device = hv_iommu_release_device, + .device_group = hv_iommu_device_group, + .get_resv_regions = hv_iommu_get_resv_regions, + .owner = THIS_MODULE, + .identity_domain = &hv_identity_domain.domain, + .blocked_domain = &hv_blocking_domain.domain, + .release_domain = &hv_blocking_domain.domain, +}; + +static void hv_iommu_shutdown(void) +{ + iommu_device_sysfs_remove(&hv_iommu_device->iommu); + + kfree(hv_iommu_device); +} + +static struct syscore_ops hv_iommu_syscore_ops = { + .shutdown = hv_iommu_shutdown, +}; + +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap) +{ + u64 status; + unsigned long flags; + struct hv_input_get_iommu_capabilities *input; + struct hv_output_get_iommu_capabilities *output; + + local_irq_save(flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + output = *this_cpu_ptr(hyperv_pcpu_output_arg); + memset(input, 0, sizeof(*input)); + memset(output, 0, sizeof(*output)); + input->partition_id = HV_PARTITION_ID_SELF; + status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output); + *hv_iommu_cap = *output; + + local_irq_restore(flags); + + if (!hv_result_success(status)) + pr_err("%s: hypercall failed, status %lld\n", __func__, status); + + return hv_result_to_errno(status); +} + +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu, + struct hv_output_get_iommu_capabilities *hv_iommu_cap) +{ + ida_init(&hv_iommu->domain_ids); + + hv_iommu->cap = hv_iommu_cap->iommu_cap; + hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width; + if (!hv_iommu_5lvl_supported(hv_iommu->cap) && + hv_iommu->max_iova_width > 48) { + pr_err("5-level paging not supported, limiting iova width to 48.\n"); + hv_iommu->max_iova_width = 48; + } + + hv_iommu->geometry = (struct iommu_domain_geometry) { + .aperture_start = 0, + .aperture_end = (((u64)1) << hv_iommu_cap->max_iova_width) - 1, + .force_aperture = true, + }; + + hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1; + hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1; + hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap; + hv_iommu_device = hv_iommu; +} + +static int __init hv_iommu_init(void) +{ + int ret = 0; + struct hv_iommu_dev *hv_iommu = NULL; + struct hv_output_get_iommu_capabilities hv_iommu_cap = {0}; + + if (no_iommu || iommu_detected) + return -ENODEV; + + if (!hv_is_hyperv_initialized()) + return -ENODEV; + + if (hv_iommu_detect(&hv_iommu_cap) || + !hv_iommu_present(hv_iommu_cap.iommu_cap) || + !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) + return -ENODEV; + + iommu_detected = 1; + pci_request_acs(); + + hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL); + if (!hv_iommu) + return -ENOMEM; + + hv_init_iommu_device(hv_iommu, &hv_iommu_cap); + + ret = hv_initialize_static_domains(); + if (ret) { + pr_err("hv_initialize_static_domains failed: %d\n", ret); + goto err_sysfs_remove; + } + + ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu"); + if (ret) { + pr_err("iommu_device_sysfs_add failed: %d\n", ret); + goto err_free; + } + + + ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL); + if (ret) { + pr_err("iommu_device_register failed: %d\n", ret); + goto err_sysfs_remove; + } + + register_syscore_ops(&hv_iommu_syscore_ops); + + pr_info("Microsoft Hypervisor IOMMU initialized\n"); + return 0; + +err_sysfs_remove: + iommu_device_sysfs_remove(&hv_iommu->iommu); +err_free: + kfree(hv_iommu); + return ret; +} + +device_initcall(hv_iommu_init); diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h new file mode 100644 index 000000000000..c8657e791a6e --- /dev/null +++ b/drivers/iommu/hyperv/iommu.h @@ -0,0 +1,53 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Hyper-V IOMMU driver. + * + * Copyright (C) 2024-2025, Microsoft, Inc. + * + */ + +#ifndef _HYPERV_IOMMU_H +#define _HYPERV_IOMMU_H + +struct hv_iommu_dev { + struct iommu_device iommu; + struct ida domain_ids; + + /* Device configuration */ + u8 max_iova_width; + u8 max_pasid_width; + u64 cap; + u64 pgsize_bitmap; + + struct iommu_domain_geometry geometry; + u64 first_domain; + u64 last_domain; +}; + +struct hv_iommu_domain { + union { + struct iommu_domain domain; + struct pt_iommu pt_iommu; + struct pt_iommu_x86_64 pt_iommu_x86_64; + }; + struct hv_iommu_dev *hv_iommu; + struct hv_input_device_domain device_domain; + u64 pgsize_bitmap; + + spinlock_t lock; /* protects dev_list and TLB flushes */ + /* List of devices in this DMA domain */ + struct list_head dev_list; +}; + +struct hv_iommu_endpoint { + struct device *dev; + struct hv_iommu_dev *hv_iommu; + struct hv_iommu_domain *hv_domain; + struct list_head list; /* For domain->dev_list */ +}; + +#define to_hv_iommu_domain(d) \ + container_of(d, struct hv_iommu_domain, domain) + +#endif /* _HYPERV_IOMMU_H */ -- 2.49.0 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2025-12-09 5:11 ` [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang @ 2025-12-10 17:15 ` Easwar Hariharan 2025-12-11 8:41 ` Yu Zhang 2026-01-08 18:48 ` Michael Kelley 1 sibling, 1 reply; 28+ messages in thread From: Easwar Hariharan @ 2025-12-10 17:15 UTC (permalink / raw) To: Yu Zhang Cc: linux-kernel, linux-hyperv, iommu, linux-pci, easwar.hariharan, kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch On 12/8/2025 9:11 PM, Yu Zhang wrote: > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V. > This driver implements stage-1 IO translation within the guest OS. > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls > for: > - Capability discovery > - Domain allocation, configuration, and deallocation > - Device attachment and detachment > - IOTLB invalidation > > The driver constructs x86-compatible stage-1 IO page tables in the > guest memory using consolidated IO page table helpers. This allows > the guest to manage stage-1 translations independently of vendor- > specific drivers (like Intel VT-d or AMD IOMMU). > > Hyper-v consumes this stage-1 IO page table, when a device domain is > created and configured, and nests it with the host's stage-2 IO page > tables, therefore elemenating the VM exits for guest IOMMU mapping > operations. > > For guest IOMMU unmapping operations, VM exits to perform the IOTLB > flush(and possibly the device TLB flush) is still unavoidable. For > now, HVCALL_FLUSH_DEVICE_DOMAIN is used to implement a domain-selective > IOTLB flush. New hypercalls for finer-grained hypercall will be provided > in future patches. > > Co-developed-by: Wei Liu <wei.liu@kernel.org> > Signed-off-by: Wei Liu <wei.liu@kernel.org> > Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > drivers/iommu/hyperv/Kconfig | 14 + > drivers/iommu/hyperv/Makefile | 1 + > drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++++++++++++++++++ > drivers/iommu/hyperv/iommu.h | 53 +++ > 4 files changed, 676 insertions(+) > create mode 100644 drivers/iommu/hyperv/iommu.c > create mode 100644 drivers/iommu/hyperv/iommu.h > <snip> > + > +static int __init hv_iommu_init(void) > +{ > + int ret = 0; > + struct hv_iommu_dev *hv_iommu = NULL; > + struct hv_output_get_iommu_capabilities hv_iommu_cap = {0}; > + > + if (no_iommu || iommu_detected) > + return -ENODEV; > + > + if (!hv_is_hyperv_initialized()) > + return -ENODEV; > + > + if (hv_iommu_detect(&hv_iommu_cap) || > + !hv_iommu_present(hv_iommu_cap.iommu_cap) || > + !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) > + return -ENODEV; > + > + iommu_detected = 1; > + pci_request_acs(); > + > + hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL); > + if (!hv_iommu) > + return -ENOMEM; > + > + hv_init_iommu_device(hv_iommu, &hv_iommu_cap); > + > + ret = hv_initialize_static_domains(); > + if (ret) { > + pr_err("hv_initialize_static_domains failed: %d\n", ret); > + goto err_sysfs_remove; This should be goto err_free since we haven't done the sysfs_add yet > + } > + > + ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu"); > + if (ret) { > + pr_err("iommu_device_sysfs_add failed: %d\n", ret); > + goto err_free; And this should be probably a goto delete_static_domains that cleans up the allocated static domains... > + } > + > + > + ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL); > + if (ret) { > + pr_err("iommu_device_register failed: %d\n", ret); > + goto err_sysfs_remove; > + } > + > + register_syscore_ops(&hv_iommu_syscore_ops); > + > + pr_info("Microsoft Hypervisor IOMMU initialized\n"); > + return 0; > + > +err_sysfs_remove: > + iommu_device_sysfs_remove(&hv_iommu->iommu); > +err_free: > + kfree(hv_iommu); > + return ret; > +} > + > +device_initcall(hv_iommu_init); <snip> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2025-12-10 17:15 ` Easwar Hariharan @ 2025-12-11 8:41 ` Yu Zhang 0 siblings, 0 replies; 28+ messages in thread From: Yu Zhang @ 2025-12-11 8:41 UTC (permalink / raw) To: Easwar Hariharan Cc: linux-kernel, linux-hyperv, iommu, linux-pci, kys, haiyangz, wei.liu, decui, lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, joro, will, robin.murphy, jacob.pan, nunodasneves, mrathor, mhklinux, peterz, linux-arch On Wed, Dec 10, 2025 at 09:15:18AM -0800, Easwar Hariharan wrote: > On 12/8/2025 9:11 PM, Yu Zhang wrote: > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V. > > This driver implements stage-1 IO translation within the guest OS. > > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls > > for: > > - Capability discovery > > - Domain allocation, configuration, and deallocation > > - Device attachment and detachment > > - IOTLB invalidation > > > > The driver constructs x86-compatible stage-1 IO page tables in the > > guest memory using consolidated IO page table helpers. This allows > > the guest to manage stage-1 translations independently of vendor- > > specific drivers (like Intel VT-d or AMD IOMMU). > > > > Hyper-v consumes this stage-1 IO page table, when a device domain is > > created and configured, and nests it with the host's stage-2 IO page > > tables, therefore elemenating the VM exits for guest IOMMU mapping > > operations. > > > > For guest IOMMU unmapping operations, VM exits to perform the IOTLB > > flush(and possibly the device TLB flush) is still unavoidable. For > > now, HVCALL_FLUSH_DEVICE_DOMAIN is used to implement a domain-selective > > IOTLB flush. New hypercalls for finer-grained hypercall will be provided > > in future patches. > > > > Co-developed-by: Wei Liu <wei.liu@kernel.org> > > Signed-off-by: Wei Liu <wei.liu@kernel.org> > > Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> > > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> > > Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > > --- > > drivers/iommu/hyperv/Kconfig | 14 + > > drivers/iommu/hyperv/Makefile | 1 + > > drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++++++++++++++++++ > > drivers/iommu/hyperv/iommu.h | 53 +++ > > 4 files changed, 676 insertions(+) > > create mode 100644 drivers/iommu/hyperv/iommu.c > > create mode 100644 drivers/iommu/hyperv/iommu.h > > > > <snip> > > > + > > +static int __init hv_iommu_init(void) > > +{ > > + int ret = 0; > > + struct hv_iommu_dev *hv_iommu = NULL; > > + struct hv_output_get_iommu_capabilities hv_iommu_cap = {0}; > > + > > + if (no_iommu || iommu_detected) > > + return -ENODEV; > > + > > + if (!hv_is_hyperv_initialized()) > > + return -ENODEV; > > + > > + if (hv_iommu_detect(&hv_iommu_cap) || > > + !hv_iommu_present(hv_iommu_cap.iommu_cap) || > > + !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) > > + return -ENODEV; > > + > > + iommu_detected = 1; > > + pci_request_acs(); > > + > > + hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL); > > + if (!hv_iommu) > > + return -ENOMEM; > > + > > + hv_init_iommu_device(hv_iommu, &hv_iommu_cap); > > + > > + ret = hv_initialize_static_domains(); > > + if (ret) { > > + pr_err("hv_initialize_static_domains failed: %d\n", ret); > > + goto err_sysfs_remove; > > This should be goto err_free since we haven't done the sysfs_add yet > > > + } > > + > > + ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu"); > > + if (ret) { > > + pr_err("iommu_device_sysfs_add failed: %d\n", ret); > > + goto err_free; > > And this should be probably a goto delete_static_domains that cleans up the allocated static > domains... > Nice catch. And thanks! :) Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2025-12-09 5:11 ` [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang 2025-12-10 17:15 ` Easwar Hariharan @ 2026-01-08 18:48 ` Michael Kelley 2026-01-12 16:56 ` Yu Zhang 1 sibling, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-08 18:48 UTC (permalink / raw) To: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V. > This driver implements stage-1 IO translation within the guest OS. > It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls > for: > - Capability discovery > - Domain allocation, configuration, and deallocation > - Device attachment and detachment > - IOTLB invalidation > > The driver constructs x86-compatible stage-1 IO page tables in the > guest memory using consolidated IO page table helpers. This allows > the guest to manage stage-1 translations independently of vendor- > specific drivers (like Intel VT-d or AMD IOMMU). > > Hyper-v consumes this stage-1 IO page table, when a device domain is s/Hyper-v/Hyper-V/ > created and configured, and nests it with the host's stage-2 IO page > tables, therefore elemenating the VM exits for guest IOMMU mapping s/elemenating/eliminating/ > operations. > > For guest IOMMU unmapping operations, VM exits to perform the IOTLB > flush(and possibly the device TLB flush) is still unavoidable. For Typo: Add a space after "flush" and before the open parenthesis. > now, HVCALL_FLUSH_DEVICE_DOMAIN is used to implement a domain-selective Typo: Extra white space after HVCALL_FLUSH_DEVICE_DOMAIN > IOTLB flush. New hypercalls for finer-grained hypercall will be provided > in future patches. > > Co-developed-by: Wei Liu <wei.liu@kernel.org> > Signed-off-by: Wei Liu <wei.liu@kernel.org> > Co-developed-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com> > Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com> > Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com> > --- > drivers/iommu/hyperv/Kconfig | 14 + > drivers/iommu/hyperv/Makefile | 1 + > drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++++++++++++++++++ > drivers/iommu/hyperv/iommu.h | 53 +++ > 4 files changed, 676 insertions(+) > create mode 100644 drivers/iommu/hyperv/iommu.c > create mode 100644 drivers/iommu/hyperv/iommu.h > > diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig > index 30f40d867036..fa3c77752d7b 100644 > --- a/drivers/iommu/hyperv/Kconfig > +++ b/drivers/iommu/hyperv/Kconfig > @@ -8,3 +8,17 @@ config HYPERV_IOMMU > help > Stub IOMMU driver to handle IRQs to support Hyper-V Linux > guest and root partitions. > + > +if HYPERV_IOMMU > +config HYPERV_PVIOMMU > + bool "Microsoft Hypervisor para-virtualized IOMMU support" > + depends on X86 && HYPERV && PCI_HYPERV Depending on PCI_HYPERV is problematic as pointed out in my comments on Patch 1 of this series. > + depends on IOMMU_PT Use "select IOMMU_PT" instead of "depends"? Other IOMMU drivers use "select". > + select IOMMU_API > + select IOMMU_DMA IOMMU_DMA is enabled by default on x86 and arm64 architectures. Other IOMMU drivers don't select it, so maybe this could be dropped. > + select DMA_OPS DMA_OPS doesn't exist. I'm not sure what this is supposed to be. > + select IOMMU_IOVA > + default HYPERV > + help > + A para-virtualized IOMMU for Microsoft Hypervisor guest. > +endif > diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile > index 9f557bad94ff..8669741c0a51 100644 > --- a/drivers/iommu/hyperv/Makefile > +++ b/drivers/iommu/hyperv/Makefile > @@ -1,2 +1,3 @@ > # SPDX-License-Identifier: GPL-2.0 > obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o > +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o > diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c > new file mode 100644 > index 000000000000..3d0aff868e16 > --- /dev/null > +++ b/drivers/iommu/hyperv/iommu.c > @@ -0,0 +1,608 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Hyper-V IOMMU driver. > + * > + * Copyright (C) 2019, 2024-2025 Microsoft, Inc. > + */ > + > +#include <linux/iommu.h> > +#include <linux/pci.h> > +#include <linux/dma-map-ops.h> > +#include <linux/generic_pt/iommu.h> > +#include <linux/syscore_ops.h> > +#include <linux/pci-ats.h> > + > +#include <asm/iommu.h> > +#include <asm/hypervisor.h> > +#include <asm/mshyperv.h> > + > +#include "iommu.h" > +#include "../dma-iommu.h" > +#include "../iommu-pages.h" > + > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev); > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain); With some fairly simple reordering of code in this source file, these two declarations could go away. Generally, the best practice is to order so such declarations aren't needed, though that's not always possible. > +struct hv_iommu_dev *hv_iommu_device; > +static struct hv_iommu_domain hv_identity_domain; > +static struct hv_iommu_domain hv_blocking_domain; Why is hv_iommu_device allocated dynamically while the two domains are allocated statically? Seems like the approach could be consistent, though maybe there's some reason I'm missing. > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops; > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops; > +static struct iommu_ops hv_iommu_ops; I'm wondering if this declaration could also be eliminated by some reordering, though I didn't take time to figure out the details. Maybe this is one of those cases that can't be avoided. > + > +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT) > +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1) > +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL) > +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS) > + > +static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_stage) > +{ > + int ret; > + u64 status; > + unsigned long flags; > + struct hv_input_create_device_domain *input; > + > + ret = ida_alloc_range(&hv_iommu_device->domain_ids, > + hv_iommu_device->first_domain, hv_iommu_device->last_domain, > + GFP_KERNEL); > + if (ret < 0) > + return ret; > + > + hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF; > + hv_domain->device_domain.domain_id.type = domain_stage; > + hv_domain->device_domain.domain_id.id = ret; > + hv_domain->hv_iommu = hv_iommu_device; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->device_domain = hv_domain->device_domain; > + input->create_device_domain_flags.forward_progress_required = 1; > + input->create_device_domain_flags.inherit_owning_vtl = 0; > + status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) { > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + ida_free(&hv_iommu_device->domain_ids, hv_domain->device_domain.domain_id.id); > + } > + > + return hv_result_to_errno(status); > +} > + > +static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain) > +{ > + u64 status; > + unsigned long flags; > + struct hv_input_delete_device_domain *input; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->device_domain = hv_domain->device_domain; > + status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + > + ida_free(&hv_domain->hv_iommu->domain_ids, hv_domain->device_domain.domain_id.id); > +} > + > +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap) > +{ > + switch (cap) { > + case IOMMU_CAP_CACHE_COHERENCY: > + return true; > + case IOMMU_CAP_DEFERRED_FLUSH: > + return true; > + default: > + return false; > + } > +} > + > +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device *dev) > +{ > + u64 status; > + unsigned long flags; > + struct pci_dev *pdev; > + struct hv_input_attach_device_domain *input; > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); > + > + /* Only allow PCI devices for now */ > + if (!dev_is_pci(dev)) > + return -EINVAL; > + > + if (vdev->hv_domain == hv_domain) > + return 0; > + > + if (vdev->hv_domain) > + hv_iommu_detach_dev(&vdev->hv_domain->domain, dev); > + > + pdev = to_pci_dev(dev); > + dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : "", > + hv_domain->device_domain.domain_id.id); > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->device_domain = hv_domain->device_domain; > + input->device_id.as_uint64 = hv_build_logical_dev_id(pdev); > + status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) { > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + } else { > + vdev->hv_domain = hv_domain; > + spin_lock_irqsave(&hv_domain->lock, flags); > + list_add(&vdev->list, &hv_domain->dev_list); > + spin_unlock_irqrestore(&hv_domain->lock, flags); > + } > + > + return hv_result_to_errno(status); > +} > + > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev) > +{ > + u64 status; > + unsigned long flags; > + struct hv_input_detach_device_domain *input; > + struct pci_dev *pdev; > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); > + > + /* See the attach function, only PCI devices for now */ > + if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain) > + return; > + > + pdev = to_pci_dev(dev); > + > + dev_dbg(dev, "Detaching from %d\n", hv_domain->device_domain.domain_id.id); > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->partition_id = HV_PARTITION_ID_SELF; > + input->device_id.as_uint64 = hv_build_logical_dev_id(pdev); > + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + > + spin_lock_irqsave(&hv_domain->lock, flags); > + hv_flush_device_domain(hv_domain); > + list_del(&vdev->list); > + spin_unlock_irqrestore(&hv_domain->lock, flags); > + > + vdev->hv_domain = NULL; > +} > + > +static int hv_iommu_get_logical_device_property(struct device *dev, > + enum hv_logical_device_property_code code, > + struct hv_output_get_logical_device_property *property) > +{ > + u64 status; > + unsigned long flags; > + struct hv_input_get_logical_device_property *input; > + struct hv_output_get_logical_device_property *output; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > + memset(input, 0, sizeof(*input)); > + memset(output, 0, sizeof(*output)); General practice is to *not* zero the output area prior to a hypercall. The hypervisor should be correctly setting all the output bits. There are a couple of cases in the new MSHV code where the output is zero'ed, but I'm planning to submit a patch to remove those so that hypercall call sites that have output are consistent across the code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the right thing, and zero'ing the output could be done as a workaround. But such cases should be explicitly known with code comments indicating the reason for the zero'ing. Same applies in hv_iommu_detect(). > + input->partition_id = HV_PARTITION_ID_SELF; > + input->logical_device_id = hv_build_logical_dev_id(to_pci_dev(dev)); > + input->code = code; > + status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output); > + *property = *output; > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + > + return hv_result_to_errno(status); > +} > + > +static struct iommu_device *hv_iommu_probe_device(struct device *dev) > +{ > + struct pci_dev *pdev; > + struct hv_iommu_endpoint *vdev; > + struct hv_output_get_logical_device_property device_iommu_property = {0}; > + > + if (!dev_is_pci(dev)) > + return ERR_PTR(-ENODEV); > + > + if (hv_iommu_get_logical_device_property(dev, > + HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU, > + &device_iommu_property) || > + !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED)) > + return ERR_PTR(-ENODEV); > + > + pdev = to_pci_dev(dev); > + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); > + if (!vdev) > + return ERR_PTR(-ENOMEM); > + > + vdev->dev = dev; > + vdev->hv_iommu = hv_iommu_device; > + dev_iommu_priv_set(dev, vdev); > + > + if (hv_iommu_ats_supported(hv_iommu_device->cap) && > + pci_ats_supported(pdev)) > + pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap)); > + > + return &vdev->hv_iommu->iommu; > +} > + > +static void hv_iommu_release_device(struct device *dev) > +{ > + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev); > + > + if (vdev->hv_domain) > + hv_iommu_detach_dev(&vdev->hv_domain->domain, dev); > + > + dev_iommu_priv_set(dev, NULL); > + set_dma_ops(dev, NULL); > + > + kfree(vdev); > +} > + > +static struct iommu_group *hv_iommu_device_group(struct device *dev) > +{ > + if (dev_is_pci(dev)) > + return pci_device_group(dev); > + else > + return generic_device_group(dev); > +} > + > +static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_type) > +{ > + u64 status; > + unsigned long flags; > + struct pt_iommu_x86_64_hw_info pt_info; > + struct hv_input_configure_device_domain *input; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->device_domain = hv_domain->device_domain; > + input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED); > + input->settings.flags.translation_enabled = (domain_type != IOMMU_DOMAIN_IDENTITY); > + > + if (domain_type & __IOMMU_DOMAIN_PAGING) { > + pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info); > + input->settings.page_table_root = pt_info.gcr3_pt; > + input->settings.flags.first_stage_paging_mode = > + pt_info.levels == 5; > + } > + status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + > + return hv_result_to_errno(status); > +} > + > +static int __init hv_initialize_static_domains(void) > +{ > + int ret; > + struct hv_iommu_domain *hv_domain; > + > + /* Default stage-1 identity domain */ > + hv_domain = &hv_identity_domain; > + memset(hv_domain, 0, sizeof(*hv_domain)); The memset() isn't necessary. hv_identity_domain is a static variable, so it is already initialized to zero. > + > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); > + if (ret) > + return ret; > + > + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY); > + if (ret) > + goto delete_identity_domain; > + > + hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY; > + hv_domain->domain.ops = &hv_iommu_identity_domain_ops; > + hv_domain->domain.owner = &hv_iommu_ops; > + hv_domain->domain.geometry = hv_iommu_device->geometry; > + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; > + INIT_LIST_HEAD(&hv_domain->dev_list); > + > + /* Default stage-1 blocked domain */ > + hv_domain = &hv_blocking_domain; > + memset(hv_domain, 0, sizeof(*hv_domain)); Same here. > + > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); > + if (ret) > + goto delete_identity_domain; > + > + ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED); > + if (ret) > + goto delete_blocked_domain; > + > + hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED; > + hv_domain->domain.ops = &hv_iommu_blocking_domain_ops; > + hv_domain->domain.owner = &hv_iommu_ops; > + hv_domain->domain.geometry = hv_iommu_device->geometry; > + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; > + INIT_LIST_HEAD(&hv_domain->dev_list); > + > + return 0; > + > +delete_blocked_domain: > + hv_delete_device_domain(&hv_blocking_domain); > +delete_identity_domain: > + hv_delete_device_domain(&hv_identity_domain); > + return ret; > +} > + > +#define INTERRUPT_RANGE_START (0xfee00000) > +#define INTERRUPT_RANGE_END (0xfeefffff) > +static void hv_iommu_get_resv_regions(struct device *dev, > + struct list_head *head) > +{ > + struct iommu_resv_region *region; > + > + region = iommu_alloc_resv_region(INTERRUPT_RANGE_START, > + INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1, > + 0, IOMMU_RESV_MSI, GFP_KERNEL); > + if (!region) > + return; > + > + list_add_tail(®ion->list, head); > +} > + > +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain) > +{ > + u64 status; > + unsigned long flags; > + struct hv_input_flush_device_domain *input; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->device_domain.partition_id = hv_domain->device_domain.partition_id; > + input->device_domain.owner_vtl = hv_domain->device_domain.owner_vtl; > + input->device_domain.domain_id.type = hv_domain->device_domain.domain_id.type; > + input->device_domain.domain_id.id = hv_domain->device_domain.domain_id.id; > + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL); > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > +} > + > +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain) > +{ > + hv_flush_device_domain(to_hv_iommu_domain(domain)); > +} > + > +static void hv_iommu_iotlb_sync(struct iommu_domain *domain, > + struct iommu_iotlb_gather *iotlb_gather) > +{ > + hv_flush_device_domain(to_hv_iommu_domain(domain)); > + > + iommu_put_pages_list(&iotlb_gather->freelist); > +} > + > +static void hv_iommu_paging_domain_free(struct iommu_domain *domain) > +{ > + struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain); > + > + /* Free all remaining mappings */ > + pt_iommu_deinit(&hv_domain->pt_iommu); > + > + hv_delete_device_domain(hv_domain); > + > + kfree(hv_domain); > +} > + > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = { > + .attach_dev = hv_iommu_attach_dev, > +}; > + > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = { > + .attach_dev = hv_iommu_attach_dev, > +}; > + > +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = { > + .attach_dev = hv_iommu_attach_dev, > + IOMMU_PT_DOMAIN_OPS(x86_64), > + .flush_iotlb_all = hv_iommu_flush_iotlb_all, > + .iotlb_sync = hv_iommu_iotlb_sync, > + .free = hv_iommu_paging_domain_free, > +}; > + > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev) > +{ > + int ret; > + struct hv_iommu_domain *hv_domain; > + struct pt_iommu_x86_64_cfg cfg = {}; > + > + hv_domain = kzalloc(sizeof(*hv_domain), GFP_KERNEL); > + if (!hv_domain) > + return ERR_PTR(-ENOMEM); > + > + ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1); > + if (ret) { > + kfree(hv_domain); > + return ERR_PTR(ret); > + } > + > + hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap; > + hv_domain->domain.geometry = hv_iommu_device->geometry; > + hv_domain->pt_iommu.nid = dev_to_node(dev); > + INIT_LIST_HEAD(&hv_domain->dev_list); > + spin_lock_init(&hv_domain->lock); > + > + cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width; > + cfg.common.hw_max_oasz_lg2 = 52; FYI, when this code is rebased to the latest linux-next, need to set cfg.top_level as well. > + > + ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL); > + if (ret) { > + hv_delete_device_domain(hv_domain); > + return ERR_PTR(ret); > + } > + > + hv_domain->domain.ops = &hv_iommu_paging_domain_ops; > + > + ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING); > + if (ret) { > + pt_iommu_deinit(&hv_domain->pt_iommu); > + hv_delete_device_domain(hv_domain); > + return ERR_PTR(ret); > + } > + > + return &hv_domain->domain; > +} > + > +static struct iommu_ops hv_iommu_ops = { > + .capable = hv_iommu_capable, > + .domain_alloc_paging = hv_iommu_domain_alloc_paging, > + .probe_device = hv_iommu_probe_device, > + .release_device = hv_iommu_release_device, > + .device_group = hv_iommu_device_group, > + .get_resv_regions = hv_iommu_get_resv_regions, > + .owner = THIS_MODULE, > + .identity_domain = &hv_identity_domain.domain, > + .blocked_domain = &hv_blocking_domain.domain, > + .release_domain = &hv_blocking_domain.domain, > +}; > + > +static void hv_iommu_shutdown(void) > +{ > + iommu_device_sysfs_remove(&hv_iommu_device->iommu); > + > + kfree(hv_iommu_device); > +} > + > +static struct syscore_ops hv_iommu_syscore_ops = { > + .shutdown = hv_iommu_shutdown, > +}; Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything that really needed, since sysfs entries are transient, and freeing memory isn't relevant for a shutdown. > + > +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap) > +{ > + u64 status; > + unsigned long flags; > + struct hv_input_get_iommu_capabilities *input; > + struct hv_output_get_iommu_capabilities *output; > + > + local_irq_save(flags); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > + memset(input, 0, sizeof(*input)); > + memset(output, 0, sizeof(*output)); > + input->partition_id = HV_PARTITION_ID_SELF; > + status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output); > + *hv_iommu_cap = *output; > + > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) > + pr_err("%s: hypercall failed, status %lld\n", __func__, status); > + > + return hv_result_to_errno(status); > +} > + > +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu, > + struct hv_output_get_iommu_capabilities *hv_iommu_cap) > +{ > + ida_init(&hv_iommu->domain_ids); > + > + hv_iommu->cap = hv_iommu_cap->iommu_cap; > + hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width; > + if (!hv_iommu_5lvl_supported(hv_iommu->cap) && > + hv_iommu->max_iova_width > 48) { > + pr_err("5-level paging not supported, limiting iova width to 48.\n"); > + hv_iommu->max_iova_width = 48; > + } > + > + hv_iommu->geometry = (struct iommu_domain_geometry) { > + .aperture_start = 0, > + .aperture_end = (((u64)1) << hv_iommu_cap->max_iova_width) - 1, > + .force_aperture = true, > + }; > + > + hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1; > + hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1; > + hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap; > + hv_iommu_device = hv_iommu; > +} > + > +static int __init hv_iommu_init(void) > +{ > + int ret = 0; > + struct hv_iommu_dev *hv_iommu = NULL; > + struct hv_output_get_iommu_capabilities hv_iommu_cap = {0}; > + > + if (no_iommu || iommu_detected) > + return -ENODEV; > + > + if (!hv_is_hyperv_initialized()) > + return -ENODEV; > + > + if (hv_iommu_detect(&hv_iommu_cap) || > + !hv_iommu_present(hv_iommu_cap.iommu_cap) || > + !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) > + return -ENODEV; > + > + iommu_detected = 1; > + pci_request_acs(); > + > + hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL); > + if (!hv_iommu) > + return -ENOMEM; > + > + hv_init_iommu_device(hv_iommu, &hv_iommu_cap); > + > + ret = hv_initialize_static_domains(); > + if (ret) { > + pr_err("hv_initialize_static_domains failed: %d\n", ret); > + goto err_sysfs_remove; > + } > + > + ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu"); > + if (ret) { > + pr_err("iommu_device_sysfs_add failed: %d\n", ret); > + goto err_free; > + } > + Extra blank line. > + > + ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL); > + if (ret) { > + pr_err("iommu_device_register failed: %d\n", ret); > + goto err_sysfs_remove; > + } > + > + register_syscore_ops(&hv_iommu_syscore_ops); Per above, not sure why this is needed. > + > + pr_info("Microsoft Hypervisor IOMMU initialized\n"); Could this be changed to fit the "standardized" messages that are output about Hyper-V specific code? They all start with "Hyper-V: ", such as these: [ 0.000000] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2 [ 0.000000] Hyper-V: Nested features: 0x0 [ 0.000000] Hyper-V: LAPIC Timer Frequency: 0xc3500 [ 0.000000] Hyper-V: Using hypercall for remote TLB flush [ 0.019223] Hyper-V: PV spinlocks enabled [ 0.052575] Hyper-V: Hypervisor Build 10.0.26100.7462-7-0 [ 0.052577] Hyper-V: enabling crash_kexec_post_notifiers [ 0.052633] Hyper-V: Using IPI hypercalls Maybe "Hyper-V: PV IOMMU initialized"? > + return 0; > + > +err_sysfs_remove: > + iommu_device_sysfs_remove(&hv_iommu->iommu); > +err_free: > + kfree(hv_iommu); > + return ret; > +} > + > +device_initcall(hv_iommu_init); I'm concerned about the timing of this initialization. VMBus is initialized with subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6. So VMBus initialization happens quite a bit earlier, and the hypervisor starts offering devices to the guest, including PCI pass-thru devices, before the IOMMU initialization starts. I cobbled together a way to make this IOMMU code run in an Azure VM using the identity domain. The VM has an NVMe OS disk, two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and completed hv_pci_probe() before this IOMMU initialization was started. When IOMMU initialization did run, it went back and found the NVMe devices. But I'm unsure if that's OK because my hacked together environment obviously couldn't do real IOMMU mapping. It appears that the NVMe device driver didn't start its initialization until after the IOMMU driver was setup, which would probably make everything OK. But that might be just timing luck, or maybe there's something that affirmatively prevents the native PCI driver (like NVMe) from getting started until after all the initcalls have finished. I'm planning to look at this further to see if there's a way for a PCI driver to try initializing a pass-thru device *before* this IOMMU driver has initialized. If so, a different way to do the IOMMU initialization will be needed that is linked to VMBus initialization so things can't happen out-of-order. Establishing such a linkage is probably a good idea regardless. FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups, and devices coming and going dynamically all worked correctly. When a device was removed, it was moved to the blocking domain, and then flushed before being finally removed. All good! I wish I had a way to test with an IOMMU paging domain that was doing real translation. > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h > new file mode 100644 > index 000000000000..c8657e791a6e > --- /dev/null > +++ b/drivers/iommu/hyperv/iommu.h > @@ -0,0 +1,53 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +/* > + * Hyper-V IOMMU driver. > + * > + * Copyright (C) 2024-2025, Microsoft, Inc. > + * > + */ > + > +#ifndef _HYPERV_IOMMU_H > +#define _HYPERV_IOMMU_H > + > +struct hv_iommu_dev { > + struct iommu_device iommu; > + struct ida domain_ids; > + > + /* Device configuration */ > + u8 max_iova_width; > + u8 max_pasid_width; > + u64 cap; > + u64 pgsize_bitmap; > + > + struct iommu_domain_geometry geometry; > + u64 first_domain; > + u64 last_domain; > +}; > + > +struct hv_iommu_domain { > + union { > + struct iommu_domain domain; > + struct pt_iommu pt_iommu; > + struct pt_iommu_x86_64 pt_iommu_x86_64; > + }; > + struct hv_iommu_dev *hv_iommu; > + struct hv_input_device_domain device_domain; > + u64 pgsize_bitmap; > + > + spinlock_t lock; /* protects dev_list and TLB flushes */ > + /* List of devices in this DMA domain */ It appears that this list is really a list of endpoints (i.e., struct hv_iommu_endpoint), not devices (which I read to be struct hv_iommu_dev). But that said, what is the list used for? I see code to add endpoints to the list, and to remove then, but the list is never walked by any code in this patch set. If there is an anticipated future use, it would be better to add the list as part of the code for that future use. > + struct list_head dev_list; > +}; > + > +struct hv_iommu_endpoint { > + struct device *dev; > + struct hv_iommu_dev *hv_iommu; > + struct hv_iommu_domain *hv_domain; > + struct list_head list; /* For domain->dev_list */ > +}; > + > +#define to_hv_iommu_domain(d) \ > + container_of(d, struct hv_iommu_domain, domain) > + > +#endif /* _HYPERV_IOMMU_H */ > -- > 2.49.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2026-01-08 18:48 ` Michael Kelley @ 2026-01-12 16:56 ` Yu Zhang 2026-01-12 17:48 ` Michael Kelley 0 siblings, 1 reply; 28+ messages in thread From: Yu Zhang @ 2026-01-12 16:56 UTC (permalink / raw) To: Michael Kelley Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote: > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM <snip> Thank you so much, Michael, for the thorough review! I've snipped some comments I fully agree with and will address in next version. Actually, I have to admit I agree with your remaining comments below as well. :) > > +struct hv_iommu_dev *hv_iommu_device; > > +static struct hv_iommu_domain hv_identity_domain; > > +static struct hv_iommu_domain hv_blocking_domain; > > Why is hv_iommu_device allocated dynamically while the two > domains are allocated statically? Seems like the approach could > be consistent, though maybe there's some reason I'm missing. > On second thought, `hv_identity_domain` and `hv_blocking_domain` should likely be allocated dynamically as well, consistent with `hv_iommu_device`. <snip> > > +static int hv_iommu_get_logical_device_property(struct device *dev, > > + enum hv_logical_device_property_code code, > > + struct hv_output_get_logical_device_property *property) > > +{ > > + u64 status; > > + unsigned long flags; > > + struct hv_input_get_logical_device_property *input; > > + struct hv_output_get_logical_device_property *output; > > + > > + local_irq_save(flags); > > + > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > > + memset(input, 0, sizeof(*input)); > > + memset(output, 0, sizeof(*output)); > > General practice is to *not* zero the output area prior to a hypercall. The hypervisor > should be correctly setting all the output bits. There are a couple of cases in the new > MSHV code where the output is zero'ed, but I'm planning to submit a patch to > remove those so that hypercall call sites that have output are consistent across the > code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the > right thing, and zero'ing the output could be done as a workaround. But such cases > should be explicitly known with code comments indicating the reason for the > zero'ing. > > Same applies in hv_iommu_detect(). > Thanks for the information! Just to clarify: this is only because Hyper-V is supposed to zero the output page, and for input page, memset is still needed. Am I correct? <snip> > > +static void hv_iommu_shutdown(void) > > +{ > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu); > > + > > + kfree(hv_iommu_device); > > +} > > + > > +static struct syscore_ops hv_iommu_syscore_ops = { > > + .shutdown = hv_iommu_shutdown, > > +}; > > Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything > that really needed, since sysfs entries are transient, and freeing memory isn't > relevant for a shutdown. > For iommu_device_sysfs_remove(), I guess they are not necessary, and I will need to do some homework to better understand the sysfs. :) Originally, we wanted a shutdown routine to trigger some hypercall, so that Hyper-V will disable the DMA translation, e.g., during the VM reboot process. <snip> > > +device_initcall(hv_iommu_init); > > I'm concerned about the timing of this initialization. VMBus is initialized with > subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6. > So VMBus initialization happens quite a bit earlier, and the hypervisor starts > offering devices to the guest, including PCI pass-thru devices, before the > IOMMU initialization starts. I cobbled together a way to make this IOMMU code > run in an Azure VM using the identity domain. The VM has an NVMe OS disk, > two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and > completed hv_pci_probe() before this IOMMU initialization was started. When > IOMMU initialization did run, it went back and found the NVMe devices. But > I'm unsure if that's OK because my hacked together environment obviously > couldn't do real IOMMU mapping. It appears that the NVMe device driver > didn't start its initialization until after the IOMMU driver was setup, which > would probably make everything OK. But that might be just timing luck, or > maybe there's something that affirmatively prevents the native PCI driver > (like NVMe) from getting started until after all the initcalls have finished. > This is yet another immature attempt by me to do the hv_iommu_init() in an arch-independent path. And I do not think using device_initcall() is harmless. This patch set was tested using an assigned Intel DSA device, and the DMA tests succeeded w/o any error. But that is not enough to justify using device_initcall(): I reset the idxd driver as kernel builtin and realized, just like you said, both hv_pci_probe() and idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU tries to probe the endpoint device, a warning is printed: [ 3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something fishy here! > I'm planning to look at this further to see if there's a way for a PCI driver > to try initializing a pass-thru device *before* this IOMMU driver has initialized. > If so, a different way to do the IOMMU initialization will be needed that is > linked to VMBus initialization so things can't happen out-of-order. Establishing > such a linkage is probably a good idea regardless. > > FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with > the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups, > and devices coming and going dynamically all worked correctly. When a device > was removed, it was moved to the blocking domain, and then flushed before > being finally removed. All good! I wish I had a way to test with an IOMMU > paging domain that was doing real translation. > Thank you, Michael! I really appreciate you running these extra experiments! My tests on this DSA device passed (using paging domain) too, with no DMA errors observed (regardless its driver is builtin or as a kernel module). But that doesn't make me confident about using `device_initcall`. I believe your concern is valid. E.g., an endpoint device might allocate a DMA address( using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then use that address for DMA later, after a paging domain is attached? > > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h > > new file mode 100644 > > index 000000000000..c8657e791a6e > > --- /dev/null > > +++ b/drivers/iommu/hyperv/iommu.h > > @@ -0,0 +1,53 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > + > > +/* > > + * Hyper-V IOMMU driver. > > + * > > + * Copyright (C) 2024-2025, Microsoft, Inc. > > + * > > + */ > > + > > +#ifndef _HYPERV_IOMMU_H > > +#define _HYPERV_IOMMU_H > > + > > +struct hv_iommu_dev { > > + struct iommu_device iommu; > > + struct ida domain_ids; > > + > > + /* Device configuration */ > > + u8 max_iova_width; > > + u8 max_pasid_width; > > + u64 cap; > > + u64 pgsize_bitmap; > > + > > + struct iommu_domain_geometry geometry; > > + u64 first_domain; > > + u64 last_domain; > > +}; > > + > > +struct hv_iommu_domain { > > + union { > > + struct iommu_domain domain; > > + struct pt_iommu pt_iommu; > > + struct pt_iommu_x86_64 pt_iommu_x86_64; > > + }; > > + struct hv_iommu_dev *hv_iommu; > > + struct hv_input_device_domain device_domain; > > + u64 pgsize_bitmap; > > + > > + spinlock_t lock; /* protects dev_list and TLB flushes */ > > + /* List of devices in this DMA domain */ > > It appears that this list is really a list of endpoints (i.e., struct > hv_iommu_endpoint), not devices (which I read to be struct > hv_iommu_dev). > > But that said, what is the list used for? I see code to add > endpoints to the list, and to remove then, but the list is never > walked by any code in this patch set. If there is an anticipated > future use, it would be better to add the list as part of the code > for that future use. > Yes, we do not really need this list for this patch set. Thanks! B.R. Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2026-01-12 16:56 ` Yu Zhang @ 2026-01-12 17:48 ` Michael Kelley 2026-01-13 17:29 ` Jacob Pan 0 siblings, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-12 17:48 UTC (permalink / raw) To: Yu Zhang Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, January 12, 2026 8:56 AM > > On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote: > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > <snip> > Thank you so much, Michael, for the thorough review! > > I've snipped some comments I fully agree with and will address in > next version. Actually, I have to admit I agree with your remaining > comments below as well. :) > > > > +struct hv_iommu_dev *hv_iommu_device; > > > +static struct hv_iommu_domain hv_identity_domain; > > > +static struct hv_iommu_domain hv_blocking_domain; > > > > Why is hv_iommu_device allocated dynamically while the two > > domains are allocated statically? Seems like the approach could > > be consistent, though maybe there's some reason I'm missing. > > > > On second thought, `hv_identity_domain` and `hv_blocking_domain` should > likely be allocated dynamically as well, consistent with `hv_iommu_device`. I don't know if there's a strong rationale either way (static allocation vs. dynamic). If the long-term expectation is that there is never more than one PV IOMMU in a guest, then static would be OK. If future direction allows that there could be multiple PV IOMMUs in a guest, then doing dynamic from the start is justifiable (though the current PV IOMMU hypercalls seem to assume only one PV IOMMU). But either way, being consistent is desirable. > > <snip> > > > +static int hv_iommu_get_logical_device_property(struct device *dev, > > > + enum hv_logical_device_property_code code, > > > + struct hv_output_get_logical_device_property *property) > > > +{ > > > + u64 status; > > > + unsigned long flags; > > > + struct hv_input_get_logical_device_property *input; > > > + struct hv_output_get_logical_device_property *output; > > > + > > > + local_irq_save(flags); > > > + > > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > > > + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > > > + memset(input, 0, sizeof(*input)); > > > + memset(output, 0, sizeof(*output)); > > > > General practice is to *not* zero the output area prior to a hypercall. The hypervisor > > should be correctly setting all the output bits. There are a couple of cases in the new > > MSHV code where the output is zero'ed, but I'm planning to submit a patch to > > remove those so that hypercall call sites that have output are consistent across the > > code base. Of course, it's possible to have a Hyper-V bug where it doesn't do the > > right thing, and zero'ing the output could be done as a workaround. But such cases > > should be explicitly known with code comments indicating the reason for the > > zero'ing. > > > > Same applies in hv_iommu_detect(). > > > > Thanks for the information! Just to clarify: this is only because Hyper-V is > supposed to zero the output page, and for input page, memset is still needed. > Am I correct? Yes, you are correct. The general TLFS requirement for hypercall input is that unused fields and bits are set to zero. This requirement ensures forward compatibility if a later version of the hypervisor assigns some meaning to previously unused fields/bits. So best practice for hypercall call sites is to use memset() to zero the entire input area, and then specific field values are set on top of that. Any fields/bits that aren't explicitly set then meet the TLFS requirement. It would be OK if a hypercall call site explicitly set every field/bit instead of using memset(), but it's easy to unintentionally miss a field/bit and create a forward compatibility problem. However, when the hypercall input contains a large array, the code usually does *not* do memset() on the large array because of the perf impact, but instead the code populating the large array must be careful to not leave any bits uninitialized. For hypercall output, the hypervisor essentially has the same requirement. It should make sure that any unused fields/bits in the output area are zero, so that the Linux guest can properly deal with a future hypervisor version that assigns meaning to previously unused fields/bits. > > <snip> > > > > +static void hv_iommu_shutdown(void) > > > +{ > > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu); > > > + > > > + kfree(hv_iommu_device); > > > +} > > > + > > > +static struct syscore_ops hv_iommu_syscore_ops = { > > > + .shutdown = hv_iommu_shutdown, > > > +}; > > > > Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything > > that really needed, since sysfs entries are transient, and freeing memory isn't > > relevant for a shutdown. > > > > For iommu_device_sysfs_remove(), I guess they are not necessary, and > I will need to do some homework to better understand the sysfs. :) > Originally, we wanted a shutdown routine to trigger some hypercall, > so that Hyper-V will disable the DMA translation, e.g., during the VM > reboot process. I would presume that if Hyper-V reboots the VM, Hyper-V automatically resets the PV IOMMU and prevents any further DMA operations. But consider kexec(), where a new kernel gets loaded without going through the hypervisor "reboot-this-VM" path. There have been problems in the past with kexec() where parts of Hyper-V state for the guest didn't get reset, and the PV IOMMU is likely something in that category. So there may indeed be a need to tell the hypervisor to reset everything related to the PV IOMMU. There are already functions to do Hyper-V cleanup: see vmbus_initiate_unload() and hyperv_cleanup(). These existing functions may be a better place to do PV IOMMU cleanup/reset if needed. > > <snip> > > > > +device_initcall(hv_iommu_init); > > > > I'm concerned about the timing of this initialization. VMBus is initialized with > > subsys_initcall(), which is initcall level 4 while device_initcall() is initcall level 6. > > So VMBus initialization happens quite a bit earlier, and the hypervisor starts > > offering devices to the guest, including PCI pass-thru devices, before the > > IOMMU initialization starts. I cobbled together a way to make this IOMMU code > > run in an Azure VM using the identity domain. The VM has an NVMe OS disk, > > two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and > > completed hv_pci_probe() before this IOMMU initialization was started. When > > IOMMU initialization did run, it went back and found the NVMe devices. But > > I'm unsure if that's OK because my hacked together environment obviously > > couldn't do real IOMMU mapping. It appears that the NVMe device driver > > didn't start its initialization until after the IOMMU driver was setup, which > > would probably make everything OK. But that might be just timing luck, or > > maybe there's something that affirmatively prevents the native PCI driver > > (like NVMe) from getting started until after all the initcalls have finished. > > > > This is yet another immature attempt by me to do the hv_iommu_init() in > an arch-independent path. And I do not think using device_initcall() is > harmless. This patch set was tested using an assigned Intel DSA device, > and the DMA tests succeeded w/o any error. But that is not enough to > justify using device_initcall(): I reset the idxd driver as kernel > builtin and realized, just like you said, both hv_pci_probe() and > idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU > tries to probe the endpoint device, a warning is printed: > > [ 3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something fishy here! > You succeeded in doing what I was going to try! I won't spend time on it now. > > I'm planning to look at this further to see if there's a way for a PCI driver > > to try initializing a pass-thru device *before* this IOMMU driver has initialized. > > If so, a different way to do the IOMMU initialization will be needed that is > > linked to VMBus initialization so things can't happen out-of-order. Establishing > > such a linkage is probably a good idea regardless. > > > > FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with > > the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups, > > and devices coming and going dynamically all worked correctly. When a device > > was removed, it was moved to the blocking domain, and then flushed before > > being finally removed. All good! I wish I had a way to test with an IOMMU > > paging domain that was doing real translation. > > > > Thank you, Michael! I really appreciate you running these extra experiments! > > My tests on this DSA device passed (using paging domain) too, with no DMA > errors observed (regardless its driver is builtin or as a kernel module). > But that doesn't make me confident about using `device_initcall`. I believe > your concern is valid. E.g., an endpoint device might allocate a DMA address( > using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then > use that address for DMA later, after a paging domain is attached? Yes, that's exactly my concern. > > > > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h > > > new file mode 100644 > > > index 000000000000..c8657e791a6e > > > --- /dev/null > > > +++ b/drivers/iommu/hyperv/iommu.h > > > @@ -0,0 +1,53 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > + > > > +/* > > > + * Hyper-V IOMMU driver. > > > + * > > > + * Copyright (C) 2024-2025, Microsoft, Inc. > > > + * > > > + */ > > > + > > > +#ifndef _HYPERV_IOMMU_H > > > +#define _HYPERV_IOMMU_H > > > + > > > +struct hv_iommu_dev { > > > + struct iommu_device iommu; > > > + struct ida domain_ids; > > > + > > > + /* Device configuration */ > > > + u8 max_iova_width; > > > + u8 max_pasid_width; > > > + u64 cap; > > > + u64 pgsize_bitmap; > > > + > > > + struct iommu_domain_geometry geometry; > > > + u64 first_domain; > > > + u64 last_domain; > > > +}; > > > + > > > +struct hv_iommu_domain { > > > + union { > > > + struct iommu_domain domain; > > > + struct pt_iommu pt_iommu; > > > + struct pt_iommu_x86_64 pt_iommu_x86_64; > > > + }; > > > + struct hv_iommu_dev *hv_iommu; > > > + struct hv_input_device_domain device_domain; > > > + u64 pgsize_bitmap; > > > + > > > + spinlock_t lock; /* protects dev_list and TLB flushes */ > > > + /* List of devices in this DMA domain */ > > > > It appears that this list is really a list of endpoints (i.e., struct > > hv_iommu_endpoint), not devices (which I read to be struct > > hv_iommu_dev). > > > > But that said, what is the list used for? I see code to add > > endpoints to the list, and to remove then, but the list is never > > walked by any code in this patch set. If there is an anticipated > > future use, it would be better to add the list as part of the code > > for that future use. > > > > Yes, we do not really need this list for this patch set. Thanks! > > B.R. > Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2026-01-12 17:48 ` Michael Kelley @ 2026-01-13 17:29 ` Jacob Pan 2026-01-14 15:43 ` Michael Kelley 0 siblings, 1 reply; 28+ messages in thread From: Jacob Pan @ 2026-01-13 17:29 UTC (permalink / raw) To: Michael Kelley Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org Hi Michael, On Mon, 12 Jan 2026 17:48:30 +0000 Michael Kelley <mhklinux@outlook.com> wrote: > From: Michael Kelley <mhklinux@outlook.com> > To: Yu Zhang <zhangyu1@linux.microsoft.com> > CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, > "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>, > "iommu@lists.linux.dev" <iommu@lists.linux.dev>, > "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>, > "kys@microsoft.com" <kys@microsoft.com>, "haiyangz@microsoft.com" > <haiyangz@microsoft.com>, "wei.liu@kernel.org" <wei.liu@kernel.org>, > "decui@microsoft.com" <decui@microsoft.com>, "lpieralisi@kernel.org" > <lpieralisi@kernel.org>, "kwilczynski@kernel.org" > <kwilczynski@kernel.org>, "mani@kernel.org" <mani@kernel.org>, > "robh@kernel.org" <robh@kernel.org>, "bhelgaas@google.com" > <bhelgaas@google.com>, "arnd@arndb.de" <arnd@arndb.de>, > "joro@8bytes.org" <joro@8bytes.org>, "will@kernel.org" > <will@kernel.org>, "robin.murphy@arm.com" <robin.murphy@arm.com>, > "easwar.hariharan@linux.microsoft.com" > <easwar.hariharan@linux.microsoft.com>, > "jacob.pan@linux.microsoft.com" <jacob.pan@linux.microsoft.com>, > "nunodasneves@linux.microsoft.com" > <nunodasneves@linux.microsoft.com>, "mrathor@linux.microsoft.com" > <mrathor@linux.microsoft.com>, "peterz@infradead.org" > <peterz@infradead.org>, "linux-arch@vger.kernel.org" > <linux-arch@vger.kernel.org> Subject: RE: [RFC v1 5/5] iommu/hyperv: > Add para-virtualized IOMMU support for Hyper-V guest Date: Mon, 12 > Jan 2026 17:48:30 +0000 > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, January > 12, 2026 8:56 AM > > > > On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote: > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, > > > December 8, 2025 9:11 PM > > > > <snip> > > Thank you so much, Michael, for the thorough review! > > > > I've snipped some comments I fully agree with and will address in > > next version. Actually, I have to admit I agree with your remaining > > comments below as well. :) > > > > > > +struct hv_iommu_dev *hv_iommu_device; > > > > +static struct hv_iommu_domain hv_identity_domain; > > > > +static struct hv_iommu_domain hv_blocking_domain; > > > > > > Why is hv_iommu_device allocated dynamically while the two > > > domains are allocated statically? Seems like the approach could > > > be consistent, though maybe there's some reason I'm missing. > > > > > > > On second thought, `hv_identity_domain` and `hv_blocking_domain` > > should likely be allocated dynamically as well, consistent with > > `hv_iommu_device`. > > I don't know if there's a strong rationale either way (static > allocation vs. dynamic). If the long-term expectation is that there > is never more than one PV IOMMU in a guest, then static would be OK. > If future direction allows that there could be multiple PV IOMMUs in > a guest, then doing dynamic from the start is justifiable (though the > current PV IOMMU hypercalls seem to assume only one PV IOMMU). But > either way, being consistent is desirable. > I believe we only need a single global static identity domain here regardless how many vIOMMUs there may be. From the guest’s perspective, the hvIOMMU only supports hardware‑passthrough identity domains, which do not maintain any per‑IOMMU state, i.e., there is no S1 IO page table based identity domain. The expectation of physical IOMMU settings for guest identity domain should be as follows: - Intel vtd PASID entry PGTT = 010b (Second-stage Translation only) - AMD DTE TV=1; GV=0 > > > > <snip> > > > > > > +static void hv_iommu_shutdown(void) > > > > +{ > > > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu); > > > > + > > > > + kfree(hv_iommu_device); > > > > +} > > > > + > > > > +static struct syscore_ops hv_iommu_syscore_ops = { > > > > + .shutdown = hv_iommu_shutdown, > > > > +}; > [...] > > > > For iommu_device_sysfs_remove(), I guess they are not necessary, and > > I will need to do some homework to better understand the sysfs. :) > > Originally, we wanted a shutdown routine to trigger some hypercall, > > so that Hyper-V will disable the DMA translation, e.g., during the > > VM reboot process. > > I would presume that if Hyper-V reboots the VM, Hyper-V automatically > resets the PV IOMMU and prevents any further DMA operations. But > consider kexec(), where a new kernel gets loaded without going through > the hypervisor "reboot-this-VM" path. There have been problems in the > past with kexec() where parts of Hyper-V state for the guest didn't > get reset, and the PV IOMMU is likely something in that category. So > there may indeed be a need to tell the hypervisor to reset everything > related to the PV IOMMU. There are already functions to do Hyper-V > cleanup: see vmbus_initiate_unload() and hyperv_cleanup(). These > existing functions may be a better place to do PV IOMMU cleanup/reset > if needed. That would be my vote also. ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest 2026-01-13 17:29 ` Jacob Pan @ 2026-01-14 15:43 ` Michael Kelley 0 siblings, 0 replies; 28+ messages in thread From: Michael Kelley @ 2026-01-14 15:43 UTC (permalink / raw) To: Jacob Pan Cc: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Tuesday, January 13, 2026 9:30 AM > > Hi Michael, > > On Mon, 12 Jan 2026 17:48:30 +0000 > Michael Kelley <mhklinux@outlook.com> wrote: > > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, January 12, 2026 8:56 AM > > > > > > On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote: > > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > > > > > <snip> > > > Thank you so much, Michael, for the thorough review! > > > > > > I've snipped some comments I fully agree with and will address in > > > next version. Actually, I have to admit I agree with your remaining > > > comments below as well. :) > > > > > > > > +struct hv_iommu_dev *hv_iommu_device; > > > > > +static struct hv_iommu_domain hv_identity_domain; > > > > > +static struct hv_iommu_domain hv_blocking_domain; > > > > > > > > Why is hv_iommu_device allocated dynamically while the two > > > > domains are allocated statically? Seems like the approach could > > > > be consistent, though maybe there's some reason I'm missing. > > > > > > > > > > On second thought, `hv_identity_domain` and `hv_blocking_domain` > > > should likely be allocated dynamically as well, consistent with > > > `hv_iommu_device`. > > > > I don't know if there's a strong rationale either way (static > > allocation vs. dynamic). If the long-term expectation is that there > > is never more than one PV IOMMU in a guest, then static would be OK. > > If future direction allows that there could be multiple PV IOMMUs in > > a guest, then doing dynamic from the start is justifiable (though the > > current PV IOMMU hypercalls seem to assume only one PV IOMMU). But > > either way, being consistent is desirable. > > > I believe we only need a single global static identity domain here > regardless how many vIOMMUs there may be. From the guest’s perspective, > the hvIOMMU only supports hardware‑passthrough identity domains, which > do not maintain any per‑IOMMU state, i.e., there is no S1 IO page table > based identity domain. Ah yes, that makes sense. With that understanding, keeping the identity domain as a static singleton would be fine. Leave a code comment with a short explanation. Michael > > The expectation of physical IOMMU settings for guest identity > domain should be as follows: > - Intel vtd PASID entry PGTT = 010b (Second-stage Translation only) > - AMD DTE TV=1; GV=0 > > > > > > > <snip> > > > > > > > > +static void hv_iommu_shutdown(void) > > > > > +{ > > > > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu); > > > > > + > > > > > + kfree(hv_iommu_device); > > > > > +} > > > > > + > > > > > +static struct syscore_ops hv_iommu_syscore_ops = { > > > > > + .shutdown = hv_iommu_shutdown, > > > > > +}; > > [...] > > > > > > For iommu_device_sysfs_remove(), I guess they are not necessary, and > > > I will need to do some homework to better understand the sysfs. :) > > > Originally, we wanted a shutdown routine to trigger some hypercall, > > > so that Hyper-V will disable the DMA translation, e.g., during the > > > VM reboot process. > > > > I would presume that if Hyper-V reboots the VM, Hyper-V automatically > > resets the PV IOMMU and prevents any further DMA operations. But > > consider kexec(), where a new kernel gets loaded without going through > > the hypervisor "reboot-this-VM" path. There have been problems in the > > past with kexec() where parts of Hyper-V state for the guest didn't > > get reset, and the PV IOMMU is likely something in that category. So > > there may indeed be a need to tell the hypervisor to reset everything > > related to the PV IOMMU. There are already functions to do Hyper-V > > cleanup: see vmbus_initiate_unload() and hyperv_cleanup(). These > > existing functions may be a better place to do PV IOMMU cleanup/reset > > if needed. > That would be my vote also. ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang ` (4 preceding siblings ...) 2025-12-09 5:11 ` [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang @ 2026-01-08 18:45 ` Michael Kelley 2026-01-10 5:39 ` Yu Zhang 5 siblings, 1 reply; 28+ messages in thread From: Michael Kelley @ 2026-01-08 18:45 UTC (permalink / raw) To: Yu Zhang, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > This patch series introduces a para-virtualized IOMMU driver for > Linux guests running on Microsoft Hyper-V. The primary objective > is to enable hardware-assisted DMA isolation and scalable device Is there any particular meaning for the qualifier "scalable" vs. just "device assignment"? I just want to understand what you are getting at. > assignment for Hyper-V child partitions, bypassing the performance > overhead and complexity associated with emulated IOMMU hardware. > > The driver implements the following core functionality: > * Hypercall-based Enumeration > Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS), > this driver enumerates the Hyper-V IOMMU capabilities directly > via hypercalls. This approach allows the guest to discover > IOMMU presence and features without requiring specific virtual > firmware extensions or modifications. > > * Domain Management > The driver manages IOMMU domains through a new set of Hyper-V > hypercall interfaces, handling domain allocation, attachment, > and detachment for endpoint devices. > > * IOTLB Invalidation > IOTLB invalidation requests are marshaled and issued to the > hypervisor through the same hypercall mechanism. > > * Nested Translation Support > This implementation leverages guest-managed stage-1 I/O page > tables nested with host stage-2 translations. It is built > upon the consolidated IOMMU page table framework designed by > Jason Gunthorpe [1]. This design eliminates the need for complex > emulation during map operations and ensures scalability across > different architectures. > > Implementation Notes: > * Architecture Independence > While the current implementation only supports x86 platforms (Intel > VT-d and AMD IOMMU), the driver design aims to be as architecture- > agnostic as possible. To achieve this, initialization occurs via > `device_initcall` rather than `x86_init.iommu.iommu_init`, and shutdown > is handled via `syscore_ops` instead of `x86_platform.iommu_shutdown`. > > * MSI Region Handling > In this RFC, the hardware MSI region is hard-coded to the standard > x86 interrupt range (0xfee00000 - 0xfeefffff). Future updates may > allow this configuration to be queried via hypercalls if new hardware > platforms are to be supported. > > * Reserved Regions (RMRR) > There is currently no requirement to support assigned devices with > ACPI RMRR limitations. Consequently, this patch series does not specify > or query reserved memory regions. > > Testing: > This series has been validated using dmatest with Intel DSA devices > assigned to the child partition. The tests confirmed successful DMA > transactions under the para-virtualized IOMMU. > > Future Work: > * Page-selective IOTLB Invalidation > The current implementation relies on full-domain flushes. Support > for page-selective invalidation is planned for a future series. > > * Advanced Features > Support for vSVA and virtual PRI will be addressed in subsequent > updates. > > * Root Partition Co-existence > Ensure compatibility with the distinct para-virtualized IOMMU driver > used by Hyper-V's Linux root partition, in which the DMA remapping > is not achieved by stage-1 IO page tables and another set of iommu > ops is provided. > > [1] https://github.com/jgunthorpe/linux/tree/iommu_pt_all > > Easwar Hariharan (2): > PCI: hv: Create and export hv_build_logical_dev_id() > iommu: Move Hyper-V IOMMU driver to its own subdirectory > > Wei Liu (1): > hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU > > Yu Zhang (2): > hyperv: allow hypercall output pages to be allocated for child > partitions > iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest > > drivers/hv/hv_common.c | 21 +- > drivers/iommu/Kconfig | 10 +- > drivers/iommu/Makefile | 2 +- > drivers/iommu/hyperv/Kconfig | 24 + > drivers/iommu/hyperv/Makefile | 3 + > drivers/iommu/hyperv/iommu.c | 608 ++++++++++++++++++ > drivers/iommu/hyperv/iommu.h | 53 ++ > .../irq_remapping.c} | 2 +- > drivers/pci/controller/pci-hyperv.c | 28 +- > include/asm-generic/mshyperv.h | 2 + > include/hyperv/hvgdk_mini.h | 8 + > include/hyperv/hvhdk_mini.h | 123 ++++ > 12 files changed, 850 insertions(+), 34 deletions(-) > create mode 100644 drivers/iommu/hyperv/Kconfig > create mode 100644 drivers/iommu/hyperv/Makefile > create mode 100644 drivers/iommu/hyperv/iommu.c > create mode 100644 drivers/iommu/hyperv/iommu.h > rename drivers/iommu/{hyperv-iommu.c => hyperv/irq_remapping.c} (99%) > > -- > 2.49.0 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests 2026-01-08 18:45 ` [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Michael Kelley @ 2026-01-10 5:39 ` Yu Zhang 0 siblings, 0 replies; 28+ messages in thread From: Yu Zhang @ 2026-01-10 5:39 UTC (permalink / raw) To: Michael Kelley Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, iommu@lists.linux.dev, linux-pci@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, lpieralisi@kernel.org, kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org, bhelgaas@google.com, arnd@arndb.de, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, easwar.hariharan@linux.microsoft.com, jacob.pan@linux.microsoft.com, nunodasneves@linux.microsoft.com, mrathor@linux.microsoft.com, peterz@infradead.org, linux-arch@vger.kernel.org On Thu, Jan 08, 2026 at 06:45:52PM +0000, Michael Kelley wrote: > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, December 8, 2025 9:11 PM > > > > This patch series introduces a para-virtualized IOMMU driver for > > Linux guests running on Microsoft Hyper-V. The primary objective > > is to enable hardware-assisted DMA isolation and scalable device > > Is there any particular meaning for the qualifier "scalable" vs. just > "device assignment"? I just want to understand what you are getting > at. > Sorry for the ambiguity. I intended to highlight two primary use cases for pvIOMMU: - to enable in-kernel DMA protection within the guest. - to allow device assignment to guest user space (e.g., via VFIO). I avoided using the phrase "device assignment" alone, because people may be confused if the main purpose of introducing pvIOMMU is for device assignment to a L1 guest(which actually does not depend on any virtual IOMMU) or to a L2 nested guest(altough I guess w/ pvIOMMU, it should work but we've never tested that case and are not aware any such requirement). And you are right, simply adding "scalable" didn't help clarify this. I will rephrase the commit message. Thanks! B.R. Yu ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2026-01-14 15:43 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-09 5:11 [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang 2025-12-09 5:11 ` [RFC v1 1/5] PCI: hv: Create and export hv_build_logical_dev_id() Yu Zhang 2025-12-09 5:21 ` Randy Dunlap 2025-12-10 17:03 ` Easwar Hariharan 2025-12-10 21:39 ` Bjorn Helgaas 2025-12-11 8:31 ` Yu Zhang 2026-01-08 18:46 ` Michael Kelley 2026-01-09 18:40 ` Easwar Hariharan 2026-01-11 17:36 ` Michael Kelley 2025-12-09 5:11 ` [RFC v1 2/5] iommu: Move Hyper-V IOMMU driver to its own subdirectory Yu Zhang 2025-12-09 5:11 ` [RFC v1 3/5] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang 2026-01-08 18:47 ` Michael Kelley 2026-01-09 18:47 ` Easwar Hariharan 2026-01-09 19:24 ` Michael Kelley 2025-12-09 5:11 ` [RFC v1 4/5] hyperv: allow hypercall output pages to be allocated for child partitions Yu Zhang 2026-01-08 18:47 ` Michael Kelley 2026-01-10 5:07 ` Yu Zhang 2026-01-11 22:27 ` Michael Kelley 2025-12-09 5:11 ` [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang 2025-12-10 17:15 ` Easwar Hariharan 2025-12-11 8:41 ` Yu Zhang 2026-01-08 18:48 ` Michael Kelley 2026-01-12 16:56 ` Yu Zhang 2026-01-12 17:48 ` Michael Kelley 2026-01-13 17:29 ` Jacob Pan 2026-01-14 15:43 ` Michael Kelley 2026-01-08 18:45 ` [RFC v1 0/5] Hyper-V: Add para-virtualized IOMMU support for Linux guests Michael Kelley 2026-01-10 5:39 ` Yu Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox