[PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests
@ 2026-07-02 16:05 Yu Zhang
  2026-07-02 16:05 ` [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

This patch series introduces a para-virtualized IOMMU driver for
Linux guests running on Microsoft Hyper-V. The driver enables two
primary use cases:
  1) In-kernel DMA protection for devices assigned to the guest.
  2) Device assignment to guest user space (e.g., via VFIO).

The driver implements the following core functionality:
*   Hypercall-based Enumeration
    Unlike traditional ACPI-based discovery (e.g., DMAR/IVRS),
    this driver enumerates the Hyper-V IOMMU capabilities directly
    via hypercalls. This approach allows the guest to discover
    IOMMU presence and features without requiring specific virtual
    firmware extensions or modifications.

*   Domain Management
    The driver manages IOMMU domains through a new set of Hyper-V
    hypercall interfaces, handling domain allocation and attachment
    for endpoint devices.

*   Nested Translation Support
    This implementation leverages guest-managed stage-1 I/O page
    tables nested with host stage-2 translations. It is built
    upon the consolidated IOMMU page table framework (IOMMU_PT).
    This design eliminates the need for emulating map operations.
    Both Intel VT-d and AMD IOMMU platforms are supported.

*   IOTLB Invalidation
    IOTLB invalidation requests are marshaled and issued to the
    hypervisor through the same hypercall mechanism. Both domain-
    selective and page-selective flushes are supported.

Implementation Notes:
*   Platform Support
    The current implementation targets x86 platforms with Intel
    VT-d and AMD IOMMU hardware.

*   MSI Region Handling
    The hardware MSI region is hard-coded to the standard x86
    interrupt range (0xfee00000 - 0xfeefffff). Future updates may
    allow this configuration to be queried via hypercalls if new
    hardware platforms are to be supported.

*   Reserved Regions (RMRR)
    There is currently no requirement to support assigned devices with
    ACPI RMRR limitations. Consequently, this patch series does not
    specify or query reserved memory regions.

Testing:
This series has been validated with the following configurations:
- Intel DSA devices assigned to the guest, tested with dmatest.
- NVMe devices assigned to the guest on AMD platforms, tested
  with fio.
- dma_map_benchmark for DMA mapping performance evaluation.

Changelog:

v1[1] -> v2:
- Dropped the "move to subdirectory" patch; the directory now exists
  upstream.

- hv: logical device ID registry:
  - Moved the registry to hv_common.c so it can be shared, and derived
    the prefix via a shared helper instead of caching it in pci-hyperv's
    private struct.
  - Moved the lookup out of the irq-disabled region (PREEMPT_RT).

- iommu/hyperv: para-virtualized IOMMU:
  - Removed the unused detach_dev op.
  - Rejected a hypervisor not advertising x86 page sizes instead of
    masking and warning.
  - Statically initialized the identity and blocking domains.
  - Gave the blocking domain its own attach op, which returns the hypercall
    status and WARNs on failure.

- iommu/hyperv: page-selective IOTLB flush:
  - Used a single descriptor covering a slightly larger power-of-two
    range, instead of splitting the range into multiple descriptors.
  - Fixed the inclusive-end corner case in the flush range calculation.

RFC v1[2] -> v1[1]:
- Scoped platform support to x86 only (Intel VT-d and AMD IOMMU);
  initialization now uses x86_init.iommu.iommu_init
- Added page-selective IOTLB flush support
- Disable device ATS in hv_iommu_release_device()
- Addressed review comments from Michael Kelley:
  - Reversed dependency: pvIOMMU exports registration API for
    pci-hyperv to call, instead of pci-hyperv exporting
    hv_build_logical_dev_id()
  - Dropped separate output page allocation patch; hypercall input
    and output now share the same per-CPU page
  - Cleaned up Kconfig (removed PCI_HYPERV dependency, unnecessary
    selects)
  - Removed dev_list, per-domain spinlock, and syscore_ops
  - Removed forward declarations by reordering functions
  - Fixed typos, cleaned up Kconfig selects, improved pr_info
    messages, etc.

[1] https://lore.kernel.org/linux-hyperv/20260511162408.1180069-1-zhangyu1@linux.microsoft.com/
[2] https://lore.kernel.org/linux-hyperv/20251209051128.76913-1-zhangyu1@linux.microsoft.com/


Easwar Hariharan (1):
  Drivers: hv: Add logical device ID registry for vPCI devices

Wei Liu (1):
  hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU

Yu Zhang (2):
  iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
  iommu/hyperv: Add page-selective IOTLB flush support

 arch/x86/hyperv/hv_init.c           |   4 +
 arch/x86/include/asm/mshyperv.h     |   4 +
 drivers/hv/hv_common.c              |  95 ++++
 drivers/iommu/Kconfig               |   1 +
 drivers/iommu/hyperv/Kconfig        |  16 +
 drivers/iommu/hyperv/Makefile       |   1 +
 drivers/iommu/hyperv/iommu.c        | 686 ++++++++++++++++++++++++++++
 drivers/iommu/hyperv/iommu.h        |  51 +++
 drivers/pci/controller/pci-hyperv.c |  21 +-
 include/asm-generic/mshyperv.h      |  13 +
 include/hyperv/hvgdk_mini.h         |   9 +
 include/hyperv/hvhdk_mini.h         | 141 ++++++
 include/linux/hyperv.h              |   8 +
 13 files changed, 1045 insertions(+), 5 deletions(-)
 create mode 100644 drivers/iommu/hyperv/Kconfig
 create mode 100644 drivers/iommu/hyperv/iommu.c
 create mode 100644 drivers/iommu/hyperv/iommu.h

-- 
2.52.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
  2026-07-02 16:05 [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang
@ 2026-07-02 16:05 ` Yu Zhang
  2026-07-02 16:36   ` sashiko-bot
  2026-07-02 16:05 ` [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices Yu Zhang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

From: Wei Liu <wei.liu@kernel.org>

Hyper-V guest IOMMU is a para-virtualized IOMMU based on hypercalls.
Introduce the hypercalls used by the child partition to interact with
this facility.

These hypercalls fall into below categories:
- Detection and capability: HVCALL_GET_IOMMU_CAPABILITIES is used to
  detect the existence and capabilities of the guest IOMMU.

- Device management: HVCALL_GET_LOGICAL_DEVICE_PROPERTY is used to
  check whether an endpoint device is managed by the guest IOMMU.

- Domain management: A set of hypercalls is provided to handle the
  creation, configuration, and deletion of guest domains, as well as
  the attachment/detachment of endpoint devices to/from those domains.

- IOTLB flushing: HVCALL_FLUSH_DEVICE_DOMAIN is used to ask Hyper-V
  for a domain-selective IOTLB flush (which in its handler may flush
  the device TLB as well).

Signed-off-by: Wei Liu <wei.liu@kernel.org>
Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 include/hyperv/hvgdk_mini.h |   8 +++
 include/hyperv/hvhdk_mini.h | 124 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+)

diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 6a4e8b9d570f..5bdbb44da112 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -486,10 +486,16 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_VP_INDEX_FROM_APIC_ID		0x009a
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
+#define HVCALL_CREATE_DEVICE_DOMAIN			0x00b1
+#define HVCALL_ATTACH_DEVICE_DOMAIN			0x00b2
 #define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
 #define HVCALL_POST_MESSAGE_DIRECT			0x00c1
 #define HVCALL_DISPATCH_VP				0x00c2
+#define HVCALL_DETACH_DEVICE_DOMAIN			0x00c4
+#define HVCALL_DELETE_DEVICE_DOMAIN			0x00c5
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
+#define HVCALL_CONFIGURE_DEVICE_DOMAIN			0x00ce
+#define HVCALL_FLUSH_DEVICE_DOMAIN			0x00d0
 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
@@ -502,6 +508,8 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_MMIO_READ				0x0106
 #define HVCALL_MMIO_WRITE				0x0107
 #define HVCALL_DISABLE_HYP_EX                           0x010f
+#define HVCALL_GET_IOMMU_CAPABILITIES			0x0125
+#define HVCALL_GET_LOGICAL_DEVICE_PROPERTY		0x0127
 #define HVCALL_MAP_STATS_PAGE2				0x0131
 
 /* HV_HYPERCALL_INPUT */
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index b4cb2fa26e9b..493608e791b4 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -547,4 +547,128 @@ union hv_device_id {		/* HV_DEVICE_ID */
 	} acpi;
 } __packed;
 
+/* Device domain types */
+#define HV_DEVICE_DOMAIN_TYPE_S1	1 /* Stage 1 domain */
+
+/* ID for default domain and NULL domain */
+#define HV_DEVICE_DOMAIN_ID_DEFAULT 0
+#define HV_DEVICE_DOMAIN_ID_NULL    0xFFFFFFFFULL
+
+union hv_device_domain_id {
+	u64 as_uint64;
+	struct {
+		u32 type: 4;
+		u32 reserved: 28;
+		u32 id;
+	} __packed;
+};
+
+struct hv_input_device_domain {
+	u64 partition_id;
+	union hv_input_vtl owner_vtl;
+	u8 padding[7];
+	union hv_device_domain_id domain_id;
+} __packed;
+
+union hv_create_device_domain_flags {
+	u32 as_uint32;
+	struct {
+		u32 forward_progress_required: 1;
+		u32 inherit_owning_vtl: 1;
+		u32 reserved: 30;
+	} __packed;
+};
+
+struct hv_input_create_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_create_device_domain_flags create_device_domain_flags;
+} __packed;
+
+struct hv_input_delete_device_domain {
+	struct hv_input_device_domain device_domain;
+} __packed;
+
+struct hv_input_attach_device_domain {
+	struct hv_input_device_domain device_domain;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_detach_device_domain {
+	u64 partition_id;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_device_domain_settings {
+	struct {
+		/*
+		 * Enable translations. If not enabled, all transaction bypass
+		 * S1 translations.
+		 */
+		u64 translation_enabled: 1;
+		u64 blocked: 1;
+		/*
+		 * First stage address translation paging mode:
+		 * 0: 4-level paging (default)
+		 * 1: 5-level paging
+		 */
+		u64 first_stage_paging_mode: 1;
+		u64 reserved: 61;
+	} flags;
+
+	/* Address of translation table */
+	u64 page_table_root;
+} __packed;
+
+struct hv_input_configure_device_domain {
+	struct hv_input_device_domain device_domain;
+	struct hv_device_domain_settings settings;
+} __packed;
+
+struct hv_input_get_iommu_capabilities {
+	u64 partition_id;
+	u64 reserved;
+} __packed;
+
+struct hv_output_get_iommu_capabilities {
+	u32 size;
+	u16 reserved;
+	u8  max_iova_width;
+	u8  max_pasid_width;
+
+#define HV_IOMMU_CAP_PRESENT (1ULL << 0)
+#define HV_IOMMU_CAP_S2 (1ULL << 1)
+#define HV_IOMMU_CAP_S1 (1ULL << 2)
+#define HV_IOMMU_CAP_S1_5LVL (1ULL << 3)
+#define HV_IOMMU_CAP_PASID (1ULL << 4)
+#define HV_IOMMU_CAP_ATS (1ULL << 5)
+#define HV_IOMMU_CAP_PRI (1ULL << 6)
+
+	u64 iommu_cap;
+	u64 pgsize_bitmap;
+} __packed;
+
+enum hv_logical_device_property_code {
+	HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU = 10,
+};
+
+struct hv_input_get_logical_device_property {
+	u64 partition_id;
+	u64 logical_device_id;
+	/* Takes values from enum hv_logical_device_property_code. */
+	u32 code;
+	u32 reserved;
+} __packed;
+
+struct hv_output_get_logical_device_property {
+#define HV_DEVICE_IOMMU_ENABLED (1ULL << 0)
+	u64 device_iommu;
+	u64 reserved;
+} __packed;
+
+struct hv_input_flush_device_domain {
+	struct hv_input_device_domain device_domain;
+	u32 flags;
+	u32 reserved;
+} __packed;
+
 #endif /* _HV_HVHDK_MINI_H */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices
  2026-07-02 16:05 [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang
  2026-07-02 16:05 ` [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang
@ 2026-07-02 16:05 ` Yu Zhang
  2026-07-02 16:42   ` sashiko-bot
  2026-07-02 16:05 ` [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang
  2026-07-02 16:05 ` [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support Yu Zhang
  3 siblings, 1 reply; 11+ messages in thread
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

Hyper-V identifies each PCI pass-thru device by a logical device ID in
its hypercall interface. This ID consists of a per-bus prefix, derived
from the VMBus device instance GUID, combined with the PCI function
number of the endpoint device.

Add a small registry in hv_common.c that maps a PCI domain number to its
logical device ID prefix. The vPCI bus driver (pci-hyperv) registers the
prefix when a bus is probed and unregisters it when the bus is removed.
Consumers such as the para-virtualized IOMMU driver look up the prefix
by PCI domain number and combine it with the function number to form the
complete logical device ID for hypercalls.

The prefix construction is shared via hv_build_logical_dev_id_prefix() so
that pci-hyperv's interrupt retargeting path and the registry use exactly
the same byte layout. It is derived on demand from the constant hv_device
instance GUID rather than cached in struct hv_pcibus_device, which is
private to the pci-hyperv module; this keeps the interface narrow and
avoids depending on pci-hyperv internals.

Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
---
 drivers/hv/hv_common.c              | 95 +++++++++++++++++++++++++++++
 drivers/pci/controller/pci-hyperv.c | 21 +++++--
 include/asm-generic/mshyperv.h      | 13 ++++
 include/linux/hyperv.h              |  8 +++
 4 files changed, 132 insertions(+), 5 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac616789..53493f8d14dc 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -26,6 +26,8 @@
 #include <linux/kmsg_dump.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
 #include <linux/dma-map-ops.h>
 #include <linux/set_memory.h>
 #include <hyperv/hvhdk.h>
@@ -863,3 +865,96 @@ const char *hv_result_to_string(u64 status)
 	return "Unknown";
 }
 EXPORT_SYMBOL_GPL(hv_result_to_string);
+
+#ifdef CONFIG_HYPERV_PVIOMMU
+/*
+ * Logical device ID registry shared between the vPCI bus driver
+ * (pci-hyperv) and the para-virtualized IOMMU driver. The vPCI driver
+ * registers the per-bus logical device ID prefix at bus probe time, and
+ * the pvIOMMU driver looks it up to build the full logical device ID used
+ * in IOMMU hypercalls.
+ */
+struct hv_pci_busdata {
+	int		 pci_domain_nr;
+	u32		 logical_dev_id_prefix;
+	struct list_head list;
+};
+
+static LIST_HEAD(hv_pci_bus_list);
+static DEFINE_SPINLOCK(hv_pci_bus_lock);
+
+int hv_iommu_register_pci_bus(int pci_domain_nr, u32 logical_dev_id_prefix)
+{
+	struct hv_pci_busdata *bus, *new;
+	int ret = 0;
+
+	new = kzalloc_obj(*new, GFP_KERNEL);
+	if (!new)
+		return -ENOMEM;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry(bus, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr != pci_domain_nr)
+			continue;
+
+		if (bus->logical_dev_id_prefix != logical_dev_id_prefix) {
+			pr_err("stale registration for PCI domain %d (old prefix 0x%08x, new 0x%08x)\n",
+			       pci_domain_nr, bus->logical_dev_id_prefix,
+			       logical_dev_id_prefix);
+			ret = -EEXIST;
+		}
+
+		goto out_free;
+	}
+
+	new->pci_domain_nr = pci_domain_nr;
+	new->logical_dev_id_prefix = logical_dev_id_prefix;
+	list_add(&new->list, &hv_pci_bus_list);
+	spin_unlock(&hv_pci_bus_lock);
+	return 0;
+
+out_free:
+	spin_unlock(&hv_pci_bus_lock);
+	kfree(new);
+	return ret;
+}
+EXPORT_SYMBOL_FOR_MODULES(hv_iommu_register_pci_bus, "pci-hyperv");
+
+void hv_iommu_unregister_pci_bus(int pci_domain_nr)
+{
+	struct hv_pci_busdata *bus, *tmp;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry_safe(bus, tmp, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr == pci_domain_nr) {
+			list_del(&bus->list);
+			kfree(bus);
+			break;
+		}
+	}
+	spin_unlock(&hv_pci_bus_lock);
+}
+EXPORT_SYMBOL_FOR_MODULES(hv_iommu_unregister_pci_bus, "pci-hyperv");
+
+/*
+ * Look up the logical device ID prefix registered for @pci_domain_nr.
+ * Returns 0 on success with *prefix filled in; -ENODEV if no entry is
+ * registered for that PCI domain.
+ */
+int hv_iommu_lookup_logical_dev_id(int pci_domain_nr, u32 *prefix)
+{
+	struct hv_pci_busdata *bus;
+	int ret = -ENODEV;
+
+	spin_lock(&hv_pci_bus_lock);
+	list_for_each_entry(bus, &hv_pci_bus_list, list) {
+		if (bus->pci_domain_nr == pci_domain_nr) {
+			*prefix = bus->logical_dev_id_prefix;
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock(&hv_pci_bus_lock);
+	return ret;
+}
+#endif /* CONFIG_HYPERV_PVIOMMU */
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index cfc8fa403dad..58ca2c95bd10 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -641,10 +641,7 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
 	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
 	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
 	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
-	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
-			   (hbus->hdev->dev_instance.b[4] << 16) |
-			   (hbus->hdev->dev_instance.b[7] << 8) |
-			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
+	params->device_id = hv_build_logical_dev_id_prefix(hbus->hdev) |
 			   PCI_FUNC(pdev->devfn);
 	params->int_target.vector = hv_msi_get_int_vector(data);
 
@@ -3715,6 +3712,7 @@ static int hv_pci_probe(struct hv_device *hdev,
 	struct hv_pcibus_device *hbus;
 	int ret, dom;
 	u16 dom_req;
+	u32 prefix;
 	char *name;
 
 	bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
@@ -3857,13 +3855,22 @@ static int hv_pci_probe(struct hv_device *hdev,
 
 	hbus->state = hv_pcibus_probed;
 
-	ret = create_root_hv_pci_bus(hbus);
+	/* Notify pvIOMMU before any device on the bus is scanned. */
+	prefix = hv_build_logical_dev_id_prefix(hdev);
+
+	ret = hv_iommu_register_pci_bus(dom, prefix);
 	if (ret)
 		goto free_windows;
 
+	ret = create_root_hv_pci_bus(hbus);
+	if (ret)
+		goto unregister_pviommu;
+
 	mutex_unlock(&hbus->state_lock);
 	return 0;
 
+unregister_pviommu:
+	hv_iommu_unregister_pci_bus(dom);
 free_windows:
 	hv_pci_free_bridge_windows(hbus);
 exit_d0:
@@ -3977,6 +3984,8 @@ static void hv_pci_remove(struct hv_device *hdev)
 
 	hbus = hv_get_drvdata(hdev);
 	if (hbus->state == hv_pcibus_installed) {
+		int dom = hbus->bridge->domain_nr;
+
 		tasklet_disable(&hdev->channel->callback_event);
 		hbus->state = hv_pcibus_removing;
 		tasklet_enable(&hdev->channel->callback_event);
@@ -3994,6 +4003,8 @@ static void hv_pci_remove(struct hv_device *hdev)
 		hv_pci_remove_slots(hbus);
 		pci_remove_root_bus(hbus->bridge->bus);
 		pci_unlock_rescan_remove();
+
+		hv_iommu_unregister_pci_bus(dom);
 	}
 
 	hv_pci_bus_exit(hdev, false);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb..f65344f2bb81 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -73,6 +73,19 @@ extern enum hv_partition_type hv_curr_partition_type;
 extern void * __percpu *hyperv_pcpu_input_arg;
 extern void * __percpu *hyperv_pcpu_output_arg;
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+int  hv_iommu_register_pci_bus(int pci_domain_nr, u32 logical_dev_id_prefix);
+void hv_iommu_unregister_pci_bus(int pci_domain_nr);
+int  hv_iommu_lookup_logical_dev_id(int pci_domain_nr, u32 *prefix);
+#else
+static inline int hv_iommu_register_pci_bus(int pci_domain_nr,
+					    u32 logical_dev_id_prefix)
+{
+	return 0;
+}
+static inline void hv_iommu_unregister_pci_bus(int pci_domain_nr) { }
+#endif
+
 u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
 u64 hv_do_fast_hypercall8(u16 control, u64 input8);
 u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 9de2c8d6037a..10ee2c462d7c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1287,6 +1287,14 @@ struct hv_device {
 #define device_to_hv_device(d)	container_of_const(d, struct hv_device, device)
 #define drv_to_hv_drv(d)	container_of_const(d, struct hv_driver, driver)
 
+static inline u32 hv_build_logical_dev_id_prefix(struct hv_device *hdev)
+{
+	return ((u32)hdev->dev_instance.b[5] << 24) |
+	       ((u32)hdev->dev_instance.b[4] << 16) |
+	       ((u32)hdev->dev_instance.b[7] << 8) |
+	       (hdev->dev_instance.b[6] & 0xf8u);
+}
+
 static inline void hv_set_drvdata(struct hv_device *dev, void *data)
 {
 	dev_set_drvdata(&dev->device, data);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
  2026-07-02 16:05 [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang
  2026-07-02 16:05 ` [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang
  2026-07-02 16:05 ` [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices Yu Zhang
@ 2026-07-02 16:05 ` Yu Zhang
  2026-07-02 17:08   ` sashiko-bot
  2026-07-03 17:32   ` Jason Gunthorpe
  2026-07-02 16:05 ` [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support Yu Zhang
  3 siblings, 2 replies; 11+ messages in thread
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
This driver implements stage-1 IO translation within the guest OS.
It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
for:
 - Capability discovery
 - Domain allocation, configuration, and deallocation
 - Device attachment and detachment
 - IOTLB invalidation

The driver constructs x86-compatible stage-1 IO page tables in the
guest memory using consolidated IO page table helpers. This allows
the guest to manage stage-1 translations independently of vendor-
specific drivers (like Intel VT-d or AMD IOMMU).

Hyper-V consumes this stage-1 IO page table when a device domain is
created and configured, and nests it with the host's stage-2 IO page
tables, therefore eliminating the VM exits for guest IOMMU mapping
operations. For unmapping operations, VM exits to perform the IOTLB
flush are still unavoidable.

To identify a device in its hypercall interface, the driver looks up the
logical device ID prefix registered for the device's PCI domain (see the
logical device ID registry in hv_common.c) and combines it with the PCI
function number of the endpoint device.

Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Co-developed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 arch/x86/hyperv/hv_init.c       |   4 +
 arch/x86/include/asm/mshyperv.h |   4 +
 drivers/iommu/Kconfig           |   1 +
 drivers/iommu/hyperv/Kconfig    |  16 +
 drivers/iommu/hyperv/Makefile   |   1 +
 drivers/iommu/hyperv/iommu.c    | 620 ++++++++++++++++++++++++++++++++
 drivers/iommu/hyperv/iommu.h    |  51 +++
 7 files changed, 697 insertions(+)
 create mode 100644 drivers/iommu/hyperv/Kconfig
 create mode 100644 drivers/iommu/hyperv/iommu.c
 create mode 100644 drivers/iommu/hyperv/iommu.h

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 55a8b6de2865..094f9f7ddb72 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -578,6 +578,10 @@ void __init hyperv_init(void)
 	old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
 	x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+	x86_init.iommu.iommu_init = hv_iommu_init;
+#endif
+
 	hv_apic_init();
 
 	x86_init.pci.arch_init = hv_pci_init;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f64393e853ee..20d947c2c758 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -313,6 +313,10 @@ static inline void mshv_vtl_return_hypercall(void) {}
 static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
 #endif
 
+#ifdef CONFIG_HYPERV_PVIOMMU
+int __init hv_iommu_init(void);
+#endif
+
 #include <asm-generic/mshyperv.h>
 
 #endif
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 6e07bd69467a..0d128f377929 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -195,6 +195,7 @@ config MSM_IOMMU
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/arm/Kconfig"
 source "drivers/iommu/intel/Kconfig"
+source "drivers/iommu/hyperv/Kconfig"
 source "drivers/iommu/iommufd/Kconfig"
 source "drivers/iommu/riscv/Kconfig"
 
diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
new file mode 100644
index 000000000000..8b6abbaaf9b8
--- /dev/null
+++ b/drivers/iommu/hyperv/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# HyperV paravirtualized IOMMU support
+config HYPERV_PVIOMMU
+	bool "Microsoft Hypervisor para-virtualized IOMMU support"
+	depends on X86_64 && HYPERV
+	select IOMMU_API
+	select GENERIC_PT
+	select IOMMU_PT
+	select IOMMU_PT_X86_64
+	select IOMMU_IOVA
+	default HYPERV
+	help
+	  Para-virtualized IOMMU driver for Linux guests running on
+	  Microsoft Hyper-V. Provides DMA remapping and IOTLB
+	  flush support to enable DMA isolation for devices
+	  assigned to the guest.
diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile
index 6ef0ef97f3dd..fefb409d976b 100644
--- a/drivers/iommu/hyperv/Makefile
+++ b/drivers/iommu/hyperv/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_IRQ_REMAP) += hv-irq-remap-x86.o
+obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
new file mode 100644
index 000000000000..254136946404
--- /dev/null
+++ b/drivers/iommu/hyperv/iommu.c
@@ -0,0 +1,620 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Hyper-V IOMMU driver.
+ *
+ * Copyright (C) 2019, 2024-2026 Microsoft, Inc.
+ */
+
+#define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
+#define dev_fmt(fmt) pr_fmt(fmt)
+
+#include <linux/iommu.h>
+#include <linux/pci.h>
+#include <linux/dma-map-ops.h>
+#include <linux/generic_pt/iommu.h>
+#include <linux/pci-ats.h>
+
+#include <asm/iommu.h>
+#include <asm/hypervisor.h>
+#include <asm/mshyperv.h>
+
+#include "iommu.h"
+#include "../iommu-pages.h"
+
+struct hv_iommu_dev *hv_iommu_device;
+
+/*
+ * Identity and blocking domains are static singletons: identity is a 1:1
+ * passthrough with no page table, blocking rejects all DMA. Neither holds
+ * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
+ */
+static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
+static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
+static struct iommu_ops hv_iommu_ops;
+
+static struct hv_iommu_domain hv_identity_domain = {
+	.domain = {
+		.type	= IOMMU_DOMAIN_IDENTITY,
+		.ops	= &hv_iommu_identity_domain_ops,
+		.owner	= &hv_iommu_ops,
+	},
+};
+static struct hv_iommu_domain hv_blocking_domain = {
+	.domain = {
+		.type	= IOMMU_DOMAIN_BLOCKED,
+		.ops	= &hv_iommu_blocking_domain_ops,
+		.owner	= &hv_iommu_ops,
+	},
+};
+
+static inline bool hv_iommu_present(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_PRESENT;
+}
+
+static inline bool hv_iommu_s1_domain_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_S1;
+}
+
+static inline bool hv_iommu_5lvl_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_S1_5LVL;
+}
+
+static inline bool hv_iommu_ats_supported(u64 cap)
+{
+	return cap & HV_IOMMU_CAP_ATS;
+}
+
+static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_stage)
+{
+	int ret;
+	u64 status;
+	unsigned long flags;
+	struct hv_input_create_device_domain *input;
+
+	ret = ida_alloc_range(&hv_iommu_device->domain_ids,
+			hv_iommu_device->first_domain, hv_iommu_device->last_domain,
+			GFP_KERNEL);
+	if (ret < 0)
+		return ret;
+
+	hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	hv_domain->device_domain.domain_id.type = domain_stage;
+	hv_domain->device_domain.domain_id.id = ret;
+	hv_domain->hv_iommu = hv_iommu_device;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->create_device_domain_flags.forward_progress_required = 1;
+	input->create_device_domain_flags.inherit_owning_vtl = 0;
+	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status)) {
+		pr_err("HVCALL_CREATE_DEVICE_DOMAIN failed, status %lld\n", status);
+		ida_free(&hv_iommu_device->domain_ids, hv_domain->device_domain.domain_id.id);
+	}
+
+	return hv_result_to_errno(status);
+}
+
+static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_delete_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_DELETE_DEVICE_DOMAIN failed, status %lld\n", status);
+
+	ida_free(&hv_domain->hv_iommu->domain_ids, hv_domain->device_domain.domain_id.id);
+}
+
+static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return true;
+	case IOMMU_CAP_DEFERRED_FLUSH:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_flush_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_FLUSH_DEVICE_DOMAIN failed, status %lld\n", status);
+}
+
+static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device *dev,
+			       struct iommu_domain *old)
+{
+	u64 status;
+	u32 prefix;
+	unsigned long flags;
+	struct pci_dev *pdev;
+	struct hv_input_attach_device_domain *input;
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
+	int ret;
+
+	if (vdev->hv_domain == hv_domain)
+		return 0;
+
+	pdev = to_pci_dev(dev);
+	dev_dbg(dev, "attaching to domain %d\n",
+		hv_domain->device_domain.domain_id.id);
+
+	ret = hv_iommu_lookup_logical_dev_id(pci_domain_nr(pdev->bus), &prefix);
+	if (ret) {
+		dev_err(&pdev->dev, "no IOMMU registration for vPCI bus\n");
+		return ret;
+	}
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->device_id.as_uint64 = (u64)prefix | PCI_FUNC(pdev->devfn);
+	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_ATTACH_DEVICE_DOMAIN failed, status %lld\n", status);
+	else
+		vdev->hv_domain = hv_domain;
+
+	return hv_result_to_errno(status);
+}
+
+static int hv_iommu_blocking_attach_dev(struct iommu_domain *domain,
+					struct device *dev,
+					struct iommu_domain *old)
+{
+	int ret = hv_iommu_attach_dev(domain, dev, old);
+
+	/*
+	 * Attaching to the blocking domain only asks the hypervisor to
+	 * disable translation and IOPF for the device, so it cannot fail
+	 * unless there is a driver or hypervisor bug. Return the hypercall
+	 * status rather than 0 so that a failure on the DMA ownership claim
+	 * path (VFIO/iommufd) fails the claim instead of leaving the device
+	 * unblocked. WARN since such a failure indicates a bug.
+	 */
+	WARN_ON(ret);
+	return ret;
+}
+
+static int hv_iommu_get_logical_device_property(struct device *dev,
+					u32 code,
+					struct hv_output_get_logical_device_property *property)
+{
+	u64 status;
+	u32 prefix;
+	unsigned long flags;
+	int ret;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct hv_input_get_logical_device_property *input;
+	struct hv_output_get_logical_device_property *output;
+
+	ret = hv_iommu_lookup_logical_dev_id(pci_domain_nr(pdev->bus), &prefix);
+	if (ret)
+		return ret;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = (struct hv_output_get_logical_device_property *)(input + 1);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = HV_PARTITION_ID_SELF;
+	input->logical_device_id = (u64)prefix | PCI_FUNC(pdev->devfn);
+	input->code = code;
+	status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, output);
+	*property = *output;
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_GET_LOGICAL_DEVICE_PROPERTY failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static struct iommu_device *hv_iommu_probe_device(struct device *dev)
+{
+	struct pci_dev *pdev;
+	struct hv_iommu_endpoint *vdev;
+	struct hv_output_get_logical_device_property device_iommu_property = {0};
+
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-ENODEV);
+
+	pdev = to_pci_dev(dev);
+
+	if (hv_iommu_get_logical_device_property(dev,
+						 HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
+						 &device_iommu_property) ||
+	    !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED))
+		return ERR_PTR(-ENODEV);
+
+	vdev = kzalloc_obj(*vdev, GFP_KERNEL);
+	if (!vdev)
+		return ERR_PTR(-ENOMEM);
+
+	vdev->dev = dev;
+	vdev->hv_iommu = hv_iommu_device;
+	dev_iommu_priv_set(dev, vdev);
+
+	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
+	    pci_ats_supported(pdev))
+		pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap));
+
+	return &vdev->hv_iommu->iommu;
+}
+
+static void hv_iommu_release_device(struct device *dev)
+{
+	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (pdev->ats_enabled)
+		pci_disable_ats(pdev);
+
+	dev_iommu_priv_set(dev, NULL);
+
+	kfree(vdev);
+}
+
+static struct iommu_group *hv_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+
+	WARN_ON_ONCE(1);
+	return generic_device_group(dev);
+}
+
+static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 domain_type)
+{
+	u64 status;
+	unsigned long flags;
+	struct pt_iommu_x86_64_hw_info pt_info;
+	struct hv_input_configure_device_domain *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->device_domain = hv_domain->device_domain;
+	input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED);
+	/*
+	 * Clearing translation_enabled bypasses translation (DMA uses the GPA
+	 * directly), which only suits identity. The hypervisor requires paging
+	 * and blocked domains to keep it set.
+	 */
+	input->settings.flags.translation_enabled = (domain_type != IOMMU_DOMAIN_IDENTITY);
+
+	if (domain_type & __IOMMU_DOMAIN_PAGING) {
+		pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info);
+		input->settings.page_table_root = pt_info.gcr3_pt;
+		input->settings.flags.first_stage_paging_mode =
+			pt_info.levels == 5;
+	}
+	status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_CONFIGURE_DEVICE_DOMAIN failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static int __init hv_initialize_static_domains(void)
+{
+	int ret;
+	struct hv_iommu_domain *hv_domain;
+
+	/* Default stage-1 identity domain */
+	hv_domain = &hv_identity_domain;
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		return ret;
+
+	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
+	if (ret)
+		goto delete_identity_domain;
+
+	/* Default stage-1 blocked domain */
+	hv_domain = &hv_blocking_domain;
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		goto delete_identity_domain;
+
+	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
+	if (ret)
+		goto delete_blocked_domain;
+
+	return 0;
+
+delete_blocked_domain:
+	hv_delete_device_domain(&hv_blocking_domain);
+delete_identity_domain:
+	hv_delete_device_domain(&hv_identity_domain);
+	return ret;
+}
+
+/* x86 architectural MSI address range */
+#define INTERRUPT_RANGE_START	(0xfee00000)
+#define INTERRUPT_RANGE_END	(0xfeefffff)
+static void hv_iommu_get_resv_regions(struct device *dev,
+		struct list_head *head)
+{
+	struct iommu_resv_region *region;
+
+	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
+				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
+				      0, IOMMU_RESV_MSI, GFP_KERNEL);
+	if (!region)
+		return;
+
+	list_add_tail(&region->list, head);
+}
+
+static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
+{
+	hv_flush_device_domain(to_hv_iommu_domain(domain));
+}
+
+static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
+				struct iommu_iotlb_gather *iotlb_gather)
+{
+	hv_flush_device_domain(to_hv_iommu_domain(domain));
+
+	iommu_put_pages_list(&iotlb_gather->freelist);
+}
+
+static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
+{
+	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
+
+	/* Free all remaining mappings */
+	pt_iommu_deinit(&hv_domain->pt_iommu);
+
+	hv_delete_device_domain(hv_domain);
+
+	kfree(hv_domain);
+}
+
+static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
+	.attach_dev	= hv_iommu_attach_dev,
+};
+
+static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
+	.attach_dev	= hv_iommu_blocking_attach_dev,
+};
+
+static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
+	.attach_dev	= hv_iommu_attach_dev,
+	IOMMU_PT_DOMAIN_OPS(x86_64),
+	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
+	.iotlb_sync = hv_iommu_iotlb_sync,
+	.free = hv_iommu_paging_domain_free,
+};
+
+static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
+{
+	int ret;
+	struct hv_iommu_domain *hv_domain;
+	struct pt_iommu_x86_64_cfg cfg = {};
+
+	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
+	if (!hv_domain)
+		return ERR_PTR(-ENOMEM);
+
+	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
+	if (ret)
+		goto err_free;
+
+	hv_domain->pt_iommu.nid = dev_to_node(dev);
+
+	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
+	cfg.common.hw_max_oasz_lg2 = 52;
+	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
+
+	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
+	if (ret)
+		goto err_delete_domain;
+
+	/* Constrain to page sizes the hypervisor supports */
+	hv_domain->domain.pgsize_bitmap &= hv_iommu_device->pgsize_bitmap;
+
+	hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
+
+	ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
+	if (ret)
+		goto err_pt_deinit;
+
+	return &hv_domain->domain;
+
+err_pt_deinit:
+	pt_iommu_deinit(&hv_domain->pt_iommu);
+err_delete_domain:
+	hv_delete_device_domain(hv_domain);
+err_free:
+	kfree(hv_domain);
+	return ERR_PTR(ret);
+}
+
+static struct iommu_ops hv_iommu_ops = {
+	.capable		  = hv_iommu_capable,
+	.domain_alloc_paging	  = hv_iommu_domain_alloc_paging,
+	.probe_device		  = hv_iommu_probe_device,
+	.release_device		  = hv_iommu_release_device,
+	.device_group		  = hv_iommu_device_group,
+	.get_resv_regions	  = hv_iommu_get_resv_regions,
+	.owner			  = THIS_MODULE,
+	.identity_domain	  = &hv_identity_domain.domain,
+	.blocked_domain		  = &hv_blocking_domain.domain,
+	.release_domain		  = &hv_blocking_domain.domain,
+};
+
+static int hv_iommu_detect(struct hv_output_get_iommu_capabilities *hv_iommu_cap)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_get_iommu_capabilities *input;
+	struct hv_output_get_iommu_capabilities *output;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = (struct hv_output_get_iommu_capabilities *)(input + 1);
+	memset(input, 0, sizeof(*input));
+	input->partition_id = HV_PARTITION_ID_SELF;
+	status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output);
+	*hv_iommu_cap = *output;
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed, status %lld\n", status);
+
+	return hv_result_to_errno(status);
+}
+
+static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
+			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
+{
+	ida_init(&hv_iommu->domain_ids);
+
+	hv_iommu->cap = hv_iommu_cap->iommu_cap;
+	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
+	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
+	    hv_iommu->max_iova_width > 48) {
+		pr_info("5-level paging not supported, limiting iova width to 48.\n");
+		hv_iommu->max_iova_width = 48;
+	}
+
+	hv_iommu->geometry = (struct iommu_domain_geometry) {
+		.aperture_start = 0,
+		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
+		.force_aperture = true,
+	};
+
+	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
+	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
+	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap;
+	hv_iommu_device = hv_iommu;
+}
+
+int __init hv_iommu_init(void)
+{
+	int ret = 0;
+	struct hv_iommu_dev *hv_iommu = NULL;
+	struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
+
+	if (no_iommu || iommu_detected)
+		return -ENODEV;
+
+	if (!hv_is_hyperv_initialized())
+		return -ENODEV;
+
+	ret = hv_iommu_detect(&hv_iommu_cap);
+	if (ret) {
+		pr_err("HVCALL_GET_IOMMU_CAPABILITIES failed: %d\n", ret);
+		return -ENODEV;
+	}
+
+	if (!hv_iommu_present(hv_iommu_cap.iommu_cap) ||
+	    !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap)) {
+		pr_err("IOMMU capabilities not sufficient: cap=0x%llx\n",
+		       hv_iommu_cap.iommu_cap);
+		return -ENODEV;
+	}
+
+	/*
+	 * The page table code only maps x86 page sizes (4K/2M/1G); require the
+	 * hypervisor to advertise a non-empty subset of exactly those.
+	 */
+	if (!hv_iommu_cap.pgsize_bitmap ||
+	    (hv_iommu_cap.pgsize_bitmap & ~(u64)(SZ_4K | SZ_2M | SZ_1G))) {
+		pr_err("unsupported page sizes: pgsize_bitmap=0x%llx\n",
+		       hv_iommu_cap.pgsize_bitmap);
+		return -ENODEV;
+	}
+
+	iommu_detected = 1;
+	pci_request_acs();
+
+	hv_iommu = kzalloc_obj(*hv_iommu, GFP_KERNEL);
+	if (!hv_iommu)
+		return -ENOMEM;
+
+	hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
+
+	ret = hv_initialize_static_domains();
+	if (ret) {
+		pr_err("static domains init failed: %d\n", ret);
+		goto err_free;
+	}
+
+	ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", "hv-iommu");
+	if (ret) {
+		pr_err("iommu_device_sysfs_add failed: %d\n", ret);
+		goto err_delete_static_domains;
+	}
+
+	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
+	if (ret) {
+		pr_err("iommu_device_register failed: %d\n", ret);
+		goto err_sysfs_remove;
+	}
+
+	pr_info("successfully initialized\n");
+	return 0;
+
+err_sysfs_remove:
+	iommu_device_sysfs_remove(&hv_iommu->iommu);
+err_delete_static_domains:
+	hv_delete_device_domain(&hv_blocking_domain);
+	hv_delete_device_domain(&hv_identity_domain);
+err_free:
+	kfree(hv_iommu);
+	return ret;
+}
diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
new file mode 100644
index 000000000000..3a9f40fa2403
--- /dev/null
+++ b/drivers/iommu/hyperv/iommu.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Hyper-V IOMMU driver.
+ *
+ * Copyright (C) 2024-2025, Microsoft, Inc.
+ *
+ */
+
+#ifndef _HYPERV_IOMMU_H
+#define _HYPERV_IOMMU_H
+
+struct hv_iommu_dev {
+	struct iommu_device iommu;
+	struct ida domain_ids;
+
+	/* Device configuration */
+	u8  max_iova_width;
+	u8  max_pasid_width;
+	u64 cap;
+	u64 pgsize_bitmap;
+
+	struct iommu_domain_geometry geometry;
+	u64 first_domain;
+	u64 last_domain;
+};
+
+struct hv_iommu_domain {
+	union {
+		struct iommu_domain    domain;
+		struct pt_iommu        pt_iommu;
+		struct pt_iommu_x86_64 pt_iommu_x86_64;
+	};
+	struct hv_iommu_dev *hv_iommu;
+	struct hv_input_device_domain device_domain;
+	u64		pgsize_bitmap;
+};
+
+PT_IOMMU_CHECK_DOMAIN(struct hv_iommu_domain, pt_iommu, domain);
+PT_IOMMU_CHECK_DOMAIN(struct hv_iommu_domain, pt_iommu_x86_64.iommu, domain);
+
+struct hv_iommu_endpoint {
+	struct device *dev;
+	struct hv_iommu_dev *hv_iommu;
+	struct hv_iommu_domain *hv_domain;
+};
+
+#define to_hv_iommu_domain(d) \
+	container_of(d, struct hv_iommu_domain, domain)
+
+#endif /* _HYPERV_IOMMU_H */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support
  2026-07-02 16:05 [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang
                   ` (2 preceding siblings ...)
  2026-07-02 16:05 ` [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang
@ 2026-07-02 16:05 ` Yu Zhang
  2026-07-02 17:20   ` sashiko-bot
  2026-07-03 17:10   ` Jason Gunthorpe
  3 siblings, 2 replies; 11+ messages in thread
From: Yu Zhang @ 2026-07-02 16:05 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch
  Cc: wei.liu, kys, haiyangz, decui, longli, joro, will, robin.murphy,
	bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd, jgg,
	mhklinux, jacob.pan, tgopinath, easwar.hariharan, mrathor

Add page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
This hypercall accepts a list of (page_number, page_mask_shift) entries,
enabling finer-grained IOTLB invalidation compared to the domain-wide
HVCALL_FLUSH_DEVICE_DOMAIN used by hv_iommu_flush_iotlb_all().

hv_iommu_calc_flush_range() computes the smallest power-of-two aligned
range that covers the target IOVA region, producing a single flush
descriptor. This may over-flush when the range is not naturally aligned,
matching the approach used by Intel VT-d PSI. If the page-selective
flush fails, the code falls back to a full domain flush.

Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
---
 drivers/iommu/hyperv/iommu.c | 68 +++++++++++++++++++++++++++++++++++-
 include/hyperv/hvgdk_mini.h  |  1 +
 include/hyperv/hvhdk_mini.h  | 17 +++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
index 254136946404..e9b104a322fd 100644
--- a/drivers/iommu/hyperv/iommu.c
+++ b/drivers/iommu/hyperv/iommu.c
@@ -9,6 +9,7 @@
 #define pr_fmt(fmt) "Hyper-V pvIOMMU: " fmt
 #define dev_fmt(fmt) pr_fmt(fmt)
 
+#include <linux/hyperv.h>
 #include <linux/iommu.h>
 #include <linux/pci.h>
 #include <linux/dma-map-ops.h>
@@ -401,10 +402,74 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
 	hv_flush_device_domain(to_hv_iommu_domain(domain));
 }
 
+/*
+ * Calculate the minimal power-of-two aligned range that covers [start, end]
+ * (end is inclusive). Returns a single (page_number, page_mask_shift)
+ * descriptor that may over-flush when the range is not naturally aligned.
+ */
+static void hv_iommu_calc_flush_range(unsigned long start, unsigned long end,
+				       union hv_iommu_flush_va *va)
+{
+	unsigned long start_pfn = HVPFN_DOWN(start);
+	unsigned long last_pfn = HVPFN_UP(end + 1) - 1;
+	unsigned long mask_shift, aligned_pfn;
+
+	if (start_pfn == last_pfn) {
+		mask_shift = 0;
+	} else {
+		/*
+		 * Find the highest bit position where start_pfn and last_pfn
+		 * differ.  A range aligned to one above that bit is the
+		 * smallest power-of-two region that covers both endpoints.
+		 */
+		mask_shift = __fls(start_pfn ^ last_pfn) + 1;
+	}
+
+	aligned_pfn = ALIGN_DOWN(start_pfn, 1UL << mask_shift);
+	va->page_number = aligned_pfn;
+	va->page_mask_shift = mask_shift;
+}
+
+static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
+					struct iommu_iotlb_gather *iotlb_gather)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_flush_device_domain_list *input;
+
+	local_irq_save(flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain = hv_domain->device_domain;
+	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
+	hv_iommu_calc_flush_range(iotlb_gather->start, iotlb_gather->end,
+				  &input->iova_list[0]);
+
+	status = hv_do_rep_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
+				     1, 0, input, NULL);
+
+	if (!hv_result_success(status)) {
+		/* Page-selective flush failed, fall back to full flush. */
+		struct hv_input_flush_device_domain *flush_all = (void *)input;
+
+		memset(flush_all, 0, sizeof(*flush_all));
+		flush_all->device_domain = hv_domain->device_domain;
+		status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
+					flush_all, NULL);
+		WARN(!hv_result_success(status),
+		     "HVCALL_FLUSH_DEVICE_DOMAIN fallback also failed: %lld\n",
+		     status);
+	}
+
+	local_irq_restore(flags);
+}
+
 static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
 				struct iommu_iotlb_gather *iotlb_gather)
 {
-	hv_flush_device_domain(to_hv_iommu_domain(domain));
+	hv_flush_device_domain_list(to_hv_iommu_domain(domain), iotlb_gather);
 
 	iommu_put_pages_list(&iotlb_gather->freelist);
 }
@@ -455,6 +520,7 @@ static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
 
 	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
 	cfg.common.hw_max_oasz_lg2 = 52;
+	cfg.common.features |= BIT(PT_FEAT_FLUSH_RANGE);
 	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
 
 	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 5bdbb44da112..eaaf87171478 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -496,6 +496,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
 #define HVCALL_CONFIGURE_DEVICE_DOMAIN			0x00ce
 #define HVCALL_FLUSH_DEVICE_DOMAIN			0x00d0
+#define HVCALL_FLUSH_DEVICE_DOMAIN_LIST			0x00d1
 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
 #define HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY	0x00db
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 493608e791b4..f51d5d9467f1 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -671,4 +671,21 @@ struct hv_input_flush_device_domain {
 	u32 reserved;
 } __packed;
 
+union hv_iommu_flush_va {
+	u64 iova;
+	struct {
+		u64 page_mask_shift : 12;
+		u64 page_number : 52;
+	};
+} __packed;
+
+
+struct hv_input_flush_device_domain_list {
+	struct hv_input_device_domain device_domain;
+#define HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT (1 << 0)
+	u32 flags;
+	u32 reserved;
+	union hv_iommu_flush_va iova_list[];
+} __packed;
+
 #endif /* _HV_HVHDK_MINI_H */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU
  2026-07-02 16:05 ` [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang
@ 2026-07-02 16:36   ` sashiko-bot
  0 siblings, 0 replies; 11+ messages in thread
From: sashiko-bot @ 2026-07-02 16:36 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-pci, linux-hyperv

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] `struct hv_input_create_device_domain` is missing a 4-byte padding field, resulting in a non-standard 28-byte size.
--

commit 267ecd2e71d0d7f097fc4d970780fef12801efc5
Author: Wei Liu <wei.liu@kernel.org>

hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU

This commit introduces new hypercall definitions to allow a child partition
to interact with the para-virtualized Hyper-V guest IOMMU. It includes
interfaces for capability detection, device/domain management, and IOTLB
flushing.

[ ... ]

> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index b4cb2fa26e9bc..493608e791b45 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h

[ ... ]

> +union hv_create_device_domain_flags {
> +	u32 as_uint32;
> +	struct {
> +		u32 forward_progress_required: 1;
> +		u32 inherit_owning_vtl: 1;
> +		u32 reserved: 30;
> +	} __packed;
> +};
> +
> +struct hv_input_create_device_domain {
> +	struct hv_input_device_domain device_domain;
> +	union hv_create_device_domain_flags create_device_domain_flags;
> +} __packed;

[Severity: Medium]
Does this structure need an explicit 4-byte padding field at the end to
ensure its size is a multiple of 8 bytes?

Because the struct is packed and lacks tail padding, its size is exactly
28 bytes. These structures are often initialized on a reused per-CPU page
like this:

hv_create_device_domain() {
        ...
        memset(input, 0, sizeof(*input));
        ...
}

If the hypervisor expects an 8-byte aligned 32-byte input for this hypercall,
the memset will only clear the first 28 bytes. This leaves 4 bytes of
uninitialized garbage from previous hypercalls in the tail end of the buffer,
which might cause domain creation to fail if the hypervisor strictly validates
padding fields to be zero.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices
  2026-07-02 16:05 ` [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices Yu Zhang
@ 2026-07-02 16:42   ` sashiko-bot
  0 siblings, 0 replies; 11+ messages in thread
From: sashiko-bot @ 2026-07-02 16:42 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci

> From: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>
> 
> Hyper-V identifies each PCI pass-thru device by a logical device ID in
> its hypercall interface. This ID consists of a per-bus prefix, derived
> from the VMBus device instance GUID, combined with the PCI function
> number of the endpoint device.
> 
> Add a small registry in hv_common.c that maps a PCI domain number to its
> logical device ID prefix. The vPCI bus driver (pci-hyperv) registers the
> prefix when a bus is probed and unregisters it when the bus is removed.
> Consumers such as the para-virtualized IOMMU driver look up the prefix
> by PCI domain number and combine it with the function number to form the
> complete logical device ID for hypercalls.
> 
> The prefix construction is shared via hv_build_logical_dev_id_prefix() so
> that pci-hyperv's interrupt retargeting path and the registry use exactly
> the same byte layout. It is derived on demand from the constant hv_device
> instance GUID rather than cached in struct hv_pcibus_device, which is
> private to the pci-hyperv module; this keeps the interface narrow and
> avoids depending on pci-hyperv internals.
> 
> Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
> Signed-off-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

Sashiko has reviewed this patch and found no issues. It looks great!

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
  2026-07-02 16:05 ` [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang
@ 2026-07-02 17:08   ` sashiko-bot
  2026-07-03 17:32   ` Jason Gunthorpe
  1 sibling, 0 replies; 11+ messages in thread
From: sashiko-bot @ 2026-07-02 17:08 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Undefined behavior when calculating `aperture_end` if the hypervisor returns `max_iova_width >= 64`.
- [High] Unconditional overwrite of `x86_init.iommu.iommu_init` causes hardware IOMMUs (VT-d/AMD) to fail initialization.
--

commit fdcb0264dfd6bd1cb7c3c4676b8a796ad8a85e4a
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest

Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
This driver implements stage-1 IO translation within the guest OS and
integrates with the Linux IOMMU core using Hyper-V hypercalls.

> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 55a8b6de2865a..094f9f7ddb728 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -578,6 +578,10 @@ void __init hyperv_init(void)
>  	old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
>  	x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
>  
> +#ifdef CONFIG_HYPERV_PVIOMMU
> +	x86_init.iommu.iommu_init = hv_iommu_init;

[Severity: High]
The assignment to x86_init.iommu.iommu_init must be conditionally guarded to
prevent hardware IOMMU initialization failures.

Will this unconditional overwrite break hardware IOMMUs?

Initial boot sequence logic and IOMMU detection parameters are validated.

If pci_iommu_alloc() previously detected a hardware IOMMU (like VT-d/AMD) and
set iommu_detected=1, overwriting the pointer with hv_iommu_init() ensures
failure since hv_iommu_init() immediately aborts when iommu_detected==1.

> +#endif
> +
>  	hv_apic_init();

[ ... ]

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> new file mode 100644
> index 0000000000000..2541369464040
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c

[ ... ]

> @@ -509,6 +509,16 @@ static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,

[Severity: High]
The max_iova_width value must be bounds-checked against the maximum 64-bit
integer width before performing the bitwise shift.

Does this code overflow the 64-bit shift limit?

Standard processing applied to the hypercall response.

If an untrusted hypervisor responds to HVCALL_GET_IOMMU_CAPABILITIES with
max_iova_width >= 64, shifting ((u64)1) directly by that amount results in
undefined behavior and unpredictable IOVA aperture sizing.

> +		.force_aperture = true,
> +	};

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support
  2026-07-02 16:05 ` [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support Yu Zhang
@ 2026-07-02 17:20   ` sashiko-bot
  2026-07-03 17:10   ` Jason Gunthorpe
  1 sibling, 0 replies; 11+ messages in thread
From: sashiko-bot @ 2026-07-02 17:20 UTC (permalink / raw)
  To: Yu Zhang; +Cc: linux-hyperv, linux-pci

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Bitfield assignment on an uninitialized hypercall buffer element causes a read of uninitialized memory, triggering a KMSAN uninit-value warning.
- [High] Missing check for an empty `iotlb_gather` causes spurious flushes of the entire IOTLB domain.
- [Medium] The calculation of `last_pfn` uses an unnecessarily complex formula that theoretically overflows for IOVAs near the end of the 64-bit address space.
--

commit 306d408fd1d04144672964e71dd7149937537189
Author: Yu Zhang <zhangyu1@linux.microsoft.com>

iommu/hyperv: Add page-selective IOTLB flush support

Adds page-selective IOTLB flush using HVCALL_FLUSH_DEVICE_DOMAIN_LIST.
Calculates the minimal power-of-two aligned range for the target IOVA region,
falling back to a full domain flush if the page-selective flush fails.

> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> index 2541369464040..e9b104a322fd8 100644
> --- a/drivers/iommu/hyperv/iommu.c
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -401,10 +402,74 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
>  	hv_flush_device_domain(to_hv_iommu_domain(domain));
>  }
>  
> +/*
> + * Calculate the minimal power-of-two aligned range that covers [start, end]
> + * (end is inclusive). Returns a single (page_number, page_mask_shift)
> + * descriptor that may over-flush when the range is not naturally aligned.
> + */
> +static void hv_iommu_calc_flush_range(unsigned long start, unsigned long end,
> +				       union hv_iommu_flush_va *va)
> +{
> +	unsigned long start_pfn = HVPFN_DOWN(start);
> +	unsigned long last_pfn = HVPFN_UP(end + 1) - 1;

[Severity: Medium]
Can this calculation overflow?

If end is within the last 4095 bytes of the 64-bit address space, end + 1
would wrap around. HVPFN_UP() expands to (((x) + 4095) >> 12), which would
evaluate to 0 in that case.

Subtracting 1 would then result in ULONG_MAX, causing __fls() to compute a
mask_shift of 64 and triggering an undefined left shift by 64 in the later
ALIGN_DOWN macro.

Would it be safer to use HVPFN_DOWN(end) instead, which avoids the overflow
entirely?

> +	unsigned long mask_shift, aligned_pfn;

[ ... ]

> +static void hv_flush_device_domain_list(struct hv_iommu_domain *hv_domain,
> +					struct iommu_iotlb_gather *iotlb_gather)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_flush_device_domain_list *input;
> +
> +	local_irq_save(flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain = hv_domain->device_domain;
> +	input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> +	hv_iommu_calc_flush_range(iotlb_gather->start, iotlb_gather->end,
> +				  &input->iova_list[0]);

[Severity: High]
Does this code trigger an uninitialized memory read KMSAN warning?

Since iova_list is a flexible array member, sizeof(*input) evaluates to just
the header size (24 bytes). This means memset() clears the header, but leaves
input->iova_list[0] uninitialized.

When hv_iommu_calc_flush_range() later assigns the page_number and
page_mask_shift bitfields of the union, the compiler typically generates a
read-modify-write operation on the underlying 64-bit value. This will read
the uninitialized memory from the reused per-CPU page.

> +
> +	status = hv_do_rep_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN_LIST,

[ ... ]

>  static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
>  				struct iommu_iotlb_gather *iotlb_gather)
>  {
> -	hv_flush_device_domain(to_hv_iommu_domain(domain));
> +	hv_flush_device_domain_list(to_hv_iommu_domain(domain), iotlb_gather);

[Severity: High]
Does this unconditionally flush the entire IOTLB domain if the gather is
empty?

If iommu_unmap() returns without unmapping anything, iotlb_gather retains its
initialized values of start = ULONG_MAX and end = 0.

Because there is no check like if (iotlb_gather->start > iotlb_gather->end)
before issuing the flush, hv_iommu_calc_flush_range() will calculate a
mask_shift of 52. This causes a spurious flush of the full 52-bit address
space instead of returning early.

>  
>  	iommu_put_pages_list(&iotlb_gather->freelist);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260702160518.311234-1-zhangyu1@linux.microsoft.com?part=4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support
  2026-07-02 16:05 ` [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support Yu Zhang
  2026-07-02 17:20   ` sashiko-bot
@ 2026-07-03 17:10   ` Jason Gunthorpe
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2026-07-03 17:10 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan, mrathor

On Fri, Jul 03, 2026 at 12:05:18AM +0800, Yu Zhang wrote:

> @@ -401,10 +402,74 @@ static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
>  	hv_flush_device_domain(to_hv_iommu_domain(domain));
>  }
>  
> +/*
> + * Calculate the minimal power-of-two aligned range that covers [start, end]
> + * (end is inclusive). Returns a single (page_number, page_mask_shift)
> + * descriptor that may over-flush when the range is not naturally aligned.
> + */
> +static void hv_iommu_calc_flush_range(unsigned long start, unsigned long end,
> +				       union hv_iommu_flush_va *va)
> +{
> +	unsigned long start_pfn = HVPFN_DOWN(start);
> +	unsigned long last_pfn = HVPFN_UP(end + 1) - 1;

Pedantically end can be ULONG_MAX, you shouldn't be adding to it since
it will overflow.

> +	unsigned long mask_shift, aligned_pfn;
> +
> +	if (start_pfn == last_pfn) {
> +		mask_shift = 0;
> +	} else {
> +		/*
> +		 * Find the highest bit position where start_pfn and last_pfn
> +		 * differ.  A range aligned to one above that bit is the
> +		 * smallest power-of-two region that covers both endpoints.
> +		 */
> +		mask_shift = __fls(start_pfn ^ last_pfn) + 1;
> +	}
> +
> +	aligned_pfn = ALIGN_DOWN(start_pfn, 1UL << mask_shift);

I think the whole thing is simpler if it stays using bytes until the end:

	sz_lg2 = __fls(gather->start ^ gather->end);
	if (sz_lg2 < HV_PAGE_SHIFT)
		cmd.sz_lg2 = HV_PAGE_SHIFT;

	page_number = (gather->start & ~(1UL << sz_lg2)) >> HV_PAGE_SHIFT;
	page_mask_shift = sz_lg2 - HV_PAGE_SIFT;

No overflows that way either

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
  2026-07-02 16:05 ` [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang
  2026-07-02 17:08   ` sashiko-bot
@ 2026-07-03 17:32   ` Jason Gunthorpe
  1 sibling, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2026-07-03 17:32 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan, mrathor

On Fri, Jul 03, 2026 at 12:05:17AM +0800, Yu Zhang wrote:

> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	case IOMMU_CAP_DEFERRED_FLUSH:
> +		return true;

This CAP isn't necessary anymore

> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_iommu_endpoint *vdev;
> +	struct hv_output_get_logical_device_property device_iommu_property = {0};
> +
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hv_iommu_get_logical_device_property(dev,
> +						 HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
> +						 &device_iommu_property) ||
> +	    !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED))
> +		return ERR_PTR(-ENODEV);
> +
> +	vdev = kzalloc_obj(*vdev, GFP_KERNEL);
> +	if (!vdev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	vdev->dev = dev;
> +	vdev->hv_iommu = hv_iommu_device;
> +	dev_iommu_priv_set(dev, vdev);
> +
> +	if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
> +	    pci_ats_supported(pdev))
> +		pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap));

This can probably just be PAGE_SHIFT

Also ATS shouldn't be enabled until a translation is installed,
otherwise the driver cannot participate in the ATS error handling
Nicolin is working on.

> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (pdev->ats_enabled)
> +		pci_disable_ats(pdev);
> +
> +	dev_iommu_priv_set(dev, NULL);

No necessary, the caller does it

> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +
> +	WARN_ON_ONCE(1);
> +	return generic_device_group(dev);

I think you can just return failure here instead of WARN_ON ?

> +static int __init hv_initialize_static_domains(void)
> +{
> +	int ret;
> +	struct hv_iommu_domain *hv_domain;
> +
> +	/* Default stage-1 identity domain */
> +	hv_domain = &hv_identity_domain;
> +
> +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +	if (ret)
> +		return ret;
> +
> +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> +	if (ret)
> +		goto delete_identity_domain;

IMHO I would change this around to have a single function that accepts
a struct hv_input_configure_device_domain as input and does both of
the hypercalls inside. Then here it is easy to directly construct the
hv_input_configure_device_domain for blocking and identity.

I'd be happy if this never touched domain_type, drivers shouldn't be
touching that.

> +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> +{
> +	ida_init(&hv_iommu->domain_ids);
> +
> +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +	    hv_iommu->max_iova_width > 48) {
> +		pr_info("5-level paging not supported, limiting iova width to 48.\n");
> +		hv_iommu->max_iova_width = 48;
> +	}
> +
> +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> +		.aperture_start = 0,
> +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
> +		.force_aperture = true,
> +	};

I don't see anything reading this, I don't expect this to be used?

The max_iova_width has to be passed into the iommupt creation, which
it does:

 +	cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
 +	cfg.common.hw_max_oasz_lg2 = 52;
 +	cfg.top_level = (hv_iommu_device->max_iova_width > 48) ? 4 : 3;
 +	ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, GFP_KERNEL);
 +	if (ret)

So just delete hv->iommu->geometry.

Also, VT-D has weirdness where the HW can require a 4 level table but
only a 3 level worth of IOVA width is being used. This was a
real-world bug we hit when converting to iommupt. This interaction
with the HV doesn't seem able to represent that.

> +	/*
> +	 * The page table code only maps x86 page sizes (4K/2M/1G); require the
> +	 * hypervisor to advertise a non-empty subset of exactly those.
> +	 */
> +	if (!hv_iommu_cap.pgsize_bitmap ||
> +	    (hv_iommu_cap.pgsize_bitmap & ~(u64)(SZ_4K | SZ_2M | SZ_1G))) {
> +		pr_err("unsupported page sizes: pgsize_bitmap=0x%llx\n",
> +		       hv_iommu_cap.pgsize_bitmap);
> +		return -ENODEV;
> +	}

This can just be

if (!(hv_iommu_cap.pgsize_bitmap & PAGE_SHIFT)) {
		pr_err("unsupported page sizes: pgsize_bitmap=0x%llx\n",
		       hv_iommu_cap.pgsize_bitmap);
}		return -ENODEV;

Which is all you really need. If the HV doesn't support 1G it is
perfectly fine, the iommupt page bitmap is already masked by this. 

> +	ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
> +	if (ret) {
> +		pr_err("iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("successfully initialized\n");

Don't log someting so vauge?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-07-03 17:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 16:05 [PATCH v2 0/4] Hyper-V: Add para-virtualized IOMMU support for Linux guests Yu Zhang
2026-07-02 16:05 ` [PATCH v2 1/4] hyperv: Introduce new hypercall interfaces used by Hyper-V guest IOMMU Yu Zhang
2026-07-02 16:36   ` sashiko-bot
2026-07-02 16:05 ` [PATCH v2 2/4] Drivers: hv: Add logical device ID registry for vPCI devices Yu Zhang
2026-07-02 16:42   ` sashiko-bot
2026-07-02 16:05 ` [PATCH v2 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest Yu Zhang
2026-07-02 17:08   ` sashiko-bot
2026-07-03 17:32   ` Jason Gunthorpe
2026-07-02 16:05 ` [PATCH v2 4/4] iommu/hyperv: Add page-selective IOTLB flush support Yu Zhang
2026-07-02 17:20   ` sashiko-bot
2026-07-03 17:10   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox