[PATCH v0 00/15] PCI passthru on Hyper-V (Part I)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v0 00/15] PCI passthru on Hyper-V (Part I)
@ 2026-01-20  6:42 Mukesh R
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
                   ` (15 more replies)
  0 siblings, 16 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Implement passthru of PCI devices to unprivileged virtual machines
(VMs) when Linux is running as a privileged VM on Microsoft Hyper-V
hypervisor. This support is made to fit within the workings of VFIO
framework, and any VMM needing to use it must use the VFIO subsystem.
This supports both full device passthru and SR-IOV based VFs.

There are 3 cases where Linux can run as a privileged VM (aka MSHV):
  Baremetal root (meaning Hyper-V+Linux), L1VH, and Nested.

At a high level, the hypervisor supports traditional mapped iommu domains
that use explicit map and unmap hypercalls for mapping and unmapping guest
RAM into the iommu subsystem. Hyper-V also has a concept of direct attach
devices whereby the iommu subsystem simply uses the guest HW page table
(ept/npt/..). This series adds support for both, and both are made to
work in VFIO type1 subsystem.

While this Part I focuses on memory mappings, upcoming Part II
will focus on irq bypass along with some minor irq remapping 
updates.

This patch series was tested using Cloud Hypervisor verion 48. Qemu
support of MSHV is in the works, and that will be extended to include
PCI passthru and SR-IOV support also in near future.

Based on: 8f0b4cce4481 (origin/hyperv-next)

Thanks,
-Mukesh

Mukesh Rathor (15):
  iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
  x86/hyperv: cosmetic changes in irqdomain.c for readability
  x86/hyperv: add insufficient memory support in irqdomain.c
  mshv: Provide a way to get partition id if running in a VMM process
  mshv: Declarations and definitions for VFIO-MSHV bridge device
  mshv: Implement mshv bridge device for VFIO
  mshv: Add ioctl support for MSHV-VFIO bridge device
  PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
  mshv: Import data structs around device domains and irq remapping
  PCI: hv: Build device id for a VMBus device
  x86/hyperv: Build logical device ids for PCI passthru hcalls
  x86/hyperv: Implement hyperv virtual iommu
  x86/hyperv: Basic interrupt support for direct attached devices
  mshv: Remove mapping of mmio space during map user ioctl
  mshv: Populate mmio mappings for PCI passthru

 MAINTAINERS                         |    1 +
 arch/arm64/include/asm/mshyperv.h   |   15 +
 arch/x86/hyperv/irqdomain.c         |  314 ++++++---
 arch/x86/include/asm/mshyperv.h     |   21 +
 arch/x86/kernel/pci-dma.c           |    2 +
 drivers/hv/Makefile                 |    3 +-
 drivers/hv/mshv_root.h              |   24 +
 drivers/hv/mshv_root_main.c         |  296 +++++++-
 drivers/hv/mshv_vfio.c              |  210 ++++++
 drivers/iommu/Kconfig               |    1 +
 drivers/iommu/Makefile              |    2 +-
 drivers/iommu/hyperv-iommu.c        | 1004 +++++++++++++++++++++------
 drivers/iommu/hyperv-irq.c          |  330 +++++++++
 drivers/pci/controller/pci-hyperv.c |  207 ++++--
 include/asm-generic/mshyperv.h      |    1 +
 include/hyperv/hvgdk_mini.h         |   11 +
 include/hyperv/hvhdk_mini.h         |  112 +++
 include/linux/hyperv.h              |    6 +
 include/uapi/linux/mshv.h           |   31 +
 19 files changed, 2182 insertions(+), 409 deletions(-)
 create mode 100644 drivers/hv/mshv_vfio.c
 create mode 100644 drivers/iommu/hyperv-irq.c

-- 
2.51.2.vfs.0.1


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 19:08   ` kernel test robot
                     ` (2 more replies)
  2026-01-20  6:42 ` [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability Mukesh R
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

This file actually implements irq remapping, so rename to more appropriate
hyperv-irq.c. A new file named hyperv-iommu.c will be introduced later.
Also, move CONFIG_IRQ_REMAP out of the file and add to Makefile.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 MAINTAINERS                                    | 2 +-
 drivers/iommu/Kconfig                          | 1 +
 drivers/iommu/Makefile                         | 2 +-
 drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 4 ----
 4 files changed, 3 insertions(+), 6 deletions(-)
 rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5b11839cba9d..381a0e086382 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11741,7 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
 F:	drivers/infiniband/hw/mana/
 F:	drivers/input/serio/hyperv-keyboard.c
-F:	drivers/iommu/hyperv-iommu.c
+F:	drivers/iommu/hyperv-irq.c
 F:	drivers/net/ethernet/microsoft/
 F:	drivers/net/hyperv/
 F:	drivers/pci/controller/pci-hyperv-intf.c
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 99095645134f..b4cc2b42b338 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -355,6 +355,7 @@ config HYPERV_IOMMU
 	bool "Hyper-V IRQ Handling"
 	depends on HYPERV && X86
 	select IOMMU_API
+	select IRQ_REMAP
 	default HYPERV
 	help
 	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 8e8843316c4b..598c39558e7d 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
-obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
+obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
 obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
similarity index 99%
rename from drivers/iommu/hyperv-iommu.c
rename to drivers/iommu/hyperv-irq.c
index 0961ac805944..1944440a5004 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-irq.c
@@ -24,8 +24,6 @@
 
 #include "irq_remapping.h"
 
-#ifdef CONFIG_IRQ_REMAP
-
 /*
  * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
  * Redirection Table. Hyper-V exposes one single IO-APIC and so define
@@ -330,5 +328,3 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
 	.alloc = hyperv_root_irq_remapping_alloc,
 	.free = hyperv_root_irq_remapping_free,
 };
-
-#endif
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-02-05 18:47   ` Anirudh Rayabharam
  2026-01-20  6:42 ` [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c Mukesh R
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Make cosmetic changes:
 o Rename struct pci_dev *dev to *pdev since there are cases of
   struct device *dev in the file and all over the kernel
 o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
   of building different types of device ids
 o Fix checkpatch.pl issues with return and extraneous printk
 o Replace spaces with tabs
 o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
   paths involve many types of device ids
 o Fix indentation in a large if block by using goto.

There are no functional changes.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/irqdomain.c | 197 +++++++++++++++++++-----------------
 1 file changed, 103 insertions(+), 94 deletions(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index c3ba12b1bc07..f6b61483b3b8 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-
 /*
  * Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
  *
@@ -14,8 +13,8 @@
 #include <linux/irqchip/irq-msi-lib.h>
 #include <asm/mshyperv.h>
 
-static int hv_map_interrupt(union hv_device_id device_id, bool level,
-		int cpu, int vector, struct hv_interrupt_entry *entry)
+static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
+		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
 {
 	struct hv_input_map_device_interrupt *input;
 	struct hv_output_map_device_interrupt *output;
@@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 	intr_desc = &input->interrupt_descriptor;
 	memset(input, 0, sizeof(*input));
 	input->partition_id = hv_current_partition_id;
-	input->device_id = device_id.as_uint64;
+	input->device_id = hv_devid.as_uint64;
 	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
 	intr_desc->vector_count = 1;
 	intr_desc->target.vector = vector;
@@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 
 	intr_desc->target.vp_set.valid_bank_mask = 0;
 	intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
-	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
+	nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
 	if (nr_bank < 0) {
 		local_irq_restore(flags);
 		pr_err("%s: unable to generate VP set\n", __func__);
@@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 
 	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
 			input, output);
-	*entry = output->interrupt_entry;
+	*ret_entry = output->interrupt_entry;
 
 	local_irq_restore(flags);
 
@@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
 	return hv_result_to_errno(status);
 }
 
-static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
 {
 	unsigned long flags;
 	struct hv_input_unmap_device_interrupt *input;
-	struct hv_interrupt_entry *intr_entry;
 	u64 status;
 
 	local_irq_save(flags);
 	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	memset(input, 0, sizeof(*input));
-	intr_entry = &input->interrupt_entry;
 	input->partition_id = hv_current_partition_id;
 	input->device_id = id;
-	*intr_entry = *old_entry;
+	input->interrupt_entry = *irq_entry;
 
 	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
 	local_irq_restore(flags);
@@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
 	return 0;
 }
 
-static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
 {
-	union hv_device_id dev_id;
+	int pos;
+	union hv_device_id hv_devid;
 	struct rid_data data = {
 		.bridge = NULL,
-		.rid = PCI_DEVID(dev->bus->number, dev->devfn)
+		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
 	};
 
-	pci_for_each_dma_alias(dev, get_rid_cb, &data);
+	pci_for_each_dma_alias(pdev, get_rid_cb, &data);
 
-	dev_id.as_uint64 = 0;
-	dev_id.device_type = HV_DEVICE_TYPE_PCI;
-	dev_id.pci.segment = pci_domain_nr(dev->bus);
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_PCI;
+	hv_devid.pci.segment = pci_domain_nr(pdev->bus);
 
-	dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
-	dev_id.pci.bdf.device = PCI_SLOT(data.rid);
-	dev_id.pci.bdf.function = PCI_FUNC(data.rid);
-	dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
+	hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
+	hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
+	hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
+	hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
 
-	if (data.bridge) {
-		int pos;
+	if (data.bridge == NULL)
+		goto out;
 
-		/*
-		 * Microsoft Hypervisor requires a bus range when the bridge is
-		 * running in PCI-X mode.
-		 *
-		 * To distinguish conventional vs PCI-X bridge, we can check
-		 * the bridge's PCI-X Secondary Status Register, Secondary Bus
-		 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
-		 * Specification Revision 1.0 5.2.2.1.3.
-		 *
-		 * Value zero means it is in conventional mode, otherwise it is
-		 * in PCI-X mode.
-		 */
+	/*
+	 * Microsoft Hypervisor requires a bus range when the bridge is
+	 * running in PCI-X mode.
+	 *
+	 * To distinguish conventional vs PCI-X bridge, we can check
+	 * the bridge's PCI-X Secondary Status Register, Secondary Bus
+	 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
+	 * Specification Revision 1.0 5.2.2.1.3.
+	 *
+	 * Value zero means it is in conventional mode, otherwise it is
+	 * in PCI-X mode.
+	 */
 
-		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
-		if (pos) {
-			u16 status;
+	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
+	if (pos) {
+		u16 status;
 
-			pci_read_config_word(data.bridge, pos +
-					PCI_X_BRIDGE_SSTATUS, &status);
+		pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
+				     &status);
 
-			if (status & PCI_X_SSTATUS_FREQ) {
-				/* Non-zero, PCI-X mode */
-				u8 sec_bus, sub_bus;
+		if (status & PCI_X_SSTATUS_FREQ) {
+			/* Non-zero, PCI-X mode */
+			u8 sec_bus, sub_bus;
 
-				dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
+			hv_devid.pci.source_shadow =
+					     HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
 
-				pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
-				dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
-				pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
-				dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
-			}
+			pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
+					     &sec_bus);
+			hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
+			pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
+					     &sub_bus);
+			hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
 		}
 	}
 
-	return dev_id;
+out:
+	return hv_devid;
 }
 
-/**
- * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+/*
+ * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
  * @data:      Describes the IRQ
  * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
  *
@@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
 {
 	struct irq_cfg *cfg = irqd_cfg(data);
 	struct hv_interrupt_entry dummy;
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 	struct msi_desc *msidesc;
-	struct pci_dev *dev;
+	struct pci_dev *pdev;
 	int cpu;
 
 	msidesc = irq_data_get_msi_desc(data);
-	dev = msi_desc_to_pci_dev(msidesc);
-	device_id = hv_build_pci_dev_id(dev);
+	pdev = msi_desc_to_pci_dev(msidesc);
+	hv_devid = hv_build_devid_type_pci(pdev);
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
 
-	return hv_map_interrupt(device_id, false, cpu, cfg->vector,
+	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
 				out_entry ? out_entry : &dummy);
 }
 EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
-static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
+static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
+			     struct msi_msg *msg)
 {
 	/* High address is always 0 */
 	msg->address_hi = 0;
@@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
 	msg->data = entry->msi_entry.data.as_uint32;
 }
 
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+				  struct hv_interrupt_entry *irq_entry);
+
 static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct hv_interrupt_entry *stored_entry;
 	struct irq_cfg *cfg = irqd_cfg(data);
 	struct msi_desc *msidesc;
-	struct pci_dev *dev;
+	struct pci_dev *pdev;
 	int ret;
 
 	msidesc = irq_data_get_msi_desc(data);
-	dev = msi_desc_to_pci_dev(msidesc);
+	pdev = msi_desc_to_pci_dev(msidesc);
 
 	if (!cfg) {
 		pr_debug("%s: cfg is NULL", __func__);
@@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		stored_entry = data->chip_data;
 		data->chip_data = NULL;
 
-		ret = hv_unmap_msi_interrupt(dev, stored_entry);
+		ret = hv_unmap_msi_interrupt(pdev, stored_entry);
 
 		kfree(stored_entry);
 
@@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	}
 
 	stored_entry = kzalloc(sizeof(*stored_entry), GFP_ATOMIC);
-	if (!stored_entry) {
-		pr_debug("%s: failed to allocate chip data\n", __func__);
+	if (!stored_entry)
 		return;
-	}
 
 	ret = hv_map_msi_interrupt(data, stored_entry);
 	if (ret) {
@@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 
 	data->chip_data = stored_entry;
 	entry_to_msi_msg(data->chip_data, msg);
-
-	return;
 }
 
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+				  struct hv_interrupt_entry *irq_entry)
 {
-	return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
+	union hv_device_id hv_devid;
+
+	hv_devid = hv_build_devid_type_pci(pdev);
+	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
 }
 
-static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
+/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
+static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
 {
-	struct hv_interrupt_entry old_entry;
+	struct hv_interrupt_entry irq_entry;
 	struct msi_msg msg;
 
 	if (!irqd->chip_data) {
@@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
 		return;
 	}
 
-	old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
-	entry_to_msi_msg(&old_entry, &msg);
+	irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
+	entry_to_msi_msg(&irq_entry, &msg);
 
 	kfree(irqd->chip_data);
 	irqd->chip_data = NULL;
 
-	(void)hv_unmap_msi_interrupt(dev, &old_entry);
+	(void)hv_unmap_msi_interrupt(pdev, &irq_entry);
 }
 
 /*
@@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
 };
 
 static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
-				 struct irq_domain *real_parent, struct msi_domain_info *info)
+				 struct irq_domain *real_parent,
+				 struct msi_domain_info *info)
 {
 	struct irq_chip *chip = info->chip;
 
@@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 }
 
 #define HV_MSI_FLAGS_SUPPORTED	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
-#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
+#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS |	\
+				 MSI_FLAG_USE_DEF_CHIP_OPS)
 
 static struct msi_parent_ops hv_msi_parent_ops = {
 	.supported_flags	= HV_MSI_FLAGS_SUPPORTED,
@@ -329,14 +336,13 @@ static struct msi_parent_ops hv_msi_parent_ops = {
 	.init_dev_msi_info	= hv_init_dev_msi_info,
 };
 
-static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
-			       void *arg)
+static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
+			       unsigned int nr_irqs, void *arg)
 {
 	/*
-	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
-	 * entry_to_msi_msg() should be in here.
+	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
+	 *	 everything except entry_to_msi_msg() should be in here.
 	 */
-
 	int ret;
 
 	ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
@@ -344,13 +350,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
 		return ret;
 
 	for (int i = 0; i < nr_irqs; ++i) {
-		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
-				    handle_edge_irq, NULL, "edge");
+		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
+				    NULL, handle_edge_irq, NULL, "edge");
 	}
+
 	return 0;
 }
 
-static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
+static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
+			       unsigned int nr_irqs)
 {
 	for (int i = 0; i < nr_irqs; ++i) {
 		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
@@ -362,6 +370,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
 
 		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
 	}
+
 	irq_domain_free_irqs_top(d, virq, nr_irqs);
 }
 
@@ -394,25 +403,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
 
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
 {
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 
-	device_id.as_uint64 = 0;
-	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
-	device_id.ioapic.ioapic_id = (u8)ioapic_id;
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_unmap_interrupt(device_id.as_uint64, entry);
+	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
 }
 EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
 
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
 		struct hv_interrupt_entry *entry)
 {
-	union hv_device_id device_id;
+	union hv_device_id hv_devid;
 
-	device_id.as_uint64 = 0;
-	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
-	device_id.ioapic.ioapic_id = (u8)ioapic_id;
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_map_interrupt(device_id, level, cpu, vector, entry);
+	return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
 }
 EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
  2026-01-20  6:42 ` [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-21  0:53   ` kernel test robot
  2026-01-20  6:42 ` [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process Mukesh R
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Passthru exposes insufficient memory hypercall failure in the current map
device interrupt hypercall. In case of such a failure, we must deposit
more memory and redo the hypercall. Add support for that. Deposit memory
needs partition id, make that a parameter to the map interrupt function.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/irqdomain.c | 38 +++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index f6b61483b3b8..ccbe5848a28f 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -13,8 +13,9 @@
 #include <linux/irqchip/irq-msi-lib.h>
 #include <asm/mshyperv.h>
 
-static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
-		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
+static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
+				  bool level, int cpu, int vector,
+				  struct hv_interrupt_entry *ret_entry)
 {
 	struct hv_input_map_device_interrupt *input;
 	struct hv_output_map_device_interrupt *output;
@@ -30,8 +31,10 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
 
 	intr_desc = &input->interrupt_descriptor;
 	memset(input, 0, sizeof(*input));
-	input->partition_id = hv_current_partition_id;
+
+	input->partition_id = ptid;
 	input->device_id = hv_devid.as_uint64;
+
 	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
 	intr_desc->vector_count = 1;
 	intr_desc->target.vector = vector;
@@ -64,6 +67,28 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
 
 	local_irq_restore(flags);
 
+	return status;
+}
+
+static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
+			    int cpu, int vector,
+			    struct hv_interrupt_entry *ret_entry)
+{
+	u64 status;
+	int rc, deposit_pgs = 16;		/* don't loop forever */
+
+	while (deposit_pgs--) {
+		status = hv_map_interrupt_hcall(ptid, device_id, level, cpu,
+						vector, ret_entry);
+
+		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
+			break;
+
+		rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
+		if (rc)
+			break;
+	};
+
 	if (!hv_result_success(status))
 		hv_status_err(status, "\n");
 
@@ -199,8 +224,8 @@ int hv_map_msi_interrupt(struct irq_data *data,
 	hv_devid = hv_build_devid_type_pci(pdev);
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
 
-	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
-				out_entry ? out_entry : &dummy);
+	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
+				cfg->vector, out_entry ? out_entry : &dummy);
 }
 EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
@@ -422,6 +447,7 @@ int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
 	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
 	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
+	return hv_map_interrupt(hv_current_partition_id, hv_devid, level, cpu,
+				vector, entry);
 }
 EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (2 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-23 18:23   ` Nuno Das Neves
  2026-01-20  6:42 ` [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device Mukesh R
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Many PCI passthru related hypercalls require partition id of the target
guest. Guests are actually managed by MSHV driver and the partition id
is only maintained there. Add a field in the partition struct in MSHV
driver to save the tgid of the VMM process creating the partition,
and add a function there to retrieve partition id if valid VMM tgid.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_root.h         |  1 +
 drivers/hv/mshv_root_main.c    | 35 +++++++++++++++++++++++++++-------
 include/asm-generic/mshyperv.h |  1 +
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..c3753b009fd8 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -134,6 +134,7 @@ struct mshv_partition {
 
 	struct mshv_girq_routing_table __rcu *pt_girq_tbl;
 	u64 isolation_type;
+	pid_t pt_vmm_tgid;
 	bool import_completed;
 	bool pt_initialized;
 };
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..83c7bad269a0 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1823,6 +1823,20 @@ mshv_partition_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+/* Given a process tgid, return partition id if it is a VMM process */
+u64 mshv_pid_to_partid(pid_t tgid)
+{
+	struct mshv_partition *pt;
+	int i;
+
+	hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode)
+		if (pt->pt_vmm_tgid == tgid)
+			return pt->pt_id;
+
+	return HV_PARTITION_ID_INVALID;
+}
+EXPORT_SYMBOL_GPL(mshv_pid_to_partid);
+
 static int
 add_partition(struct mshv_partition *partition)
 {
@@ -1987,13 +2001,20 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
 		goto delete_partition;
 
 	ret = mshv_init_async_handler(partition);
-	if (!ret) {
-		ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
-							   &mshv_partition_fops,
-							   partition, O_RDWR));
-		if (ret >= 0)
-			return ret;
-	}
+	if (ret)
+		goto rem_partition;
+
+	ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
+						   &mshv_partition_fops,
+						   partition, O_RDWR));
+	if (ret < 0)
+		goto rem_partition;
+
+	partition->pt_vmm_tgid = current->tgid;
+
+	return ret;
+
+rem_partition:
 	remove_partition(partition);
 delete_partition:
 	hv_call_delete_partition(partition->pt_id);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index ecedab554c80..e46a38916e76 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -211,6 +211,7 @@ void __init ms_hyperv_late_init(void);
 int hv_common_cpu_init(unsigned int cpu);
 int hv_common_cpu_die(unsigned int cpu);
 void hv_identify_partition_type(void);
+u64 mshv_pid_to_partid(pid_t tgid);
 
 /**
  * hv_cpu_number_to_vp_number() - Map CPU to VP.
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (3 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-23 18:25   ` Nuno Das Neves
  2026-01-20  6:42 ` [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO Mukesh R
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Add data structs needed by the subsequent patch that introduces a new
module to implement VFIO-MSHV pseudo device.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_root.h    | 23 +++++++++++++++++++++++
 include/uapi/linux/mshv.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index c3753b009fd8..42e1da1d545b 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -220,6 +220,29 @@ struct port_table_info {
 	};
 };
 
+struct mshv_device {
+	const struct mshv_device_ops *device_ops;
+	struct mshv_partition *device_pt;
+	void *device_private;
+	struct hlist_node device_ptnode;
+};
+
+struct mshv_device_ops {
+	const char *device_name;
+	long (*device_create)(struct mshv_device *dev, u32 type);
+	void (*device_release)(struct mshv_device *dev);
+	long (*device_set_attr)(struct mshv_device *dev,
+				struct mshv_device_attr *attr);
+	long (*device_has_attr)(struct mshv_device *dev,
+				struct mshv_device_attr *attr);
+};
+
+extern struct mshv_device_ops mshv_vfio_device_ops;
+int mshv_vfio_ops_init(void);
+void mshv_vfio_ops_exit(void);
+long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
+					void __user *user_args);
+
 int mshv_update_routing_table(struct mshv_partition *partition,
 			      const struct mshv_user_irq_entry *entries,
 			      unsigned int numents);
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index dee3ece28ce5..b7b10f9e2896 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -252,6 +252,7 @@ struct mshv_root_hvcall {
 #define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
 /* Generic hypercall */
 #define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
+#define MSHV_CREATE_DEVICE		_IOWR(MSHV_IOCTL, 0x08, struct mshv_create_device)
 
 /*
  ********************************
@@ -402,4 +403,34 @@ struct mshv_sint_mask {
 /* hv_hvcall device */
 #define MSHV_HVCALL_SETUP        _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup)
 #define MSHV_HVCALL              _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall)
+
+/* device passhthru */
+#define MSHV_CREATE_DEVICE_TEST		1
+
+enum {
+	MSHV_DEV_TYPE_VFIO,
+	MSHV_DEV_TYPE_MAX,
+};
+
+struct mshv_create_device {
+	__u32	type;	     /* in: MSHV_DEV_TYPE_xxx */
+	__u32	fd;	     /* out: device handle */
+	__u32	flags;	     /* in: MSHV_CREATE_DEVICE_xxx */
+};
+
+#define MSHV_DEV_VFIO_FILE      1
+#define MSHV_DEV_VFIO_FILE_ADD	1
+#define MSHV_DEV_VFIO_FILE_DEL	2
+
+struct mshv_device_attr {
+	__u32	flags;		/* no flags currently defined */
+	__u32	group;		/* device-defined */
+	__u64	attr;		/* group-defined */
+	__u64	addr;		/* userspace address of attr data */
+};
+
+/* Device fds created with MSHV_CREATE_DEVICE */
+#define MSHV_SET_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x00, struct mshv_device_attr)
+#define MSHV_HAS_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x01, struct mshv_device_attr)
+
 #endif
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (4 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 16:09   ` Stanislav Kinsburskii
  2026-01-23 18:32   ` Nuno Das Neves
  2026-01-20  6:42 ` [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device Mukesh R
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Add a new file to implement VFIO-MSHV bridge pseudo device. These
functions are called in the VFIO framework, and credits to kvm/vfio.c
as this file was adapted from it.

Original author: Wei Liu <wei.liu@kernel.org>
(Slightly modified from the original version).

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/Makefile    |   3 +-
 drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 drivers/hv/mshv_vfio.c

diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index a49f93c2d245..eae003c4cb8f 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
 hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
 hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
 mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
-	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
+               mshv_vfio.o
 mshv_vtl-y := mshv_vtl_main.o
 
 # Code that must be built-in
diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
new file mode 100644
index 000000000000..6ea4d99a3bd2
--- /dev/null
+++ b/drivers/hv/mshv_vfio.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO-MSHV bridge pseudo device
+ *
+ * Heavily inspired by the VFIO-KVM bridge pseudo device.
+ */
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+struct mshv_vfio_file {
+	struct list_head node;
+	struct file *file;	/* list of struct mshv_vfio_file */
+};
+
+struct mshv_vfio {
+	struct list_head file_list;
+	struct mutex lock;
+};
+
+static bool mshv_vfio_file_is_valid(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(vfio_file_is_valid);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(vfio_file_is_valid);
+
+	return ret;
+}
+
+static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
+{
+	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
+	struct mshv_vfio_file *mvf;
+	struct file *filp;
+	long ret = 0;
+
+	filp = fget(fd);
+	if (!filp)
+		return -EBADF;
+
+	/* Ensure the FD is a vfio FD. */
+	if (!mshv_vfio_file_is_valid(filp)) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	mutex_lock(&mshv_vfio->lock);
+
+	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
+		if (mvf->file == filp) {
+			ret = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
+	if (!mvf) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	mvf->file = get_file(filp);
+	list_add_tail(&mvf->node, &mshv_vfio->file_list);
+
+out_unlock:
+	mutex_unlock(&mshv_vfio->lock);
+out_fput:
+	fput(filp);
+	return ret;
+}
+
+static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
+{
+	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
+	struct mshv_vfio_file *mvf;
+	long ret;
+
+	CLASS(fd, f)(fd);
+
+	if (fd_empty(f))
+		return -EBADF;
+
+	ret = -ENOENT;
+	mutex_lock(&mshv_vfio->lock);
+
+	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
+		if (mvf->file != fd_file(f))
+			continue;
+
+		list_del(&mvf->node);
+		fput(mvf->file);
+		kfree(mvf);
+		ret = 0;
+		break;
+	}
+
+	mutex_unlock(&mshv_vfio->lock);
+	return ret;
+}
+
+static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
+			      void __user *arg)
+{
+	int32_t __user *argp = arg;
+	int32_t fd;
+
+	switch (attr) {
+	case MSHV_DEV_VFIO_FILE_ADD:
+		if (get_user(fd, argp))
+			return -EFAULT;
+		return mshv_vfio_file_add(mshvdev, fd);
+
+	case MSHV_DEV_VFIO_FILE_DEL:
+		if (get_user(fd, argp))
+			return -EFAULT;
+		return mshv_vfio_file_del(mshvdev, fd);
+	}
+
+	return -ENXIO;
+}
+
+static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
+			      struct mshv_device_attr *attr)
+{
+	switch (attr->group) {
+	case MSHV_DEV_VFIO_FILE:
+		return mshv_vfio_set_file(mshvdev, attr->attr,
+					  u64_to_user_ptr(attr->addr));
+	}
+
+	return -ENXIO;
+}
+
+static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
+			      struct mshv_device_attr *attr)
+{
+	switch (attr->group) {
+	case MSHV_DEV_VFIO_FILE:
+		switch (attr->attr) {
+		case MSHV_DEV_VFIO_FILE_ADD:
+		case MSHV_DEV_VFIO_FILE_DEL:
+			return 0;
+		}
+
+		break;
+	}
+
+	return -ENXIO;
+}
+
+static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
+{
+	struct mshv_device *tmp;
+	struct mshv_vfio *mshv_vfio;
+
+	/* Only one VFIO "device" per VM */
+	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
+			     device_ptnode)
+		if (tmp->device_ops == &mshv_vfio_device_ops)
+			return -EBUSY;
+
+	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
+	if (mshv_vfio == NULL)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&mshv_vfio->file_list);
+	mutex_init(&mshv_vfio->lock);
+
+	mshvdev->device_private = mshv_vfio;
+
+	return 0;
+}
+
+/* This is called from mshv_device_fop_release() */
+static void mshv_vfio_release_device(struct mshv_device *mshvdev)
+{
+	struct mshv_vfio *mv = mshvdev->device_private;
+	struct mshv_vfio_file *mvf, *tmp;
+
+	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
+		fput(mvf->file);
+		list_del(&mvf->node);
+		kfree(mvf);
+	}
+
+	kfree(mv);
+	kfree(mshvdev);
+}
+
+struct mshv_device_ops mshv_vfio_device_ops = {
+	.device_name = "mshv-vfio",
+	.device_create = mshv_vfio_create_device,
+	.device_release = mshv_vfio_release_device,
+	.device_set_attr = mshv_vfio_set_attr,
+	.device_has_attr = mshv_vfio_has_attr,
+};
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (5 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 16:13   ` Stanislav Kinsburskii
  2026-01-20  6:42 ` [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg Mukesh R
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Add ioctl support for creating MSHV devices for a paritition. At
present only VFIO device types are supported, but more could be
added. At a high level, a partition ioctl to create device verifies
it is of type VFIO and does some setup for bridge code in mshv_vfio.c.
Adapted from KVM device ioctls.

Credits: Original author: Wei Liu <wei.liu@kernel.org>
NB: Slightly modified from the original version.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 126 ++++++++++++++++++++++++++++++++++++
 1 file changed, 126 insertions(+)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 83c7bad269a0..27313419828d 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1551,6 +1551,129 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
 	return ret;
 }
 
+static long mshv_device_attr_ioctl(struct mshv_device *mshv_dev, int cmd,
+				   ulong uarg)
+{
+	struct mshv_device_attr attr;
+	const struct mshv_device_ops *devops = mshv_dev->device_ops;
+
+	if (copy_from_user(&attr, (void __user *)uarg, sizeof(attr)))
+		return -EFAULT;
+
+	switch (cmd) {
+	case MSHV_SET_DEVICE_ATTR:
+		if (devops->device_set_attr)
+			return devops->device_set_attr(mshv_dev, &attr);
+		break;
+	case MSHV_HAS_DEVICE_ATTR:
+		if (devops->device_has_attr)
+			return devops->device_has_attr(mshv_dev, &attr);
+		break;
+	}
+
+	return -EPERM;
+}
+
+static long mshv_device_fop_ioctl(struct file *filp, unsigned int cmd,
+				  ulong uarg)
+{
+	struct mshv_device *mshv_dev = filp->private_data;
+
+	switch (cmd) {
+	case MSHV_SET_DEVICE_ATTR:
+	case MSHV_HAS_DEVICE_ATTR:
+		return mshv_device_attr_ioctl(mshv_dev, cmd, uarg);
+	}
+
+	return -ENOTTY;
+}
+
+static int mshv_device_fop_release(struct inode *inode, struct file *filp)
+{
+	struct mshv_device *mshv_dev = filp->private_data;
+	struct mshv_partition *partition = mshv_dev->device_pt;
+
+	if (mshv_dev->device_ops->device_release) {
+		mutex_lock(&partition->pt_mutex);
+		hlist_del(&mshv_dev->device_ptnode);
+		mshv_dev->device_ops->device_release(mshv_dev);
+		mutex_unlock(&partition->pt_mutex);
+	}
+
+	mshv_partition_put(partition);
+	return 0;
+}
+
+static const struct file_operations mshv_device_fops = {
+	.owner = THIS_MODULE,
+	.unlocked_ioctl = mshv_device_fop_ioctl,
+	.release = mshv_device_fop_release,
+};
+
+long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
+					void __user *uarg)
+{
+	long rc;
+	struct mshv_create_device devargk;
+	struct mshv_device *mshv_dev;
+	const struct mshv_device_ops *vfio_ops;
+	int type;
+
+	if (copy_from_user(&devargk, uarg, sizeof(devargk))) {
+		rc = -EFAULT;
+		goto out;
+	}
+
+	/* At present, only VFIO is supported */
+	if (devargk.type != MSHV_DEV_TYPE_VFIO) {
+		rc = -ENODEV;
+		goto out;
+	}
+
+	if (devargk.flags & MSHV_CREATE_DEVICE_TEST) {
+		rc = 0;
+		goto out;
+	}
+
+	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
+	if (mshv_dev == NULL) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	vfio_ops = &mshv_vfio_device_ops;
+	mshv_dev->device_ops = vfio_ops;
+	mshv_dev->device_pt = partition;
+
+	rc = vfio_ops->device_create(mshv_dev, type);
+	if (rc < 0) {
+		kfree(mshv_dev);
+		goto out;
+	}
+
+	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
+
+	mshv_partition_get(partition);
+	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
+			      mshv_dev, O_RDWR | O_CLOEXEC);
+	if (rc < 0) {
+		mshv_partition_put(partition);
+		hlist_del(&mshv_dev->device_ptnode);
+		vfio_ops->device_release(mshv_dev);
+		goto out;
+	}
+
+	devargk.fd = rc;
+	rc = 0;
+
+	if (copy_to_user(uarg, &devargk, sizeof(devargk))) {
+		rc = -EFAULT;
+		goto out;
+	}
+out:
+	return rc;
+}
+
 static long
 mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 {
@@ -1587,6 +1710,9 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
 	case MSHV_ROOT_HVCALL:
 		ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
 		break;
+	case MSHV_CREATE_DEVICE:
+		ret = mshv_partition_ioctl_create_device(partition, uarg);
+		break;
 	default:
 		ret = -ENOTTY;
 	}
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (6 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-28 14:03   ` Manivannan Sadhasivam
  2026-01-20  6:42 ` [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping Mukesh R
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Main change here is to rename hv_compose_msi_msg to
hv_vmbus_compose_msi_msg as we introduce hv_compose_msi_msg in upcoming
patches that builds MSI messages for both VMBus and non-VMBus cases. VMBus
is not used on baremetal root partition for example. While at it, replace
spaces with tabs and fix some formatting involving excessive line wraps.

There is no functional change.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 95 +++++++++++++++--------------
 1 file changed, 48 insertions(+), 47 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1e237d3538f9..8bc6a38c9b5a 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -30,7 +30,7 @@
  * function's configuration space is zero.
  *
  * The rest of this driver mostly maps PCI concepts onto underlying Hyper-V
- * facilities.  For instance, the configuration space of a function exposed
+ * facilities.	For instance, the configuration space of a function exposed
  * by Hyper-V is mapped into a single page of memory space, and the
  * read and write handlers for config space must be aware of this mechanism.
  * Similarly, device setup and teardown involves messages sent to and from
@@ -109,33 +109,33 @@ enum pci_message_type {
 	/*
 	 * Version 1.1
 	 */
-	PCI_MESSAGE_BASE                = 0x42490000,
-	PCI_BUS_RELATIONS               = PCI_MESSAGE_BASE + 0,
-	PCI_QUERY_BUS_RELATIONS         = PCI_MESSAGE_BASE + 1,
-	PCI_POWER_STATE_CHANGE          = PCI_MESSAGE_BASE + 4,
+	PCI_MESSAGE_BASE		= 0x42490000,
+	PCI_BUS_RELATIONS		= PCI_MESSAGE_BASE + 0,
+	PCI_QUERY_BUS_RELATIONS		= PCI_MESSAGE_BASE + 1,
+	PCI_POWER_STATE_CHANGE		= PCI_MESSAGE_BASE + 4,
 	PCI_QUERY_RESOURCE_REQUIREMENTS = PCI_MESSAGE_BASE + 5,
-	PCI_QUERY_RESOURCE_RESOURCES    = PCI_MESSAGE_BASE + 6,
-	PCI_BUS_D0ENTRY                 = PCI_MESSAGE_BASE + 7,
-	PCI_BUS_D0EXIT                  = PCI_MESSAGE_BASE + 8,
-	PCI_READ_BLOCK                  = PCI_MESSAGE_BASE + 9,
-	PCI_WRITE_BLOCK                 = PCI_MESSAGE_BASE + 0xA,
-	PCI_EJECT                       = PCI_MESSAGE_BASE + 0xB,
-	PCI_QUERY_STOP                  = PCI_MESSAGE_BASE + 0xC,
-	PCI_REENABLE                    = PCI_MESSAGE_BASE + 0xD,
-	PCI_QUERY_STOP_FAILED           = PCI_MESSAGE_BASE + 0xE,
-	PCI_EJECTION_COMPLETE           = PCI_MESSAGE_BASE + 0xF,
-	PCI_RESOURCES_ASSIGNED          = PCI_MESSAGE_BASE + 0x10,
-	PCI_RESOURCES_RELEASED          = PCI_MESSAGE_BASE + 0x11,
-	PCI_INVALIDATE_BLOCK            = PCI_MESSAGE_BASE + 0x12,
-	PCI_QUERY_PROTOCOL_VERSION      = PCI_MESSAGE_BASE + 0x13,
-	PCI_CREATE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x14,
-	PCI_DELETE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x15,
+	PCI_QUERY_RESOURCE_RESOURCES	= PCI_MESSAGE_BASE + 6,
+	PCI_BUS_D0ENTRY			= PCI_MESSAGE_BASE + 7,
+	PCI_BUS_D0EXIT			= PCI_MESSAGE_BASE + 8,
+	PCI_READ_BLOCK			= PCI_MESSAGE_BASE + 9,
+	PCI_WRITE_BLOCK			= PCI_MESSAGE_BASE + 0xA,
+	PCI_EJECT			= PCI_MESSAGE_BASE + 0xB,
+	PCI_QUERY_STOP			= PCI_MESSAGE_BASE + 0xC,
+	PCI_REENABLE			= PCI_MESSAGE_BASE + 0xD,
+	PCI_QUERY_STOP_FAILED		= PCI_MESSAGE_BASE + 0xE,
+	PCI_EJECTION_COMPLETE		= PCI_MESSAGE_BASE + 0xF,
+	PCI_RESOURCES_ASSIGNED		= PCI_MESSAGE_BASE + 0x10,
+	PCI_RESOURCES_RELEASED		= PCI_MESSAGE_BASE + 0x11,
+	PCI_INVALIDATE_BLOCK		= PCI_MESSAGE_BASE + 0x12,
+	PCI_QUERY_PROTOCOL_VERSION	= PCI_MESSAGE_BASE + 0x13,
+	PCI_CREATE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x14,
+	PCI_DELETE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x15,
 	PCI_RESOURCES_ASSIGNED2		= PCI_MESSAGE_BASE + 0x16,
 	PCI_CREATE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x17,
 	PCI_DELETE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x18, /* unused */
 	PCI_BUS_RELATIONS2		= PCI_MESSAGE_BASE + 0x19,
-	PCI_RESOURCES_ASSIGNED3         = PCI_MESSAGE_BASE + 0x1A,
-	PCI_CREATE_INTERRUPT_MESSAGE3   = PCI_MESSAGE_BASE + 0x1B,
+	PCI_RESOURCES_ASSIGNED3		= PCI_MESSAGE_BASE + 0x1A,
+	PCI_CREATE_INTERRUPT_MESSAGE3	= PCI_MESSAGE_BASE + 0x1B,
 	PCI_MESSAGE_MAXIMUM
 };
 
@@ -1775,20 +1775,21 @@ static u32 hv_compose_msi_req_v1(
  * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
  * not irrelevant because Hyper-V chooses the physical CPU to handle the
  * interrupts based on the vCPU specified in message sent to the vPCI VSP in
- * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
- * but assigning too many vPCI device interrupts to the same pCPU can cause a
- * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
- * to spread out the pCPUs that it selects.
+ * hv_vmbus_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the
+ * guest, but assigning too many vPCI device interrupts to the same pCPU can
+ * cause a performance bottleneck. So we spread out the dummy vCPUs to influence
+ * Hyper-V to spread out the pCPUs that it selects.
  *
  * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
  * to always return the same dummy vCPU, because a second call to
- * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
- * new pCPU for the interrupt. But for the multi-MSI case, the second call to
- * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
- * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
- * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
- * the same pCPU, even though the vCPUs will be spread out by later calls
- * to hv_irq_unmask(), but that is the best we can do now.
+ * hv_vmbus_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
+ * choose a new pCPU for the interrupt. But for the multi-MSI case, the second
+ * call to hv_vmbus_compose_msi_msg() exits without sending a message to the
+ * vPCI VSP, so the original dummy vCPU is used. This dummy vCPU must be
+ * round-robin'ed so that the pCPUs are spread out. All interrupts for a
+ * multi-MSI device end up using the same pCPU, even though the vCPUs will be
+ * spread out by later calls to hv_irq_unmask(), but that is the best we can do
+ * now.
  *
  * With Hyper-V in Nov 2022, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
  * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
@@ -1863,7 +1864,7 @@ static u32 hv_compose_msi_req_v3(
 }
 
 /**
- * hv_compose_msi_msg() - Supplies a valid MSI address/data
+ * hv_vmbus_compose_msi_msg() - Supplies a valid MSI address/data
  * @data:	Everything about this MSI
  * @msg:	Buffer that is filled in by this function
  *
@@ -1873,7 +1874,7 @@ static u32 hv_compose_msi_req_v3(
  * response supplies a data value and address to which that data
  * should be written to trigger that interrupt.
  */
-static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
+static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct hv_pcibus_device *hbus;
 	struct vmbus_channel *channel;
@@ -1955,7 +1956,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 			return;
 		}
 		/*
-		 * The vector we select here is a dummy value.  The correct
+		 * The vector we select here is a dummy value.	The correct
 		 * value gets sent to the hypervisor in unmask().  This needs
 		 * to be aligned with the count, and also not zero.  Multi-msi
 		 * is powers of 2 up to 32, so 32 will always work here.
@@ -2047,7 +2048,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 
 		/*
 		 * Make sure that the ring buffer data structure doesn't get
-		 * freed while we dereference the ring buffer pointer.  Test
+		 * freed while we dereference the ring buffer pointer.	Test
 		 * for the channel's onchannel_callback being NULL within a
 		 * sched_lock critical section.  See also the inline comments
 		 * in vmbus_reset_channel_cb().
@@ -2147,7 +2148,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
 /* HW Interrupt Chip Descriptor */
 static struct irq_chip hv_msi_irq_chip = {
 	.name			= "Hyper-V PCIe MSI",
-	.irq_compose_msi_msg	= hv_compose_msi_msg,
+	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
 	.irq_set_affinity	= irq_chip_set_affinity_parent,
 	.irq_ack		= irq_chip_ack_parent,
 	.irq_eoi		= irq_chip_eoi_parent,
@@ -2159,8 +2160,8 @@ static int hv_pcie_domain_alloc(struct irq_domain *d, unsigned int virq, unsigne
 			       void *arg)
 {
 	/*
-	 * TODO: Allocating and populating struct tran_int_desc in hv_compose_msi_msg()
-	 * should be moved here.
+	 * TODO: Allocating and populating struct tran_int_desc in
+	 *	 hv_vmbus_compose_msi_msg() should be moved here.
 	 */
 	int ret;
 
@@ -2227,7 +2228,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
 /**
  * get_bar_size() - Get the address space consumed by a BAR
  * @bar_val:	Value that a BAR returned after -1 was written
- *              to it.
+ *		to it.
  *
  * This function returns the size of the BAR, rounded up to 1
  * page.  It has to be rounded up because the hypervisor's page
@@ -2573,7 +2574,7 @@ static void q_resource_requirements(void *context, struct pci_response *resp,
  * new_pcichild_device() - Create a new child device
  * @hbus:	The internal struct tracking this root PCI bus.
  * @desc:	The information supplied so far from the host
- *              about the device.
+ *		about the device.
  *
  * This function creates the tracking structure for a new child
  * device and kicks off the process of figuring out what it is.
@@ -3100,7 +3101,7 @@ static void hv_pci_onchannelcallback(void *context)
 			 * sure that the packet pointer is still valid during the call:
 			 * here 'valid' means that there's a task still waiting for the
 			 * completion, and that the packet data is still on the waiting
-			 * task's stack.  Cf. hv_compose_msi_msg().
+			 * task's stack.  Cf. hv_vmbus_compose_msi_msg().
 			 */
 			comp_packet->completion_func(comp_packet->compl_ctxt,
 						     response,
@@ -3417,7 +3418,7 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
 	 * vmbus_allocate_mmio() gets used for allocating both device endpoint
 	 * resource claims (those which cannot be overlapped) and the ranges
 	 * which are valid for the children of this bus, which are intended
-	 * to be overlapped by those children.  Set the flag on this claim
+	 * to be overlapped by those children.	Set the flag on this claim
 	 * meaning that this region can't be overlapped.
 	 */
 
@@ -4066,7 +4067,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
 		irq_data = irq_get_irq_data(entry->irq);
 		if (WARN_ON_ONCE(!irq_data))
 			return -EINVAL;
-		hv_compose_msi_msg(irq_data, &entry->msg);
+		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
 	}
 	return 0;
 }
@@ -4074,7 +4075,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
 /*
  * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
  * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
- * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
+ * doesn't trap and emulate the MMIO accesses, here hv_vmbus_compose_msi_msg()
  * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
  * Table entries.
  */
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (7 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 22:17   ` Stanislav Kinsburskii
  2026-01-20  6:42 ` [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device Mukesh R
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Import/copy from Hyper-V public headers, definitions and declarations that
are related to attaching and detaching of device domains and interrupt
remapping, and building device ids for those purposes.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 include/hyperv/hvgdk_mini.h |  11 ++++
 include/hyperv/hvhdk_mini.h | 112 ++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+)

diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 04b18d0e37af..bda9fae5b1ef 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -323,6 +323,9 @@ union hv_hypervisor_version_info {
 /* stimer Direct Mode is available */
 #define HV_STIMER_DIRECT_MODE_AVAILABLE			BIT(19)
 
+#define HV_DEVICE_DOMAIN_AVAILABLE			BIT(24)
+#define HV_S1_DEVICE_DOMAIN_AVAILABLE			BIT(25)
+
 /*
  * Implementation recommendations. Indicates which behaviors the hypervisor
  * recommends the OS implement for optimal performance.
@@ -471,6 +474,8 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
 #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
 #define HVCALL_RETARGET_INTERRUPT			0x007e
+#define HVCALL_ATTACH_DEVICE                            0x0082
+#define HVCALL_DETACH_DEVICE                            0x0083
 #define HVCALL_NOTIFY_PARTITION_EVENT                   0x0087
 #define HVCALL_ENTER_SLEEP_STATE			0x0084
 #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
@@ -482,9 +487,15 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_VP_INDEX_FROM_APIC_ID		0x009a
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE	0x00af
 #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST	0x00b0
+#define HVCALL_CREATE_DEVICE_DOMAIN                     0x00b1
+#define HVCALL_ATTACH_DEVICE_DOMAIN                     0x00b2
+#define HVCALL_MAP_DEVICE_GPA_PAGES                     0x00b3
+#define HVCALL_UNMAP_DEVICE_GPA_PAGES                   0x00b4
 #define HVCALL_SIGNAL_EVENT_DIRECT			0x00c0
 #define HVCALL_POST_MESSAGE_DIRECT			0x00c1
 #define HVCALL_DISPATCH_VP				0x00c2
+#define HVCALL_DETACH_DEVICE_DOMAIN                     0x00c4
+#define HVCALL_DELETE_DEVICE_DOMAIN                     0x00c5
 #define HVCALL_GET_GPA_PAGES_ACCESS_STATES		0x00c9
 #define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d7
 #define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS	0x00d8
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 41a29bf8ec14..57821d6ddb61 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -449,6 +449,32 @@ struct hv_send_ipi_ex { /* HV_INPUT_SEND_SYNTHETIC_CLUSTER_IPI_EX */
 	struct hv_vpset vp_set;
 } __packed;
 
+union hv_attdev_flags {		/* HV_ATTACH_DEVICE_FLAGS */
+	struct {
+		u32 logical_id : 1;
+		u32 resvd0 : 1;
+		u32 ats_enabled : 1;
+		u32 virt_func : 1;
+		u32 shared_irq_child : 1;
+		u32 virt_dev : 1;
+		u32 ats_supported : 1;
+		u32 small_irt : 1;
+		u32 resvd : 24;
+	} __packed;
+	u32 as_uint32;
+};
+
+union hv_dev_pci_caps {		/* HV_DEVICE_PCI_CAPABILITIES */
+	struct {
+		u32 max_pasid_width : 5;
+		u32 invalidate_qdepth : 5;
+		u32 global_inval : 1;
+		u32 prg_response_req : 1;
+		u32 resvd : 20;
+	} __packed;
+	u32 as_uint32;
+};
+
 typedef u16 hv_pci_rid;		/* HV_PCI_RID */
 typedef u16 hv_pci_segment;	/* HV_PCI_SEGMENT */
 typedef u64 hv_logical_device_id;
@@ -528,4 +554,90 @@ union hv_device_id {		/* HV_DEVICE_ID */
 	} acpi;
 } __packed;
 
+struct hv_input_attach_device {         /* HV_INPUT_ATTACH_DEVICE */
+	u64 partition_id;
+	union hv_device_id device_id;
+	union hv_attdev_flags attdev_flags;
+	u8  attdev_vtl;
+	u8  rsvd0;
+	u16 rsvd1;
+	u64 logical_devid;
+	union hv_dev_pci_caps dev_pcicaps;
+	u16 pf_pci_rid;
+	u16 resvd2;
+} __packed;
+
+struct hv_input_detach_device {		/* HV_INPUT_DETACH_DEVICE */
+	u64 partition_id;
+	u64 logical_devid;
+} __packed;
+
+
+/* 3 domain types: stage 1, stage 2, and SOC */
+#define HV_DEVICE_DOMAIN_TYPE_S2  0 /* HV_DEVICE_DOMAIN_ID_TYPE_S2 */
+#define HV_DEVICE_DOMAIN_TYPE_S1  1 /* HV_DEVICE_DOMAIN_ID_TYPE_S1 */
+#define HV_DEVICE_DOMAIN_TYPE_SOC 2 /* HV_DEVICE_DOMAIN_ID_TYPE_SOC */
+
+/* ID for stage 2 default domain and NULL domain */
+#define HV_DEVICE_DOMAIN_ID_S2_DEFAULT 0
+#define HV_DEVICE_DOMAIN_ID_S2_NULL    0xFFFFFFFFULL
+
+union hv_device_domain_id {
+	u64 as_uint64;
+	struct {
+		u32 type : 4;
+		u32 reserved : 28;
+		u32 id;
+	};
+} __packed;
+
+struct hv_input_device_domain { /* HV_INPUT_DEVICE_DOMAIN */
+	u64 partition_id;
+	union hv_input_vtl owner_vtl;
+	u8 padding[7];
+	union hv_device_domain_id domain_id;
+} __packed;
+
+union hv_create_device_domain_flags {	/* HV_CREATE_DEVICE_DOMAIN_FLAGS */
+	u32 as_uint32;
+	struct {
+		u32 forward_progress_required : 1;
+		u32 inherit_owning_vtl : 1;
+		u32 reserved : 30;
+	} __packed;
+} __packed;
+
+struct hv_input_create_device_domain {	/* HV_INPUT_CREATE_DEVICE_DOMAIN */
+	struct hv_input_device_domain device_domain;
+	union hv_create_device_domain_flags create_device_domain_flags;
+} __packed;
+
+struct hv_input_delete_device_domain {	/* HV_INPUT_DELETE_DEVICE_DOMAIN */
+	struct hv_input_device_domain device_domain;
+} __packed;
+
+struct hv_input_attach_device_domain {	/* HV_INPUT_ATTACH_DEVICE_DOMAIN */
+	struct hv_input_device_domain device_domain;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_detach_device_domain {	/* HV_INPUT_DETACH_DEVICE_DOMAIN */
+	u64 partition_id;
+	union hv_device_id device_id;
+} __packed;
+
+struct hv_input_map_device_gpa_pages {	/* HV_INPUT_MAP_DEVICE_GPA_PAGES */
+	struct hv_input_device_domain device_domain;
+	union hv_input_vtl target_vtl;
+	u8 padding[3];
+	u32 map_flags;
+	u64 target_device_va_base;
+	u64 gpa_page_list[];
+} __packed;
+
+struct hv_input_unmap_device_gpa_pages {  /* HV_INPUT_UNMAP_DEVICE_GPA_PAGES */
+	struct hv_input_device_domain device_domain;
+	u64 target_device_va_base;
+} __packed;
+
 #endif /* _HV_HVHDK_MINI_H */
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (8 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 22:22   ` Stanislav Kinsburskii
  2026-01-20  6:42 ` [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls Mukesh R
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
interrupts, etc need a device id as a parameter. This device id refers
to that specific device during the lifetime of passthru.

An L1VH VM only contains VMBus based devices. A device id for a VMBus
device is slightly different in that it uses the hv_pcibus_device info
for building it to make sure it matches exactly what the hypervisor
expects. This VMBus based device id is needed when attaching devices in
an L1VH based guest VM. Before building it, a check is done to make sure
the device is a valid VMBus device.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/include/asm/mshyperv.h     |  2 ++
 drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index eef4c3a5ba28..0d7fdfb25e76 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
 static inline void hv_apic_init(void) {}
 #endif
 
+u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
+
 struct irq_domain *hv_create_pci_msi_domain(void);
 
 int hv_map_msi_interrupt(struct irq_data *data,
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 8bc6a38c9b5a..40f0b06bb966 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
 #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
 #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
 
+static bool hv_vmbus_pci_device(struct pci_bus *pbus);
+
 static int hv_pci_irqchip_init(void)
 {
 	return 0;
@@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
 
 #define hv_msi_prepare		pci_msi_prepare
 
+u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
+{
+	u64 u64val;
+	struct hv_pcibus_device *hbus;
+	struct pci_bus *pbus = pdev->bus;
+
+	if (!hv_vmbus_pci_device(pbus))
+		return 0;
+
+	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
+	u64val = (hbus->hdev->dev_instance.b[5] << 24) |
+		 (hbus->hdev->dev_instance.b[4] << 16) |
+		 (hbus->hdev->dev_instance.b[7] << 8) |
+		 (hbus->hdev->dev_instance.b[6] & 0xf8) |
+		 PCI_FUNC(pdev->devfn);
+
+	return u64val;
+}
+EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
+
 /**
  * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
  * affinity.
@@ -1404,6 +1426,13 @@ static struct pci_ops hv_pcifront_ops = {
 	.write = hv_pcifront_write_config,
 };
 
+#ifdef CONFIG_X86
+static bool hv_vmbus_pci_device(struct pci_bus *pbus)
+{
+	return pbus->ops == &hv_pcifront_ops;
+}
+#endif /* CONFIG_X86 */
+
 /*
  * Paravirtual backchannel
  *
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (9 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 22:27   ` Stanislav Kinsburskii
  2026-01-20  6:42 ` [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu Mukesh R
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
interrupts, etc need a device id as a parameter. A device id refers
to a specific device. A device id is of two types:
   o Logical: used for direct attach (see below) hypercalls. A logical
              device id is a unique 62bit value that is created and
              sent during the initial device attach. Then all further
              communications (for interrupt remaps etc) must use this
              logical id.
   o PCI: used for device domain hypercalls such as map, unmap, etc.
          This is built using actual device BDF info.

   PS: Since an L1VH only supports direct attaches, a logical device id
       on an L1VH VM is always a VMBus device id. For non-L1VH cases,
       we just use PCI BDF info, altho not strictly needed, to build the
       logical device id.

At a high level, Hyper-V supports two ways to do PCI passthru:
  1. Device Domain: root must create a device domain in the hypervisor,
     and do map/unmap hypercalls for mapping and unmapping guest RAM.
     All hypervisor communications use device id of type PCI for
     identifying and referencing the device.

  2. Direct Attach: the hypervisor will simply use the guest's HW
     page table for mappings, thus the host need not do map/unmap
     hypercalls. A direct attached device must be referenced
     via logical device id and never via the PCI device id. For an
     L1VH root/parent, Hyper-V only supports direct attaches.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/irqdomain.c     | 60 ++++++++++++++++++++++++++++++---
 arch/x86/include/asm/mshyperv.h | 14 ++++++++
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index ccbe5848a28f..33017aa0caa4 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -137,7 +137,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
 	return 0;
 }
 
-static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
+static u64 hv_build_devid_type_pci(struct pci_dev *pdev)
 {
 	int pos;
 	union hv_device_id hv_devid;
@@ -197,7 +197,58 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
 	}
 
 out:
-	return hv_devid;
+	return hv_devid.as_uint64;
+}
+
+/* Build device id for direct attached devices */
+static u64 hv_build_devid_type_logical(struct pci_dev *pdev)
+{
+	hv_pci_segment segment;
+	union hv_device_id hv_devid;
+	union hv_pci_bdf bdf = {.as_uint16 = 0};
+	struct rid_data data = {
+		.bridge = NULL,
+		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
+	};
+
+	segment = pci_domain_nr(pdev->bus);
+	bdf.bus = PCI_BUS_NUM(data.rid);
+	bdf.device = PCI_SLOT(data.rid);
+	bdf.function = PCI_FUNC(data.rid);
+
+	hv_devid.as_uint64 = 0;
+	hv_devid.device_type = HV_DEVICE_TYPE_LOGICAL;
+	hv_devid.logical.id = (u64)segment << 16 | bdf.as_uint16;
+
+	return hv_devid.as_uint64;
+}
+
+/* Build device id after the device has been attached */
+u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
+{
+	if (type == HV_DEVICE_TYPE_LOGICAL) {
+		if (hv_l1vh_partition())
+			return hv_pci_vmbus_device_id(pdev);
+		else
+			return hv_build_devid_type_logical(pdev);
+	} else if (type == HV_DEVICE_TYPE_PCI)
+		return hv_build_devid_type_pci(pdev);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
+
+/* Build device id for the interrupt path */
+static u64 hv_build_irq_devid(struct pci_dev *pdev)
+{
+	enum hv_device_type dev_type;
+
+	if (hv_pcidev_is_attached_dev(pdev) || hv_l1vh_partition())
+		dev_type = HV_DEVICE_TYPE_LOGICAL;
+	else
+		dev_type = HV_DEVICE_TYPE_PCI;
+
+	return hv_build_devid_oftype(pdev, dev_type);
 }
 
 /*
@@ -221,7 +272,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
 
 	msidesc = irq_data_get_msi_desc(data);
 	pdev = msi_desc_to_pci_dev(msidesc);
-	hv_devid = hv_build_devid_type_pci(pdev);
+	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
 
 	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
@@ -296,7 +347,8 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
 {
 	union hv_device_id hv_devid;
 
-	hv_devid = hv_build_devid_type_pci(pdev);
+	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
+
 	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
 }
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 0d7fdfb25e76..97477c5a8487 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -188,6 +188,20 @@ bool hv_vcpu_is_preempted(int vcpu);
 static inline void hv_apic_init(void) {}
 #endif
 
+#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
+static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{ return false; }       /* temporary */
+u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
+#else	/* CONFIG_HYPERV_IOMMU */
+static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{ return false; }
+
+static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
+				       enum hv_device_type type)
+{ return 0; }
+
+#endif	/* CONFIG_HYPERV_IOMMU */
+
 u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
 
 struct irq_domain *hv_create_pci_msi_domain(void);
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (10 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-21  0:12   ` Stanislav Kinsburskii
  2026-01-22  5:18   ` Jacob Pan
  2026-01-20  6:42 ` [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices Mukesh R
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Add a new file to implement management of device domains, mapping and
unmapping of iommu memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as root or L1VH
parent. This also implements direct attach mechanism for PCI passthru,
and it is also made to work within the VFIO framework.

At a high level, during boot the hypervisor creates a default identity
domain and attaches all devices to it. This nicely maps to Linux iommu
subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
need to explicitly ask Hyper-V to attach devices and do maps/unmaps
during boot. As mentioned previously, Hyper-V supports two ways to do
PCI passthru:

  1. Device Domain: root must create a device domain in the hypervisor,
     and do map/unmap hypercalls for mapping and unmapping guest RAM.
     All hypervisor communications use device id of type PCI for
     identifying and referencing the device.

  2. Direct Attach: the hypervisor will simply use the guest's HW
     page table for mappings, thus the host need not do map/unmap
     device memory hypercalls. As such, direct attach passthru setup
     during guest boot is extremely fast. A direct attached device
     must be referenced via logical device id and not via the PCI
     device id.

At present, L1VH root/parent only supports direct attaches. Also direct
attach is default in non-L1VH cases because there are some significant
performance issues with device domain implementation currently for guests
with higher RAM (say more than 8GB), and that unfortunately cannot be
addressed in the short term.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 MAINTAINERS                     |   1 +
 arch/x86/include/asm/mshyperv.h |   7 +-
 arch/x86/kernel/pci-dma.c       |   2 +
 drivers/iommu/Makefile          |   2 +-
 drivers/iommu/hyperv-iommu.c    | 876 ++++++++++++++++++++++++++++++++
 include/linux/hyperv.h          |   6 +
 6 files changed, 890 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/hyperv-iommu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 381a0e086382..63160cee942c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
 F:	drivers/infiniband/hw/mana/
 F:	drivers/input/serio/hyperv-keyboard.c
+F:	drivers/iommu/hyperv-iommu.c
 F:	drivers/iommu/hyperv-irq.c
 F:	drivers/net/ethernet/microsoft/
 F:	drivers/net/hyperv/
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 97477c5a8487..e4ccdbbf1d12 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
 #endif
 
 #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
-static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
-{ return false; }       /* temporary */
+bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
 u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
+u64 hv_iommu_get_curr_partid(void);
 #else	/* CONFIG_HYPERV_IOMMU */
 static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
 { return false; }
-
 static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
 				       enum hv_device_type type)
 { return 0; }
+static inline u64 hv_iommu_get_curr_partid(void)
+{ return HV_PARTITION_ID_INVALID; }
 
 #endif	/* CONFIG_HYPERV_IOMMU */
 
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 6267363e0189..cfeee6505e17 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -8,6 +8,7 @@
 #include <linux/gfp.h>
 #include <linux/pci.h>
 #include <linux/amd-iommu.h>
+#include <linux/hyperv.h>
 
 #include <asm/proto.h>
 #include <asm/dma.h>
@@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
 	gart_iommu_hole_init();
 	amd_iommu_detect();
 	detect_intel_iommu();
+	hv_iommu_detect();
 	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
 }
 
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 598c39558e7d..cc9774864b00 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
-obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
+obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
 obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
new file mode 100644
index 000000000000..548483fec6b1
--- /dev/null
+++ b/drivers/iommu/hyperv-iommu.c
@@ -0,0 +1,876 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hyper-V root vIOMMU driver.
+ * Copyright (C) 2026, Microsoft, Inc.
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/dma-map-ops.h>
+#include <linux/interval_tree.h>
+#include <linux/hyperv.h>
+#include "dma-iommu.h"
+#include <asm/iommu.h>
+#include <asm/mshyperv.h>
+
+/* We will not claim these PCI devices, eg hypervisor needs it for debugger */
+static char *pci_devs_to_skip;
+static int __init hv_iommu_setup_skip(char *str)
+{
+	pci_devs_to_skip = str;
+
+	return 0;
+}
+/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
+__setup("hv_iommu_skip=", hv_iommu_setup_skip);
+
+bool hv_no_attdev;	 /* disable direct device attach for passthru */
+EXPORT_SYMBOL_GPL(hv_no_attdev);
+static int __init setup_hv_no_attdev(char *str)
+{
+	hv_no_attdev = true;
+	return 0;
+}
+__setup("hv_no_attdev", setup_hv_no_attdev);
+
+/* Iommu device that we export to the world. HyperV supports max of one */
+static struct iommu_device hv_virt_iommu;
+
+struct hv_domain {
+	struct iommu_domain iommu_dom;
+	u32 domid_num;			      /* as opposed to domain_id.type */
+	u32 num_attchd;		      /* number of currently attached devices */
+	bool attached_dom;		      /* is this direct attached dom? */
+	spinlock_t mappings_lock;	      /* protects mappings_tree */
+	struct rb_root_cached mappings_tree;  /* iova to pa lookup tree */
+};
+
+#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
+
+struct hv_iommu_mapping {
+	phys_addr_t paddr;
+	struct interval_tree_node iova;
+	u32 flags;
+};
+
+/*
+ * By default, during boot the hypervisor creates one Stage 2 (S2) default
+ * domain. Stage 2 means that the page table is controlled by the hypervisor.
+ *   S2 default: access to entire root partition memory. This for us easily
+ *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu subsystem, and
+ *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the hypervisor.
+ *
+ * Device Management:
+ *   There are two ways to manage device attaches to domains:
+ *     1. Domain Attach: A device domain is created in the hypervisor, the
+ *			 device is attached to this domain, and then memory
+ *			 ranges are mapped in the map callbacks.
+ *     2. Direct Attach: No need to create a domain in the hypervisor for direct
+ *			 attached devices. A hypercall is made to tell the
+ *			 hypervisor to attach the device to a guest. There is
+ *			 no need for explicit memory mappings because the
+ *			 hypervisor will just use the guest HW page table.
+ *
+ * Since a direct attach is much faster, it is the default. This can be
+ * changed via hv_no_attdev.
+ *
+ * L1VH: hypervisor only supports direct attach.
+ */
+
+/*
+ * Create dummy domain to correspond to hypervisor prebuilt default identity
+ * domain (dummy because we do not make hypercall to create them).
+ */
+static struct hv_domain hv_def_identity_dom;
+
+static bool hv_special_domain(struct hv_domain *hvdom)
+{
+	return hvdom == &hv_def_identity_dom;
+}
+
+struct iommu_domain_geometry default_geometry = (struct iommu_domain_geometry) {
+	.aperture_start = 0,
+	.aperture_end = -1UL,
+	.force_aperture = true,
+};
+
+/*
+ * Since the relevant hypercalls can only fit less than 512 PFNs in the pfn
+ * array, report 1M max.
+ */
+#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
+
+static u32 unique_id;	      /* unique numeric id of a new domain */
+
+static void hv_iommu_detach_dev(struct iommu_domain *immdom,
+				struct device *dev);
+static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
+				   size_t pgsize, size_t pgcount,
+				   struct iommu_iotlb_gather *gather);
+
+/*
+ * If the current thread is a VMM thread, return the partition id of the VM it
+ * is managing, else return HV_PARTITION_ID_INVALID.
+ */
+u64 hv_iommu_get_curr_partid(void)
+{
+	u64 (*fn)(pid_t pid);
+	u64 partid;
+
+	fn = symbol_get(mshv_pid_to_partid);
+	if (!fn)
+		return HV_PARTITION_ID_INVALID;
+
+	partid = fn(current->tgid);
+	symbol_put(mshv_pid_to_partid);
+
+	return partid;
+}
+
+/* If this is a VMM thread, then this domain is for a guest VM */
+static bool hv_curr_thread_is_vmm(void)
+{
+	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
+}
+
+static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return true;
+	default:
+		return false;
+	}
+	return false;
+}
+
+/*
+ * Check if given pci device is a direct attached device. Caller must have
+ * verified pdev is a valid pci device.
+ */
+bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{
+	struct iommu_domain *iommu_domain;
+	struct hv_domain *hvdom;
+	struct device *dev = &pdev->dev;
+
+	iommu_domain = iommu_get_domain_for_dev(dev);
+	if (iommu_domain) {
+		hvdom = to_hv_domain(iommu_domain);
+		return hvdom->attached_dom;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
+
+/* Create a new device domain in the hypervisor */
+static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
+{
+	u64 status;
+	unsigned long flags;
+	struct hv_input_device_domain *ddp;
+	struct hv_input_create_device_domain *input;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	ddp = &input->device_domain;
+	ddp->partition_id = HV_PARTITION_ID_SELF;
+	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	ddp->domain_id.id = hvdom->domid_num;
+
+	input->create_device_domain_flags.forward_progress_required = 1;
+	input->create_device_domain_flags.inherit_owning_vtl = 0;
+
+	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
+
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+
+	return hv_result_to_errno(status);
+}
+
+/* During boot, all devices are attached to this */
+static struct iommu_domain *hv_iommu_domain_alloc_identity(struct device *dev)
+{
+	return &hv_def_identity_dom.iommu_dom;
+}
+
+static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
+{
+	struct hv_domain *hvdom;
+	int rc;
+
+	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() && !hv_no_attdev) {
+		pr_err("Hyper-V: l1vh iommu does not support host devices\n");
+		return NULL;
+	}
+
+	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
+	if (hvdom == NULL)
+		goto out;
+
+	spin_lock_init(&hvdom->mappings_lock);
+	hvdom->mappings_tree = RB_ROOT_CACHED;
+
+	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie, 0 */
+		goto out_free;
+
+	hvdom->domid_num = unique_id;
+	hvdom->iommu_dom.geometry = default_geometry;
+	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
+
+	/* For guests, by default we do direct attaches, so no domain in hyp */
+	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
+		hvdom->attached_dom = true;
+	else {
+		rc = hv_iommu_create_hyp_devdom(hvdom);
+		if (rc)
+			goto out_free_id;
+	}
+
+	return &hvdom->iommu_dom;
+
+out_free_id:
+	unique_id--;
+out_free:
+	kfree(hvdom);
+out:
+	return NULL;
+}
+
+static void hv_iommu_domain_free(struct iommu_domain *immdom)
+{
+	struct hv_domain *hvdom = to_hv_domain(immdom);
+	unsigned long flags;
+	u64 status;
+	struct hv_input_delete_device_domain *input;
+
+	if (hv_special_domain(hvdom))
+		return;
+
+	if (hvdom->num_attchd) {
+		pr_err("Hyper-V: can't free busy iommu domain (%p)\n", immdom);
+		return;
+	}
+
+	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
+		struct hv_input_device_domain *ddp;
+
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		ddp = &input->device_domain;
+		memset(input, 0, sizeof(*input));
+
+		ddp->partition_id = HV_PARTITION_ID_SELF;
+		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+		ddp->domain_id.id = hvdom->domid_num;
+
+		status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
+					 NULL);
+		local_irq_restore(flags);
+
+		if (!hv_result_success(status))
+			hv_status_err(status, "\n");
+	}
+
+	kfree(hvdom);
+}
+
+/* Attach a device to a domain previously created in the hypervisor */
+static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct pci_dev *pdev)
+{
+	unsigned long flags;
+	u64 status;
+	enum hv_device_type dev_type;
+	struct hv_input_attach_device_domain *input;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	input->device_domain.domain_id.id = hvdom->domid_num;
+
+	/* NB: Upon guest shutdown, device is re-attached to the default domain
+	 * without explicit detach.
+	 */
+	if (hv_l1vh_partition())
+		dev_type = HV_DEVICE_TYPE_LOGICAL;
+	else
+		dev_type = HV_DEVICE_TYPE_PCI;
+
+	input->device_id.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
+
+	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+
+	return hv_result_to_errno(status);
+}
+
+/* Caller must have validated that dev is a valid pci dev */
+static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
+{
+	struct hv_input_attach_device *input;
+	u64 status;
+	int rc;
+	unsigned long flags;
+	union hv_device_id host_devid;
+	enum hv_device_type dev_type;
+	u64 ptid = hv_iommu_get_curr_partid();
+
+	if (ptid == HV_PARTITION_ID_INVALID) {
+		pr_err("Hyper-V: Invalid partition id in direct attach\n");
+		return -EINVAL;
+	}
+
+	if (hv_l1vh_partition())
+		dev_type = HV_DEVICE_TYPE_LOGICAL;
+	else
+		dev_type = HV_DEVICE_TYPE_PCI;
+
+	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
+
+	do {
+		local_irq_save(flags);
+		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+		memset(input, 0, sizeof(*input));
+		input->partition_id = ptid;
+		input->device_id = host_devid;
+
+		/* Hypervisor associates logical_id with this device, and in
+		 * some hypercalls like retarget interrupts, logical_id must be
+		 * used instead of the BDF. It is a required parameter.
+		 */
+		input->attdev_flags.logical_id = 1;
+		input->logical_devid =
+			   hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
+
+		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE, input, NULL);
+		local_irq_restore(flags);
+
+		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+			rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
+			if (rc)
+				break;
+		}
+	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+
+	return hv_result_to_errno(status);
+}
+
+/* This to attach a device to both host app (like DPDK) and a guest VM */
+static int hv_iommu_attach_dev(struct iommu_domain *immdom, struct device *dev,
+			       struct iommu_domain *old)
+{
+	struct pci_dev *pdev;
+	int rc;
+	struct hv_domain *hvdom_new = to_hv_domain(immdom);
+	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
+
+	/* Only allow PCI devices for now */
+	if (!dev_is_pci(dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(dev);
+
+	/* l1vh does not support host device (eg DPDK) passthru */
+	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
+	    !hvdom_new->attached_dom)
+		return -EINVAL;
+
+	/*
+	 * VFIO does not do explicit detach calls, hence check first if we need
+	 * to detach first. Also, in case of guest shutdown, it's the VMM
+	 * thread that attaches it back to the hv_def_identity_dom, and
+	 * hvdom_prev will not be null then. It is null during boot.
+	 */
+	if (hvdom_prev)
+		if (!hv_l1vh_partition() || !hv_special_domain(hvdom_prev))
+			hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev);
+
+	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
+		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
+		return 0;
+	}
+
+	if (hvdom_new->attached_dom)
+		rc = hv_iommu_direct_attach_device(pdev);
+	else
+		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
+
+	if (rc && hvdom_prev) {
+		int rc1;
+
+		if (hvdom_prev->attached_dom)
+			rc1 = hv_iommu_direct_attach_device(pdev);
+		else
+			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
+
+		if (rc1)
+			pr_err("Hyper-V: iommu could not restore orig device state.. dev:%s\n",
+			       dev_name(dev));
+	}
+
+	if (rc == 0) {
+		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
+		hvdom_new->num_attchd++;
+	}
+
+	return rc;
+}
+
+static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
+					struct pci_dev *pdev)
+{
+	struct hv_input_detach_device *input;
+	u64 status, log_devid;
+	unsigned long flags;
+
+	log_devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->partition_id = hv_iommu_get_curr_partid();
+	input->logical_devid = log_devid;
+	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+}
+
+static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
+				      struct pci_dev *pdev)
+{
+	u64 status, devid;
+	unsigned long flags;
+	struct hv_input_detach_device_domain *input;
+
+	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->partition_id = HV_PARTITION_ID_SELF;
+	input->device_id.as_uint64 = devid;
+	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+}
+
+static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
+{
+	struct pci_dev *pdev;
+	struct hv_domain *hvdom = to_hv_domain(immdom);
+
+	/* See the attach function, only PCI devices for now */
+	if (!dev_is_pci(dev))
+		return;
+
+	if (hvdom->num_attchd == 0)
+		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
+
+	pdev = to_pci_dev(dev);
+
+	if (hvdom->attached_dom) {
+		hv_iommu_det_dev_from_guest(hvdom, pdev);
+
+		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
+		 * next.
+		 */
+	} else {
+		hv_iommu_det_dev_from_dom(hvdom, pdev);
+	}
+
+	hvdom->num_attchd--;
+}
+
+static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
+				     unsigned long iova, phys_addr_t paddr,
+				     size_t size, u32 flags)
+{
+	unsigned long irqflags;
+	struct hv_iommu_mapping *mapping;
+
+	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+	if (!mapping)
+		return -ENOMEM;
+
+	mapping->paddr = paddr;
+	mapping->iova.start = iova;
+	mapping->iova.last = iova + size - 1;
+	mapping->flags = flags;
+
+	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
+	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
+	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
+
+	return 0;
+}
+
+static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
+					unsigned long iova, size_t size)
+{
+	unsigned long flags;
+	size_t unmapped = 0;
+	unsigned long last = iova + size - 1;
+	struct hv_iommu_mapping *mapping = NULL;
+	struct interval_tree_node *node, *next;
+
+	spin_lock_irqsave(&hvdom->mappings_lock, flags);
+	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
+	while (next) {
+		node = next;
+		mapping = container_of(node, struct hv_iommu_mapping, iova);
+		next = interval_tree_iter_next(node, iova, last);
+
+		/* Trying to split a mapping? Not supported for now. */
+		if (mapping->iova.start < iova)
+			break;
+
+		unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+		interval_tree_remove(node, &hvdom->mappings_tree);
+		kfree(mapping);
+	}
+	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
+
+	return unmapped;
+}
+
+/* Return: must return exact status from the hypercall without changes */
+static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
+			    unsigned long iova, phys_addr_t paddr,
+			    unsigned long npages, u32 map_flags)
+{
+	u64 status;
+	int i;
+	struct hv_input_map_device_gpa_pages *input;
+	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	input->device_domain.domain_id.id = hvdom->domid_num;
+	input->map_flags = map_flags;
+	input->target_device_va_base = iova;
+
+	pfn = paddr >> HV_HYP_PAGE_SHIFT;
+	for (i = 0; i < npages; i++, pfn++)
+		input->gpa_page_list[i] = pfn;
+
+	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
+				     input, NULL);
+
+	local_irq_restore(flags);
+	return status;
+}
+
+/*
+ * The core VFIO code loops over memory ranges calling this function with
+ * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
+ */
+static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
+			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
+			      int prot, gfp_t gfp, size_t *mapped)
+{
+	u32 map_flags;
+	int ret;
+	u64 status;
+	unsigned long npages, done = 0;
+	struct hv_domain *hvdom = to_hv_domain(immdom);
+	size_t size = pgsize * pgcount;
+
+	map_flags = HV_MAP_GPA_READABLE;	/* required */
+	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
+
+	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
+	if (ret)
+		return ret;
+
+	if (hvdom->attached_dom) {
+		*mapped = size;
+		return 0;
+	}
+
+	npages = size >> HV_HYP_PAGE_SHIFT;
+	while (done < npages) {
+		ulong completed, remain = npages - done;
+
+		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
+					  map_flags);
+
+		completed = hv_repcomp(status);
+		done = done + completed;
+		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
+		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
+
+		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+			ret = hv_call_deposit_pages(NUMA_NO_NODE,
+						    hv_current_partition_id,
+						    256);
+			if (ret)
+				break;
+		}
+		if (!hv_result_success(status))
+			break;
+	}
+
+	if (!hv_result_success(status)) {
+		size_t done_size = done << HV_HYP_PAGE_SHIFT;
+
+		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
+			      done, npages, iova);
+		/*
+		 * lookup tree has all mappings [0 - size-1]. Below unmap will
+		 * only remove from [0 - done], we need to remove second chunk
+		 * [done+1 - size-1].
+		 */
+		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
+		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
+				     done, NULL);
+		if (mapped)
+			*mapped = 0;
+	} else
+		if (mapped)
+			*mapped = size;
+
+	return hv_result_to_errno(status);
+}
+
+static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
+				   size_t pgsize, size_t pgcount,
+				   struct iommu_iotlb_gather *gather)
+{
+	unsigned long flags, npages;
+	struct hv_input_unmap_device_gpa_pages *input;
+	u64 status;
+	struct hv_domain *hvdom = to_hv_domain(immdom);
+	size_t unmapped, size = pgsize * pgcount;
+
+	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
+	if (unmapped < size)
+		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
+		       __func__, iova, unmapped, size);
+
+	if (hvdom->attached_dom)
+		return size;
+
+	npages = size >> HV_HYP_PAGE_SHIFT;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+
+	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+	input->device_domain.domain_id.id = hvdom->domid_num;
+	input->target_device_va_base = iova;
+
+	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
+				     0, input, NULL);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status))
+		hv_status_err(status, "\n");
+
+	return unmapped;
+}
+
+static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
+					 dma_addr_t iova)
+{
+	u64 paddr = 0;
+	unsigned long flags;
+	struct hv_iommu_mapping *mapping;
+	struct interval_tree_node *node;
+	struct hv_domain *hvdom = to_hv_domain(immdom);
+
+	spin_lock_irqsave(&hvdom->mappings_lock, flags);
+	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
+	if (node) {
+		mapping = container_of(node, struct hv_iommu_mapping, iova);
+		paddr = mapping->paddr + (iova - mapping->iova.start);
+	}
+	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
+
+	return paddr;
+}
+
+/*
+ * Currently, hypervisor does not provide list of devices it is using
+ * dynamically. So use this to allow users to manually specify devices that
+ * should be skipped. (eg. hypervisor debugger using some network device).
+ */
+static struct iommu_device *hv_iommu_probe_device(struct device *dev)
+{
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-ENODEV);
+
+	if (pci_devs_to_skip && *pci_devs_to_skip) {
+		int rc, pos = 0;
+		int parsed;
+		int segment, bus, slot, func;
+		struct pci_dev *pdev = to_pci_dev(dev);
+
+		do {
+			parsed = 0;
+
+			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
+				    &segment, &bus, &slot, &func, &parsed);
+			if (rc)
+				break;
+			if (parsed <= 0)
+				break;
+
+			if (pci_domain_nr(pdev->bus) == segment &&
+			    pdev->bus->number == bus &&
+			    PCI_SLOT(pdev->devfn) == slot &&
+			    PCI_FUNC(pdev->devfn) == func) {
+
+				dev_info(dev, "skipped by Hyper-V IOMMU\n");
+				return ERR_PTR(-ENODEV);
+			}
+			pos += parsed;
+
+		} while (pci_devs_to_skip[pos]);
+	}
+
+	/* Device will be explicitly attached to the default domain, so no need
+	 * to do dev_iommu_priv_set() here.
+	 */
+
+	return &hv_virt_iommu;
+}
+
+static void hv_iommu_probe_finalize(struct device *dev)
+{
+	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
+
+	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
+		iommu_setup_dma_ops(dev);
+	else
+		set_dma_ops(dev, NULL);
+}
+
+static void hv_iommu_release_device(struct device *dev)
+{
+	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
+
+	/* Need to detach device from device domain if necessary. */
+	if (hvdom)
+		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
+
+	dev_iommu_priv_set(dev, NULL);
+	set_dma_ops(dev, NULL);
+}
+
+static struct iommu_group *hv_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+	else
+		return generic_device_group(dev);
+}
+
+static int hv_iommu_def_domain_type(struct device *dev)
+{
+	/* The hypervisor always creates this by default during boot */
+	return IOMMU_DOMAIN_IDENTITY;
+}
+
+static struct iommu_ops hv_iommu_ops = {
+	.capable	    = hv_iommu_capable,
+	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
+	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
+	.probe_device	    = hv_iommu_probe_device,
+	.probe_finalize     = hv_iommu_probe_finalize,
+	.release_device     = hv_iommu_release_device,
+	.def_domain_type    = hv_iommu_def_domain_type,
+	.device_group	    = hv_iommu_device_group,
+	.default_domain_ops = &(const struct iommu_domain_ops) {
+		.attach_dev   = hv_iommu_attach_dev,
+		.map_pages    = hv_iommu_map_pages,
+		.unmap_pages  = hv_iommu_unmap_pages,
+		.iova_to_phys = hv_iommu_iova_to_phys,
+		.free	      = hv_iommu_domain_free,
+	},
+	.owner		    = THIS_MODULE,
+};
+
+static void __init hv_initialize_special_domains(void)
+{
+	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
+	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
+}
+
+static int __init hv_iommu_init(void)
+{
+	int ret;
+	struct iommu_device *iommup = &hv_virt_iommu;
+
+	if (!hv_is_hyperv_initialized())
+		return -ENODEV;
+
+	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
+	if (ret) {
+		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
+		return ret;
+	}
+
+	/* This must come before iommu_device_register because the latter calls
+	 * into the hooks.
+	 */
+	hv_initialize_special_domains();
+
+	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
+	if (ret) {
+		pr_err("Hyper-V: iommu_device_register failed: %d\n", ret);
+		goto err_sysfs_remove;
+	}
+
+	pr_info("Hyper-V IOMMU initialized\n");
+
+	return 0;
+
+err_sysfs_remove:
+	iommu_device_sysfs_remove(iommup);
+	return ret;
+}
+
+void __init hv_iommu_detect(void)
+{
+	if (no_iommu || iommu_detected)
+		return;
+
+	/* For l1vh, always expose an iommu unit */
+	if (!hv_l1vh_partition())
+		if (!(ms_hyperv.misc_features & HV_DEVICE_DOMAIN_AVAILABLE))
+			return;
+
+	iommu_detected = 1;
+	x86_init.iommu.iommu_init = hv_iommu_init;
+
+	pci_request_acs();
+}
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..2ad111727e82 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1767,4 +1767,10 @@ static inline unsigned long virt_to_hvpfn(void *addr)
 #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
 #define page_to_hvpfn(page)	(page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
 
+#ifdef CONFIG_HYPERV_IOMMU
+void __init hv_iommu_detect(void);
+#else
+static inline void hv_iommu_detect(void) { }
+#endif /* CONFIG_HYPERV_IOMMU */
+
 #endif /* _HYPERV_H */
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (11 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-21  0:47   ` Stanislav Kinsburskii
  2026-01-20  6:42 ` [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl Mukesh R
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

As mentioned previously, a direct attached device must be referenced
via logical device id which is formed in the initial attach hypercall.
Interrupt mapping paths for direct attached devices are almost same,
except we must use logical device ids instead of the PCI device ids.

L1VH only supports direct attaches for passing thru devices to its guests,
and devices on L1VH are VMBus based. However, the interrupts are mapped
via the map interrupt hypercall and not the traditional method of VMBus
messages.

Partition id for the relevant hypercalls is tricky. This because a device
could be moving from root to guest and then back to the root. In case
of L1VH, it could be moving from system host to L1VH root to a guest,
then back to the L1VH root. So, it is carefully crafted by keeping
track of whether the call is on behalf of a VMM process, whether the
device is attached device (as opposed to mapped), and whether we are in
an L1VH root/parent. If VMM process, we assume it is on behalf of a
guest. Otherwise, the device is being attached or detached during boot
or shutdown of the privileged partition.

Lastly, a dummy cpu and vector is used to map interrupt for a direct
attached device. This because, once a device is marked for direct attach,
hypervisor will not let any interrupts be mapped to host. So it is mapped
to guest dummy cpu and dummy vector. This is then correctly mapped during
guest boot via the retarget paths.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/arm64/include/asm/mshyperv.h   | 15 +++++
 arch/x86/hyperv/irqdomain.c         | 57 +++++++++++++-----
 arch/x86/include/asm/mshyperv.h     |  4 ++
 drivers/pci/controller/pci-hyperv.c | 91 +++++++++++++++++++++++++----
 4 files changed, 142 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab6..27da480f94f6 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -53,6 +53,21 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 	return hv_get_msr(reg);
 }
 
+struct irq_data;
+struct msi_msg;
+struct pci_dev;
+static inline void hv_irq_compose_msi_msg(struct irq_data *data,
+					  struct msi_msg *msg) {};
+static inline int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+					struct hv_interrupt_entry *hvirqe)
+{
+	return -EOPNOTSUPP;
+}
+static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{
+	return false;
+}
+
 /* SMCCC hypercall parameters */
 #define HV_SMCCC_FUNC_NUMBER	1
 #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 33017aa0caa4..e6eb457f791e 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -13,6 +13,16 @@
 #include <linux/irqchip/irq-msi-lib.h>
 #include <asm/mshyperv.h>
 
+/*
+ * For direct attached devices (which use logical device ids), hypervisor will
+ * not allow mappings to host. But VFIO needs to bind the interrupt at the very
+ * start before the guest cpu/vector is known. So we use dummy cpu and vector
+ * to bind in such case, and later when the guest starts, retarget will move it
+ * to correct guest cpu and vector.
+ */
+#define HV_DDA_DUMMY_CPU      0
+#define HV_DDA_DUMMY_VECTOR  32
+
 static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
 				  bool level, int cpu, int vector,
 				  struct hv_interrupt_entry *ret_entry)
@@ -24,6 +34,11 @@ static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
 	u64 status;
 	int nr_bank, var_size;
 
+	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL) {
+		cpu = HV_DDA_DUMMY_CPU;
+		vector = HV_DDA_DUMMY_VECTOR;
+	}
+
 	local_irq_save(flags);
 
 	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
@@ -95,7 +110,8 @@ static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
 	return hv_result_to_errno(status);
 }
 
-static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
+static int hv_unmap_interrupt(union hv_device_id hv_devid,
+			      struct hv_interrupt_entry *irq_entry)
 {
 	unsigned long flags;
 	struct hv_input_unmap_device_interrupt *input;
@@ -103,10 +119,14 @@ static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
 
 	local_irq_save(flags);
 	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
-
 	memset(input, 0, sizeof(*input));
-	input->partition_id = hv_current_partition_id;
-	input->device_id = id;
+
+	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
+		input->partition_id = hv_iommu_get_curr_partid();
+	else
+		input->partition_id = hv_current_partition_id;
+
+	input->device_id = hv_devid.as_uint64;
 	input->interrupt_entry = *irq_entry;
 
 	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
@@ -263,6 +283,7 @@ static u64 hv_build_irq_devid(struct pci_dev *pdev)
 int hv_map_msi_interrupt(struct irq_data *data,
 			 struct hv_interrupt_entry *out_entry)
 {
+	u64 ptid;
 	struct irq_cfg *cfg = irqd_cfg(data);
 	struct hv_interrupt_entry dummy;
 	union hv_device_id hv_devid;
@@ -275,8 +296,17 @@ int hv_map_msi_interrupt(struct irq_data *data,
 	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
 
-	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
-				cfg->vector, out_entry ? out_entry : &dummy);
+	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
+		if (hv_pcidev_is_attached_dev(pdev))
+			ptid = hv_iommu_get_curr_partid();
+		else
+			/* Device actually on l1vh root, not passthru'd to vm */
+			ptid = hv_current_partition_id;
+	else
+		ptid = hv_current_partition_id;
+
+	return hv_map_interrupt(ptid, hv_devid, false, cpu, cfg->vector,
+				out_entry ? out_entry : &dummy);
 }
 EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
 
@@ -289,10 +319,7 @@ static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
 	msg->data = entry->msi_entry.data.as_uint32;
 }
 
-static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
-				  struct hv_interrupt_entry *irq_entry);
-
-static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
+void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 {
 	struct hv_interrupt_entry *stored_entry;
 	struct irq_cfg *cfg = irqd_cfg(data);
@@ -341,16 +368,18 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	data->chip_data = stored_entry;
 	entry_to_msi_msg(data->chip_data, msg);
 }
+EXPORT_SYMBOL_GPL(hv_irq_compose_msi_msg);
 
-static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
-				  struct hv_interrupt_entry *irq_entry)
+int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+			   struct hv_interrupt_entry *irq_entry)
 {
 	union hv_device_id hv_devid;
 
 	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
 
-	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
+	return hv_unmap_interrupt(hv_devid, irq_entry);
 }
+EXPORT_SYMBOL_GPL(hv_unmap_msi_interrupt);
 
 /* NB: during map, hv_interrupt_entry is saved via data->chip_data */
 static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
@@ -486,7 +515,7 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
 	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
 	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
 
-	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
+	return hv_unmap_interrupt(hv_devid, entry);
 }
 EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index e4ccdbbf1d12..b6facd3a0f5e 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -204,11 +204,15 @@ static inline u64 hv_iommu_get_curr_partid(void)
 #endif	/* CONFIG_HYPERV_IOMMU */
 
 u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
+void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg);
+extern bool hv_no_attdev;
 
 struct irq_domain *hv_create_pci_msi_domain(void);
 
 int hv_map_msi_interrupt(struct irq_data *data,
 			 struct hv_interrupt_entry *out_entry);
+int hv_unmap_msi_interrupt(struct pci_dev *dev,
+			   struct hv_interrupt_entry *hvirqe);
 int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
 		struct hv_interrupt_entry *entry);
 int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 40f0b06bb966..71d1599dc4a8 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -660,15 +660,17 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
 
 	params = *this_cpu_ptr(hyperv_pcpu_input_arg);
 	memset(params, 0, sizeof(*params));
-	params->partition_id = HV_PARTITION_ID_SELF;
+
+	if (hv_pcidev_is_attached_dev(pdev))
+		params->partition_id = hv_iommu_get_curr_partid();
+	else
+		params->partition_id = HV_PARTITION_ID_SELF;
+
 	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
-	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
+	params->int_entry.msi_entry.address.as_uint32 =
+						int_desc->address & 0xffffffff;
 	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
-	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
-			   (hbus->hdev->dev_instance.b[4] << 16) |
-			   (hbus->hdev->dev_instance.b[7] << 8) |
-			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
-			   PCI_FUNC(pdev->devfn);
+	params->device_id = hv_pci_vmbus_device_id(pdev);
 	params->int_target.vector = hv_msi_get_int_vector(data);
 
 	if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) {
@@ -1263,6 +1265,15 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
 			mb();
 		}
 		spin_unlock_irqrestore(&hbus->config_lock, flags);
+		/*
+		 * Make sure PCI_INTERRUPT_PIN is hard-wired to 0 since it may
+		 * be read using a 32bit read which is skipped by the above
+		 * emulation.
+		 */
+		if (PCI_INTERRUPT_PIN >= where &&
+		    PCI_INTERRUPT_PIN <= (where + size)) {
+			*((char *)val + PCI_INTERRUPT_PIN - where) = 0;
+		}
 	} else {
 		dev_err(dev, "Attempt to read beyond a function's config space.\n");
 	}
@@ -1731,14 +1742,22 @@ static void hv_msi_free(struct irq_domain *domain, unsigned int irq)
 	if (!int_desc)
 		return;
 
-	irq_data->chip_data = NULL;
 	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
 	if (!hpdev) {
+		irq_data->chip_data = NULL;
 		kfree(int_desc);
 		return;
 	}
 
-	hv_int_desc_free(hpdev, int_desc);
+	if (hv_pcidev_is_attached_dev(pdev)) {
+		hv_unmap_msi_interrupt(pdev, irq_data->chip_data);
+		kfree(irq_data->chip_data);
+		irq_data->chip_data = NULL;
+	} else {
+		irq_data->chip_data = NULL;
+		hv_int_desc_free(hpdev, int_desc);
+	}
+
 	put_pcichild(hpdev);
 }
 
@@ -2139,6 +2158,56 @@ static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	msg->data = 0;
 }
 
+/* Compose an msi message for a directly attached device */
+static void hv_dda_compose_msi_msg(struct irq_data *irq_data,
+				   struct msi_desc *msi_desc,
+				   struct msi_msg *msg)
+{
+	bool multi_msi;
+	struct hv_pcibus_device *hbus;
+	struct hv_pci_dev *hpdev;
+	struct pci_dev *pdev = msi_desc_to_pci_dev(msi_desc);
+
+	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
+		    msi_desc->nvec_used > 1;
+
+	if (multi_msi) {
+		dev_err(&hbus->hdev->device,
+			"Passthru direct attach does not support multi msi\n");
+		goto outerr;
+	}
+
+	hbus = container_of(pdev->bus->sysdata, struct hv_pcibus_device,
+			    sysdata);
+
+	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
+	if (!hpdev)
+		goto outerr;
+
+	/* will unmap if needed and also update irq_data->chip_data */
+	hv_irq_compose_msi_msg(irq_data, msg);
+
+	put_pcichild(hpdev);
+	return;
+
+outerr:
+	memset(msg, 0, sizeof(*msg));
+}
+
+static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+	struct pci_dev *pdev;
+	struct msi_desc *msi_desc;
+
+	msi_desc = irq_data_get_msi_desc(data);
+	pdev = msi_desc_to_pci_dev(msi_desc);
+
+	if (hv_pcidev_is_attached_dev(pdev))
+		hv_dda_compose_msi_msg(data, msi_desc, msg);
+	else
+		hv_vmbus_compose_msi_msg(data, msg);
+}
+
 static bool hv_pcie_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
 				      struct irq_domain *real_parent, struct msi_domain_info *info)
 {
@@ -2177,7 +2246,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
 /* HW Interrupt Chip Descriptor */
 static struct irq_chip hv_msi_irq_chip = {
 	.name			= "Hyper-V PCIe MSI",
-	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
+	.irq_compose_msi_msg	= hv_compose_msi_msg,
 	.irq_set_affinity	= irq_chip_set_affinity_parent,
 	.irq_ack		= irq_chip_ack_parent,
 	.irq_eoi		= irq_chip_eoi_parent,
@@ -4096,7 +4165,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
 		irq_data = irq_get_irq_data(entry->irq);
 		if (WARN_ON_ONCE(!irq_data))
 			return -EINVAL;
-		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
+		hv_compose_msi_msg(irq_data, &entry->msg);
 	}
 	return 0;
 }
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (12 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-21  1:41   ` Stanislav Kinsburskii
  2026-01-23 18:34   ` Nuno Das Neves
  2026-01-20  6:42 ` [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru Mukesh R
  2026-01-20 21:50 ` [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Jacob Pan
  15 siblings, 2 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

VFIO no longer puts the mmio pfn in vma->vm_pgoff. So, remove code
that is using it to map mmio space. It is broken and will cause
panic.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 27313419828d..03f3aa9f5541 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1258,16 +1258,8 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
 }
 
 /*
- * This maps two things: guest RAM and for pci passthru mmio space.
- *
- * mmio:
- *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
- *  - Two things need to happen for mapping mmio range:
- *	1. mapped in the uaddr so VMM can access it.
- *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
- *
- *   This function takes care of the second. The first one is managed by vfio,
- *   and hence is taken care of via vfio_pci_mmap_fault().
+ * This is called for both user ram and mmio space. The mmio space is not
+ * mapped here, but later during intercept.
  */
 static long
 mshv_map_user_memory(struct mshv_partition *partition,
@@ -1276,7 +1268,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	struct mshv_mem_region *region;
 	struct vm_area_struct *vma;
 	bool is_mmio;
-	ulong mmio_pfn;
 	long ret;
 
 	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
@@ -1286,7 +1277,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
 	mmap_read_lock(current->mm);
 	vma = vma_lookup(current->mm, mem.userspace_addr);
 	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
-	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
 	mmap_read_unlock(current->mm);
 
 	if (!vma)
@@ -1313,10 +1303,8 @@ mshv_map_user_memory(struct mshv_partition *partition,
 					    HV_MAP_GPA_NO_ACCESS, NULL);
 		break;
 	case MSHV_REGION_TYPE_MMIO:
-		ret = hv_call_map_mmio_pages(partition->pt_id,
-					     region->start_gfn,
-					     mmio_pfn,
-					     region->nr_pages);
+		/* mmio mappings are handled later during intercepts */
+		ret = 0;
 		break;
 	}
 
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (13 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl Mukesh R
@ 2026-01-20  6:42 ` Mukesh R
  2026-01-20 19:52   ` kernel test robot
  2026-01-21  1:53   ` Stanislav Kinsburskii
  2026-01-20 21:50 ` [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Jacob Pan
  15 siblings, 2 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-20  6:42 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, nunodasneves, mhklinux, romank

From: Mukesh Rathor <mrathor@linux.microsoft.com>

Upon guest access, in case of missing mmio mapping, the hypervisor
generates an unmapped gpa intercept. In this path, lookup the PCI
resource pfn for the guest gpa, and ask the hypervisor to map it
via hypercall. The PCI resource pfn is maintained by the VFIO driver,
and obtained via fixup_user_fault call (similar to KVM).

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 03f3aa9f5541..4c8bc7cd0888 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -56,6 +56,14 @@ struct hv_stats_page {
 	};
 } __packed;
 
+bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
+static int __init setup_hv_full_mmio(char *str)
+{
+	hv_nofull_mmio = true;
+	return 0;
+}
+__setup("hv_nofull_mmio", setup_hv_full_mmio);
+
 struct mshv_root mshv_root;
 
 enum hv_scheduler_type hv_scheduler_type;
@@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
 }
 
 #ifdef CONFIG_X86_64
+
+/*
+ * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
+ * else just return -errno.
+ */
+static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
+				       u64 *mmio_pfnp)
+{
+	struct vm_area_struct *vma;
+	bool is_mmio;
+	u64 uaddr;
+	struct mshv_mem_region *mreg;
+	struct follow_pfnmap_args pfnmap_args;
+	int rc = -EINVAL;
+
+	/*
+	 * Do not allow mem region to be deleted beneath us. VFIO uses
+	 * useraddr vma to lookup pci bar pfn.
+	 */
+	spin_lock(&pt->pt_mem_regions_lock);
+
+	/* Get the region again under the lock */
+	mreg = mshv_partition_region_by_gfn(pt, gfn);
+	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
+		goto unlock_pt_out;
+
+	uaddr = mreg->start_uaddr +
+		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
+
+	mmap_read_lock(current->mm);
+	vma = vma_lookup(current->mm, uaddr);
+	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+	if (!is_mmio)
+		goto unlock_mmap_out;
+
+	pfnmap_args.vma = vma;
+	pfnmap_args.address = uaddr;
+
+	rc = follow_pfnmap_start(&pfnmap_args);
+	if (rc) {
+		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
+				      NULL);
+		if (rc)
+			goto unlock_mmap_out;
+
+		rc = follow_pfnmap_start(&pfnmap_args);
+		if (rc)
+			goto unlock_mmap_out;
+	}
+
+	*mmio_pfnp = pfnmap_args.pfn;
+	follow_pfnmap_end(&pfnmap_args);
+
+unlock_mmap_out:
+	mmap_read_unlock(current->mm);
+unlock_pt_out:
+	spin_unlock(&pt->pt_mem_regions_lock);
+	return rc;
+}
+
+/*
+ * At present, the only unmapped gpa is mmio space. Verify if it's mmio
+ * and resolve if possible.
+ * Returns: True if valid mmio intercept and it was handled, else false
+ */
+static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
+{
+	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
+	struct hv_x64_memory_intercept_message *msg;
+	union hv_x64_memory_access_info accinfo;
+	u64 gfn, mmio_spa, numpgs;
+	struct mshv_mem_region *mreg;
+	int rc;
+	struct mshv_partition *pt = vp->vp_partition;
+
+	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
+	accinfo = msg->memory_access_info;
+
+	if (!accinfo.gva_gpa_valid)
+		return false;
+
+	/* Do a fast check and bail if non mmio intercept */
+	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
+	mreg = mshv_partition_region_by_gfn(pt, gfn);
+	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
+		return false;
+
+	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
+	if (rc)
+		return false;
+
+	if (!hv_nofull_mmio) {		/* default case */
+		gfn = mreg->start_gfn;
+		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
+		numpgs = mreg->nr_pages;
+	} else
+		numpgs = 1;
+
+	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
+
+	return rc == 0;
+}
+
 static struct mshv_mem_region *
 mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
 {
@@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
 
 	return ret;
 }
+
 #else  /* CONFIG_X86_64 */
+static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
 static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
 #endif /* CONFIG_X86_64 */
 
 static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
 {
 	switch (vp->vp_intercept_msg_page->header.message_type) {
+	case HVMSG_UNMAPPED_GPA:
+		return mshv_handle_unmapped_gpa(vp);
 	case HVMSG_GPA_INTERCEPT:
 		return mshv_handle_gpa_intercept(vp);
 	}
-- 
2.51.2.vfs.0.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
  2026-01-20  6:42 ` [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO Mukesh R
@ 2026-01-20 16:09   ` Stanislav Kinsburskii
  2026-01-23 18:32   ` Nuno Das Neves
  1 sibling, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-20 16:09 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:21PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add a new file to implement VFIO-MSHV bridge pseudo device. These
> functions are called in the VFIO framework, and credits to kvm/vfio.c
> as this file was adapted from it.
> 
> Original author: Wei Liu <wei.liu@kernel.org>
> (Slightly modified from the original version).
> 

There is a Linux standard for giving credits when code is adapted from.
This doesn't follow that standard. Please fix.

> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/Makefile    |   3 +-
>  drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 212 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/hv/mshv_vfio.c
> 
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index a49f93c2d245..eae003c4cb8f 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
>  hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
>  hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
>  mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
> -	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
> +	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
> +               mshv_vfio.o
>  mshv_vtl-y := mshv_vtl_main.o
>  
>  # Code that must be built-in
> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> new file mode 100644
> index 000000000000..6ea4d99a3bd2
> --- /dev/null
> +++ b/drivers/hv/mshv_vfio.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VFIO-MSHV bridge pseudo device
> + *
> + * Heavily inspired by the VFIO-KVM bridge pseudo device.
> + */
> +#include <linux/errno.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/vfio.h>
> +
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +struct mshv_vfio_file {
> +	struct list_head node;
> +	struct file *file;	/* list of struct mshv_vfio_file */
> +};
> +
> +struct mshv_vfio {
> +	struct list_head file_list;
> +	struct mutex lock;
> +};
> +
> +static bool mshv_vfio_file_is_valid(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_valid);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_valid);
> +
> +	return ret;
> +}
> +
> +static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
> +{
> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf;
> +	struct file *filp;
> +	long ret = 0;
> +
> +	filp = fget(fd);
> +	if (!filp)
> +		return -EBADF;
> +
> +	/* Ensure the FD is a vfio FD. */
> +	if (!mshv_vfio_file_is_valid(filp)) {
> +		ret = -EINVAL;
> +		goto out_fput;
> +	}
> +
> +	mutex_lock(&mshv_vfio->lock);
> +
> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> +		if (mvf->file == filp) {
> +			ret = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
> +	if (!mvf) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	mvf->file = get_file(filp);
> +	list_add_tail(&mvf->node, &mshv_vfio->file_list);
> +
> +out_unlock:
> +	mutex_unlock(&mshv_vfio->lock);
> +out_fput:
> +	fput(filp);
> +	return ret;
> +}
> +
> +static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
> +{
> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf;
> +	long ret;
> +
> +	CLASS(fd, f)(fd);
> +
> +	if (fd_empty(f))
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +	mutex_lock(&mshv_vfio->lock);
> +
> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> +		if (mvf->file != fd_file(f))
> +			continue;
> +
> +		list_del(&mvf->node);
> +		fput(mvf->file);
> +		kfree(mvf);
> +		ret = 0;
> +		break;
> +	}
> +
> +	mutex_unlock(&mshv_vfio->lock);
> +	return ret;
> +}
> +
> +static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
> +			      void __user *arg)
> +{
> +	int32_t __user *argp = arg;
> +	int32_t fd;
> +
> +	switch (attr) {
> +	case MSHV_DEV_VFIO_FILE_ADD:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +		return mshv_vfio_file_add(mshvdev, fd);
> +
> +	case MSHV_DEV_VFIO_FILE_DEL:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +		return mshv_vfio_file_del(mshvdev, fd);
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_FILE:
> +		return mshv_vfio_set_file(mshvdev, attr->attr,
> +					  u64_to_user_ptr(attr->addr));
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_FILE:
> +		switch (attr->attr) {
> +		case MSHV_DEV_VFIO_FILE_ADD:
> +		case MSHV_DEV_VFIO_FILE_DEL:
> +			return 0;
> +		}
> +
> +		break;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
> +{
> +	struct mshv_device *tmp;
> +	struct mshv_vfio *mshv_vfio;
> +
> +	/* Only one VFIO "device" per VM */
> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> +			     device_ptnode)
> +		if (tmp->device_ops == &mshv_vfio_device_ops)
> +			return -EBUSY;
> +
> +	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
> +	if (mshv_vfio == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&mshv_vfio->file_list);
> +	mutex_init(&mshv_vfio->lock);
> +
> +	mshvdev->device_private = mshv_vfio;
> +
> +	return 0;
> +}
> +
> +/* This is called from mshv_device_fop_release() */
> +static void mshv_vfio_release_device(struct mshv_device *mshvdev)
> +{
> +	struct mshv_vfio *mv = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf, *tmp;
> +
> +	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
> +		fput(mvf->file);

This put must be sync as device must be detached from domain before
attempting partition destruction.
This was explicitly mentioned in the patch originated this code.
Please fix, add a comment and credits to the commit message.

Thanks,
Stanislav


> +		list_del(&mvf->node);
> +		kfree(mvf);
> +	}
> +
> +	kfree(mv);
> +	kfree(mshvdev);
> +}
> +
> +struct mshv_device_ops mshv_vfio_device_ops = {
> +	.device_name = "mshv-vfio",
> +	.device_create = mshv_vfio_create_device,
> +	.device_release = mshv_vfio_release_device,
> +	.device_set_attr = mshv_vfio_set_attr,
> +	.device_has_attr = mshv_vfio_has_attr,
> +};
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device
  2026-01-20  6:42 ` [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device Mukesh R
@ 2026-01-20 16:13   ` Stanislav Kinsburskii
  0 siblings, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-20 16:13 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:22PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add ioctl support for creating MSHV devices for a paritition. At
> present only VFIO device types are supported, but more could be
> added. At a high level, a partition ioctl to create device verifies
> it is of type VFIO and does some setup for bridge code in mshv_vfio.c.
> Adapted from KVM device ioctls.
> 
> Credits: Original author: Wei Liu <wei.liu@kernel.org>
> NB: Slightly modified from the original version.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c | 126 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 126 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 83c7bad269a0..27313419828d 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1551,6 +1551,129 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
>  	return ret;
>  }
>  
> +static long mshv_device_attr_ioctl(struct mshv_device *mshv_dev, int cmd,
> +				   ulong uarg)
> +{
> +	struct mshv_device_attr attr;
> +	const struct mshv_device_ops *devops = mshv_dev->device_ops;
> +
> +	if (copy_from_user(&attr, (void __user *)uarg, sizeof(attr)))
> +		return -EFAULT;
> +
> +	switch (cmd) {
> +	case MSHV_SET_DEVICE_ATTR:
> +		if (devops->device_set_attr)
> +			return devops->device_set_attr(mshv_dev, &attr);
> +		break;
> +	case MSHV_HAS_DEVICE_ATTR:
> +		if (devops->device_has_attr)
> +			return devops->device_has_attr(mshv_dev, &attr);
> +		break;
> +	}
> +
> +	return -EPERM;
> +}
> +
> +static long mshv_device_fop_ioctl(struct file *filp, unsigned int cmd,
> +				  ulong uarg)
> +{
> +	struct mshv_device *mshv_dev = filp->private_data;
> +
> +	switch (cmd) {
> +	case MSHV_SET_DEVICE_ATTR:
> +	case MSHV_HAS_DEVICE_ATTR:
> +		return mshv_device_attr_ioctl(mshv_dev, cmd, uarg);
> +	}
> +
> +	return -ENOTTY;
> +}
> +
> +static int mshv_device_fop_release(struct inode *inode, struct file *filp)
> +{
> +	struct mshv_device *mshv_dev = filp->private_data;
> +	struct mshv_partition *partition = mshv_dev->device_pt;
> +
> +	if (mshv_dev->device_ops->device_release) {
> +		mutex_lock(&partition->pt_mutex);
> +		hlist_del(&mshv_dev->device_ptnode);
> +		mshv_dev->device_ops->device_release(mshv_dev);
> +		mutex_unlock(&partition->pt_mutex);
> +	}
> +
> +	mshv_partition_put(partition);
> +	return 0;
> +}
> +
> +static const struct file_operations mshv_device_fops = {
> +	.owner = THIS_MODULE,
> +	.unlocked_ioctl = mshv_device_fop_ioctl,
> +	.release = mshv_device_fop_release,
> +};
> +
> +long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +					void __user *uarg)
> +{
> +	long rc;
> +	struct mshv_create_device devargk;
> +	struct mshv_device *mshv_dev;
> +	const struct mshv_device_ops *vfio_ops;
> +	int type;
> +
> +	if (copy_from_user(&devargk, uarg, sizeof(devargk))) {
> +		rc = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* At present, only VFIO is supported */
> +	if (devargk.type != MSHV_DEV_TYPE_VFIO) {
> +		rc = -ENODEV;
> +		goto out;
> +	}
> +
> +	if (devargk.flags & MSHV_CREATE_DEVICE_TEST) {
> +		rc = 0;
> +		goto out;
> +	}
> +
> +	mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
> +	if (mshv_dev == NULL) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	vfio_ops = &mshv_vfio_device_ops;
> +	mshv_dev->device_ops = vfio_ops;
> +	mshv_dev->device_pt = partition;
> +
> +	rc = vfio_ops->device_create(mshv_dev, type);
> +	if (rc < 0) {
> +		kfree(mshv_dev);
> +		goto out;
> +	}
> +
> +	hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
> +
> +	mshv_partition_get(partition);
> +	rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
> +			      mshv_dev, O_RDWR | O_CLOEXEC);
> +	if (rc < 0) {
> +		mshv_partition_put(partition);
> +		hlist_del(&mshv_dev->device_ptnode);
> +		vfio_ops->device_release(mshv_dev);
> +		goto out;
> +	}
> +
> +	devargk.fd = rc;
> +	rc = 0;
> +
> +	if (copy_to_user(uarg, &devargk, sizeof(devargk))) {

Shouldn't the partition be put here?

Thanks,
Stanislav

> +		rc = -EFAULT;
> +		goto out;
> +	}
> +out:
> +	return rc;
> +}
> +
>  static long
>  mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  {
> @@ -1587,6 +1710,9 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>  	case MSHV_ROOT_HVCALL:
>  		ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
>  		break;
> +	case MSHV_CREATE_DEVICE:
> +		ret = mshv_partition_ioctl_create_device(partition, uarg);
> +		break;
>  	default:
>  		ret = -ENOTTY;
>  	}
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
@ 2026-01-20 19:08   ` kernel test robot
  2026-01-20 21:09   ` kernel test robot
  2026-02-05 18:48   ` Anirudh Rayabharam
  2 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2026-01-20 19:08 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

Hi Mukesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on pci/next pci/for-linus arm64/for-next/core clk/clk-next soc/for-next linus/master arnd-asm-generic/master v6.19-rc6 next-20260119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-R/iommu-hyperv-rename-hyperv-iommu-c-to-hyperv-irq-c/20260120-145832
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20260120064230.3602565-2-mrathor%40linux.microsoft.com
patch subject: [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
config: i386-randconfig-001-20260120 (https://download.01.org/0day-ci/archive/20260121/202601210208.mg3YUkif-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260121/202601210208.mg3YUkif-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601210208.mg3YUkif-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/acpi/pci_root.c:20:
>> include/linux/dmar.h:269:17: error: unknown type name '__u128'; did you mean '__u32'?
     269 |                 __u128 irte;
         |                 ^~~~~~
         |                 __u32

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for IRQ_REMAP
   Depends on [n]: IOMMU_SUPPORT [=y] && X86_64 [=n] && X86_IO_APIC [=y] && PCI_MSI [=n] && ACPI [=y]
   Selected by [y]:
   - HYPERV_IOMMU [=y] && IOMMU_SUPPORT [=y] && HYPERV [=y] && X86 [=y]


vim +269 include/linux/dmar.h

2ae21010694e56 Suresh Siddha   2008-07-10  200  
2ae21010694e56 Suresh Siddha   2008-07-10  201  struct irte {
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  202  	union {
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  203  		struct {
2ae21010694e56 Suresh Siddha   2008-07-10  204  			union {
3bf17472226b00 Thomas Gleixner 2015-06-09  205  				/* Shared between remapped and posted mode*/
2ae21010694e56 Suresh Siddha   2008-07-10  206  				struct {
3bf17472226b00 Thomas Gleixner 2015-06-09  207  					__u64	present		: 1,  /*  0      */
3bf17472226b00 Thomas Gleixner 2015-06-09  208  						fpd		: 1,  /*  1      */
3bf17472226b00 Thomas Gleixner 2015-06-09  209  						__res0		: 6,  /*  2 -  6 */
3bf17472226b00 Thomas Gleixner 2015-06-09  210  						avail		: 4,  /*  8 - 11 */
3bf17472226b00 Thomas Gleixner 2015-06-09  211  						__res1		: 3,  /* 12 - 14 */
3bf17472226b00 Thomas Gleixner 2015-06-09  212  						pst		: 1,  /* 15      */
3bf17472226b00 Thomas Gleixner 2015-06-09  213  						vector		: 8,  /* 16 - 23 */
3bf17472226b00 Thomas Gleixner 2015-06-09  214  						__res2		: 40; /* 24 - 63 */
3bf17472226b00 Thomas Gleixner 2015-06-09  215  				};
3bf17472226b00 Thomas Gleixner 2015-06-09  216  
3bf17472226b00 Thomas Gleixner 2015-06-09  217  				/* Remapped mode */
3bf17472226b00 Thomas Gleixner 2015-06-09  218  				struct {
3bf17472226b00 Thomas Gleixner 2015-06-09  219  					__u64	r_present	: 1,  /*  0      */
3bf17472226b00 Thomas Gleixner 2015-06-09  220  						r_fpd		: 1,  /*  1      */
3bf17472226b00 Thomas Gleixner 2015-06-09  221  						dst_mode	: 1,  /*  2      */
3bf17472226b00 Thomas Gleixner 2015-06-09  222  						redir_hint	: 1,  /*  3      */
3bf17472226b00 Thomas Gleixner 2015-06-09  223  						trigger_mode	: 1,  /*  4      */
3bf17472226b00 Thomas Gleixner 2015-06-09  224  						dlvry_mode	: 3,  /*  5 -  7 */
3bf17472226b00 Thomas Gleixner 2015-06-09  225  						r_avail		: 4,  /*  8 - 11 */
3bf17472226b00 Thomas Gleixner 2015-06-09  226  						r_res0		: 4,  /* 12 - 15 */
3bf17472226b00 Thomas Gleixner 2015-06-09  227  						r_vector	: 8,  /* 16 - 23 */
3bf17472226b00 Thomas Gleixner 2015-06-09  228  						r_res1		: 8,  /* 24 - 31 */
3bf17472226b00 Thomas Gleixner 2015-06-09  229  						dest_id		: 32; /* 32 - 63 */
3bf17472226b00 Thomas Gleixner 2015-06-09  230  				};
3bf17472226b00 Thomas Gleixner 2015-06-09  231  
3bf17472226b00 Thomas Gleixner 2015-06-09  232  				/* Posted mode */
3bf17472226b00 Thomas Gleixner 2015-06-09  233  				struct {
3bf17472226b00 Thomas Gleixner 2015-06-09  234  					__u64	p_present	: 1,  /*  0      */
3bf17472226b00 Thomas Gleixner 2015-06-09  235  						p_fpd		: 1,  /*  1      */
3bf17472226b00 Thomas Gleixner 2015-06-09  236  						p_res0		: 6,  /*  2 -  7 */
3bf17472226b00 Thomas Gleixner 2015-06-09  237  						p_avail		: 4,  /*  8 - 11 */
3bf17472226b00 Thomas Gleixner 2015-06-09  238  						p_res1		: 2,  /* 12 - 13 */
3bf17472226b00 Thomas Gleixner 2015-06-09  239  						p_urgent	: 1,  /* 14      */
3bf17472226b00 Thomas Gleixner 2015-06-09  240  						p_pst		: 1,  /* 15      */
3bf17472226b00 Thomas Gleixner 2015-06-09  241  						p_vector	: 8,  /* 16 - 23 */
3bf17472226b00 Thomas Gleixner 2015-06-09  242  						p_res2		: 14, /* 24 - 37 */
3bf17472226b00 Thomas Gleixner 2015-06-09  243  						pda_l		: 26; /* 38 - 63 */
2ae21010694e56 Suresh Siddha   2008-07-10  244  				};
2ae21010694e56 Suresh Siddha   2008-07-10  245  				__u64 low;
2ae21010694e56 Suresh Siddha   2008-07-10  246  			};
2ae21010694e56 Suresh Siddha   2008-07-10  247  
2ae21010694e56 Suresh Siddha   2008-07-10  248  			union {
3bf17472226b00 Thomas Gleixner 2015-06-09  249  				/* Shared between remapped and posted mode*/
2ae21010694e56 Suresh Siddha   2008-07-10  250  				struct {
3bf17472226b00 Thomas Gleixner 2015-06-09  251  					__u64	sid		: 16,  /* 64 - 79  */
3bf17472226b00 Thomas Gleixner 2015-06-09  252  						sq		: 2,   /* 80 - 81  */
3bf17472226b00 Thomas Gleixner 2015-06-09  253  						svt		: 2,   /* 82 - 83  */
3bf17472226b00 Thomas Gleixner 2015-06-09  254  						__res3		: 44;  /* 84 - 127 */
3bf17472226b00 Thomas Gleixner 2015-06-09  255  				};
3bf17472226b00 Thomas Gleixner 2015-06-09  256  
3bf17472226b00 Thomas Gleixner 2015-06-09  257  				/* Posted mode*/
3bf17472226b00 Thomas Gleixner 2015-06-09  258  				struct {
3bf17472226b00 Thomas Gleixner 2015-06-09  259  					__u64	p_sid		: 16,  /* 64 - 79  */
3bf17472226b00 Thomas Gleixner 2015-06-09  260  						p_sq		: 2,   /* 80 - 81  */
3bf17472226b00 Thomas Gleixner 2015-06-09  261  						p_svt		: 2,   /* 82 - 83  */
3bf17472226b00 Thomas Gleixner 2015-06-09  262  						p_res3		: 12,  /* 84 - 95  */
3bf17472226b00 Thomas Gleixner 2015-06-09  263  						pda_h		: 32;  /* 96 - 127 */
2ae21010694e56 Suresh Siddha   2008-07-10  264  				};
2ae21010694e56 Suresh Siddha   2008-07-10  265  				__u64 high;
2ae21010694e56 Suresh Siddha   2008-07-10  266  			};
2ae21010694e56 Suresh Siddha   2008-07-10  267  		};
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  268  #ifdef CONFIG_IRQ_REMAP
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31 @269  		__u128 irte;
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  270  #endif
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  271  	};
b1fe7f2cda2a00 Peter Zijlstra  2023-05-31  272  };
423f085952fd72 Thomas Gleixner 2010-10-10  273  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-20  6:42 ` [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru Mukesh R
@ 2026-01-20 19:52   ` kernel test robot
  2026-01-21  1:53   ` Stanislav Kinsburskii
  1 sibling, 0 replies; 69+ messages in thread
From: kernel test robot @ 2026-01-20 19:52 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

Hi Mukesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on pci/next pci/for-linus arm64/for-next/core soc/for-next linus/master v6.19-rc6]
[cannot apply to clk/clk-next arnd-asm-generic/master next-20260119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-R/iommu-hyperv-rename-hyperv-iommu-c-to-hyperv-irq-c/20260120-145832
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20260120064230.3602565-16-mrathor%40linux.microsoft.com
patch subject: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
config: x86_64-randconfig-003-20260120 (https://download.01.org/0day-ci/archive/20260121/202601210255.2ZZOLtMV-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260121/202601210255.2ZZOLtMV-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601210255.2ZZOLtMV-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/hv/mshv_root_main.c:60:19: warning: 'setup_hv_full_mmio' defined but not used [-Wunused-function]
      60 | static int __init setup_hv_full_mmio(char *str)
         |                   ^~~~~~~~~~~~~~~~~~


vim +/setup_hv_full_mmio +60 drivers/hv/mshv_root_main.c

    58	
    59	bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
  > 60	static int __init setup_hv_full_mmio(char *str)
    61	{
    62		hv_nofull_mmio = true;
    63		return 0;
    64	}
    65	__setup("hv_nofull_mmio", setup_hv_full_mmio);
    66	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
  2026-01-20 19:08   ` kernel test robot
@ 2026-01-20 21:09   ` kernel test robot
  2026-02-05 18:48   ` Anirudh Rayabharam
  2 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2026-01-20 21:09 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

Hi Mukesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on pci/next pci/for-linus arm64/for-next/core clk/clk-next soc/for-next linus/master arnd-asm-generic/master v6.19-rc6 next-20260120]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-R/iommu-hyperv-rename-hyperv-iommu-c-to-hyperv-irq-c/20260120-145832
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20260120064230.3602565-2-mrathor%40linux.microsoft.com
patch subject: [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
config: i386-allmodconfig (https://download.01.org/0day-ci/archive/20260121/202601210423.wwOrf2K8-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260121/202601210423.wwOrf2K8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601210423.wwOrf2K8-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/iommu/intel/irq_remapping.c:6:
   include/linux/dmar.h:269:17: error: unknown type name '__u128'; did you mean '__u32'?
     269 |                 __u128 irte;
         |                 ^~~~~~
         |                 __u32
   drivers/iommu/intel/irq_remapping.c: In function 'modify_irte':
   drivers/iommu/intel/irq_remapping.c:181:17: error: unknown type name 'u128'
     181 |                 u128 old = irte->irte;
         |                 ^~~~
   In file included from arch/x86/include/asm/bug.h:193,
                    from arch/x86/include/asm/alternative.h:9,
                    from arch/x86/include/asm/barrier.h:5,
                    from include/asm-generic/bitops/generic-non-atomic.h:7,
                    from include/linux/bitops.h:28,
                    from include/linux/kernel.h:23,
                    from include/linux/interrupt.h:6,
                    from drivers/iommu/intel/irq_remapping.c:5:
   include/linux/atomic/atomic-arch-fallback.h:326:14: error: void value not ignored as it ought to be
     326 |         ___r = raw_cmpxchg128((_ptr), ___o, (_new)); \
         |              ^
   include/asm-generic/bug.h:110:32: note: in definition of macro 'WARN_ON'
     110 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   include/linux/atomic/atomic-instrumented.h:4956:9: note: in expansion of macro 'raw_try_cmpxchg128'
    4956 |         raw_try_cmpxchg128(__ai_ptr, __ai_oldp, __VA_ARGS__); \
         |         ^~~~~~~~~~~~~~~~~~
   drivers/iommu/intel/irq_remapping.c:182:26: note: in expansion of macro 'try_cmpxchg128'
     182 |                 WARN_ON(!try_cmpxchg128(&irte->irte, &old, irte_modified->irte));
         |                          ^~~~~~~~~~~~~~
   drivers/iommu/intel/irq_remapping.c: In function 'intel_ir_set_vcpu_affinity':
>> drivers/iommu/intel/irq_remapping.c:1270:40: warning: left shift count >= width of type [-Wshift-count-overflow]
    1270 |                                 ~(-1UL << PDA_HIGH_BIT);
         |                                        ^~

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for IRQ_REMAP
   Depends on [n]: IOMMU_SUPPORT [=y] && X86_64 [=n] && X86_IO_APIC [=y] && PCI_MSI [=y] && ACPI [=y]
   Selected by [y]:
   - HYPERV_IOMMU [=y] && IOMMU_SUPPORT [=y] && HYPERV [=y] && X86 [=y]


vim +1270 drivers/iommu/intel/irq_remapping.c

b106ee63abccbba drivers/iommu/intel_irq_remapping.c Jiang Liu           2015-04-13  1241  
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1242  static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *info)
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1243  {
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1244  	struct intel_ir_data *ir_data = data->chip_data;
53527ea1b70224d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-06-11  1245  	struct intel_iommu_pi_data *pi_data = info;
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1246  
ed1e48ea4370300 drivers/iommu/intel/irq_remapping.c Jacob Pan           2024-04-23  1247  	/* stop posting interrupts, back to the default mode */
53527ea1b70224d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-06-11  1248  	if (!pi_data) {
2454823e97a63d8 drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-03-19  1249  		__intel_ir_reconfigure_irte(data, true);
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1250  	} else {
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1251  		struct irte irte_pi;
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1252  
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1253  		/*
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1254  		 * We are not caching the posted interrupt entry. We
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1255  		 * copy the data from the remapped entry and modify
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1256  		 * the fields which are relevant for posted mode. The
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1257  		 * cached remapped entry is used for switching back to
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1258  		 * remapped mode.
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1259  		 */
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1260  		memset(&irte_pi, 0, sizeof(irte_pi));
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1261  		dmar_copy_shared_irte(&irte_pi, &ir_data->irte_entry);
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1262  
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1263  		/* Update the posted mode fields */
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1264  		irte_pi.p_pst = 1;
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1265  		irte_pi.p_urgent = 0;
53527ea1b70224d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-06-11  1266  		irte_pi.p_vector = pi_data->vector;
53527ea1b70224d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-06-11  1267  		irte_pi.pda_l = (pi_data->pi_desc_addr >>
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1268  				(32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
53527ea1b70224d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-06-11  1269  		irte_pi.pda_h = (pi_data->pi_desc_addr >> 32) &
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09 @1270  				~(-1UL << PDA_HIGH_BIT);
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1271  
688124cc541f60d drivers/iommu/intel/irq_remapping.c Sean Christopherson 2025-03-19  1272  		ir_data->irq_2_iommu.posted_vcpu = true;
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1273  		modify_irte(&ir_data->irq_2_iommu, &irte_pi);
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1274  	}
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1275  
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1276  	return 0;
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1277  }
8541186faf3b596 drivers/iommu/intel_irq_remapping.c Feng Wu             2015-06-09  1278  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 00/15] PCI passthru on Hyper-V (Part I)
  2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
                   ` (14 preceding siblings ...)
  2026-01-20  6:42 ` [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru Mukesh R
@ 2026-01-20 21:50 ` Jacob Pan
  2026-01-24  2:27   ` Mukesh R
  15 siblings, 1 reply; 69+ messages in thread
From: Jacob Pan @ 2026-01-20 21:50 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

Hi Mukesh,

On Mon, 19 Jan 2026 22:42:15 -0800
Mukesh R <mrathor@linux.microsoft.com> wrote:

> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Implement passthru of PCI devices to unprivileged virtual machines
> (VMs) when Linux is running as a privileged VM on Microsoft Hyper-V
> hypervisor. This support is made to fit within the workings of VFIO
> framework, and any VMM needing to use it must use the VFIO subsystem.
> This supports both full device passthru and SR-IOV based VFs.
> 
> There are 3 cases where Linux can run as a privileged VM (aka MSHV):
>   Baremetal root (meaning Hyper-V+Linux), L1VH, and Nested.
> 
I think some introduction/background to L1VH would help.

> At a high level, the hypervisor supports traditional mapped iommu
> domains that use explicit map and unmap hypercalls for mapping and
> unmapping guest RAM into the iommu subsystem.
It may be clearer to state that the hypervisor supports Linux IOMMU
paging domains through map/unmap hypercalls, mapping GPAs to HPAs using
stage‑2 I/O page tables.

> Hyper-V also has a
> concept of direct attach devices whereby the iommu subsystem simply
> uses the guest HW page table (ept/npt/..). This series adds support
> for both, and both are made to work in VFIO type1 subsystem.
> 
This may warrant introducing a new IOMMU domain feature flag, as it
performs mappings but does not support map/unmap semantics in the same
way as a paging domain.

> While this Part I focuses on memory mappings, upcoming Part II
> will focus on irq bypass along with some minor irq remapping 
> updates.
> 
> This patch series was tested using Cloud Hypervisor verion 48. Qemu
> support of MSHV is in the works, and that will be extended to include
> PCI passthru and SR-IOV support also in near future.
> 
> Based on: 8f0b4cce4481 (origin/hyperv-next)
> 
> Thanks,
> -Mukesh
> 
> Mukesh Rathor (15):
>   iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
>   x86/hyperv: cosmetic changes in irqdomain.c for readability
>   x86/hyperv: add insufficient memory support in irqdomain.c
>   mshv: Provide a way to get partition id if running in a VMM process
>   mshv: Declarations and definitions for VFIO-MSHV bridge device
>   mshv: Implement mshv bridge device for VFIO
>   mshv: Add ioctl support for MSHV-VFIO bridge device
>   PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
>   mshv: Import data structs around device domains and irq remapping
>   PCI: hv: Build device id for a VMBus device
>   x86/hyperv: Build logical device ids for PCI passthru hcalls
>   x86/hyperv: Implement hyperv virtual iommu
>   x86/hyperv: Basic interrupt support for direct attached devices
>   mshv: Remove mapping of mmio space during map user ioctl
>   mshv: Populate mmio mappings for PCI passthru
> 
>  MAINTAINERS                         |    1 +
>  arch/arm64/include/asm/mshyperv.h   |   15 +
>  arch/x86/hyperv/irqdomain.c         |  314 ++++++---
>  arch/x86/include/asm/mshyperv.h     |   21 +
>  arch/x86/kernel/pci-dma.c           |    2 +
>  drivers/hv/Makefile                 |    3 +-
>  drivers/hv/mshv_root.h              |   24 +
>  drivers/hv/mshv_root_main.c         |  296 +++++++-
>  drivers/hv/mshv_vfio.c              |  210 ++++++
>  drivers/iommu/Kconfig               |    1 +
>  drivers/iommu/Makefile              |    2 +-
>  drivers/iommu/hyperv-iommu.c        | 1004
> +++++++++++++++++++++------ drivers/iommu/hyperv-irq.c          |
> 330 +++++++++ drivers/pci/controller/pci-hyperv.c |  207 ++++--
>  include/asm-generic/mshyperv.h      |    1 +
>  include/hyperv/hvgdk_mini.h         |   11 +
>  include/hyperv/hvhdk_mini.h         |  112 +++
>  include/linux/hyperv.h              |    6 +
>  include/uapi/linux/mshv.h           |   31 +
>  19 files changed, 2182 insertions(+), 409 deletions(-)
>  create mode 100644 drivers/hv/mshv_vfio.c
>  create mode 100644 drivers/iommu/hyperv-irq.c
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping
  2026-01-20  6:42 ` [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping Mukesh R
@ 2026-01-20 22:17   ` Stanislav Kinsburskii
  2026-01-24  0:38     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-20 22:17 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:24PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Import/copy from Hyper-V public headers, definitions and declarations that
> are related to attaching and detaching of device domains and interrupt
> remapping, and building device ids for those purposes.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  include/hyperv/hvgdk_mini.h |  11 ++++
>  include/hyperv/hvhdk_mini.h | 112 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 123 insertions(+)
> 

<snip>

> +/* ID for stage 2 default domain and NULL domain */
> +#define HV_DEVICE_DOMAIN_ID_S2_DEFAULT 0
> +#define HV_DEVICE_DOMAIN_ID_S2_NULL    0xFFFFFFFFULL
> +
> +union hv_device_domain_id {
> +	u64 as_uint64;
> +	struct {
> +		u32 type : 4;
> +		u32 reserved : 28;
> +		u32 id;
> +	};
> +} __packed;

Shouldn't the inner struct be packed instead?

> +
> +struct hv_input_device_domain { /* HV_INPUT_DEVICE_DOMAIN */
> +	u64 partition_id;
> +	union hv_input_vtl owner_vtl;
> +	u8 padding[7];
> +	union hv_device_domain_id domain_id;
> +} __packed;
> +
> +union hv_create_device_domain_flags {	/* HV_CREATE_DEVICE_DOMAIN_FLAGS */
> +	u32 as_uint32;
> +	struct {
> +		u32 forward_progress_required : 1;
> +		u32 inherit_owning_vtl : 1;
> +		u32 reserved : 30;
> +	} __packed;
> +} __packed;

Why should the union be packed?

Thanks,
Stanislav

> +
> +struct hv_input_create_device_domain {	/* HV_INPUT_CREATE_DEVICE_DOMAIN */
> +	struct hv_input_device_domain device_domain;
> +	union hv_create_device_domain_flags create_device_domain_flags;
> +} __packed;
> +
> +struct hv_input_delete_device_domain {	/* HV_INPUT_DELETE_DEVICE_DOMAIN */
> +	struct hv_input_device_domain device_domain;
> +} __packed;
> +
> +struct hv_input_attach_device_domain {	/* HV_INPUT_ATTACH_DEVICE_DOMAIN */
> +	struct hv_input_device_domain device_domain;
> +	union hv_device_id device_id;
> +} __packed;
> +
> +struct hv_input_detach_device_domain {	/* HV_INPUT_DETACH_DEVICE_DOMAIN */
> +	u64 partition_id;
> +	union hv_device_id device_id;
> +} __packed;
> +
> +struct hv_input_map_device_gpa_pages {	/* HV_INPUT_MAP_DEVICE_GPA_PAGES */
> +	struct hv_input_device_domain device_domain;
> +	union hv_input_vtl target_vtl;
> +	u8 padding[3];
> +	u32 map_flags;
> +	u64 target_device_va_base;
> +	u64 gpa_page_list[];
> +} __packed;
> +
> +struct hv_input_unmap_device_gpa_pages {  /* HV_INPUT_UNMAP_DEVICE_GPA_PAGES */
> +	struct hv_input_device_domain device_domain;
> +	u64 target_device_va_base;
> +} __packed;
> +
>  #endif /* _HV_HVHDK_MINI_H */
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
  2026-01-20  6:42 ` [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device Mukesh R
@ 2026-01-20 22:22   ` Stanislav Kinsburskii
  2026-01-24  0:42     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-20 22:22 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device id as a parameter. This device id refers
> to that specific device during the lifetime of passthru.
> 
> An L1VH VM only contains VMBus based devices. A device id for a VMBus
> device is slightly different in that it uses the hv_pcibus_device info
> for building it to make sure it matches exactly what the hypervisor
> expects. This VMBus based device id is needed when attaching devices in
> an L1VH based guest VM. Before building it, a check is done to make sure
> the device is a valid VMBus device.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/include/asm/mshyperv.h     |  2 ++
>  drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index eef4c3a5ba28..0d7fdfb25e76 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
>  static inline void hv_apic_init(void) {}
>  #endif
>  
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +
>  struct irq_domain *hv_create_pci_msi_domain(void);
>  
>  int hv_map_msi_interrupt(struct irq_data *data,
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 8bc6a38c9b5a..40f0b06bb966 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
>  #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
>  #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
>  
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> +

Why not moving this static function definition above the called instead of
defining the prototype?

>  static int hv_pci_irqchip_init(void)
>  {
>  	return 0;
> @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>  
>  #define hv_msi_prepare		pci_msi_prepare
>  
> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> +{
> +	u64 u64val;

This variable is redundant.

> +	struct hv_pcibus_device *hbus;
> +	struct pci_bus *pbus = pdev->bus;
> +
> +	if (!hv_vmbus_pci_device(pbus))
> +		return 0;
> +
> +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> +	u64val = (hbus->hdev->dev_instance.b[5] << 24) |
> +		 (hbus->hdev->dev_instance.b[4] << 16) |
> +		 (hbus->hdev->dev_instance.b[7] << 8) |
> +		 (hbus->hdev->dev_instance.b[6] & 0xf8) |
> +		 PCI_FUNC(pdev->devfn);
> +

It looks like this value always fits into 32 bit, so what is the value
in returning 64 bit?

Thanks,
Stanislav

> +	return u64val;
> +}
> +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> +
>  /**
>   * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
>   * affinity.
> @@ -1404,6 +1426,13 @@ static struct pci_ops hv_pcifront_ops = {
>  	.write = hv_pcifront_write_config,
>  };
>  
> +#ifdef CONFIG_X86
> +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> +{
> +	return pbus->ops == &hv_pcifront_ops;
> +}
> +#endif /* CONFIG_X86 */
> +
>  /*
>   * Paravirtual backchannel
>   *
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls
  2026-01-20  6:42 ` [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls Mukesh R
@ 2026-01-20 22:27   ` Stanislav Kinsburskii
  2026-01-24  0:44     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-20 22:27 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:26PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> interrupts, etc need a device id as a parameter. A device id refers
> to a specific device. A device id is of two types:
>    o Logical: used for direct attach (see below) hypercalls. A logical
>               device id is a unique 62bit value that is created and
>               sent during the initial device attach. Then all further
>               communications (for interrupt remaps etc) must use this
>               logical id.
>    o PCI: used for device domain hypercalls such as map, unmap, etc.
>           This is built using actual device BDF info.
> 
>    PS: Since an L1VH only supports direct attaches, a logical device id
>        on an L1VH VM is always a VMBus device id. For non-L1VH cases,
>        we just use PCI BDF info, altho not strictly needed, to build the
>        logical device id.
> 
> At a high level, Hyper-V supports two ways to do PCI passthru:
>   1. Device Domain: root must create a device domain in the hypervisor,
>      and do map/unmap hypercalls for mapping and unmapping guest RAM.
>      All hypervisor communications use device id of type PCI for
>      identifying and referencing the device.
> 
>   2. Direct Attach: the hypervisor will simply use the guest's HW
>      page table for mappings, thus the host need not do map/unmap
>      hypercalls. A direct attached device must be referenced
>      via logical device id and never via the PCI device id. For an
>      L1VH root/parent, Hyper-V only supports direct attaches.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c     | 60 ++++++++++++++++++++++++++++++---
>  arch/x86/include/asm/mshyperv.h | 14 ++++++++
>  2 files changed, 70 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index ccbe5848a28f..33017aa0caa4 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -137,7 +137,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
> +static u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
>  	int pos;
>  	union hv_device_id hv_devid;
> @@ -197,7 +197,58 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  	}
>  
>  out:
> -	return hv_devid;
> +	return hv_devid.as_uint64;
> +}
> +
> +/* Build device id for direct attached devices */
> +static u64 hv_build_devid_type_logical(struct pci_dev *pdev)
> +{
> +	hv_pci_segment segment;
> +	union hv_device_id hv_devid;
> +	union hv_pci_bdf bdf = {.as_uint16 = 0};
> +	struct rid_data data = {
> +		.bridge = NULL,
> +		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
> +	};
> +
> +	segment = pci_domain_nr(pdev->bus);
> +	bdf.bus = PCI_BUS_NUM(data.rid);
> +	bdf.device = PCI_SLOT(data.rid);
> +	bdf.function = PCI_FUNC(data.rid);
> +
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_LOGICAL;
> +	hv_devid.logical.id = (u64)segment << 16 | bdf.as_uint16;
> +
> +	return hv_devid.as_uint64;
> +}
> +
> +/* Build device id after the device has been attached */
> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
> +{
> +	if (type == HV_DEVICE_TYPE_LOGICAL) {
> +		if (hv_l1vh_partition())
> +			return hv_pci_vmbus_device_id(pdev);

Should this one be renamed into hv_build_devid_type_vmbus() to align
with the other two function names?

Thanks,
Stanislav

> +		else
> +			return hv_build_devid_type_logical(pdev);
> +	} else if (type == HV_DEVICE_TYPE_PCI)
> +		return hv_build_devid_type_pci(pdev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
> +
> +/* Build device id for the interrupt path */
> +static u64 hv_build_irq_devid(struct pci_dev *pdev)
> +{
> +	enum hv_device_type dev_type;
> +
> +	if (hv_pcidev_is_attached_dev(pdev) || hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	return hv_build_devid_oftype(pdev, dev_type);
>  }
>  
>  /*
> @@ -221,7 +272,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  
>  	msidesc = irq_data_get_msi_desc(data);
>  	pdev = msi_desc_to_pci_dev(msidesc);
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
>  	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
> @@ -296,7 +347,8 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>  {
>  	union hv_device_id hv_devid;
>  
> -	hv_devid = hv_build_devid_type_pci(pdev);
> +	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
> +
>  	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 0d7fdfb25e76..97477c5a8487 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -188,6 +188,20 @@ bool hv_vcpu_is_preempted(int vcpu);
>  static inline void hv_apic_init(void) {}
>  #endif
>  
> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{ return false; }       /* temporary */
> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
> +#else	/* CONFIG_HYPERV_IOMMU */
> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{ return false; }
> +
> +static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
> +				       enum hv_device_type type)
> +{ return 0; }
> +
> +#endif	/* CONFIG_HYPERV_IOMMU */
> +
>  u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
>  
>  struct irq_domain *hv_create_pci_msi_domain(void);
> -- 
> 2.51.2.vfs.0.1
> 










^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-20  6:42 ` [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu Mukesh R
@ 2026-01-21  0:12   ` Stanislav Kinsburskii
  2026-01-24  1:26     ` Mukesh R
  2026-01-22  5:18   ` Jacob Pan
  1 sibling, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-21  0:12 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add a new file to implement management of device domains, mapping and
> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> framework for PCI passthru on Hyper-V running Linux as root or L1VH
> parent. This also implements direct attach mechanism for PCI passthru,
> and it is also made to work within the VFIO framework.
> 
> At a high level, during boot the hypervisor creates a default identity
> domain and attaches all devices to it. This nicely maps to Linux iommu
> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> during boot. As mentioned previously, Hyper-V supports two ways to do
> PCI passthru:
> 
>   1. Device Domain: root must create a device domain in the hypervisor,
>      and do map/unmap hypercalls for mapping and unmapping guest RAM.
>      All hypervisor communications use device id of type PCI for
>      identifying and referencing the device.
> 
>   2. Direct Attach: the hypervisor will simply use the guest's HW
>      page table for mappings, thus the host need not do map/unmap
>      device memory hypercalls. As such, direct attach passthru setup
>      during guest boot is extremely fast. A direct attached device
>      must be referenced via logical device id and not via the PCI
>      device id.
> 
> At present, L1VH root/parent only supports direct attaches. Also direct
> attach is default in non-L1VH cases because there are some significant
> performance issues with device domain implementation currently for guests
> with higher RAM (say more than 8GB), and that unfortunately cannot be
> addressed in the short term.
> 

<snip>

> +/*
> + * If the current thread is a VMM thread, return the partition id of the VM it
> + * is managing, else return HV_PARTITION_ID_INVALID.
> + */
> +u64 hv_iommu_get_curr_partid(void)
> +{
> +	u64 (*fn)(pid_t pid);
> +	u64 partid;
> +
> +	fn = symbol_get(mshv_pid_to_partid);
> +	if (!fn)
> +		return HV_PARTITION_ID_INVALID;
> +
> +	partid = fn(current->tgid);
> +	symbol_put(mshv_pid_to_partid);
> +
> +	return partid;
> +}
> +
> +/* If this is a VMM thread, then this domain is for a guest VM */
> +static bool hv_curr_thread_is_vmm(void)
> +{
> +	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	default:
> +		return false;
> +	}
> +	return false;

The return above is never reached.

> +}
> +
> +/*
> + * Check if given pci device is a direct attached device. Caller must have
> + * verified pdev is a valid pci device.
> + */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{
> +	struct iommu_domain *iommu_domain;
> +	struct hv_domain *hvdom;
> +	struct device *dev = &pdev->dev;
> +
> +	iommu_domain = iommu_get_domain_for_dev(dev);
> +	if (iommu_domain) {
> +		hvdom = to_hv_domain(iommu_domain);

hvdom varaible is redundant.

> +		return hvdom->attached_dom;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
> +
> +/* Create a new device domain in the hypervisor */
> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_device_domain *ddp;
> +	struct hv_input_create_device_domain *input;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	ddp = &input->device_domain;
> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	ddp->domain_id.id = hvdom->domid_num;
> +
> +	input->create_device_domain_flags.forward_progress_required = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
> +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* During boot, all devices are attached to this */
> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct device *dev)
> +{
> +	return &hv_def_identity_dom.iommu_dom;
> +}
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	struct hv_domain *hvdom;
> +	int rc;
> +
> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() && !hv_no_attdev) {
> +		pr_err("Hyper-V: l1vh iommu does not support host devices\n");
> +		return NULL;
> +	}
> +
> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> +	if (hvdom == NULL)
> +		goto out;

Why goto here and not return NULL like above?

> +
> +	spin_lock_init(&hvdom->mappings_lock);
> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> +
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie, 0 */
> +		goto out_free;
> +
> +	hvdom->domid_num = unique_id;
> +	hvdom->iommu_dom.geometry = default_geometry;
> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> +
> +	/* For guests, by default we do direct attaches, so no domain in hyp */
> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
> +		hvdom->attached_dom = true;
> +	else {
> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> +		if (rc)
> +			goto out_free_id;
> +	}
> +
> +	return &hvdom->iommu_dom;
> +
> +out_free_id:
> +	unique_id--;
> +out_free:
> +	kfree(hvdom);
> +out:
> +	return NULL;
> +}
> +
> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (hvdom->num_attchd) {
> +		pr_err("Hyper-V: can't free busy iommu domain (%p)\n", immdom);
> +		return;
> +	}
> +
> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
> +		struct hv_input_device_domain *ddp;
> +
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		ddp = &input->device_domain;
> +		memset(input, 0, sizeof(*input));
> +
> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +		ddp->domain_id.id = hvdom->domid_num;
> +
> +		status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> +					 NULL);
> +		local_irq_restore(flags);
> +
> +		if (!hv_result_success(status))
> +			hv_status_err(status, "\n");
> +	}
> +
> +	kfree(hvdom);
> +}
> +
> +/* Attach a device to a domain previously created in the hypervisor */
> +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct pci_dev *pdev)
> +{
> +	unsigned long flags;
> +	u64 status;
> +	enum hv_device_type dev_type;
> +	struct hv_input_attach_device_domain *input;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +
> +	/* NB: Upon guest shutdown, device is re-attached to the default domain
> +	 * without explicit detach.
> +	 */
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Caller must have validated that dev is a valid pci dev */
> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
> +{
> +	struct hv_input_attach_device *input;
> +	u64 status;
> +	int rc;
> +	unsigned long flags;
> +	union hv_device_id host_devid;
> +	enum hv_device_type dev_type;
> +	u64 ptid = hv_iommu_get_curr_partid();
> +
> +	if (ptid == HV_PARTITION_ID_INVALID) {
> +		pr_err("Hyper-V: Invalid partition id in direct attach\n");
> +		return -EINVAL;
> +	}
> +
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +		input->partition_id = ptid;
> +		input->device_id = host_devid;
> +
> +		/* Hypervisor associates logical_id with this device, and in
> +		 * some hypercalls like retarget interrupts, logical_id must be
> +		 * used instead of the BDF. It is a required parameter.
> +		 */
> +		input->attdev_flags.logical_id = 1;
> +		input->logical_devid =
> +			   hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
> +
> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE, input, NULL);
> +		local_irq_restore(flags);
> +
> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
> +			if (rc)
> +				break;
> +		}
> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* This to attach a device to both host app (like DPDK) and a guest VM */
> +static int hv_iommu_attach_dev(struct iommu_domain *immdom, struct device *dev,
> +			       struct iommu_domain *old)
> +{
> +	struct pci_dev *pdev;
> +	int rc;
> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	/* l1vh does not support host device (eg DPDK) passthru */
> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
> +	    !hvdom_new->attached_dom)
> +		return -EINVAL;
> +
> +	/*
> +	 * VFIO does not do explicit detach calls, hence check first if we need
> +	 * to detach first. Also, in case of guest shutdown, it's the VMM
> +	 * thread that attaches it back to the hv_def_identity_dom, and
> +	 * hvdom_prev will not be null then. It is null during boot.
> +	 */
> +	if (hvdom_prev)
> +		if (!hv_l1vh_partition() || !hv_special_domain(hvdom_prev))
> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev);
> +
> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
> +		return 0;
> +	}
> +
> +	if (hvdom_new->attached_dom)
> +		rc = hv_iommu_direct_attach_device(pdev);
> +	else
> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
> +
> +	if (rc && hvdom_prev) {
> +		int rc1;
> +
> +		if (hvdom_prev->attached_dom)
> +			rc1 = hv_iommu_direct_attach_device(pdev);
> +		else
> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
> +
> +		if (rc1)
> +			pr_err("Hyper-V: iommu could not restore orig device state.. dev:%s\n",
> +			       dev_name(dev));
> +	}
> +
> +	if (rc == 0) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
> +		hvdom_new->num_attchd++;
> +	}
> +
> +	return rc;
> +}
> +
> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
> +					struct pci_dev *pdev)
> +{
> +	struct hv_input_detach_device *input;
> +	u64 status, log_devid;
> +	unsigned long flags;
> +
> +	log_devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = hv_iommu_get_curr_partid();
> +	input->logical_devid = log_devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
> +				      struct pci_dev *pdev)
> +{
> +	u64 status, devid;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +
> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id.as_uint64 = devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> +{
> +	struct pci_dev *pdev;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	if (hvdom->num_attchd == 0)
> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hvdom->attached_dom) {
> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> +
> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> +		 * next.
> +		 */
> +	} else {
> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> +	}
> +
> +	hvdom->num_attchd--;

Shouldn't this be modified iff the detach succeeded?

> +}
> +
> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +	mapping->flags = flags;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> +
> +	return 0;
> +}
> +
> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;
> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &hvdom->mappings_tree);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* Return: must return exact status from the hypercall without changes */
> +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> +			    unsigned long iova, phys_addr_t paddr,
> +			    unsigned long npages, u32 map_flags)
> +{
> +	u64 status;
> +	int i;
> +	struct hv_input_map_device_gpa_pages *input;
> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->map_flags = map_flags;
> +	input->target_device_va_base = iova;
> +
> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +	for (i = 0; i < npages; i++, pfn++)
> +		input->gpa_page_list[i] = pfn;
> +
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> +				     input, NULL);
> +
> +	local_irq_restore(flags);
> +	return status;
> +}
> +
> +/*
> + * The core VFIO code loops over memory ranges calling this function with
> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
> + */
> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +	while (done < npages) {
> +		ulong completed, remain = npages - done;
> +
> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> +					  map_flags);
> +
> +		completed = hv_repcomp(status);
> +		done = done + completed;
> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> +
> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    hv_current_partition_id,
> +						    256);
> +			if (ret)
> +				break;
> +		}
> +		if (!hv_result_success(status))
> +			break;
> +	}
> +
> +	if (!hv_result_success(status)) {
> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> +
> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> +			      done, npages, iova);
> +		/*
> +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
> +		 * only remove from [0 - done], we need to remove second chunk
> +		 * [done+1 - size-1].
> +		 */
> +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
> +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
> +				     done, NULL);
> +		if (mapped)
> +			*mapped = 0;
> +	} else
> +		if (mapped)
> +			*mapped = size;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;
> +
> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> +	if (unmapped < size)
> +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
> +		       __func__, iova, unmapped, size);
> +
> +	if (hvdom->attached_dom)
> +		return size;
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +				     0, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +

There is some inconsistency in namings and behaviour of paired
functions:
1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
2. hv_iommu_map_pgs doesn't print status in case of error.

It would be much better to keep this code consistent.

> +	return unmapped;
> +}
> +
> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> +					 dma_addr_t iova)
> +{
> +	u64 paddr = 0;
> +	unsigned long flags;
> +	struct hv_iommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
> +	if (node) {
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		paddr = mapping->paddr + (iova - mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return paddr;
> +}
> +
> +/*
> + * Currently, hypervisor does not provide list of devices it is using
> + * dynamically. So use this to allow users to manually specify devices that
> + * should be skipped. (eg. hypervisor debugger using some network device).
> + */
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func, &parsed);
> +			if (rc)
> +				break;
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +			    pdev->bus->number == bus &&
> +			    PCI_SLOT(pdev->devfn) == slot &&
> +			    PCI_FUNC(pdev->devfn) == func) {
> +
> +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);
> +	}
> +
> +	/* Device will be explicitly attached to the default domain, so no need
> +	 * to do dev_iommu_priv_set() here.
> +	 */
> +
> +	return &hv_virt_iommu;
> +}
> +
> +static void hv_iommu_probe_finalize(struct device *dev)
> +{
> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> +
> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> +		iommu_setup_dma_ops(dev);
> +	else
> +		set_dma_ops(dev, NULL);
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> +
> +	/* Need to detach device from device domain if necessary. */
> +	if (hvdom)
> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int hv_iommu_def_domain_type(struct device *dev)
> +{
> +	/* The hypervisor always creates this by default during boot */
> +	return IOMMU_DOMAIN_IDENTITY;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable	    = hv_iommu_capable,
> +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> +	.probe_device	    = hv_iommu_probe_device,
> +	.probe_finalize     = hv_iommu_probe_finalize,
> +	.release_device     = hv_iommu_release_device,
> +	.def_domain_type    = hv_iommu_def_domain_type,
> +	.device_group	    = hv_iommu_device_group,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev   = hv_iommu_attach_dev,
> +		.map_pages    = hv_iommu_map_pages,
> +		.unmap_pages  = hv_iommu_unmap_pages,
> +		.iova_to_phys = hv_iommu_iova_to_phys,
> +		.free	      = hv_iommu_domain_free,
> +	},
> +	.owner		    = THIS_MODULE,
> +};
> +
> +static void __init hv_initialize_special_domains(void)
> +{
> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */

hv_def_identity_dom is a static global variable.
Why not initialize hv_def_identity_dom upon definition instead of
introducing a new function?

> +}
> +
> +static int __init hv_iommu_init(void)
> +{
> +	int ret;
> +	struct iommu_device *iommup = &hv_virt_iommu;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
> +		return ret;
> +	}
> +
> +	/* This must come before iommu_device_register because the latter calls
> +	 * into the hooks.
> +	 */
> +	hv_initialize_special_domains();
> +
> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);

It looks weird to initialize an object after creating sysfs entries for
it.
It should be the other way around.

Thanks,
Stanislav

> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_register failed: %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("Hyper-V IOMMU initialized\n");
> +
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(iommup);
> +	return ret;
> +}
> +
> +void __init hv_iommu_detect(void)
> +{
> +	if (no_iommu || iommu_detected)
> +		return;
> +
> +	/* For l1vh, always expose an iommu unit */
> +	if (!hv_l1vh_partition())
> +		if (!(ms_hyperv.misc_features & HV_DEVICE_DOMAIN_AVAILABLE))
> +			return;
> +
> +	iommu_detected = 1;
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +
> +	pci_request_acs();
> +}
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index dfc516c1c719..2ad111727e82 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1767,4 +1767,10 @@ static inline unsigned long virt_to_hvpfn(void *addr)
>  #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
>  #define page_to_hvpfn(page)	(page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
>  
> +#ifdef CONFIG_HYPERV_IOMMU
> +void __init hv_iommu_detect(void);
> +#else
> +static inline void hv_iommu_detect(void) { }
> +#endif /* CONFIG_HYPERV_IOMMU */
> +
>  #endif /* _HYPERV_H */
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices
  2026-01-20  6:42 ` [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices Mukesh R
@ 2026-01-21  0:47   ` Stanislav Kinsburskii
  2026-01-24  2:08     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-21  0:47 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:28PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> As mentioned previously, a direct attached device must be referenced
> via logical device id which is formed in the initial attach hypercall.
> Interrupt mapping paths for direct attached devices are almost same,
> except we must use logical device ids instead of the PCI device ids.
> 
> L1VH only supports direct attaches for passing thru devices to its guests,
> and devices on L1VH are VMBus based. However, the interrupts are mapped
> via the map interrupt hypercall and not the traditional method of VMBus
> messages.
> 
> Partition id for the relevant hypercalls is tricky. This because a device
> could be moving from root to guest and then back to the root. In case
> of L1VH, it could be moving from system host to L1VH root to a guest,
> then back to the L1VH root. So, it is carefully crafted by keeping
> track of whether the call is on behalf of a VMM process, whether the
> device is attached device (as opposed to mapped), and whether we are in
> an L1VH root/parent. If VMM process, we assume it is on behalf of a
> guest. Otherwise, the device is being attached or detached during boot
> or shutdown of the privileged partition.
> 
> Lastly, a dummy cpu and vector is used to map interrupt for a direct
> attached device. This because, once a device is marked for direct attach,
> hypervisor will not let any interrupts be mapped to host. So it is mapped
> to guest dummy cpu and dummy vector. This is then correctly mapped during
> guest boot via the retarget paths.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/arm64/include/asm/mshyperv.h   | 15 +++++
>  arch/x86/hyperv/irqdomain.c         | 57 +++++++++++++-----
>  arch/x86/include/asm/mshyperv.h     |  4 ++
>  drivers/pci/controller/pci-hyperv.c | 91 +++++++++++++++++++++++++----
>  4 files changed, 142 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab6..27da480f94f6 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -53,6 +53,21 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>  	return hv_get_msr(reg);
>  }
>  
> +struct irq_data;
> +struct msi_msg;
> +struct pci_dev;
> +static inline void hv_irq_compose_msi_msg(struct irq_data *data,
> +					  struct msi_msg *msg) {};
> +static inline int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +					struct hv_interrupt_entry *hvirqe)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{
> +	return false;
> +}
> +
>  /* SMCCC hypercall parameters */
>  #define HV_SMCCC_FUNC_NUMBER	1
>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 33017aa0caa4..e6eb457f791e 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -13,6 +13,16 @@
>  #include <linux/irqchip/irq-msi-lib.h>
>  #include <asm/mshyperv.h>
>  
> +/*
> + * For direct attached devices (which use logical device ids), hypervisor will
> + * not allow mappings to host. But VFIO needs to bind the interrupt at the very
> + * start before the guest cpu/vector is known. So we use dummy cpu and vector
> + * to bind in such case, and later when the guest starts, retarget will move it
> + * to correct guest cpu and vector.
> + */
> +#define HV_DDA_DUMMY_CPU      0
> +#define HV_DDA_DUMMY_VECTOR  32
> +
>  static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
>  				  bool level, int cpu, int vector,
>  				  struct hv_interrupt_entry *ret_entry)
> @@ -24,6 +34,11 @@ static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
>  	u64 status;
>  	int nr_bank, var_size;
>  
> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL) {
> +		cpu = HV_DDA_DUMMY_CPU;
> +		vector = HV_DDA_DUMMY_VECTOR;
> +	}
> +
>  	local_irq_save(flags);
>  
>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -95,7 +110,8 @@ static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
>  	return hv_result_to_errno(status);
>  }
>  
> -static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
> +static int hv_unmap_interrupt(union hv_device_id hv_devid,
> +			      struct hv_interrupt_entry *irq_entry)
>  {
>  	unsigned long flags;
>  	struct hv_input_unmap_device_interrupt *input;
> @@ -103,10 +119,14 @@ static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
>  
>  	local_irq_save(flags);
>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> -
>  	memset(input, 0, sizeof(*input));
> -	input->partition_id = hv_current_partition_id;
> -	input->device_id = id;
> +
> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
> +		input->partition_id = hv_iommu_get_curr_partid();
> +	else
> +		input->partition_id = hv_current_partition_id;
> +
> +	input->device_id = hv_devid.as_uint64;
>  	input->interrupt_entry = *irq_entry;
>  
>  	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
> @@ -263,6 +283,7 @@ static u64 hv_build_irq_devid(struct pci_dev *pdev)
>  int hv_map_msi_interrupt(struct irq_data *data,
>  			 struct hv_interrupt_entry *out_entry)
>  {
> +	u64 ptid;
>  	struct irq_cfg *cfg = irqd_cfg(data);
>  	struct hv_interrupt_entry dummy;
>  	union hv_device_id hv_devid;
> @@ -275,8 +296,17 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
> -	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
> -				cfg->vector, out_entry ? out_entry : &dummy);
> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
> +		if (hv_pcidev_is_attached_dev(pdev))
> +			ptid = hv_iommu_get_curr_partid();
> +		else
> +			/* Device actually on l1vh root, not passthru'd to vm */

l1vh and root are mutually exclusive partitions.
If you wanted to highlight that it's l1vh itself and not its child guest, then
"l1vh parent" term would do.

> +			ptid = hv_current_partition_id;
> +	else
> +		ptid = hv_current_partition_id;

Looks like the only special case is for attached logical devices,
otherwise hv_current_partition_id is used.
Can the logic simplified here?

Thanks,
Stanislav

> +
> +	return hv_map_interrupt(ptid, hv_devid, false, cpu, cfg->vector,
> +				out_entry ? out_entry : &dummy);
>  }
>  EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
>  
> @@ -289,10 +319,7 @@ static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
>  	msg->data = entry->msi_entry.data.as_uint32;
>  }
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> -				  struct hv_interrupt_entry *irq_entry);
> -
> -static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> +void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct hv_interrupt_entry *stored_entry;
>  	struct irq_cfg *cfg = irqd_cfg(data);
> @@ -341,16 +368,18 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	data->chip_data = stored_entry;
>  	entry_to_msi_msg(data->chip_data, msg);
>  }
> +EXPORT_SYMBOL_GPL(hv_irq_compose_msi_msg);
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> -				  struct hv_interrupt_entry *irq_entry)
> +int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +			   struct hv_interrupt_entry *irq_entry)
>  {
>  	union hv_device_id hv_devid;
>  
>  	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>  
> -	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
> +	return hv_unmap_interrupt(hv_devid, irq_entry);
>  }
> +EXPORT_SYMBOL_GPL(hv_unmap_msi_interrupt);
>  
>  /* NB: during map, hv_interrupt_entry is saved via data->chip_data */
>  static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
> @@ -486,7 +515,7 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
>  	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
>  	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>  
> -	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
> +	return hv_unmap_interrupt(hv_devid, entry);
>  }
>  EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index e4ccdbbf1d12..b6facd3a0f5e 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -204,11 +204,15 @@ static inline u64 hv_iommu_get_curr_partid(void)
>  #endif	/* CONFIG_HYPERV_IOMMU */
>  
>  u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> +void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg);
> +extern bool hv_no_attdev;
>  
>  struct irq_domain *hv_create_pci_msi_domain(void);
>  
>  int hv_map_msi_interrupt(struct irq_data *data,
>  			 struct hv_interrupt_entry *out_entry);
> +int hv_unmap_msi_interrupt(struct pci_dev *dev,
> +			   struct hv_interrupt_entry *hvirqe);
>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
>  		struct hv_interrupt_entry *entry);
>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 40f0b06bb966..71d1599dc4a8 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -660,15 +660,17 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
>  
>  	params = *this_cpu_ptr(hyperv_pcpu_input_arg);
>  	memset(params, 0, sizeof(*params));
> -	params->partition_id = HV_PARTITION_ID_SELF;
> +
> +	if (hv_pcidev_is_attached_dev(pdev))
> +		params->partition_id = hv_iommu_get_curr_partid();
> +	else
> +		params->partition_id = HV_PARTITION_ID_SELF;
> +
>  	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
> -	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
> +	params->int_entry.msi_entry.address.as_uint32 =
> +						int_desc->address & 0xffffffff;
>  	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
> -	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
> -			   (hbus->hdev->dev_instance.b[4] << 16) |
> -			   (hbus->hdev->dev_instance.b[7] << 8) |
> -			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
> -			   PCI_FUNC(pdev->devfn);
> +	params->device_id = hv_pci_vmbus_device_id(pdev);
>  	params->int_target.vector = hv_msi_get_int_vector(data);
>  
>  	if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) {
> @@ -1263,6 +1265,15 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
>  			mb();
>  		}
>  		spin_unlock_irqrestore(&hbus->config_lock, flags);
> +		/*
> +		 * Make sure PCI_INTERRUPT_PIN is hard-wired to 0 since it may
> +		 * be read using a 32bit read which is skipped by the above
> +		 * emulation.
> +		 */
> +		if (PCI_INTERRUPT_PIN >= where &&
> +		    PCI_INTERRUPT_PIN <= (where + size)) {
> +			*((char *)val + PCI_INTERRUPT_PIN - where) = 0;
> +		}
>  	} else {
>  		dev_err(dev, "Attempt to read beyond a function's config space.\n");
>  	}
> @@ -1731,14 +1742,22 @@ static void hv_msi_free(struct irq_domain *domain, unsigned int irq)
>  	if (!int_desc)
>  		return;
>  
> -	irq_data->chip_data = NULL;
>  	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
>  	if (!hpdev) {
> +		irq_data->chip_data = NULL;
>  		kfree(int_desc);
>  		return;
>  	}
>  
> -	hv_int_desc_free(hpdev, int_desc);
> +	if (hv_pcidev_is_attached_dev(pdev)) {
> +		hv_unmap_msi_interrupt(pdev, irq_data->chip_data);
> +		kfree(irq_data->chip_data);
> +		irq_data->chip_data = NULL;
> +	} else {
> +		irq_data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>  	put_pcichild(hpdev);
>  }
>  
> @@ -2139,6 +2158,56 @@ static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	msg->data = 0;
>  }
>  
> +/* Compose an msi message for a directly attached device */
> +static void hv_dda_compose_msi_msg(struct irq_data *irq_data,
> +				   struct msi_desc *msi_desc,
> +				   struct msi_msg *msg)
> +{
> +	bool multi_msi;
> +	struct hv_pcibus_device *hbus;
> +	struct hv_pci_dev *hpdev;
> +	struct pci_dev *pdev = msi_desc_to_pci_dev(msi_desc);
> +
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
> +
> +	if (multi_msi) {
> +		dev_err(&hbus->hdev->device,
> +			"Passthru direct attach does not support multi msi\n");
> +		goto outerr;
> +	}
> +
> +	hbus = container_of(pdev->bus->sysdata, struct hv_pcibus_device,
> +			    sysdata);
> +
> +	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
> +	if (!hpdev)
> +		goto outerr;
> +
> +	/* will unmap if needed and also update irq_data->chip_data */
> +	hv_irq_compose_msi_msg(irq_data, msg);
> +
> +	put_pcichild(hpdev);
> +	return;
> +
> +outerr:
> +	memset(msg, 0, sizeof(*msg));
> +}
> +
> +static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> +{
> +	struct pci_dev *pdev;
> +	struct msi_desc *msi_desc;
> +
> +	msi_desc = irq_data_get_msi_desc(data);
> +	pdev = msi_desc_to_pci_dev(msi_desc);
> +
> +	if (hv_pcidev_is_attached_dev(pdev))
> +		hv_dda_compose_msi_msg(data, msi_desc, msg);
> +	else
> +		hv_vmbus_compose_msi_msg(data, msg);
> +}
> +
>  static bool hv_pcie_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>  				      struct irq_domain *real_parent, struct msi_domain_info *info)
>  {
> @@ -2177,7 +2246,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
>  /* HW Interrupt Chip Descriptor */
>  static struct irq_chip hv_msi_irq_chip = {
>  	.name			= "Hyper-V PCIe MSI",
> -	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
> +	.irq_compose_msi_msg	= hv_compose_msi_msg,
>  	.irq_set_affinity	= irq_chip_set_affinity_parent,
>  	.irq_ack		= irq_chip_ack_parent,
>  	.irq_eoi		= irq_chip_eoi_parent,
> @@ -4096,7 +4165,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>  		irq_data = irq_get_irq_data(entry->irq);
>  		if (WARN_ON_ONCE(!irq_data))
>  			return -EINVAL;
> -		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
> +		hv_compose_msi_msg(irq_data, &entry->msg);
>  	}
>  	return 0;
>  }
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c
  2026-01-20  6:42 ` [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c Mukesh R
@ 2026-01-21  0:53   ` kernel test robot
  0 siblings, 0 replies; 69+ messages in thread
From: kernel test robot @ 2026-01-21  0:53 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

Hi Mukesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on pci/next pci/for-linus arm64/for-next/core clk/clk-next soc/for-next linus/master v6.19-rc6 next-20260120]
[cannot apply to arnd-asm-generic/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-R/iommu-hyperv-rename-hyperv-iommu-c-to-hyperv-irq-c/20260120-145832
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20260120064230.3602565-4-mrathor%40linux.microsoft.com
patch subject: [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c
config: i386-randconfig-053-20260120 (https://download.01.org/0day-ci/archive/20260121/202601210731.f1WLdgcO-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202601210731.f1WLdgcO-lkp@intel.com/

cocci warnings: (new ones prefixed by >>)
>> arch/x86/hyperv/irqdomain.c:90:2-3: Unneeded semicolon

vim +90 arch/x86/hyperv/irqdomain.c

    72	
    73	static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
    74				    int cpu, int vector,
    75				    struct hv_interrupt_entry *ret_entry)
    76	{
    77		u64 status;
    78		int rc, deposit_pgs = 16;		/* don't loop forever */
    79	
    80		while (deposit_pgs--) {
    81			status = hv_map_interrupt_hcall(ptid, device_id, level, cpu,
    82							vector, ret_entry);
    83	
    84			if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
    85				break;
    86	
    87			rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
    88			if (rc)
    89				break;
  > 90		};
    91	
    92		if (!hv_result_success(status))
    93			hv_status_err(status, "\n");
    94	
    95		return hv_result_to_errno(status);
    96	}
    97	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl
  2026-01-20  6:42 ` [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl Mukesh R
@ 2026-01-21  1:41   ` Stanislav Kinsburskii
  2026-01-23 18:34   ` Nuno Das Neves
  1 sibling, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-21  1:41 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:29PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> VFIO no longer puts the mmio pfn in vma->vm_pgoff. So, remove code
> that is using it to map mmio space. It is broken and will cause
> panic.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 27313419828d..03f3aa9f5541 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1258,16 +1258,8 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
>  }
>  
>  /*
> - * This maps two things: guest RAM and for pci passthru mmio space.
> - *
> - * mmio:
> - *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
> - *  - Two things need to happen for mapping mmio range:
> - *	1. mapped in the uaddr so VMM can access it.
> - *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
> - *
> - *   This function takes care of the second. The first one is managed by vfio,
> - *   and hence is taken care of via vfio_pci_mmap_fault().
> + * This is called for both user ram and mmio space. The mmio space is not
> + * mapped here, but later during intercept.
>   */
>  static long
>  mshv_map_user_memory(struct mshv_partition *partition,
> @@ -1276,7 +1268,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	struct mshv_mem_region *region;
>  	struct vm_area_struct *vma;
>  	bool is_mmio;
> -	ulong mmio_pfn;
>  	long ret;
>  
>  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
> @@ -1286,7 +1277,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	mmap_read_lock(current->mm);
>  	vma = vma_lookup(current->mm, mem.userspace_addr);
>  	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> -	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
>  	mmap_read_unlock(current->mm);
>  
>  	if (!vma)
> @@ -1313,10 +1303,8 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  					    HV_MAP_GPA_NO_ACCESS, NULL);
>  		break;
>  	case MSHV_REGION_TYPE_MMIO:
> -		ret = hv_call_map_mmio_pages(partition->pt_id,
> -					     region->start_gfn,
> -					     mmio_pfn,
> -					     region->nr_pages);
> +		/* mmio mappings are handled later during intercepts */
> +		ret = 0;

No need updating ret here: it's 0 after the previous call.

Thanks,
Stanislav

>  		break;
>  	}
>  
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-20  6:42 ` [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru Mukesh R
  2026-01-20 19:52   ` kernel test robot
@ 2026-01-21  1:53   ` Stanislav Kinsburskii
  2026-01-24  2:19     ` Mukesh R
  1 sibling, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-21  1:53 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Upon guest access, in case of missing mmio mapping, the hypervisor
> generates an unmapped gpa intercept. In this path, lookup the PCI
> resource pfn for the guest gpa, and ask the hypervisor to map it
> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> and obtained via fixup_user_fault call (similar to KVM).
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 115 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 03f3aa9f5541..4c8bc7cd0888 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -56,6 +56,14 @@ struct hv_stats_page {
>  	};
>  } __packed;
>  
> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> +static int __init setup_hv_full_mmio(char *str)
> +{
> +	hv_nofull_mmio = true;
> +	return 0;
> +}
> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> +
>  struct mshv_root mshv_root;
>  
>  enum hv_scheduler_type hv_scheduler_type;
> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>  }
>  
>  #ifdef CONFIG_X86_64
> +
> +/*
> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> + * else just return -errno.
> + */
> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> +				       u64 *mmio_pfnp)
> +{
> +	struct vm_area_struct *vma;
> +	bool is_mmio;
> +	u64 uaddr;
> +	struct mshv_mem_region *mreg;
> +	struct follow_pfnmap_args pfnmap_args;
> +	int rc = -EINVAL;
> +
> +	/*
> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> +	 * useraddr vma to lookup pci bar pfn.
> +	 */
> +	spin_lock(&pt->pt_mem_regions_lock);
> +
> +	/* Get the region again under the lock */
> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> +		goto unlock_pt_out;
> +
> +	uaddr = mreg->start_uaddr +
> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> +
> +	mmap_read_lock(current->mm);

Semaphore can't be taken under spinlock.
Get it instead.

> +	vma = vma_lookup(current->mm, uaddr);
> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;

Why this check is needed again?
The region type is stored on the region itself.
And the type is checked on the caller side.

> +	if (!is_mmio)
> +		goto unlock_mmap_out;
> +
> +	pfnmap_args.vma = vma;
> +	pfnmap_args.address = uaddr;
> +
> +	rc = follow_pfnmap_start(&pfnmap_args);
> +	if (rc) {
> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> +				      NULL);
> +		if (rc)
> +			goto unlock_mmap_out;
> +
> +		rc = follow_pfnmap_start(&pfnmap_args);
> +		if (rc)
> +			goto unlock_mmap_out;
> +	}
> +
> +	*mmio_pfnp = pfnmap_args.pfn;
> +	follow_pfnmap_end(&pfnmap_args);
> +
> +unlock_mmap_out:
> +	mmap_read_unlock(current->mm);
> +unlock_pt_out:
> +	spin_unlock(&pt->pt_mem_regions_lock);
> +	return rc;
> +}
> +
> +/*
> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> + * and resolve if possible.
> + * Returns: True if valid mmio intercept and it was handled, else false
> + */
> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> +{
> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> +	struct hv_x64_memory_intercept_message *msg;
> +	union hv_x64_memory_access_info accinfo;
> +	u64 gfn, mmio_spa, numpgs;
> +	struct mshv_mem_region *mreg;
> +	int rc;
> +	struct mshv_partition *pt = vp->vp_partition;
> +
> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> +	accinfo = msg->memory_access_info;
> +
> +	if (!accinfo.gva_gpa_valid)
> +		return false;
> +
> +	/* Do a fast check and bail if non mmio intercept */
> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> +	mreg = mshv_partition_region_by_gfn(pt, gfn);

This call needs to be protected by the spinlock.

Thanks,
Stanislav 

> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> +		return false;
> +
> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> +	if (rc)
> +		return false;
> +
> +	if (!hv_nofull_mmio) {		/* default case */
> +		gfn = mreg->start_gfn;
> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> +		numpgs = mreg->nr_pages;
> +	} else
> +		numpgs = 1;
> +
> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> +
> +	return rc == 0;
> +}
> +
>  static struct mshv_mem_region *
>  mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>  {
> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>  
>  	return ret;
>  }
> +
>  #else  /* CONFIG_X86_64 */
> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>  static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>  #endif /* CONFIG_X86_64 */
>  
>  static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>  {
>  	switch (vp->vp_intercept_msg_page->header.message_type) {
> +	case HVMSG_UNMAPPED_GPA:
> +		return mshv_handle_unmapped_gpa(vp);
>  	case HVMSG_GPA_INTERCEPT:
>  		return mshv_handle_gpa_intercept(vp);
>  	}
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-20  6:42 ` [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu Mukesh R
  2026-01-21  0:12   ` Stanislav Kinsburskii
@ 2026-01-22  5:18   ` Jacob Pan
  2026-01-24  2:01     ` Mukesh R
  1 sibling, 1 reply; 69+ messages in thread
From: Jacob Pan @ 2026-01-22  5:18 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, Jacob Pan

Hi Mukesh,

On Mon, 19 Jan 2026 22:42:27 -0800
Mukesh R <mrathor@linux.microsoft.com> wrote:

> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add a new file to implement management of device domains, mapping and
> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> framework for PCI passthru on Hyper-V running Linux as root or L1VH
> parent. This also implements direct attach mechanism for PCI passthru,
> and it is also made to work within the VFIO framework.
> 
> At a high level, during boot the hypervisor creates a default identity
> domain and attaches all devices to it. This nicely maps to Linux iommu
> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> during boot. As mentioned previously, Hyper-V supports two ways to do
> PCI passthru:
> 
>   1. Device Domain: root must create a device domain in the
> hypervisor, and do map/unmap hypercalls for mapping and unmapping
> guest RAM. All hypervisor communications use device id of type PCI for
>      identifying and referencing the device.
> 
>   2. Direct Attach: the hypervisor will simply use the guest's HW
>      page table for mappings, thus the host need not do map/unmap
>      device memory hypercalls. As such, direct attach passthru setup
>      during guest boot is extremely fast. A direct attached device
>      must be referenced via logical device id and not via the PCI
>      device id.
> 
> At present, L1VH root/parent only supports direct attaches. Also
> direct attach is default in non-L1VH cases because there are some
> significant performance issues with device domain implementation
> currently for guests with higher RAM (say more than 8GB), and that
> unfortunately cannot be addressed in the short term.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                     |   1 +
>  arch/x86/include/asm/mshyperv.h |   7 +-
>  arch/x86/kernel/pci-dma.c       |   2 +
>  drivers/iommu/Makefile          |   2 +-
>  drivers/iommu/hyperv-iommu.c    | 876
> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/iommu/hyperv-iommu.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 381a0e086382..63160cee942c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/infiniband/hw/mana/
>  F:	drivers/input/serio/hyperv-keyboard.c
> +F:	drivers/iommu/hyperv-iommu.c
Given we are also developing a guest iommu driver on hyperv, I think it
is more clear to name them accordingly. Perhaps, hyperv-iommu-root.c?

>  F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
> diff --git a/arch/x86/include/asm/mshyperv.h
> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
> 100644 --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
>  #endif
>  
>  #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> -{ return false; }       /* temporary */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>  u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type
> type); +u64 hv_iommu_get_curr_partid(void);
>  #else	/* CONFIG_HYPERV_IOMMU */
>  static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>  { return false; }
> -
>  static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>  				       enum hv_device_type type)
>  { return 0; }
> +static inline u64 hv_iommu_get_curr_partid(void)
> +{ return HV_PARTITION_ID_INVALID; }
>  
>  #endif	/* CONFIG_HYPERV_IOMMU */
>  
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 6267363e0189..cfeee6505e17 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -8,6 +8,7 @@
>  #include <linux/gfp.h>
>  #include <linux/pci.h>
>  #include <linux/amd-iommu.h>
> +#include <linux/hyperv.h>
>  
>  #include <asm/proto.h>
>  #include <asm/dma.h>
> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>  	gart_iommu_hole_init();
>  	amd_iommu_detect();
>  	detect_intel_iommu();
> +	hv_iommu_detect();
Will this driver be x86 only?

>  	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>  }
>  
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 598c39558e7d..cc9774864b00 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
DMA and IRQ remapping should be separate

>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c
> b/drivers/iommu/hyperv-iommu.c new file mode 100644
> index 000000000000..548483fec6b1
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -0,0 +1,876 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Hyper-V root vIOMMU driver.
> + * Copyright (C) 2026, Microsoft, Inc.
> + */
> +
> +#include <linux/module.h>
I don't think this is needed since this driver cannot be a module

> +#include <linux/pci.h>
> +#include <linux/dmar.h>
should not depend on Intel's DMAR

> +#include <linux/dma-map-ops.h>
> +#include <linux/interval_tree.h>
> +#include <linux/hyperv.h>
> +#include "dma-iommu.h"
> +#include <asm/iommu.h>
> +#include <asm/mshyperv.h>
> +
> +/* We will not claim these PCI devices, eg hypervisor needs it for
> debugger */ +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}
> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for
> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}
> +__setup("hv_no_attdev", setup_hv_no_attdev);
> +
> +/* Iommu device that we export to the world. HyperV supports max of
> one */ +static struct iommu_device hv_virt_iommu;
> +
> +struct hv_domain {
> +	struct iommu_domain iommu_dom;
> +	u32 domid_num;			      /* as opposed to
> domain_id.type */
> +	u32 num_attchd;		      /* number of currently
> attached devices */
rename to num_dev_attached?

> +	bool attached_dom;		      /* is this direct
> attached dom? */
> +	spinlock_t mappings_lock;	      /* protects
> mappings_tree */
> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
> tree */ +};
> +
> +#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
> +
> +struct hv_iommu_mapping {
> +	phys_addr_t paddr;
> +	struct interval_tree_node iova;
> +	u32 flags;
> +};
> +
> +/*
> + * By default, during boot the hypervisor creates one Stage 2 (S2)
> default
> + * domain. Stage 2 means that the page table is controlled by the
> hypervisor.
> + *   S2 default: access to entire root partition memory. This for us
> easily
> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
> subsystem, and
> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
> hypervisor.
> + *
> + * Device Management:
> + *   There are two ways to manage device attaches to domains:
> + *     1. Domain Attach: A device domain is created in the
> hypervisor, the
> + *			 device is attached to this domain, and
> then memory
> + *			 ranges are mapped in the map callbacks.
> + *     2. Direct Attach: No need to create a domain in the
> hypervisor for direct
> + *			 attached devices. A hypercall is made to
> tell the
> + *			 hypervisor to attach the device to a
> guest. There is
> + *			 no need for explicit memory mappings
> because the
> + *			 hypervisor will just use the guest HW
> page table.
> + *
> + * Since a direct attach is much faster, it is the default. This can
> be
> + * changed via hv_no_attdev.
> + *
> + * L1VH: hypervisor only supports direct attach.
> + */
> +
> +/*
> + * Create dummy domain to correspond to hypervisor prebuilt default
> identity
> + * domain (dummy because we do not make hypercall to create them).
> + */
> +static struct hv_domain hv_def_identity_dom;
> +
> +static bool hv_special_domain(struct hv_domain *hvdom)
> +{
> +	return hvdom == &hv_def_identity_dom;
> +}
> +
> +struct iommu_domain_geometry default_geometry = (struct
> iommu_domain_geometry) {
> +	.aperture_start = 0,
> +	.aperture_end = -1UL,
> +	.force_aperture = true,
> +};
> +
> +/*
> + * Since the relevant hypercalls can only fit less than 512 PFNs in
> the pfn
> + * array, report 1M max.
> + */
> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
> +
> +static u32 unique_id;	      /* unique numeric id of a new
> domain */ +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> +				struct device *dev);
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather
> *gather); +
> +/*
> + * If the current thread is a VMM thread, return the partition id of
> the VM it
> + * is managing, else return HV_PARTITION_ID_INVALID.
> + */
> +u64 hv_iommu_get_curr_partid(void)
> +{
> +	u64 (*fn)(pid_t pid);
> +	u64 partid;
> +
> +	fn = symbol_get(mshv_pid_to_partid);
> +	if (!fn)
> +		return HV_PARTITION_ID_INVALID;
> +
> +	partid = fn(current->tgid);
> +	symbol_put(mshv_pid_to_partid);
> +
> +	return partid;
> +}
This function is not iommu specific. Maybe move it to mshv code?

> +
> +/* If this is a VMM thread, then this domain is for a guest VM */
> +static bool hv_curr_thread_is_vmm(void)
> +{
> +	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +	switch (cap) {
> +	case IOMMU_CAP_CACHE_COHERENCY:
> +		return true;
> +	default:
> +		return false;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Check if given pci device is a direct attached device. Caller
> must have
> + * verified pdev is a valid pci device.
> + */
> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> +{
> +	struct iommu_domain *iommu_domain;
> +	struct hv_domain *hvdom;
> +	struct device *dev = &pdev->dev;
> +
> +	iommu_domain = iommu_get_domain_for_dev(dev);
> +	if (iommu_domain) {
> +		hvdom = to_hv_domain(iommu_domain);
> +		return hvdom->attached_dom;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
Attached domain can change anytime, what guarantee does the caller have?

> +
> +/* Create a new device domain in the hypervisor */
> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> +{
> +	u64 status;
> +	unsigned long flags;
> +	struct hv_input_device_domain *ddp;
> +	struct hv_input_create_device_domain *input;
nit: use consistent coding style, inverse Christmas tree.

> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	ddp = &input->device_domain;
> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	ddp->domain_id.id = hvdom->domid_num;
> +
> +	input->create_device_domain_flags.forward_progress_required
> = 1;
> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> +
> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input,
> NULL); +
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* During boot, all devices are attached to this */
> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
> device *dev) +{
> +	return &hv_def_identity_dom.iommu_dom;
> +}
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> device *dev) +{
> +	struct hv_domain *hvdom;
> +	int rc;
> +
> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> !hv_no_attdev) {
> +		pr_err("Hyper-V: l1vh iommu does not support host
> devices\n");
why is this an error if user input choose not to do direct attach?

> +		return NULL;
> +	}
> +
> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> +	if (hvdom == NULL)
> +		goto out;
> +
> +	spin_lock_init(&hvdom->mappings_lock);
> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> +
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie,
> 0 */
This is true only when unique_id wraps around, right? Then this driver
stops working?

can you use an IDR for the unique_id and free it as you detach instead
of doing this cyclic allocation?

> +		goto out_free;
> +
> +	hvdom->domid_num = unique_id;
> +	hvdom->iommu_dom.geometry = default_geometry;
> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> +
> +	/* For guests, by default we do direct attaches, so no
> domain in hyp */
> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
> +		hvdom->attached_dom = true;
> +	else {
> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> +		if (rc)
> +			goto out_free_id;
> +	}
> +
> +	return &hvdom->iommu_dom;
> +
> +out_free_id:
> +	unique_id--;
> +out_free:
> +	kfree(hvdom);
> +out:
> +	return NULL;
> +}
> +
> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (hvdom->num_attchd) {
> +		pr_err("Hyper-V: can't free busy iommu domain
> (%p)\n", immdom);
> +		return;
> +	}
> +
> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
> +		struct hv_input_device_domain *ddp;
> +
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		ddp = &input->device_domain;
> +		memset(input, 0, sizeof(*input));
> +
> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +		ddp->domain_id.id = hvdom->domid_num;
> +
> +		status =
> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> +					 NULL);
> +		local_irq_restore(flags);
> +
> +		if (!hv_result_success(status))
> +			hv_status_err(status, "\n");
> +	}
you could free the domid here, no?

> +
> +	kfree(hvdom);
> +}
> +
> +/* Attach a device to a domain previously created in the hypervisor
> */ +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct
> pci_dev *pdev) +{
> +	unsigned long flags;
> +	u64 status;
> +	enum hv_device_type dev_type;
> +	struct hv_input_attach_device_domain *input;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +
> +	/* NB: Upon guest shutdown, device is re-attached to the
> default domain
> +	 * without explicit detach.
> +	 */
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
> dev_type); +
> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input,
> NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* Caller must have validated that dev is a valid pci dev */
> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
> +{
> +	struct hv_input_attach_device *input;
> +	u64 status;
> +	int rc;
> +	unsigned long flags;
> +	union hv_device_id host_devid;
> +	enum hv_device_type dev_type;
> +	u64 ptid = hv_iommu_get_curr_partid();
> +
> +	if (ptid == HV_PARTITION_ID_INVALID) {
> +		pr_err("Hyper-V: Invalid partition id in direct
> attach\n");
> +		return -EINVAL;
> +	}
> +
> +	if (hv_l1vh_partition())
> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> +	else
> +		dev_type = HV_DEVICE_TYPE_PCI;
> +
> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
> +
> +	do {
> +		local_irq_save(flags);
> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +		memset(input, 0, sizeof(*input));
> +		input->partition_id = ptid;
> +		input->device_id = host_devid;
> +
> +		/* Hypervisor associates logical_id with this
> device, and in
> +		 * some hypercalls like retarget interrupts,
> logical_id must be
> +		 * used instead of the BDF. It is a required
> parameter.
> +		 */
> +		input->attdev_flags.logical_id = 1;
> +		input->logical_devid =
> +			   hv_build_devid_oftype(pdev,
> HV_DEVICE_TYPE_LOGICAL); +
> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
> input, NULL);
> +		local_irq_restore(flags);
> +
> +		if (hv_result(status) ==
> HV_STATUS_INSUFFICIENT_MEMORY) {
> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
> ptid, 1);
> +			if (rc)
> +				break;
> +		}
> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +/* This to attach a device to both host app (like DPDK) and a guest
> VM */
The IOMMU driver should be agnostic to the type of consumer, whether a
userspace driver or a VM. This comment is not necessary.

> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
> struct device *dev,
> +			       struct iommu_domain *old)
This does not match upstream kernel prototype, which kernel version is
this based on? I will stop here for now.

struct iommu_domain_ops {
	int (*attach_dev)(struct iommu_domain *domain, struct device
	*dev);

> +{
> +	struct pci_dev *pdev;
> +	int rc;
> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> +
> +	/* Only allow PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	/* l1vh does not support host device (eg DPDK) passthru */
> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
> +	    !hvdom_new->attached_dom)
> +		return -EINVAL;
> +
> +	/*
> +	 * VFIO does not do explicit detach calls, hence check first
> if we need
> +	 * to detach first. Also, in case of guest shutdown, it's
> the VMM
> +	 * thread that attaches it back to the hv_def_identity_dom,
> and
> +	 * hvdom_prev will not be null then. It is null during boot.
> +	 */
> +	if (hvdom_prev)
> +		if (!hv_l1vh_partition() ||
> !hv_special_domain(hvdom_prev))
> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom,
> dev); +
> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> "private" field */
> +		return 0;
> +	}
> +
> +	if (hvdom_new->attached_dom)
> +		rc = hv_iommu_direct_attach_device(pdev);
> +	else
> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
> +
> +	if (rc && hvdom_prev) {
> +		int rc1;
> +
> +		if (hvdom_prev->attached_dom)
> +			rc1 = hv_iommu_direct_attach_device(pdev);
> +		else
> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
> +
> +		if (rc1)
> +			pr_err("Hyper-V: iommu could not restore
> orig device state.. dev:%s\n",
> +			       dev_name(dev));
> +	}
> +
> +	if (rc == 0) {
> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> "private" field */
> +		hvdom_new->num_attchd++;
> +	}
> +
> +	return rc;
> +}
> +
> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
> +					struct pci_dev *pdev)
> +{
> +	struct hv_input_detach_device *input;
> +	u64 status, log_devid;
> +	unsigned long flags;
> +
> +	log_devid = hv_build_devid_oftype(pdev,
> HV_DEVICE_TYPE_LOGICAL); +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = hv_iommu_get_curr_partid();
> +	input->logical_devid = log_devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
> +				      struct pci_dev *pdev)
> +{
> +	u64 status, devid;
> +	unsigned long flags;
> +	struct hv_input_detach_device_domain *input;
> +
> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->partition_id = HV_PARTITION_ID_SELF;
> +	input->device_id.as_uint64 = devid;
> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input,
> NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct
> device *dev) +{
> +	struct pci_dev *pdev;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	/* See the attach function, only PCI devices for now */
> +	if (!dev_is_pci(dev))
> +		return;
> +
> +	if (hvdom->num_attchd == 0)
> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n",
> dev_name(dev)); +
> +	pdev = to_pci_dev(dev);
> +
> +	if (hvdom->attached_dom) {
> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> +
> +		/* Do not reset attached_dom, hv_iommu_unmap_pages
> happens
> +		 * next.
> +		 */
> +	} else {
> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> +	}
> +
> +	hvdom->num_attchd--;
> +}
> +
> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t
> paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +	mapping->flags = flags;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> +
> +	return 0;
> +}
> +
> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t
> size) +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova,
> last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct
> hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now.
> */
> +		if (mapping->iova.start < iova)
> +			break;
> +
> +		unmapped += mapping->iova.last - mapping->iova.start
> + 1; +
> +		interval_tree_remove(node, &hvdom->mappings_tree);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* Return: must return exact status from the hypercall without
> changes */ +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> +			    unsigned long iova, phys_addr_t paddr,
> +			    unsigned long npages, u32 map_flags)
> +{
> +	u64 status;
> +	int i;
> +	struct hv_input_map_device_gpa_pages *input;
> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->map_flags = map_flags;
> +	input->target_device_va_base = iova;
> +
> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> +	for (i = 0; i < npages; i++, pfn++)
> +		input->gpa_page_list[i] = pfn;
> +
> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES,
> npages, 0,
> +				     input, NULL);
> +
> +	local_irq_restore(flags);
> +	return status;
> +}
> +
> +/*
> + * The core VFIO code loops over memory ranges calling this function
> with
> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in
> vfio_iommu_map.
> + */
> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong
> iova,
> +			      phys_addr_t paddr, size_t pgsize,
> size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size,
> map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +	while (done < npages) {
> +		ulong completed, remain = npages - done;
> +
> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> +					  map_flags);
> +
> +		completed = hv_repcomp(status);
> +		done = done + completed;
> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> +
> +		if (hv_result(status) ==
> HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +
> hv_current_partition_id,
> +						    256);
> +			if (ret)
> +				break;
> +		}
> +		if (!hv_result_success(status))
> +			break;
> +	}
> +
> +	if (!hv_result_success(status)) {
> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> +
> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> +			      done, npages, iova);
> +		/*
> +		 * lookup tree has all mappings [0 - size-1]. Below
> unmap will
> +		 * only remove from [0 - done], we need to remove
> second chunk
> +		 * [done+1 - size-1].
> +		 */
> +		hv_iommu_del_tree_mappings(hvdom, iova, size -
> done_size);
> +		hv_iommu_unmap_pages(immdom, iova - done_size,
> pgsize,
> +				     done, NULL);
> +		if (mapped)
> +			*mapped = 0;
> +	} else
> +		if (mapped)
> +			*mapped = size;
> +
> +	return hv_result_to_errno(status);
> +}
> +
> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;
> +
> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> +	if (unmapped < size)
> +		pr_err("%s: could not delete all mappings
> (%lx:%lx/%lx)\n",
> +		       __func__, iova, unmapped, size);
> +
> +	if (hvdom->attached_dom)
> +		return size;
> +
> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type =
> HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES,
> npages,
> +				     0, input, NULL);
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status))
> +		hv_status_err(status, "\n");
> +
> +	return unmapped;
> +}
> +
> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> +					 dma_addr_t iova)
> +{
> +	u64 paddr = 0;
> +	unsigned long flags;
> +	struct hv_iommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova,
> iova);
> +	if (node) {
> +		mapping = container_of(node, struct
> hv_iommu_mapping, iova);
> +		paddr = mapping->paddr + (iova -
> mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> +
> +	return paddr;
> +}
> +
> +/*
> + * Currently, hypervisor does not provide list of devices it is using
> + * dynamically. So use this to allow users to manually specify
> devices that
> + * should be skipped. (eg. hypervisor debugger using some network
> device).
> + */
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, "
> (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func,
> &parsed);
> +			if (rc)
> +				break;
> +			if (parsed <= 0)
> +				break;
> +
> +			if (pci_domain_nr(pdev->bus) == segment &&
> +			    pdev->bus->number == bus &&
> +			    PCI_SLOT(pdev->devfn) == slot &&
> +			    PCI_FUNC(pdev->devfn) == func) {
> +
> +				dev_info(dev, "skipped by Hyper-V
> IOMMU\n");
> +				return ERR_PTR(-ENODEV);
> +			}
> +			pos += parsed;
> +
> +		} while (pci_devs_to_skip[pos]);
> +	}
> +
> +	/* Device will be explicitly attached to the default domain,
> so no need
> +	 * to do dev_iommu_priv_set() here.
> +	 */
> +
> +	return &hv_virt_iommu;
> +}
> +
> +static void hv_iommu_probe_finalize(struct device *dev)
> +{
> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> +
> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> +		iommu_setup_dma_ops(dev);
> +	else
> +		set_dma_ops(dev, NULL);
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> +
> +	/* Need to detach device from device domain if necessary. */
> +	if (hvdom)
> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> +
> +	dev_iommu_priv_set(dev, NULL);
> +	set_dma_ops(dev, NULL);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int hv_iommu_def_domain_type(struct device *dev)
> +{
> +	/* The hypervisor always creates this by default during boot
> */
> +	return IOMMU_DOMAIN_IDENTITY;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +	.capable	    = hv_iommu_capable,
> +	.domain_alloc_identity	=
> hv_iommu_domain_alloc_identity,
> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> +	.probe_device	    = hv_iommu_probe_device,
> +	.probe_finalize     = hv_iommu_probe_finalize,
> +	.release_device     = hv_iommu_release_device,
> +	.def_domain_type    = hv_iommu_def_domain_type,
> +	.device_group	    = hv_iommu_device_group,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev   = hv_iommu_attach_dev,
> +		.map_pages    = hv_iommu_map_pages,
> +		.unmap_pages  = hv_iommu_unmap_pages,
> +		.iova_to_phys = hv_iommu_iova_to_phys,
> +		.free	      = hv_iommu_domain_free,
> +	},
> +	.owner		    = THIS_MODULE,
> +};
> +
> +static void __init hv_initialize_special_domains(void)
> +{
> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> +	hv_def_identity_dom.domid_num =
> HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */ +}
This could be initialized statically.

> +
> +static int __init hv_iommu_init(void)
> +{
> +	int ret;
> +	struct iommu_device *iommup = &hv_virt_iommu;
> +
> +	if (!hv_is_hyperv_initialized())
> +		return -ENODEV;
> +
> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s",
> "hyperv-iommu");
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_sysfs_add failed:
> %d\n", ret);
> +		return ret;
> +	}
> +
> +	/* This must come before iommu_device_register because the
> latter calls
> +	 * into the hooks.
> +	 */
> +	hv_initialize_special_domains();
> +
> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> +	if (ret) {
> +		pr_err("Hyper-V: iommu_device_register failed:
> %d\n", ret);
> +		goto err_sysfs_remove;
> +	}
> +
> +	pr_info("Hyper-V IOMMU initialized\n");
> +
> +	return 0;
> +
> +err_sysfs_remove:
> +	iommu_device_sysfs_remove(iommup);
> +	return ret;
> +}
> +
> +void __init hv_iommu_detect(void)
> +{
> +	if (no_iommu || iommu_detected)
> +		return;
> +
> +	/* For l1vh, always expose an iommu unit */
> +	if (!hv_l1vh_partition())
> +		if (!(ms_hyperv.misc_features &
> HV_DEVICE_DOMAIN_AVAILABLE))
> +			return;
> +
> +	iommu_detected = 1;
> +	x86_init.iommu.iommu_init = hv_iommu_init;
> +
> +	pci_request_acs();
> +}
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index dfc516c1c719..2ad111727e82 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1767,4 +1767,10 @@ static inline unsigned long virt_to_hvpfn(void
> *addr) #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
>  #define page_to_hvpfn(page)	(page_to_pfn(page) *
> NR_HV_HYP_PAGES_IN_PAGE) 
> +#ifdef CONFIG_HYPERV_IOMMU
> +void __init hv_iommu_detect(void);
> +#else
> +static inline void hv_iommu_detect(void) { }
> +#endif /* CONFIG_HYPERV_IOMMU */
> +
>  #endif /* _HYPERV_H */


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process
  2026-01-20  6:42 ` [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process Mukesh R
@ 2026-01-23 18:23   ` Nuno Das Neves
  0 siblings, 0 replies; 69+ messages in thread
From: Nuno Das Neves @ 2026-01-23 18:23 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/19/2026 10:42 PM, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Many PCI passthru related hypercalls require partition id of the target
> guest. Guests are actually managed by MSHV driver and the partition id
> is only maintained there. Add a field in the partition struct in MSHV
> driver to save the tgid of the VMM process creating the partition,
> and add a function there to retrieve partition id if valid VMM tgid.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h         |  1 +
>  drivers/hv/mshv_root_main.c    | 35 +++++++++++++++++++++++++++-------
>  include/asm-generic/mshyperv.h |  1 +
>  3 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index 3c1d88b36741..c3753b009fd8 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -134,6 +134,7 @@ struct mshv_partition {
>  
>  	struct mshv_girq_routing_table __rcu *pt_girq_tbl;
>  	u64 isolation_type;
> +	pid_t pt_vmm_tgid;
>  	bool import_completed;
>  	bool pt_initialized;
>  };
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 1134a82c7881..83c7bad269a0 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1823,6 +1823,20 @@ mshv_partition_release(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +/* Given a process tgid, return partition id if it is a VMM process */
> +u64 mshv_pid_to_partid(pid_t tgid)
> +{
> +	struct mshv_partition *pt;
> +	int i;
> +
> +	hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode)
> +		if (pt->pt_vmm_tgid == tgid)
> +			return pt->pt_id;
> +
> +	return HV_PARTITION_ID_INVALID;
> +}
> +EXPORT_SYMBOL_GPL(mshv_pid_to_partid);
> +
>  static int
>  add_partition(struct mshv_partition *partition)
>  {
> @@ -1987,13 +2001,20 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
>  		goto delete_partition;
>  
>  	ret = mshv_init_async_handler(partition);
> -	if (!ret) {
> -		ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
> -							   &mshv_partition_fops,
> -							   partition, O_RDWR));
> -		if (ret >= 0)
> -			return ret;
> -	}
> +	if (ret)
> +		goto rem_partition;
> +
> +	ret = FD_ADD(O_CLOEXEC, anon_inode_getfile("mshv_partition",
> +						   &mshv_partition_fops,
> +						   partition, O_RDWR));
> +	if (ret < 0)
> +		goto rem_partition;
> +
> +	partition->pt_vmm_tgid = current->tgid;
> +
> +	return ret;
> +
> +rem_partition:
>  	remove_partition(partition);
>  delete_partition:
>  	hv_call_delete_partition(partition->pt_id);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index ecedab554c80..e46a38916e76 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -211,6 +211,7 @@ void __init ms_hyperv_late_init(void);
>  int hv_common_cpu_init(unsigned int cpu);
>  int hv_common_cpu_die(unsigned int cpu);
>  void hv_identify_partition_type(void);
> +u64 mshv_pid_to_partid(pid_t tgid);

This should go inside the #if IS_ENABLED(CONFIG_MSHV_ROOT) section.

>  
>  /**
>   * hv_cpu_number_to_vp_number() - Map CPU to VP.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device
  2026-01-20  6:42 ` [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device Mukesh R
@ 2026-01-23 18:25   ` Nuno Das Neves
  2026-01-24  0:36     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Nuno Das Neves @ 2026-01-23 18:25 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/19/2026 10:42 PM, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add data structs needed by the subsequent patch that introduces a new
> module to implement VFIO-MSHV pseudo device.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root.h    | 23 +++++++++++++++++++++++
>  include/uapi/linux/mshv.h | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> index c3753b009fd8..42e1da1d545b 100644
> --- a/drivers/hv/mshv_root.h
> +++ b/drivers/hv/mshv_root.h
> @@ -220,6 +220,29 @@ struct port_table_info {
>  	};
>  };
>  
> +struct mshv_device {
> +	const struct mshv_device_ops *device_ops;
> +	struct mshv_partition *device_pt;
> +	void *device_private;
> +	struct hlist_node device_ptnode;
> +};
> +
> +struct mshv_device_ops {
> +	const char *device_name;
> +	long (*device_create)(struct mshv_device *dev, u32 type);
> +	void (*device_release)(struct mshv_device *dev);
> +	long (*device_set_attr)(struct mshv_device *dev,
> +				struct mshv_device_attr *attr);
> +	long (*device_has_attr)(struct mshv_device *dev,
> +				struct mshv_device_attr *attr);
> +};
> +
> +extern struct mshv_device_ops mshv_vfio_device_ops;
> +int mshv_vfio_ops_init(void);
> +void mshv_vfio_ops_exit(void);
> +long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
> +					void __user *user_args);
> +
>  int mshv_update_routing_table(struct mshv_partition *partition,
>  			      const struct mshv_user_irq_entry *entries,
>  			      unsigned int numents);
> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
> index dee3ece28ce5..b7b10f9e2896 100644
> --- a/include/uapi/linux/mshv.h
> +++ b/include/uapi/linux/mshv.h
> @@ -252,6 +252,7 @@ struct mshv_root_hvcall {
>  #define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
>  /* Generic hypercall */
>  #define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
> +#define MSHV_CREATE_DEVICE		_IOWR(MSHV_IOCTL, 0x08, struct mshv_create_device)
>  

With this commit, the IOCTL number is exposed to userspace but it doesn't work.
Ideally the IOCTL number should be added in the commit where it becomes usable.

>  /*
>   ********************************
> @@ -402,4 +403,34 @@ struct mshv_sint_mask {
>  /* hv_hvcall device */
>  #define MSHV_HVCALL_SETUP        _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup)
>  #define MSHV_HVCALL              _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall)
> +
> +/* device passhthru */
> +#define MSHV_CREATE_DEVICE_TEST		1
> +
> +enum {
> +	MSHV_DEV_TYPE_VFIO,
> +	MSHV_DEV_TYPE_MAX,
> +};
> +
> +struct mshv_create_device {
> +	__u32	type;	     /* in: MSHV_DEV_TYPE_xxx */
> +	__u32	fd;	     /* out: device handle */
> +	__u32	flags;	     /* in: MSHV_CREATE_DEVICE_xxx */
> +};
> +
> +#define MSHV_DEV_VFIO_FILE      1
> +#define MSHV_DEV_VFIO_FILE_ADD	1
> +#define MSHV_DEV_VFIO_FILE_DEL	2
> +
> +struct mshv_device_attr {
> +	__u32	flags;		/* no flags currently defined */
> +	__u32	group;		/* device-defined */
> +	__u64	attr;		/* group-defined */
> +	__u64	addr;		/* userspace address of attr data */
> +};
> +
> +/* Device fds created with MSHV_CREATE_DEVICE */
> +#define MSHV_SET_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x00, struct mshv_device_attr)
> +#define MSHV_HAS_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x01, struct mshv_device_attr)
> +
>  #endif


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
  2026-01-20  6:42 ` [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO Mukesh R
  2026-01-20 16:09   ` Stanislav Kinsburskii
@ 2026-01-23 18:32   ` Nuno Das Neves
  2026-01-24  0:37     ` Mukesh R
  1 sibling, 1 reply; 69+ messages in thread
From: Nuno Das Neves @ 2026-01-23 18:32 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/19/2026 10:42 PM, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Add a new file to implement VFIO-MSHV bridge pseudo device. These
> functions are called in the VFIO framework, and credits to kvm/vfio.c
> as this file was adapted from it.
> 
> Original author: Wei Liu <wei.liu@kernel.org>
> (Slightly modified from the original version).
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

Since the code is very similar to Wei's original commit, the way I'd
recommend to do it is:
1. Change the commit author to Wei, using git commit --amend --author=
and
2. Put his signed-off line before yours:

Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

This shows he is the author of the commit but you ported it.

If you feel you changed it enough that it should be considered
co-authored, you can instead keep your authorship of the commit and
put:

Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

> ---
>  drivers/hv/Makefile    |   3 +-
>  drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 212 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/hv/mshv_vfio.c
> 
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index a49f93c2d245..eae003c4cb8f 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
>  hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
>  hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
>  mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
> -	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
> +	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
> +               mshv_vfio.o
>  mshv_vtl-y := mshv_vtl_main.o
>  
>  # Code that must be built-in
> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
> new file mode 100644
> index 000000000000..6ea4d99a3bd2
> --- /dev/null
> +++ b/drivers/hv/mshv_vfio.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VFIO-MSHV bridge pseudo device
> + *
> + * Heavily inspired by the VFIO-KVM bridge pseudo device.
> + */
> +#include <linux/errno.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/vfio.h>
> +
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +struct mshv_vfio_file {
> +	struct list_head node;
> +	struct file *file;	/* list of struct mshv_vfio_file */
> +};
> +
> +struct mshv_vfio {
> +	struct list_head file_list;
> +	struct mutex lock;
> +};
> +
> +static bool mshv_vfio_file_is_valid(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_valid);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_valid);
> +
> +	return ret;
> +}
> +
> +static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
> +{
> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf;
> +	struct file *filp;
> +	long ret = 0;
> +
> +	filp = fget(fd);
> +	if (!filp)
> +		return -EBADF;
> +
> +	/* Ensure the FD is a vfio FD. */
> +	if (!mshv_vfio_file_is_valid(filp)) {
> +		ret = -EINVAL;
> +		goto out_fput;
> +	}
> +
> +	mutex_lock(&mshv_vfio->lock);
> +
> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> +		if (mvf->file == filp) {
> +			ret = -EEXIST;
> +			goto out_unlock;
> +		}
> +	}
> +
> +	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
> +	if (!mvf) {
> +		ret = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	mvf->file = get_file(filp);
> +	list_add_tail(&mvf->node, &mshv_vfio->file_list);
> +
> +out_unlock:
> +	mutex_unlock(&mshv_vfio->lock);
> +out_fput:
> +	fput(filp);
> +	return ret;
> +}
> +
> +static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
> +{
> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf;
> +	long ret;
> +
> +	CLASS(fd, f)(fd);
> +
> +	if (fd_empty(f))
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +	mutex_lock(&mshv_vfio->lock);
> +
> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
> +		if (mvf->file != fd_file(f))
> +			continue;
> +
> +		list_del(&mvf->node);
> +		fput(mvf->file);
> +		kfree(mvf);
> +		ret = 0;
> +		break;
> +	}
> +
> +	mutex_unlock(&mshv_vfio->lock);
> +	return ret;
> +}
> +
> +static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
> +			      void __user *arg)
> +{
> +	int32_t __user *argp = arg;
> +	int32_t fd;
> +
> +	switch (attr) {
> +	case MSHV_DEV_VFIO_FILE_ADD:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +		return mshv_vfio_file_add(mshvdev, fd);
> +
> +	case MSHV_DEV_VFIO_FILE_DEL:
> +		if (get_user(fd, argp))
> +			return -EFAULT;
> +		return mshv_vfio_file_del(mshvdev, fd);
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_FILE:
> +		return mshv_vfio_set_file(mshvdev, attr->attr,
> +					  u64_to_user_ptr(attr->addr));
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
> +			      struct mshv_device_attr *attr)
> +{
> +	switch (attr->group) {
> +	case MSHV_DEV_VFIO_FILE:
> +		switch (attr->attr) {
> +		case MSHV_DEV_VFIO_FILE_ADD:
> +		case MSHV_DEV_VFIO_FILE_DEL:
> +			return 0;
> +		}
> +
> +		break;
> +	}
> +
> +	return -ENXIO;
> +}
> +
> +static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
> +{
> +	struct mshv_device *tmp;
> +	struct mshv_vfio *mshv_vfio;
> +
> +	/* Only one VFIO "device" per VM */
> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
> +			     device_ptnode)
> +		if (tmp->device_ops == &mshv_vfio_device_ops)
> +			return -EBUSY;
> +
> +	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
> +	if (mshv_vfio == NULL)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&mshv_vfio->file_list);
> +	mutex_init(&mshv_vfio->lock);
> +
> +	mshvdev->device_private = mshv_vfio;
> +
> +	return 0;
> +}
> +
> +/* This is called from mshv_device_fop_release() */
> +static void mshv_vfio_release_device(struct mshv_device *mshvdev)
> +{
> +	struct mshv_vfio *mv = mshvdev->device_private;
> +	struct mshv_vfio_file *mvf, *tmp;
> +
> +	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
> +		fput(mvf->file);
> +		list_del(&mvf->node);
> +		kfree(mvf);
> +	}
> +
> +	kfree(mv);
> +	kfree(mshvdev);
> +}
> +
> +struct mshv_device_ops mshv_vfio_device_ops = {
> +	.device_name = "mshv-vfio",
> +	.device_create = mshv_vfio_create_device,
> +	.device_release = mshv_vfio_release_device,
> +	.device_set_attr = mshv_vfio_set_attr,
> +	.device_has_attr = mshv_vfio_has_attr,
> +};


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl
  2026-01-20  6:42 ` [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl Mukesh R
  2026-01-21  1:41   ` Stanislav Kinsburskii
@ 2026-01-23 18:34   ` Nuno Das Neves
  2026-01-24  2:12     ` Mukesh R
  1 sibling, 1 reply; 69+ messages in thread
From: Nuno Das Neves @ 2026-01-23 18:34 UTC (permalink / raw)
  To: Mukesh R, linux-kernel, linux-hyperv, linux-arm-kernel, iommu,
	linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/19/2026 10:42 PM, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> VFIO no longer puts the mmio pfn in vma->vm_pgoff. So, remove code
> that is using it to map mmio space. It is broken and will cause
> panic.

What is the reason for having this as a separate commit from patch 15?
It seems like removing this code and adding the mmio intercept
handling could be done in one patch.

> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_main.c | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 27313419828d..03f3aa9f5541 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1258,16 +1258,8 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
>  }
>  
>  /*
> - * This maps two things: guest RAM and for pci passthru mmio space.
> - *
> - * mmio:
> - *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
> - *  - Two things need to happen for mapping mmio range:
> - *	1. mapped in the uaddr so VMM can access it.
> - *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
> - *
> - *   This function takes care of the second. The first one is managed by vfio,
> - *   and hence is taken care of via vfio_pci_mmap_fault().
> + * This is called for both user ram and mmio space. The mmio space is not
> + * mapped here, but later during intercept.
>   */
>  static long
>  mshv_map_user_memory(struct mshv_partition *partition,
> @@ -1276,7 +1268,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	struct mshv_mem_region *region;
>  	struct vm_area_struct *vma;
>  	bool is_mmio;
> -	ulong mmio_pfn;
>  	long ret;
>  
>  	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
> @@ -1286,7 +1277,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  	mmap_read_lock(current->mm);
>  	vma = vma_lookup(current->mm, mem.userspace_addr);
>  	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> -	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
>  	mmap_read_unlock(current->mm);
>  
>  	if (!vma)
> @@ -1313,10 +1303,8 @@ mshv_map_user_memory(struct mshv_partition *partition,
>  					    HV_MAP_GPA_NO_ACCESS, NULL);
>  		break;
>  	case MSHV_REGION_TYPE_MMIO:
> -		ret = hv_call_map_mmio_pages(partition->pt_id,
> -					     region->start_gfn,
> -					     mmio_pfn,
> -					     region->nr_pages);
> +		/* mmio mappings are handled later during intercepts */
> +		ret = 0;
>  		break;
>  	}
>  


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device
  2026-01-23 18:25   ` Nuno Das Neves
@ 2026-01-24  0:36     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  0:36 UTC (permalink / raw)
  To: Nuno Das Neves, linux-kernel, linux-hyperv, linux-arm-kernel,
	iommu, linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/23/26 10:25, Nuno Das Neves wrote:
> On 1/19/2026 10:42 PM, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add data structs needed by the subsequent patch that introduces a new
>> module to implement VFIO-MSHV pseudo device.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   drivers/hv/mshv_root.h    | 23 +++++++++++++++++++++++
>>   include/uapi/linux/mshv.h | 31 +++++++++++++++++++++++++++++++
>>   2 files changed, 54 insertions(+)
>>
>> diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
>> index c3753b009fd8..42e1da1d545b 100644
>> --- a/drivers/hv/mshv_root.h
>> +++ b/drivers/hv/mshv_root.h
>> @@ -220,6 +220,29 @@ struct port_table_info {
>>   	};
>>   };
>>   
>> +struct mshv_device {
>> +	const struct mshv_device_ops *device_ops;
>> +	struct mshv_partition *device_pt;
>> +	void *device_private;
>> +	struct hlist_node device_ptnode;
>> +};
>> +
>> +struct mshv_device_ops {
>> +	const char *device_name;
>> +	long (*device_create)(struct mshv_device *dev, u32 type);
>> +	void (*device_release)(struct mshv_device *dev);
>> +	long (*device_set_attr)(struct mshv_device *dev,
>> +				struct mshv_device_attr *attr);
>> +	long (*device_has_attr)(struct mshv_device *dev,
>> +				struct mshv_device_attr *attr);
>> +};
>> +
>> +extern struct mshv_device_ops mshv_vfio_device_ops;
>> +int mshv_vfio_ops_init(void);
>> +void mshv_vfio_ops_exit(void);
>> +long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
>> +					void __user *user_args);
>> +
>>   int mshv_update_routing_table(struct mshv_partition *partition,
>>   			      const struct mshv_user_irq_entry *entries,
>>   			      unsigned int numents);
>> diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
>> index dee3ece28ce5..b7b10f9e2896 100644
>> --- a/include/uapi/linux/mshv.h
>> +++ b/include/uapi/linux/mshv.h
>> @@ -252,6 +252,7 @@ struct mshv_root_hvcall {
>>   #define MSHV_GET_GPAP_ACCESS_BITMAP	_IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
>>   /* Generic hypercall */
>>   #define MSHV_ROOT_HVCALL		_IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
>> +#define MSHV_CREATE_DEVICE		_IOWR(MSHV_IOCTL, 0x08, struct mshv_create_device)
>>   
> 
> With this commit, the IOCTL number is exposed to userspace but it doesn't work.
> Ideally the IOCTL number should be added in the commit where it becomes usable.
> 


Correct, I switched it because the next patch won't compile without it as
it needs the declarations here. It could be combined into one big patch,
but I think normally one would not expect full functionality until the
release is certified to be that feature compliant anyways. Hope that
makes sense.

Thanks,
-Mukesh




>>   /*
>>    ********************************
>> @@ -402,4 +403,34 @@ struct mshv_sint_mask {
>>   /* hv_hvcall device */
>>   #define MSHV_HVCALL_SETUP        _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup)
>>   #define MSHV_HVCALL              _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall)
>> +
>> +/* device passhthru */
>> +#define MSHV_CREATE_DEVICE_TEST		1
>> +
>> +enum {
>> +	MSHV_DEV_TYPE_VFIO,
>> +	MSHV_DEV_TYPE_MAX,
>> +};
>> +
>> +struct mshv_create_device {
>> +	__u32	type;	     /* in: MSHV_DEV_TYPE_xxx */
>> +	__u32	fd;	     /* out: device handle */
>> +	__u32	flags;	     /* in: MSHV_CREATE_DEVICE_xxx */
>> +};
>> +
>> +#define MSHV_DEV_VFIO_FILE      1
>> +#define MSHV_DEV_VFIO_FILE_ADD	1
>> +#define MSHV_DEV_VFIO_FILE_DEL	2
>> +
>> +struct mshv_device_attr {
>> +	__u32	flags;		/* no flags currently defined */
>> +	__u32	group;		/* device-defined */
>> +	__u64	attr;		/* group-defined */
>> +	__u64	addr;		/* userspace address of attr data */
>> +};
>> +
>> +/* Device fds created with MSHV_CREATE_DEVICE */
>> +#define MSHV_SET_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x00, struct mshv_device_attr)
>> +#define MSHV_HAS_DEVICE_ATTR	_IOW(MSHV_IOCTL, 0x01, struct mshv_device_attr)
>> +
>>   #endif


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO
  2026-01-23 18:32   ` Nuno Das Neves
@ 2026-01-24  0:37     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  0:37 UTC (permalink / raw)
  To: Nuno Das Neves, linux-kernel, linux-hyperv, linux-arm-kernel,
	iommu, linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/23/26 10:32, Nuno Das Neves wrote:
> On 1/19/2026 10:42 PM, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add a new file to implement VFIO-MSHV bridge pseudo device. These
>> functions are called in the VFIO framework, and credits to kvm/vfio.c
>> as this file was adapted from it.
>>
>> Original author: Wei Liu <wei.liu@kernel.org>
>> (Slightly modified from the original version).
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Since the code is very similar to Wei's original commit, the way I'd
> recommend to do it is:
> 1. Change the commit author to Wei, using git commit --amend --author=
> and
> 2. Put his signed-off line before yours:
> 
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> This shows he is the author of the commit but you ported it.
> 
> If you feel you changed it enough that it should be considered
> co-authored, you can instead keep your authorship of the commit and
> put:
> 
> Co-developed-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

Perfect! Thank you, that is exactly the information I was trying to
seek... makes sense.

Thanks,
-Mukesh



>> ---
>>   drivers/hv/Makefile    |   3 +-
>>   drivers/hv/mshv_vfio.c | 210 +++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 212 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/hv/mshv_vfio.c
>>
>> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
>> index a49f93c2d245..eae003c4cb8f 100644
>> --- a/drivers/hv/Makefile
>> +++ b/drivers/hv/Makefile
>> @@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
>>   hv_vmbus-$(CONFIG_HYPERV_TESTING)	+= hv_debugfs.o
>>   hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
>>   mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
>> -	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
>> +	       mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
>> +               mshv_vfio.o
>>   mshv_vtl-y := mshv_vtl_main.o
>>   
>>   # Code that must be built-in
>> diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
>> new file mode 100644
>> index 000000000000..6ea4d99a3bd2
>> --- /dev/null
>> +++ b/drivers/hv/mshv_vfio.c
>> @@ -0,0 +1,210 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * VFIO-MSHV bridge pseudo device
>> + *
>> + * Heavily inspired by the VFIO-KVM bridge pseudo device.
>> + */
>> +#include <linux/errno.h>
>> +#include <linux/file.h>
>> +#include <linux/list.h>
>> +#include <linux/module.h>
>> +#include <linux/mutex.h>
>> +#include <linux/slab.h>
>> +#include <linux/vfio.h>
>> +
>> +#include "mshv.h"
>> +#include "mshv_root.h"
>> +
>> +struct mshv_vfio_file {
>> +	struct list_head node;
>> +	struct file *file;	/* list of struct mshv_vfio_file */
>> +};
>> +
>> +struct mshv_vfio {
>> +	struct list_head file_list;
>> +	struct mutex lock;
>> +};
>> +
>> +static bool mshv_vfio_file_is_valid(struct file *file)
>> +{
>> +	bool (*fn)(struct file *file);
>> +	bool ret;
>> +
>> +	fn = symbol_get(vfio_file_is_valid);
>> +	if (!fn)
>> +		return false;
>> +
>> +	ret = fn(file);
>> +
>> +	symbol_put(vfio_file_is_valid);
>> +
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
>> +{
>> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf;
>> +	struct file *filp;
>> +	long ret = 0;
>> +
>> +	filp = fget(fd);
>> +	if (!filp)
>> +		return -EBADF;
>> +
>> +	/* Ensure the FD is a vfio FD. */
>> +	if (!mshv_vfio_file_is_valid(filp)) {
>> +		ret = -EINVAL;
>> +		goto out_fput;
>> +	}
>> +
>> +	mutex_lock(&mshv_vfio->lock);
>> +
>> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
>> +		if (mvf->file == filp) {
>> +			ret = -EEXIST;
>> +			goto out_unlock;
>> +		}
>> +	}
>> +
>> +	mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
>> +	if (!mvf) {
>> +		ret = -ENOMEM;
>> +		goto out_unlock;
>> +	}
>> +
>> +	mvf->file = get_file(filp);
>> +	list_add_tail(&mvf->node, &mshv_vfio->file_list);
>> +
>> +out_unlock:
>> +	mutex_unlock(&mshv_vfio->lock);
>> +out_fput:
>> +	fput(filp);
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
>> +{
>> +	struct mshv_vfio *mshv_vfio = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf;
>> +	long ret;
>> +
>> +	CLASS(fd, f)(fd);
>> +
>> +	if (fd_empty(f))
>> +		return -EBADF;
>> +
>> +	ret = -ENOENT;
>> +	mutex_lock(&mshv_vfio->lock);
>> +
>> +	list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
>> +		if (mvf->file != fd_file(f))
>> +			continue;
>> +
>> +		list_del(&mvf->node);
>> +		fput(mvf->file);
>> +		kfree(mvf);
>> +		ret = 0;
>> +		break;
>> +	}
>> +
>> +	mutex_unlock(&mshv_vfio->lock);
>> +	return ret;
>> +}
>> +
>> +static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
>> +			      void __user *arg)
>> +{
>> +	int32_t __user *argp = arg;
>> +	int32_t fd;
>> +
>> +	switch (attr) {
>> +	case MSHV_DEV_VFIO_FILE_ADD:
>> +		if (get_user(fd, argp))
>> +			return -EFAULT;
>> +		return mshv_vfio_file_add(mshvdev, fd);
>> +
>> +	case MSHV_DEV_VFIO_FILE_DEL:
>> +		if (get_user(fd, argp))
>> +			return -EFAULT;
>> +		return mshv_vfio_file_del(mshvdev, fd);
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
>> +			      struct mshv_device_attr *attr)
>> +{
>> +	switch (attr->group) {
>> +	case MSHV_DEV_VFIO_FILE:
>> +		return mshv_vfio_set_file(mshvdev, attr->attr,
>> +					  u64_to_user_ptr(attr->addr));
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
>> +			      struct mshv_device_attr *attr)
>> +{
>> +	switch (attr->group) {
>> +	case MSHV_DEV_VFIO_FILE:
>> +		switch (attr->attr) {
>> +		case MSHV_DEV_VFIO_FILE_ADD:
>> +		case MSHV_DEV_VFIO_FILE_DEL:
>> +			return 0;
>> +		}
>> +
>> +		break;
>> +	}
>> +
>> +	return -ENXIO;
>> +}
>> +
>> +static long mshv_vfio_create_device(struct mshv_device *mshvdev, u32 type)
>> +{
>> +	struct mshv_device *tmp;
>> +	struct mshv_vfio *mshv_vfio;
>> +
>> +	/* Only one VFIO "device" per VM */
>> +	hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
>> +			     device_ptnode)
>> +		if (tmp->device_ops == &mshv_vfio_device_ops)
>> +			return -EBUSY;
>> +
>> +	mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
>> +	if (mshv_vfio == NULL)
>> +		return -ENOMEM;
>> +
>> +	INIT_LIST_HEAD(&mshv_vfio->file_list);
>> +	mutex_init(&mshv_vfio->lock);
>> +
>> +	mshvdev->device_private = mshv_vfio;
>> +
>> +	return 0;
>> +}
>> +
>> +/* This is called from mshv_device_fop_release() */
>> +static void mshv_vfio_release_device(struct mshv_device *mshvdev)
>> +{
>> +	struct mshv_vfio *mv = mshvdev->device_private;
>> +	struct mshv_vfio_file *mvf, *tmp;
>> +
>> +	list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
>> +		fput(mvf->file);
>> +		list_del(&mvf->node);
>> +		kfree(mvf);
>> +	}
>> +
>> +	kfree(mv);
>> +	kfree(mshvdev);
>> +}
>> +
>> +struct mshv_device_ops mshv_vfio_device_ops = {
>> +	.device_name = "mshv-vfio",
>> +	.device_create = mshv_vfio_create_device,
>> +	.device_release = mshv_vfio_release_device,
>> +	.device_set_attr = mshv_vfio_set_attr,
>> +	.device_has_attr = mshv_vfio_has_attr,
>> +};
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping
  2026-01-20 22:17   ` Stanislav Kinsburskii
@ 2026-01-24  0:38     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  0:38 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/20/26 14:17, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:24PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Import/copy from Hyper-V public headers, definitions and declarations that
>> are related to attaching and detaching of device domains and interrupt
>> remapping, and building device ids for those purposes.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   include/hyperv/hvgdk_mini.h |  11 ++++
>>   include/hyperv/hvhdk_mini.h | 112 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 123 insertions(+)
>>
> 
> <snip>
> 
>> +/* ID for stage 2 default domain and NULL domain */
>> +#define HV_DEVICE_DOMAIN_ID_S2_DEFAULT 0
>> +#define HV_DEVICE_DOMAIN_ID_S2_NULL    0xFFFFFFFFULL
>> +
>> +union hv_device_domain_id {
>> +	u64 as_uint64;
>> +	struct {
>> +		u32 type : 4;
>> +		u32 reserved : 28;
>> +		u32 id;
>> +	};
>> +} __packed;
> 
> Shouldn't the inner struct be packed instead?
> 
>> +
>> +struct hv_input_device_domain { /* HV_INPUT_DEVICE_DOMAIN */
>> +	u64 partition_id;
>> +	union hv_input_vtl owner_vtl;
>> +	u8 padding[7];
>> +	union hv_device_domain_id domain_id;
>> +} __packed;
>> +
>> +union hv_create_device_domain_flags {	/* HV_CREATE_DEVICE_DOMAIN_FLAGS */
>> +	u32 as_uint32;
>> +	struct {
>> +		u32 forward_progress_required : 1;
>> +		u32 inherit_owning_vtl : 1;
>> +		u32 reserved : 30;
>> +	} __packed;
>> +} __packed;
> 
> Why should the union be packed?

 From GCC docs:

Specifying this attribute for struct and union types is equivalent to
specifying the packed attribute on each of the structure or union members.

Thanks,
-Mukesh



> Thanks,
> Stanislav
> 
>> +
>> +struct hv_input_create_device_domain {	/* HV_INPUT_CREATE_DEVICE_DOMAIN */
>> +	struct hv_input_device_domain device_domain;
>> +	union hv_create_device_domain_flags create_device_domain_flags;
>> +} __packed;
>> +
>> +struct hv_input_delete_device_domain {	/* HV_INPUT_DELETE_DEVICE_DOMAIN */
>> +	struct hv_input_device_domain device_domain;
>> +} __packed;
>> +
>> +struct hv_input_attach_device_domain {	/* HV_INPUT_ATTACH_DEVICE_DOMAIN */
>> +	struct hv_input_device_domain device_domain;
>> +	union hv_device_id device_id;
>> +} __packed;
>> +
>> +struct hv_input_detach_device_domain {	/* HV_INPUT_DETACH_DEVICE_DOMAIN */
>> +	u64 partition_id;
>> +	union hv_device_id device_id;
>> +} __packed;
>> +
>> +struct hv_input_map_device_gpa_pages {	/* HV_INPUT_MAP_DEVICE_GPA_PAGES */
>> +	struct hv_input_device_domain device_domain;
>> +	union hv_input_vtl target_vtl;
>> +	u8 padding[3];
>> +	u32 map_flags;
>> +	u64 target_device_va_base;
>> +	u64 gpa_page_list[];
>> +} __packed;
>> +
>> +struct hv_input_unmap_device_gpa_pages {  /* HV_INPUT_UNMAP_DEVICE_GPA_PAGES */
>> +	struct hv_input_device_domain device_domain;
>> +	u64 target_device_va_base;
>> +} __packed;
>> +
>>   #endif /* _HV_HVHDK_MINI_H */
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
  2026-01-20 22:22   ` Stanislav Kinsburskii
@ 2026-01-24  0:42     ` Mukesh R
  2026-01-26 20:50       ` Stanislav Kinsburskii
  2026-01-28 14:36       ` Manivannan Sadhasivam
  0 siblings, 2 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  0:42 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/20/26 14:22, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
>> interrupts, etc need a device id as a parameter. This device id refers
>> to that specific device during the lifetime of passthru.
>>
>> An L1VH VM only contains VMBus based devices. A device id for a VMBus
>> device is slightly different in that it uses the hv_pcibus_device info
>> for building it to make sure it matches exactly what the hypervisor
>> expects. This VMBus based device id is needed when attaching devices in
>> an L1VH based guest VM. Before building it, a check is done to make sure
>> the device is a valid VMBus device.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   arch/x86/include/asm/mshyperv.h     |  2 ++
>>   drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
>>   2 files changed, 31 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index eef4c3a5ba28..0d7fdfb25e76 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
>>   static inline void hv_apic_init(void) {}
>>   #endif
>>   
>> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
>> +
>>   struct irq_domain *hv_create_pci_msi_domain(void);
>>   
>>   int hv_map_msi_interrupt(struct irq_data *data,
>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>> index 8bc6a38c9b5a..40f0b06bb966 100644
>> --- a/drivers/pci/controller/pci-hyperv.c
>> +++ b/drivers/pci/controller/pci-hyperv.c
>> @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
>>   #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
>>   #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
>>   
>> +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
>> +
> 
> Why not moving this static function definition above the called instead of
> defining the prototype?

Did you see the function implementation? It has other dependencies that
are later, it would need code reorg.

Thanks,
-Mukesh


>>   static int hv_pci_irqchip_init(void)
>>   {
>>   	return 0;
>> @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
>>   
>>   #define hv_msi_prepare		pci_msi_prepare
>>   
>> +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
>> +{
>> +	u64 u64val;
> 
> This variable is redundant.

Not really. It helps with debug by putting a quick print, and is
harmless.

>> +	struct hv_pcibus_device *hbus;
>> +	struct pci_bus *pbus = pdev->bus;
>> +
>> +	if (!hv_vmbus_pci_device(pbus))
>> +		return 0;
>> +
>> +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
>> +	u64val = (hbus->hdev->dev_instance.b[5] << 24) |
>> +		 (hbus->hdev->dev_instance.b[4] << 16) |
>> +		 (hbus->hdev->dev_instance.b[7] << 8) |
>> +		 (hbus->hdev->dev_instance.b[6] & 0xf8) |
>> +		 PCI_FUNC(pdev->devfn);
>> +
> 
> It looks like this value always fits into 32 bit, so what is the value
> in returning 64 bit?

The ABI has device id defined as 64bits where this is assigned.

Thanks,
-Mukesh




> Thanks,
> Stanislav
> 
>> +	return u64val;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
>> +
>>   /**
>>    * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
>>    * affinity.
>> @@ -1404,6 +1426,13 @@ static struct pci_ops hv_pcifront_ops = {
>>   	.write = hv_pcifront_write_config,
>>   };
>>   
>> +#ifdef CONFIG_X86
>> +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
>> +{
>> +	return pbus->ops == &hv_pcifront_ops;
>> +}
>> +#endif /* CONFIG_X86 */
>> +
>>   /*
>>    * Paravirtual backchannel
>>    *
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls
  2026-01-20 22:27   ` Stanislav Kinsburskii
@ 2026-01-24  0:44     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  0:44 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/20/26 14:27, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:26PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
>> interrupts, etc need a device id as a parameter. A device id refers
>> to a specific device. A device id is of two types:
>>     o Logical: used for direct attach (see below) hypercalls. A logical
>>                device id is a unique 62bit value that is created and
>>                sent during the initial device attach. Then all further
>>                communications (for interrupt remaps etc) must use this
>>                logical id.
>>     o PCI: used for device domain hypercalls such as map, unmap, etc.
>>            This is built using actual device BDF info.
>>
>>     PS: Since an L1VH only supports direct attaches, a logical device id
>>         on an L1VH VM is always a VMBus device id. For non-L1VH cases,
>>         we just use PCI BDF info, altho not strictly needed, to build the
>>         logical device id.
>>
>> At a high level, Hyper-V supports two ways to do PCI passthru:
>>    1. Device Domain: root must create a device domain in the hypervisor,
>>       and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>       All hypervisor communications use device id of type PCI for
>>       identifying and referencing the device.
>>
>>    2. Direct Attach: the hypervisor will simply use the guest's HW
>>       page table for mappings, thus the host need not do map/unmap
>>       hypercalls. A direct attached device must be referenced
>>       via logical device id and never via the PCI device id. For an
>>       L1VH root/parent, Hyper-V only supports direct attaches.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   arch/x86/hyperv/irqdomain.c     | 60 ++++++++++++++++++++++++++++++---
>>   arch/x86/include/asm/mshyperv.h | 14 ++++++++
>>   2 files changed, 70 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
>> index ccbe5848a28f..33017aa0caa4 100644
>> --- a/arch/x86/hyperv/irqdomain.c
>> +++ b/arch/x86/hyperv/irqdomain.c
>> @@ -137,7 +137,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>>   	return 0;
>>   }
>>   
>> -static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>> +static u64 hv_build_devid_type_pci(struct pci_dev *pdev)
>>   {
>>   	int pos;
>>   	union hv_device_id hv_devid;
>> @@ -197,7 +197,58 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>>   	}
>>   
>>   out:
>> -	return hv_devid;
>> +	return hv_devid.as_uint64;
>> +}
>> +
>> +/* Build device id for direct attached devices */
>> +static u64 hv_build_devid_type_logical(struct pci_dev *pdev)
>> +{
>> +	hv_pci_segment segment;
>> +	union hv_device_id hv_devid;
>> +	union hv_pci_bdf bdf = {.as_uint16 = 0};
>> +	struct rid_data data = {
>> +		.bridge = NULL,
>> +		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
>> +	};
>> +
>> +	segment = pci_domain_nr(pdev->bus);
>> +	bdf.bus = PCI_BUS_NUM(data.rid);
>> +	bdf.device = PCI_SLOT(data.rid);
>> +	bdf.function = PCI_FUNC(data.rid);
>> +
>> +	hv_devid.as_uint64 = 0;
>> +	hv_devid.device_type = HV_DEVICE_TYPE_LOGICAL;
>> +	hv_devid.logical.id = (u64)segment << 16 | bdf.as_uint16;
>> +
>> +	return hv_devid.as_uint64;
>> +}
>> +
>> +/* Build device id after the device has been attached */
>> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
>> +{
>> +	if (type == HV_DEVICE_TYPE_LOGICAL) {
>> +		if (hv_l1vh_partition())
>> +			return hv_pci_vmbus_device_id(pdev);
> 
> Should this one be renamed into hv_build_devid_type_vmbus() to align
> with the other two function names?

No, because hyperv only defines two types of device ids, and it would
unnecessary at to confusion. vmbus uses one the two types of device
ids.


> Thanks,
> Stanislav
> 
>> +		else
>> +			return hv_build_devid_type_logical(pdev);
>> +	} else if (type == HV_DEVICE_TYPE_PCI)
>> +		return hv_build_devid_type_pci(pdev);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
>> +
>> +/* Build device id for the interrupt path */
>> +static u64 hv_build_irq_devid(struct pci_dev *pdev)
>> +{
>> +	enum hv_device_type dev_type;
>> +
>> +	if (hv_pcidev_is_attached_dev(pdev) || hv_l1vh_partition())
>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>> +	else
>> +		dev_type = HV_DEVICE_TYPE_PCI;
>> +
>> +	return hv_build_devid_oftype(pdev, dev_type);
>>   }
>>   
>>   /*
>> @@ -221,7 +272,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
>>   
>>   	msidesc = irq_data_get_msi_desc(data);
>>   	pdev = msi_desc_to_pci_dev(msidesc);
>> -	hv_devid = hv_build_devid_type_pci(pdev);
>> +	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>>   	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>>   
>>   	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
>> @@ -296,7 +347,8 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>>   {
>>   	union hv_device_id hv_devid;
>>   
>> -	hv_devid = hv_build_devid_type_pci(pdev);
>> +	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>> +
>>   	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>>   }
>>   
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index 0d7fdfb25e76..97477c5a8487 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -188,6 +188,20 @@ bool hv_vcpu_is_preempted(int vcpu);
>>   static inline void hv_apic_init(void) {}
>>   #endif
>>   
>> +#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
>> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> +{ return false; }       /* temporary */
>> +u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
>> +#else	/* CONFIG_HYPERV_IOMMU */
>> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> +{ return false; }
>> +
>> +static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>> +				       enum hv_device_type type)
>> +{ return 0; }
>> +
>> +#endif	/* CONFIG_HYPERV_IOMMU */
>> +
>>   u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
>>   
>>   struct irq_domain *hv_create_pci_msi_domain(void);
>> -- 
>> 2.51.2.vfs.0.1
>>
> 
> 
> 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-21  0:12   ` Stanislav Kinsburskii
@ 2026-01-24  1:26     ` Mukesh R
  2026-01-26 15:57       ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-24  1:26 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add a new file to implement management of device domains, mapping and
>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>> parent. This also implements direct attach mechanism for PCI passthru,
>> and it is also made to work within the VFIO framework.
>>
>> At a high level, during boot the hypervisor creates a default identity
>> domain and attaches all devices to it. This nicely maps to Linux iommu
>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>> during boot. As mentioned previously, Hyper-V supports two ways to do
>> PCI passthru:
>>
>>    1. Device Domain: root must create a device domain in the hypervisor,
>>       and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>       All hypervisor communications use device id of type PCI for
>>       identifying and referencing the device.
>>
>>    2. Direct Attach: the hypervisor will simply use the guest's HW
>>       page table for mappings, thus the host need not do map/unmap
>>       device memory hypercalls. As such, direct attach passthru setup
>>       during guest boot is extremely fast. A direct attached device
>>       must be referenced via logical device id and not via the PCI
>>       device id.
>>
>> At present, L1VH root/parent only supports direct attaches. Also direct
>> attach is default in non-L1VH cases because there are some significant
>> performance issues with device domain implementation currently for guests
>> with higher RAM (say more than 8GB), and that unfortunately cannot be
>> addressed in the short term.
>>
> 
> <snip>
> 
>> +/*
>> + * If the current thread is a VMM thread, return the partition id of the VM it
>> + * is managing, else return HV_PARTITION_ID_INVALID.
>> + */
>> +u64 hv_iommu_get_curr_partid(void)
>> +{
>> +	u64 (*fn)(pid_t pid);
>> +	u64 partid;
>> +
>> +	fn = symbol_get(mshv_pid_to_partid);
>> +	if (!fn)
>> +		return HV_PARTITION_ID_INVALID;
>> +
>> +	partid = fn(current->tgid);
>> +	symbol_put(mshv_pid_to_partid);
>> +
>> +	return partid;
>> +}
>> +
>> +/* If this is a VMM thread, then this domain is for a guest VM */
>> +static bool hv_curr_thread_is_vmm(void)
>> +{
>> +	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
>> +}
>> +
>> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
>> +{
>> +	switch (cap) {
>> +	case IOMMU_CAP_CACHE_COHERENCY:
>> +		return true;
>> +	default:
>> +		return false;
>> +	}
>> +	return false;
> 
> The return above is never reached.
> 
>> +}
>> +
>> +/*
>> + * Check if given pci device is a direct attached device. Caller must have
>> + * verified pdev is a valid pci device.
>> + */
>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> +{
>> +	struct iommu_domain *iommu_domain;
>> +	struct hv_domain *hvdom;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	iommu_domain = iommu_get_domain_for_dev(dev);
>> +	if (iommu_domain) {
>> +		hvdom = to_hv_domain(iommu_domain);
> 
> hvdom varaible is redundant.
> 
>> +		return hvdom->attached_dom;
>> +	}
>> +
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
>> +
>> +/* Create a new device domain in the hypervisor */
>> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
>> +{
>> +	u64 status;
>> +	unsigned long flags;
>> +	struct hv_input_device_domain *ddp;
>> +	struct hv_input_create_device_domain *input;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	ddp = &input->device_domain;
>> +	ddp->partition_id = HV_PARTITION_ID_SELF;
>> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +	ddp->domain_id.id = hvdom->domid_num;
>> +
>> +	input->create_device_domain_flags.forward_progress_required = 1;
>> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
>> +
>> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
>> +
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* During boot, all devices are attached to this */
>> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct device *dev)
>> +{
>> +	return &hv_def_identity_dom.iommu_dom;
>> +}
>> +
>> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
>> +{
>> +	struct hv_domain *hvdom;
>> +	int rc;
>> +
>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() && !hv_no_attdev) {
>> +		pr_err("Hyper-V: l1vh iommu does not support host devices\n");
>> +		return NULL;
>> +	}
>> +
>> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
>> +	if (hvdom == NULL)
>> +		goto out;
> 
> Why goto here and not return NULL like above?

Some debug code there got removed. Will fix in next version.

>> +
>> +	spin_lock_init(&hvdom->mappings_lock);
>> +	hvdom->mappings_tree = RB_ROOT_CACHED;
>> +
>> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie, 0 */
>> +		goto out_free;
>> +
>> +	hvdom->domid_num = unique_id;
>> +	hvdom->iommu_dom.geometry = default_geometry;
>> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
>> +
>> +	/* For guests, by default we do direct attaches, so no domain in hyp */
>> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
>> +		hvdom->attached_dom = true;
>> +	else {
>> +		rc = hv_iommu_create_hyp_devdom(hvdom);
>> +		if (rc)
>> +			goto out_free_id;
>> +	}
>> +
>> +	return &hvdom->iommu_dom;
>> +
>> +out_free_id:
>> +	unique_id--;
>> +out_free:
>> +	kfree(hvdom);
>> +out:
>> +	return NULL;
>> +}
>> +
>> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
>> +{
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	unsigned long flags;
>> +	u64 status;
>> +	struct hv_input_delete_device_domain *input;
>> +
>> +	if (hv_special_domain(hvdom))
>> +		return;
>> +
>> +	if (hvdom->num_attchd) {
>> +		pr_err("Hyper-V: can't free busy iommu domain (%p)\n", immdom);
>> +		return;
>> +	}
>> +
>> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
>> +		struct hv_input_device_domain *ddp;
>> +
>> +		local_irq_save(flags);
>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		ddp = &input->device_domain;
>> +		memset(input, 0, sizeof(*input));
>> +
>> +		ddp->partition_id = HV_PARTITION_ID_SELF;
>> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +		ddp->domain_id.id = hvdom->domid_num;
>> +
>> +		status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
>> +					 NULL);
>> +		local_irq_restore(flags);
>> +
>> +		if (!hv_result_success(status))
>> +			hv_status_err(status, "\n");
>> +	}
>> +
>> +	kfree(hvdom);
>> +}
>> +
>> +/* Attach a device to a domain previously created in the hypervisor */
>> +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct pci_dev *pdev)
>> +{
>> +	unsigned long flags;
>> +	u64 status;
>> +	enum hv_device_type dev_type;
>> +	struct hv_input_attach_device_domain *input;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +
>> +	/* NB: Upon guest shutdown, device is re-attached to the default domain
>> +	 * without explicit detach.
>> +	 */
>> +	if (hv_l1vh_partition())
>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>> +	else
>> +		dev_type = HV_DEVICE_TYPE_PCI;
>> +
>> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
>> +
>> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* Caller must have validated that dev is a valid pci dev */
>> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
>> +{
>> +	struct hv_input_attach_device *input;
>> +	u64 status;
>> +	int rc;
>> +	unsigned long flags;
>> +	union hv_device_id host_devid;
>> +	enum hv_device_type dev_type;
>> +	u64 ptid = hv_iommu_get_curr_partid();
>> +
>> +	if (ptid == HV_PARTITION_ID_INVALID) {
>> +		pr_err("Hyper-V: Invalid partition id in direct attach\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (hv_l1vh_partition())
>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>> +	else
>> +		dev_type = HV_DEVICE_TYPE_PCI;
>> +
>> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		memset(input, 0, sizeof(*input));
>> +		input->partition_id = ptid;
>> +		input->device_id = host_devid;
>> +
>> +		/* Hypervisor associates logical_id with this device, and in
>> +		 * some hypercalls like retarget interrupts, logical_id must be
>> +		 * used instead of the BDF. It is a required parameter.
>> +		 */
>> +		input->attdev_flags.logical_id = 1;
>> +		input->logical_devid =
>> +			   hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
>> +
>> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE, input, NULL);
>> +		local_irq_restore(flags);
>> +
>> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
>> +			if (rc)
>> +				break;
>> +		}
>> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* This to attach a device to both host app (like DPDK) and a guest VM */
>> +static int hv_iommu_attach_dev(struct iommu_domain *immdom, struct device *dev,
>> +			       struct iommu_domain *old)
>> +{
>> +	struct pci_dev *pdev;
>> +	int rc;
>> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
>> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
>> +
>> +	/* Only allow PCI devices for now */
>> +	if (!dev_is_pci(dev))
>> +		return -EINVAL;
>> +
>> +	pdev = to_pci_dev(dev);
>> +
>> +	/* l1vh does not support host device (eg DPDK) passthru */
>> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
>> +	    !hvdom_new->attached_dom)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * VFIO does not do explicit detach calls, hence check first if we need
>> +	 * to detach first. Also, in case of guest shutdown, it's the VMM
>> +	 * thread that attaches it back to the hv_def_identity_dom, and
>> +	 * hvdom_prev will not be null then. It is null during boot.
>> +	 */
>> +	if (hvdom_prev)
>> +		if (!hv_l1vh_partition() || !hv_special_domain(hvdom_prev))
>> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev);
>> +
>> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
>> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
>> +		return 0;
>> +	}
>> +
>> +	if (hvdom_new->attached_dom)
>> +		rc = hv_iommu_direct_attach_device(pdev);
>> +	else
>> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
>> +
>> +	if (rc && hvdom_prev) {
>> +		int rc1;
>> +
>> +		if (hvdom_prev->attached_dom)
>> +			rc1 = hv_iommu_direct_attach_device(pdev);
>> +		else
>> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
>> +
>> +		if (rc1)
>> +			pr_err("Hyper-V: iommu could not restore orig device state.. dev:%s\n",
>> +			       dev_name(dev));
>> +	}
>> +
>> +	if (rc == 0) {
>> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets "private" field */
>> +		hvdom_new->num_attchd++;
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
>> +					struct pci_dev *pdev)
>> +{
>> +	struct hv_input_detach_device *input;
>> +	u64 status, log_devid;
>> +	unsigned long flags;
>> +
>> +	log_devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->partition_id = hv_iommu_get_curr_partid();
>> +	input->logical_devid = log_devid;
>> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +}
>> +
>> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
>> +				      struct pci_dev *pdev)
>> +{
>> +	u64 status, devid;
>> +	unsigned long flags;
>> +	struct hv_input_detach_device_domain *input;
>> +
>> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_id.as_uint64 = devid;
>> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +}
>> +
>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +
>> +	/* See the attach function, only PCI devices for now */
>> +	if (!dev_is_pci(dev))
>> +		return;
>> +
>> +	if (hvdom->num_attchd == 0)
>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
>> +
>> +	pdev = to_pci_dev(dev);
>> +
>> +	if (hvdom->attached_dom) {
>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>> +
>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
>> +		 * next.
>> +		 */
>> +	} else {
>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>> +	}
>> +
>> +	hvdom->num_attchd--;
> 
> Shouldn't this be modified iff the detach succeeded?

We want to still free the domain and not let it get stuck. The purpose
is more to make sure detach was called before domain free.

>> +}
>> +
>> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
>> +				     unsigned long iova, phys_addr_t paddr,
>> +				     size_t size, u32 flags)
>> +{
>> +	unsigned long irqflags;
>> +	struct hv_iommu_mapping *mapping;
>> +
>> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
>> +	if (!mapping)
>> +		return -ENOMEM;
>> +
>> +	mapping->paddr = paddr;
>> +	mapping->iova.start = iova;
>> +	mapping->iova.last = iova + size - 1;
>> +	mapping->flags = flags;
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
>> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
>> +
>> +	return 0;
>> +}
>> +
>> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
>> +					unsigned long iova, size_t size)
>> +{
>> +	unsigned long flags;
>> +	size_t unmapped = 0;
>> +	unsigned long last = iova + size - 1;
>> +	struct hv_iommu_mapping *mapping = NULL;
>> +	struct interval_tree_node *node, *next;
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
>> +	while (next) {
>> +		node = next;
>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>> +		next = interval_tree_iter_next(node, iova, last);
>> +
>> +		/* Trying to split a mapping? Not supported for now. */
>> +		if (mapping->iova.start < iova)
>> +			break;
>> +
>> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
>> +
>> +		interval_tree_remove(node, &hvdom->mappings_tree);
>> +		kfree(mapping);
>> +	}
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>> +
>> +	return unmapped;
>> +}
>> +
>> +/* Return: must return exact status from the hypercall without changes */
>> +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
>> +			    unsigned long iova, phys_addr_t paddr,
>> +			    unsigned long npages, u32 map_flags)
>> +{
>> +	u64 status;
>> +	int i;
>> +	struct hv_input_map_device_gpa_pages *input;
>> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +	input->map_flags = map_flags;
>> +	input->target_device_va_base = iova;
>> +
>> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
>> +	for (i = 0; i < npages; i++, pfn++)
>> +		input->gpa_page_list[i] = pfn;
>> +
>> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
>> +				     input, NULL);
>> +
>> +	local_irq_restore(flags);
>> +	return status;
>> +}
>> +
>> +/*
>> + * The core VFIO code loops over memory ranges calling this function with
>> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
>> + */
>> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
>> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
>> +			      int prot, gfp_t gfp, size_t *mapped)
>> +{
>> +	u32 map_flags;
>> +	int ret;
>> +	u64 status;
>> +	unsigned long npages, done = 0;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	size_t size = pgsize * pgcount;
>> +
>> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
>> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
>> +
>> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (hvdom->attached_dom) {
>> +		*mapped = size;
>> +		return 0;
>> +	}
>> +
>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>> +	while (done < npages) {
>> +		ulong completed, remain = npages - done;
>> +
>> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
>> +					  map_flags);
>> +
>> +		completed = hv_repcomp(status);
>> +		done = done + completed;
>> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
>> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
>> +
>> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +						    hv_current_partition_id,
>> +						    256);
>> +			if (ret)
>> +				break;
>> +		}
>> +		if (!hv_result_success(status))
>> +			break;
>> +	}
>> +
>> +	if (!hv_result_success(status)) {
>> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
>> +
>> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
>> +			      done, npages, iova);
>> +		/*
>> +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
>> +		 * only remove from [0 - done], we need to remove second chunk
>> +		 * [done+1 - size-1].
>> +		 */
>> +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
>> +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
>> +				     done, NULL);
>> +		if (mapped)
>> +			*mapped = 0;
>> +	} else
>> +		if (mapped)
>> +			*mapped = size;
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
>> +				   size_t pgsize, size_t pgcount,
>> +				   struct iommu_iotlb_gather *gather)
>> +{
>> +	unsigned long flags, npages;
>> +	struct hv_input_unmap_device_gpa_pages *input;
>> +	u64 status;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	size_t unmapped, size = pgsize * pgcount;
>> +
>> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
>> +	if (unmapped < size)
>> +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
>> +		       __func__, iova, unmapped, size);
>> +
>> +	if (hvdom->attached_dom)
>> +		return size;
>> +
>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +	input->target_device_va_base = iova;
>> +
>> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
>> +				     0, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
> 
> There is some inconsistency in namings and behaviour of paired
> functions:
> 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs

The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.

> 2. hv_iommu_map_pgs doesn't print status in case of error.

it does:
             hv_status_err(status, "\n");  <==============


> It would be much better to keep this code consistent.
> 
>> +	return unmapped;
>> +}
>> +
>> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
>> +					 dma_addr_t iova)
>> +{
>> +	u64 paddr = 0;
>> +	unsigned long flags;
>> +	struct hv_iommu_mapping *mapping;
>> +	struct interval_tree_node *node;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
>> +	if (node) {
>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>> +		paddr = mapping->paddr + (iova - mapping->iova.start);
>> +	}
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>> +
>> +	return paddr;
>> +}
>> +
>> +/*
>> + * Currently, hypervisor does not provide list of devices it is using
>> + * dynamically. So use this to allow users to manually specify devices that
>> + * should be skipped. (eg. hypervisor debugger using some network device).
>> + */
>> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>> +{
>> +	if (!dev_is_pci(dev))
>> +		return ERR_PTR(-ENODEV);
>> +
>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
>> +		int rc, pos = 0;
>> +		int parsed;
>> +		int segment, bus, slot, func;
>> +		struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +		do {
>> +			parsed = 0;
>> +
>> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
>> +				    &segment, &bus, &slot, &func, &parsed);
>> +			if (rc)
>> +				break;
>> +			if (parsed <= 0)
>> +				break;
>> +
>> +			if (pci_domain_nr(pdev->bus) == segment &&
>> +			    pdev->bus->number == bus &&
>> +			    PCI_SLOT(pdev->devfn) == slot &&
>> +			    PCI_FUNC(pdev->devfn) == func) {
>> +
>> +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
>> +				return ERR_PTR(-ENODEV);
>> +			}
>> +			pos += parsed;
>> +
>> +		} while (pci_devs_to_skip[pos]);
>> +	}
>> +
>> +	/* Device will be explicitly attached to the default domain, so no need
>> +	 * to do dev_iommu_priv_set() here.
>> +	 */
>> +
>> +	return &hv_virt_iommu;
>> +}
>> +
>> +static void hv_iommu_probe_finalize(struct device *dev)
>> +{
>> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
>> +
>> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
>> +		iommu_setup_dma_ops(dev);
>> +	else
>> +		set_dma_ops(dev, NULL);
>> +}
>> +
>> +static void hv_iommu_release_device(struct device *dev)
>> +{
>> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
>> +
>> +	/* Need to detach device from device domain if necessary. */
>> +	if (hvdom)
>> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
>> +
>> +	dev_iommu_priv_set(dev, NULL);
>> +	set_dma_ops(dev, NULL);
>> +}
>> +
>> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
>> +{
>> +	if (dev_is_pci(dev))
>> +		return pci_device_group(dev);
>> +	else
>> +		return generic_device_group(dev);
>> +}
>> +
>> +static int hv_iommu_def_domain_type(struct device *dev)
>> +{
>> +	/* The hypervisor always creates this by default during boot */
>> +	return IOMMU_DOMAIN_IDENTITY;
>> +}
>> +
>> +static struct iommu_ops hv_iommu_ops = {
>> +	.capable	    = hv_iommu_capable,
>> +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
>> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
>> +	.probe_device	    = hv_iommu_probe_device,
>> +	.probe_finalize     = hv_iommu_probe_finalize,
>> +	.release_device     = hv_iommu_release_device,
>> +	.def_domain_type    = hv_iommu_def_domain_type,
>> +	.device_group	    = hv_iommu_device_group,
>> +	.default_domain_ops = &(const struct iommu_domain_ops) {
>> +		.attach_dev   = hv_iommu_attach_dev,
>> +		.map_pages    = hv_iommu_map_pages,
>> +		.unmap_pages  = hv_iommu_unmap_pages,
>> +		.iova_to_phys = hv_iommu_iova_to_phys,
>> +		.free	      = hv_iommu_domain_free,
>> +	},
>> +	.owner		    = THIS_MODULE,
>> +};
>> +
>> +static void __init hv_initialize_special_domains(void)
>> +{
>> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
>> +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
> 
> hv_def_identity_dom is a static global variable.
> Why not initialize hv_def_identity_dom upon definition instead of
> introducing a new function?

Originally, it was function. I changed it static, but during 6.6
review I changed it back to function.  I can't remember why, but is
pretty harmless. We may add more domains, for example null domain to the
initilization in future.

>> +}
>> +
>> +static int __init hv_iommu_init(void)
>> +{
>> +	int ret;
>> +	struct iommu_device *iommup = &hv_virt_iommu;
>> +
>> +	if (!hv_is_hyperv_initialized())
>> +		return -ENODEV;
>> +
>> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
>> +	if (ret) {
>> +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	/* This must come before iommu_device_register because the latter calls
>> +	 * into the hooks.
>> +	 */
>> +	hv_initialize_special_domains();
>> +
>> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> 
> It looks weird to initialize an object after creating sysfs entries for
> it.
> It should be the other way around.

Not sure if it should be, much easier to remove sysfs entry than other
cleanup, even tho iommu_device_unregister is there. I am sure we'll add
more code here, probably why it was originally done this way.

Thanks,
-Mukesh


... snip........


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-22  5:18   ` Jacob Pan
@ 2026-01-24  2:01     ` Mukesh R
  2026-01-27 19:21       ` Jacob Pan
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-24  2:01 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/21/26 21:18, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Mon, 19 Jan 2026 22:42:27 -0800
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Add a new file to implement management of device domains, mapping and
>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>> parent. This also implements direct attach mechanism for PCI passthru,
>> and it is also made to work within the VFIO framework.
>>
>> At a high level, during boot the hypervisor creates a default identity
>> domain and attaches all devices to it. This nicely maps to Linux iommu
>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>> during boot. As mentioned previously, Hyper-V supports two ways to do
>> PCI passthru:
>>
>>    1. Device Domain: root must create a device domain in the
>> hypervisor, and do map/unmap hypercalls for mapping and unmapping
>> guest RAM. All hypervisor communications use device id of type PCI for
>>       identifying and referencing the device.
>>
>>    2. Direct Attach: the hypervisor will simply use the guest's HW
>>       page table for mappings, thus the host need not do map/unmap
>>       device memory hypercalls. As such, direct attach passthru setup
>>       during guest boot is extremely fast. A direct attached device
>>       must be referenced via logical device id and not via the PCI
>>       device id.
>>
>> At present, L1VH root/parent only supports direct attaches. Also
>> direct attach is default in non-L1VH cases because there are some
>> significant performance issues with device domain implementation
>> currently for guests with higher RAM (say more than 8GB), and that
>> unfortunately cannot be addressed in the short term.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   MAINTAINERS                     |   1 +
>>   arch/x86/include/asm/mshyperv.h |   7 +-
>>   arch/x86/kernel/pci-dma.c       |   2 +
>>   drivers/iommu/Makefile          |   2 +-
>>   drivers/iommu/hyperv-iommu.c    | 876
>> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
>> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
>>   create mode 100644 drivers/iommu/hyperv-iommu.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 381a0e086382..63160cee942c 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>>   F:	drivers/hv/
>>   F:	drivers/infiniband/hw/mana/
>>   F:	drivers/input/serio/hyperv-keyboard.c
>> +F:	drivers/iommu/hyperv-iommu.c
> Given we are also developing a guest iommu driver on hyperv, I think it
> is more clear to name them accordingly. Perhaps, hyperv-iommu-root.c?

well, l1vh is not quite root, more like a parent. But we've been using
l1vh root loosely to mean l1vh parent. so probably ok to rename it
to hyperv-iommu-root.c. I prefer not calling it parent or something
like that.

>>   F:	drivers/iommu/hyperv-irq.c
>>   F:	drivers/net/ethernet/microsoft/
>>   F:	drivers/net/hyperv/
>> diff --git a/arch/x86/include/asm/mshyperv.h
>> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
>> 100644 --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
>>   #endif
>>   
>>   #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
>> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> -{ return false; }       /* temporary */
>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>>   u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type
>> type); +u64 hv_iommu_get_curr_partid(void);
>>   #else	/* CONFIG_HYPERV_IOMMU */
>>   static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>>   { return false; }
>> -
>>   static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>>   				       enum hv_device_type type)
>>   { return 0; }
>> +static inline u64 hv_iommu_get_curr_partid(void)
>> +{ return HV_PARTITION_ID_INVALID; }
>>   
>>   #endif	/* CONFIG_HYPERV_IOMMU */
>>   
>> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
>> index 6267363e0189..cfeee6505e17 100644
>> --- a/arch/x86/kernel/pci-dma.c
>> +++ b/arch/x86/kernel/pci-dma.c
>> @@ -8,6 +8,7 @@
>>   #include <linux/gfp.h>
>>   #include <linux/pci.h>
>>   #include <linux/amd-iommu.h>
>> +#include <linux/hyperv.h>
>>   
>>   #include <asm/proto.h>
>>   #include <asm/dma.h>
>> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>>   	gart_iommu_hole_init();
>>   	amd_iommu_detect();
>>   	detect_intel_iommu();
>> +	hv_iommu_detect();
j
> Will this driver be x86 only?
Yes for now.

>>   	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>>   }
>>   
>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>> index 598c39558e7d..cc9774864b00 100644
>> --- a/drivers/iommu/Makefile
>> +++ b/drivers/iommu/Makefile
>> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>>   obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>>   obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>>   obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
>> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
> DMA and IRQ remapping should be separate

not sure i follow.


>>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>   obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>>   obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
>> diff --git a/drivers/iommu/hyperv-iommu.c
>> b/drivers/iommu/hyperv-iommu.c new file mode 100644
>> index 000000000000..548483fec6b1
>> --- /dev/null
>> +++ b/drivers/iommu/hyperv-iommu.c
>> @@ -0,0 +1,876 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Hyper-V root vIOMMU driver.
>> + * Copyright (C) 2026, Microsoft, Inc.
>> + */
>> +
>> +#include <linux/module.h>
> I don't think this is needed since this driver cannot be a module
> 
>> +#include <linux/pci.h>
>> +#include <linux/dmar.h>
> should not depend on Intel's DMAR
> 
>> +#include <linux/dma-map-ops.h>
>> +#include <linux/interval_tree.h>
>> +#include <linux/hyperv.h>
>> +#include "dma-iommu.h"
>> +#include <asm/iommu.h>
>> +#include <asm/mshyperv.h>
>> +
>> +/* We will not claim these PCI devices, eg hypervisor needs it for
>> debugger */ +static char *pci_devs_to_skip;
>> +static int __init hv_iommu_setup_skip(char *str)
>> +{
>> +	pci_devs_to_skip = str;
>> +
>> +	return 0;
>> +}
>> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
>> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
>> +
>> +bool hv_no_attdev;	 /* disable direct device attach for
>> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
>> +static int __init setup_hv_no_attdev(char *str)
>> +{
>> +	hv_no_attdev = true;
>> +	return 0;
>> +}
>> +__setup("hv_no_attdev", setup_hv_no_attdev);
>> +
>> +/* Iommu device that we export to the world. HyperV supports max of
>> one */ +static struct iommu_device hv_virt_iommu;
>> +
>> +struct hv_domain {
>> +	struct iommu_domain iommu_dom;
>> +	u32 domid_num;			      /* as opposed to
>> domain_id.type */
>> +	u32 num_attchd;		      /* number of currently
>> attached devices */
> rename to num_dev_attached?
> 
>> +	bool attached_dom;		      /* is this direct
>> attached dom? */
>> +	spinlock_t mappings_lock;	      /* protects
>> mappings_tree */
>> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
>> tree */ +};
>> +
>> +#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
>> +
>> +struct hv_iommu_mapping {
>> +	phys_addr_t paddr;
>> +	struct interval_tree_node iova;
>> +	u32 flags;
>> +};
>> +
>> +/*
>> + * By default, during boot the hypervisor creates one Stage 2 (S2)
>> default
>> + * domain. Stage 2 means that the page table is controlled by the
>> hypervisor.
>> + *   S2 default: access to entire root partition memory. This for us
>> easily
>> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
>> subsystem, and
>> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
>> hypervisor.
>> + *
>> + * Device Management:
>> + *   There are two ways to manage device attaches to domains:
>> + *     1. Domain Attach: A device domain is created in the
>> hypervisor, the
>> + *			 device is attached to this domain, and
>> then memory
>> + *			 ranges are mapped in the map callbacks.
>> + *     2. Direct Attach: No need to create a domain in the
>> hypervisor for direct
>> + *			 attached devices. A hypercall is made to
>> tell the
>> + *			 hypervisor to attach the device to a
>> guest. There is
>> + *			 no need for explicit memory mappings
>> because the
>> + *			 hypervisor will just use the guest HW
>> page table.
>> + *
>> + * Since a direct attach is much faster, it is the default. This can
>> be
>> + * changed via hv_no_attdev.
>> + *
>> + * L1VH: hypervisor only supports direct attach.
>> + */
>> +
>> +/*
>> + * Create dummy domain to correspond to hypervisor prebuilt default
>> identity
>> + * domain (dummy because we do not make hypercall to create them).
>> + */
>> +static struct hv_domain hv_def_identity_dom;
>> +
>> +static bool hv_special_domain(struct hv_domain *hvdom)
>> +{
>> +	return hvdom == &hv_def_identity_dom;
>> +}
>> +
>> +struct iommu_domain_geometry default_geometry = (struct
>> iommu_domain_geometry) {
>> +	.aperture_start = 0,
>> +	.aperture_end = -1UL,
>> +	.force_aperture = true,
>> +};
>> +
>> +/*
>> + * Since the relevant hypercalls can only fit less than 512 PFNs in
>> the pfn
>> + * array, report 1M max.
>> + */
>> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
>> +
>> +static u32 unique_id;	      /* unique numeric id of a new
>> domain */ +
>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
>> +				struct device *dev);
>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
>> ulong iova,
>> +				   size_t pgsize, size_t pgcount,
>> +				   struct iommu_iotlb_gather
>> *gather); +
>> +/*
>> + * If the current thread is a VMM thread, return the partition id of
>> the VM it
>> + * is managing, else return HV_PARTITION_ID_INVALID.
>> + */
>> +u64 hv_iommu_get_curr_partid(void)
>> +{
>> +	u64 (*fn)(pid_t pid);
>> +	u64 partid;
>> +
>> +	fn = symbol_get(mshv_pid_to_partid);
>> +	if (!fn)
>> +		return HV_PARTITION_ID_INVALID;
>> +
>> +	partid = fn(current->tgid);
>> +	symbol_put(mshv_pid_to_partid);
>> +
>> +	return partid;
>> +}
> This function is not iommu specific. Maybe move it to mshv code?

Well, it is getting the information from mshv by calling a function
there for iommu, and is not needed if no HYPER_IOMMU. So this is probably
the best place for it.

>> +
>> +/* If this is a VMM thread, then this domain is for a guest VM */
>> +static bool hv_curr_thread_is_vmm(void)
>> +{
>> +	return hv_iommu_get_curr_partid() != HV_PARTITION_ID_INVALID;
>> +}
>> +
>> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
>> +{
>> +	switch (cap) {
>> +	case IOMMU_CAP_CACHE_COHERENCY:
>> +		return true;
>> +	default:
>> +		return false;
>> +	}
>> +	return false;
>> +}
>> +
>> +/*
>> + * Check if given pci device is a direct attached device. Caller
>> must have
>> + * verified pdev is a valid pci device.
>> + */
>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> +{
>> +	struct iommu_domain *iommu_domain;
>> +	struct hv_domain *hvdom;
>> +	struct device *dev = &pdev->dev;
>> +
>> +	iommu_domain = iommu_get_domain_for_dev(dev);
>> +	if (iommu_domain) {
>> +		hvdom = to_hv_domain(iommu_domain);
>> +		return hvdom->attached_dom;
>> +	}
>> +
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
> Attached domain can change anytime, what guarantee does the caller have?

Not sure I understand what can change: the device moving from attached
to non-attached? or the domain getting deleted? In any case, this is
called from leaf functions, so that should not happen... and it
will return false if the device did somehow got removed.
  
>> +
>> +/* Create a new device domain in the hypervisor */
>> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
>> +{
>> +	u64 status;
>> +	unsigned long flags;
>> +	struct hv_input_device_domain *ddp;
>> +	struct hv_input_create_device_domain *input;
> nit: use consistent coding style, inverse Christmas tree.
> 
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	ddp = &input->device_domain;
>> +	ddp->partition_id = HV_PARTITION_ID_SELF;
>> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +	ddp->domain_id.id = hvdom->domid_num;
>> +
>> +	input->create_device_domain_flags.forward_progress_required
>> = 1;
>> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
>> +
>> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input,
>> NULL); +
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* During boot, all devices are attached to this */
>> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
>> device *dev) +{
>> +	return &hv_def_identity_dom.iommu_dom;
>> +}
>> +
>> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
>> device *dev) +{
>> +	struct hv_domain *hvdom;
>> +	int rc;
>> +
>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
>> !hv_no_attdev) {
>> +		pr_err("Hyper-V: l1vh iommu does not support host
>> devices\n");
> why is this an error if user input choose not to do direct attach?

Like the error message says: on l1vh, direct attaches of host devices
(eg dpdk) is not supported. and l1vh only does direct attaches. IOW,
no host devices on l1vh.

>> +		return NULL;
>> +	}
>> +
>> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
>> +	if (hvdom == NULL)
>> +		goto out;
>> +
>> +	spin_lock_init(&hvdom->mappings_lock);
>> +	hvdom->mappings_tree = RB_ROOT_CACHED;
>> +
>> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /* ie,
>> 0 */
> This is true only when unique_id wraps around, right? Then this driver
> stops working?

Correct. It's a u32, so if my math is right, and a device is attached
every second, it will take 136 years to wrap! Did i get that right?

> can you use an IDR for the unique_id and free it as you detach instead
> of doing this cyclic allocation?
> 
>> +		goto out_free;
>> +
>> +	hvdom->domid_num = unique_id;
>> +	hvdom->iommu_dom.geometry = default_geometry;
>> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
>> +
>> +	/* For guests, by default we do direct attaches, so no
>> domain in hyp */
>> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
>> +		hvdom->attached_dom = true;
>> +	else {
>> +		rc = hv_iommu_create_hyp_devdom(hvdom);
>> +		if (rc)
>> +			goto out_free_id;
>> +	}
>> +
>> +	return &hvdom->iommu_dom;
>> +
>> +out_free_id:
>> +	unique_id--;
>> +out_free:
>> +	kfree(hvdom);
>> +out:
>> +	return NULL;
>> +}
>> +
>> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
>> +{
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	unsigned long flags;
>> +	u64 status;
>> +	struct hv_input_delete_device_domain *input;
>> +
>> +	if (hv_special_domain(hvdom))
>> +		return;
>> +
>> +	if (hvdom->num_attchd) {
>> +		pr_err("Hyper-V: can't free busy iommu domain
>> (%p)\n", immdom);
>> +		return;
>> +	}
>> +
>> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
>> +		struct hv_input_device_domain *ddp;
>> +
>> +		local_irq_save(flags);
>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		ddp = &input->device_domain;
>> +		memset(input, 0, sizeof(*input));
>> +
>> +		ddp->partition_id = HV_PARTITION_ID_SELF;
>> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>> +		ddp->domain_id.id = hvdom->domid_num;
>> +
>> +		status =
>> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
>> +					 NULL);
>> +		local_irq_restore(flags);
>> +
>> +		if (!hv_result_success(status))
>> +			hv_status_err(status, "\n");
>> +	}

> you could free the domid here, no?
sorry, don't follow what you mean by domid, you mean unique_id?

>> +
>> +	kfree(hvdom);
>> +}
>> +
>> +/* Attach a device to a domain previously created in the hypervisor
>> */ +static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct
>> pci_dev *pdev) +{
>> +	unsigned long flags;
>> +	u64 status;
>> +	enum hv_device_type dev_type;
>> +	struct hv_input_attach_device_domain *input;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type =
>> HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +
>> +	/* NB: Upon guest shutdown, device is re-attached to the
>> default domain
>> +	 * without explicit detach.
>> +	 */
>> +	if (hv_l1vh_partition())
>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>> +	else
>> +		dev_type = HV_DEVICE_TYPE_PCI;
>> +
>> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
>> dev_type); +
>> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input,
>> NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* Caller must have validated that dev is a valid pci dev */
>> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
>> +{
>> +	struct hv_input_attach_device *input;
>> +	u64 status;
>> +	int rc;
>> +	unsigned long flags;
>> +	union hv_device_id host_devid;
>> +	enum hv_device_type dev_type;
>> +	u64 ptid = hv_iommu_get_curr_partid();
>> +
>> +	if (ptid == HV_PARTITION_ID_INVALID) {
>> +		pr_err("Hyper-V: Invalid partition id in direct
>> attach\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (hv_l1vh_partition())
>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>> +	else
>> +		dev_type = HV_DEVICE_TYPE_PCI;
>> +
>> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
>> +
>> +	do {
>> +		local_irq_save(flags);
>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +		memset(input, 0, sizeof(*input));
>> +		input->partition_id = ptid;
>> +		input->device_id = host_devid;
>> +
>> +		/* Hypervisor associates logical_id with this
>> device, and in
>> +		 * some hypercalls like retarget interrupts,
>> logical_id must be
>> +		 * used instead of the BDF. It is a required
>> parameter.
>> +		 */
>> +		input->attdev_flags.logical_id = 1;
>> +		input->logical_devid =
>> +			   hv_build_devid_oftype(pdev,
>> HV_DEVICE_TYPE_LOGICAL); +
>> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
>> input, NULL);
>> +		local_irq_restore(flags);
>> +
>> +		if (hv_result(status) ==
>> HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
>> ptid, 1);
>> +			if (rc)
>> +				break;
>> +		}
>> +	} while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +/* This to attach a device to both host app (like DPDK) and a guest
>> VM */
> The IOMMU driver should be agnostic to the type of consumer, whether a
> userspace driver or a VM. This comment is not necessary.
> 
>> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
>> struct device *dev,
>> +			       struct iommu_domain *old)
> This does not match upstream kernel prototype, which kernel version is
> this based on? I will stop here for now.

As I mentioned in the cover letter:
          Based on: 8f0b4cce4481 (origin/hyperv-next)

which is now 6.19 based.

> struct iommu_domain_ops {
> 	int (*attach_dev)(struct iommu_domain *domain, struct device
> 	*dev);

I think you got it backwards, 6.6 has this. 6.19 has extra paremeter.

Thanks,
-Mukesh


>> +{
>> +	struct pci_dev *pdev;
>> +	int rc;
>> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
>> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
>> +
>> +	/* Only allow PCI devices for now */
>> +	if (!dev_is_pci(dev))
>> +		return -EINVAL;
>> +
>> +	pdev = to_pci_dev(dev);
>> +
>> +	/* l1vh does not support host device (eg DPDK) passthru */
>> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
>> +	    !hvdom_new->attached_dom)
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * VFIO does not do explicit detach calls, hence check first
>> if we need
>> +	 * to detach first. Also, in case of guest shutdown, it's
>> the VMM
>> +	 * thread that attaches it back to the hv_def_identity_dom,
>> and
>> +	 * hvdom_prev will not be null then. It is null during boot.
>> +	 */
>> +	if (hvdom_prev)
>> +		if (!hv_l1vh_partition() ||
>> !hv_special_domain(hvdom_prev))
>> +			hv_iommu_detach_dev(&hvdom_prev->iommu_dom,
>> dev); +
>> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
>> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
>> "private" field */
>> +		return 0;
>> +	}
>> +
>> +	if (hvdom_new->attached_dom)
>> +		rc = hv_iommu_direct_attach_device(pdev);
>> +	else
>> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
>> +
>> +	if (rc && hvdom_prev) {
>> +		int rc1;
>> +
>> +		if (hvdom_prev->attached_dom)
>> +			rc1 = hv_iommu_direct_attach_device(pdev);
>> +		else
>> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev, pdev);
>> +
>> +		if (rc1)
>> +			pr_err("Hyper-V: iommu could not restore
>> orig device state.. dev:%s\n",
>> +			       dev_name(dev));
>> +	}
>> +
>> +	if (rc == 0) {
>> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
>> "private" field */
>> +		hvdom_new->num_attchd++;
>> +	}
>> +
>> +	return rc;
>> +}
>> +
>> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
>> +					struct pci_dev *pdev)
>> +{
>> +	struct hv_input_detach_device *input;
>> +	u64 status, log_devid;
>> +	unsigned long flags;
>> +
>> +	log_devid = hv_build_devid_oftype(pdev,
>> HV_DEVICE_TYPE_LOGICAL); +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->partition_id = hv_iommu_get_curr_partid();
>> +	input->logical_devid = log_devid;
>> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +}
>> +
>> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
>> +				      struct pci_dev *pdev)
>> +{
>> +	u64 status, devid;
>> +	unsigned long flags;
>> +	struct hv_input_detach_device_domain *input;
>> +
>> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_id.as_uint64 = devid;
>> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input,
>> NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +}
>> +
>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct
>> device *dev) +{
>> +	struct pci_dev *pdev;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +
>> +	/* See the attach function, only PCI devices for now */
>> +	if (!dev_is_pci(dev))
>> +		return;
>> +
>> +	if (hvdom->num_attchd == 0)
>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n",
>> dev_name(dev)); +
>> +	pdev = to_pci_dev(dev);
>> +
>> +	if (hvdom->attached_dom) {
>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>> +
>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages
>> happens
>> +		 * next.
>> +		 */
>> +	} else {
>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>> +	}
>> +
>> +	hvdom->num_attchd--;
>> +}
>> +
>> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
>> +				     unsigned long iova, phys_addr_t
>> paddr,
>> +				     size_t size, u32 flags)
>> +{
>> +	unsigned long irqflags;
>> +	struct hv_iommu_mapping *mapping;
>> +
>> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
>> +	if (!mapping)
>> +		return -ENOMEM;
>> +
>> +	mapping->paddr = paddr;
>> +	mapping->iova.start = iova;
>> +	mapping->iova.last = iova + size - 1;
>> +	mapping->flags = flags;
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
>> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
>> +
>> +	return 0;
>> +}
>> +
>> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
>> +					unsigned long iova, size_t
>> size) +{
>> +	unsigned long flags;
>> +	size_t unmapped = 0;
>> +	unsigned long last = iova + size - 1;
>> +	struct hv_iommu_mapping *mapping = NULL;
>> +	struct interval_tree_node *node, *next;
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova,
>> last);
>> +	while (next) {
>> +		node = next;
>> +		mapping = container_of(node, struct
>> hv_iommu_mapping, iova);
>> +		next = interval_tree_iter_next(node, iova, last);
>> +
>> +		/* Trying to split a mapping? Not supported for now.
>> */
>> +		if (mapping->iova.start < iova)
>> +			break;
>> +
>> +		unmapped += mapping->iova.last - mapping->iova.start
>> + 1; +
>> +		interval_tree_remove(node, &hvdom->mappings_tree);
>> +		kfree(mapping);
>> +	}
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>> +
>> +	return unmapped;
>> +}
>> +
>> +/* Return: must return exact status from the hypercall without
>> changes */ +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
>> +			    unsigned long iova, phys_addr_t paddr,
>> +			    unsigned long npages, u32 map_flags)
>> +{
>> +	u64 status;
>> +	int i;
>> +	struct hv_input_map_device_gpa_pages *input;
>> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type =
>> HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +	input->map_flags = map_flags;
>> +	input->target_device_va_base = iova;
>> +
>> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
>> +	for (i = 0; i < npages; i++, pfn++)
>> +		input->gpa_page_list[i] = pfn;
>> +
>> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES,
>> npages, 0,
>> +				     input, NULL);
>> +
>> +	local_irq_restore(flags);
>> +	return status;
>> +}
>> +
>> +/*
>> + * The core VFIO code loops over memory ranges calling this function
>> with
>> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in
>> vfio_iommu_map.
>> + */
>> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong
>> iova,
>> +			      phys_addr_t paddr, size_t pgsize,
>> size_t pgcount,
>> +			      int prot, gfp_t gfp, size_t *mapped)
>> +{
>> +	u32 map_flags;
>> +	int ret;
>> +	u64 status;
>> +	unsigned long npages, done = 0;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	size_t size = pgsize * pgcount;
>> +
>> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
>> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
>> +
>> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size,
>> map_flags);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (hvdom->attached_dom) {
>> +		*mapped = size;
>> +		return 0;
>> +	}
>> +
>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>> +	while (done < npages) {
>> +		ulong completed, remain = npages - done;
>> +
>> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
>> +					  map_flags);
>> +
>> +		completed = hv_repcomp(status);
>> +		done = done + completed;
>> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
>> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
>> +
>> +		if (hv_result(status) ==
>> HV_STATUS_INSUFFICIENT_MEMORY) {
>> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>> +
>> hv_current_partition_id,
>> +						    256);
>> +			if (ret)
>> +				break;
>> +		}
>> +		if (!hv_result_success(status))
>> +			break;
>> +	}
>> +
>> +	if (!hv_result_success(status)) {
>> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
>> +
>> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
>> +			      done, npages, iova);
>> +		/*
>> +		 * lookup tree has all mappings [0 - size-1]. Below
>> unmap will
>> +		 * only remove from [0 - done], we need to remove
>> second chunk
>> +		 * [done+1 - size-1].
>> +		 */
>> +		hv_iommu_del_tree_mappings(hvdom, iova, size -
>> done_size);
>> +		hv_iommu_unmap_pages(immdom, iova - done_size,
>> pgsize,
>> +				     done, NULL);
>> +		if (mapped)
>> +			*mapped = 0;
>> +	} else
>> +		if (mapped)
>> +			*mapped = size;
>> +
>> +	return hv_result_to_errno(status);
>> +}
>> +
>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
>> ulong iova,
>> +				   size_t pgsize, size_t pgcount,
>> +				   struct iommu_iotlb_gather *gather)
>> +{
>> +	unsigned long flags, npages;
>> +	struct hv_input_unmap_device_gpa_pages *input;
>> +	u64 status;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +	size_t unmapped, size = pgsize * pgcount;
>> +
>> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
>> +	if (unmapped < size)
>> +		pr_err("%s: could not delete all mappings
>> (%lx:%lx/%lx)\n",
>> +		       __func__, iova, unmapped, size);
>> +
>> +	if (hvdom->attached_dom)
>> +		return size;
>> +
>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +
>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>> +	input->device_domain.domain_id.type =
>> HV_DEVICE_DOMAIN_TYPE_S2;
>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>> +	input->target_device_va_base = iova;
>> +
>> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES,
>> npages,
>> +				     0, input, NULL);
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status))
>> +		hv_status_err(status, "\n");
>> +
>> +	return unmapped;
>> +}
>> +
>> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
>> +					 dma_addr_t iova)
>> +{
>> +	u64 paddr = 0;
>> +	unsigned long flags;
>> +	struct hv_iommu_mapping *mapping;
>> +	struct interval_tree_node *node;
>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>> +
>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova,
>> iova);
>> +	if (node) {
>> +		mapping = container_of(node, struct
>> hv_iommu_mapping, iova);
>> +		paddr = mapping->paddr + (iova -
>> mapping->iova.start);
>> +	}
>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>> +
>> +	return paddr;
>> +}
>> +
>> +/*
>> + * Currently, hypervisor does not provide list of devices it is using
>> + * dynamically. So use this to allow users to manually specify
>> devices that
>> + * should be skipped. (eg. hypervisor debugger using some network
>> device).
>> + */
>> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>> +{
>> +	if (!dev_is_pci(dev))
>> +		return ERR_PTR(-ENODEV);
>> +
>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
>> +		int rc, pos = 0;
>> +		int parsed;
>> +		int segment, bus, slot, func;
>> +		struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +		do {
>> +			parsed = 0;
>> +
>> +			rc = sscanf(pci_devs_to_skip + pos, "
>> (%x:%x:%x.%x) %n",
>> +				    &segment, &bus, &slot, &func,
>> &parsed);
>> +			if (rc)
>> +				break;
>> +			if (parsed <= 0)
>> +				break;
>> +
>> +			if (pci_domain_nr(pdev->bus) == segment &&
>> +			    pdev->bus->number == bus &&
>> +			    PCI_SLOT(pdev->devfn) == slot &&
>> +			    PCI_FUNC(pdev->devfn) == func) {
>> +
>> +				dev_info(dev, "skipped by Hyper-V
>> IOMMU\n");
>> +				return ERR_PTR(-ENODEV);
>> +			}
>> +			pos += parsed;
>> +
>> +		} while (pci_devs_to_skip[pos]);
>> +	}
>> +
>> +	/* Device will be explicitly attached to the default domain,
>> so no need
>> +	 * to do dev_iommu_priv_set() here.
>> +	 */
>> +
>> +	return &hv_virt_iommu;
>> +}
>> +
>> +static void hv_iommu_probe_finalize(struct device *dev)
>> +{
>> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
>> +
>> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
>> +		iommu_setup_dma_ops(dev);
>> +	else
>> +		set_dma_ops(dev, NULL);
>> +}
>> +
>> +static void hv_iommu_release_device(struct device *dev)
>> +{
>> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
>> +
>> +	/* Need to detach device from device domain if necessary. */
>> +	if (hvdom)
>> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
>> +
>> +	dev_iommu_priv_set(dev, NULL);
>> +	set_dma_ops(dev, NULL);
>> +}
>> +
>> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
>> +{
>> +	if (dev_is_pci(dev))
>> +		return pci_device_group(dev);
>> +	else
>> +		return generic_device_group(dev);
>> +}
>> +
>> +static int hv_iommu_def_domain_type(struct device *dev)
>> +{
>> +	/* The hypervisor always creates this by default during boot
>> */
>> +	return IOMMU_DOMAIN_IDENTITY;
>> +}
>> +
>> +static struct iommu_ops hv_iommu_ops = {
>> +	.capable	    = hv_iommu_capable,
>> +	.domain_alloc_identity	=
>> hv_iommu_domain_alloc_identity,
>> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
>> +	.probe_device	    = hv_iommu_probe_device,
>> +	.probe_finalize     = hv_iommu_probe_finalize,
>> +	.release_device     = hv_iommu_release_device,
>> +	.def_domain_type    = hv_iommu_def_domain_type,
>> +	.device_group	    = hv_iommu_device_group,
>> +	.default_domain_ops = &(const struct iommu_domain_ops) {
>> +		.attach_dev   = hv_iommu_attach_dev,
>> +		.map_pages    = hv_iommu_map_pages,
>> +		.unmap_pages  = hv_iommu_unmap_pages,
>> +		.iova_to_phys = hv_iommu_iova_to_phys,
>> +		.free	      = hv_iommu_domain_free,
>> +	},
>> +	.owner		    = THIS_MODULE,
>> +};
>> +
>> +static void __init hv_initialize_special_domains(void)
>> +{
>> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
>> +	hv_def_identity_dom.domid_num =
>> HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */ +}
> This could be initialized statically.
> 
>> +
>> +static int __init hv_iommu_init(void)
>> +{
>> +	int ret;
>> +	struct iommu_device *iommup = &hv_virt_iommu;
>> +
>> +	if (!hv_is_hyperv_initialized())
>> +		return -ENODEV;
>> +
>> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s",
>> "hyperv-iommu");
>> +	if (ret) {
>> +		pr_err("Hyper-V: iommu_device_sysfs_add failed:
>> %d\n", ret);
>> +		return ret;
>> +	}
>> +
>> +	/* This must come before iommu_device_register because the
>> latter calls
>> +	 * into the hooks.
>> +	 */
>> +	hv_initialize_special_domains();
>> +
>> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
>> +	if (ret) {
>> +		pr_err("Hyper-V: iommu_device_register failed:
>> %d\n", ret);
>> +		goto err_sysfs_remove;
>> +	}
>> +
>> +	pr_info("Hyper-V IOMMU initialized\n");
>> +
>> +	return 0;
>> +
>> +err_sysfs_remove:
>> +	iommu_device_sysfs_remove(iommup);
>> +	return ret;
>> +}
>> +
>> +void __init hv_iommu_detect(void)
>> +{
>> +	if (no_iommu || iommu_detected)
>> +		return;
>> +
>> +	/* For l1vh, always expose an iommu unit */
>> +	if (!hv_l1vh_partition())
>> +		if (!(ms_hyperv.misc_features &
>> HV_DEVICE_DOMAIN_AVAILABLE))
>> +			return;
>> +
>> +	iommu_detected = 1;
>> +	x86_init.iommu.iommu_init = hv_iommu_init;
>> +
>> +	pci_request_acs();
>> +}
>> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
>> index dfc516c1c719..2ad111727e82 100644
>> --- a/include/linux/hyperv.h
>> +++ b/include/linux/hyperv.h
>> @@ -1767,4 +1767,10 @@ static inline unsigned long virt_to_hvpfn(void
>> *addr) #define HVPFN_DOWN(x)	((x) >> HV_HYP_PAGE_SHIFT)
>>   #define page_to_hvpfn(page)	(page_to_pfn(page) *
>> NR_HV_HYP_PAGES_IN_PAGE)
>> +#ifdef CONFIG_HYPERV_IOMMU
>> +void __init hv_iommu_detect(void);
>> +#else
>> +static inline void hv_iommu_detect(void) { }
>> +#endif /* CONFIG_HYPERV_IOMMU */
>> +
>>   #endif /* _HYPERV_H */


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices
  2026-01-21  0:47   ` Stanislav Kinsburskii
@ 2026-01-24  2:08     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  2:08 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/20/26 16:47, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:28PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> As mentioned previously, a direct attached device must be referenced
>> via logical device id which is formed in the initial attach hypercall.
>> Interrupt mapping paths for direct attached devices are almost same,
>> except we must use logical device ids instead of the PCI device ids.
>>
>> L1VH only supports direct attaches for passing thru devices to its guests,
>> and devices on L1VH are VMBus based. However, the interrupts are mapped
>> via the map interrupt hypercall and not the traditional method of VMBus
>> messages.
>>
>> Partition id for the relevant hypercalls is tricky. This because a device
>> could be moving from root to guest and then back to the root. In case
>> of L1VH, it could be moving from system host to L1VH root to a guest,
>> then back to the L1VH root. So, it is carefully crafted by keeping
>> track of whether the call is on behalf of a VMM process, whether the
>> device is attached device (as opposed to mapped), and whether we are in
>> an L1VH root/parent. If VMM process, we assume it is on behalf of a
>> guest. Otherwise, the device is being attached or detached during boot
>> or shutdown of the privileged partition.
>>
>> Lastly, a dummy cpu and vector is used to map interrupt for a direct
>> attached device. This because, once a device is marked for direct attach,
>> hypervisor will not let any interrupts be mapped to host. So it is mapped
>> to guest dummy cpu and dummy vector. This is then correctly mapped during
>> guest boot via the retarget paths.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   arch/arm64/include/asm/mshyperv.h   | 15 +++++
>>   arch/x86/hyperv/irqdomain.c         | 57 +++++++++++++-----
>>   arch/x86/include/asm/mshyperv.h     |  4 ++
>>   drivers/pci/controller/pci-hyperv.c | 91 +++++++++++++++++++++++++----
>>   4 files changed, 142 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
>> index b721d3134ab6..27da480f94f6 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -53,6 +53,21 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>   	return hv_get_msr(reg);
>>   }
>>   
>> +struct irq_data;
>> +struct msi_msg;
>> +struct pci_dev;
>> +static inline void hv_irq_compose_msi_msg(struct irq_data *data,
>> +					  struct msi_msg *msg) {};
>> +static inline int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>> +					struct hv_interrupt_entry *hvirqe)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
>> +static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>> +{
>> +	return false;
>> +}
>> +
>>   /* SMCCC hypercall parameters */
>>   #define HV_SMCCC_FUNC_NUMBER	1
>>   #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
>> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
>> index 33017aa0caa4..e6eb457f791e 100644
>> --- a/arch/x86/hyperv/irqdomain.c
>> +++ b/arch/x86/hyperv/irqdomain.c
>> @@ -13,6 +13,16 @@
>>   #include <linux/irqchip/irq-msi-lib.h>
>>   #include <asm/mshyperv.h>
>>   
>> +/*
>> + * For direct attached devices (which use logical device ids), hypervisor will
>> + * not allow mappings to host. But VFIO needs to bind the interrupt at the very
>> + * start before the guest cpu/vector is known. So we use dummy cpu and vector
>> + * to bind in such case, and later when the guest starts, retarget will move it
>> + * to correct guest cpu and vector.
>> + */
>> +#define HV_DDA_DUMMY_CPU      0
>> +#define HV_DDA_DUMMY_VECTOR  32
>> +
>>   static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
>>   				  bool level, int cpu, int vector,
>>   				  struct hv_interrupt_entry *ret_entry)
>> @@ -24,6 +34,11 @@ static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
>>   	u64 status;
>>   	int nr_bank, var_size;
>>   
>> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL) {
>> +		cpu = HV_DDA_DUMMY_CPU;
>> +		vector = HV_DDA_DUMMY_VECTOR;
>> +	}
>> +
>>   	local_irq_save(flags);
>>   
>>   	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> @@ -95,7 +110,8 @@ static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
>>   	return hv_result_to_errno(status);
>>   }
>>   
>> -static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
>> +static int hv_unmap_interrupt(union hv_device_id hv_devid,
>> +			      struct hv_interrupt_entry *irq_entry)
>>   {
>>   	unsigned long flags;
>>   	struct hv_input_unmap_device_interrupt *input;
>> @@ -103,10 +119,14 @@ static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
>>   
>>   	local_irq_save(flags);
>>   	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> -
>>   	memset(input, 0, sizeof(*input));
>> -	input->partition_id = hv_current_partition_id;
>> -	input->device_id = id;
>> +
>> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
>> +		input->partition_id = hv_iommu_get_curr_partid();
>> +	else
>> +		input->partition_id = hv_current_partition_id;
>> +
>> +	input->device_id = hv_devid.as_uint64;
>>   	input->interrupt_entry = *irq_entry;
>>   
>>   	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
>> @@ -263,6 +283,7 @@ static u64 hv_build_irq_devid(struct pci_dev *pdev)
>>   int hv_map_msi_interrupt(struct irq_data *data,
>>   			 struct hv_interrupt_entry *out_entry)
>>   {
>> +	u64 ptid;
>>   	struct irq_cfg *cfg = irqd_cfg(data);
>>   	struct hv_interrupt_entry dummy;
>>   	union hv_device_id hv_devid;
>> @@ -275,8 +296,17 @@ int hv_map_msi_interrupt(struct irq_data *data,
>>   	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>>   	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>>   
>> -	return hv_map_interrupt(hv_current_partition_id, hv_devid, false, cpu,
>> -				cfg->vector, out_entry ? out_entry : &dummy);
>> +	if (hv_devid.device_type == HV_DEVICE_TYPE_LOGICAL)
>> +		if (hv_pcidev_is_attached_dev(pdev))
>> +			ptid = hv_iommu_get_curr_partid();
>> +		else
>> +			/* Device actually on l1vh root, not passthru'd to vm */
> 
> l1vh and root are mutually exclusive partitions.
> If you wanted to highlight that it's l1vh itself and not its child guest, then
> "l1vh parent" term would do.

We've been loosely using "l1vh root" to mean "privilated l1vh" as opposed
to l1vh guests. I think that is fine. l1vh parent is confusing, as it may
also refer to l1vh parent, which would be the host. so as long as the
context is clear, we are ok.

>> +			ptid = hv_current_partition_id;
>> +	else
>> +		ptid = hv_current_partition_id;
> 
> Looks like the only special case is for attached logical devices,
> otherwise hv_current_partition_id is used.
> Can the logic simplified here?

Could be, but at the cost of clear upfront clarity. this nicely tells
the reader that a logical ID has different cases, where as PCI
does not. End instructions are the same.

Thanks,
-Mukesh



> Thanks,
> Stanislav
> 
>> +
>> +	return hv_map_interrupt(ptid, hv_devid, false, cpu, cfg->vector,
>> +				out_entry ? out_entry : &dummy);
>>   }
>>   EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
>>   
>> @@ -289,10 +319,7 @@ static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
>>   	msg->data = entry->msi_entry.data.as_uint32;
>>   }
>>   
>> -static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>> -				  struct hv_interrupt_entry *irq_entry);
>> -
>> -static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>> +void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>   {
>>   	struct hv_interrupt_entry *stored_entry;
>>   	struct irq_cfg *cfg = irqd_cfg(data);
>> @@ -341,16 +368,18 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>   	data->chip_data = stored_entry;
>>   	entry_to_msi_msg(data->chip_data, msg);
>>   }
>> +EXPORT_SYMBOL_GPL(hv_irq_compose_msi_msg);
>>   
>> -static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>> -				  struct hv_interrupt_entry *irq_entry)
>> +int hv_unmap_msi_interrupt(struct pci_dev *pdev,
>> +			   struct hv_interrupt_entry *irq_entry)
>>   {
>>   	union hv_device_id hv_devid;
>>   
>>   	hv_devid.as_uint64 = hv_build_irq_devid(pdev);
>>   
>> -	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>> +	return hv_unmap_interrupt(hv_devid, irq_entry);
>>   }
>> +EXPORT_SYMBOL_GPL(hv_unmap_msi_interrupt);
>>   
>>   /* NB: during map, hv_interrupt_entry is saved via data->chip_data */
>>   static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
>> @@ -486,7 +515,7 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
>>   	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
>>   	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>>   
>> -	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
>> +	return hv_unmap_interrupt(hv_devid, entry);
>>   }
>>   EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
>>   
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index e4ccdbbf1d12..b6facd3a0f5e 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -204,11 +204,15 @@ static inline u64 hv_iommu_get_curr_partid(void)
>>   #endif	/* CONFIG_HYPERV_IOMMU */
>>   
>>   u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
>> +void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg);
>> +extern bool hv_no_attdev;
>>   
>>   struct irq_domain *hv_create_pci_msi_domain(void);
>>   
>>   int hv_map_msi_interrupt(struct irq_data *data,
>>   			 struct hv_interrupt_entry *out_entry);
>> +int hv_unmap_msi_interrupt(struct pci_dev *dev,
>> +			   struct hv_interrupt_entry *hvirqe);
>>   int hv_map_ioapic_interrupt(int ioapic_id, bool level, int vcpu, int vector,
>>   		struct hv_interrupt_entry *entry);
>>   int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry);
>> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
>> index 40f0b06bb966..71d1599dc4a8 100644
>> --- a/drivers/pci/controller/pci-hyperv.c
>> +++ b/drivers/pci/controller/pci-hyperv.c
>> @@ -660,15 +660,17 @@ static void hv_irq_retarget_interrupt(struct irq_data *data)
>>   
>>   	params = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>   	memset(params, 0, sizeof(*params));
>> -	params->partition_id = HV_PARTITION_ID_SELF;
>> +
>> +	if (hv_pcidev_is_attached_dev(pdev))
>> +		params->partition_id = hv_iommu_get_curr_partid();
>> +	else
>> +		params->partition_id = HV_PARTITION_ID_SELF;
>> +
>>   	params->int_entry.source = HV_INTERRUPT_SOURCE_MSI;
>> -	params->int_entry.msi_entry.address.as_uint32 = int_desc->address & 0xffffffff;
>> +	params->int_entry.msi_entry.address.as_uint32 =
>> +						int_desc->address & 0xffffffff;
>>   	params->int_entry.msi_entry.data.as_uint32 = int_desc->data;
>> -	params->device_id = (hbus->hdev->dev_instance.b[5] << 24) |
>> -			   (hbus->hdev->dev_instance.b[4] << 16) |
>> -			   (hbus->hdev->dev_instance.b[7] << 8) |
>> -			   (hbus->hdev->dev_instance.b[6] & 0xf8) |
>> -			   PCI_FUNC(pdev->devfn);
>> +	params->device_id = hv_pci_vmbus_device_id(pdev);
>>   	params->int_target.vector = hv_msi_get_int_vector(data);
>>   
>>   	if (hbus->protocol_version >= PCI_PROTOCOL_VERSION_1_2) {
>> @@ -1263,6 +1265,15 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
>>   			mb();
>>   		}
>>   		spin_unlock_irqrestore(&hbus->config_lock, flags);
>> +		/*
>> +		 * Make sure PCI_INTERRUPT_PIN is hard-wired to 0 since it may
>> +		 * be read using a 32bit read which is skipped by the above
>> +		 * emulation.
>> +		 */
>> +		if (PCI_INTERRUPT_PIN >= where &&
>> +		    PCI_INTERRUPT_PIN <= (where + size)) {
>> +			*((char *)val + PCI_INTERRUPT_PIN - where) = 0;
>> +		}
>>   	} else {
>>   		dev_err(dev, "Attempt to read beyond a function's config space.\n");
>>   	}
>> @@ -1731,14 +1742,22 @@ static void hv_msi_free(struct irq_domain *domain, unsigned int irq)
>>   	if (!int_desc)
>>   		return;
>>   
>> -	irq_data->chip_data = NULL;
>>   	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
>>   	if (!hpdev) {
>> +		irq_data->chip_data = NULL;
>>   		kfree(int_desc);
>>   		return;
>>   	}
>>   
>> -	hv_int_desc_free(hpdev, int_desc);
>> +	if (hv_pcidev_is_attached_dev(pdev)) {
>> +		hv_unmap_msi_interrupt(pdev, irq_data->chip_data);
>> +		kfree(irq_data->chip_data);
>> +		irq_data->chip_data = NULL;
>> +	} else {
>> +		irq_data->chip_data = NULL;
>> +		hv_int_desc_free(hpdev, int_desc);
>> +	}
>> +
>>   	put_pcichild(hpdev);
>>   }
>>   
>> @@ -2139,6 +2158,56 @@ static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>>   	msg->data = 0;
>>   }
>>   
>> +/* Compose an msi message for a directly attached device */
>> +static void hv_dda_compose_msi_msg(struct irq_data *irq_data,
>> +				   struct msi_desc *msi_desc,
>> +				   struct msi_msg *msg)
>> +{
>> +	bool multi_msi;
>> +	struct hv_pcibus_device *hbus;
>> +	struct hv_pci_dev *hpdev;
>> +	struct pci_dev *pdev = msi_desc_to_pci_dev(msi_desc);
>> +
>> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
>> +		    msi_desc->nvec_used > 1;
>> +
>> +	if (multi_msi) {
>> +		dev_err(&hbus->hdev->device,
>> +			"Passthru direct attach does not support multi msi\n");
>> +		goto outerr;
>> +	}
>> +
>> +	hbus = container_of(pdev->bus->sysdata, struct hv_pcibus_device,
>> +			    sysdata);
>> +
>> +	hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
>> +	if (!hpdev)
>> +		goto outerr;
>> +
>> +	/* will unmap if needed and also update irq_data->chip_data */
>> +	hv_irq_compose_msi_msg(irq_data, msg);
>> +
>> +	put_pcichild(hpdev);
>> +	return;
>> +
>> +outerr:
>> +	memset(msg, 0, sizeof(*msg));
>> +}
>> +
>> +static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>> +{
>> +	struct pci_dev *pdev;
>> +	struct msi_desc *msi_desc;
>> +
>> +	msi_desc = irq_data_get_msi_desc(data);
>> +	pdev = msi_desc_to_pci_dev(msi_desc);
>> +
>> +	if (hv_pcidev_is_attached_dev(pdev))
>> +		hv_dda_compose_msi_msg(data, msi_desc, msg);
>> +	else
>> +		hv_vmbus_compose_msi_msg(data, msg);
>> +}
>> +
>>   static bool hv_pcie_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>>   				      struct irq_domain *real_parent, struct msi_domain_info *info)
>>   {
>> @@ -2177,7 +2246,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
>>   /* HW Interrupt Chip Descriptor */
>>   static struct irq_chip hv_msi_irq_chip = {
>>   	.name			= "Hyper-V PCIe MSI",
>> -	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
>> +	.irq_compose_msi_msg	= hv_compose_msi_msg,
>>   	.irq_set_affinity	= irq_chip_set_affinity_parent,
>>   	.irq_ack		= irq_chip_ack_parent,
>>   	.irq_eoi		= irq_chip_eoi_parent,
>> @@ -4096,7 +4165,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>>   		irq_data = irq_get_irq_data(entry->irq);
>>   		if (WARN_ON_ONCE(!irq_data))
>>   			return -EINVAL;
>> -		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
>> +		hv_compose_msi_msg(irq_data, &entry->msg);
>>   	}
>>   	return 0;
>>   }
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl
  2026-01-23 18:34   ` Nuno Das Neves
@ 2026-01-24  2:12     ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  2:12 UTC (permalink / raw)
  To: Nuno Das Neves, linux-kernel, linux-hyperv, linux-arm-kernel,
	iommu, linux-pci, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, longli, catalin.marinas, will,
	tglx, mingo, bp, dave.hansen, hpa, joro, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, arnd, mhklinux, romank

On 1/23/26 10:34, Nuno Das Neves wrote:
> On 1/19/2026 10:42 PM, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> VFIO no longer puts the mmio pfn in vma->vm_pgoff. So, remove code
>> that is using it to map mmio space. It is broken and will cause
>> panic.
> 
> What is the reason for having this as a separate commit from patch 15?
> It seems like removing this code and adding the mmio intercept
> handling could be done in one patch.

Just ease of review and porting patches from this branch to that
branch to that release to this release... I am sure someone would
have asked for this to be a separate patch :).

Thanks,
-Mukesh


>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   drivers/hv/mshv_root_main.c | 20 ++++----------------
>>   1 file changed, 4 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 27313419828d..03f3aa9f5541 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -1258,16 +1258,8 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
>>   }
>>   
>>   /*
>> - * This maps two things: guest RAM and for pci passthru mmio space.
>> - *
>> - * mmio:
>> - *  - vfio overloads vm_pgoff to store the mmio start pfn/spa.
>> - *  - Two things need to happen for mapping mmio range:
>> - *	1. mapped in the uaddr so VMM can access it.
>> - *	2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
>> - *
>> - *   This function takes care of the second. The first one is managed by vfio,
>> - *   and hence is taken care of via vfio_pci_mmap_fault().
>> + * This is called for both user ram and mmio space. The mmio space is not
>> + * mapped here, but later during intercept.
>>    */
>>   static long
>>   mshv_map_user_memory(struct mshv_partition *partition,
>> @@ -1276,7 +1268,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>>   	struct mshv_mem_region *region;
>>   	struct vm_area_struct *vma;
>>   	bool is_mmio;
>> -	ulong mmio_pfn;
>>   	long ret;
>>   
>>   	if (mem.flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
>> @@ -1286,7 +1277,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
>>   	mmap_read_lock(current->mm);
>>   	vma = vma_lookup(current->mm, mem.userspace_addr);
>>   	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>> -	mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
>>   	mmap_read_unlock(current->mm);
>>   
>>   	if (!vma)
>> @@ -1313,10 +1303,8 @@ mshv_map_user_memory(struct mshv_partition *partition,
>>   					    HV_MAP_GPA_NO_ACCESS, NULL);
>>   		break;
>>   	case MSHV_REGION_TYPE_MMIO:
>> -		ret = hv_call_map_mmio_pages(partition->pt_id,
>> -					     region->start_gfn,
>> -					     mmio_pfn,
>> -					     region->nr_pages);
>> +		/* mmio mappings are handled later during intercepts */
>> +		ret = 0;
>>   		break;
>>   	}
>>   
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-21  1:53   ` Stanislav Kinsburskii
@ 2026-01-24  2:19     ` Mukesh R
  2026-01-26 18:15       ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-24  2:19 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Upon guest access, in case of missing mmio mapping, the hypervisor
>> generates an unmapped gpa intercept. In this path, lookup the PCI
>> resource pfn for the guest gpa, and ask the hypervisor to map it
>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>> and obtained via fixup_user_fault call (similar to KVM).
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>   drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 115 insertions(+)
>>
>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>> index 03f3aa9f5541..4c8bc7cd0888 100644
>> --- a/drivers/hv/mshv_root_main.c
>> +++ b/drivers/hv/mshv_root_main.c
>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>   	};
>>   } __packed;
>>   
>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>> +static int __init setup_hv_full_mmio(char *str)
>> +{
>> +	hv_nofull_mmio = true;
>> +	return 0;
>> +}
>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>> +
>>   struct mshv_root mshv_root;
>>   
>>   enum hv_scheduler_type hv_scheduler_type;
>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>   }
>>   
>>   #ifdef CONFIG_X86_64
>> +
>> +/*
>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>> + * else just return -errno.
>> + */
>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>> +				       u64 *mmio_pfnp)
>> +{
>> +	struct vm_area_struct *vma;
>> +	bool is_mmio;
>> +	u64 uaddr;
>> +	struct mshv_mem_region *mreg;
>> +	struct follow_pfnmap_args pfnmap_args;
>> +	int rc = -EINVAL;
>> +
>> +	/*
>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>> +	 * useraddr vma to lookup pci bar pfn.
>> +	 */
>> +	spin_lock(&pt->pt_mem_regions_lock);
>> +
>> +	/* Get the region again under the lock */
>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>> +		goto unlock_pt_out;
>> +
>> +	uaddr = mreg->start_uaddr +
>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>> +
>> +	mmap_read_lock(current->mm);
> 
> Semaphore can't be taken under spinlock.
> Get it instead.

Yeah, something didn't feel right here and I meant to recheck, now regret
rushing to submit the patch.

Rethinking, I think the pt_mem_regions_lock is not needed to protect
the uaddr because unmap will properly serialize via the mm lock.


>> +	vma = vma_lookup(current->mm, uaddr);
>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> 
> Why this check is needed again?

To make sure region did not change. This check is under lock.

> The region type is stored on the region itself.
> And the type is checked on the caller side.
> 
>> +	if (!is_mmio)
>> +		goto unlock_mmap_out;
>> +
>> +	pfnmap_args.vma = vma;
>> +	pfnmap_args.address = uaddr;
>> +
>> +	rc = follow_pfnmap_start(&pfnmap_args);
>> +	if (rc) {
>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>> +				      NULL);
>> +		if (rc)
>> +			goto unlock_mmap_out;
>> +
>> +		rc = follow_pfnmap_start(&pfnmap_args);
>> +		if (rc)
>> +			goto unlock_mmap_out;
>> +	}
>> +
>> +	*mmio_pfnp = pfnmap_args.pfn;
>> +	follow_pfnmap_end(&pfnmap_args);
>> +
>> +unlock_mmap_out:
>> +	mmap_read_unlock(current->mm);
>> +unlock_pt_out:
>> +	spin_unlock(&pt->pt_mem_regions_lock);
>> +	return rc;
>> +}
>> +
>> +/*
>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>> + * and resolve if possible.
>> + * Returns: True if valid mmio intercept and it was handled, else false
>> + */
>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>> +{
>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>> +	struct hv_x64_memory_intercept_message *msg;
>> +	union hv_x64_memory_access_info accinfo;
>> +	u64 gfn, mmio_spa, numpgs;
>> +	struct mshv_mem_region *mreg;
>> +	int rc;
>> +	struct mshv_partition *pt = vp->vp_partition;
>> +
>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>> +	accinfo = msg->memory_access_info;
>> +
>> +	if (!accinfo.gva_gpa_valid)
>> +		return false;
>> +
>> +	/* Do a fast check and bail if non mmio intercept */
>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> 
> This call needs to be protected by the spinlock.

This is sorta fast path to bail. We recheck under partition lock above.

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>> +		return false;
>> +
>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>> +	if (rc)
>> +		return false;
>> +
>> +	if (!hv_nofull_mmio) {		/* default case */
>> +		gfn = mreg->start_gfn;
>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>> +		numpgs = mreg->nr_pages;
>> +	} else
>> +		numpgs = 1;
>> +
>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>> +
>> +	return rc == 0;
>> +}
>> +
>>   static struct mshv_mem_region *
>>   mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>   {
>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>   
>>   	return ret;
>>   }
>> +
>>   #else  /* CONFIG_X86_64 */
>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>   static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>   #endif /* CONFIG_X86_64 */
>>   
>>   static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>   {
>>   	switch (vp->vp_intercept_msg_page->header.message_type) {
>> +	case HVMSG_UNMAPPED_GPA:
>> +		return mshv_handle_unmapped_gpa(vp);
>>   	case HVMSG_GPA_INTERCEPT:
>>   		return mshv_handle_gpa_intercept(vp);
>>   	}
>> -- 
>> 2.51.2.vfs.0.1
>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 00/15] PCI passthru on Hyper-V (Part I)
  2026-01-20 21:50 ` [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Jacob Pan
@ 2026-01-24  2:27   ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-24  2:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/20/26 13:50, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Mon, 19 Jan 2026 22:42:15 -0800
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>
>> Implement passthru of PCI devices to unprivileged virtual machines
>> (VMs) when Linux is running as a privileged VM on Microsoft Hyper-V
>> hypervisor. This support is made to fit within the workings of VFIO
>> framework, and any VMM needing to use it must use the VFIO subsystem.
>> This supports both full device passthru and SR-IOV based VFs.
>>
>> There are 3 cases where Linux can run as a privileged VM (aka MSHV):
>>    Baremetal root (meaning Hyper-V+Linux), L1VH, and Nested.
>>
> I think some introduction/background to L1VH would help.

Ok, i can add something, but l1vh was very well introduced if you
search the mshv commits for "l1vh".

>> At a high level, the hypervisor supports traditional mapped iommu
>> domains that use explicit map and unmap hypercalls for mapping and
>> unmapping guest RAM into the iommu subsystem.
> It may be clearer to state that the hypervisor supports Linux IOMMU
> paging domains through map/unmap hypercalls, mapping GPAs to HPAs using
> stage?2 I/O page tables.

sure.

>> Hyper-V also has a
>> concept of direct attach devices whereby the iommu subsystem simply
>> uses the guest HW page table (ept/npt/..). This series adds support
>> for both, and both are made to work in VFIO type1 subsystem.
>>
> This may warrant introducing a new IOMMU domain feature flag, as it
> performs mappings but does not support map/unmap semantics in the same
> way as a paging domain.

Yeah, I was hoping we can get by for now without it. At least in case of
the cloud hypervisor, entire guest ram is mapped anyways. We can document
it and work on enhancements which are much easier once we have a baseline.
For now, it's a paging domain will all pages pinned.. :).

>> While this Part I focuses on memory mappings, upcoming Part II
>> will focus on irq bypass along with some minor irq remapping
>> updates.
>>
>> This patch series was tested using Cloud Hypervisor verion 48. Qemu
>> support of MSHV is in the works, and that will be extended to include
>> PCI passthru and SR-IOV support also in near future.
>>
>> Based on: 8f0b4cce4481 (origin/hyperv-next)
>>
>> Thanks,
>> -Mukesh
>>
>> Mukesh Rathor (15):
>>    iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
>>    x86/hyperv: cosmetic changes in irqdomain.c for readability
>>    x86/hyperv: add insufficient memory support in irqdomain.c
>>    mshv: Provide a way to get partition id if running in a VMM process
>>    mshv: Declarations and definitions for VFIO-MSHV bridge device
>>    mshv: Implement mshv bridge device for VFIO
>>    mshv: Add ioctl support for MSHV-VFIO bridge device
>>    PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
>>    mshv: Import data structs around device domains and irq remapping
>>    PCI: hv: Build device id for a VMBus device
>>    x86/hyperv: Build logical device ids for PCI passthru hcalls
>>    x86/hyperv: Implement hyperv virtual iommu
>>    x86/hyperv: Basic interrupt support for direct attached devices
>>    mshv: Remove mapping of mmio space during map user ioctl
>>    mshv: Populate mmio mappings for PCI passthru
>>
>>   MAINTAINERS                         |    1 +
>>   arch/arm64/include/asm/mshyperv.h   |   15 +
>>   arch/x86/hyperv/irqdomain.c         |  314 ++++++---
>>   arch/x86/include/asm/mshyperv.h     |   21 +
>>   arch/x86/kernel/pci-dma.c           |    2 +
>>   drivers/hv/Makefile                 |    3 +-
>>   drivers/hv/mshv_root.h              |   24 +
>>   drivers/hv/mshv_root_main.c         |  296 +++++++-
>>   drivers/hv/mshv_vfio.c              |  210 ++++++
>>   drivers/iommu/Kconfig               |    1 +
>>   drivers/iommu/Makefile              |    2 +-
>>   drivers/iommu/hyperv-iommu.c        | 1004
>> +++++++++++++++++++++------ drivers/iommu/hyperv-irq.c          |
>> 330 +++++++++ drivers/pci/controller/pci-hyperv.c |  207 ++++--
>>   include/asm-generic/mshyperv.h      |    1 +
>>   include/hyperv/hvgdk_mini.h         |   11 +
>>   include/hyperv/hvhdk_mini.h         |  112 +++
>>   include/linux/hyperv.h              |    6 +
>>   include/uapi/linux/mshv.h           |   31 +
>>   19 files changed, 2182 insertions(+), 409 deletions(-)
>>   create mode 100644 drivers/hv/mshv_vfio.c
>>   create mode 100644 drivers/iommu/hyperv-irq.c
>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-24  1:26     ` Mukesh R
@ 2026-01-26 15:57       ` Stanislav Kinsburskii
  2026-01-27  3:02         ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-26 15:57 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > Add a new file to implement management of device domains, mapping and
> > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > parent. This also implements direct attach mechanism for PCI passthru,
> > > and it is also made to work within the VFIO framework.
> > > 
> > > At a high level, during boot the hypervisor creates a default identity
> > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > PCI passthru:
> > > 
> > >    1. Device Domain: root must create a device domain in the hypervisor,
> > >       and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > >       All hypervisor communications use device id of type PCI for
> > >       identifying and referencing the device.
> > > 
> > >    2. Direct Attach: the hypervisor will simply use the guest's HW
> > >       page table for mappings, thus the host need not do map/unmap
> > >       device memory hypercalls. As such, direct attach passthru setup
> > >       during guest boot is extremely fast. A direct attached device
> > >       must be referenced via logical device id and not via the PCI
> > >       device id.
> > > 
> > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > attach is default in non-L1VH cases because there are some significant
> > > performance issues with device domain implementation currently for guests
> > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > addressed in the short term.
> > > 
> > 
> > <snip>
> >

<snip>

> > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > +{
> > > +	struct pci_dev *pdev;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > +	/* See the attach function, only PCI devices for now */
> > > +	if (!dev_is_pci(dev))
> > > +		return;
> > > +
> > > +	if (hvdom->num_attchd == 0)
> > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > +
> > > +	pdev = to_pci_dev(dev);
> > > +
> > > +	if (hvdom->attached_dom) {
> > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > +
> > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > +		 * next.
> > > +		 */
> > > +	} else {
> > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > +	}
> > > +
> > > +	hvdom->num_attchd--;
> > 
> > Shouldn't this be modified iff the detach succeeded?
> 
> We want to still free the domain and not let it get stuck. The purpose
> is more to make sure detach was called before domain free.
> 

How can one debug subseqent errors if num_attchd is decremented
unconditionally? In reality the device is left attached, but the related
kernel metadata is gone.

> > > +}
> > > +
> > > +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> > > +				     unsigned long iova, phys_addr_t paddr,
> > > +				     size_t size, u32 flags)
> > > +{
> > > +	unsigned long irqflags;
> > > +	struct hv_iommu_mapping *mapping;
> > > +
> > > +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> > > +	if (!mapping)
> > > +		return -ENOMEM;
> > > +
> > > +	mapping->paddr = paddr;
> > > +	mapping->iova.start = iova;
> > > +	mapping->iova.last = iova + size - 1;
> > > +	mapping->flags = flags;
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> > > +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> > > +					unsigned long iova, size_t size)
> > > +{
> > > +	unsigned long flags;
> > > +	size_t unmapped = 0;
> > > +	unsigned long last = iova + size - 1;
> > > +	struct hv_iommu_mapping *mapping = NULL;
> > > +	struct interval_tree_node *node, *next;
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> > > +	while (next) {
> > > +		node = next;
> > > +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > +		next = interval_tree_iter_next(node, iova, last);
> > > +
> > > +		/* Trying to split a mapping? Not supported for now. */
> > > +		if (mapping->iova.start < iova)
> > > +			break;
> > > +
> > > +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> > > +
> > > +		interval_tree_remove(node, &hvdom->mappings_tree);
> > > +		kfree(mapping);
> > > +	}
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > +	return unmapped;
> > > +}
> > > +
> > > +/* Return: must return exact status from the hypercall without changes */
> > > +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> > > +			    unsigned long iova, phys_addr_t paddr,
> > > +			    unsigned long npages, u32 map_flags)
> > > +{
> > > +	u64 status;
> > > +	int i;
> > > +	struct hv_input_map_device_gpa_pages *input;
> > > +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > +
> > > +	local_irq_save(flags);
> > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > +	memset(input, 0, sizeof(*input));
> > > +
> > > +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > +	input->device_domain.domain_id.id = hvdom->domid_num;
> > > +	input->map_flags = map_flags;
> > > +	input->target_device_va_base = iova;
> > > +
> > > +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> > > +	for (i = 0; i < npages; i++, pfn++)
> > > +		input->gpa_page_list[i] = pfn;
> > > +
> > > +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
> > > +				     input, NULL);
> > > +
> > > +	local_irq_restore(flags);
> > > +	return status;
> > > +}
> > > +
> > > +/*
> > > + * The core VFIO code loops over memory ranges calling this function with
> > > + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
> > > + */
> > > +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> > > +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> > > +			      int prot, gfp_t gfp, size_t *mapped)
> > > +{
> > > +	u32 map_flags;
> > > +	int ret;
> > > +	u64 status;
> > > +	unsigned long npages, done = 0;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +	size_t size = pgsize * pgcount;
> > > +
> > > +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> > > +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> > > +
> > > +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (hvdom->attached_dom) {
> > > +		*mapped = size;
> > > +		return 0;
> > > +	}
> > > +
> > > +	npages = size >> HV_HYP_PAGE_SHIFT;
> > > +	while (done < npages) {
> > > +		ulong completed, remain = npages - done;
> > > +
> > > +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
> > > +					  map_flags);
> > > +
> > > +		completed = hv_repcomp(status);
> > > +		done = done + completed;
> > > +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> > > +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> > > +
> > > +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> > > +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> > > +						    hv_current_partition_id,
> > > +						    256);
> > > +			if (ret)
> > > +				break;
> > > +		}
> > > +		if (!hv_result_success(status))
> > > +			break;
> > > +	}
> > > +
> > > +	if (!hv_result_success(status)) {
> > > +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> > > +
> > > +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> > > +			      done, npages, iova);
> > > +		/*
> > > +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
> > > +		 * only remove from [0 - done], we need to remove second chunk
> > > +		 * [done+1 - size-1].
> > > +		 */
> > > +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
> > > +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
> > > +				     done, NULL);
> > > +		if (mapped)
> > > +			*mapped = 0;
> > > +	} else
> > > +		if (mapped)
> > > +			*mapped = size;
> > > +
> > > +	return hv_result_to_errno(status);
> > > +}
> > > +
> > > +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> > > +				   size_t pgsize, size_t pgcount,
> > > +				   struct iommu_iotlb_gather *gather)
> > > +{
> > > +	unsigned long flags, npages;
> > > +	struct hv_input_unmap_device_gpa_pages *input;
> > > +	u64 status;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +	size_t unmapped, size = pgsize * pgcount;
> > > +
> > > +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> > > +	if (unmapped < size)
> > > +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
> > > +		       __func__, iova, unmapped, size);
> > > +
> > > +	if (hvdom->attached_dom)
> > > +		return size;
> > > +
> > > +	npages = size >> HV_HYP_PAGE_SHIFT;
> > > +
> > > +	local_irq_save(flags);
> > > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > +	memset(input, 0, sizeof(*input));
> > > +
> > > +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> > > +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> > > +	input->device_domain.domain_id.id = hvdom->domid_num;
> > > +	input->target_device_va_base = iova;
> > > +
> > > +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> > > +				     0, input, NULL);
> > > +	local_irq_restore(flags);
> > > +
> > > +	if (!hv_result_success(status))
> > > +		hv_status_err(status, "\n");
> > > +
> > 
> > There is some inconsistency in namings and behaviour of paired
> > functions:
> > 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
> 
> The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
> hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.
> 

Hv_iommu_map_pages is a wrapper around hv_iommu_map_pgs while
hv_iommu_unmap_pages is a wrapper around the correspodning hypercall.
That's the inconsistency I meant.

> > 2. hv_iommu_map_pgs doesn't print status in case of error.
> 
> it does:
>             hv_status_err(status, "\n");  <==============

It does not. I guess you are confusing it with some other function.
Here is the function:

> 
> 
> > It would be much better to keep this code consistent.
> > 
> > > +	return unmapped;
> > > +}
> > > +
> > > +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
> > > +					 dma_addr_t iova)
> > > +{
> > > +	u64 paddr = 0;
> > > +	unsigned long flags;
> > > +	struct hv_iommu_mapping *mapping;
> > > +	struct interval_tree_node *node;
> > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > +
> > > +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> > > +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
> > > +	if (node) {
> > > +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> > > +		paddr = mapping->paddr + (iova - mapping->iova.start);
> > > +	}
> > > +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> > > +
> > > +	return paddr;
> > > +}
> > > +
> > > +/*
> > > + * Currently, hypervisor does not provide list of devices it is using
> > > + * dynamically. So use this to allow users to manually specify devices that
> > > + * should be skipped. (eg. hypervisor debugger using some network device).
> > > + */
> > > +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> > > +{
> > > +	if (!dev_is_pci(dev))
> > > +		return ERR_PTR(-ENODEV);
> > > +
> > > +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> > > +		int rc, pos = 0;
> > > +		int parsed;
> > > +		int segment, bus, slot, func;
> > > +		struct pci_dev *pdev = to_pci_dev(dev);
> > > +
> > > +		do {
> > > +			parsed = 0;
> > > +
> > > +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> > > +				    &segment, &bus, &slot, &func, &parsed);
> > > +			if (rc)
> > > +				break;
> > > +			if (parsed <= 0)
> > > +				break;
> > > +
> > > +			if (pci_domain_nr(pdev->bus) == segment &&
> > > +			    pdev->bus->number == bus &&
> > > +			    PCI_SLOT(pdev->devfn) == slot &&
> > > +			    PCI_FUNC(pdev->devfn) == func) {
> > > +
> > > +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
> > > +				return ERR_PTR(-ENODEV);
> > > +			}
> > > +			pos += parsed;
> > > +
> > > +		} while (pci_devs_to_skip[pos]);
> > > +	}
> > > +
> > > +	/* Device will be explicitly attached to the default domain, so no need
> > > +	 * to do dev_iommu_priv_set() here.
> > > +	 */
> > > +
> > > +	return &hv_virt_iommu;
> > > +}
> > > +
> > > +static void hv_iommu_probe_finalize(struct device *dev)
> > > +{
> > > +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
> > > +
> > > +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> > > +		iommu_setup_dma_ops(dev);
> > > +	else
> > > +		set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static void hv_iommu_release_device(struct device *dev)
> > > +{
> > > +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> > > +
> > > +	/* Need to detach device from device domain if necessary. */
> > > +	if (hvdom)
> > > +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> > > +
> > > +	dev_iommu_priv_set(dev, NULL);
> > > +	set_dma_ops(dev, NULL);
> > > +}
> > > +
> > > +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> > > +{
> > > +	if (dev_is_pci(dev))
> > > +		return pci_device_group(dev);
> > > +	else
> > > +		return generic_device_group(dev);
> > > +}
> > > +
> > > +static int hv_iommu_def_domain_type(struct device *dev)
> > > +{
> > > +	/* The hypervisor always creates this by default during boot */
> > > +	return IOMMU_DOMAIN_IDENTITY;
> > > +}
> > > +
> > > +static struct iommu_ops hv_iommu_ops = {
> > > +	.capable	    = hv_iommu_capable,
> > > +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
> > > +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
> > > +	.probe_device	    = hv_iommu_probe_device,
> > > +	.probe_finalize     = hv_iommu_probe_finalize,
> > > +	.release_device     = hv_iommu_release_device,
> > > +	.def_domain_type    = hv_iommu_def_domain_type,
> > > +	.device_group	    = hv_iommu_device_group,
> > > +	.default_domain_ops = &(const struct iommu_domain_ops) {
> > > +		.attach_dev   = hv_iommu_attach_dev,
> > > +		.map_pages    = hv_iommu_map_pages,
> > > +		.unmap_pages  = hv_iommu_unmap_pages,
> > > +		.iova_to_phys = hv_iommu_iova_to_phys,
> > > +		.free	      = hv_iommu_domain_free,
> > > +	},
> > > +	.owner		    = THIS_MODULE,
> > > +};
> > > +
> > > +static void __init hv_initialize_special_domains(void)
> > > +{
> > > +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> > > +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
> > 
> > hv_def_identity_dom is a static global variable.
> > Why not initialize hv_def_identity_dom upon definition instead of
> > introducing a new function?
> 
> Originally, it was function. I changed it static, but during 6.6
> review I changed it back to function.  I can't remember why, but is
> pretty harmless. We may add more domains, for example null domain to the
> initilization in future.
> 
> > > +}
> > > +
> > > +static int __init hv_iommu_init(void)
> > > +{
> > > +	int ret;
> > > +	struct iommu_device *iommup = &hv_virt_iommu;
> > > +
> > > +	if (!hv_is_hyperv_initialized())
> > > +		return -ENODEV;
> > > +
> > > +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
> > > +	if (ret) {
> > > +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	/* This must come before iommu_device_register because the latter calls
> > > +	 * into the hooks.
> > > +	 */
> > > +	hv_initialize_special_domains();
> > > +
> > > +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> > 
> > It looks weird to initialize an object after creating sysfs entries for
> > it.
> > It should be the other way around.
> 
> Not sure if it should be, much easier to remove sysfs entry than other
> cleanup, even tho iommu_device_unregister is there. I am sure we'll add
> more code here, probably why it was originally done this way.
> 

Sysfs provides user space access to kernel objects. If the object is not
initialized, it's not only a useless sysfs entry, but also a potential
cause for kernel panic if user space will try to access this entry
before the object is initialized.

Thanks,
Stanislav


> Thanks,
> -Mukesh
> 
> 
> ... snip........

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-24  2:19     ` Mukesh R
@ 2026-01-26 18:15       ` Stanislav Kinsburskii
  2026-01-27  3:07         ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-26 18:15 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > and obtained via fixup_user_fault call (similar to KVM).
> > > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 115 insertions(+)
> > > 
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > >   	};
> > >   } __packed;
> > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > +static int __init setup_hv_full_mmio(char *str)
> > > +{
> > > +	hv_nofull_mmio = true;
> > > +	return 0;
> > > +}
> > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > +
> > >   struct mshv_root mshv_root;
> > >   enum hv_scheduler_type hv_scheduler_type;
> > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > >   }
> > >   #ifdef CONFIG_X86_64
> > > +
> > > +/*
> > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > + * else just return -errno.
> > > + */
> > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > +				       u64 *mmio_pfnp)
> > > +{
> > > +	struct vm_area_struct *vma;
> > > +	bool is_mmio;
> > > +	u64 uaddr;
> > > +	struct mshv_mem_region *mreg;
> > > +	struct follow_pfnmap_args pfnmap_args;
> > > +	int rc = -EINVAL;
> > > +
> > > +	/*
> > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > +	 * useraddr vma to lookup pci bar pfn.
> > > +	 */
> > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > +
> > > +	/* Get the region again under the lock */
> > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > +		goto unlock_pt_out;
> > > +
> > > +	uaddr = mreg->start_uaddr +
> > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > +
> > > +	mmap_read_lock(current->mm);
> > 
> > Semaphore can't be taken under spinlock.

> 
> Yeah, something didn't feel right here and I meant to recheck, now regret
> rushing to submit the patch.
> 
> Rethinking, I think the pt_mem_regions_lock is not needed to protect
> the uaddr because unmap will properly serialize via the mm lock.
> 
> 
> > > +	vma = vma_lookup(current->mm, uaddr);
> > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > 
> > Why this check is needed again?
> 
> To make sure region did not change. This check is under lock.
> 

How can this happen? One can't change VMA type without unmapping it
first. And unmapping it leads to a kernel MMIO region state dangling
around without corresponding user space mapping.

This is similar to dangling pinned regions and should likely be
addressed the same way by utilizing MMU notifiers to destpoy memoty
regions is VMA is detached.

> > The region type is stored on the region itself.
> > And the type is checked on the caller side.
> > 
> > > +	if (!is_mmio)
> > > +		goto unlock_mmap_out;
> > > +
> > > +	pfnmap_args.vma = vma;
> > > +	pfnmap_args.address = uaddr;
> > > +
> > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > +	if (rc) {
> > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > +				      NULL);
> > > +		if (rc)
> > > +			goto unlock_mmap_out;
> > > +
> > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > +		if (rc)
> > > +			goto unlock_mmap_out;
> > > +	}
> > > +
> > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > +	follow_pfnmap_end(&pfnmap_args);
> > > +d
> > > +unlock_mmap_out:
> > > +	mmap_read_unlock(current->mm);
> > > +unlock_pt_out:
> > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > +	return rc;
> > > +}
> > > +
> > > +/*
> > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > + * and resolve if possible.
> > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > + */
> > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > +{
> > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > +	struct hv_x64_memory_intercept_message *msg;
> > > +	union hv_x64_memory_access_info accinfo;
> > > +	u64 gfn, mmio_spa, numpgs;
> > > +	struct mshv_mem_region *mreg;
> > > +	int rc;
> > > +	struct mshv_partition *pt = vp->vp_partition;
> > > +
> > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > +	accinfo = msg->memory_access_info;
> > > +
> > > +	if (!accinfo.gva_gpa_valid)
> > > +		return false;
> > > +
> > > +	/* Do a fast check and bail if non mmio intercept */
> > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > 
> > This call needs to be protected by the spinlock.
> 
> This is sorta fast path to bail. We recheck under partition lock above.
> 

Accessing the list of regions without lock is unsafe.

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > +		return false;
> > > +
> > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > +	if (rc)
> > > +		return false;
> > > +
> > > +	if (!hv_nofull_mmio) {		/* default case */
> > > +		gfn = mreg->start_gfn;
> > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > +		numpgs = mreg->nr_pages;
> > > +	} else
> > > +		numpgs = 1;
> > > +
> > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > +
> > > +	return rc == 0;
> > > +}
> > > +
> > >   static struct mshv_mem_region *
> > >   mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > >   {
> > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > >   	return ret;
> > >   }
> > > +
> > >   #else  /* CONFIG_X86_64 */
> > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > >   static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > >   #endif /* CONFIG_X86_64 */
> > >   static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > >   {
> > >   	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > +	case HVMSG_UNMAPPED_GPA:
> > > +		return mshv_handle_unmapped_gpa(vp);
> > >   	case HVMSG_GPA_INTERCEPT:
> > >   		return mshv_handle_gpa_intercept(vp);
> > >   	}
> > > -- 
> > > 2.51.2.vfs.0.1
> > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
  2026-01-24  0:42     ` Mukesh R
@ 2026-01-26 20:50       ` Stanislav Kinsburskii
  2026-01-28 14:36       ` Manivannan Sadhasivam
  1 sibling, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-26 20:50 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Fri, Jan 23, 2026 at 04:42:54PM -0800, Mukesh R wrote:
> On 1/20/26 14:22, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > > interrupts, etc need a device id as a parameter. This device id refers
> > > to that specific device during the lifetime of passthru.
> > > 
> > > An L1VH VM only contains VMBus based devices. A device id for a VMBus
> > > device is slightly different in that it uses the hv_pcibus_device info
> > > for building it to make sure it matches exactly what the hypervisor
> > > expects. This VMBus based device id is needed when attaching devices in
> > > an L1VH based guest VM. Before building it, a check is done to make sure
> > > the device is a valid VMBus device.
> > > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   arch/x86/include/asm/mshyperv.h     |  2 ++
> > >   drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
> > >   2 files changed, 31 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > > index eef4c3a5ba28..0d7fdfb25e76 100644
> > > --- a/arch/x86/include/asm/mshyperv.h
> > > +++ b/arch/x86/include/asm/mshyperv.h
> > > @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
> > >   static inline void hv_apic_init(void) {}
> > >   #endif
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > > +
> > >   struct irq_domain *hv_create_pci_msi_domain(void);
> > >   int hv_map_msi_interrupt(struct irq_data *data,
> > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > > index 8bc6a38c9b5a..40f0b06bb966 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
> > >   #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > >   #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> > > +
> > 
> > Why not moving this static function definition above the called instead of
> > defining the prototype?
> 
> Did you see the function implementation? It has other dependencies that
> are later, it would need code reorg.
> 

Why not placing the caller side after the function definition then?

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > >   static int hv_pci_irqchip_init(void)
> > >   {
> > >   	return 0;
> > > @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> > >   #define hv_msi_prepare		pci_msi_prepare
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > > +{
> > > +	u64 u64val;
> > 
> > This variable is redundant.
> 
> Not really. It helps with debug by putting a quick print, and is
> harmless.
> 
> > > +	struct hv_pcibus_device *hbus;
> > > +	struct pci_bus *pbus = pdev->bus;
> > > +
> > > +	if (!hv_vmbus_pci_device(pbus))
> > > +		return 0;
> > > +
> > > +	hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > > +	u64val = (hbus->hdev->dev_instance.b[5] << 24) |
> > > +		 (hbus->hdev->dev_instance.b[4] << 16) |
> > > +		 (hbus->hdev->dev_instance.b[7] << 8) |
> > > +		 (hbus->hdev->dev_instance.b[6] & 0xf8) |
> > > +		 PCI_FUNC(pdev->devfn);
> > > +
> > 
> > It looks like this value always fits into 32 bit, so what is the value
> > in returning 64 bit?
> 
> The ABI has device id defined as 64bits where this is assigned.
> 
> Thanks,
> -Mukesh
> 
> 
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > +	return u64val;
> > > +}
> > > +EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
> > > +
> > >   /**
> > >    * hv_irq_retarget_interrupt() - "Unmask" the IRQ by setting its current
> > >    * affinity.
> > > @@ -1404,6 +1426,13 @@ static struct pci_ops hv_pcifront_ops = {
> > >   	.write = hv_pcifront_write_config,
> > >   };
> > > +#ifdef CONFIG_X86
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus)
> > > +{
> > > +	return pbus->ops == &hv_pcifront_ops;
> > > +}
> > > +#endif /* CONFIG_X86 */
> > > +
> > >   /*
> > >    * Paravirtual backchannel
> > >    *
> > > -- 
> > > 2.51.2.vfs.0.1
> > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-26 15:57       ` Stanislav Kinsburskii
@ 2026-01-27  3:02         ` Mukesh R
  2026-01-27 18:46           ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-27  3:02 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
>> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Add a new file to implement management of device domains, mapping and
>>>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>>>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>>>> parent. This also implements direct attach mechanism for PCI passthru,
>>>> and it is also made to work within the VFIO framework.
>>>>
>>>> At a high level, during boot the hypervisor creates a default identity
>>>> domain and attaches all devices to it. This nicely maps to Linux iommu
>>>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>>>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>>>> during boot. As mentioned previously, Hyper-V supports two ways to do
>>>> PCI passthru:
>>>>
>>>>     1. Device Domain: root must create a device domain in the hypervisor,
>>>>        and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>>>        All hypervisor communications use device id of type PCI for
>>>>        identifying and referencing the device.
>>>>
>>>>     2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>        page table for mappings, thus the host need not do map/unmap
>>>>        device memory hypercalls. As such, direct attach passthru setup
>>>>        during guest boot is extremely fast. A direct attached device
>>>>        must be referenced via logical device id and not via the PCI
>>>>        device id.
>>>>
>>>> At present, L1VH root/parent only supports direct attaches. Also direct
>>>> attach is default in non-L1VH cases because there are some significant
>>>> performance issues with device domain implementation currently for guests
>>>> with higher RAM (say more than 8GB), and that unfortunately cannot be
>>>> addressed in the short term.
>>>>
>>>
>>> <snip>
>>>
> 
> <snip>
> 
>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
>>>> +{
>>>> +	struct pci_dev *pdev;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +
>>>> +	/* See the attach function, only PCI devices for now */
>>>> +	if (!dev_is_pci(dev))
>>>> +		return;
>>>> +
>>>> +	if (hvdom->num_attchd == 0)
>>>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
>>>> +
>>>> +	pdev = to_pci_dev(dev);
>>>> +
>>>> +	if (hvdom->attached_dom) {
>>>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>>>> +
>>>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
>>>> +		 * next.
>>>> +		 */
>>>> +	} else {
>>>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>>>> +	}
>>>> +
>>>> +	hvdom->num_attchd--;
>>>
>>> Shouldn't this be modified iff the detach succeeded?
>>
>> We want to still free the domain and not let it get stuck. The purpose
>> is more to make sure detach was called before domain free.
>>
> 
> How can one debug subseqent errors if num_attchd is decremented
> unconditionally? In reality the device is left attached, but the related
> kernel metadata is gone.

Error is printed in case of failed detach. If there is panic, at least
you can get some info about the device. Metadata in hypervisor is
around if failed.

>>>> +}
>>>> +
>>>> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
>>>> +				     unsigned long iova, phys_addr_t paddr,
>>>> +				     size_t size, u32 flags)
>>>> +{
>>>> +	unsigned long irqflags;
>>>> +	struct hv_iommu_mapping *mapping;
>>>> +
>>>> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
>>>> +	if (!mapping)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	mapping->paddr = paddr;
>>>> +	mapping->iova.start = iova;
>>>> +	mapping->iova.last = iova + size - 1;
>>>> +	mapping->flags = flags;
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
>>>> +	interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
>>>> +					unsigned long iova, size_t size)
>>>> +{
>>>> +	unsigned long flags;
>>>> +	size_t unmapped = 0;
>>>> +	unsigned long last = iova + size - 1;
>>>> +	struct hv_iommu_mapping *mapping = NULL;
>>>> +	struct interval_tree_node *node, *next;
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>>>> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
>>>> +	while (next) {
>>>> +		node = next;
>>>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>>>> +		next = interval_tree_iter_next(node, iova, last);
>>>> +
>>>> +		/* Trying to split a mapping? Not supported for now. */
>>>> +		if (mapping->iova.start < iova)
>>>> +			break;
>>>> +
>>>> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
>>>> +
>>>> +		interval_tree_remove(node, &hvdom->mappings_tree);
>>>> +		kfree(mapping);
>>>> +	}
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>>>> +
>>>> +	return unmapped;
>>>> +}
>>>> +
>>>> +/* Return: must return exact status from the hypercall without changes */
>>>> +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
>>>> +			    unsigned long iova, phys_addr_t paddr,
>>>> +			    unsigned long npages, u32 map_flags)
>>>> +{
>>>> +	u64 status;
>>>> +	int i;
>>>> +	struct hv_input_map_device_gpa_pages *input;
>>>> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +	input->map_flags = map_flags;
>>>> +	input->target_device_va_base = iova;
>>>> +
>>>> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
>>>> +	for (i = 0; i < npages; i++, pfn++)
>>>> +		input->gpa_page_list[i] = pfn;
>>>> +
>>>> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
>>>> +				     input, NULL);
>>>> +
>>>> +	local_irq_restore(flags);
>>>> +	return status;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The core VFIO code loops over memory ranges calling this function with
>>>> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
>>>> + */
>>>> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
>>>> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
>>>> +			      int prot, gfp_t gfp, size_t *mapped)
>>>> +{
>>>> +	u32 map_flags;
>>>> +	int ret;
>>>> +	u64 status;
>>>> +	unsigned long npages, done = 0;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	size_t size = pgsize * pgcount;
>>>> +
>>>> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
>>>> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
>>>> +
>>>> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	if (hvdom->attached_dom) {
>>>> +		*mapped = size;
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>>>> +	while (done < npages) {
>>>> +		ulong completed, remain = npages - done;
>>>> +
>>>> +		status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
>>>> +					  map_flags);
>>>> +
>>>> +		completed = hv_repcomp(status);
>>>> +		done = done + completed;
>>>> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
>>>> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
>>>> +
>>>> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
>>>> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
>>>> +						    hv_current_partition_id,
>>>> +						    256);
>>>> +			if (ret)
>>>> +				break;
>>>> +		}
>>>> +		if (!hv_result_success(status))
>>>> +			break;
>>>> +	}
>>>> +
>>>> +	if (!hv_result_success(status)) {
>>>> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
>>>> +			      done, npages, iova);
>>>> +		/*
>>>> +		 * lookup tree has all mappings [0 - size-1]. Below unmap will
>>>> +		 * only remove from [0 - done], we need to remove second chunk
>>>> +		 * [done+1 - size-1].
>>>> +		 */
>>>> +		hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
>>>> +		hv_iommu_unmap_pages(immdom, iova - done_size, pgsize,
>>>> +				     done, NULL);
>>>> +		if (mapped)
>>>> +			*mapped = 0;
>>>> +	} else
>>>> +		if (mapped)
>>>> +			*mapped = size;
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
>>>> +				   size_t pgsize, size_t pgcount,
>>>> +				   struct iommu_iotlb_gather *gather)
>>>> +{
>>>> +	unsigned long flags, npages;
>>>> +	struct hv_input_unmap_device_gpa_pages *input;
>>>> +	u64 status;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	size_t unmapped, size = pgsize * pgcount;
>>>> +
>>>> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
>>>> +	if (unmapped < size)
>>>> +		pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
>>>> +		       __func__, iova, unmapped, size);
>>>> +
>>>> +	if (hvdom->attached_dom)
>>>> +		return size;
>>>> +
>>>> +	npages = size >> HV_HYP_PAGE_SHIFT;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +	input->target_device_va_base = iova;
>>>> +
>>>> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
>>>> +				     0, input, NULL);
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>
>>> There is some inconsistency in namings and behaviour of paired
>>> functions:
>>> 1. The pair of hv_iommu_unmap_pages is called hv_iommu_map_pgs
>>
>> The pair of hv_iommu_unmap_pages is hv_iommu_map_pages right above.
>> hv_iommu_map_pgs could be renamed to hv_iommu_map_pgs_hcall I suppose.
>>
> 
> Hv_iommu_map_pages is a wrapper around hv_iommu_map_pgs while
> hv_iommu_unmap_pages is a wrapper around the correspodning hypercall.
> That's the inconsistency I meant.

Unmap does not need intermediate function.

>>> 2. hv_iommu_map_pgs doesn't print status in case of error.

  We print error upon its failure in hv_iommu_map_pages():

          if (!hv_result_success(status)) {
                 size_t done_size = done << HV_HYP_PAGE_SHIFT;
                 hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
                               done, npages, iova);


>> it does:
>>              hv_status_err(status, "\n");  <==============
> 
> It does not. I guess you are confusing it with some other function.
> Here is the function:
> 
>>
>>
>>> It would be much better to keep this code consistent.
>>>
>>>> +	return unmapped;
>>>> +}
>>>> +
>>>> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
>>>> +					 dma_addr_t iova)
>>>> +{
>>>> +	u64 paddr = 0;
>>>> +	unsigned long flags;
>>>> +	struct hv_iommu_mapping *mapping;
>>>> +	struct interval_tree_node *node;
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +
>>>> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
>>>> +	node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
>>>> +	if (node) {
>>>> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
>>>> +		paddr = mapping->paddr + (iova - mapping->iova.start);
>>>> +	}
>>>> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
>>>> +
>>>> +	return paddr;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Currently, hypervisor does not provide list of devices it is using
>>>> + * dynamically. So use this to allow users to manually specify devices that
>>>> + * should be skipped. (eg. hypervisor debugger using some network device).
>>>> + */
>>>> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
>>>> +{
>>>> +	if (!dev_is_pci(dev))
>>>> +		return ERR_PTR(-ENODEV);
>>>> +
>>>> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
>>>> +		int rc, pos = 0;
>>>> +		int parsed;
>>>> +		int segment, bus, slot, func;
>>>> +		struct pci_dev *pdev = to_pci_dev(dev);
>>>> +
>>>> +		do {
>>>> +			parsed = 0;
>>>> +
>>>> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
>>>> +				    &segment, &bus, &slot, &func, &parsed);
>>>> +			if (rc)
>>>> +				break;
>>>> +			if (parsed <= 0)
>>>> +				break;
>>>> +
>>>> +			if (pci_domain_nr(pdev->bus) == segment &&
>>>> +			    pdev->bus->number == bus &&
>>>> +			    PCI_SLOT(pdev->devfn) == slot &&
>>>> +			    PCI_FUNC(pdev->devfn) == func) {
>>>> +
>>>> +				dev_info(dev, "skipped by Hyper-V IOMMU\n");
>>>> +				return ERR_PTR(-ENODEV);
>>>> +			}
>>>> +			pos += parsed;
>>>> +
>>>> +		} while (pci_devs_to_skip[pos]);
>>>> +	}
>>>> +
>>>> +	/* Device will be explicitly attached to the default domain, so no need
>>>> +	 * to do dev_iommu_priv_set() here.
>>>> +	 */
>>>> +
>>>> +	return &hv_virt_iommu;
>>>> +}
>>>> +
>>>> +static void hv_iommu_probe_finalize(struct device *dev)
>>>> +{
>>>> +	struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
>>>> +
>>>> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
>>>> +		iommu_setup_dma_ops(dev);
>>>> +	else
>>>> +		set_dma_ops(dev, NULL);
>>>> +}
>>>> +
>>>> +static void hv_iommu_release_device(struct device *dev)
>>>> +{
>>>> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
>>>> +
>>>> +	/* Need to detach device from device domain if necessary. */
>>>> +	if (hvdom)
>>>> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
>>>> +
>>>> +	dev_iommu_priv_set(dev, NULL);
>>>> +	set_dma_ops(dev, NULL);
>>>> +}
>>>> +
>>>> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
>>>> +{
>>>> +	if (dev_is_pci(dev))
>>>> +		return pci_device_group(dev);
>>>> +	else
>>>> +		return generic_device_group(dev);
>>>> +}
>>>> +
>>>> +static int hv_iommu_def_domain_type(struct device *dev)
>>>> +{
>>>> +	/* The hypervisor always creates this by default during boot */
>>>> +	return IOMMU_DOMAIN_IDENTITY;
>>>> +}
>>>> +
>>>> +static struct iommu_ops hv_iommu_ops = {
>>>> +	.capable	    = hv_iommu_capable,
>>>> +	.domain_alloc_identity	= hv_iommu_domain_alloc_identity,
>>>> +	.domain_alloc_paging	= hv_iommu_domain_alloc_paging,
>>>> +	.probe_device	    = hv_iommu_probe_device,
>>>> +	.probe_finalize     = hv_iommu_probe_finalize,
>>>> +	.release_device     = hv_iommu_release_device,
>>>> +	.def_domain_type    = hv_iommu_def_domain_type,
>>>> +	.device_group	    = hv_iommu_device_group,
>>>> +	.default_domain_ops = &(const struct iommu_domain_ops) {
>>>> +		.attach_dev   = hv_iommu_attach_dev,
>>>> +		.map_pages    = hv_iommu_map_pages,
>>>> +		.unmap_pages  = hv_iommu_unmap_pages,
>>>> +		.iova_to_phys = hv_iommu_iova_to_phys,
>>>> +		.free	      = hv_iommu_domain_free,
>>>> +	},
>>>> +	.owner		    = THIS_MODULE,
>>>> +};
>>>> +
>>>> +static void __init hv_initialize_special_domains(void)
>>>> +{
>>>> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
>>>> +	hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
>>>
>>> hv_def_identity_dom is a static global variable.
>>> Why not initialize hv_def_identity_dom upon definition instead of
>>> introducing a new function?
>>
>> Originally, it was function. I changed it static, but during 6.6
>> review I changed it back to function.  I can't remember why, but is
>> pretty harmless. We may add more domains, for example null domain to the
>> initilization in future.
>>
>>>> +}
>>>> +
>>>> +static int __init hv_iommu_init(void)
>>>> +{
>>>> +	int ret;
>>>> +	struct iommu_device *iommup = &hv_virt_iommu;
>>>> +
>>>> +	if (!hv_is_hyperv_initialized())
>>>> +		return -ENODEV;
>>>> +
>>>> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
>>>> +	if (ret) {
>>>> +		pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
>>>> +		return ret;
>>>> +	}
>>>> +
>>>> +	/* This must come before iommu_device_register because the latter calls
>>>> +	 * into the hooks.
>>>> +	 */
>>>> +	hv_initialize_special_domains();
>>>> +
>>>> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
>>>
>>> It looks weird to initialize an object after creating sysfs entries for
>>> it.
>>> It should be the other way around.
>>
>> Not sure if it should be, much easier to remove sysfs entry than other
>> cleanup, even tho iommu_device_unregister is there. I am sure we'll add
>> more code here, probably why it was originally done this way.
>>
> 
> Sysfs provides user space access to kernel objects. If the object is not
> initialized, it's not only a useless sysfs entry, but also a potential
> cause for kernel panic if user space will try to access this entry
> before the object is initialized.

I hear you... but,
   o there is nothing under sysfs to be accessed when created
   o it is during boot
   o it should almost never fail...
   o iommu_device_sysfs_remove is much more light weight than
     iommu_device_unregister
   o i expect more to be added there as we enhance it

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
> 
>> Thanks,
>> -Mukesh
>>
>>
>> ... snip........


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-26 18:15       ` Stanislav Kinsburskii
@ 2026-01-27  3:07         ` Mukesh R
  2026-01-27 18:57           ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-27  3:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>> ---
>>>>    drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 115 insertions(+)
>>>>
>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>> --- a/drivers/hv/mshv_root_main.c
>>>> +++ b/drivers/hv/mshv_root_main.c
>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>    	};
>>>>    } __packed;
>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>> +static int __init setup_hv_full_mmio(char *str)
>>>> +{
>>>> +	hv_nofull_mmio = true;
>>>> +	return 0;
>>>> +}
>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>> +
>>>>    struct mshv_root mshv_root;
>>>>    enum hv_scheduler_type hv_scheduler_type;
>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>    }
>>>>    #ifdef CONFIG_X86_64
>>>> +
>>>> +/*
>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>> + * else just return -errno.
>>>> + */
>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>> +				       u64 *mmio_pfnp)
>>>> +{
>>>> +	struct vm_area_struct *vma;
>>>> +	bool is_mmio;
>>>> +	u64 uaddr;
>>>> +	struct mshv_mem_region *mreg;
>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>> +	int rc = -EINVAL;
>>>> +
>>>> +	/*
>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>> +	 */
>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>> +
>>>> +	/* Get the region again under the lock */
>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>> +		goto unlock_pt_out;
>>>> +
>>>> +	uaddr = mreg->start_uaddr +
>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>> +
>>>> +	mmap_read_lock(current->mm);
>>>
>>> Semaphore can't be taken under spinlock.
> 
>>
>> Yeah, something didn't feel right here and I meant to recheck, now regret
>> rushing to submit the patch.
>>
>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>> the uaddr because unmap will properly serialize via the mm lock.
>>
>>
>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>
>>> Why this check is needed again?
>>
>> To make sure region did not change. This check is under lock.
>>
> 
> How can this happen? One can't change VMA type without unmapping it
> first. And unmapping it leads to a kernel MMIO region state dangling
> around without corresponding user space mapping.

Right, and vm_flags would not be mmio expected then.

> This is similar to dangling pinned regions and should likely be
> addressed the same way by utilizing MMU notifiers to destpoy memoty
> regions is VMA is detached.

I don't think we need that. Either it succeeds if the region did not
change at all, or just fails.


>>> The region type is stored on the region itself.
>>> And the type is checked on the caller side.
>>>
>>>> +	if (!is_mmio)
>>>> +		goto unlock_mmap_out;
>>>> +
>>>> +	pfnmap_args.vma = vma;
>>>> +	pfnmap_args.address = uaddr;
>>>> +
>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>> +	if (rc) {
>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>> +				      NULL);
>>>> +		if (rc)
>>>> +			goto unlock_mmap_out;
>>>> +
>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>> +		if (rc)
>>>> +			goto unlock_mmap_out;
>>>> +	}
>>>> +
>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>> +d
>>>> +unlock_mmap_out:
>>>> +	mmap_read_unlock(current->mm);
>>>> +unlock_pt_out:
>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>> +	return rc;
>>>> +}
>>>> +
>>>> +/*
>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>> + * and resolve if possible.
>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>> + */
>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>> +{
>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>> +	union hv_x64_memory_access_info accinfo;
>>>> +	u64 gfn, mmio_spa, numpgs;
>>>> +	struct mshv_mem_region *mreg;
>>>> +	int rc;
>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>> +
>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>> +	accinfo = msg->memory_access_info;
>>>> +
>>>> +	if (!accinfo.gva_gpa_valid)
>>>> +		return false;
>>>> +
>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>
>>> This call needs to be protected by the spinlock.
>>
>> This is sorta fast path to bail. We recheck under partition lock above.
>>
> 
> Accessing the list of regions without lock is unsafe.

I am not sure why? This check is done by a vcpu thread, so regions
will not have just gone away.

Thanks,
-Mukesh


> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>> +		return false;
>>>> +
>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>> +	if (rc)
>>>> +		return false;
>>>> +
>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>> +		gfn = mreg->start_gfn;
>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>> +		numpgs = mreg->nr_pages;
>>>> +	} else
>>>> +		numpgs = 1;
>>>> +
>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>> +
>>>> +	return rc == 0;
>>>> +}
>>>> +
>>>>    static struct mshv_mem_region *
>>>>    mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>    {
>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>    	return ret;
>>>>    }
>>>> +
>>>>    #else  /* CONFIG_X86_64 */
>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>    static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>    #endif /* CONFIG_X86_64 */
>>>>    static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>    {
>>>>    	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>> +	case HVMSG_UNMAPPED_GPA:
>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>    	case HVMSG_GPA_INTERCEPT:
>>>>    		return mshv_handle_gpa_intercept(vp);
>>>>    	}
>>>> -- 
>>>> 2.51.2.vfs.0.1
>>>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-27  3:02         ` Mukesh R
@ 2026-01-27 18:46           ` Stanislav Kinsburskii
  2026-01-30 22:51             ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-27 18:46 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
> On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> > > On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > 
> > > > > Add a new file to implement management of device domains, mapping and
> > > > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > > > parent. This also implements direct attach mechanism for PCI passthru,
> > > > > and it is also made to work within the VFIO framework.
> > > > > 
> > > > > At a high level, during boot the hypervisor creates a default identity
> > > > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > > > PCI passthru:
> > > > > 
> > > > >     1. Device Domain: root must create a device domain in the hypervisor,
> > > > >        and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > > >        All hypervisor communications use device id of type PCI for
> > > > >        identifying and referencing the device.
> > > > > 
> > > > >     2. Direct Attach: the hypervisor will simply use the guest's HW
> > > > >        page table for mappings, thus the host need not do map/unmap
> > > > >        device memory hypercalls. As such, direct attach passthru setup
> > > > >        during guest boot is extremely fast. A direct attached device
> > > > >        must be referenced via logical device id and not via the PCI
> > > > >        device id.
> > > > > 
> > > > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > > > attach is default in non-L1VH cases because there are some significant
> > > > > performance issues with device domain implementation currently for guests
> > > > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > > > addressed in the short term.
> > > > > 
> > > > 
> > > > <snip>
> > > > 
> > 
> > <snip>
> > 
> > > > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > > > +{
> > > > > +	struct pci_dev *pdev;
> > > > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > > > +
> > > > > +	/* See the attach function, only PCI devices for now */
> > > > > +	if (!dev_is_pci(dev))
> > > > > +		return;
> > > > > +
> > > > > +	if (hvdom->num_attchd == 0)
> > > > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > > > +
> > > > > +	pdev = to_pci_dev(dev);
> > > > > +
> > > > > +	if (hvdom->attached_dom) {
> > > > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > > > +
> > > > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > > > +		 * next.
> > > > > +		 */
> > > > > +	} else {
> > > > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > > > +	}
> > > > > +
> > > > > +	hvdom->num_attchd--;
> > > > 
> > > > Shouldn't this be modified iff the detach succeeded?
> > > 
> > > We want to still free the domain and not let it get stuck. The purpose
> > > is more to make sure detach was called before domain free.
> > > 
> > 
> > How can one debug subseqent errors if num_attchd is decremented
> > unconditionally? In reality the device is left attached, but the related
> > kernel metadata is gone.
> 
> Error is printed in case of failed detach. If there is panic, at least
> you can get some info about the device. Metadata in hypervisor is
> around if failed.
> 

With this approach the only thing left is a kernel message.
But if the state is kept intact, one could collect a kernel core and
analyze it.

And note, that there won't be a hypervisor core by default: our main
context with the usptreamed version of the driver is L1VH and a kernel
core is the only thing a third party customer can provide for our
analysis.

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-27  3:07         ` Mukesh R
@ 2026-01-27 18:57           ` Stanislav Kinsburskii
  2026-01-30 22:17             ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-01-27 18:57 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > 
> > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > 
> > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > ---
> > > > >    drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 115 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > >    	};
> > > > >    } __packed;
> > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > +{
> > > > > +	hv_nofull_mmio = true;
> > > > > +	return 0;
> > > > > +}
> > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > +
> > > > >    struct mshv_root mshv_root;
> > > > >    enum hv_scheduler_type hv_scheduler_type;
> > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > >    }
> > > > >    #ifdef CONFIG_X86_64
> > > > > +
> > > > > +/*
> > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > + * else just return -errno.
> > > > > + */
> > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > +				       u64 *mmio_pfnp)
> > > > > +{
> > > > > +	struct vm_area_struct *vma;
> > > > > +	bool is_mmio;
> > > > > +	u64 uaddr;
> > > > > +	struct mshv_mem_region *mreg;
> > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > +	int rc = -EINVAL;
> > > > > +
> > > > > +	/*
> > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > +	 */
> > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > +
> > > > > +	/* Get the region again under the lock */
> > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > +		goto unlock_pt_out;
> > > > > +
> > > > > +	uaddr = mreg->start_uaddr +
> > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > +
> > > > > +	mmap_read_lock(current->mm);
> > > > 
> > > > Semaphore can't be taken under spinlock.
> > 
> > > 
> > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > rushing to submit the patch.
> > > 
> > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > the uaddr because unmap will properly serialize via the mm lock.
> > > 
> > > 
> > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > 
> > > > Why this check is needed again?
> > > 
> > > To make sure region did not change. This check is under lock.
> > > 
> > 
> > How can this happen? One can't change VMA type without unmapping it
> > first. And unmapping it leads to a kernel MMIO region state dangling
> > around without corresponding user space mapping.
> 
> Right, and vm_flags would not be mmio expected then.
> 
> > This is similar to dangling pinned regions and should likely be
> > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > regions is VMA is detached.
> 
> I don't think we need that. Either it succeeds if the region did not
> change at all, or just fails.
> 

I'm afraid we do, as if the driver mapped a page with the previous
memory region, and then the region is unmapped, the page will stay
mapped in the hypervisor, but will be considered free by kernel, which
in turn will lead to GPF upn next allocation.

With pinned regions we issue is similar but less impacting: pages can't
be released by user space unmapping and thus will be simply leaked, but
the system stays intact.

MMIO regions are simila to movable region in this regard: they don't
reference the user pages, and thus this guest region replaement is a
stright wat to kernel panic.

> 
> > > > The region type is stored on the region itself.
> > > > And the type is checked on the caller side.
> > > > 
> > > > > +	if (!is_mmio)
> > > > > +		goto unlock_mmap_out;
> > > > > +
> > > > > +	pfnmap_args.vma = vma;
> > > > > +	pfnmap_args.address = uaddr;
> > > > > +
> > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > +	if (rc) {
> > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > +				      NULL);
> > > > > +		if (rc)
> > > > > +			goto unlock_mmap_out;
> > > > > +
> > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > +		if (rc)
> > > > > +			goto unlock_mmap_out;
> > > > > +	}
> > > > > +
> > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > +d
> > > > > +unlock_mmap_out:
> > > > > +	mmap_read_unlock(current->mm);
> > > > > +unlock_pt_out:
> > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > +	return rc;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > + * and resolve if possible.
> > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > + */
> > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > +{
> > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > +	struct mshv_mem_region *mreg;
> > > > > +	int rc;
> > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > +
> > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > +	accinfo = msg->memory_access_info;
> > > > > +
> > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > +		return false;
> > > > > +
> > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > 
> > > > This call needs to be protected by the spinlock.
> > > 
> > > This is sorta fast path to bail. We recheck under partition lock above.
> > > 
> > 
> > Accessing the list of regions without lock is unsafe.
> 
> I am not sure why? This check is done by a vcpu thread, so regions
> will not have just gone away.
> 

This is shared resources. Multiple VP thread get into this function
simultaneously, so there is a race already. But this one we can live
with without locking as they don't mutate the list of the regions.

The issue happens when VMM adds or removed another region as it mutates
the list and races with VP threads doing this lookup.

Thanks,
Stanislav


> Thanks,
> -Mukesh
> 
> 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > +		return false;
> > > > > +
> > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > +	if (rc)
> > > > > +		return false;
> > > > > +
> > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > +		gfn = mreg->start_gfn;
> > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > +		numpgs = mreg->nr_pages;
> > > > > +	} else
> > > > > +		numpgs = 1;
> > > > > +
> > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > +
> > > > > +	return rc == 0;
> > > > > +}
> > > > > +
> > > > >    static struct mshv_mem_region *
> > > > >    mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > >    {
> > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > >    	return ret;
> > > > >    }
> > > > > +
> > > > >    #else  /* CONFIG_X86_64 */
> > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > >    static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > >    #endif /* CONFIG_X86_64 */
> > > > >    static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > >    {
> > > > >    	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > >    	case HVMSG_GPA_INTERCEPT:
> > > > >    		return mshv_handle_gpa_intercept(vp);
> > > > >    	}
> > > > > -- 
> > > > > 2.51.2.vfs.0.1
> > > > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-24  2:01     ` Mukesh R
@ 2026-01-27 19:21       ` Jacob Pan
  2026-01-27 22:31         ` Jacob Pan
  2026-01-30 23:44         ` Mukesh R
  0 siblings, 2 replies; 69+ messages in thread
From: Jacob Pan @ 2026-01-27 19:21 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

Hi Mukesh,

On Fri, 23 Jan 2026 18:01:29 -0800
Mukesh R <mrathor@linux.microsoft.com> wrote:

> On 1/21/26 21:18, Jacob Pan wrote:
> > Hi Mukesh,
> > 
> > On Mon, 19 Jan 2026 22:42:27 -0800
> > Mukesh R <mrathor@linux.microsoft.com> wrote:
> >   
> >> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> >>
> >> Add a new file to implement management of device domains, mapping
> >> and unmapping of iommu memory, and other iommu_ops to fit within
> >> the VFIO framework for PCI passthru on Hyper-V running Linux as
> >> root or L1VH parent. This also implements direct attach mechanism
> >> for PCI passthru, and it is also made to work within the VFIO
> >> framework.
> >>
> >> At a high level, during boot the hypervisor creates a default
> >> identity domain and attaches all devices to it. This nicely maps
> >> to Linux iommu subsystem IOMMU_DOMAIN_IDENTITY domain. As a
> >> result, Linux does not need to explicitly ask Hyper-V to attach
> >> devices and do maps/unmaps during boot. As mentioned previously,
> >> Hyper-V supports two ways to do PCI passthru:
> >>
> >>    1. Device Domain: root must create a device domain in the
> >> hypervisor, and do map/unmap hypercalls for mapping and unmapping
> >> guest RAM. All hypervisor communications use device id of type PCI
> >> for identifying and referencing the device.
> >>
> >>    2. Direct Attach: the hypervisor will simply use the guest's HW
> >>       page table for mappings, thus the host need not do map/unmap
> >>       device memory hypercalls. As such, direct attach passthru
> >> setup during guest boot is extremely fast. A direct attached device
> >>       must be referenced via logical device id and not via the PCI
> >>       device id.
> >>
> >> At present, L1VH root/parent only supports direct attaches. Also
> >> direct attach is default in non-L1VH cases because there are some
> >> significant performance issues with device domain implementation
> >> currently for guests with higher RAM (say more than 8GB), and that
> >> unfortunately cannot be addressed in the short term.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>   MAINTAINERS                     |   1 +
> >>   arch/x86/include/asm/mshyperv.h |   7 +-
> >>   arch/x86/kernel/pci-dma.c       |   2 +
> >>   drivers/iommu/Makefile          |   2 +-
> >>   drivers/iommu/hyperv-iommu.c    | 876
> >> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
> >> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
> >>   create mode 100644 drivers/iommu/hyperv-iommu.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 381a0e086382..63160cee942c 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
> >>   F:	drivers/hv/
> >>   F:	drivers/infiniband/hw/mana/
> >>   F:	drivers/input/serio/hyperv-keyboard.c
> >> +F:	drivers/iommu/hyperv-iommu.c  
> > Given we are also developing a guest iommu driver on hyperv, I
> > think it is more clear to name them accordingly. Perhaps,
> > hyperv-iommu-root.c?  
> 
> well, l1vh is not quite root, more like a parent. But we've been using
> l1vh root loosely to mean l1vh parent. so probably ok to rename it
> to hyperv-iommu-root.c. I prefer not calling it parent or something
> like that.
yeah, something specific and different than the guest driver will do.

> >>   F:	drivers/iommu/hyperv-irq.c
> >>   F:	drivers/net/ethernet/microsoft/
> >>   F:	drivers/net/hyperv/
> >> diff --git a/arch/x86/include/asm/mshyperv.h
> >> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
> >> 100644 --- a/arch/x86/include/asm/mshyperv.h
> >> +++ b/arch/x86/include/asm/mshyperv.h
> >> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
> >>   #endif
> >>   
> >>   #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
> >> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> >> -{ return false; }       /* temporary */
> >> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
> >>   u64 hv_build_devid_oftype(struct pci_dev *pdev, enum
> >> hv_device_type type); +u64 hv_iommu_get_curr_partid(void);
> >>   #else	/* CONFIG_HYPERV_IOMMU */
> >>   static inline bool hv_pcidev_is_attached_dev(struct pci_dev
> >> *pdev) { return false; }
> >> -
> >>   static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
> >>   				       enum hv_device_type type)
> >>   { return 0; }
> >> +static inline u64 hv_iommu_get_curr_partid(void)
> >> +{ return HV_PARTITION_ID_INVALID; }
> >>   
> >>   #endif	/* CONFIG_HYPERV_IOMMU */
> >>   
> >> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> >> index 6267363e0189..cfeee6505e17 100644
> >> --- a/arch/x86/kernel/pci-dma.c
> >> +++ b/arch/x86/kernel/pci-dma.c
> >> @@ -8,6 +8,7 @@
> >>   #include <linux/gfp.h>
> >>   #include <linux/pci.h>
> >>   #include <linux/amd-iommu.h>
> >> +#include <linux/hyperv.h>
> >>   
> >>   #include <asm/proto.h>
> >>   #include <asm/dma.h>
> >> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
> >>   	gart_iommu_hole_init();
> >>   	amd_iommu_detect();
> >>   	detect_intel_iommu();
> >> +	hv_iommu_detect();  
> j
> > Will this driver be x86 only?  
> Yes for now.
If there is nothing x86 specific in this driver (assuming the
hypercalls here are not x86 only), maybe you can move to the generic
startup code.

> >>   	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
> >>   }
> >>   
> >> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> >> index 598c39558e7d..cc9774864b00 100644
> >> --- a/drivers/iommu/Makefile
> >> +++ b/drivers/iommu/Makefile
> >> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
> >>   obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
> >>   obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
> >>   obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> >> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
> >> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o  
> > DMA and IRQ remapping should be separate  
> 
> not sure i follow.
In IOMMU subsystem, DMA remapping and IRQ remapping can be turned
on/off independently. e.g. you could have an option to turn on IRQ
remapping w/o DMA remapping. But here you tied them together.

> 
> >>   obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> >>   obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
> >>   obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> >> diff --git a/drivers/iommu/hyperv-iommu.c
> >> b/drivers/iommu/hyperv-iommu.c new file mode 100644
> >> index 000000000000..548483fec6b1
> >> --- /dev/null
> >> +++ b/drivers/iommu/hyperv-iommu.c
> >> @@ -0,0 +1,876 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Hyper-V root vIOMMU driver.
> >> + * Copyright (C) 2026, Microsoft, Inc.
> >> + */
> >> +
> >> +#include <linux/module.h>  
> > I don't think this is needed since this driver cannot be a module
> >   
> >> +#include <linux/pci.h>
> >> +#include <linux/dmar.h>  
> > should not depend on Intel's DMAR
> >   
> >> +#include <linux/dma-map-ops.h>
> >> +#include <linux/interval_tree.h>
> >> +#include <linux/hyperv.h>
> >> +#include "dma-iommu.h"
> >> +#include <asm/iommu.h>
> >> +#include <asm/mshyperv.h>
> >> +
> >> +/* We will not claim these PCI devices, eg hypervisor needs it for
> >> debugger */ +static char *pci_devs_to_skip;
> >> +static int __init hv_iommu_setup_skip(char *str)
> >> +{
> >> +	pci_devs_to_skip = str;
> >> +
> >> +	return 0;
> >> +}
> >> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> >> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> >> +
> >> +bool hv_no_attdev;	 /* disable direct device attach for
> >> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
> >> +static int __init setup_hv_no_attdev(char *str)
> >> +{
> >> +	hv_no_attdev = true;
> >> +	return 0;
> >> +}
> >> +__setup("hv_no_attdev", setup_hv_no_attdev);
> >> +
> >> +/* Iommu device that we export to the world. HyperV supports max
> >> of one */ +static struct iommu_device hv_virt_iommu;
> >> +
> >> +struct hv_domain {
> >> +	struct iommu_domain iommu_dom;
> >> +	u32 domid_num;			      /* as opposed
> >> to domain_id.type */
> >> +	u32 num_attchd;		      /* number of
> >> currently attached devices */  
> > rename to num_dev_attached?
> >   
> >> +	bool attached_dom;		      /* is this direct
> >> attached dom? */
> >> +	spinlock_t mappings_lock;	      /* protects
> >> mappings_tree */
> >> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
> >> tree */ +};
> >> +
> >> +#define to_hv_domain(d) container_of(d, struct hv_domain,
> >> iommu_dom) +
> >> +struct hv_iommu_mapping {
> >> +	phys_addr_t paddr;
> >> +	struct interval_tree_node iova;
> >> +	u32 flags;
> >> +};
> >> +
> >> +/*
> >> + * By default, during boot the hypervisor creates one Stage 2 (S2)
> >> default
> >> + * domain. Stage 2 means that the page table is controlled by the
> >> hypervisor.
> >> + *   S2 default: access to entire root partition memory. This for
> >> us easily
> >> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
> >> subsystem, and
> >> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
> >> hypervisor.
> >> + *
> >> + * Device Management:
> >> + *   There are two ways to manage device attaches to domains:
> >> + *     1. Domain Attach: A device domain is created in the
> >> hypervisor, the
> >> + *			 device is attached to this domain, and
> >> then memory
> >> + *			 ranges are mapped in the map callbacks.
> >> + *     2. Direct Attach: No need to create a domain in the
> >> hypervisor for direct
> >> + *			 attached devices. A hypercall is made
> >> to tell the
> >> + *			 hypervisor to attach the device to a
> >> guest. There is
> >> + *			 no need for explicit memory mappings
> >> because the
> >> + *			 hypervisor will just use the guest HW
> >> page table.
> >> + *
> >> + * Since a direct attach is much faster, it is the default. This
> >> can be
> >> + * changed via hv_no_attdev.
> >> + *
> >> + * L1VH: hypervisor only supports direct attach.
> >> + */
> >> +
> >> +/*
> >> + * Create dummy domain to correspond to hypervisor prebuilt
> >> default identity
> >> + * domain (dummy because we do not make hypercall to create them).
> >> + */
> >> +static struct hv_domain hv_def_identity_dom;
> >> +
> >> +static bool hv_special_domain(struct hv_domain *hvdom)
> >> +{
> >> +	return hvdom == &hv_def_identity_dom;
> >> +}
> >> +
> >> +struct iommu_domain_geometry default_geometry = (struct
> >> iommu_domain_geometry) {
> >> +	.aperture_start = 0,
> >> +	.aperture_end = -1UL,
> >> +	.force_aperture = true,
> >> +};
> >> +
> >> +/*
> >> + * Since the relevant hypercalls can only fit less than 512 PFNs
> >> in the pfn
> >> + * array, report 1M max.
> >> + */
> >> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
> >> +
> >> +static u32 unique_id;	      /* unique numeric id of a new
> >> domain */ +
> >> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> >> +				struct device *dev);
> >> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> >> ulong iova,
> >> +				   size_t pgsize, size_t pgcount,
> >> +				   struct iommu_iotlb_gather
> >> *gather); +
> >> +/*
> >> + * If the current thread is a VMM thread, return the partition id
> >> of the VM it
> >> + * is managing, else return HV_PARTITION_ID_INVALID.
> >> + */
> >> +u64 hv_iommu_get_curr_partid(void)
> >> +{
> >> +	u64 (*fn)(pid_t pid);
> >> +	u64 partid;
> >> +
> >> +	fn = symbol_get(mshv_pid_to_partid);
> >> +	if (!fn)
> >> +		return HV_PARTITION_ID_INVALID;
> >> +
> >> +	partid = fn(current->tgid);
> >> +	symbol_put(mshv_pid_to_partid);
> >> +
> >> +	return partid;
> >> +}  
> > This function is not iommu specific. Maybe move it to mshv code?  
> 
> Well, it is getting the information from mshv by calling a function
> there for iommu, and is not needed if no HYPER_IOMMU. So this is
> probably the best place for it.
> 
ok, maybe move it to mshv after we have a second user. But the function
name can be just hv_get_curr_partid(void), no?

> >> +
> >> +/* If this is a VMM thread, then this domain is for a guest VM */
> >> +static bool hv_curr_thread_is_vmm(void)
> >> +{
> >> +	return hv_iommu_get_curr_partid() !=
> >> HV_PARTITION_ID_INVALID; +}
> >> +
> >> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap
> >> cap) +{
> >> +	switch (cap) {
> >> +	case IOMMU_CAP_CACHE_COHERENCY:
> >> +		return true;
> >> +	default:
> >> +		return false;
> >> +	}
> >> +	return false;
> >> +}
> >> +
> >> +/*
> >> + * Check if given pci device is a direct attached device. Caller
> >> must have
> >> + * verified pdev is a valid pci device.
> >> + */
> >> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
> >> +{
> >> +	struct iommu_domain *iommu_domain;
> >> +	struct hv_domain *hvdom;
> >> +	struct device *dev = &pdev->dev;
> >> +
> >> +	iommu_domain = iommu_get_domain_for_dev(dev);
> >> +	if (iommu_domain) {
> >> +		hvdom = to_hv_domain(iommu_domain);
> >> +		return hvdom->attached_dom;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);  
> > Attached domain can change anytime, what guarantee does the caller
> > have?  
> 
> Not sure I understand what can change: the device moving from attached
> to non-attached? or the domain getting deleted? In any case, this is
> called from leaf functions, so that should not happen... and it
> will return false if the device did somehow got removed.
> 
I was thinking the device can be attached to a different domain type at
runtime, e.g. via sysfs to identity or DMA. But I guess here is a static
attachment either for l1vh or root.

> >> +
> >> +/* Create a new device domain in the hypervisor */
> >> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
> >> +{
> >> +	u64 status;
> >> +	unsigned long flags;
> >> +	struct hv_input_device_domain *ddp;
> >> +	struct hv_input_create_device_domain *input;  
> > nit: use consistent coding style, inverse Christmas tree.
> >   
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	ddp = &input->device_domain;
> >> +	ddp->partition_id = HV_PARTITION_ID_SELF;
> >> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	ddp->domain_id.id = hvdom->domid_num;
> >> +
> >> +
> >> input->create_device_domain_flags.forward_progress_required = 1;
> >> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
> >> +
> >> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN,
> >> input, NULL); +
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* During boot, all devices are attached to this */
> >> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
> >> device *dev) +{
> >> +	return &hv_def_identity_dom.iommu_dom;
> >> +}
> >> +
> >> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> >> device *dev) +{
> >> +	struct hv_domain *hvdom;
> >> +	int rc;
> >> +
> >> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> >> !hv_no_attdev) {
> >> +		pr_err("Hyper-V: l1vh iommu does not support host
> >> devices\n");  
> > why is this an error if user input choose not to do direct attach?  
> 
> Like the error message says: on l1vh, direct attaches of host devices
> (eg dpdk) is not supported. and l1vh only does direct attaches. IOW,
> no host devices on l1vh.
> 
This hv_no_attdev flag is really confusing to me, by default
hv_no_attdev is false, which allows direct attach. And you are saying
l1vh allows it.

Why is this flag also controls host device attachment in l1vh? If you
can tell the difference between direct host device attach and other
direct attach, why don't you reject always reject host attach in l1vh?

> >> +		return NULL;
> >> +	}
> >> +
> >> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> >> +	if (hvdom == NULL)
> >> +		goto out;
> >> +
> >> +	spin_lock_init(&hvdom->mappings_lock);
> >> +	hvdom->mappings_tree = RB_ROOT_CACHED;
> >> +
> >> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /*
> >> ie, 0 */  
> > This is true only when unique_id wraps around, right? Then this
> > driver stops working?  
> 
> Correct. It's a u32, so if my math is right, and a device is attached
> every second, it will take 136 years to wrap! Did i get that right?
> 
This is still a unnecessary vulnerability.

> > can you use an IDR for the unique_id and free it as you detach
> > instead of doing this cyclic allocation?
> >   
> >> +		goto out_free;
> >> +
> >> +	hvdom->domid_num = unique_id;
> >> +	hvdom->iommu_dom.geometry = default_geometry;
> >> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> >> +
> >> +	/* For guests, by default we do direct attaches, so no
> >> domain in hyp */
> >> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
> >> +		hvdom->attached_dom = true;
> >> +	else {
> >> +		rc = hv_iommu_create_hyp_devdom(hvdom);
> >> +		if (rc)
> >> +			goto out_free_id;
> >> +	}
> >> +
> >> +	return &hvdom->iommu_dom;
> >> +
> >> +out_free_id:
> >> +	unique_id--;
> >> +out_free:
> >> +	kfree(hvdom);
> >> +out:
> >> +	return NULL;
> >> +}
> >> +
> >> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> >> +{
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	unsigned long flags;
> >> +	u64 status;
> >> +	struct hv_input_delete_device_domain *input;
> >> +
> >> +	if (hv_special_domain(hvdom))
> >> +		return;
> >> +
> >> +	if (hvdom->num_attchd) {
> >> +		pr_err("Hyper-V: can't free busy iommu domain
> >> (%p)\n", immdom);
> >> +		return;
> >> +	}
> >> +
> >> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
> >> +		struct hv_input_device_domain *ddp;
> >> +
> >> +		local_irq_save(flags);
> >> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		ddp = &input->device_domain;
> >> +		memset(input, 0, sizeof(*input));
> >> +
> >> +		ddp->partition_id = HV_PARTITION_ID_SELF;
> >> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> >> +		ddp->domain_id.id = hvdom->domid_num;
> >> +
> >> +		status =
> >> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
> >> +					 NULL);
> >> +		local_irq_restore(flags);
> >> +
> >> +		if (!hv_result_success(status))
> >> +			hv_status_err(status, "\n");
> >> +	}  
> 
> > you could free the domid here, no?  
> sorry, don't follow what you mean by domid, you mean unique_id?
> 
yes.
> >> +
> >> +	kfree(hvdom);
> >> +}
> >> +
> >> +/* Attach a device to a domain previously created in the
> >> hypervisor */ +static int hv_iommu_att_dev2dom(struct hv_domain
> >> *hvdom, struct pci_dev *pdev) +{
> >> +	unsigned long flags;
> >> +	u64 status;
> >> +	enum hv_device_type dev_type;
> >> +	struct hv_input_attach_device_domain *input;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +
> >> +	/* NB: Upon guest shutdown, device is re-attached to the
> >> default domain
> >> +	 * without explicit detach.
> >> +	 */
> >> +	if (hv_l1vh_partition())
> >> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> >> +	else
> >> +		dev_type = HV_DEVICE_TYPE_PCI;
> >> +
> >> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
> >> dev_type); +
> >> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN,
> >> input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* Caller must have validated that dev is a valid pci dev */
> >> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
> >> +{
> >> +	struct hv_input_attach_device *input;
> >> +	u64 status;
> >> +	int rc;
> >> +	unsigned long flags;
> >> +	union hv_device_id host_devid;
> >> +	enum hv_device_type dev_type;
> >> +	u64 ptid = hv_iommu_get_curr_partid();
> >> +
> >> +	if (ptid == HV_PARTITION_ID_INVALID) {
> >> +		pr_err("Hyper-V: Invalid partition id in direct
> >> attach\n");
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	if (hv_l1vh_partition())
> >> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
> >> +	else
> >> +		dev_type = HV_DEVICE_TYPE_PCI;
> >> +
> >> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev,
> >> dev_type); +
> >> +	do {
> >> +		local_irq_save(flags);
> >> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +		memset(input, 0, sizeof(*input));
> >> +		input->partition_id = ptid;
> >> +		input->device_id = host_devid;
> >> +
> >> +		/* Hypervisor associates logical_id with this
> >> device, and in
> >> +		 * some hypercalls like retarget interrupts,
> >> logical_id must be
> >> +		 * used instead of the BDF. It is a required
> >> parameter.
> >> +		 */
> >> +		input->attdev_flags.logical_id = 1;
> >> +		input->logical_devid =
> >> +			   hv_build_devid_oftype(pdev,
> >> HV_DEVICE_TYPE_LOGICAL); +
> >> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
> >> input, NULL);
> >> +		local_irq_restore(flags);
> >> +
> >> +		if (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
> >> ptid, 1);
> >> +			if (rc)
> >> +				break;
> >> +		}
> >> +	} while (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY); +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +/* This to attach a device to both host app (like DPDK) and a
> >> guest VM */  
> > The IOMMU driver should be agnostic to the type of consumer,
> > whether a userspace driver or a VM. This comment is not necessary.
> >   
> >> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
> >> struct device *dev,
> >> +			       struct iommu_domain *old)  
> > This does not match upstream kernel prototype, which kernel version
> > is this based on? I will stop here for now.  
> 
> As I mentioned in the cover letter:
>           Based on: 8f0b4cce4481 (origin/hyperv-next)
> 
where is this repo?

> which is now 6.19 based.
> 
> > struct iommu_domain_ops {
> > 	int (*attach_dev)(struct iommu_domain *domain, struct device
> > 	*dev);  
> 
> I think you got it backwards, 6.6 has this. 6.19 has extra paremeter.
> 
you are right, this is a very recent change. my bad.

> 
> 
> >> +{
> >> +	struct pci_dev *pdev;
> >> +	int rc;
> >> +	struct hv_domain *hvdom_new = to_hv_domain(immdom);
> >> +	struct hv_domain *hvdom_prev = dev_iommu_priv_get(dev);
> >> +
> >> +	/* Only allow PCI devices for now */
> >> +	if (!dev_is_pci(dev))
> >> +		return -EINVAL;
> >> +
> >> +	pdev = to_pci_dev(dev);
> >> +
> >> +	/* l1vh does not support host device (eg DPDK) passthru */
> >> +	if (hv_l1vh_partition() && !hv_special_domain(hvdom_new)
> >> &&
> >> +	    !hvdom_new->attached_dom)
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * VFIO does not do explicit detach calls, hence check
> >> first if we need
> >> +	 * to detach first. Also, in case of guest shutdown, it's
> >> the VMM
> >> +	 * thread that attaches it back to the
> >> hv_def_identity_dom, and
> >> +	 * hvdom_prev will not be null then. It is null during
> >> boot.
> >> +	 */
> >> +	if (hvdom_prev)
> >> +		if (!hv_l1vh_partition() ||
> >> !hv_special_domain(hvdom_prev))
> >> +
> >> hv_iommu_detach_dev(&hvdom_prev->iommu_dom, dev); +
> >> +	if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
> >> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> >> "private" field */
> >> +		return 0;
> >> +	}
> >> +
> >> +	if (hvdom_new->attached_dom)
> >> +		rc = hv_iommu_direct_attach_device(pdev);
> >> +	else
> >> +		rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
> >> +
> >> +	if (rc && hvdom_prev) {
> >> +		int rc1;
> >> +
> >> +		if (hvdom_prev->attached_dom)
> >> +			rc1 = hv_iommu_direct_attach_device(pdev);
> >> +		else
> >> +			rc1 = hv_iommu_att_dev2dom(hvdom_prev,
> >> pdev); +
> >> +		if (rc1)
> >> +			pr_err("Hyper-V: iommu could not restore
> >> orig device state.. dev:%s\n",
> >> +			       dev_name(dev));
> >> +	}
> >> +
> >> +	if (rc == 0) {
> >> +		dev_iommu_priv_set(dev, hvdom_new);  /* sets
> >> "private" field */
> >> +		hvdom_new->num_attchd++;
> >> +	}
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static void hv_iommu_det_dev_from_guest(struct hv_domain *hvdom,
> >> +					struct pci_dev *pdev)
> >> +{
> >> +	struct hv_input_detach_device *input;
> >> +	u64 status, log_devid;
> >> +	unsigned long flags;
> >> +
> >> +	log_devid = hv_build_devid_oftype(pdev,
> >> HV_DEVICE_TYPE_LOGICAL); +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->partition_id = hv_iommu_get_curr_partid();
> >> +	input->logical_devid = log_devid;
> >> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input,
> >> NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +}
> >> +
> >> +static void hv_iommu_det_dev_from_dom(struct hv_domain *hvdom,
> >> +				      struct pci_dev *pdev)
> >> +{
> >> +	u64 status, devid;
> >> +	unsigned long flags;
> >> +	struct hv_input_detach_device_domain *input;
> >> +
> >> +	devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_id.as_uint64 = devid;
> >> +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN,
> >> input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +}
> >> +
> >> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
> >> struct device *dev) +{
> >> +	struct pci_dev *pdev;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +
> >> +	/* See the attach function, only PCI devices for now */
> >> +	if (!dev_is_pci(dev))
> >> +		return;
> >> +
> >> +	if (hvdom->num_attchd == 0)
> >> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n",
> >> dev_name(dev)); +
> >> +	pdev = to_pci_dev(dev);
> >> +
> >> +	if (hvdom->attached_dom) {
> >> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> >> +
> >> +		/* Do not reset attached_dom, hv_iommu_unmap_pages
> >> happens
> >> +		 * next.
> >> +		 */
> >> +	} else {
> >> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> >> +	}
> >> +
> >> +	hvdom->num_attchd--;
> >> +}
> >> +
> >> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> >> +				     unsigned long iova,
> >> phys_addr_t paddr,
> >> +				     size_t size, u32 flags)
> >> +{
> >> +	unsigned long irqflags;
> >> +	struct hv_iommu_mapping *mapping;
> >> +
> >> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> >> +	if (!mapping)
> >> +		return -ENOMEM;
> >> +
> >> +	mapping->paddr = paddr;
> >> +	mapping->iova.start = iova;
> >> +	mapping->iova.last = iova + size - 1;
> >> +	mapping->flags = flags;
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
> >> +	interval_tree_insert(&mapping->iova,
> >> &hvdom->mappings_tree);
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> >> +					unsigned long iova, size_t
> >> size) +{
> >> +	unsigned long flags;
> >> +	size_t unmapped = 0;
> >> +	unsigned long last = iova + size - 1;
> >> +	struct hv_iommu_mapping *mapping = NULL;
> >> +	struct interval_tree_node *node, *next;
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> >> +	next = interval_tree_iter_first(&hvdom->mappings_tree,
> >> iova, last);
> >> +	while (next) {
> >> +		node = next;
> >> +		mapping = container_of(node, struct
> >> hv_iommu_mapping, iova);
> >> +		next = interval_tree_iter_next(node, iova, last);
> >> +
> >> +		/* Trying to split a mapping? Not supported for
> >> now. */
> >> +		if (mapping->iova.start < iova)
> >> +			break;
> >> +
> >> +		unmapped += mapping->iova.last -
> >> mapping->iova.start
> >> + 1; +
> >> +		interval_tree_remove(node, &hvdom->mappings_tree);
> >> +		kfree(mapping);
> >> +	}
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> >> +
> >> +	return unmapped;
> >> +}
> >> +
> >> +/* Return: must return exact status from the hypercall without
> >> changes */ +static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
> >> +			    unsigned long iova, phys_addr_t paddr,
> >> +			    unsigned long npages, u32 map_flags)
> >> +{
> >> +	u64 status;
> >> +	int i;
> >> +	struct hv_input_map_device_gpa_pages *input;
> >> +	unsigned long flags, pfn = paddr >> HV_HYP_PAGE_SHIFT;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +	input->map_flags = map_flags;
> >> +	input->target_device_va_base = iova;
> >> +
> >> +	pfn = paddr >> HV_HYP_PAGE_SHIFT;
> >> +	for (i = 0; i < npages; i++, pfn++)
> >> +		input->gpa_page_list[i] = pfn;
> >> +
> >> +	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES,
> >> npages, 0,
> >> +				     input, NULL);
> >> +
> >> +	local_irq_restore(flags);
> >> +	return status;
> >> +}
> >> +
> >> +/*
> >> + * The core VFIO code loops over memory ranges calling this
> >> function with
> >> + * the largest size from HV_IOMMU_PGSIZES. cond_resched() is in
> >> vfio_iommu_map.
> >> + */
> >> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong
> >> iova,
> >> +			      phys_addr_t paddr, size_t pgsize,
> >> size_t pgcount,
> >> +			      int prot, gfp_t gfp, size_t *mapped)
> >> +{
> >> +	u32 map_flags;
> >> +	int ret;
> >> +	u64 status;
> >> +	unsigned long npages, done = 0;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	size_t size = pgsize * pgcount;
> >> +
> >> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> >> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> >> +
> >> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size,
> >> map_flags);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	if (hvdom->attached_dom) {
> >> +		*mapped = size;
> >> +		return 0;
> >> +	}
> >> +
> >> +	npages = size >> HV_HYP_PAGE_SHIFT;
> >> +	while (done < npages) {
> >> +		ulong completed, remain = npages - done;
> >> +
> >> +		status = hv_iommu_map_pgs(hvdom, iova, paddr,
> >> remain,
> >> +					  map_flags);
> >> +
> >> +		completed = hv_repcomp(status);
> >> +		done = done + completed;
> >> +		iova = iova + (completed << HV_HYP_PAGE_SHIFT);
> >> +		paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
> >> +
> >> +		if (hv_result(status) ==
> >> HV_STATUS_INSUFFICIENT_MEMORY) {
> >> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> >> +
> >> hv_current_partition_id,
> >> +						    256);
> >> +			if (ret)
> >> +				break;
> >> +		}
> >> +		if (!hv_result_success(status))
> >> +			break;
> >> +	}
> >> +
> >> +	if (!hv_result_success(status)) {
> >> +		size_t done_size = done << HV_HYP_PAGE_SHIFT;
> >> +
> >> +		hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
> >> +			      done, npages, iova);
> >> +		/*
> >> +		 * lookup tree has all mappings [0 - size-1].
> >> Below unmap will
> >> +		 * only remove from [0 - done], we need to remove
> >> second chunk
> >> +		 * [done+1 - size-1].
> >> +		 */
> >> +		hv_iommu_del_tree_mappings(hvdom, iova, size -
> >> done_size);
> >> +		hv_iommu_unmap_pages(immdom, iova - done_size,
> >> pgsize,
> >> +				     done, NULL);
> >> +		if (mapped)
> >> +			*mapped = 0;
> >> +	} else
> >> +		if (mapped)
> >> +			*mapped = size;
> >> +
> >> +	return hv_result_to_errno(status);
> >> +}
> >> +
> >> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
> >> ulong iova,
> >> +				   size_t pgsize, size_t pgcount,
> >> +				   struct iommu_iotlb_gather
> >> *gather) +{
> >> +	unsigned long flags, npages;
> >> +	struct hv_input_unmap_device_gpa_pages *input;
> >> +	u64 status;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +	size_t unmapped, size = pgsize * pgcount;
> >> +
> >> +	unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
> >> +	if (unmapped < size)
> >> +		pr_err("%s: could not delete all mappings
> >> (%lx:%lx/%lx)\n",
> >> +		       __func__, iova, unmapped, size);
> >> +
> >> +	if (hvdom->attached_dom)
> >> +		return size;
> >> +
> >> +	npages = size >> HV_HYP_PAGE_SHIFT;
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +
> >> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> >> +	input->device_domain.domain_id.type =
> >> HV_DEVICE_DOMAIN_TYPE_S2;
> >> +	input->device_domain.domain_id.id = hvdom->domid_num;
> >> +	input->target_device_va_base = iova;
> >> +
> >> +	status =
> >> hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> >> +				     0, input, NULL);
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status))
> >> +		hv_status_err(status, "\n");
> >> +
> >> +	return unmapped;
> >> +}
> >> +
> >> +static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain
> >> *immdom,
> >> +					 dma_addr_t iova)
> >> +{
> >> +	u64 paddr = 0;
> >> +	unsigned long flags;
> >> +	struct hv_iommu_mapping *mapping;
> >> +	struct interval_tree_node *node;
> >> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> >> +
> >> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> >> +	node = interval_tree_iter_first(&hvdom->mappings_tree,
> >> iova, iova);
> >> +	if (node) {
> >> +		mapping = container_of(node, struct
> >> hv_iommu_mapping, iova);
> >> +		paddr = mapping->paddr + (iova -
> >> mapping->iova.start);
> >> +	}
> >> +	spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
> >> +
> >> +	return paddr;
> >> +}
> >> +
> >> +/*
> >> + * Currently, hypervisor does not provide list of devices it is
> >> using
> >> + * dynamically. So use this to allow users to manually specify
> >> devices that
> >> + * should be skipped. (eg. hypervisor debugger using some network
> >> device).
> >> + */
> >> +static struct iommu_device *hv_iommu_probe_device(struct device
> >> *dev) +{
> >> +	if (!dev_is_pci(dev))
> >> +		return ERR_PTR(-ENODEV);
> >> +
> >> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> >> +		int rc, pos = 0;
> >> +		int parsed;
> >> +		int segment, bus, slot, func;
> >> +		struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> +		do {
> >> +			parsed = 0;
> >> +
> >> +			rc = sscanf(pci_devs_to_skip + pos, "
> >> (%x:%x:%x.%x) %n",
> >> +				    &segment, &bus, &slot, &func,
> >> &parsed);
> >> +			if (rc)
> >> +				break;
> >> +			if (parsed <= 0)
> >> +				break;
> >> +
> >> +			if (pci_domain_nr(pdev->bus) == segment &&
> >> +			    pdev->bus->number == bus &&
> >> +			    PCI_SLOT(pdev->devfn) == slot &&
> >> +			    PCI_FUNC(pdev->devfn) == func) {
> >> +
> >> +				dev_info(dev, "skipped by Hyper-V
> >> IOMMU\n");
> >> +				return ERR_PTR(-ENODEV);
> >> +			}
> >> +			pos += parsed;
> >> +
> >> +		} while (pci_devs_to_skip[pos]);
> >> +	}
> >> +
> >> +	/* Device will be explicitly attached to the default
> >> domain, so no need
> >> +	 * to do dev_iommu_priv_set() here.
> >> +	 */
> >> +
> >> +	return &hv_virt_iommu;
> >> +}
> >> +
> >> +static void hv_iommu_probe_finalize(struct device *dev)
> >> +{
> >> +	struct iommu_domain *immdom =
> >> iommu_get_domain_for_dev(dev); +
> >> +	if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
> >> +		iommu_setup_dma_ops(dev);
> >> +	else
> >> +		set_dma_ops(dev, NULL);
> >> +}
> >> +
> >> +static void hv_iommu_release_device(struct device *dev)
> >> +{
> >> +	struct hv_domain *hvdom = dev_iommu_priv_get(dev);
> >> +
> >> +	/* Need to detach device from device domain if necessary.
> >> */
> >> +	if (hvdom)
> >> +		hv_iommu_detach_dev(&hvdom->iommu_dom, dev);
> >> +
> >> +	dev_iommu_priv_set(dev, NULL);
> >> +	set_dma_ops(dev, NULL);
> >> +}
> >> +
> >> +static struct iommu_group *hv_iommu_device_group(struct device
> >> *dev) +{
> >> +	if (dev_is_pci(dev))
> >> +		return pci_device_group(dev);
> >> +	else
> >> +		return generic_device_group(dev);
> >> +}
> >> +
> >> +static int hv_iommu_def_domain_type(struct device *dev)
> >> +{
> >> +	/* The hypervisor always creates this by default during
> >> boot */
> >> +	return IOMMU_DOMAIN_IDENTITY;
> >> +}
> >> +
> >> +static struct iommu_ops hv_iommu_ops = {
> >> +	.capable	    = hv_iommu_capable,
> >> +	.domain_alloc_identity	=
> >> hv_iommu_domain_alloc_identity,
> >> +	.domain_alloc_paging	=
> >> hv_iommu_domain_alloc_paging,
> >> +	.probe_device	    = hv_iommu_probe_device,
> >> +	.probe_finalize     = hv_iommu_probe_finalize,
> >> +	.release_device     = hv_iommu_release_device,
> >> +	.def_domain_type    = hv_iommu_def_domain_type,
> >> +	.device_group	    = hv_iommu_device_group,
> >> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> >> +		.attach_dev   = hv_iommu_attach_dev,
> >> +		.map_pages    = hv_iommu_map_pages,
> >> +		.unmap_pages  = hv_iommu_unmap_pages,
> >> +		.iova_to_phys = hv_iommu_iova_to_phys,
> >> +		.free	      = hv_iommu_domain_free,
> >> +	},
> >> +	.owner		    = THIS_MODULE,
> >> +};
> >> +
> >> +static void __init hv_initialize_special_domains(void)
> >> +{
> >> +	hv_def_identity_dom.iommu_dom.geometry = default_geometry;
> >> +	hv_def_identity_dom.domid_num =
> >> HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */ +}  
> > This could be initialized statically.
> >   
> >> +
> >> +static int __init hv_iommu_init(void)
> >> +{
> >> +	int ret;
> >> +	struct iommu_device *iommup = &hv_virt_iommu;
> >> +
> >> +	if (!hv_is_hyperv_initialized())
> >> +		return -ENODEV;
> >> +
> >> +	ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s",
> >> "hyperv-iommu");
> >> +	if (ret) {
> >> +		pr_err("Hyper-V: iommu_device_sysfs_add failed:
> >> %d\n", ret);
> >> +		return ret;
> >> +	}
> >> +
> >> +	/* This must come before iommu_device_register because the
> >> latter calls
> >> +	 * into the hooks.
> >> +	 */
> >> +	hv_initialize_special_domains();
> >> +
> >> +	ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
> >> +	if (ret) {
> >> +		pr_err("Hyper-V: iommu_device_register failed:
> >> %d\n", ret);
> >> +		goto err_sysfs_remove;
> >> +	}
> >> +
> >> +	pr_info("Hyper-V IOMMU initialized\n");
> >> +
> >> +	return 0;
> >> +
> >> +err_sysfs_remove:
> >> +	iommu_device_sysfs_remove(iommup);
> >> +	return ret;
> >> +}
> >> +
> >> +void __init hv_iommu_detect(void)
> >> +{
> >> +	if (no_iommu || iommu_detected)
> >> +		return;
> >> +
> >> +	/* For l1vh, always expose an iommu unit */
> >> +	if (!hv_l1vh_partition())
> >> +		if (!(ms_hyperv.misc_features &
> >> HV_DEVICE_DOMAIN_AVAILABLE))
> >> +			return;
> >> +
> >> +	iommu_detected = 1;
> >> +	x86_init.iommu.iommu_init = hv_iommu_init;
> >> +
> >> +	pci_request_acs();
> >> +}
> >> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> >> index dfc516c1c719..2ad111727e82 100644
> >> --- a/include/linux/hyperv.h
> >> +++ b/include/linux/hyperv.h
> >> @@ -1767,4 +1767,10 @@ static inline unsigned long
> >> virt_to_hvpfn(void *addr) #define HVPFN_DOWN(x)	((x) >>
> >> HV_HYP_PAGE_SHIFT) #define page_to_hvpfn(page)
> >> (page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
> >> +#ifdef CONFIG_HYPERV_IOMMU
> >> +void __init hv_iommu_detect(void);
> >> +#else
> >> +static inline void hv_iommu_detect(void) { }
> >> +#endif /* CONFIG_HYPERV_IOMMU */
> >> +
> >>   #endif /* _HYPERV_H */  


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-27 19:21       ` Jacob Pan
@ 2026-01-27 22:31         ` Jacob Pan
  2026-01-30 22:10           ` Mukesh R
  2026-01-30 23:44         ` Mukesh R
  1 sibling, 1 reply; 69+ messages in thread
From: Jacob Pan @ 2026-01-27 22:31 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

Hi Mukesh,

> > >> +
> > >> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
> > >> !hv_no_attdev) {
> > >> +		pr_err("Hyper-V: l1vh iommu does not support
> > >> host devices\n");    
> > > why is this an error if user input choose not to do direct
> > > attach?    
> > 
> > Like the error message says: on l1vh, direct attaches of host
> > devices (eg dpdk) is not supported. and l1vh only does direct
> > attaches. IOW, no host devices on l1vh.
> >   
> This hv_no_attdev flag is really confusing to me, by default
> hv_no_attdev is false, which allows direct attach. And you are saying
> l1vh allows it.
> 
> Why is this flag also controls host device attachment in l1vh? If you
> can tell the difference between direct host device attach and other
> direct attach, why don't you reject always reject host attach in l1vh?
On second thought, if the hv_no_attdev knob is only meant to control
host domain attach vs. direct attach, then it is irrelevant on L1VH.

Would it make more sense to rename this to something like
hv_host_disable_direct_attach? That would better reflect its scope and
allow it to be ignored under L1VH, and reduce the risk of users
misinterpreting or misusing it.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg
  2026-01-20  6:42 ` [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg Mukesh R
@ 2026-01-28 14:03   ` Manivannan Sadhasivam
  0 siblings, 0 replies; 69+ messages in thread
From: Manivannan Sadhasivam @ 2026-01-28 14:03 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:23PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Main change here is to rename hv_compose_msi_msg to
> hv_vmbus_compose_msi_msg as we introduce hv_compose_msi_msg in upcoming
> patches that builds MSI messages for both VMBus and non-VMBus cases. VMBus
> is not used on baremetal root partition for example.

> While at it, replace
> spaces with tabs and fix some formatting involving excessive line wraps.
>

Don't mix up cleanup changes. Do it in a separate patch.

- Mani
 
> There is no functional change.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 95 +++++++++++++++--------------
>  1 file changed, 48 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1e237d3538f9..8bc6a38c9b5a 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -30,7 +30,7 @@
>   * function's configuration space is zero.
>   *
>   * The rest of this driver mostly maps PCI concepts onto underlying Hyper-V
> - * facilities.  For instance, the configuration space of a function exposed
> + * facilities.	For instance, the configuration space of a function exposed
>   * by Hyper-V is mapped into a single page of memory space, and the
>   * read and write handlers for config space must be aware of this mechanism.
>   * Similarly, device setup and teardown involves messages sent to and from
> @@ -109,33 +109,33 @@ enum pci_message_type {
>  	/*
>  	 * Version 1.1
>  	 */
> -	PCI_MESSAGE_BASE                = 0x42490000,
> -	PCI_BUS_RELATIONS               = PCI_MESSAGE_BASE + 0,
> -	PCI_QUERY_BUS_RELATIONS         = PCI_MESSAGE_BASE + 1,
> -	PCI_POWER_STATE_CHANGE          = PCI_MESSAGE_BASE + 4,
> +	PCI_MESSAGE_BASE		= 0x42490000,
> +	PCI_BUS_RELATIONS		= PCI_MESSAGE_BASE + 0,
> +	PCI_QUERY_BUS_RELATIONS		= PCI_MESSAGE_BASE + 1,
> +	PCI_POWER_STATE_CHANGE		= PCI_MESSAGE_BASE + 4,
>  	PCI_QUERY_RESOURCE_REQUIREMENTS = PCI_MESSAGE_BASE + 5,
> -	PCI_QUERY_RESOURCE_RESOURCES    = PCI_MESSAGE_BASE + 6,
> -	PCI_BUS_D0ENTRY                 = PCI_MESSAGE_BASE + 7,
> -	PCI_BUS_D0EXIT                  = PCI_MESSAGE_BASE + 8,
> -	PCI_READ_BLOCK                  = PCI_MESSAGE_BASE + 9,
> -	PCI_WRITE_BLOCK                 = PCI_MESSAGE_BASE + 0xA,
> -	PCI_EJECT                       = PCI_MESSAGE_BASE + 0xB,
> -	PCI_QUERY_STOP                  = PCI_MESSAGE_BASE + 0xC,
> -	PCI_REENABLE                    = PCI_MESSAGE_BASE + 0xD,
> -	PCI_QUERY_STOP_FAILED           = PCI_MESSAGE_BASE + 0xE,
> -	PCI_EJECTION_COMPLETE           = PCI_MESSAGE_BASE + 0xF,
> -	PCI_RESOURCES_ASSIGNED          = PCI_MESSAGE_BASE + 0x10,
> -	PCI_RESOURCES_RELEASED          = PCI_MESSAGE_BASE + 0x11,
> -	PCI_INVALIDATE_BLOCK            = PCI_MESSAGE_BASE + 0x12,
> -	PCI_QUERY_PROTOCOL_VERSION      = PCI_MESSAGE_BASE + 0x13,
> -	PCI_CREATE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x14,
> -	PCI_DELETE_INTERRUPT_MESSAGE    = PCI_MESSAGE_BASE + 0x15,
> +	PCI_QUERY_RESOURCE_RESOURCES	= PCI_MESSAGE_BASE + 6,
> +	PCI_BUS_D0ENTRY			= PCI_MESSAGE_BASE + 7,
> +	PCI_BUS_D0EXIT			= PCI_MESSAGE_BASE + 8,
> +	PCI_READ_BLOCK			= PCI_MESSAGE_BASE + 9,
> +	PCI_WRITE_BLOCK			= PCI_MESSAGE_BASE + 0xA,
> +	PCI_EJECT			= PCI_MESSAGE_BASE + 0xB,
> +	PCI_QUERY_STOP			= PCI_MESSAGE_BASE + 0xC,
> +	PCI_REENABLE			= PCI_MESSAGE_BASE + 0xD,
> +	PCI_QUERY_STOP_FAILED		= PCI_MESSAGE_BASE + 0xE,
> +	PCI_EJECTION_COMPLETE		= PCI_MESSAGE_BASE + 0xF,
> +	PCI_RESOURCES_ASSIGNED		= PCI_MESSAGE_BASE + 0x10,
> +	PCI_RESOURCES_RELEASED		= PCI_MESSAGE_BASE + 0x11,
> +	PCI_INVALIDATE_BLOCK		= PCI_MESSAGE_BASE + 0x12,
> +	PCI_QUERY_PROTOCOL_VERSION	= PCI_MESSAGE_BASE + 0x13,
> +	PCI_CREATE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x14,
> +	PCI_DELETE_INTERRUPT_MESSAGE	= PCI_MESSAGE_BASE + 0x15,
>  	PCI_RESOURCES_ASSIGNED2		= PCI_MESSAGE_BASE + 0x16,
>  	PCI_CREATE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x17,
>  	PCI_DELETE_INTERRUPT_MESSAGE2	= PCI_MESSAGE_BASE + 0x18, /* unused */
>  	PCI_BUS_RELATIONS2		= PCI_MESSAGE_BASE + 0x19,
> -	PCI_RESOURCES_ASSIGNED3         = PCI_MESSAGE_BASE + 0x1A,
> -	PCI_CREATE_INTERRUPT_MESSAGE3   = PCI_MESSAGE_BASE + 0x1B,
> +	PCI_RESOURCES_ASSIGNED3		= PCI_MESSAGE_BASE + 0x1A,
> +	PCI_CREATE_INTERRUPT_MESSAGE3	= PCI_MESSAGE_BASE + 0x1B,
>  	PCI_MESSAGE_MAXIMUM
>  };
>  
> @@ -1775,20 +1775,21 @@ static u32 hv_compose_msi_req_v1(
>   * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is
>   * not irrelevant because Hyper-V chooses the physical CPU to handle the
>   * interrupts based on the vCPU specified in message sent to the vPCI VSP in
> - * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest,
> - * but assigning too many vPCI device interrupts to the same pCPU can cause a
> - * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V
> - * to spread out the pCPUs that it selects.
> + * hv_vmbus_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the
> + * guest, but assigning too many vPCI device interrupts to the same pCPU can
> + * cause a performance bottleneck. So we spread out the dummy vCPUs to influence
> + * Hyper-V to spread out the pCPUs that it selects.
>   *
>   * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu()
>   * to always return the same dummy vCPU, because a second call to
> - * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a
> - * new pCPU for the interrupt. But for the multi-MSI case, the second call to
> - * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the
> - * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that
> - * the pCPUs are spread out. All interrupts for a multi-MSI device end up using
> - * the same pCPU, even though the vCPUs will be spread out by later calls
> - * to hv_irq_unmask(), but that is the best we can do now.
> + * hv_vmbus_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to
> + * choose a new pCPU for the interrupt. But for the multi-MSI case, the second
> + * call to hv_vmbus_compose_msi_msg() exits without sending a message to the
> + * vPCI VSP, so the original dummy vCPU is used. This dummy vCPU must be
> + * round-robin'ed so that the pCPUs are spread out. All interrupts for a
> + * multi-MSI device end up using the same pCPU, even though the vCPUs will be
> + * spread out by later calls to hv_irq_unmask(), but that is the best we can do
> + * now.
>   *
>   * With Hyper-V in Nov 2022, the HVCALL_RETARGET_INTERRUPT hypercall does *not*
>   * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an
> @@ -1863,7 +1864,7 @@ static u32 hv_compose_msi_req_v3(
>  }
>  
>  /**
> - * hv_compose_msi_msg() - Supplies a valid MSI address/data
> + * hv_vmbus_compose_msi_msg() - Supplies a valid MSI address/data
>   * @data:	Everything about this MSI
>   * @msg:	Buffer that is filled in by this function
>   *
> @@ -1873,7 +1874,7 @@ static u32 hv_compose_msi_req_v3(
>   * response supplies a data value and address to which that data
>   * should be written to trigger that interrupt.
>   */
> -static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> +static void hv_vmbus_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct hv_pcibus_device *hbus;
>  	struct vmbus_channel *channel;
> @@ -1955,7 +1956,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  			return;
>  		}
>  		/*
> -		 * The vector we select here is a dummy value.  The correct
> +		 * The vector we select here is a dummy value.	The correct
>  		 * value gets sent to the hypervisor in unmask().  This needs
>  		 * to be aligned with the count, and also not zero.  Multi-msi
>  		 * is powers of 2 up to 32, so 32 will always work here.
> @@ -2047,7 +2048,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  		/*
>  		 * Make sure that the ring buffer data structure doesn't get
> -		 * freed while we dereference the ring buffer pointer.  Test
> +		 * freed while we dereference the ring buffer pointer.	Test
>  		 * for the channel's onchannel_callback being NULL within a
>  		 * sched_lock critical section.  See also the inline comments
>  		 * in vmbus_reset_channel_cb().
> @@ -2147,7 +2148,7 @@ static const struct msi_parent_ops hv_pcie_msi_parent_ops = {
>  /* HW Interrupt Chip Descriptor */
>  static struct irq_chip hv_msi_irq_chip = {
>  	.name			= "Hyper-V PCIe MSI",
> -	.irq_compose_msi_msg	= hv_compose_msi_msg,
> +	.irq_compose_msi_msg	= hv_vmbus_compose_msi_msg,
>  	.irq_set_affinity	= irq_chip_set_affinity_parent,
>  	.irq_ack		= irq_chip_ack_parent,
>  	.irq_eoi		= irq_chip_eoi_parent,
> @@ -2159,8 +2160,8 @@ static int hv_pcie_domain_alloc(struct irq_domain *d, unsigned int virq, unsigne
>  			       void *arg)
>  {
>  	/*
> -	 * TODO: Allocating and populating struct tran_int_desc in hv_compose_msi_msg()
> -	 * should be moved here.
> +	 * TODO: Allocating and populating struct tran_int_desc in
> +	 *	 hv_vmbus_compose_msi_msg() should be moved here.
>  	 */
>  	int ret;
>  
> @@ -2227,7 +2228,7 @@ static int hv_pcie_init_irq_domain(struct hv_pcibus_device *hbus)
>  /**
>   * get_bar_size() - Get the address space consumed by a BAR
>   * @bar_val:	Value that a BAR returned after -1 was written
> - *              to it.
> + *		to it.
>   *
>   * This function returns the size of the BAR, rounded up to 1
>   * page.  It has to be rounded up because the hypervisor's page
> @@ -2573,7 +2574,7 @@ static void q_resource_requirements(void *context, struct pci_response *resp,
>   * new_pcichild_device() - Create a new child device
>   * @hbus:	The internal struct tracking this root PCI bus.
>   * @desc:	The information supplied so far from the host
> - *              about the device.
> + *		about the device.
>   *
>   * This function creates the tracking structure for a new child
>   * device and kicks off the process of figuring out what it is.
> @@ -3100,7 +3101,7 @@ static void hv_pci_onchannelcallback(void *context)
>  			 * sure that the packet pointer is still valid during the call:
>  			 * here 'valid' means that there's a task still waiting for the
>  			 * completion, and that the packet data is still on the waiting
> -			 * task's stack.  Cf. hv_compose_msi_msg().
> +			 * task's stack.  Cf. hv_vmbus_compose_msi_msg().
>  			 */
>  			comp_packet->completion_func(comp_packet->compl_ctxt,
>  						     response,
> @@ -3417,7 +3418,7 @@ static int hv_allocate_config_window(struct hv_pcibus_device *hbus)
>  	 * vmbus_allocate_mmio() gets used for allocating both device endpoint
>  	 * resource claims (those which cannot be overlapped) and the ranges
>  	 * which are valid for the children of this bus, which are intended
> -	 * to be overlapped by those children.  Set the flag on this claim
> +	 * to be overlapped by those children.	Set the flag on this claim
>  	 * meaning that this region can't be overlapped.
>  	 */
>  
> @@ -4066,7 +4067,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>  		irq_data = irq_get_irq_data(entry->irq);
>  		if (WARN_ON_ONCE(!irq_data))
>  			return -EINVAL;
> -		hv_compose_msi_msg(irq_data, &entry->msg);
> +		hv_vmbus_compose_msi_msg(irq_data, &entry->msg);
>  	}
>  	return 0;
>  }
> @@ -4074,7 +4075,7 @@ static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
>  /*
>   * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
>   * directly writes the MSI/MSI-X registers via MMIO, but since Hyper-V
> - * doesn't trap and emulate the MMIO accesses, here hv_compose_msi_msg()
> + * doesn't trap and emulate the MMIO accesses, here hv_vmbus_compose_msi_msg()
>   * must be used to ask Hyper-V to re-create the IOMMU Interrupt Remapping
>   * Table entries.
>   */
> -- 
> 2.51.2.vfs.0.1
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device
  2026-01-24  0:42     ` Mukesh R
  2026-01-26 20:50       ` Stanislav Kinsburskii
@ 2026-01-28 14:36       ` Manivannan Sadhasivam
  1 sibling, 0 replies; 69+ messages in thread
From: Manivannan Sadhasivam @ 2026-01-28 14:36 UTC (permalink / raw)
  To: Mukesh R
  Cc: Stanislav Kinsburskii, linux-kernel, linux-hyperv,
	linux-arm-kernel, iommu, linux-pci, linux-arch, kys, haiyangz,
	wei.liu, decui, longli, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, hpa, joro, lpieralisi, kwilczynski, robh, bhelgaas,
	arnd, nunodasneves, mhklinux, romank

On Fri, Jan 23, 2026 at 04:42:54PM -0800, Mukesh R wrote:
> On 1/20/26 14:22, Stanislav Kinsburskii wrote:
> > On Mon, Jan 19, 2026 at 10:42:25PM -0800, Mukesh R wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > 
> > > On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
> > > interrupts, etc need a device id as a parameter. This device id refers
> > > to that specific device during the lifetime of passthru.
> > > 
> > > An L1VH VM only contains VMBus based devices. A device id for a VMBus
> > > device is slightly different in that it uses the hv_pcibus_device info
> > > for building it to make sure it matches exactly what the hypervisor
> > > expects. This VMBus based device id is needed when attaching devices in
> > > an L1VH based guest VM. Before building it, a check is done to make sure
> > > the device is a valid VMBus device.
> > > 
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > >   arch/x86/include/asm/mshyperv.h     |  2 ++
> > >   drivers/pci/controller/pci-hyperv.c | 29 +++++++++++++++++++++++++++++
> > >   2 files changed, 31 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> > > index eef4c3a5ba28..0d7fdfb25e76 100644
> > > --- a/arch/x86/include/asm/mshyperv.h
> > > +++ b/arch/x86/include/asm/mshyperv.h
> > > @@ -188,6 +188,8 @@ bool hv_vcpu_is_preempted(int vcpu);
> > >   static inline void hv_apic_init(void) {}
> > >   #endif
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
> > > +
> > >   struct irq_domain *hv_create_pci_msi_domain(void);
> > >   int hv_map_msi_interrupt(struct irq_data *data,
> > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> > > index 8bc6a38c9b5a..40f0b06bb966 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -579,6 +579,8 @@ static void hv_pci_onchannelcallback(void *context);
> > >   #define DELIVERY_MODE		APIC_DELIVERY_MODE_FIXED
> > >   #define HV_MSI_CHIP_FLAGS	MSI_CHIP_FLAG_SET_ACK
> > > +static bool hv_vmbus_pci_device(struct pci_bus *pbus);
> > > +
> > 
> > Why not moving this static function definition above the called instead of
> > defining the prototype?
> 
> Did you see the function implementation? It has other dependencies that
> are later, it would need code reorg.
> 
> Thanks,
> -Mukesh
> 
> 
> > >   static int hv_pci_irqchip_init(void)
> > >   {
> > >   	return 0;
> > > @@ -598,6 +600,26 @@ static unsigned int hv_msi_get_int_vector(struct irq_data *data)
> > >   #define hv_msi_prepare		pci_msi_prepare
> > > +u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
> > > +{
> > > +	u64 u64val;
> > 
> > This variable is redundant.
> 
> Not really. It helps with debug by putting a quick print, and is
> harmless.
> 

Such debug print do not exist now. So there is no need of a variable, drop it.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-27 22:31         ` Jacob Pan
@ 2026-01-30 22:10           ` Mukesh R
  0 siblings, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-30 22:10 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/27/26 14:31, Jacob Pan wrote:
> Hi Mukesh,
> 
>>>>> +
>>>>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
>>>>> !hv_no_attdev) {
>>>>> +		pr_err("Hyper-V: l1vh iommu does not support
>>>>> host devices\n");
>>>> why is this an error if user input choose not to do direct
>>>> attach?
>>>
>>> Like the error message says: on l1vh, direct attaches of host
>>> devices (eg dpdk) is not supported. and l1vh only does direct
>>> attaches. IOW, no host devices on l1vh.
>>>    
>> This hv_no_attdev flag is really confusing to me, by default
>> hv_no_attdev is false, which allows direct attach. And you are saying
>> l1vh allows it.
>>
>> Why is this flag also controls host device attachment in l1vh? If you
>> can tell the difference between direct host device attach and other
>> direct attach, why don't you reject always reject host attach in l1vh?
> On second thought, if the hv_no_attdev knob is only meant to control
> host domain attach vs. direct attach, then it is irrelevant on L1VH.
> 
> Would it make more sense to rename this to something like
> hv_host_disable_direct_attach? That would better reflect its scope and
> allow it to be ignored under L1VH, and reduce the risk of users
> misinterpreting or misusing it.

It would, but it is kernel parameter and needs to be terse. It would
be documented properly tho, so we should be ok.

Thanks,
-Mukesh


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-27 18:57           ` Stanislav Kinsburskii
@ 2026-01-30 22:17             ` Mukesh R
  2026-02-02 16:30               ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-30 22:17 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
>> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>>>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>
>>>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>>>
>>>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>> ---
>>>>>>     drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 115 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>>>     	};
>>>>>>     } __packed;
>>>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>>>> +static int __init setup_hv_full_mmio(char *str)
>>>>>> +{
>>>>>> +	hv_nofull_mmio = true;
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>>>> +
>>>>>>     struct mshv_root mshv_root;
>>>>>>     enum hv_scheduler_type hv_scheduler_type;
>>>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>>>     }
>>>>>>     #ifdef CONFIG_X86_64
>>>>>> +
>>>>>> +/*
>>>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>>>> + * else just return -errno.
>>>>>> + */
>>>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>>>> +				       u64 *mmio_pfnp)
>>>>>> +{
>>>>>> +	struct vm_area_struct *vma;
>>>>>> +	bool is_mmio;
>>>>>> +	u64 uaddr;
>>>>>> +	struct mshv_mem_region *mreg;
>>>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>>>> +	int rc = -EINVAL;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>>>> +	 */
>>>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>>>> +
>>>>>> +	/* Get the region again under the lock */
>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>> +		goto unlock_pt_out;
>>>>>> +
>>>>>> +	uaddr = mreg->start_uaddr +
>>>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>>>> +
>>>>>> +	mmap_read_lock(current->mm);
>>>>>
>>>>> Semaphore can't be taken under spinlock.
>>>
>>>>
>>>> Yeah, something didn't feel right here and I meant to recheck, now regret
>>>> rushing to submit the patch.
>>>>
>>>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>>>> the uaddr because unmap will properly serialize via the mm lock.
>>>>
>>>>
>>>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>>>
>>>>> Why this check is needed again?
>>>>
>>>> To make sure region did not change. This check is under lock.
>>>>
>>>
>>> How can this happen? One can't change VMA type without unmapping it
>>> first. And unmapping it leads to a kernel MMIO region state dangling
>>> around without corresponding user space mapping.
>>
>> Right, and vm_flags would not be mmio expected then.
>>
>>> This is similar to dangling pinned regions and should likely be
>>> addressed the same way by utilizing MMU notifiers to destpoy memoty
>>> regions is VMA is detached.
>>
>> I don't think we need that. Either it succeeds if the region did not
>> change at all, or just fails.
>>
> 
> I'm afraid we do, as if the driver mapped a page with the previous
> memory region, and then the region is unmapped, the page will stay
> mapped in the hypervisor, but will be considered free by kernel, which
> in turn will lead to GPF upn next allocation.

There are no ram pages for mmio regions. Also, we don't do much with
mmio regions other than tell the hyp about it.

Thanks,
-Mukesh


> With pinned regions we issue is similar but less impacting: pages can't
> be released by user space unmapping and thus will be simply leaked, but
> the system stays intact.
> 
> MMIO regions are simila to movable region in this regard: they don't
> reference the user pages, and thus this guest region replaement is a
> stright wat to kernel panic.
> 
>>
>>>>> The region type is stored on the region itself.
>>>>> And the type is checked on the caller side.
>>>>>
>>>>>> +	if (!is_mmio)
>>>>>> +		goto unlock_mmap_out;
>>>>>> +
>>>>>> +	pfnmap_args.vma = vma;
>>>>>> +	pfnmap_args.address = uaddr;
>>>>>> +
>>>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>>>> +	if (rc) {
>>>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>>>> +				      NULL);
>>>>>> +		if (rc)
>>>>>> +			goto unlock_mmap_out;
>>>>>> +
>>>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>>>> +		if (rc)
>>>>>> +			goto unlock_mmap_out;
>>>>>> +	}
>>>>>> +
>>>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>>>> +d
>>>>>> +unlock_mmap_out:
>>>>>> +	mmap_read_unlock(current->mm);
>>>>>> +unlock_pt_out:
>>>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>>>> +	return rc;
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>>>> + * and resolve if possible.
>>>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>>>> + */
>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>>>> +{
>>>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>>>> +	union hv_x64_memory_access_info accinfo;
>>>>>> +	u64 gfn, mmio_spa, numpgs;
>>>>>> +	struct mshv_mem_region *mreg;
>>>>>> +	int rc;
>>>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>>>> +
>>>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>>>> +	accinfo = msg->memory_access_info;
>>>>>> +
>>>>>> +	if (!accinfo.gva_gpa_valid)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>
>>>>> This call needs to be protected by the spinlock.
>>>>
>>>> This is sorta fast path to bail. We recheck under partition lock above.
>>>>
>>>
>>> Accessing the list of regions without lock is unsafe.
>>
>> I am not sure why? This check is done by a vcpu thread, so regions
>> will not have just gone away.
>>
> 
> This is shared resources. Multiple VP thread get into this function
> simultaneously, so there is a race already. But this one we can live
> with without locking as they don't mutate the list of the regions.
> 
> The issue happens when VMM adds or removed another region as it mutates
> the list and races with VP threads doing this lookup.
> 
> Thanks,
> Stanislav
> 
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>>>> +	if (rc)
>>>>>> +		return false;
>>>>>> +
>>>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>>>> +		gfn = mreg->start_gfn;
>>>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>>>> +		numpgs = mreg->nr_pages;
>>>>>> +	} else
>>>>>> +		numpgs = 1;
>>>>>> +
>>>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>>>> +
>>>>>> +	return rc == 0;
>>>>>> +}
>>>>>> +
>>>>>>     static struct mshv_mem_region *
>>>>>>     mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>>>     {
>>>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>>>     	return ret;
>>>>>>     }
>>>>>> +
>>>>>>     #else  /* CONFIG_X86_64 */
>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>>>     static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>>>     #endif /* CONFIG_X86_64 */
>>>>>>     static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>>>     {
>>>>>>     	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>>>> +	case HVMSG_UNMAPPED_GPA:
>>>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>>>     	case HVMSG_GPA_INTERCEPT:
>>>>>>     		return mshv_handle_gpa_intercept(vp);
>>>>>>     	}
>>>>>> -- 
>>>>>> 2.51.2.vfs.0.1
>>>>>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-27 18:46           ` Stanislav Kinsburskii
@ 2026-01-30 22:51             ` Mukesh R
  2026-02-02 16:20               ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-01-30 22:51 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On 1/27/26 10:46, Stanislav Kinsburskii wrote:
> On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
>> On 1/26/26 07:57, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
>>>> On 1/20/26 16:12, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>
>>>>>> Add a new file to implement management of device domains, mapping and
>>>>>> unmapping of iommu memory, and other iommu_ops to fit within the VFIO
>>>>>> framework for PCI passthru on Hyper-V running Linux as root or L1VH
>>>>>> parent. This also implements direct attach mechanism for PCI passthru,
>>>>>> and it is also made to work within the VFIO framework.
>>>>>>
>>>>>> At a high level, during boot the hypervisor creates a default identity
>>>>>> domain and attaches all devices to it. This nicely maps to Linux iommu
>>>>>> subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
>>>>>> need to explicitly ask Hyper-V to attach devices and do maps/unmaps
>>>>>> during boot. As mentioned previously, Hyper-V supports two ways to do
>>>>>> PCI passthru:
>>>>>>
>>>>>>      1. Device Domain: root must create a device domain in the hypervisor,
>>>>>>         and do map/unmap hypercalls for mapping and unmapping guest RAM.
>>>>>>         All hypervisor communications use device id of type PCI for
>>>>>>         identifying and referencing the device.
>>>>>>
>>>>>>      2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>>>         page table for mappings, thus the host need not do map/unmap
>>>>>>         device memory hypercalls. As such, direct attach passthru setup
>>>>>>         during guest boot is extremely fast. A direct attached device
>>>>>>         must be referenced via logical device id and not via the PCI
>>>>>>         device id.
>>>>>>
>>>>>> At present, L1VH root/parent only supports direct attaches. Also direct
>>>>>> attach is default in non-L1VH cases because there are some significant
>>>>>> performance issues with device domain implementation currently for guests
>>>>>> with higher RAM (say more than 8GB), and that unfortunately cannot be
>>>>>> addressed in the short term.
>>>>>>
>>>>>
>>>>> <snip>
>>>>>
>>>
>>> <snip>
>>>
>>>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
>>>>>> +{
>>>>>> +	struct pci_dev *pdev;
>>>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>>>> +
>>>>>> +	/* See the attach function, only PCI devices for now */
>>>>>> +	if (!dev_is_pci(dev))
>>>>>> +		return;
>>>>>> +
>>>>>> +	if (hvdom->num_attchd == 0)
>>>>>> +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
>>>>>> +
>>>>>> +	pdev = to_pci_dev(dev);
>>>>>> +
>>>>>> +	if (hvdom->attached_dom) {
>>>>>> +		hv_iommu_det_dev_from_guest(hvdom, pdev);
>>>>>> +
>>>>>> +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
>>>>>> +		 * next.
>>>>>> +		 */
>>>>>> +	} else {
>>>>>> +		hv_iommu_det_dev_from_dom(hvdom, pdev);
>>>>>> +	}
>>>>>> +
>>>>>> +	hvdom->num_attchd--;
>>>>>
>>>>> Shouldn't this be modified iff the detach succeeded?
>>>>
>>>> We want to still free the domain and not let it get stuck. The purpose
>>>> is more to make sure detach was called before domain free.
>>>>
>>>
>>> How can one debug subseqent errors if num_attchd is decremented
>>> unconditionally? In reality the device is left attached, but the related
>>> kernel metadata is gone.
>>
>> Error is printed in case of failed detach. If there is panic, at least
>> you can get some info about the device. Metadata in hypervisor is
>> around if failed.
>>
> 
> With this approach the only thing left is a kernel message.
> But if the state is kept intact, one could collect a kernel core and
> analyze it.

Again, most of linux stuff is cleaned up, the only state is in
hypervisor, and hypervisor can totally protect itself and devices.
So there is not much in kernel core as it got cleaned up already.
Think of this as additional check, we can remove in future after
it stands the test of time, until then, every debugging bit helps.

> And note, that there won't be a hypervisor core by default: our main
> context with the usptreamed version of the driver is L1VH and a kernel
> core is the only thing a third party customer can provide for our
> analysis.

Wei can correct me, but we are not only l1vh focused here. There is
work going on on all fronts.

Thanks,
-Mukesh

> Thanks,
> Stanislav

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-27 19:21       ` Jacob Pan
  2026-01-27 22:31         ` Jacob Pan
@ 2026-01-30 23:44         ` Mukesh R
  1 sibling, 0 replies; 69+ messages in thread
From: Mukesh R @ 2026-01-30 23:44 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 1/27/26 11:21, Jacob Pan wrote:
> Hi Mukesh,
> 
> On Fri, 23 Jan 2026 18:01:29 -0800
> Mukesh R <mrathor@linux.microsoft.com> wrote:
> 
>> On 1/21/26 21:18, Jacob Pan wrote:
>>> Hi Mukesh,
>>>
>>> On Mon, 19 Jan 2026 22:42:27 -0800
>>> Mukesh R <mrathor@linux.microsoft.com> wrote:
>>>    
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>
>>>> Add a new file to implement management of device domains, mapping
>>>> and unmapping of iommu memory, and other iommu_ops to fit within
>>>> the VFIO framework for PCI passthru on Hyper-V running Linux as
>>>> root or L1VH parent. This also implements direct attach mechanism
>>>> for PCI passthru, and it is also made to work within the VFIO
>>>> framework.
>>>>
>>>> At a high level, during boot the hypervisor creates a default
>>>> identity domain and attaches all devices to it. This nicely maps
>>>> to Linux iommu subsystem IOMMU_DOMAIN_IDENTITY domain. As a
>>>> result, Linux does not need to explicitly ask Hyper-V to attach
>>>> devices and do maps/unmaps during boot. As mentioned previously,
>>>> Hyper-V supports two ways to do PCI passthru:
>>>>
>>>>     1. Device Domain: root must create a device domain in the
>>>> hypervisor, and do map/unmap hypercalls for mapping and unmapping
>>>> guest RAM. All hypervisor communications use device id of type PCI
>>>> for identifying and referencing the device.
>>>>
>>>>     2. Direct Attach: the hypervisor will simply use the guest's HW
>>>>        page table for mappings, thus the host need not do map/unmap
>>>>        device memory hypercalls. As such, direct attach passthru
>>>> setup during guest boot is extremely fast. A direct attached device
>>>>        must be referenced via logical device id and not via the PCI
>>>>        device id.
>>>>
>>>> At present, L1VH root/parent only supports direct attaches. Also
>>>> direct attach is default in non-L1VH cases because there are some
>>>> significant performance issues with device domain implementation
>>>> currently for guests with higher RAM (say more than 8GB), and that
>>>> unfortunately cannot be addressed in the short term.
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>> ---
>>>>    MAINTAINERS                     |   1 +
>>>>    arch/x86/include/asm/mshyperv.h |   7 +-
>>>>    arch/x86/kernel/pci-dma.c       |   2 +
>>>>    drivers/iommu/Makefile          |   2 +-
>>>>    drivers/iommu/hyperv-iommu.c    | 876
>>>> ++++++++++++++++++++++++++++++++ include/linux/hyperv.h          |
>>>> 6 + 6 files changed, 890 insertions(+), 4 deletions(-)
>>>>    create mode 100644 drivers/iommu/hyperv-iommu.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 381a0e086382..63160cee942c 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -11741,6 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>>>>    F:	drivers/hv/
>>>>    F:	drivers/infiniband/hw/mana/
>>>>    F:	drivers/input/serio/hyperv-keyboard.c
>>>> +F:	drivers/iommu/hyperv-iommu.c
>>> Given we are also developing a guest iommu driver on hyperv, I
>>> think it is more clear to name them accordingly. Perhaps,
>>> hyperv-iommu-root.c?
>>
>> well, l1vh is not quite root, more like a parent. But we've been using
>> l1vh root loosely to mean l1vh parent. so probably ok to rename it
>> to hyperv-iommu-root.c. I prefer not calling it parent or something
>> like that.
> yeah, something specific and different than the guest driver will do.
> 
>>>>    F:	drivers/iommu/hyperv-irq.c
>>>>    F:	drivers/net/ethernet/microsoft/
>>>>    F:	drivers/net/hyperv/
>>>> diff --git a/arch/x86/include/asm/mshyperv.h
>>>> b/arch/x86/include/asm/mshyperv.h index 97477c5a8487..e4ccdbbf1d12
>>>> 100644 --- a/arch/x86/include/asm/mshyperv.h
>>>> +++ b/arch/x86/include/asm/mshyperv.h
>>>> @@ -189,16 +189,17 @@ static inline void hv_apic_init(void) {}
>>>>    #endif
>>>>    
>>>>    #if IS_ENABLED(CONFIG_HYPERV_IOMMU)
>>>> -static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>>>> -{ return false; }       /* temporary */
>>>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
>>>>    u64 hv_build_devid_oftype(struct pci_dev *pdev, enum
>>>> hv_device_type type); +u64 hv_iommu_get_curr_partid(void);
>>>>    #else	/* CONFIG_HYPERV_IOMMU */
>>>>    static inline bool hv_pcidev_is_attached_dev(struct pci_dev
>>>> *pdev) { return false; }
>>>> -
>>>>    static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
>>>>    				       enum hv_device_type type)
>>>>    { return 0; }
>>>> +static inline u64 hv_iommu_get_curr_partid(void)
>>>> +{ return HV_PARTITION_ID_INVALID; }
>>>>    
>>>>    #endif	/* CONFIG_HYPERV_IOMMU */
>>>>    
>>>> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
>>>> index 6267363e0189..cfeee6505e17 100644
>>>> --- a/arch/x86/kernel/pci-dma.c
>>>> +++ b/arch/x86/kernel/pci-dma.c
>>>> @@ -8,6 +8,7 @@
>>>>    #include <linux/gfp.h>
>>>>    #include <linux/pci.h>
>>>>    #include <linux/amd-iommu.h>
>>>> +#include <linux/hyperv.h>
>>>>    
>>>>    #include <asm/proto.h>
>>>>    #include <asm/dma.h>
>>>> @@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
>>>>    	gart_iommu_hole_init();
>>>>    	amd_iommu_detect();
>>>>    	detect_intel_iommu();
>>>> +	hv_iommu_detect();
>> j
>>> Will this driver be x86 only?
>> Yes for now.
> If there is nothing x86 specific in this driver (assuming the
> hypercalls here are not x86 only), maybe you can move to the generic
> startup code.

It's x86 specific:

         x86_init.iommu.iommu_init = hv_iommu_init


>>>>    	swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
>>>>    }
>>>>    
>>>> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
>>>> index 598c39558e7d..cc9774864b00 100644
>>>> --- a/drivers/iommu/Makefile
>>>> +++ b/drivers/iommu/Makefile
>>>> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>>>>    obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>>>>    obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>>>>    obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>>>> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
>>>> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o hyperv-iommu.o
>>> DMA and IRQ remapping should be separate
>>
>> not sure i follow.
> In IOMMU subsystem, DMA remapping and IRQ remapping can be turned
> on/off independently. e.g. you could have an option to turn on IRQ
> remapping w/o DMA remapping. But here you tied them together.

oh, you are talking about the config option, yeah, I will move
CONFIG_IRQ_REMAP from Kconfig to here.

>>
>>>>    obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>>>>    obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>>>>    obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
>>>> diff --git a/drivers/iommu/hyperv-iommu.c
>>>> b/drivers/iommu/hyperv-iommu.c new file mode 100644
>>>> index 000000000000..548483fec6b1
>>>> --- /dev/null
>>>> +++ b/drivers/iommu/hyperv-iommu.c
>>>> @@ -0,0 +1,876 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Hyper-V root vIOMMU driver.
>>>> + * Copyright (C) 2026, Microsoft, Inc.
>>>> + */
>>>> +
>>>> +#include <linux/module.h>
>>> I don't think this is needed since this driver cannot be a module
>>>    
>>>> +#include <linux/pci.h>
>>>> +#include <linux/dmar.h>
>>> should not depend on Intel's DMAR
>>>    
>>>> +#include <linux/dma-map-ops.h>
>>>> +#include <linux/interval_tree.h>
>>>> +#include <linux/hyperv.h>
>>>> +#include "dma-iommu.h"
>>>> +#include <asm/iommu.h>
>>>> +#include <asm/mshyperv.h>
>>>> +
>>>> +/* We will not claim these PCI devices, eg hypervisor needs it for
>>>> debugger */ +static char *pci_devs_to_skip;
>>>> +static int __init hv_iommu_setup_skip(char *str)
>>>> +{
>>>> +	pci_devs_to_skip = str;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
>>>> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
>>>> +
>>>> +bool hv_no_attdev;	 /* disable direct device attach for
>>>> passthru */ +EXPORT_SYMBOL_GPL(hv_no_attdev);
>>>> +static int __init setup_hv_no_attdev(char *str)
>>>> +{
>>>> +	hv_no_attdev = true;
>>>> +	return 0;
>>>> +}
>>>> +__setup("hv_no_attdev", setup_hv_no_attdev);
>>>> +
>>>> +/* Iommu device that we export to the world. HyperV supports max
>>>> of one */ +static struct iommu_device hv_virt_iommu;
>>>> +
>>>> +struct hv_domain {
>>>> +	struct iommu_domain iommu_dom;
>>>> +	u32 domid_num;			      /* as opposed
>>>> to domain_id.type */
>>>> +	u32 num_attchd;		      /* number of
>>>> currently attached devices */
>>> rename to num_dev_attached?
>>>    
>>>> +	bool attached_dom;		      /* is this direct
>>>> attached dom? */
>>>> +	spinlock_t mappings_lock;	      /* protects
>>>> mappings_tree */
>>>> +	struct rb_root_cached mappings_tree;  /* iova to pa lookup
>>>> tree */ +};
>>>> +
>>>> +#define to_hv_domain(d) container_of(d, struct hv_domain,
>>>> iommu_dom) +
>>>> +struct hv_iommu_mapping {
>>>> +	phys_addr_t paddr;
>>>> +	struct interval_tree_node iova;
>>>> +	u32 flags;
>>>> +};
>>>> +
>>>> +/*
>>>> + * By default, during boot the hypervisor creates one Stage 2 (S2)
>>>> default
>>>> + * domain. Stage 2 means that the page table is controlled by the
>>>> hypervisor.
>>>> + *   S2 default: access to entire root partition memory. This for
>>>> us easily
>>>> + *		 maps to IOMMU_DOMAIN_IDENTITY in the iommu
>>>> subsystem, and
>>>> + *		 is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the
>>>> hypervisor.
>>>> + *
>>>> + * Device Management:
>>>> + *   There are two ways to manage device attaches to domains:
>>>> + *     1. Domain Attach: A device domain is created in the
>>>> hypervisor, the
>>>> + *			 device is attached to this domain, and
>>>> then memory
>>>> + *			 ranges are mapped in the map callbacks.
>>>> + *     2. Direct Attach: No need to create a domain in the
>>>> hypervisor for direct
>>>> + *			 attached devices. A hypercall is made
>>>> to tell the
>>>> + *			 hypervisor to attach the device to a
>>>> guest. There is
>>>> + *			 no need for explicit memory mappings
>>>> because the
>>>> + *			 hypervisor will just use the guest HW
>>>> page table.
>>>> + *
>>>> + * Since a direct attach is much faster, it is the default. This
>>>> can be
>>>> + * changed via hv_no_attdev.
>>>> + *
>>>> + * L1VH: hypervisor only supports direct attach.
>>>> + */
>>>> +
>>>> +/*
>>>> + * Create dummy domain to correspond to hypervisor prebuilt
>>>> default identity
>>>> + * domain (dummy because we do not make hypercall to create them).
>>>> + */
>>>> +static struct hv_domain hv_def_identity_dom;
>>>> +
>>>> +static bool hv_special_domain(struct hv_domain *hvdom)
>>>> +{
>>>> +	return hvdom == &hv_def_identity_dom;
>>>> +}
>>>> +
>>>> +struct iommu_domain_geometry default_geometry = (struct
>>>> iommu_domain_geometry) {
>>>> +	.aperture_start = 0,
>>>> +	.aperture_end = -1UL,
>>>> +	.force_aperture = true,
>>>> +};
>>>> +
>>>> +/*
>>>> + * Since the relevant hypercalls can only fit less than 512 PFNs
>>>> in the pfn
>>>> + * array, report 1M max.
>>>> + */
>>>> +#define HV_IOMMU_PGSIZES (SZ_4K | SZ_1M)
>>>> +
>>>> +static u32 unique_id;	      /* unique numeric id of a new
>>>> domain */ +
>>>> +static void hv_iommu_detach_dev(struct iommu_domain *immdom,
>>>> +				struct device *dev);
>>>> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom,
>>>> ulong iova,
>>>> +				   size_t pgsize, size_t pgcount,
>>>> +				   struct iommu_iotlb_gather
>>>> *gather); +
>>>> +/*
>>>> + * If the current thread is a VMM thread, return the partition id
>>>> of the VM it
>>>> + * is managing, else return HV_PARTITION_ID_INVALID.
>>>> + */
>>>> +u64 hv_iommu_get_curr_partid(void)
>>>> +{
>>>> +	u64 (*fn)(pid_t pid);
>>>> +	u64 partid;
>>>> +
>>>> +	fn = symbol_get(mshv_pid_to_partid);
>>>> +	if (!fn)
>>>> +		return HV_PARTITION_ID_INVALID;
>>>> +
>>>> +	partid = fn(current->tgid);
>>>> +	symbol_put(mshv_pid_to_partid);
>>>> +
>>>> +	return partid;
>>>> +}
>>> This function is not iommu specific. Maybe move it to mshv code?
>>
>> Well, it is getting the information from mshv by calling a function
>> there for iommu, and is not needed if no HYPER_IOMMU. So this is
>> probably the best place for it.
>>
> ok, maybe move it to mshv after we have a second user. But the function
> name can be just hv_get_curr_partid(void), no?

it could, but by convention all public funcs here are hv_iommu_xxx..
and other reviewers might object... We really need virt/mshv/
sub directory... not sure if it's worth creating now for just one
function. so maybe we just live with it for now... we do have work
item to move some things from drivers/hv to virt/mshv/ .. so this
can get added to that whenever that happens.

>>>> +
>>>> +/* If this is a VMM thread, then this domain is for a guest VM */
>>>> +static bool hv_curr_thread_is_vmm(void)
>>>> +{
>>>> +	return hv_iommu_get_curr_partid() !=
>>>> HV_PARTITION_ID_INVALID; +}
>>>> +
>>>> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap
>>>> cap) +{
>>>> +	switch (cap) {
>>>> +	case IOMMU_CAP_CACHE_COHERENCY:
>>>> +		return true;
>>>> +	default:
>>>> +		return false;
>>>> +	}
>>>> +	return false;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Check if given pci device is a direct attached device. Caller
>>>> must have
>>>> + * verified pdev is a valid pci device.
>>>> + */
>>>> +bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
>>>> +{
>>>> +	struct iommu_domain *iommu_domain;
>>>> +	struct hv_domain *hvdom;
>>>> +	struct device *dev = &pdev->dev;
>>>> +
>>>> +	iommu_domain = iommu_get_domain_for_dev(dev);
>>>> +	if (iommu_domain) {
>>>> +		hvdom = to_hv_domain(iommu_domain);
>>>> +		return hvdom->attached_dom;
>>>> +	}
>>>> +
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
>>> Attached domain can change anytime, what guarantee does the caller
>>> have?
>>
>> Not sure I understand what can change: the device moving from attached
>> to non-attached? or the domain getting deleted? In any case, this is
>> called from leaf functions, so that should not happen... and it
>> will return false if the device did somehow got removed.
>>
> I was thinking the device can be attached to a different domain type at
> runtime, e.g. via sysfs to identity or DMA. But I guess here is a static
> attachment either for l1vh or root.

That is correct. It is extra work to support that if there is a good
usecase/demand.
  
>>>> +
>>>> +/* Create a new device domain in the hypervisor */
>>>> +static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
>>>> +{
>>>> +	u64 status;
>>>> +	unsigned long flags;
>>>> +	struct hv_input_device_domain *ddp;
>>>> +	struct hv_input_create_device_domain *input;
>>> nit: use consistent coding style, inverse Christmas tree.
>>>    
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	ddp = &input->device_domain;
>>>> +	ddp->partition_id = HV_PARTITION_ID_SELF;
>>>> +	ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	ddp->domain_id.id = hvdom->domid_num;
>>>> +
>>>> +
>>>> input->create_device_domain_flags.forward_progress_required = 1;
>>>> +	input->create_device_domain_flags.inherit_owning_vtl = 0;
>>>> +
>>>> +	status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN,
>>>> input, NULL); +
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* During boot, all devices are attached to this */
>>>> +static struct iommu_domain *hv_iommu_domain_alloc_identity(struct
>>>> device *dev) +{
>>>> +	return &hv_def_identity_dom.iommu_dom;
>>>> +}
>>>> +
>>>> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
>>>> device *dev) +{
>>>> +	struct hv_domain *hvdom;
>>>> +	int rc;
>>>> +
>>>> +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm() &&
>>>> !hv_no_attdev) {
>>>> +		pr_err("Hyper-V: l1vh iommu does not support host
>>>> devices\n");
>>> why is this an error if user input choose not to do direct attach?
>>
>> Like the error message says: on l1vh, direct attaches of host devices
>> (eg dpdk) is not supported. and l1vh only does direct attaches. IOW,
>> no host devices on l1vh.
>>
> This hv_no_attdev flag is really confusing to me, by default
> hv_no_attdev is false, which allows direct attach. And you are saying
> l1vh allows it.

Well, at the time of this design/coding, my understanding was we'd have
mapped devices on l1vh also. But now it looks like that would be bit
later than sooner .. unless AI bots start dumping code of course :) :)..

I could remove it from the if statement and add it when the support
is added, but is harmless and one less thing to remember.

> Why is this flag also controls host device attachment in l1vh? If you
> can tell the difference between direct host device attach and other
> direct attach, why don't you reject always reject host attach in l1vh?
> 
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
>>>> +	if (hvdom == NULL)
>>>> +		goto out;
>>>> +
>>>> +	spin_lock_init(&hvdom->mappings_lock);
>>>> +	hvdom->mappings_tree = RB_ROOT_CACHED;
>>>> +
>>>> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_DEFAULT)   /*
>>>> ie, 0 */
>>> This is true only when unique_id wraps around, right? Then this
>>> driver stops working?
>>
>> Correct. It's a u32, so if my math is right, and a device is attached
>> every second, it will take 136 years to wrap! Did i get that right?
>>
> This is still a unnecessary vulnerability.

Device passthru will fail and will not cause any corruption or
data theft issues... can make it u64 if it gives extra peace. not
worth all that mumbo jumbo for almost never gonna happen case.

>>> can you use an IDR for the unique_id and free it as you detach
>>> instead of doing this cyclic allocation?
>>>    
>>>> +		goto out_free;
>>>> +
>>>> +	hvdom->domid_num = unique_id;
>>>> +	hvdom->iommu_dom.geometry = default_geometry;
>>>> +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
>>>> +
>>>> +	/* For guests, by default we do direct attaches, so no
>>>> domain in hyp */
>>>> +	if (hv_curr_thread_is_vmm() && !hv_no_attdev)
>>>> +		hvdom->attached_dom = true;
>>>> +	else {
>>>> +		rc = hv_iommu_create_hyp_devdom(hvdom);
>>>> +		if (rc)
>>>> +			goto out_free_id;
>>>> +	}
>>>> +
>>>> +	return &hvdom->iommu_dom;
>>>> +
>>>> +out_free_id:
>>>> +	unique_id--;
>>>> +out_free:
>>>> +	kfree(hvdom);
>>>> +out:
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
>>>> +{
>>>> +	struct hv_domain *hvdom = to_hv_domain(immdom);
>>>> +	unsigned long flags;
>>>> +	u64 status;
>>>> +	struct hv_input_delete_device_domain *input;
>>>> +
>>>> +	if (hv_special_domain(hvdom))
>>>> +		return;
>>>> +
>>>> +	if (hvdom->num_attchd) {
>>>> +		pr_err("Hyper-V: can't free busy iommu domain
>>>> (%p)\n", immdom);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (!hv_curr_thread_is_vmm() || hv_no_attdev) {
>>>> +		struct hv_input_device_domain *ddp;
>>>> +
>>>> +		local_irq_save(flags);
>>>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +		ddp = &input->device_domain;
>>>> +		memset(input, 0, sizeof(*input));
>>>> +
>>>> +		ddp->partition_id = HV_PARTITION_ID_SELF;
>>>> +		ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +		ddp->domain_id.id = hvdom->domid_num;
>>>> +
>>>> +		status =
>>>> hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
>>>> +					 NULL);
>>>> +		local_irq_restore(flags);
>>>> +
>>>> +		if (!hv_result_success(status))
>>>> +			hv_status_err(status, "\n");
>>>> +	}
>>
>>> you could free the domid here, no?
>> sorry, don't follow what you mean by domid, you mean unique_id?
>>
> yes.

no it's just a sequential number with no track of what's used.

>>>> +
>>>> +	kfree(hvdom);
>>>> +}
>>>> +
>>>> +/* Attach a device to a domain previously created in the
>>>> hypervisor */ +static int hv_iommu_att_dev2dom(struct hv_domain
>>>> *hvdom, struct pci_dev *pdev) +{
>>>> +	unsigned long flags;
>>>> +	u64 status;
>>>> +	enum hv_device_type dev_type;
>>>> +	struct hv_input_attach_device_domain *input;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +
>>>> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
>>>> +	input->device_domain.domain_id.type =
>>>> HV_DEVICE_DOMAIN_TYPE_S2;
>>>> +	input->device_domain.domain_id.id = hvdom->domid_num;
>>>> +
>>>> +	/* NB: Upon guest shutdown, device is re-attached to the
>>>> default domain
>>>> +	 * without explicit detach.
>>>> +	 */
>>>> +	if (hv_l1vh_partition())
>>>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>>>> +	else
>>>> +		dev_type = HV_DEVICE_TYPE_PCI;
>>>> +
>>>> +	input->device_id.as_uint64 = hv_build_devid_oftype(pdev,
>>>> dev_type); +
>>>> +	status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN,
>>>> input, NULL);
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* Caller must have validated that dev is a valid pci dev */
>>>> +static int hv_iommu_direct_attach_device(struct pci_dev *pdev)
>>>> +{
>>>> +	struct hv_input_attach_device *input;
>>>> +	u64 status;
>>>> +	int rc;
>>>> +	unsigned long flags;
>>>> +	union hv_device_id host_devid;
>>>> +	enum hv_device_type dev_type;
>>>> +	u64 ptid = hv_iommu_get_curr_partid();
>>>> +
>>>> +	if (ptid == HV_PARTITION_ID_INVALID) {
>>>> +		pr_err("Hyper-V: Invalid partition id in direct
>>>> attach\n");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	if (hv_l1vh_partition())
>>>> +		dev_type = HV_DEVICE_TYPE_LOGICAL;
>>>> +	else
>>>> +		dev_type = HV_DEVICE_TYPE_PCI;
>>>> +
>>>> +	host_devid.as_uint64 = hv_build_devid_oftype(pdev,
>>>> dev_type); +
>>>> +	do {
>>>> +		local_irq_save(flags);
>>>> +		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +		memset(input, 0, sizeof(*input));
>>>> +		input->partition_id = ptid;
>>>> +		input->device_id = host_devid;
>>>> +
>>>> +		/* Hypervisor associates logical_id with this
>>>> device, and in
>>>> +		 * some hypercalls like retarget interrupts,
>>>> logical_id must be
>>>> +		 * used instead of the BDF. It is a required
>>>> parameter.
>>>> +		 */
>>>> +		input->attdev_flags.logical_id = 1;
>>>> +		input->logical_devid =
>>>> +			   hv_build_devid_oftype(pdev,
>>>> HV_DEVICE_TYPE_LOGICAL); +
>>>> +		status = hv_do_hypercall(HVCALL_ATTACH_DEVICE,
>>>> input, NULL);
>>>> +		local_irq_restore(flags);
>>>> +
>>>> +		if (hv_result(status) ==
>>>> HV_STATUS_INSUFFICIENT_MEMORY) {
>>>> +			rc = hv_call_deposit_pages(NUMA_NO_NODE,
>>>> ptid, 1);
>>>> +			if (rc)
>>>> +				break;
>>>> +		}
>>>> +	} while (hv_result(status) ==
>>>> HV_STATUS_INSUFFICIENT_MEMORY); +
>>>> +	if (!hv_result_success(status))
>>>> +		hv_status_err(status, "\n");
>>>> +
>>>> +	return hv_result_to_errno(status);
>>>> +}
>>>> +
>>>> +/* This to attach a device to both host app (like DPDK) and a
>>>> guest VM */
>>> The IOMMU driver should be agnostic to the type of consumer,
>>> whether a userspace driver or a VM. This comment is not necessary.
>>>    
>>>> +static int hv_iommu_attach_dev(struct iommu_domain *immdom,
>>>> struct device *dev,
>>>> +			       struct iommu_domain *old)
>>> This does not match upstream kernel prototype, which kernel version
>>> is this based on? I will stop here for now.
>>
>> As I mentioned in the cover letter:
>>            Based on: 8f0b4cce4481 (origin/hyperv-next)
>>
> where is this repo?

https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
branch: hyperv-next

All our hyperv/mshv related patch submissions are merged there
first by Wei.

Thanks,
-Mukesh

.. deleted ......

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu
  2026-01-30 22:51             ` Mukesh R
@ 2026-02-02 16:20               ` Stanislav Kinsburskii
  0 siblings, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-02-02 16:20 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Fri, Jan 30, 2026 at 02:51:19PM -0800, Mukesh R wrote:
> On 1/27/26 10:46, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 07:02:29PM -0800, Mukesh R wrote:
> > > On 1/26/26 07:57, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 05:26:19PM -0800, Mukesh R wrote:
> > > > > On 1/20/26 16:12, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 19, 2026 at 10:42:27PM -0800, Mukesh R wrote:
> > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > 
> > > > > > > Add a new file to implement management of device domains, mapping and
> > > > > > > unmapping of iommu memory, and other iommu_ops to fit within the VFIO
> > > > > > > framework for PCI passthru on Hyper-V running Linux as root or L1VH
> > > > > > > parent. This also implements direct attach mechanism for PCI passthru,
> > > > > > > and it is also made to work within the VFIO framework.
> > > > > > > 
> > > > > > > At a high level, during boot the hypervisor creates a default identity
> > > > > > > domain and attaches all devices to it. This nicely maps to Linux iommu
> > > > > > > subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
> > > > > > > need to explicitly ask Hyper-V to attach devices and do maps/unmaps
> > > > > > > during boot. As mentioned previously, Hyper-V supports two ways to do
> > > > > > > PCI passthru:
> > > > > > > 
> > > > > > >      1. Device Domain: root must create a device domain in the hypervisor,
> > > > > > >         and do map/unmap hypercalls for mapping and unmapping guest RAM.
> > > > > > >         All hypervisor communications use device id of type PCI for
> > > > > > >         identifying and referencing the device.
> > > > > > > 
> > > > > > >      2. Direct Attach: the hypervisor will simply use the guest's HW
> > > > > > >         page table for mappings, thus the host need not do map/unmap
> > > > > > >         device memory hypercalls. As such, direct attach passthru setup
> > > > > > >         during guest boot is extremely fast. A direct attached device
> > > > > > >         must be referenced via logical device id and not via the PCI
> > > > > > >         device id.
> > > > > > > 
> > > > > > > At present, L1VH root/parent only supports direct attaches. Also direct
> > > > > > > attach is default in non-L1VH cases because there are some significant
> > > > > > > performance issues with device domain implementation currently for guests
> > > > > > > with higher RAM (say more than 8GB), and that unfortunately cannot be
> > > > > > > addressed in the short term.
> > > > > > > 
> > > > > > 
> > > > > > <snip>
> > > > > > 
> > > > 
> > > > <snip>
> > > > 
> > > > > > > +static void hv_iommu_detach_dev(struct iommu_domain *immdom, struct device *dev)
> > > > > > > +{
> > > > > > > +	struct pci_dev *pdev;
> > > > > > > +	struct hv_domain *hvdom = to_hv_domain(immdom);
> > > > > > > +
> > > > > > > +	/* See the attach function, only PCI devices for now */
> > > > > > > +	if (!dev_is_pci(dev))
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	if (hvdom->num_attchd == 0)
> > > > > > > +		pr_warn("Hyper-V: num_attchd is zero (%s)\n", dev_name(dev));
> > > > > > > +
> > > > > > > +	pdev = to_pci_dev(dev);
> > > > > > > +
> > > > > > > +	if (hvdom->attached_dom) {
> > > > > > > +		hv_iommu_det_dev_from_guest(hvdom, pdev);
> > > > > > > +
> > > > > > > +		/* Do not reset attached_dom, hv_iommu_unmap_pages happens
> > > > > > > +		 * next.
> > > > > > > +		 */
> > > > > > > +	} else {
> > > > > > > +		hv_iommu_det_dev_from_dom(hvdom, pdev);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	hvdom->num_attchd--;
> > > > > > 
> > > > > > Shouldn't this be modified iff the detach succeeded?
> > > > > 
> > > > > We want to still free the domain and not let it get stuck. The purpose
> > > > > is more to make sure detach was called before domain free.
> > > > > 
> > > > 
> > > > How can one debug subseqent errors if num_attchd is decremented
> > > > unconditionally? In reality the device is left attached, but the related
> > > > kernel metadata is gone.
> > > 
> > > Error is printed in case of failed detach. If there is panic, at least
> > > you can get some info about the device. Metadata in hypervisor is
> > > around if failed.
> > > 
> > 
> > With this approach the only thing left is a kernel message.
> > But if the state is kept intact, one could collect a kernel core and
> > analyze it.
> 
> Again, most of linux stuff is cleaned up, the only state is in
> hypervisor, and hypervisor can totally protect itself and devices.
> So there is not much in kernel core as it got cleaned up already.
> Think of this as additional check, we can remove in future after
> it stands the test of time, until then, every debugging bit helps.
> 

Again, the hypervisor state is not accessible from the kernel core in
L1VH.

> > And note, that there won't be a hypervisor core by default: our main
> > context with the usptreamed version of the driver is L1VH and a kernel
> > core is the only thing a third party customer can provide for our
> > analysis.
> 
> Wei can correct me, but we are not only l1vh focused here. There is
> work going on on all fronts.
> 

In upstream, it does not matter what the work is focused on. The only
thing that matters is that the functionality is merged and available.
Once the code is merged upstream, it becomes available to third-party
customers. They can use it in any way they see fit.
The only product we support with upstream code is L1VH. We should keep
it crisp, clear, and easy to debug.
This change does not help debugging. It only sweeps the issue under the
carpet, with no justification.

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> > Thanks,
> > Stanislav

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-01-30 22:17             ` Mukesh R
@ 2026-02-02 16:30               ` Stanislav Kinsburskii
  2026-02-04 22:52                 ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-02-02 16:30 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
> On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> > On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> > > On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > 
> > > > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > > > 
> > > > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > ---
> > > > > > >     drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > > > >     1 file changed, 115 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > > > >     	};
> > > > > > >     } __packed;
> > > > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > > > +{
> > > > > > > +	hv_nofull_mmio = true;
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > > > +
> > > > > > >     struct mshv_root mshv_root;
> > > > > > >     enum hv_scheduler_type hv_scheduler_type;
> > > > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > > > >     }
> > > > > > >     #ifdef CONFIG_X86_64
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > > > + * else just return -errno.
> > > > > > > + */
> > > > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > > > +				       u64 *mmio_pfnp)
> > > > > > > +{
> > > > > > > +	struct vm_area_struct *vma;
> > > > > > > +	bool is_mmio;
> > > > > > > +	u64 uaddr;
> > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > > > +	int rc = -EINVAL;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > > > +	 */
> > > > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > > > +
> > > > > > > +	/* Get the region again under the lock */
> > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > +		goto unlock_pt_out;
> > > > > > > +
> > > > > > > +	uaddr = mreg->start_uaddr +
> > > > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > > > +
> > > > > > > +	mmap_read_lock(current->mm);
> > > > > > 
> > > > > > Semaphore can't be taken under spinlock.
> > > > 
> > > > > 
> > > > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > > > rushing to submit the patch.
> > > > > 
> > > > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > > > the uaddr because unmap will properly serialize via the mm lock.
> > > > > 
> > > > > 
> > > > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > > > 
> > > > > > Why this check is needed again?
> > > > > 
> > > > > To make sure region did not change. This check is under lock.
> > > > > 
> > > > 
> > > > How can this happen? One can't change VMA type without unmapping it
> > > > first. And unmapping it leads to a kernel MMIO region state dangling
> > > > around without corresponding user space mapping.
> > > 
> > > Right, and vm_flags would not be mmio expected then.
> > > 
> > > > This is similar to dangling pinned regions and should likely be
> > > > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > > > regions is VMA is detached.
> > > 
> > > I don't think we need that. Either it succeeds if the region did not
> > > change at all, or just fails.
> > > 
> > 
> > I'm afraid we do, as if the driver mapped a page with the previous
> > memory region, and then the region is unmapped, the page will stay
> > mapped in the hypervisor, but will be considered free by kernel, which
> > in turn will lead to GPF upn next allocation.
> 
> There are no ram pages for mmio regions. Also, we don't do much with
> mmio regions other than tell the hyp about it.
> 

So, are you saying that the hypervisor does not use these pages and only
tracks them? That would make things easier.
However, if we later try to map a GPA that is already mapped, will the
hypervisor return an error?

Thanks,
Stanislav

> Thanks,
> -Mukesh
> 
> 
> > With pinned regions we issue is similar but less impacting: pages can't
> > be released by user space unmapping and thus will be simply leaked, but
> > the system stays intact.
> > 
> > MMIO regions are simila to movable region in this regard: they don't
> > reference the user pages, and thus this guest region replaement is a
> > stright wat to kernel panic.
> > 
> > > 
> > > > > > The region type is stored on the region itself.
> > > > > > And the type is checked on the caller side.
> > > > > > 
> > > > > > > +	if (!is_mmio)
> > > > > > > +		goto unlock_mmap_out;
> > > > > > > +
> > > > > > > +	pfnmap_args.vma = vma;
> > > > > > > +	pfnmap_args.address = uaddr;
> > > > > > > +
> > > > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > +	if (rc) {
> > > > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > > > +				      NULL);
> > > > > > > +		if (rc)
> > > > > > > +			goto unlock_mmap_out;
> > > > > > > +
> > > > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > +		if (rc)
> > > > > > > +			goto unlock_mmap_out;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > > > +d
> > > > > > > +unlock_mmap_out:
> > > > > > > +	mmap_read_unlock(current->mm);
> > > > > > > +unlock_pt_out:
> > > > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > > > +	return rc;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > > > + * and resolve if possible.
> > > > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > > > + */
> > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > > > +{
> > > > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > +	int rc;
> > > > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > > > +
> > > > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > > > +	accinfo = msg->memory_access_info;
> > > > > > > +
> > > > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > 
> > > > > > This call needs to be protected by the spinlock.
> > > > > 
> > > > > This is sorta fast path to bail. We recheck under partition lock above.
> > > > > 
> > > > 
> > > > Accessing the list of regions without lock is unsafe.
> > > 
> > > I am not sure why? This check is done by a vcpu thread, so regions
> > > will not have just gone away.
> > > 
> > 
> > This is shared resources. Multiple VP thread get into this function
> > simultaneously, so there is a race already. But this one we can live
> > with without locking as they don't mutate the list of the regions.
> > 
> > The issue happens when VMM adds or removed another region as it mutates
> > the list and races with VP threads doing this lookup.
> > 
> > Thanks,
> > Stanislav
> > 
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > -Mukesh
> > > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > > > +	if (rc)
> > > > > > > +		return false;
> > > > > > > +
> > > > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > > > +		gfn = mreg->start_gfn;
> > > > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > > > +		numpgs = mreg->nr_pages;
> > > > > > > +	} else
> > > > > > > +		numpgs = 1;
> > > > > > > +
> > > > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > > > +
> > > > > > > +	return rc == 0;
> > > > > > > +}
> > > > > > > +
> > > > > > >     static struct mshv_mem_region *
> > > > > > >     mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > > > >     {
> > > > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > > > >     	return ret;
> > > > > > >     }
> > > > > > > +
> > > > > > >     #else  /* CONFIG_X86_64 */
> > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > > > >     static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > > > >     #endif /* CONFIG_X86_64 */
> > > > > > >     static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > > > >     {
> > > > > > >     	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > > > >     	case HVMSG_GPA_INTERCEPT:
> > > > > > >     		return mshv_handle_gpa_intercept(vp);
> > > > > > >     	}
> > > > > > > -- 
> > > > > > > 2.51.2.vfs.0.1
> > > > > > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-02-02 16:30               ` Stanislav Kinsburskii
@ 2026-02-04 22:52                 ` Mukesh R
  2026-02-05 16:28                   ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-02-04 22:52 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 2/2/26 08:30, Stanislav Kinsburskii wrote:
> On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
>> On 1/27/26 10:57, Stanislav Kinsburskii wrote:
>>> On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
>>>> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
>>>>> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>>>>>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>>>>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>>>
>>>>>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>>>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>>>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>>>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>>>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>>>>>
>>>>>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>>> ---
>>>>>>>>      drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 115 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>>>>>      	};
>>>>>>>>      } __packed;
>>>>>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>>>>>> +static int __init setup_hv_full_mmio(char *str)
>>>>>>>> +{
>>>>>>>> +	hv_nofull_mmio = true;
>>>>>>>> +	return 0;
>>>>>>>> +}
>>>>>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>>>>>> +
>>>>>>>>      struct mshv_root mshv_root;
>>>>>>>>      enum hv_scheduler_type hv_scheduler_type;
>>>>>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>>>>>      }
>>>>>>>>      #ifdef CONFIG_X86_64
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>>>>>> + * else just return -errno.
>>>>>>>> + */
>>>>>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>>>>>> +				       u64 *mmio_pfnp)
>>>>>>>> +{
>>>>>>>> +	struct vm_area_struct *vma;
>>>>>>>> +	bool is_mmio;
>>>>>>>> +	u64 uaddr;
>>>>>>>> +	struct mshv_mem_region *mreg;
>>>>>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>>>>>> +	int rc = -EINVAL;
>>>>>>>> +
>>>>>>>> +	/*
>>>>>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>>>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>>>>>> +	 */
>>>>>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>>>>>> +
>>>>>>>> +	/* Get the region again under the lock */
>>>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>>>> +		goto unlock_pt_out;
>>>>>>>> +
>>>>>>>> +	uaddr = mreg->start_uaddr +
>>>>>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>>>>>> +
>>>>>>>> +	mmap_read_lock(current->mm);
>>>>>>>
>>>>>>> Semaphore can't be taken under spinlock.
>>>>>
>>>>>>
>>>>>> Yeah, something didn't feel right here and I meant to recheck, now regret
>>>>>> rushing to submit the patch.
>>>>>>
>>>>>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>>>>>> the uaddr because unmap will properly serialize via the mm lock.
>>>>>>
>>>>>>
>>>>>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>>>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>>>>>
>>>>>>> Why this check is needed again?
>>>>>>
>>>>>> To make sure region did not change. This check is under lock.
>>>>>>
>>>>>
>>>>> How can this happen? One can't change VMA type without unmapping it
>>>>> first. And unmapping it leads to a kernel MMIO region state dangling
>>>>> around without corresponding user space mapping.
>>>>
>>>> Right, and vm_flags would not be mmio expected then.
>>>>
>>>>> This is similar to dangling pinned regions and should likely be
>>>>> addressed the same way by utilizing MMU notifiers to destpoy memoty
>>>>> regions is VMA is detached.
>>>>
>>>> I don't think we need that. Either it succeeds if the region did not
>>>> change at all, or just fails.
>>>>
>>>
>>> I'm afraid we do, as if the driver mapped a page with the previous
>>> memory region, and then the region is unmapped, the page will stay
>>> mapped in the hypervisor, but will be considered free by kernel, which
>>> in turn will lead to GPF upn next allocation.
>>
>> There are no ram pages for mmio regions. Also, we don't do much with
>> mmio regions other than tell the hyp about it.
>>
> 
> So, are you saying that the hypervisor does not use these pages and only
> tracks them? That would make things easier.
> However, if we later try to map a GPA that is already mapped, will the
> hypervisor return an error?

Hypervisor does not return an error.



> Thanks,
> Stanislav
> 
>> Thanks,
>> -Mukesh
>>
>>
>>> With pinned regions we issue is similar but less impacting: pages can't
>>> be released by user space unmapping and thus will be simply leaked, but
>>> the system stays intact.
>>>
>>> MMIO regions are simila to movable region in this regard: they don't
>>> reference the user pages, and thus this guest region replaement is a
>>> stright wat to kernel panic.
>>>
>>>>
>>>>>>> The region type is stored on the region itself.
>>>>>>> And the type is checked on the caller side.
>>>>>>>
>>>>>>>> +	if (!is_mmio)
>>>>>>>> +		goto unlock_mmap_out;
>>>>>>>> +
>>>>>>>> +	pfnmap_args.vma = vma;
>>>>>>>> +	pfnmap_args.address = uaddr;
>>>>>>>> +
>>>>>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>>>>>> +	if (rc) {
>>>>>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>>>>>> +				      NULL);
>>>>>>>> +		if (rc)
>>>>>>>> +			goto unlock_mmap_out;
>>>>>>>> +
>>>>>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>>>>>> +		if (rc)
>>>>>>>> +			goto unlock_mmap_out;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>>>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>>>>>> +d
>>>>>>>> +unlock_mmap_out:
>>>>>>>> +	mmap_read_unlock(current->mm);
>>>>>>>> +unlock_pt_out:
>>>>>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>>>>>> +	return rc;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>>>>>> + * and resolve if possible.
>>>>>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>>>>>> + */
>>>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>>>>>> +{
>>>>>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>>>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>>>>>> +	union hv_x64_memory_access_info accinfo;
>>>>>>>> +	u64 gfn, mmio_spa, numpgs;
>>>>>>>> +	struct mshv_mem_region *mreg;
>>>>>>>> +	int rc;
>>>>>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>>>>>> +
>>>>>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>>>>>> +	accinfo = msg->memory_access_info;
>>>>>>>> +
>>>>>>>> +	if (!accinfo.gva_gpa_valid)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>>>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>>>
>>>>>>> This call needs to be protected by the spinlock.
>>>>>>
>>>>>> This is sorta fast path to bail. We recheck under partition lock above.
>>>>>>
>>>>>
>>>>> Accessing the list of regions without lock is unsafe.
>>>>
>>>> I am not sure why? This check is done by a vcpu thread, so regions
>>>> will not have just gone away.
>>>>
>>>
>>> This is shared resources. Multiple VP thread get into this function
>>> simultaneously, so there is a race already. But this one we can live
>>> with without locking as they don't mutate the list of the regions.
>>>
>>> The issue happens when VMM adds or removed another region as it mutates
>>> the list and races with VP threads doing this lookup.
>>>
>>> Thanks,
>>> Stanislav
>>>
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>>>>>> +	if (rc)
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>>>>>> +		gfn = mreg->start_gfn;
>>>>>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>>>>>> +		numpgs = mreg->nr_pages;
>>>>>>>> +	} else
>>>>>>>> +		numpgs = 1;
>>>>>>>> +
>>>>>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>>>>>> +
>>>>>>>> +	return rc == 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      static struct mshv_mem_region *
>>>>>>>>      mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>>>>>      {
>>>>>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>>>>>      	return ret;
>>>>>>>>      }
>>>>>>>> +
>>>>>>>>      #else  /* CONFIG_X86_64 */
>>>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>>>>>      static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>>>>>      #endif /* CONFIG_X86_64 */
>>>>>>>>      static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>>>>>      {
>>>>>>>>      	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>>>>>> +	case HVMSG_UNMAPPED_GPA:
>>>>>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>>>>>      	case HVMSG_GPA_INTERCEPT:
>>>>>>>>      		return mshv_handle_gpa_intercept(vp);
>>>>>>>>      	}
>>>>>>>> -- 
>>>>>>>> 2.51.2.vfs.0.1
>>>>>>>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-02-04 22:52                 ` Mukesh R
@ 2026-02-05 16:28                   ` Stanislav Kinsburskii
  2026-02-05 17:57                     ` Mukesh R
  0 siblings, 1 reply; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-02-05 16:28 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On Wed, Feb 04, 2026 at 02:52:54PM -0800, Mukesh R wrote:
> On 2/2/26 08:30, Stanislav Kinsburskii wrote:
> > On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
> > > On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> > > > On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> > > > > On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > > > > > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > > > > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > > > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > > > 
> > > > > > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > > > > > >      1 file changed, 115 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > > > > > >      	};
> > > > > > > > >      } __packed;
> > > > > > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > > > > > +{
> > > > > > > > > +	hv_nofull_mmio = true;
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > > > > > +
> > > > > > > > >      struct mshv_root mshv_root;
> > > > > > > > >      enum hv_scheduler_type hv_scheduler_type;
> > > > > > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > > > > > >      }
> > > > > > > > >      #ifdef CONFIG_X86_64
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > > > > > + * else just return -errno.
> > > > > > > > > + */
> > > > > > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > > > > > +				       u64 *mmio_pfnp)
> > > > > > > > > +{
> > > > > > > > > +	struct vm_area_struct *vma;
> > > > > > > > > +	bool is_mmio;
> > > > > > > > > +	u64 uaddr;
> > > > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > > > > > +	int rc = -EINVAL;
> > > > > > > > > +
> > > > > > > > > +	/*
> > > > > > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > > > > > +	 */
> > > > > > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > > > > > +
> > > > > > > > > +	/* Get the region again under the lock */
> > > > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > > > +		goto unlock_pt_out;
> > > > > > > > > +
> > > > > > > > > +	uaddr = mreg->start_uaddr +
> > > > > > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > > > > > +
> > > > > > > > > +	mmap_read_lock(current->mm);
> > > > > > > > 
> > > > > > > > Semaphore can't be taken under spinlock.
> > > > > > 
> > > > > > > 
> > > > > > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > > > > > rushing to submit the patch.
> > > > > > > 
> > > > > > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > > > > > the uaddr because unmap will properly serialize via the mm lock.
> > > > > > > 
> > > > > > > 
> > > > > > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > > > > > 
> > > > > > > > Why this check is needed again?
> > > > > > > 
> > > > > > > To make sure region did not change. This check is under lock.
> > > > > > > 
> > > > > > 
> > > > > > How can this happen? One can't change VMA type without unmapping it
> > > > > > first. And unmapping it leads to a kernel MMIO region state dangling
> > > > > > around without corresponding user space mapping.
> > > > > 
> > > > > Right, and vm_flags would not be mmio expected then.
> > > > > 
> > > > > > This is similar to dangling pinned regions and should likely be
> > > > > > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > > > > > regions is VMA is detached.
> > > > > 
> > > > > I don't think we need that. Either it succeeds if the region did not
> > > > > change at all, or just fails.
> > > > > 
> > > > 
> > > > I'm afraid we do, as if the driver mapped a page with the previous
> > > > memory region, and then the region is unmapped, the page will stay
> > > > mapped in the hypervisor, but will be considered free by kernel, which
> > > > in turn will lead to GPF upn next allocation.
> > > 
> > > There are no ram pages for mmio regions. Also, we don't do much with
> > > mmio regions other than tell the hyp about it.
> > > 
> > 
> > So, are you saying that the hypervisor does not use these pages and only
> > tracks them? That would make things easier.
> > However, if we later try to map a GPA that is already mapped, will the
> > hypervisor return an error?
> 
> Hypervisor does not return an error.
> 

So, what happenes if we map a GPA that is already mapped? Does it just
remap it to the new PFN?

Thanks,
Stanislav

> 
> 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > -Mukesh
> > > 
> > > 
> > > > With pinned regions we issue is similar but less impacting: pages can't
> > > > be released by user space unmapping and thus will be simply leaked, but
> > > > the system stays intact.
> > > > 
> > > > MMIO regions are simila to movable region in this regard: they don't
> > > > reference the user pages, and thus this guest region replaement is a
> > > > stright wat to kernel panic.
> > > > 
> > > > > 
> > > > > > > > The region type is stored on the region itself.
> > > > > > > > And the type is checked on the caller side.
> > > > > > > > 
> > > > > > > > > +	if (!is_mmio)
> > > > > > > > > +		goto unlock_mmap_out;
> > > > > > > > > +
> > > > > > > > > +	pfnmap_args.vma = vma;
> > > > > > > > > +	pfnmap_args.address = uaddr;
> > > > > > > > > +
> > > > > > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > > > +	if (rc) {
> > > > > > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > > > > > +				      NULL);
> > > > > > > > > +		if (rc)
> > > > > > > > > +			goto unlock_mmap_out;
> > > > > > > > > +
> > > > > > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > > > +		if (rc)
> > > > > > > > > +			goto unlock_mmap_out;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > > > > > +d
> > > > > > > > > +unlock_mmap_out:
> > > > > > > > > +	mmap_read_unlock(current->mm);
> > > > > > > > > +unlock_pt_out:
> > > > > > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > > > > > +	return rc;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > > > > > + * and resolve if possible.
> > > > > > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > > > > > + */
> > > > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > > > > > +{
> > > > > > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > > > +	int rc;
> > > > > > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > > > > > +
> > > > > > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > > > > > +	accinfo = msg->memory_access_info;
> > > > > > > > > +
> > > > > > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > > > > > +		return false;
> > > > > > > > > +
> > > > > > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > > 
> > > > > > > > This call needs to be protected by the spinlock.
> > > > > > > 
> > > > > > > This is sorta fast path to bail. We recheck under partition lock above.
> > > > > > > 
> > > > > > 
> > > > > > Accessing the list of regions without lock is unsafe.
> > > > > 
> > > > > I am not sure why? This check is done by a vcpu thread, so regions
> > > > > will not have just gone away.
> > > > > 
> > > > 
> > > > This is shared resources. Multiple VP thread get into this function
> > > > simultaneously, so there is a race already. But this one we can live
> > > > with without locking as they don't mutate the list of the regions.
> > > > 
> > > > The issue happens when VMM adds or removed another region as it mutates
> > > > the list and races with VP threads doing this lookup.
> > > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > 
> > > > > Thanks,
> > > > > -Mukesh
> > > > > 
> > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > > Thanks,
> > > > > > > -Mukesh
> > > > > > > 
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > > 
> > > > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > > > +		return false;
> > > > > > > > > +
> > > > > > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > > > > > +	if (rc)
> > > > > > > > > +		return false;
> > > > > > > > > +
> > > > > > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > > > > > +		gfn = mreg->start_gfn;
> > > > > > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > > > > > +		numpgs = mreg->nr_pages;
> > > > > > > > > +	} else
> > > > > > > > > +		numpgs = 1;
> > > > > > > > > +
> > > > > > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > > > > > +
> > > > > > > > > +	return rc == 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >      static struct mshv_mem_region *
> > > > > > > > >      mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > > > > > >      {
> > > > > > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > > > > > >      	return ret;
> > > > > > > > >      }
> > > > > > > > > +
> > > > > > > > >      #else  /* CONFIG_X86_64 */
> > > > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > > > > > >      static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > > > > > >      #endif /* CONFIG_X86_64 */
> > > > > > > > >      static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > > > > > >      {
> > > > > > > > >      	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > > > > > >      	case HVMSG_GPA_INTERCEPT:
> > > > > > > > >      		return mshv_handle_gpa_intercept(vp);
> > > > > > > > >      	}
> > > > > > > > > -- 
> > > > > > > > > 2.51.2.vfs.0.1
> > > > > > > > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-02-05 16:28                   ` Stanislav Kinsburskii
@ 2026-02-05 17:57                     ` Mukesh R
  2026-02-05 18:31                       ` Stanislav Kinsburskii
  0 siblings, 1 reply; 69+ messages in thread
From: Mukesh R @ 2026-02-05 17:57 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On 2/5/26 08:28, Stanislav Kinsburskii wrote:
> On Wed, Feb 04, 2026 at 02:52:54PM -0800, Mukesh R wrote:
>> On 2/2/26 08:30, Stanislav Kinsburskii wrote:
>>> On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
>>>> On 1/27/26 10:57, Stanislav Kinsburskii wrote:
>>>>> On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
>>>>>> On 1/26/26 10:15, Stanislav Kinsburskii wrote:
>>>>>>> On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
>>>>>>>> On 1/20/26 17:53, Stanislav Kinsburskii wrote:
>>>>>>>>> On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
>>>>>>>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>>>>>
>>>>>>>>>> Upon guest access, in case of missing mmio mapping, the hypervisor
>>>>>>>>>> generates an unmapped gpa intercept. In this path, lookup the PCI
>>>>>>>>>> resource pfn for the guest gpa, and ask the hypervisor to map it
>>>>>>>>>> via hypercall. The PCI resource pfn is maintained by the VFIO driver,
>>>>>>>>>> and obtained via fixup_user_fault call (similar to KVM).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>>>>>>> ---
>>>>>>>>>>       drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 115 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>>>>>>>> index 03f3aa9f5541..4c8bc7cd0888 100644
>>>>>>>>>> --- a/drivers/hv/mshv_root_main.c
>>>>>>>>>> +++ b/drivers/hv/mshv_root_main.c
>>>>>>>>>> @@ -56,6 +56,14 @@ struct hv_stats_page {
>>>>>>>>>>       	};
>>>>>>>>>>       } __packed;
>>>>>>>>>> +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
>>>>>>>>>> +static int __init setup_hv_full_mmio(char *str)
>>>>>>>>>> +{
>>>>>>>>>> +	hv_nofull_mmio = true;
>>>>>>>>>> +	return 0;
>>>>>>>>>> +}
>>>>>>>>>> +__setup("hv_nofull_mmio", setup_hv_full_mmio);
>>>>>>>>>> +
>>>>>>>>>>       struct mshv_root mshv_root;
>>>>>>>>>>       enum hv_scheduler_type hv_scheduler_type;
>>>>>>>>>> @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
>>>>>>>>>>       }
>>>>>>>>>>       #ifdef CONFIG_X86_64
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
>>>>>>>>>> + * else just return -errno.
>>>>>>>>>> + */
>>>>>>>>>> +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
>>>>>>>>>> +				       u64 *mmio_pfnp)
>>>>>>>>>> +{
>>>>>>>>>> +	struct vm_area_struct *vma;
>>>>>>>>>> +	bool is_mmio;
>>>>>>>>>> +	u64 uaddr;
>>>>>>>>>> +	struct mshv_mem_region *mreg;
>>>>>>>>>> +	struct follow_pfnmap_args pfnmap_args;
>>>>>>>>>> +	int rc = -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +	/*
>>>>>>>>>> +	 * Do not allow mem region to be deleted beneath us. VFIO uses
>>>>>>>>>> +	 * useraddr vma to lookup pci bar pfn.
>>>>>>>>>> +	 */
>>>>>>>>>> +	spin_lock(&pt->pt_mem_regions_lock);
>>>>>>>>>> +
>>>>>>>>>> +	/* Get the region again under the lock */
>>>>>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>>>>>> +		goto unlock_pt_out;
>>>>>>>>>> +
>>>>>>>>>> +	uaddr = mreg->start_uaddr +
>>>>>>>>>> +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
>>>>>>>>>> +
>>>>>>>>>> +	mmap_read_lock(current->mm);
>>>>>>>>>
>>>>>>>>> Semaphore can't be taken under spinlock.
>>>>>>>
>>>>>>>>
>>>>>>>> Yeah, something didn't feel right here and I meant to recheck, now regret
>>>>>>>> rushing to submit the patch.
>>>>>>>>
>>>>>>>> Rethinking, I think the pt_mem_regions_lock is not needed to protect
>>>>>>>> the uaddr because unmap will properly serialize via the mm lock.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> +	vma = vma_lookup(current->mm, uaddr);
>>>>>>>>>> +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
>>>>>>>>>
>>>>>>>>> Why this check is needed again?
>>>>>>>>
>>>>>>>> To make sure region did not change. This check is under lock.
>>>>>>>>
>>>>>>>
>>>>>>> How can this happen? One can't change VMA type without unmapping it
>>>>>>> first. And unmapping it leads to a kernel MMIO region state dangling
>>>>>>> around without corresponding user space mapping.
>>>>>>
>>>>>> Right, and vm_flags would not be mmio expected then.
>>>>>>
>>>>>>> This is similar to dangling pinned regions and should likely be
>>>>>>> addressed the same way by utilizing MMU notifiers to destpoy memoty
>>>>>>> regions is VMA is detached.
>>>>>>
>>>>>> I don't think we need that. Either it succeeds if the region did not
>>>>>> change at all, or just fails.
>>>>>>
>>>>>
>>>>> I'm afraid we do, as if the driver mapped a page with the previous
>>>>> memory region, and then the region is unmapped, the page will stay
>>>>> mapped in the hypervisor, but will be considered free by kernel, which
>>>>> in turn will lead to GPF upn next allocation.
>>>>
>>>> There are no ram pages for mmio regions. Also, we don't do much with
>>>> mmio regions other than tell the hyp about it.
>>>>
>>>
>>> So, are you saying that the hypervisor does not use these pages and only
>>> tracks them? That would make things easier.
>>> However, if we later try to map a GPA that is already mapped, will the
>>> hypervisor return an error?
>>
>> Hypervisor does not return an error.
>>
> 
> So, what happenes if we map a GPA that is already mapped? Does it just
> remap it to the new PFN?

yes, otherwise it would return error, right?

> Thanks,
> Stanislav
> 
>>
>>
>>> Thanks,
>>> Stanislav
>>>
>>>> Thanks,
>>>> -Mukesh
>>>>
>>>>
>>>>> With pinned regions we issue is similar but less impacting: pages can't
>>>>> be released by user space unmapping and thus will be simply leaked, but
>>>>> the system stays intact.
>>>>>
>>>>> MMIO regions are simila to movable region in this regard: they don't
>>>>> reference the user pages, and thus this guest region replaement is a
>>>>> stright wat to kernel panic.
>>>>>
>>>>>>
>>>>>>>>> The region type is stored on the region itself.
>>>>>>>>> And the type is checked on the caller side.
>>>>>>>>>
>>>>>>>>>> +	if (!is_mmio)
>>>>>>>>>> +		goto unlock_mmap_out;
>>>>>>>>>> +
>>>>>>>>>> +	pfnmap_args.vma = vma;
>>>>>>>>>> +	pfnmap_args.address = uaddr;
>>>>>>>>>> +
>>>>>>>>>> +	rc = follow_pfnmap_start(&pfnmap_args);
>>>>>>>>>> +	if (rc) {
>>>>>>>>>> +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
>>>>>>>>>> +				      NULL);
>>>>>>>>>> +		if (rc)
>>>>>>>>>> +			goto unlock_mmap_out;
>>>>>>>>>> +
>>>>>>>>>> +		rc = follow_pfnmap_start(&pfnmap_args);
>>>>>>>>>> +		if (rc)
>>>>>>>>>> +			goto unlock_mmap_out;
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	*mmio_pfnp = pfnmap_args.pfn;
>>>>>>>>>> +	follow_pfnmap_end(&pfnmap_args);
>>>>>>>>>> +d
>>>>>>>>>> +unlock_mmap_out:
>>>>>>>>>> +	mmap_read_unlock(current->mm);
>>>>>>>>>> +unlock_pt_out:
>>>>>>>>>> +	spin_unlock(&pt->pt_mem_regions_lock);
>>>>>>>>>> +	return rc;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>> + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
>>>>>>>>>> + * and resolve if possible.
>>>>>>>>>> + * Returns: True if valid mmio intercept and it was handled, else false
>>>>>>>>>> + */
>>>>>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
>>>>>>>>>> +{
>>>>>>>>>> +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
>>>>>>>>>> +	struct hv_x64_memory_intercept_message *msg;
>>>>>>>>>> +	union hv_x64_memory_access_info accinfo;
>>>>>>>>>> +	u64 gfn, mmio_spa, numpgs;
>>>>>>>>>> +	struct mshv_mem_region *mreg;
>>>>>>>>>> +	int rc;
>>>>>>>>>> +	struct mshv_partition *pt = vp->vp_partition;
>>>>>>>>>> +
>>>>>>>>>> +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
>>>>>>>>>> +	accinfo = msg->memory_access_info;
>>>>>>>>>> +
>>>>>>>>>> +	if (!accinfo.gva_gpa_valid)
>>>>>>>>>> +		return false;
>>>>>>>>>> +
>>>>>>>>>> +	/* Do a fast check and bail if non mmio intercept */
>>>>>>>>>> +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
>>>>>>>>>> +	mreg = mshv_partition_region_by_gfn(pt, gfn);
>>>>>>>>>
>>>>>>>>> This call needs to be protected by the spinlock.
>>>>>>>>
>>>>>>>> This is sorta fast path to bail. We recheck under partition lock above.
>>>>>>>>
>>>>>>>
>>>>>>> Accessing the list of regions without lock is unsafe.
>>>>>>
>>>>>> I am not sure why? This check is done by a vcpu thread, so regions
>>>>>> will not have just gone away.
>>>>>>
>>>>>
>>>>> This is shared resources. Multiple VP thread get into this function
>>>>> simultaneously, so there is a race already. But this one we can live
>>>>> with without locking as they don't mutate the list of the regions.
>>>>>
>>>>> The issue happens when VMM adds or removed another region as it mutates
>>>>> the list and races with VP threads doing this lookup.
>>>>>
>>>>> Thanks,
>>>>> Stanislav
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> -Mukesh
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Stanislav
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Mukesh
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Stanislav
>>>>>>>>>
>>>>>>>>>> +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
>>>>>>>>>> +		return false;
>>>>>>>>>> +
>>>>>>>>>> +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
>>>>>>>>>> +	if (rc)
>>>>>>>>>> +		return false;
>>>>>>>>>> +
>>>>>>>>>> +	if (!hv_nofull_mmio) {		/* default case */
>>>>>>>>>> +		gfn = mreg->start_gfn;
>>>>>>>>>> +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
>>>>>>>>>> +		numpgs = mreg->nr_pages;
>>>>>>>>>> +	} else
>>>>>>>>>> +		numpgs = 1;
>>>>>>>>>> +
>>>>>>>>>> +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
>>>>>>>>>> +
>>>>>>>>>> +	return rc == 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>       static struct mshv_mem_region *
>>>>>>>>>>       mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
>>>>>>>>>>       {
>>>>>>>>>> @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
>>>>>>>>>>       	return ret;
>>>>>>>>>>       }
>>>>>>>>>> +
>>>>>>>>>>       #else  /* CONFIG_X86_64 */
>>>>>>>>>> +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
>>>>>>>>>>       static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
>>>>>>>>>>       #endif /* CONFIG_X86_64 */
>>>>>>>>>>       static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
>>>>>>>>>>       {
>>>>>>>>>>       	switch (vp->vp_intercept_msg_page->header.message_type) {
>>>>>>>>>> +	case HVMSG_UNMAPPED_GPA:
>>>>>>>>>> +		return mshv_handle_unmapped_gpa(vp);
>>>>>>>>>>       	case HVMSG_GPA_INTERCEPT:
>>>>>>>>>>       		return mshv_handle_gpa_intercept(vp);
>>>>>>>>>>       	}
>>>>>>>>>> -- 
>>>>>>>>>> 2.51.2.vfs.0.1
>>>>>>>>>>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru
  2026-02-05 17:57                     ` Mukesh R
@ 2026-02-05 18:31                       ` Stanislav Kinsburskii
  0 siblings, 0 replies; 69+ messages in thread
From: Stanislav Kinsburskii @ 2026-02-05 18:31 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux

On Thu, Feb 05, 2026 at 09:57:20AM -0800, Mukesh R wrote:
> On 2/5/26 08:28, Stanislav Kinsburskii wrote:
> > On Wed, Feb 04, 2026 at 02:52:54PM -0800, Mukesh R wrote:
> > > On 2/2/26 08:30, Stanislav Kinsburskii wrote:
> > > > On Fri, Jan 30, 2026 at 02:17:24PM -0800, Mukesh R wrote:
> > > > > On 1/27/26 10:57, Stanislav Kinsburskii wrote:
> > > > > > On Mon, Jan 26, 2026 at 07:07:22PM -0800, Mukesh R wrote:
> > > > > > > On 1/26/26 10:15, Stanislav Kinsburskii wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 06:19:15PM -0800, Mukesh R wrote:
> > > > > > > > > On 1/20/26 17:53, Stanislav Kinsburskii wrote:
> > > > > > > > > > On Mon, Jan 19, 2026 at 10:42:30PM -0800, Mukesh R wrote:
> > > > > > > > > > > From: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > > > > > 
> > > > > > > > > > > Upon guest access, in case of missing mmio mapping, the hypervisor
> > > > > > > > > > > generates an unmapped gpa intercept. In this path, lookup the PCI
> > > > > > > > > > > resource pfn for the guest gpa, and ask the hypervisor to map it
> > > > > > > > > > > via hypercall. The PCI resource pfn is maintained by the VFIO driver,
> > > > > > > > > > > and obtained via fixup_user_fault call (similar to KVM).
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       drivers/hv/mshv_root_main.c | 115 ++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >       1 file changed, 115 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > > > > > > > > > index 03f3aa9f5541..4c8bc7cd0888 100644
> > > > > > > > > > > --- a/drivers/hv/mshv_root_main.c
> > > > > > > > > > > +++ b/drivers/hv/mshv_root_main.c
> > > > > > > > > > > @@ -56,6 +56,14 @@ struct hv_stats_page {
> > > > > > > > > > >       	};
> > > > > > > > > > >       } __packed;
> > > > > > > > > > > +bool hv_nofull_mmio;   /* don't map entire mmio region upon fault */
> > > > > > > > > > > +static int __init setup_hv_full_mmio(char *str)
> > > > > > > > > > > +{
> > > > > > > > > > > +	hv_nofull_mmio = true;
> > > > > > > > > > > +	return 0;
> > > > > > > > > > > +}
> > > > > > > > > > > +__setup("hv_nofull_mmio", setup_hv_full_mmio);
> > > > > > > > > > > +
> > > > > > > > > > >       struct mshv_root mshv_root;
> > > > > > > > > > >       enum hv_scheduler_type hv_scheduler_type;
> > > > > > > > > > > @@ -612,6 +620,109 @@ mshv_partition_region_by_gfn(struct mshv_partition *partition, u64 gfn)
> > > > > > > > > > >       }
> > > > > > > > > > >       #ifdef CONFIG_X86_64
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
> > > > > > > > > > > + * else just return -errno.
> > > > > > > > > > > + */
> > > > > > > > > > > +static int mshv_chk_get_mmio_start_pfn(struct mshv_partition *pt, u64 gfn,
> > > > > > > > > > > +				       u64 *mmio_pfnp)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct vm_area_struct *vma;
> > > > > > > > > > > +	bool is_mmio;
> > > > > > > > > > > +	u64 uaddr;
> > > > > > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > > > > > +	struct follow_pfnmap_args pfnmap_args;
> > > > > > > > > > > +	int rc = -EINVAL;
> > > > > > > > > > > +
> > > > > > > > > > > +	/*
> > > > > > > > > > > +	 * Do not allow mem region to be deleted beneath us. VFIO uses
> > > > > > > > > > > +	 * useraddr vma to lookup pci bar pfn.
> > > > > > > > > > > +	 */
> > > > > > > > > > > +	spin_lock(&pt->pt_mem_regions_lock);
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Get the region again under the lock */
> > > > > > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > > > > > +		goto unlock_pt_out;
> > > > > > > > > > > +
> > > > > > > > > > > +	uaddr = mreg->start_uaddr +
> > > > > > > > > > > +		((gfn - mreg->start_gfn) << HV_HYP_PAGE_SHIFT);
> > > > > > > > > > > +
> > > > > > > > > > > +	mmap_read_lock(current->mm);
> > > > > > > > > > 
> > > > > > > > > > Semaphore can't be taken under spinlock.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Yeah, something didn't feel right here and I meant to recheck, now regret
> > > > > > > > > rushing to submit the patch.
> > > > > > > > > 
> > > > > > > > > Rethinking, I think the pt_mem_regions_lock is not needed to protect
> > > > > > > > > the uaddr because unmap will properly serialize via the mm lock.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > +	vma = vma_lookup(current->mm, uaddr);
> > > > > > > > > > > +	is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
> > > > > > > > > > 
> > > > > > > > > > Why this check is needed again?
> > > > > > > > > 
> > > > > > > > > To make sure region did not change. This check is under lock.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > How can this happen? One can't change VMA type without unmapping it
> > > > > > > > first. And unmapping it leads to a kernel MMIO region state dangling
> > > > > > > > around without corresponding user space mapping.
> > > > > > > 
> > > > > > > Right, and vm_flags would not be mmio expected then.
> > > > > > > 
> > > > > > > > This is similar to dangling pinned regions and should likely be
> > > > > > > > addressed the same way by utilizing MMU notifiers to destpoy memoty
> > > > > > > > regions is VMA is detached.
> > > > > > > 
> > > > > > > I don't think we need that. Either it succeeds if the region did not
> > > > > > > change at all, or just fails.
> > > > > > > 
> > > > > > 
> > > > > > I'm afraid we do, as if the driver mapped a page with the previous
> > > > > > memory region, and then the region is unmapped, the page will stay
> > > > > > mapped in the hypervisor, but will be considered free by kernel, which
> > > > > > in turn will lead to GPF upn next allocation.
> > > > > 
> > > > > There are no ram pages for mmio regions. Also, we don't do much with
> > > > > mmio regions other than tell the hyp about it.
> > > > > 
> > > > 
> > > > So, are you saying that the hypervisor does not use these pages and only
> > > > tracks them? That would make things easier.
> > > > However, if we later try to map a GPA that is already mapped, will the
> > > > hypervisor return an error?
> > > 
> > > Hypervisor does not return an error.
> > > 
> > 
> > So, what happenes if we map a GPA that is already mapped? Does it just
> > remap it to the new PFN?
> 
> yes, otherwise it would return error, right?
> 

I see.
Please summarize and document this behaviour in the commit message.

Thanks,
Stanislav

> > Thanks,
> > Stanislav
> > 
> > > 
> > > 
> > > > Thanks,
> > > > Stanislav
> > > > 
> > > > > Thanks,
> > > > > -Mukesh
> > > > > 
> > > > > 
> > > > > > With pinned regions we issue is similar but less impacting: pages can't
> > > > > > be released by user space unmapping and thus will be simply leaked, but
> > > > > > the system stays intact.
> > > > > > 
> > > > > > MMIO regions are simila to movable region in this regard: they don't
> > > > > > reference the user pages, and thus this guest region replaement is a
> > > > > > stright wat to kernel panic.
> > > > > > 
> > > > > > > 
> > > > > > > > > > The region type is stored on the region itself.
> > > > > > > > > > And the type is checked on the caller side.
> > > > > > > > > > 
> > > > > > > > > > > +	if (!is_mmio)
> > > > > > > > > > > +		goto unlock_mmap_out;
> > > > > > > > > > > +
> > > > > > > > > > > +	pfnmap_args.vma = vma;
> > > > > > > > > > > +	pfnmap_args.address = uaddr;
> > > > > > > > > > > +
> > > > > > > > > > > +	rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > > > > > +	if (rc) {
> > > > > > > > > > > +		rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
> > > > > > > > > > > +				      NULL);
> > > > > > > > > > > +		if (rc)
> > > > > > > > > > > +			goto unlock_mmap_out;
> > > > > > > > > > > +
> > > > > > > > > > > +		rc = follow_pfnmap_start(&pfnmap_args);
> > > > > > > > > > > +		if (rc)
> > > > > > > > > > > +			goto unlock_mmap_out;
> > > > > > > > > > > +	}
> > > > > > > > > > > +
> > > > > > > > > > > +	*mmio_pfnp = pfnmap_args.pfn;
> > > > > > > > > > > +	follow_pfnmap_end(&pfnmap_args);
> > > > > > > > > > > +d
> > > > > > > > > > > +unlock_mmap_out:
> > > > > > > > > > > +	mmap_read_unlock(current->mm);
> > > > > > > > > > > +unlock_pt_out:
> > > > > > > > > > > +	spin_unlock(&pt->pt_mem_regions_lock);
> > > > > > > > > > > +	return rc;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * At present, the only unmapped gpa is mmio space. Verify if it's mmio
> > > > > > > > > > > + * and resolve if possible.
> > > > > > > > > > > + * Returns: True if valid mmio intercept and it was handled, else false
> > > > > > > > > > > + */
> > > > > > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct hv_message *hvmsg = vp->vp_intercept_msg_page;
> > > > > > > > > > > +	struct hv_x64_memory_intercept_message *msg;
> > > > > > > > > > > +	union hv_x64_memory_access_info accinfo;
> > > > > > > > > > > +	u64 gfn, mmio_spa, numpgs;
> > > > > > > > > > > +	struct mshv_mem_region *mreg;
> > > > > > > > > > > +	int rc;
> > > > > > > > > > > +	struct mshv_partition *pt = vp->vp_partition;
> > > > > > > > > > > +
> > > > > > > > > > > +	msg = (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
> > > > > > > > > > > +	accinfo = msg->memory_access_info;
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!accinfo.gva_gpa_valid)
> > > > > > > > > > > +		return false;
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Do a fast check and bail if non mmio intercept */
> > > > > > > > > > > +	gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
> > > > > > > > > > > +	mreg = mshv_partition_region_by_gfn(pt, gfn);
> > > > > > > > > > 
> > > > > > > > > > This call needs to be protected by the spinlock.
> > > > > > > > > 
> > > > > > > > > This is sorta fast path to bail. We recheck under partition lock above.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Accessing the list of regions without lock is unsafe.
> > > > > > > 
> > > > > > > I am not sure why? This check is done by a vcpu thread, so regions
> > > > > > > will not have just gone away.
> > > > > > > 
> > > > > > 
> > > > > > This is shared resources. Multiple VP thread get into this function
> > > > > > simultaneously, so there is a race already. But this one we can live
> > > > > > with without locking as they don't mutate the list of the regions.
> > > > > > 
> > > > > > The issue happens when VMM adds or removed another region as it mutates
> > > > > > the list and races with VP threads doing this lookup.
> > > > > > 
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > > 
> > > > > > 
> > > > > > > Thanks,
> > > > > > > -Mukesh
> > > > > > > 
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > -Mukesh
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Stanislav
> > > > > > > > > > 
> > > > > > > > > > > +	if (mreg == NULL || mreg->type != MSHV_REGION_TYPE_MMIO)
> > > > > > > > > > > +		return false;
> > > > > > > > > > > +
> > > > > > > > > > > +	rc = mshv_chk_get_mmio_start_pfn(pt, gfn, &mmio_spa);
> > > > > > > > > > > +	if (rc)
> > > > > > > > > > > +		return false;
> > > > > > > > > > > +
> > > > > > > > > > > +	if (!hv_nofull_mmio) {		/* default case */
> > > > > > > > > > > +		gfn = mreg->start_gfn;
> > > > > > > > > > > +		mmio_spa = mmio_spa - (gfn - mreg->start_gfn);
> > > > > > > > > > > +		numpgs = mreg->nr_pages;
> > > > > > > > > > > +	} else
> > > > > > > > > > > +		numpgs = 1;
> > > > > > > > > > > +
> > > > > > > > > > > +	rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
> > > > > > > > > > > +
> > > > > > > > > > > +	return rc == 0;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > >       static struct mshv_mem_region *
> > > > > > > > > > >       mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
> > > > > > > > > > >       {
> > > > > > > > > > > @@ -666,13 +777,17 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > > > > > > > > >       	return ret;
> > > > > > > > > > >       }
> > > > > > > > > > > +
> > > > > > > > > > >       #else  /* CONFIG_X86_64 */
> > > > > > > > > > > +static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp) { return false; }
> > > > > > > > > > >       static bool mshv_handle_gpa_intercept(struct mshv_vp *vp) { return false; }
> > > > > > > > > > >       #endif /* CONFIG_X86_64 */
> > > > > > > > > > >       static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
> > > > > > > > > > >       {
> > > > > > > > > > >       	switch (vp->vp_intercept_msg_page->header.message_type) {
> > > > > > > > > > > +	case HVMSG_UNMAPPED_GPA:
> > > > > > > > > > > +		return mshv_handle_unmapped_gpa(vp);
> > > > > > > > > > >       	case HVMSG_GPA_INTERCEPT:
> > > > > > > > > > >       		return mshv_handle_gpa_intercept(vp);
> > > > > > > > > > >       	}
> > > > > > > > > > > -- 
> > > > > > > > > > > 2.51.2.vfs.0.1
> > > > > > > > > > > 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability
  2026-01-20  6:42 ` [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability Mukesh R
@ 2026-02-05 18:47   ` Anirudh Rayabharam
  0 siblings, 0 replies; 69+ messages in thread
From: Anirudh Rayabharam @ 2026-02-05 18:47 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:17PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> Make cosmetic changes:
>  o Rename struct pci_dev *dev to *pdev since there are cases of
>    struct device *dev in the file and all over the kernel
>  o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
>    of building different types of device ids
>  o Fix checkpatch.pl issues with return and extraneous printk
>  o Replace spaces with tabs
>  o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
>    paths involve many types of device ids
>  o Fix indentation in a large if block by using goto.
> 
> There are no functional changes.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c | 197 +++++++++++++++++++-----------------
>  1 file changed, 103 insertions(+), 94 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index c3ba12b1bc07..f6b61483b3b8 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -
>  /*
>   * Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
>   *
> @@ -14,8 +13,8 @@
>  #include <linux/irqchip/irq-msi-lib.h>
>  #include <asm/mshyperv.h>
>  
> -static int hv_map_interrupt(union hv_device_id device_id, bool level,
> -		int cpu, int vector, struct hv_interrupt_entry *entry)
> +static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
> +		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
>  {
>  	struct hv_input_map_device_interrupt *input;
>  	struct hv_output_map_device_interrupt *output;
> @@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  	intr_desc = &input->interrupt_descriptor;
>  	memset(input, 0, sizeof(*input));
>  	input->partition_id = hv_current_partition_id;
> -	input->device_id = device_id.as_uint64;
> +	input->device_id = hv_devid.as_uint64;
>  	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
>  	intr_desc->vector_count = 1;
>  	intr_desc->target.vector = vector;
> @@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  
>  	intr_desc->target.vp_set.valid_bank_mask = 0;
>  	intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
> -	nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
> +	nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
>  	if (nr_bank < 0) {
>  		local_irq_restore(flags);
>  		pr_err("%s: unable to generate VP set\n", __func__);
> @@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  
>  	status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
>  			input, output);
> -	*entry = output->interrupt_entry;
> +	*ret_entry = output->interrupt_entry;
>  
>  	local_irq_restore(flags);
>  
> @@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
>  	return hv_result_to_errno(status);
>  }
>  
> -static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
> +static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
>  {
>  	unsigned long flags;
>  	struct hv_input_unmap_device_interrupt *input;
> -	struct hv_interrupt_entry *intr_entry;
>  	u64 status;
>  
>  	local_irq_save(flags);
>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>  
>  	memset(input, 0, sizeof(*input));
> -	intr_entry = &input->interrupt_entry;
>  	input->partition_id = hv_current_partition_id;
>  	input->device_id = id;
> -	*intr_entry = *old_entry;
> +	input->interrupt_entry = *irq_entry;
>  
>  	status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
>  	local_irq_restore(flags);
> @@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
>  	return 0;
>  }
>  
> -static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
> +static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
>  {
> -	union hv_device_id dev_id;
> +	int pos;
> +	union hv_device_id hv_devid;
>  	struct rid_data data = {
>  		.bridge = NULL,
> -		.rid = PCI_DEVID(dev->bus->number, dev->devfn)
> +		.rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
>  	};
>  
> -	pci_for_each_dma_alias(dev, get_rid_cb, &data);
> +	pci_for_each_dma_alias(pdev, get_rid_cb, &data);
>  
> -	dev_id.as_uint64 = 0;
> -	dev_id.device_type = HV_DEVICE_TYPE_PCI;
> -	dev_id.pci.segment = pci_domain_nr(dev->bus);
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_PCI;
> +	hv_devid.pci.segment = pci_domain_nr(pdev->bus);
>  
> -	dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
> -	dev_id.pci.bdf.device = PCI_SLOT(data.rid);
> -	dev_id.pci.bdf.function = PCI_FUNC(data.rid);
> -	dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
> +	hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
> +	hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
> +	hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
> +	hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
>  
> -	if (data.bridge) {
> -		int pos;
> +	if (data.bridge == NULL)
> +		goto out;
>  
> -		/*
> -		 * Microsoft Hypervisor requires a bus range when the bridge is
> -		 * running in PCI-X mode.
> -		 *
> -		 * To distinguish conventional vs PCI-X bridge, we can check
> -		 * the bridge's PCI-X Secondary Status Register, Secondary Bus
> -		 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
> -		 * Specification Revision 1.0 5.2.2.1.3.
> -		 *
> -		 * Value zero means it is in conventional mode, otherwise it is
> -		 * in PCI-X mode.
> -		 */
> +	/*
> +	 * Microsoft Hypervisor requires a bus range when the bridge is
> +	 * running in PCI-X mode.
> +	 *
> +	 * To distinguish conventional vs PCI-X bridge, we can check
> +	 * the bridge's PCI-X Secondary Status Register, Secondary Bus
> +	 * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
> +	 * Specification Revision 1.0 5.2.2.1.3.
> +	 *
> +	 * Value zero means it is in conventional mode, otherwise it is
> +	 * in PCI-X mode.
> +	 */
>  
> -		pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> -		if (pos) {
> -			u16 status;
> +	pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
> +	if (pos) {
> +		u16 status;
>  
> -			pci_read_config_word(data.bridge, pos +
> -					PCI_X_BRIDGE_SSTATUS, &status);
> +		pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
> +				     &status);
>  
> -			if (status & PCI_X_SSTATUS_FREQ) {
> -				/* Non-zero, PCI-X mode */
> -				u8 sec_bus, sub_bus;
> +		if (status & PCI_X_SSTATUS_FREQ) {
> +			/* Non-zero, PCI-X mode */
> +			u8 sec_bus, sub_bus;
>  
> -				dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
> +			hv_devid.pci.source_shadow =
> +					     HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
>  
> -				pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
> -				dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
> -				pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
> -				dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
> -			}
> +			pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
> +					     &sec_bus);
> +			hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
> +			pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
> +					     &sub_bus);
> +			hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
>  		}
>  	}
>  
> -	return dev_id;
> +out:
> +	return hv_devid;
>  }
>  
> -/**
> - * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
> +/*
> + * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
>   * @data:      Describes the IRQ
>   * @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
>   *
> @@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
>  {
>  	struct irq_cfg *cfg = irqd_cfg(data);
>  	struct hv_interrupt_entry dummy;
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  	struct msi_desc *msidesc;
> -	struct pci_dev *dev;
> +	struct pci_dev *pdev;
>  	int cpu;
>  
>  	msidesc = irq_data_get_msi_desc(data);
> -	dev = msi_desc_to_pci_dev(msidesc);
> -	device_id = hv_build_pci_dev_id(dev);
> +	pdev = msi_desc_to_pci_dev(msidesc);
> +	hv_devid = hv_build_devid_type_pci(pdev);
>  	cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
>  
> -	return hv_map_interrupt(device_id, false, cpu, cfg->vector,
> +	return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
>  				out_entry ? out_entry : &dummy);
>  }
>  EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
>  
> -static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
> +static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
> +			     struct msi_msg *msg)
>  {
>  	/* High address is always 0 */
>  	msg->address_hi = 0;
> @@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
>  	msg->data = entry->msi_entry.data.as_uint32;
>  }
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
> +static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +				  struct hv_interrupt_entry *irq_entry);
> +
>  static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  {
>  	struct hv_interrupt_entry *stored_entry;
>  	struct irq_cfg *cfg = irqd_cfg(data);
>  	struct msi_desc *msidesc;
> -	struct pci_dev *dev;
> +	struct pci_dev *pdev;
>  	int ret;
>  
>  	msidesc = irq_data_get_msi_desc(data);
> -	dev = msi_desc_to_pci_dev(msidesc);
> +	pdev = msi_desc_to_pci_dev(msidesc);
>  
>  	if (!cfg) {
>  		pr_debug("%s: cfg is NULL", __func__);
> @@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  		stored_entry = data->chip_data;
>  		data->chip_data = NULL;
>  
> -		ret = hv_unmap_msi_interrupt(dev, stored_entry);
> +		ret = hv_unmap_msi_interrupt(pdev, stored_entry);
>  
>  		kfree(stored_entry);
>  
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  	}
>  
>  	stored_entry = kzalloc(sizeof(*stored_entry), GFP_ATOMIC);
> -	if (!stored_entry) {
> -		pr_debug("%s: failed to allocate chip data\n", __func__);
> +	if (!stored_entry)
>  		return;
> -	}
>  
>  	ret = hv_map_msi_interrupt(data, stored_entry);
>  	if (ret) {
> @@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>  
>  	data->chip_data = stored_entry;
>  	entry_to_msi_msg(data->chip_data, msg);
> -
> -	return;
>  }
>  
> -static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
> +static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
> +				  struct hv_interrupt_entry *irq_entry)
>  {
> -	return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
> +	union hv_device_id hv_devid;
> +
> +	hv_devid = hv_build_devid_type_pci(pdev);
> +	return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
>  }
>  
> -static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
> +/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
> +static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
>  {
> -	struct hv_interrupt_entry old_entry;
> +	struct hv_interrupt_entry irq_entry;
>  	struct msi_msg msg;
>  
>  	if (!irqd->chip_data) {
> @@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
>  		return;
>  	}
>  
> -	old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
> -	entry_to_msi_msg(&old_entry, &msg);
> +	irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
> +	entry_to_msi_msg(&irq_entry, &msg);
>  
>  	kfree(irqd->chip_data);
>  	irqd->chip_data = NULL;
>  
> -	(void)hv_unmap_msi_interrupt(dev, &old_entry);
> +	(void)hv_unmap_msi_interrupt(pdev, &irq_entry);
>  }
>  
>  /*
> @@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
>  };
>  
>  static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
> -				 struct irq_domain *real_parent, struct msi_domain_info *info)
> +				 struct irq_domain *real_parent,
> +				 struct msi_domain_info *info)
>  {
>  	struct irq_chip *chip = info->chip;
>  
> @@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
>  }
>  
>  #define HV_MSI_FLAGS_SUPPORTED	(MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
> -#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
> +#define HV_MSI_FLAGS_REQUIRED	(MSI_FLAG_USE_DEF_DOM_OPS |	\
> +				 MSI_FLAG_USE_DEF_CHIP_OPS)
>  
>  static struct msi_parent_ops hv_msi_parent_ops = {
>  	.supported_flags	= HV_MSI_FLAGS_SUPPORTED,
> @@ -329,14 +336,13 @@ static struct msi_parent_ops hv_msi_parent_ops = {
>  	.init_dev_msi_info	= hv_init_dev_msi_info,
>  };
>  
> -static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
> -			       void *arg)
> +static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs, void *arg)
>  {
>  	/*
> -	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
> -	 * entry_to_msi_msg() should be in here.
> +	 * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
> +	 *	 everything except entry_to_msi_msg() should be in here.
>  	 */
> -
>  	int ret;
>  
>  	ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
> @@ -344,13 +350,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
>  		return ret;
>  
>  	for (int i = 0; i < nr_irqs; ++i) {
> -		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
> -				    handle_edge_irq, NULL, "edge");
> +		irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
> +				    NULL, handle_edge_irq, NULL, "edge");
>  	}
> +
>  	return 0;
>  }
>  
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> +			       unsigned int nr_irqs)
>  {
>  	for (int i = 0; i < nr_irqs; ++i) {
>  		struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
> @@ -362,6 +370,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
>  
>  		hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
>  	}
> +
>  	irq_domain_free_irqs_top(d, virq, nr_irqs);
>  }
>  
> @@ -394,25 +403,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
>  
>  int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
>  {
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  
> -	device_id.as_uint64 = 0;
> -	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
> -	device_id.ioapic.ioapic_id = (u8)ioapic_id;
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
> +	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>  
> -	return hv_unmap_interrupt(device_id.as_uint64, entry);
> +	return hv_unmap_interrupt(hv_devid.as_uint64, entry);
>  }
>  EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
>  
>  int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
>  		struct hv_interrupt_entry *entry)
>  {
> -	union hv_device_id device_id;
> +	union hv_device_id hv_devid;
>  
> -	device_id.as_uint64 = 0;
> -	device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
> -	device_id.ioapic.ioapic_id = (u8)ioapic_id;
> +	hv_devid.as_uint64 = 0;
> +	hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
> +	hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
>  
> -	return hv_map_interrupt(device_id, level, cpu, vector, entry);
> +	return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
>  }
>  EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
> -- 
> 2.51.2.vfs.0.1
> 

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
  2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
  2026-01-20 19:08   ` kernel test robot
  2026-01-20 21:09   ` kernel test robot
@ 2026-02-05 18:48   ` Anirudh Rayabharam
  2 siblings, 0 replies; 69+ messages in thread
From: Anirudh Rayabharam @ 2026-02-05 18:48 UTC (permalink / raw)
  To: Mukesh R
  Cc: linux-kernel, linux-hyperv, linux-arm-kernel, iommu, linux-pci,
	linux-arch, kys, haiyangz, wei.liu, decui, longli,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, joro,
	lpieralisi, kwilczynski, mani, robh, bhelgaas, arnd, nunodasneves,
	mhklinux, romank

On Mon, Jan 19, 2026 at 10:42:16PM -0800, Mukesh R wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com>
> 
> This file actually implements irq remapping, so rename to more appropriate
> hyperv-irq.c. A new file named hyperv-iommu.c will be introduced later.
> Also, move CONFIG_IRQ_REMAP out of the file and add to Makefile.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>

> ---
>  MAINTAINERS                                    | 2 +-
>  drivers/iommu/Kconfig                          | 1 +
>  drivers/iommu/Makefile                         | 2 +-
>  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 4 ----
>  4 files changed, 3 insertions(+), 6 deletions(-)
>  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5b11839cba9d..381a0e086382 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11741,7 +11741,7 @@ F:	drivers/hid/hid-hyperv.c
>  F:	drivers/hv/
>  F:	drivers/infiniband/hw/mana/
>  F:	drivers/input/serio/hyperv-keyboard.c
> -F:	drivers/iommu/hyperv-iommu.c
> +F:	drivers/iommu/hyperv-irq.c
>  F:	drivers/net/ethernet/microsoft/
>  F:	drivers/net/hyperv/
>  F:	drivers/pci/controller/pci-hyperv-intf.c
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 99095645134f..b4cc2b42b338 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -355,6 +355,7 @@ config HYPERV_IOMMU
>  	bool "Hyper-V IRQ Handling"
>  	depends on HYPERV && X86
>  	select IOMMU_API
> +	select IRQ_REMAP
>  	default HYPERV
>  	help
>  	  Stub IOMMU driver to handle IRQs to support Hyper-V Linux
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 8e8843316c4b..598c39558e7d 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> -obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> +obj-$(CONFIG_HYPERV_IOMMU) += hyperv-irq.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
>  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
> similarity index 99%
> rename from drivers/iommu/hyperv-iommu.c
> rename to drivers/iommu/hyperv-irq.c
> index 0961ac805944..1944440a5004 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-irq.c
> @@ -24,8 +24,6 @@
>  
>  #include "irq_remapping.h"
>  
> -#ifdef CONFIG_IRQ_REMAP
> -
>  /*
>   * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
>   * Redirection Table. Hyper-V exposes one single IO-APIC and so define
> @@ -330,5 +328,3 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
>  	.alloc = hyperv_root_irq_remapping_alloc,
>  	.free = hyperv_root_irq_remapping_free,
>  };
> -
> -#endif
> -- 
> 2.51.2.vfs.0.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2026-02-05 18:50 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-20  6:42 [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Mukesh R
2026-01-20  6:42 ` [PATCH v0 01/15] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c Mukesh R
2026-01-20 19:08   ` kernel test robot
2026-01-20 21:09   ` kernel test robot
2026-02-05 18:48   ` Anirudh Rayabharam
2026-01-20  6:42 ` [PATCH v0 02/15] x86/hyperv: cosmetic changes in irqdomain.c for readability Mukesh R
2026-02-05 18:47   ` Anirudh Rayabharam
2026-01-20  6:42 ` [PATCH v0 03/15] x86/hyperv: add insufficient memory support in irqdomain.c Mukesh R
2026-01-21  0:53   ` kernel test robot
2026-01-20  6:42 ` [PATCH v0 04/15] mshv: Provide a way to get partition id if running in a VMM process Mukesh R
2026-01-23 18:23   ` Nuno Das Neves
2026-01-20  6:42 ` [PATCH v0 05/15] mshv: Declarations and definitions for VFIO-MSHV bridge device Mukesh R
2026-01-23 18:25   ` Nuno Das Neves
2026-01-24  0:36     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 06/15] mshv: Implement mshv bridge device for VFIO Mukesh R
2026-01-20 16:09   ` Stanislav Kinsburskii
2026-01-23 18:32   ` Nuno Das Neves
2026-01-24  0:37     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 07/15] mshv: Add ioctl support for MSHV-VFIO bridge device Mukesh R
2026-01-20 16:13   ` Stanislav Kinsburskii
2026-01-20  6:42 ` [PATCH v0 08/15] PCI: hv: rename hv_compose_msi_msg to hv_vmbus_compose_msi_msg Mukesh R
2026-01-28 14:03   ` Manivannan Sadhasivam
2026-01-20  6:42 ` [PATCH v0 09/15] mshv: Import data structs around device domains and irq remapping Mukesh R
2026-01-20 22:17   ` Stanislav Kinsburskii
2026-01-24  0:38     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 10/15] PCI: hv: Build device id for a VMBus device Mukesh R
2026-01-20 22:22   ` Stanislav Kinsburskii
2026-01-24  0:42     ` Mukesh R
2026-01-26 20:50       ` Stanislav Kinsburskii
2026-01-28 14:36       ` Manivannan Sadhasivam
2026-01-20  6:42 ` [PATCH v0 11/15] x86/hyperv: Build logical device ids for PCI passthru hcalls Mukesh R
2026-01-20 22:27   ` Stanislav Kinsburskii
2026-01-24  0:44     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 12/15] x86/hyperv: Implement hyperv virtual iommu Mukesh R
2026-01-21  0:12   ` Stanislav Kinsburskii
2026-01-24  1:26     ` Mukesh R
2026-01-26 15:57       ` Stanislav Kinsburskii
2026-01-27  3:02         ` Mukesh R
2026-01-27 18:46           ` Stanislav Kinsburskii
2026-01-30 22:51             ` Mukesh R
2026-02-02 16:20               ` Stanislav Kinsburskii
2026-01-22  5:18   ` Jacob Pan
2026-01-24  2:01     ` Mukesh R
2026-01-27 19:21       ` Jacob Pan
2026-01-27 22:31         ` Jacob Pan
2026-01-30 22:10           ` Mukesh R
2026-01-30 23:44         ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 13/15] x86/hyperv: Basic interrupt support for direct attached devices Mukesh R
2026-01-21  0:47   ` Stanislav Kinsburskii
2026-01-24  2:08     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 14/15] mshv: Remove mapping of mmio space during map user ioctl Mukesh R
2026-01-21  1:41   ` Stanislav Kinsburskii
2026-01-23 18:34   ` Nuno Das Neves
2026-01-24  2:12     ` Mukesh R
2026-01-20  6:42 ` [PATCH v0 15/15] mshv: Populate mmio mappings for PCI passthru Mukesh R
2026-01-20 19:52   ` kernel test robot
2026-01-21  1:53   ` Stanislav Kinsburskii
2026-01-24  2:19     ` Mukesh R
2026-01-26 18:15       ` Stanislav Kinsburskii
2026-01-27  3:07         ` Mukesh R
2026-01-27 18:57           ` Stanislav Kinsburskii
2026-01-30 22:17             ` Mukesh R
2026-02-02 16:30               ` Stanislav Kinsburskii
2026-02-04 22:52                 ` Mukesh R
2026-02-05 16:28                   ` Stanislav Kinsburskii
2026-02-05 17:57                     ` Mukesh R
2026-02-05 18:31                       ` Stanislav Kinsburskii
2026-01-20 21:50 ` [PATCH v0 00/15] PCI passthru on Hyper-V (Part I) Jacob Pan
2026-01-24  2:27   ` Mukesh R

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox