* [PATCH V1 1/3] mshv: Import declarations for irq remap and add irqbypass support
From: Mukesh R @ 2026-05-12 2:12 UTC (permalink / raw)
To: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
linux-pci, linux-arch
In-Reply-To: <20260512021242.1679786-1-mrathor@linux.microsoft.com>
For the irq map/remap hypercalls, copy relevant data structures from
hypervisor public headers into Linux equivalents. Also, update Kconfig and
mshv_irqfd for irqbypass. Please note, irqbypass is required for doing
passthru on MSHV. This because there is really no way of knowing the linux
irq in the mshv_irqfd_assign and mshv_irqfd_update paths without it. The
linux irq is setup upfront by VFIO before irqfd assign/update happens.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/Kconfig | 1 +
drivers/hv/mshv_eventfd.h | 3 +++
include/hyperv/hvgdk_mini.h | 3 +++
include/hyperv/hvhdk.h | 17 +++++++++++++++++
4 files changed, 24 insertions(+)
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 7937ac0cbd0f..c831fe25ca2b 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
# no particular order, making it impossible to reassemble larger pages
depends on PAGE_SIZE_4KB
select EVENTFD
+ select IRQ_BYPASS_MANAGER
select VIRT_XFER_TO_GUEST_WORK
select HMM_MIRROR
select MMU_NOTIFIER
diff --git a/drivers/hv/mshv_eventfd.h b/drivers/hv/mshv_eventfd.h
index 464c6b81ab33..ff4dd24b8ad4 100644
--- a/drivers/hv/mshv_eventfd.h
+++ b/drivers/hv/mshv_eventfd.h
@@ -9,6 +9,7 @@
#define __LINUX_MSHV_EVENTFD_H
#include <linux/poll.h>
+#include <linux/irqbypass.h>
#include "mshv.h"
#include "mshv_root.h"
@@ -37,6 +38,8 @@ struct mshv_irqfd {
struct mshv_irqfd_resampler *irqfd_resampler;
struct eventfd_ctx *irqfd_resamplefd;
struct hlist_node irqfd_resampler_hnode;
+ struct irq_bypass_consumer irqfd_bypass_cons;
+ struct irq_bypass_producer *irqfd_bypass_prod;
};
void mshv_eventfd_init(struct mshv_partition *partition);
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index da622fb06440..1ef480825705 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -59,6 +59,8 @@ struct hv_u128 {
#define HV_PARTITION_ID_INVALID ((u64)0)
#define HV_PARTITION_ID_SELF ((u64)-1)
+#define HV_MAX_VPS 256 /* HV_MAXIMUM_PROCESSORS */
+
/* Hyper-V specific model specific registers (MSRs) */
#if defined(CONFIG_X86)
@@ -508,6 +510,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
#define HVCALL_UNMAP_VP_STATE_PAGE 0x00e2
#define HVCALL_GET_VP_STATE 0x00e3
#define HVCALL_SET_VP_STATE 0x00e4
+#define HVCALL_GET_VPSET_FROM_MDA 0x00e5
#define HVCALL_GET_VP_CPUID_VALUES 0x00f4
#define HVCALL_GET_PARTITION_PROPERTY_EX 0x0101
#define HVCALL_MMIO_READ 0x0106
diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 5e83d3714966..d0a892347ab1 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -952,4 +952,21 @@ struct hv_input_modify_sparse_spa_page_host_access {
#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE 0x4
#define HV_MODIFY_SPA_PAGE_HOST_ACCESS_HUGE_PAGE 0x8
+#ifdef CONFIG_X86
+
+struct hv_input_get_vp_set_from_mda { /* HV_OUTPUT_GET_VP_SET_FROM_MDA */
+ u64 target_partid;
+ u64 dest_address;
+ u8 input_vtl;
+ u8 destmode_logical; /* true => mode is logical */
+ u16 reserved0; /* mbz */
+ u32 reserved1; /* mbz */
+} __packed;
+
+union hv_output_get_vp_set_from_mda { /* HV_OUTPUT_GET_VP_SET_FROM_MDA */
+ struct hv_vpset target_vpset;
+ u64 bitset_buffer[HV_GENERIC_SET_QWORD_COUNT(HV_MAX_VPS)];
+} __packed;
+
+#endif /* CONFIG_X86 */
#endif /* _HV_HVHDK_H */
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V1 0/3] PCI passthru on Hyper-V (Part II)
From: Mukesh R @ 2026-05-12 2:12 UTC (permalink / raw)
To: hpa, robin.murphy, robh, linux-hyperv, linux-kernel, iommu,
linux-pci, linux-arch
This patch series implements interrupt remapping part of the PCI
passthru feature on Hyper-V when Linux is running as a privileged VM.
These patches complement Part I of the feature at:
https://lore.kernel.org/linux-hyperv/20260512020259.1678627-1-mrathor@linux.microsoft.com/T/#t
Testing and other details are listed there.
Changes in V1:
o rebase to above V3 of Part I
o check for NULL irqdata->parent_data->chip before calling
irq_chip_unmask_parent().
Thanks,
-Mukesh
Mukesh R (3):
mshv: Import declarations for irq remap and add irqbypass support
hyperv: Implement irq remap for passthru devices
mshv: Implement guest irq migration for passthru devices
arch/x86/hyperv/irqdomain.c | 18 +-
drivers/hv/Kconfig | 1 +
drivers/hv/mshv_eventfd.c | 501 +++++++++++++++++++++++++++-
drivers/hv/mshv_eventfd.h | 3 +
drivers/iommu/hyperv-iommu-root.c | 14 +
drivers/pci/controller/pci-hyperv.c | 10 +
include/asm-generic/mshyperv.h | 3 +
include/hyperv/hvgdk_mini.h | 3 +
include/hyperv/hvhdk.h | 17 +
9 files changed, 564 insertions(+), 6 deletions(-)
--
2.51.2.vfs.0.1
^ permalink raw reply
* [PATCH V3 10/11] mshv: Populate mmio mappings for PCI passthru
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Upon guest access, in case of missing mmio mapping, the hypervisor
generates an unmapped gpa intercept. In this path, lookup the PCI
resource pfn for the guest gpa, and ask the hypervisor to map it
via hypercall. The PCI resource pfn is maintained by the VFIO driver,
and obtained via fixup_user_fault call (similar to KVM).
Also, VFIO no longer puts the mmio pfn in vma->vm_pgoff. So, remove
code that is using it to map mmio space. It is broken and will cause
panic.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 113 ++++++++++++++++++++++++++++++------
1 file changed, 96 insertions(+), 17 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 6ceb5f608589..a7864463961b 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -46,6 +46,9 @@ MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
#define HV_VP_COUNTER_ROOT_DISPATCH_THREAD_BLOCKED 95
#endif
+static bool hv_nofull_mmio; /* don't map entire mmio region upon fault */
+module_param(hv_nofull_mmio, bool, 0644);
+
struct mshv_root mshv_root;
enum hv_scheduler_type hv_scheduler_type;
@@ -641,6 +644,94 @@ mshv_partition_region_by_gfn_get(struct mshv_partition *p, u64 gfn)
return region;
}
+/*
+ * Check if uaddr is for mmio range. If yes, return 0 with mmio_pfn filled in
+ * else just return -errno.
+ */
+static int mshv_chk_get_mmio_start_pfn(u64 uaddr, u64 *mmio_pfnp)
+{
+ struct vm_area_struct *vma;
+ bool is_mmio;
+ struct follow_pfnmap_args pfnmap_args;
+ int rc = -EINVAL;
+
+ mmap_read_lock(current->mm);
+ vma = vma_lookup(current->mm, uaddr);
+ is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
+ if (!is_mmio)
+ goto unlock_mmap_out;
+
+ pfnmap_args.vma = vma;
+ pfnmap_args.address = uaddr;
+
+ rc = follow_pfnmap_start(&pfnmap_args);
+ if (rc) {
+ rc = fixup_user_fault(current->mm, uaddr, FAULT_FLAG_WRITE,
+ NULL);
+ if (rc)
+ goto unlock_mmap_out;
+
+ rc = follow_pfnmap_start(&pfnmap_args);
+ if (rc)
+ goto unlock_mmap_out;
+ }
+
+ *mmio_pfnp = pfnmap_args.pfn;
+ follow_pfnmap_end(&pfnmap_args);
+
+unlock_mmap_out:
+ mmap_read_unlock(current->mm);
+ return rc;
+}
+
+/*
+ * Check if the unmapped gpa belongs to mmio space. If yes, resolve it.
+ *
+ * Returns: True if valid mmio intercept and handled, else false.
+ */
+static bool mshv_handle_unmapped_gpa(struct mshv_vp *vp)
+{
+ struct hv_message *hvmsg = vp->vp_intercept_msg_page;
+ u64 gfn, uaddr, mmio_spa, numpgs;
+ struct mshv_mem_region *rg;
+ int rc = -EINVAL;
+ struct mshv_partition *pt = vp->vp_partition;
+#if defined(CONFIG_X86_64)
+ struct hv_x64_memory_intercept_message *msg =
+ (struct hv_x64_memory_intercept_message *)hvmsg->u.payload;
+#elif defined(CONFIG_ARM64)
+ struct hv_arm64_memory_intercept_message *msg =
+ (struct hv_arm64_memory_intercept_message *)hvmsg->u.payload;
+#endif
+
+ gfn = msg->guest_physical_address >> HV_HYP_PAGE_SHIFT;
+
+ rg = mshv_partition_region_by_gfn_get(pt, gfn);
+ if (rg == NULL)
+ return false;
+ if (rg->mreg_type != MSHV_REGION_TYPE_MMIO)
+ goto put_rg_out;
+
+ uaddr = rg->start_uaddr + ((gfn - rg->start_gfn) << HV_HYP_PAGE_SHIFT);
+
+ rc = mshv_chk_get_mmio_start_pfn(uaddr, &mmio_spa);
+ if (rc)
+ goto put_rg_out;
+
+ if (!hv_nofull_mmio) { /* default case */
+ mmio_spa = mmio_spa - (gfn - rg->start_gfn);
+ gfn = rg->start_gfn;
+ numpgs = rg->nr_pages;
+ } else
+ numpgs = 1;
+
+ rc = hv_call_map_mmio_pages(pt->pt_id, gfn, mmio_spa, numpgs);
+
+put_rg_out:
+ mshv_region_put(rg);
+ return rc == 0;
+}
+
/**
* mshv_handle_gpa_intercept - Handle GPA (Guest Physical Address) intercepts.
* @vp: Pointer to the virtual processor structure.
@@ -699,6 +790,8 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
static bool mshv_vp_handle_intercept(struct mshv_vp *vp)
{
switch (vp->vp_intercept_msg_page->header.message_type) {
+ case HVMSG_UNMAPPED_GPA:
+ return mshv_handle_unmapped_gpa(vp);
case HVMSG_GPA_INTERCEPT:
return mshv_handle_gpa_intercept(vp);
}
@@ -1322,16 +1415,8 @@ static int mshv_prepare_pinned_region(struct mshv_mem_region *region)
}
/*
- * This maps two things: guest RAM and for pci passthru mmio space.
- *
- * mmio:
- * - vfio overloads vm_pgoff to store the mmio start pfn/spa.
- * - Two things need to happen for mapping mmio range:
- * 1. mapped in the uaddr so VMM can access it.
- * 2. mapped in the hwpt (gfn <-> mmio phys addr) so guest can access it.
- *
- * This function takes care of the second. The first one is managed by vfio,
- * and hence is taken care of via vfio_pci_mmap_fault().
+ * This is called for both user ram and mmio space. The mmio space is not
+ * mapped here, but later during intercept on demand.
*/
static long
mshv_map_user_memory(struct mshv_partition *partition,
@@ -1340,7 +1425,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
struct mshv_mem_region *region;
struct vm_area_struct *vma;
bool is_mmio;
- ulong mmio_pfn;
long ret;
if (mem->flags & BIT(MSHV_SET_MEM_BIT_UNMAP) ||
@@ -1350,7 +1434,6 @@ mshv_map_user_memory(struct mshv_partition *partition,
mmap_read_lock(current->mm);
vma = vma_lookup(current->mm, mem->userspace_addr);
is_mmio = vma ? !!(vma->vm_flags & (VM_IO | VM_PFNMAP)) : 0;
- mmio_pfn = is_mmio ? vma->vm_pgoff : 0;
mmap_read_unlock(current->mm);
if (!vma)
@@ -1376,11 +1459,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
region->nr_pages,
HV_MAP_GPA_NO_ACCESS, NULL);
break;
- case MSHV_REGION_TYPE_MMIO:
- ret = hv_call_map_mmio_pages(partition->pt_id,
- region->start_gfn,
- mmio_pfn,
- region->nr_pages);
+ default:
break;
}
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 11/11] mshv: Mark mem regions as non-movable upfront if device passthru
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
If a VM is started with device attached, the mem regions must be marked
non-movable as the device attach hypercall right away allows the use of
SLAT for IOMMU. Marking them non-movable forces mapping of the entire
guest RAM in the SLAT at the time of region creation along with the
region pinned. Also, because a device could be dynamically attached
much later in a VM, create a boot parameter to disable movable pages
that users can set if they anticipate such an action.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_root.h | 1 +
drivers/hv/mshv_root_main.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index b9880d0bdc4d..d57c26950203 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -141,6 +141,7 @@ struct mshv_partition {
pid_t pt_vmm_tgid;
bool import_completed;
bool pt_initialized;
+ bool pt_regions_pinned;
#if IS_ENABLED(CONFIG_DEBUG_FS)
struct dentry *pt_stats_dentry;
struct dentry *pt_vp_dentry;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index a7864463961b..ac71534733bd 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -49,6 +49,10 @@ MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface /dev/mshv");
static bool hv_nofull_mmio; /* don't map entire mmio region upon fault */
module_param(hv_nofull_mmio, bool, 0644);
+static bool hv_no_movbl_pgs; /* disable movable pages completely */
+module_param(hv_no_movbl_pgs, bool, 0644);
+MODULE_PARM_DESC(hv_no_movbl_pgs, "If set, don't do movable pages for VMs");
+
struct mshv_root mshv_root;
enum hv_scheduler_type hv_scheduler_type;
@@ -1303,6 +1307,12 @@ static void mshv_async_hvcall_handler(void *data, u64 *status)
*status = partition->async_hypercall_status;
}
+static bool mshv_do_pt_regions_pinned(struct mshv_partition *pt)
+{
+ return pt->pt_regions_pinned || mshv_partition_encrypted(pt) ||
+ hv_no_movbl_pgs;
+}
+
/*
* NB: caller checks and makes sure mem->size is page aligned
* Returns: 0 with regionpp updated on success, or -errno
@@ -1333,7 +1343,7 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
if (is_mmio)
rg->mreg_type = MSHV_REGION_TYPE_MMIO;
- else if (mshv_partition_encrypted(partition) ||
+ else if (mshv_do_pt_regions_pinned(partition) ||
!mshv_region_movable_init(rg))
rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
else
@@ -1808,6 +1818,9 @@ static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
if (copy_to_user(uarg, &devargk, sizeof(devargk)))
return -EFAULT; /* cleanup in mshv_device_fop_release() */
+ /* For now, all regions must be pinned if there is device passthru. */
+ partition->pt_regions_pinned = true;
+
return 0;
undo_out:
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Add a new file to implement management of device domains, mapping and
unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as baremetal root
or L1VH root. This also implements direct attach mechanism (see below),
a special feature of Hyper-V for PCI passthru, and it is also made to
work within the VFIO framework.
At a high level, during boot the hypervisor creates a default identity
domain and attaches all devices to it. This nicely maps to Linux IOMMU
subsystem IOMMU_DOMAIN_IDENTITY domain. As a result, Linux does not
need to explicitly ask Hyper-V to attach devices and do maps/unmaps
during boot. As mentioned previously, Hyper-V supports two ways to do
PCI passthru:
1. Device Domain (aka Domain Attach): root must create a device domain
in the hypervisor, and do map/unmap hypercalls for mapping and
unmapping guest RAM for DMA. All hypervisor communications use
device ID of type PCI for identifying and referencing the device.
2. Direct Attach: the hypervisor will simply use the guest's HW
page table for mappings, thus the root need not map/unmap guest
memory for DMA. As such, direct attach passthru setup during guest
boot is extremely fast. A direct attached device must always be
referenced via logical device ID and not via the PCI device ID.
At present, L1VH root only supports direct attaches. Also direct attach is
default in non-L1VH cases because there are some significant performance
issues with domain attach implementations currently for guests with higher
RAM (say more than 8GB), and that unfortunately cannot be addressed in
the short term.
Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
MAINTAINERS | 1 +
arch/x86/kernel/pci-dma.c | 2 +
drivers/iommu/Kconfig | 5 +-
drivers/iommu/Makefile | 1 +
drivers/iommu/hyperv-iommu-root.c | 918 ++++++++++++++++++++++++++++++
include/asm-generic/mshyperv.h | 17 +
include/linux/hyperv.h | 6 +
7 files changed, 947 insertions(+), 3 deletions(-)
create mode 100644 drivers/iommu/hyperv-iommu-root.c
diff --git a/MAINTAINERS b/MAINTAINERS
index f803a6a38fee..8ae040b89a56 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11914,6 +11914,7 @@ F: drivers/clocksource/hyperv_timer.c
F: drivers/hid/hid-hyperv.c
F: drivers/hv/
F: drivers/input/serio/hyperv-keyboard.c
+F: drivers/iommu/hyperv-iommu-root.c
F: drivers/iommu/hyperv-irq.c
F: drivers/net/ethernet/microsoft/
F: drivers/net/hyperv/
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 6267363e0189..cfeee6505e17 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -8,6 +8,7 @@
#include <linux/gfp.h>
#include <linux/pci.h>
#include <linux/amd-iommu.h>
+#include <linux/hyperv.h>
#include <asm/proto.h>
#include <asm/dma.h>
@@ -105,6 +106,7 @@ void __init pci_iommu_alloc(void)
gart_iommu_hole_init();
amd_iommu_detect();
detect_intel_iommu();
+ hv_iommu_detect();
swiotlb_init(x86_swiotlb_enable, x86_swiotlb_flags);
}
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f86262b11416..7909cf4373a6 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -352,13 +352,12 @@ config MTK_IOMMU_V1
if unsure, say N here.
config HYPERV_IOMMU
- bool "Hyper-V IRQ Handling"
+ bool "Hyper-V IOMMU Unit"
depends on HYPERV && X86
select IOMMU_API
default HYPERV
help
- Stub IOMMU driver to handle IRQs to support Hyper-V Linux
- guest and root partitions.
+ Hyper-V pseudo IOMMU unit.
config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 335ea77cced6..296fbc6ca829 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
obj-$(CONFIG_HYPERV) += hyperv-irq.o
+obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu-root.o
obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
new file mode 100644
index 000000000000..a2e0f6cc78e6
--- /dev/null
+++ b/drivers/iommu/hyperv-iommu-root.c
@@ -0,0 +1,918 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Hyper-V root vIOMMU driver.
+ * Copyright (C) 2026, Microsoft, Inc.
+ */
+
+#include <linux/pci.h>
+#include <linux/dma-map-ops.h>
+#include <linux/interval_tree.h>
+#include <linux/hyperv.h>
+#include "dma-iommu.h"
+#include <asm/iommu.h>
+#include <asm/mshyperv.h>
+
+/* We will not claim these PCI devices, eg hypervisor needs it for debugger */
+static char *pci_devs_to_skip;
+static int __init hv_iommu_setup_skip(char *str)
+{
+ pci_devs_to_skip = str;
+
+ return 0;
+}
+/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
+__setup("hv_iommu_skip=", hv_iommu_setup_skip);
+
+bool hv_no_attdev; /* disable direct device attach for passthru */
+EXPORT_SYMBOL_GPL(hv_no_attdev);
+static int __init setup_hv_no_attdev(char *str)
+{
+ hv_no_attdev = true;
+ return 0;
+}
+__setup("hv_no_attdev", setup_hv_no_attdev);
+
+/* Iommu device that we export to the world. HyperV supports max of one */
+static struct iommu_device hv_virt_iommu;
+
+struct hv_domain {
+ struct iommu_domain iommu_dom;
+ u32 domid_num; /* as opposed to domain_id.type */
+ bool attached_dom; /* is this direct attached dom? */
+ u64 partid; /* partition id */
+ spinlock_t mappings_lock; /* protects mappings_tree */
+ struct rb_root_cached mappings_tree; /* iova to pa lookup tree */
+};
+
+#define to_hv_domain(d) container_of(d, struct hv_domain, iommu_dom)
+
+struct hv_iommu_mapping {
+ phys_addr_t paddr;
+ struct interval_tree_node iova;
+ u32 flags;
+};
+
+/*
+ * By default, during boot the hypervisor creates one Stage 2 (S2) default
+ * domain. Stage 2 means that the page table is controlled by the hypervisor.
+ * It has two types:
+ * S2 default: access to entire root partition memory. This for us easily
+ * maps to IOMMU_DOMAIN_IDENTITY in the iommu subsystem, and
+ * is called HV_DEVICE_DOMAIN_ID_S2_DEFAULT in the hypervisor.
+ * S2 NULL: Blocks everything except RMRR
+ *
+ * Device Management:
+ * There are two ways to manage device attaches to domains:
+ * 1. Domain Attach: A device domain is created in the hypervisor, the
+ * device is attached to this domain, and then memory
+ * ranges are mapped in the map callbacks.
+ * 2. Direct Attach: No need to create a domain in the hypervisor for direct
+ * attached devices. A hypercall is made to tell the
+ * hypervisor to attach the device to a guest. There is
+ * no need for explicit memory mappings because the
+ * hypervisor will just use the guest HW page table.
+ *
+ * Since a direct attach is much faster, it is the default. This can be
+ * changed via hv_no_attdev.
+ *
+ * L1VH: hypervisor only supports direct attach. Also, there is no S2 default
+ * in the hypervisor, so no explicit attach to S2 needed.
+ */
+
+/*
+ * Create dummy domains to correspond to hypervisor prebuilt default identity
+ * and null domains (dummy because we do not make hypercalls to create them).
+ */
+static struct hv_domain hv_def_identity_dom;
+static struct hv_domain hv_null_dom;
+
+static bool hv_special_domain(struct hv_domain *hvdom)
+{
+ return hvdom == &hv_def_identity_dom || hvdom == &hv_null_dom;
+}
+
+struct iommu_domain_geometry default_geometry = (struct iommu_domain_geometry) {
+ .aperture_start = 0,
+ .aperture_end = -1UL,
+ .force_aperture = true,
+};
+
+#define HV_IOMMU_PGSIZES SZ_4K /* for now, to be enhanced */
+
+static u32 unique_id; /* unique numeric id of a new domain */
+
+static void hv_iommu_detach_dev(struct hv_domain *hvdom, struct device *dev);
+static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
+ size_t pgsize, size_t pgcount,
+ struct iommu_iotlb_gather *gather);
+
+/*
+ * If the current thread is a VMM thread, return the partition id of the VM it
+ * is managing, else return HV_PARTITION_ID_INVALID.
+ */
+u64 hv_get_current_partid(void)
+{
+ u64 (*fn)(void);
+ u64 ptid;
+
+ fn = symbol_get(mshv_current_partid);
+ if (!fn)
+ return HV_PARTITION_ID_INVALID;
+
+ ptid = fn();
+ symbol_put(mshv_current_partid);
+
+ return ptid;
+}
+EXPORT_SYMBOL_GPL(hv_get_current_partid);
+
+/* If this is a VMM thread, then this domain is for a guest vm */
+static bool hv_curr_thread_is_vmm(void)
+{
+ return hv_get_current_partid() != HV_PARTITION_ID_INVALID;
+}
+
+/* As opposed to some host app like SPDK etc... */
+static bool hv_dom_owner_is_vmm(struct hv_domain *hvdom)
+{
+ return hvdom && hvdom->partid != HV_PARTITION_ID_INVALID;
+}
+
+static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+ switch (cap) {
+ case IOMMU_CAP_CACHE_COHERENCY:
+ return true;
+ default:
+ return false;
+ }
+}
+
+/*
+ * Check if given pci device is a direct attached device. Caller must have
+ * verified pdev is a valid pci device.
+ */
+bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{
+ struct iommu_domain *iommu_domain;
+ struct hv_domain *hvdom;
+ struct device *dev = &pdev->dev;
+
+ iommu_domain = iommu_get_domain_for_dev(dev);
+ if (iommu_domain) {
+ hvdom = to_hv_domain(iommu_domain);
+ return hvdom->attached_dom;
+ }
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(hv_pcidev_is_attached_dev);
+
+bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct hv_domain *hvdom = dev_iommu_priv_get(dev);
+
+ if (hvdom && !hv_special_domain(hvdom))
+ return true;
+
+ return false;
+}
+EXPORT_SYMBOL_GPL(hv_pcidev_is_pthru_dev);
+
+/* Build device id for direct attached devices */
+static u64 hv_build_devid_type_logical(struct pci_dev *pdev)
+{
+ hv_pci_segment segment;
+ union hv_device_id hv_devid;
+ union hv_pci_bdf bdf = {.as_uint16 = 0};
+ u32 rid = PCI_DEVID(pdev->bus->number, pdev->devfn);
+
+ segment = pci_domain_nr(pdev->bus);
+ bdf.bus = PCI_BUS_NUM(rid);
+ bdf.device = PCI_SLOT(rid);
+ bdf.function = PCI_FUNC(rid);
+
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_LOGICAL;
+ hv_devid.logical.id = (u64)segment << 16 | bdf.as_uint16;
+
+ return hv_devid.as_uint64;
+}
+
+u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type)
+{
+ if (type == HV_DEVICE_TYPE_LOGICAL) {
+ if (hv_l1vh_partition())
+ return hv_pci_vmbus_device_id(pdev);
+ else
+ return hv_build_devid_type_logical(pdev);
+ } else if (type == HV_DEVICE_TYPE_PCI)
+#ifdef CONFIG_X86
+ return hv_build_devid_type_pci(pdev);
+#else
+ return 0;
+#endif
+ return 0;
+}
+EXPORT_SYMBOL_GPL(hv_build_devid_oftype);
+
+/* Create a new device domain in the hypervisor */
+static int hv_iommu_create_hyp_devdom(struct hv_domain *hvdom)
+{
+ u64 status;
+ struct hv_input_device_domain *ddp;
+ struct hv_input_create_device_domain *input;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ ddp = &input->device_domain;
+ ddp->partition_id = HV_PARTITION_ID_SELF;
+ ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+ ddp->domain_id.id = hvdom->domid_num;
+
+ input->create_device_domain_flags.forward_progress_required = 1;
+ input->create_device_domain_flags.inherit_owning_vtl = 0;
+
+ status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
+
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+
+ return hv_result_to_errno(status);
+}
+
+static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
+{
+ struct hv_domain *hvdom;
+ int rc;
+
+ if (hv_l1vh_partition() && !hv_curr_thread_is_vmm()) {
+ pr_err("Hyper-V: l1vh iommu does not support host devices\n");
+ return NULL;
+ }
+
+ hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
+ if (hvdom == NULL)
+ return NULL;
+
+ spin_lock_init(&hvdom->mappings_lock);
+ hvdom->mappings_tree = RB_ROOT_CACHED;
+
+ /* Called under iommu group mutex, so single threaded */
+ if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL) /* ie, UINTMAX */
+ goto out_err;
+
+ hvdom->domid_num = unique_id;
+ hvdom->partid = hv_get_current_partid();
+ hvdom->iommu_dom.geometry = default_geometry;
+ hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
+
+ /* For guests, by default we do direct attaches, so no domain in hyp */
+ if (hv_dom_owner_is_vmm(hvdom) && !hv_no_attdev)
+ hvdom->attached_dom = true;
+ else {
+ rc = hv_iommu_create_hyp_devdom(hvdom);
+ if (rc)
+ goto out_err;
+ }
+
+ return &hvdom->iommu_dom;
+
+out_err:
+ unique_id--;
+ kfree(hvdom);
+ return NULL;
+}
+
+static void hv_iommu_domain_free(struct iommu_domain *immdom)
+{
+ struct hv_domain *hvdom = to_hv_domain(immdom);
+ unsigned long flags;
+ u64 status;
+ struct hv_input_delete_device_domain *input;
+
+ if (hv_special_domain(hvdom))
+ return;
+
+ if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {
+ struct hv_input_device_domain *ddp;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ ddp = &input->device_domain;
+ memset(input, 0, sizeof(*input));
+
+ ddp->partition_id = HV_PARTITION_ID_SELF;
+ ddp->domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+ ddp->domain_id.id = hvdom->domid_num;
+
+ status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input,
+ NULL);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+ }
+
+ kfree(hvdom);
+}
+
+/*
+ * Attach a device to the default domain, or the null domain, or to a domain
+ * previously created in the hypervisor.
+ */
+static int hv_iommu_att_dev2dom(struct hv_domain *hvdom, struct pci_dev *pdev)
+{
+ unsigned long flags;
+ u64 status;
+ enum hv_device_type dev_type;
+ struct hv_input_attach_device_domain *input;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ /* For null domain, hvdom->domid_num == HV_DEVICE_DOMAIN_ID_S2_NULL */
+ input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+ input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+ input->device_domain.domain_id.id = hvdom->domid_num;
+
+ /* NB: Upon guest shutdown, device is re-attached to the default domain
+ * without explicit detach.
+ */
+ if (hv_l1vh_partition())
+ dev_type = HV_DEVICE_TYPE_LOGICAL;
+ else
+ dev_type = HV_DEVICE_TYPE_PCI;
+
+ input->device_id.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
+
+ status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+
+ return hv_result_to_errno(status);
+}
+
+/* Caller must have validated that dev is a valid pci dev */
+static int hv_iommu_direct_attach_device(struct pci_dev *pdev, u64 ptid)
+{
+ struct hv_input_attach_device *input;
+ u64 status;
+ int rc;
+ unsigned long flags;
+ union hv_device_id host_devid;
+ enum hv_device_type dev_type;
+
+ if (ptid == HV_PARTITION_ID_INVALID) {
+ pr_err("Hyper-V: Invalid partition id in direct attach\n");
+ return -EINVAL;
+ }
+
+ if (hv_l1vh_partition())
+ dev_type = HV_DEVICE_TYPE_LOGICAL;
+ else
+ dev_type = HV_DEVICE_TYPE_PCI;
+
+ host_devid.as_uint64 = hv_build_devid_oftype(pdev, dev_type);
+
+ do {
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+ input->partition_id = ptid;
+ input->device_id = host_devid;
+
+ /* Hypervisor associates logical_id with this device, and in
+ * some hypercalls like retarget interrupts, logical_id must be
+ * used instead of the BDF. It is a required parameter.
+ */
+ input->attdev_flags.logical_id = 1;
+ input->logical_devid =
+ hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
+
+ status = hv_do_hypercall(HVCALL_ATTACH_DEVICE, input, NULL);
+ local_irq_restore(flags);
+
+ if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+ rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);
+ if (rc)
+ break;
+ }
+ } while (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+
+ return hv_result_to_errno(status);
+}
+
+/* Attach a device for passthru to guest VMs, host apps like SPDK, etc */
+static int hv_iommu_attach_dev(struct iommu_domain *immdom, struct device *dev,
+ struct iommu_domain *old)
+{
+ struct pci_dev *pdev;
+ int rc;
+ struct hv_domain *hvdom_new = to_hv_domain(immdom);
+ struct hv_domain *hvdom_prev = to_hv_domain(old);
+
+ /* Only allow PCI devices for now */
+ if (!dev_is_pci(dev))
+ return -EINVAL;
+
+ pdev = to_pci_dev(dev);
+
+ if (hv_l1vh_partition() && !hv_special_domain(hvdom_new) &&
+ !hvdom_new->attached_dom)
+ return -EINVAL;
+
+ /* VFIO does not do explicit detach calls, hence check first if we need
+ * to detach first. Also, in case of guest shutdown, it's the VMM
+ * thread that attaches it back to the hv_def_identity_dom, and
+ * hvdom_prev will not be null then. It is null during boot.
+ */
+ if (hvdom_prev)
+ if (!hv_l1vh_partition() || !hv_special_domain(hvdom_prev))
+ hv_iommu_detach_dev(hvdom_prev, dev);
+
+ /* l1vh does not have a default S2 domain in the hypervisor */
+ if (hv_l1vh_partition() && hv_special_domain(hvdom_new)) {
+ dev_iommu_priv_set(dev, hvdom_new); /* sets "private" field */
+ return 0;
+ }
+
+ if (hvdom_new->attached_dom)
+ rc = hv_iommu_direct_attach_device(pdev, hvdom_new->partid);
+ else
+ rc = hv_iommu_att_dev2dom(hvdom_new, pdev);
+
+ if (rc == 0)
+ dev_iommu_priv_set(dev, hvdom_new); /* sets "private" field */
+ else
+ dev_iommu_priv_set(dev, NULL);
+
+ return rc;
+}
+
+static void hv_iommu_det_dev_from_guest(struct pci_dev *pdev, u64 ptid)
+{
+ struct hv_input_detach_device *input;
+ u64 status, log_devid;
+ unsigned long flags;
+
+ log_devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_LOGICAL);
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ input->partition_id = ptid;
+ input->logical_devid = log_devid;
+ status = hv_do_hypercall(HVCALL_DETACH_DEVICE, input, NULL);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+}
+
+static void hv_iommu_det_dev_from_dom(struct pci_dev *pdev)
+{
+ u64 status, devid;
+ unsigned long flags;
+ struct hv_input_detach_device_domain *input;
+
+ devid = hv_build_devid_oftype(pdev, HV_DEVICE_TYPE_PCI);
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ input->partition_id = HV_PARTITION_ID_SELF;
+ input->device_id.as_uint64 = devid;
+ status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+}
+
+static void hv_iommu_detach_dev(struct hv_domain *hvdom, struct device *dev)
+{
+ struct pci_dev *pdev;
+
+ /* See the attach function, only PCI devices for now */
+ if (!dev_is_pci(dev))
+ return;
+
+ pdev = to_pci_dev(dev);
+
+ if (hvdom->attached_dom)
+ hv_iommu_det_dev_from_guest(pdev, hvdom->partid);
+
+ /* Do not clear attached_dom, hv_iommu_unmap_pages happens
+ * next.
+ */
+ else
+ hv_iommu_det_dev_from_dom(pdev);
+}
+
+static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
+ unsigned long iova, phys_addr_t paddr,
+ size_t size, u32 flags)
+{
+ unsigned long irqflags;
+ struct hv_iommu_mapping *mapping;
+
+ mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+ if (!mapping)
+ return -ENOMEM;
+
+ mapping->paddr = paddr;
+ mapping->iova.start = iova;
+ mapping->iova.last = iova + size - 1;
+ mapping->flags = flags;
+
+ spin_lock_irqsave(&hvdom->mappings_lock, irqflags);
+ interval_tree_insert(&mapping->iova, &hvdom->mappings_tree);
+ spin_unlock_irqrestore(&hvdom->mappings_lock, irqflags);
+
+ return 0;
+}
+
+static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
+ unsigned long iova, size_t size)
+{
+ unsigned long flags;
+ size_t unmapped = 0;
+ unsigned long last = iova + size - 1;
+ struct hv_iommu_mapping *mapping = NULL;
+ struct interval_tree_node *node, *next;
+
+ spin_lock_irqsave(&hvdom->mappings_lock, flags);
+ next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
+ while (next) {
+ node = next;
+ mapping = container_of(node, struct hv_iommu_mapping, iova);
+ next = interval_tree_iter_next(node, iova, last);
+
+ /* Trying to split a mapping? Not supported for now. */
+ if (mapping->iova.start < iova)
+ break;
+
+ unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+ interval_tree_remove(node, &hvdom->mappings_tree);
+ kfree(mapping);
+ }
+ spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
+
+ return unmapped;
+}
+
+/* Return: must return exact status from the hypercall without changes */
+static u64 hv_iommu_map_pgs(struct hv_domain *hvdom,
+ unsigned long iova, phys_addr_t paddr,
+ unsigned long npages, u32 map_flags)
+{
+ u64 status;
+ int i;
+ struct hv_input_map_device_gpa_pages *input;
+ unsigned long flags, pfn;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+ input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+ input->device_domain.domain_id.id = hvdom->domid_num;
+ input->map_flags = map_flags;
+ input->target_device_va_base = iova;
+
+ pfn = paddr >> HV_HYP_PAGE_SHIFT;
+ for (i = 0; i < npages; i++, pfn++)
+ input->gpa_page_list[i] = pfn;
+
+ status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_GPA_PAGES, npages, 0,
+ input, NULL);
+
+ local_irq_restore(flags);
+ return status;
+}
+
+#define HV_MAP_DEVICE_GPA_BATCH_SIZE \
+ ((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_map_device_gpa_pages)) \
+ / sizeof(u64))
+
+/*
+ * The core VFIO code loops over memory ranges calling this function with the
+ * largest pgsize from HV_IOMMU_PGSIZES. cond_resched() is in vfio_iommu_map.
+ */
+static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
+ phys_addr_t paddr, size_t pgsize, size_t pgcount,
+ int prot, gfp_t gfp, size_t *mapped)
+{
+ u32 map_flags;
+ int ret;
+ u64 status;
+ unsigned long npages, done = 0;
+ struct hv_domain *hvdom = to_hv_domain(immdom);
+ size_t size = pgsize * pgcount;
+
+ map_flags = HV_MAP_GPA_READABLE; /* required */
+ map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
+
+ ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
+ if (ret)
+ return ret;
+
+ if (hvdom->attached_dom) {
+ *mapped = size;
+ return 0;
+ }
+
+ npages = size >> HV_HYP_PAGE_SHIFT;
+ while (done < npages) {
+ ulong completed, remain = npages - done;
+
+ remain = min(remain, HV_MAP_DEVICE_GPA_BATCH_SIZE);
+
+ status = hv_iommu_map_pgs(hvdom, iova, paddr, remain,
+ map_flags);
+
+ completed = hv_repcomp(status);
+ done = done + completed;
+ iova = iova + (completed << HV_HYP_PAGE_SHIFT);
+ paddr = paddr + (completed << HV_HYP_PAGE_SHIFT);
+
+ if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
+ ret = hv_call_deposit_pages(NUMA_NO_NODE,
+ hv_current_partition_id,
+ 256);
+ if (ret)
+ break;
+ continue;
+ }
+ if (!hv_result_success(status))
+ break;
+ }
+
+ if (!hv_result_success(status)) {
+ size_t done_size = done << HV_HYP_PAGE_SHIFT;
+
+ hv_status_err(status, "pgs:%lx/%lx iova:%lx\n",
+ done, npages, iova);
+ /*
+ * lookup tree has all mappings [0 - size-1]. Below unmap will
+ * only remove from [0 - done], we need to remove second chunk
+ * [done+1 - size-1].
+ */
+ hv_iommu_del_tree_mappings(hvdom, iova, size - done_size);
+ hv_iommu_unmap_pages(immdom, iova - done_size, HV_HYP_PAGE_SIZE,
+ done, NULL);
+ if (mapped)
+ *mapped = 0;
+ } else
+ if (mapped)
+ *mapped = size;
+
+ return hv_result_to_errno(status);
+}
+
+static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
+ size_t pgsize, size_t pgcount,
+ struct iommu_iotlb_gather *gather)
+{
+ unsigned long flags, npages;
+ struct hv_input_unmap_device_gpa_pages *input;
+ u64 status;
+ struct hv_domain *hvdom = to_hv_domain(immdom);
+ size_t unmapped, size = pgsize * pgcount;
+
+ unmapped = hv_iommu_del_tree_mappings(hvdom, iova, size);
+ if (unmapped < size)
+ pr_err("%s: could not delete all mappings (%lx:%lx/%lx)\n",
+ __func__, iova, unmapped, size);
+
+ if (hvdom->attached_dom)
+ return size;
+
+ npages = size >> HV_HYP_PAGE_SHIFT;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+
+ input->device_domain.partition_id = HV_PARTITION_ID_SELF;
+ input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
+ input->device_domain.domain_id.id = hvdom->domid_num;
+ input->target_device_va_base = iova;
+
+ status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
+ 0, input, NULL);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status))
+ hv_status_err(status, "\n");
+
+ return unmapped;
+}
+
+static phys_addr_t hv_iommu_iova_to_phys(struct iommu_domain *immdom,
+ dma_addr_t iova)
+{
+ unsigned long flags;
+ struct hv_iommu_mapping *mapping;
+ struct interval_tree_node *node;
+ u64 paddr = 0;
+ struct hv_domain *hvdom = to_hv_domain(immdom);
+
+ spin_lock_irqsave(&hvdom->mappings_lock, flags);
+ node = interval_tree_iter_first(&hvdom->mappings_tree, iova, iova);
+ if (node) {
+ mapping = container_of(node, struct hv_iommu_mapping, iova);
+ paddr = mapping->paddr + (iova - mapping->iova.start);
+ }
+ spin_unlock_irqrestore(&hvdom->mappings_lock, flags);
+
+ return paddr;
+}
+
+/*
+ * Currently, hypervisor does not provide list of devices it is using
+ * dynamically. So use this to allow users to manually specify devices that
+ * should be skipped. (eg. hypervisor debugger using some network device).
+ */
+static struct iommu_device *hv_iommu_probe_device(struct device *dev)
+{
+ if (!dev_is_pci(dev))
+ return ERR_PTR(-ENODEV);
+
+ if (pci_devs_to_skip && *pci_devs_to_skip) {
+ int rc, pos = 0;
+ int parsed;
+ int segment, bus, slot, func;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ do {
+ parsed = 0;
+
+ rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
+ &segment, &bus, &slot, &func, &parsed);
+ if (rc)
+ break;
+ if (parsed <= 0)
+ break;
+
+ if (pci_domain_nr(pdev->bus) == segment &&
+ pdev->bus->number == bus &&
+ PCI_SLOT(pdev->devfn) == slot &&
+ PCI_FUNC(pdev->devfn) == func) {
+
+ dev_info(dev, "skipped by Hyper-V IOMMU\n");
+ return ERR_PTR(-ENODEV);
+ }
+ pos += parsed;
+
+ } while (pci_devs_to_skip[pos]);
+ }
+
+ /* Device will be explicitly attached to the default domain, so no need
+ * to do dev_iommu_priv_set() here.
+ */
+
+ return &hv_virt_iommu;
+}
+
+static void hv_iommu_probe_finalize(struct device *dev)
+{
+ struct iommu_domain *immdom = iommu_get_domain_for_dev(dev);
+
+ if (immdom && immdom->type == IOMMU_DOMAIN_DMA)
+ iommu_setup_dma_ops(dev, immdom);
+ else
+ set_dma_ops(dev, NULL);
+}
+
+static void hv_iommu_release_device(struct device *dev)
+{
+ struct hv_domain *hvdom = dev_iommu_priv_get(dev);
+
+ /* Need to detach device from device domain if necessary. */
+ if (hvdom)
+ hv_iommu_detach_dev(hvdom, dev);
+
+ dev_iommu_priv_set(dev, NULL);
+ set_dma_ops(dev, NULL);
+}
+
+static struct iommu_group *hv_iommu_device_group(struct device *dev)
+{
+ if (dev_is_pci(dev))
+ return pci_device_group(dev);
+ else
+ return generic_device_group(dev);
+}
+
+static int hv_iommu_def_domain_type(struct device *dev)
+{
+ /* The hypervisor always creates this by default during boot */
+ return IOMMU_DOMAIN_IDENTITY;
+}
+
+static struct iommu_ops hv_iommu_ops = {
+ .capable = hv_iommu_capable,
+ .domain_alloc_paging = hv_iommu_domain_alloc_paging,
+ .probe_device = hv_iommu_probe_device,
+ .probe_finalize = hv_iommu_probe_finalize,
+ .release_device = hv_iommu_release_device,
+ .def_domain_type = hv_iommu_def_domain_type,
+ .device_group = hv_iommu_device_group,
+ .default_domain_ops = &(const struct iommu_domain_ops) {
+ .attach_dev = hv_iommu_attach_dev,
+ .map_pages = hv_iommu_map_pages,
+ .unmap_pages = hv_iommu_unmap_pages,
+ .iova_to_phys = hv_iommu_iova_to_phys,
+ .free = hv_iommu_domain_free,
+ },
+ .owner = THIS_MODULE,
+ .identity_domain = &hv_def_identity_dom.iommu_dom,
+ .blocked_domain = &hv_null_dom.iommu_dom,
+};
+
+static const struct iommu_domain_ops hv_special_domain_ops = {
+ .attach_dev = hv_iommu_attach_dev,
+};
+
+static void __init hv_initialize_special_domains(void)
+{
+ hv_def_identity_dom.iommu_dom.type = IOMMU_DOMAIN_IDENTITY;
+ hv_def_identity_dom.iommu_dom.ops = &hv_special_domain_ops;
+ hv_def_identity_dom.iommu_dom.owner = &hv_iommu_ops;
+ hv_def_identity_dom.iommu_dom.geometry = default_geometry;
+ hv_def_identity_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_DEFAULT; /* 0 */
+
+ hv_null_dom.iommu_dom.type = IOMMU_DOMAIN_BLOCKED;
+ hv_null_dom.iommu_dom.ops = &hv_special_domain_ops;
+ hv_null_dom.iommu_dom.owner = &hv_iommu_ops;
+ hv_null_dom.iommu_dom.geometry = default_geometry;
+ hv_null_dom.domid_num = HV_DEVICE_DOMAIN_ID_S2_NULL; /* INTMAX */
+}
+
+static int __init hv_iommu_init(void)
+{
+ int ret;
+ struct iommu_device *iommup = &hv_virt_iommu;
+
+ if (!hv_is_hyperv_initialized())
+ return -ENODEV;
+
+ ret = iommu_device_sysfs_add(iommup, NULL, NULL, "%s", "hyperv-iommu");
+ if (ret) {
+ pr_err("Hyper-V: iommu_device_sysfs_add failed: %d\n", ret);
+ return ret;
+ }
+
+ /* This must come before iommu_device_register because the latter calls
+ * into the hooks.
+ */
+ hv_initialize_special_domains();
+
+ ret = iommu_device_register(iommup, &hv_iommu_ops, NULL);
+ if (ret) {
+ pr_err("Hyper-V: iommu_device_register failed: %d\n", ret);
+ goto err_sysfs_remove;
+ }
+
+ pr_info("Hyper-V IOMMU initialized\n");
+
+ return 0;
+
+err_sysfs_remove:
+ iommu_device_sysfs_remove(iommup);
+ return ret;
+}
+
+void __init hv_iommu_detect(void)
+{
+ if (no_iommu || iommu_detected)
+ return;
+
+ /* For l1vh, always expose an iommu unit */
+ if (!hv_l1vh_partition())
+ if (!(ms_hyperv.misc_features & HV_DEVICE_DOMAIN_AVAILABLE))
+ return;
+
+ iommu_detected = 1;
+ x86_init.iommu.iommu_init = hv_iommu_init;
+
+ pci_request_acs();
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 25ac7ca0fd8b..8d5c610da99a 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -327,6 +327,23 @@ static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
{ return 0; }
#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
+#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
+u64 hv_get_current_partid(void);
+bool hv_pcidev_is_attached_dev(struct pci_dev *pdev);
+bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev);
+u64 hv_build_devid_oftype(struct pci_dev *pdev, enum hv_device_type type);
+#else
+static inline bool hv_pcidev_is_attached_dev(struct pci_dev *pdev)
+{ return false; }
+static inline bool hv_pcidev_is_pthru_dev(struct pci_dev *pdev)
+{ return false; }
+static inline u64 hv_build_devid_oftype(struct pci_dev *pdev,
+ enum hv_device_type type)
+{ return 0; }
+static inline u64 hv_get_current_partid(void)
+{ return HV_PARTITION_ID_INVALID; }
+#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
+
#else /* CONFIG_HYPERV */
static inline void hv_identify_partition_type(void) {}
static inline bool hv_is_hyperv_initialized(void) { return false; }
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 5459e776ec17..6eee1cbf6f23 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1769,4 +1769,10 @@ static inline unsigned long virt_to_hvpfn(void *addr)
#define HVPFN_DOWN(x) ((x) >> HV_HYP_PAGE_SHIFT)
#define page_to_hvpfn(page) (page_to_pfn(page) * NR_HV_HYP_PAGES_IN_PAGE)
+#ifdef CONFIG_HYPERV_IOMMU
+void __init hv_iommu_detect(void);
+#else
+static inline void hv_iommu_detect(void) { }
+#endif /* CONFIG_HYPERV_IOMMU */
+
#endif /* _HYPERV_H */
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
On Hyper-V, most hypercalls related to PCI passthru to map/unmap regions,
interrupts, etc need a device ID as a parameter. This device ID refers
to that specific device during the lifetime of passthru.
An L1VH VM only contains VMBus based devices. A device ID for a VMBus
device is slightly different in that it uses the hv_pcibus_device info
for building it to make sure it matches exactly what the hypervisor
expects. This VMBus based device ID is needed when attaching devices in
an L1VH based guest VM. Add a function to build and export it. Before
building it, a check is done to make sure the device is a valid VMBus
device.
In remaining cases, PCI device ID is used. So, also make PCI device ID
build function hv_build_devid_type_pci() public.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
arch/x86/hyperv/irqdomain.c | 9 +++++----
arch/x86/include/asm/mshyperv.h | 6 ++++++
drivers/pci/controller/pci-hyperv.c | 24 ++++++++++++++++++++++++
include/asm-generic/mshyperv.h | 11 +++++++++++
4 files changed, 46 insertions(+), 4 deletions(-)
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index b3ad50a874dc..8780573a4332 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -112,7 +112,7 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
return 0;
}
-static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
+u64 hv_build_devid_type_pci(struct pci_dev *pdev)
{
int pos;
union hv_device_id hv_devid;
@@ -172,8 +172,9 @@ static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
}
out:
- return hv_devid;
+ return hv_devid.as_uint64;
}
+EXPORT_SYMBOL_GPL(hv_build_devid_type_pci);
/*
* hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
@@ -196,7 +197,7 @@ int hv_map_msi_interrupt(struct irq_data *data,
msidesc = irq_data_get_msi_desc(data);
pdev = msi_desc_to_pci_dev(msidesc);
- hv_devid = hv_build_devid_type_pci(pdev);
+ hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
@@ -271,7 +272,7 @@ static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
{
union hv_device_id hv_devid;
- hv_devid = hv_build_devid_type_pci(pdev);
+ hv_devid.as_uint64 = hv_build_devid_type_pci(pdev);
return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f64393e853ee..2ef34001f8d3 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -248,6 +248,12 @@ void hv_crash_asm_end(void);
static inline void hv_root_crash_init(void) {}
#endif /* CONFIG_MSHV_ROOT && CONFIG_CRASH_DUMP */
+#if IS_ENABLED(CONFIG_HYPERV_IOMMU)
+u64 hv_build_devid_type_pci(struct pci_dev *pdev);
+#else
+static inline u64 hv_build_devid_type_pci(struct pci_dev *pdev) { return 0; }
+#endif /* IS_ENABLED(CONFIG_HYPERV_IOMMU) */
+
#else /* CONFIG_HYPERV */
static inline void hyperv_init(void) {}
static inline void hyperv_setup_mmu_ops(void) {}
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index cfc8fa403dad..50d793ca8f31 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -573,6 +573,7 @@ struct hv_pci_compl {
};
static void hv_pci_onchannelcallback(void *context);
+static bool hv_vmbus_pci_device(struct pci_bus *pbus);
#ifdef CONFIG_X86
#define DELIVERY_MODE APIC_DELIVERY_MODE_FIXED
@@ -1005,6 +1006,24 @@ static struct irq_domain *hv_pci_get_root_domain(void)
static void hv_arch_irq_unmask(struct irq_data *data) { }
#endif /* CONFIG_ARM64 */
+u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
+{
+ struct hv_pcibus_device *hbus;
+ struct pci_bus *pbus = pdev->bus;
+
+ if (!hv_vmbus_pci_device(pbus))
+ return 0;
+
+ hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
+
+ return (hbus->hdev->dev_instance.b[5] << 24) |
+ (hbus->hdev->dev_instance.b[4] << 16) |
+ (hbus->hdev->dev_instance.b[7] << 8) |
+ (hbus->hdev->dev_instance.b[6] & 0xf8) |
+ PCI_FUNC(pdev->devfn);
+}
+EXPORT_SYMBOL_GPL(hv_pci_vmbus_device_id);
+
/**
* hv_pci_generic_compl() - Invoked for a completion packet
* @context: Set up by the sender of the packet.
@@ -1403,6 +1422,11 @@ static struct pci_ops hv_pcifront_ops = {
.write = hv_pcifront_write_config,
};
+static bool hv_vmbus_pci_device(struct pci_bus *pbus)
+{
+ return pbus->ops == &hv_pcifront_ops;
+}
+
/*
* Paravirtual backchannel
*
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index e8cbc4e3f7ad..25ac7ca0fd8b 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -204,6 +204,9 @@ extern u64 (*hv_read_reference_counter)(void);
/* Sentinel value for an uninitialized entry in hv_vp_index array */
#define VP_INVAL U32_MAX
+/* Forward declarations */
+struct pci_dev;
+
int __init hv_common_init(void);
void __init hv_get_partition_id(void);
void __init hv_common_free(void);
@@ -316,6 +319,14 @@ void hv_para_set_synic_register(unsigned int reg, u64 val);
void hyperv_cleanup(void);
bool hv_query_ext_cap(u64 cap_query);
void hv_setup_dma_ops(struct device *dev, bool coherent);
+
+#if IS_ENABLED(CONFIG_PCI_HYPERV)
+u64 hv_pci_vmbus_device_id(struct pci_dev *pdev);
+#else
+static inline u64 hv_pci_vmbus_device_id(struct pci_dev *pdev)
+{ return 0; }
+#endif /* IS_ENABLED(CONFIG_PCI_HYPERV) */
+
#else /* CONFIG_HYPERV */
static inline void hv_identify_partition_type(void) {}
static inline bool hv_is_hyperv_initialized(void) { return false; }
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 07/11] mshv: Import data structs around device passthru from hyperv headers
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Copy/import from Hyper-V public headers, definitions and declarations that
are related to attaching and detaching of device domains, and building
device ids for those purposes.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
include/hyperv/hvgdk_mini.h | 11 ++++
include/hyperv/hvhdk_mini.h | 112 ++++++++++++++++++++++++++++++++++++
2 files changed, 123 insertions(+)
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 6a4e8b9d570f..da622fb06440 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -326,6 +326,9 @@ union hv_hypervisor_version_info {
/* stimer Direct Mode is available */
#define HV_STIMER_DIRECT_MODE_AVAILABLE BIT(19)
+#define HV_DEVICE_DOMAIN_AVAILABLE BIT(24)
+#define HV_S1_DEVICE_DOMAIN_AVAILABLE BIT(25)
+
/*
* Implementation recommendations. Indicates which behaviors the hypervisor
* recommends the OS implement for optimal performance.
@@ -475,6 +478,8 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
#define HVCALL_MAP_DEVICE_INTERRUPT 0x007c
#define HVCALL_UNMAP_DEVICE_INTERRUPT 0x007d
#define HVCALL_RETARGET_INTERRUPT 0x007e
+#define HVCALL_ATTACH_DEVICE 0x0082
+#define HVCALL_DETACH_DEVICE 0x0083
#define HVCALL_NOTIFY_PARTITION_EVENT 0x0087
#define HVCALL_ENTER_SLEEP_STATE 0x0084
#define HVCALL_NOTIFY_PORT_RING_EMPTY 0x008b
@@ -486,9 +491,15 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
#define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af
#define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0
+#define HVCALL_CREATE_DEVICE_DOMAIN 0x00b1
+#define HVCALL_ATTACH_DEVICE_DOMAIN 0x00b2
+#define HVCALL_MAP_DEVICE_GPA_PAGES 0x00b3
+#define HVCALL_UNMAP_DEVICE_GPA_PAGES 0x00b4
#define HVCALL_SIGNAL_EVENT_DIRECT 0x00c0
#define HVCALL_POST_MESSAGE_DIRECT 0x00c1
#define HVCALL_DISPATCH_VP 0x00c2
+#define HVCALL_DETACH_DEVICE_DOMAIN 0x00c4
+#define HVCALL_DELETE_DEVICE_DOMAIN 0x00c5
#define HVCALL_GET_GPA_PAGES_ACCESS_STATES 0x00c9
#define HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d7
#define HVCALL_RELEASE_SPARSE_SPA_PAGE_HOST_ACCESS 0x00d8
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index b4cb2fa26e9b..60425052a799 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -468,6 +468,32 @@ struct hv_send_ipi_ex { /* HV_INPUT_SEND_SYNTHETIC_CLUSTER_IPI_EX */
struct hv_vpset vp_set;
} __packed;
+union hv_attdev_flags { /* HV_ATTACH_DEVICE_FLAGS */
+ struct {
+ u32 logical_id : 1;
+ u32 resvd0 : 1;
+ u32 ats_enabled : 1;
+ u32 virt_func : 1;
+ u32 shared_irq_child : 1;
+ u32 virt_dev : 1;
+ u32 ats_supported : 1;
+ u32 small_irt : 1;
+ u32 resvd : 24;
+ } __packed;
+ u32 as_uint32;
+};
+
+union hv_dev_pci_caps { /* HV_DEVICE_PCI_CAPABILITIES */
+ struct {
+ u32 max_pasid_width : 5;
+ u32 invalidate_qdepth : 5;
+ u32 global_inval : 1;
+ u32 prg_response_req : 1;
+ u32 resvd : 20;
+ } __packed;
+ u32 as_uint32;
+};
+
typedef u16 hv_pci_rid; /* HV_PCI_RID */
typedef u16 hv_pci_segment; /* HV_PCI_SEGMENT */
typedef u64 hv_logical_device_id;
@@ -547,4 +573,90 @@ union hv_device_id { /* HV_DEVICE_ID */
} acpi;
} __packed;
+struct hv_input_attach_device { /* HV_INPUT_ATTACH_DEVICE */
+ u64 partition_id;
+ union hv_device_id device_id;
+ union hv_attdev_flags attdev_flags;
+ u8 attdev_vtl;
+ u8 rsvd0;
+ u16 rsvd1;
+ u64 logical_devid;
+ union hv_dev_pci_caps dev_pcicaps;
+ u16 pf_pci_rid;
+ u16 resvd2;
+} __packed;
+
+struct hv_input_detach_device { /* HV_INPUT_DETACH_DEVICE */
+ u64 partition_id;
+ u64 logical_devid;
+} __packed;
+
+
+/* 3 domain types: stage 1, stage 2, and SOC */
+#define HV_DEVICE_DOMAIN_TYPE_S2 0 /* HV_DEVICE_DOMAIN_ID_TYPE_S2 */
+#define HV_DEVICE_DOMAIN_TYPE_S1 1 /* HV_DEVICE_DOMAIN_ID_TYPE_S1 */
+#define HV_DEVICE_DOMAIN_TYPE_SOC 2 /* HV_DEVICE_DOMAIN_ID_TYPE_SOC */
+
+/* ID for stage 2 default domain and NULL domain */
+#define HV_DEVICE_DOMAIN_ID_S2_DEFAULT 0
+#define HV_DEVICE_DOMAIN_ID_S2_NULL 0xFFFFFFFFULL
+
+union hv_device_domain_id {
+ u64 as_uint64;
+ struct {
+ u32 type : 4;
+ u32 reserved : 28;
+ u32 id;
+ };
+} __packed;
+
+struct hv_input_device_domain { /* HV_INPUT_DEVICE_DOMAIN */
+ u64 partition_id;
+ union hv_input_vtl owner_vtl;
+ u8 padding[7];
+ union hv_device_domain_id domain_id;
+} __packed;
+
+union hv_create_device_domain_flags { /* HV_CREATE_DEVICE_DOMAIN_FLAGS */
+ u32 as_uint32;
+ struct {
+ u32 forward_progress_required : 1;
+ u32 inherit_owning_vtl : 1;
+ u32 reserved : 30;
+ } __packed;
+} __packed;
+
+struct hv_input_create_device_domain { /* HV_INPUT_CREATE_DEVICE_DOMAIN */
+ struct hv_input_device_domain device_domain;
+ union hv_create_device_domain_flags create_device_domain_flags;
+} __packed;
+
+struct hv_input_delete_device_domain { /* HV_INPUT_DELETE_DEVICE_DOMAIN */
+ struct hv_input_device_domain device_domain;
+} __packed;
+
+struct hv_input_attach_device_domain { /* HV_INPUT_ATTACH_DEVICE_DOMAIN */
+ struct hv_input_device_domain device_domain;
+ union hv_device_id device_id;
+} __packed;
+
+struct hv_input_detach_device_domain { /* HV_INPUT_DETACH_DEVICE_DOMAIN */
+ u64 partition_id;
+ union hv_device_id device_id;
+} __packed;
+
+struct hv_input_map_device_gpa_pages { /* HV_INPUT_MAP_DEVICE_GPA_PAGES */
+ struct hv_input_device_domain device_domain;
+ union hv_input_vtl target_vtl;
+ u8 padding[3];
+ u32 map_flags;
+ u64 target_device_va_base;
+ u64 gpa_page_list[];
+} __packed;
+
+struct hv_input_unmap_device_gpa_pages { /* HV_INPUT_UNMAP_DEVICE_GPA_PAGES */
+ struct hv_input_device_domain device_domain;
+ u64 target_device_va_base;
+} __packed;
+
#endif /* _HV_HVHDK_MINI_H */
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Add ioctl support for creating MSHV devices for a partition. At
present only VFIO device types are supported, but more could be
added. At a high level, a partition ioctl to create device verifies
it is of type VFIO and does some setup for bridge code in mshv_vfio.c.
Adapted from KVM device ioctls.
Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 116 ++++++++++++++++++++++++++++++++++++
1 file changed, 116 insertions(+)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 02c107458be9..6ceb5f608589 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1625,6 +1625,119 @@ mshv_partition_ioctl_initialize(struct mshv_partition *partition)
return ret;
}
+static long mshv_device_attr_ioctl(struct mshv_device *mshv_dev, int cmd,
+ ulong uarg)
+{
+ struct mshv_device_attr attr;
+ const struct mshv_device_ops *devops = mshv_dev->device_ops;
+
+ if (copy_from_user(&attr, (void __user *)uarg, sizeof(attr)))
+ return -EFAULT;
+
+ switch (cmd) {
+ case MSHV_SET_DEVICE_ATTR:
+ if (devops->device_set_attr)
+ return devops->device_set_attr(mshv_dev, &attr);
+ break;
+ case MSHV_HAS_DEVICE_ATTR:
+ if (devops->device_has_attr)
+ return devops->device_has_attr(mshv_dev, &attr);
+ break;
+ }
+
+ return -EPERM;
+}
+
+static long mshv_device_fop_ioctl(struct file *filp, unsigned int cmd,
+ ulong uarg)
+{
+ struct mshv_device *mshv_dev = filp->private_data;
+
+ switch (cmd) {
+ case MSHV_SET_DEVICE_ATTR:
+ case MSHV_HAS_DEVICE_ATTR:
+ return mshv_device_attr_ioctl(mshv_dev, cmd, uarg);
+ }
+
+ return -ENOTTY;
+}
+
+static int mshv_device_fop_release(struct inode *inode, struct file *filp)
+{
+ struct mshv_device *mshv_dev = filp->private_data;
+ struct mshv_partition *partition = mshv_dev->device_pt;
+
+ if (mshv_dev->device_ops->device_release) {
+ mutex_lock(&partition->pt_mutex);
+ hlist_del(&mshv_dev->device_ptnode);
+ mshv_dev->device_ops->device_release(mshv_dev);
+ mutex_unlock(&partition->pt_mutex);
+ }
+
+ mshv_partition_put(partition);
+ return 0;
+}
+
+static const struct file_operations mshv_device_fops = {
+ .owner = THIS_MODULE,
+ .unlocked_ioctl = mshv_device_fop_ioctl,
+ .release = mshv_device_fop_release,
+};
+
+static long mshv_partition_ioctl_create_device(struct mshv_partition *partition,
+ void __user *uarg)
+{
+ long rc;
+ struct mshv_create_device devargk;
+ struct mshv_device *mshv_dev;
+ const struct mshv_device_ops *vfio_ops;
+
+ if (copy_from_user(&devargk, uarg, sizeof(devargk)))
+ return -EFAULT;
+
+ /* At present, only VFIO is supported */
+ if (devargk.type != MSHV_DEV_TYPE_VFIO)
+ return -ENODEV;
+
+ if (devargk.flags & MSHV_CREATE_DEVICE_TEST)
+ return 0;
+
+ /* This is freed later by mshv_vfio_release_device() */
+ mshv_dev = kzalloc(sizeof(*mshv_dev), GFP_KERNEL_ACCOUNT);
+ if (mshv_dev == NULL)
+ return -ENOMEM;
+
+ vfio_ops = &mshv_vfio_device_ops;
+ mshv_dev->device_ops = vfio_ops;
+ mshv_dev->device_pt = partition;
+
+ rc = vfio_ops->device_create(mshv_dev);
+ if (rc < 0) {
+ kfree(mshv_dev);
+ return rc;
+ }
+
+ hlist_add_head(&mshv_dev->device_ptnode, &partition->pt_devices);
+
+ mshv_partition_get(partition);
+ rc = anon_inode_getfd(vfio_ops->device_name, &mshv_device_fops,
+ mshv_dev, O_RDWR | O_CLOEXEC);
+ if (rc < 0)
+ goto undo_out;
+
+ devargk.fd = rc;
+ if (copy_to_user(uarg, &devargk, sizeof(devargk)))
+ return -EFAULT; /* cleanup in mshv_device_fop_release() */
+
+ return 0;
+
+undo_out:
+ hlist_del(&mshv_dev->device_ptnode);
+ vfio_ops->device_release(mshv_dev); /* will kfree(mshv_dev) */
+ mshv_partition_put(partition);
+ return rc;
+}
+
static long
mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
{
@@ -1661,6 +1774,9 @@ mshv_partition_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
case MSHV_ROOT_HVCALL:
ret = mshv_ioctl_passthru_hvcall(partition, true, uarg);
break;
+ case MSHV_CREATE_DEVICE:
+ ret = mshv_partition_ioctl_create_device(partition, uarg);
+ break;
default:
ret = -ENOTTY;
}
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 05/11] mshv: Implement mshv bridge device for VFIO
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Add a new file to implement VFIO-MSHV bridge pseudo device. These
functions are called in the VFIO framework, and credits to kvm/vfio.c
as this file was adapted from it.
Co-developed-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/Makefile | 3 +-
drivers/hv/mshv_vfio.c | 211 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/mshv.h | 1 +
3 files changed, 214 insertions(+), 1 deletion(-)
create mode 100644 drivers/hv/mshv_vfio.c
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index 888a748cc7cb..9ab6fc254c38 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -14,7 +14,8 @@ hv_vmbus-y := vmbus_drv.o \
hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o
hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
mshv_root-y := mshv_root_main.o mshv_synic.o mshv_eventfd.o mshv_irq.o \
- mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o
+ mshv_root_hv_call.o mshv_portid_table.o mshv_regions.o \
+ mshv_vfio.o
mshv_root-$(CONFIG_DEBUG_FS) += mshv_debugfs.o
mshv_root-$(CONFIG_TRACEPOINTS) += mshv_trace.o
mshv_vtl-y := mshv_vtl_main.o
diff --git a/drivers/hv/mshv_vfio.c b/drivers/hv/mshv_vfio.c
new file mode 100644
index 000000000000..00a97920e25b
--- /dev/null
+++ b/drivers/hv/mshv_vfio.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO-MSHV bridge pseudo device
+ *
+ * Heavily inspired by the VFIO-KVM bridge pseudo device.
+ */
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <asm/mshyperv.h>
+
+#include "mshv.h"
+#include "mshv_root.h"
+
+struct mshv_vfio_file {
+ struct list_head node;
+ struct file *file; /* list of struct mshv_vfio_file */
+};
+
+struct mshv_vfio {
+ struct list_head file_list;
+ struct mutex lock;
+};
+
+static bool mshv_vfio_file_is_valid(struct file *file)
+{
+ bool (*fn)(struct file *file);
+ bool ret;
+
+ fn = symbol_get(vfio_file_is_valid);
+ if (!fn)
+ return false;
+
+ ret = fn(file);
+
+ symbol_put(vfio_file_is_valid);
+
+ return ret;
+}
+
+static long mshv_vfio_file_add(struct mshv_device *mshvdev, unsigned int fd)
+{
+ struct mshv_vfio *mshv_vfio = mshvdev->device_private;
+ struct mshv_vfio_file *mvf;
+ struct file *filp;
+ long ret = 0;
+
+ filp = fget(fd);
+ if (!filp)
+ return -EBADF;
+
+ /* Ensure the FD is a vfio FD. */
+ if (!mshv_vfio_file_is_valid(filp)) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ mutex_lock(&mshv_vfio->lock);
+
+ list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
+ if (mvf->file == filp) {
+ ret = -EEXIST;
+ goto out_unlock;
+ }
+ }
+
+ mvf = kzalloc(sizeof(*mvf), GFP_KERNEL_ACCOUNT);
+ if (!mvf) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+
+ mvf->file = get_file(filp);
+ list_add_tail(&mvf->node, &mshv_vfio->file_list);
+
+out_unlock:
+ mutex_unlock(&mshv_vfio->lock);
+out_fput:
+ fput(filp);
+ return ret;
+}
+
+static long mshv_vfio_file_del(struct mshv_device *mshvdev, unsigned int fd)
+{
+ struct mshv_vfio *mshv_vfio = mshvdev->device_private;
+ struct mshv_vfio_file *mvf;
+ long ret;
+
+ CLASS(fd, f)(fd);
+
+ if (fd_empty(f))
+ return -EBADF;
+
+ ret = -ENOENT;
+ mutex_lock(&mshv_vfio->lock);
+
+ list_for_each_entry(mvf, &mshv_vfio->file_list, node) {
+ if (mvf->file != fd_file(f))
+ continue;
+
+ list_del(&mvf->node);
+ fput(mvf->file);
+ kfree(mvf);
+ ret = 0;
+ break;
+ }
+
+ mutex_unlock(&mshv_vfio->lock);
+ return ret;
+}
+
+static long mshv_vfio_set_file(struct mshv_device *mshvdev, long attr,
+ void __user *arg)
+{
+ int32_t __user *argp = arg;
+ int32_t fd;
+
+ switch (attr) {
+ case MSHV_DEV_VFIO_FILE_ADD:
+ if (get_user(fd, argp))
+ return -EFAULT;
+ return mshv_vfio_file_add(mshvdev, fd);
+
+ case MSHV_DEV_VFIO_FILE_DEL:
+ if (get_user(fd, argp))
+ return -EFAULT;
+ return mshv_vfio_file_del(mshvdev, fd);
+ }
+
+ return -ENXIO;
+}
+
+static long mshv_vfio_set_attr(struct mshv_device *mshvdev,
+ struct mshv_device_attr *attr)
+{
+ switch (attr->group) {
+ case MSHV_DEV_VFIO_FILE:
+ return mshv_vfio_set_file(mshvdev, attr->attr,
+ u64_to_user_ptr(attr->addr));
+ }
+
+ return -ENXIO;
+}
+
+static long mshv_vfio_has_attr(struct mshv_device *mshvdev,
+ struct mshv_device_attr *attr)
+{
+ switch (attr->group) {
+ case MSHV_DEV_VFIO_FILE:
+ switch (attr->attr) {
+ case MSHV_DEV_VFIO_FILE_ADD:
+ case MSHV_DEV_VFIO_FILE_DEL:
+ return 0;
+ }
+
+ break;
+ }
+
+ return -ENXIO;
+}
+
+static long mshv_vfio_create_device(struct mshv_device *mshvdev)
+{
+ struct mshv_device *tmp;
+ struct mshv_vfio *mshv_vfio;
+
+ /* Only one VFIO "device" per VM */
+ hlist_for_each_entry(tmp, &mshvdev->device_pt->pt_devices,
+ device_ptnode)
+ if (tmp->device_ops == &mshv_vfio_device_ops)
+ return -EBUSY;
+
+ mshv_vfio = kzalloc(sizeof(*mshv_vfio), GFP_KERNEL_ACCOUNT);
+ if (mshv_vfio == NULL)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&mshv_vfio->file_list);
+ mutex_init(&mshv_vfio->lock);
+
+ mshvdev->device_private = mshv_vfio;
+
+ return 0;
+}
+
+/* This is called from mshv_device_fop_release() */
+static void mshv_vfio_release_device(struct mshv_device *mshvdev)
+{
+ struct mshv_vfio *mv = mshvdev->device_private;
+ struct mshv_vfio_file *mvf, *tmp;
+
+ list_for_each_entry_safe(mvf, tmp, &mv->file_list, node) {
+ fput(mvf->file);
+ list_del(&mvf->node);
+ kfree(mvf);
+ }
+
+ kfree(mv);
+ kfree(mshvdev);
+}
+
+struct mshv_device_ops mshv_vfio_device_ops = {
+ .device_name = "mshv-vfio",
+ .device_create = mshv_vfio_create_device,
+ .device_release = mshv_vfio_release_device,
+ .device_set_attr = mshv_vfio_set_attr,
+ .device_has_attr = mshv_vfio_has_attr,
+};
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index be6fe3ee8707..b038a79786d2 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -254,6 +254,7 @@ struct mshv_root_hvcall {
#define MSHV_GET_GPAP_ACCESS_BITMAP _IOWR(MSHV_IOCTL, 0x06, struct mshv_gpap_access_bitmap)
/* Generic hypercall */
#define MSHV_ROOT_HVCALL _IOWR(MSHV_IOCTL, 0x07, struct mshv_root_hvcall)
+#define MSHV_CREATE_DEVICE _IOWR(MSHV_IOCTL, 0x08, struct mshv_create_device)
/*
********************************
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 04/11] mshv: Declarations and definitions for VFIO-MSHV bridge device
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Add data structs needed by the subsequent patch that introduces a new
module to implement VFIO-MSHV pseudo device.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_root.h | 19 +++++++++++++++++++
include/uapi/linux/mshv.h | 30 ++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index a85c24dcc701..b9880d0bdc4d 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -227,6 +227,25 @@ struct port_table_info {
};
};
+struct mshv_device {
+ const struct mshv_device_ops *device_ops;
+ struct mshv_partition *device_pt;
+ void *device_private;
+ struct hlist_node device_ptnode;
+};
+
+struct mshv_device_ops {
+ const char *device_name;
+ long (*device_create)(struct mshv_device *dev);
+ void (*device_release)(struct mshv_device *dev);
+ long (*device_set_attr)(struct mshv_device *dev,
+ struct mshv_device_attr *attr);
+ long (*device_has_attr)(struct mshv_device *dev,
+ struct mshv_device_attr *attr);
+};
+
+extern struct mshv_device_ops mshv_vfio_device_ops;
+
int mshv_update_routing_table(struct mshv_partition *partition,
const struct mshv_user_irq_entry *entries,
unsigned int numents);
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index 32ff92b6342b..be6fe3ee8707 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -404,4 +404,34 @@ struct mshv_sint_mask {
/* hv_hvcall device */
#define MSHV_HVCALL_SETUP _IOW(MSHV_IOCTL, 0x1E, struct mshv_vtl_hvcall_setup)
#define MSHV_HVCALL _IOWR(MSHV_IOCTL, 0x1F, struct mshv_vtl_hvcall)
+
+/* Device passhthru */
+#define MSHV_CREATE_DEVICE_TEST 1
+
+enum {
+ MSHV_DEV_TYPE_VFIO,
+ MSHV_DEV_TYPE_MAX,
+};
+
+struct mshv_create_device {
+ __u32 type; /* in: MSHV_DEV_TYPE_xxx */
+ __u32 fd; /* out: device handle */
+ __u32 flags; /* in: MSHV_CREATE_DEVICE_xxx */
+};
+
+#define MSHV_DEV_VFIO_FILE 1
+#define MSHV_DEV_VFIO_FILE_ADD 1
+#define MSHV_DEV_VFIO_FILE_DEL 2
+
+struct mshv_device_attr {
+ __u32 flags; /* no flags currently defined */
+ __u32 group; /* device-defined */
+ __u64 attr; /* group-defined */
+ __u64 addr; /* userspace address of attr data */
+};
+
+/* Device fds created with MSHV_CREATE_DEVICE */
+#define MSHV_SET_DEVICE_ATTR _IOW(MSHV_IOCTL, 0x00, struct mshv_device_attr)
+#define MSHV_HAS_DEVICE_ATTR _IOW(MSHV_IOCTL, 0x01, struct mshv_device_attr)
+
#endif
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 03/11] mshv: Provide a way to get partition ID if running in a VMM process
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Many PCI passthru related hypercalls require partition ID of the target
guest. Guests are actually managed by MSHV driver and the partition ID
is only maintained there. Add a field in the partition struct in MSHV
driver to save the tgid of the VMM process creating the partition, and
add a function there to retrieve partition ID if the current process is
a VMM process.
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
drivers/hv/mshv_root.h | 1 +
drivers/hv/mshv_root_main.c | 22 ++++++++++++++++++++++
include/asm-generic/mshyperv.h | 5 +++++
3 files changed, 28 insertions(+)
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 1f086dcb7aa1..a85c24dcc701 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -138,6 +138,7 @@ struct mshv_partition {
struct mshv_girq_routing_table __rcu *pt_girq_tbl;
u64 isolation_type;
+ pid_t pt_vmm_tgid;
bool import_completed;
bool pt_initialized;
#if IS_ENABLED(CONFIG_DEBUG_FS)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bd1359eb58dd..02c107458be9 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1908,6 +1908,27 @@ mshv_partition_release(struct inode *inode, struct file *filp)
return 0;
}
+/* Given a process tgid, return partition id if it is a VMM process */
+u64 mshv_current_partid(void)
+{
+ struct mshv_partition *pt;
+ int i;
+ u64 ret_ptid = HV_PARTITION_ID_INVALID;
+
+ rcu_read_lock();
+
+ hash_for_each_rcu(mshv_root.pt_htable, i, pt, pt_hnode) {
+ if (pt->pt_vmm_tgid == current->tgid) {
+ ret_ptid = pt->pt_id;
+ break;
+ }
+ }
+
+ rcu_read_unlock();
+ return ret_ptid;
+}
+EXPORT_SYMBOL_GPL(mshv_current_partid);
+
static int
add_partition(struct mshv_partition *partition)
{
@@ -2073,6 +2094,7 @@ mshv_ioctl_create_partition(void __user *user_arg, struct device *module_dev)
goto cleanup_irq_srcu;
partition->pt_id = pt_id;
+ partition->pt_vmm_tgid = current->tgid;
ret = add_partition(partition);
if (ret)
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb..e8cbc4e3f7ad 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -350,6 +350,7 @@ int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_notify_all_processors_started(void);
bool hv_lp_exists(u32 lp_index);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
+u64 mshv_current_partid(void);
#else /* CONFIG_MSHV_ROOT */
static inline bool hv_root_partition(void) { return false; }
@@ -380,6 +381,10 @@ static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u3
{
return -EOPNOTSUPP;
}
+static inline u64 mshv_current_partid(void)
+{
+ return HV_PARTITION_ID_INVALID;
+}
#endif /* CONFIG_MSHV_ROOT */
static inline int hv_deposit_memory(u64 partition_id, u64 status)
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
Make cosmetic changes:
o Rename struct pci_dev *dev to *pdev since there are cases of
struct device *dev in the file and all over the kernel
o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
of building different types of device IDs
o Fix checkpatch.pl issues with return and extraneous printk
o Replace spaces with tabs
o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
paths involve many types of device IDs
o Fix indentation in a large if block by using goto.
There are no functional changes.
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
arch/x86/hyperv/irqdomain.c | 198 +++++++++++++++++++-----------------
1 file changed, 104 insertions(+), 94 deletions(-)
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 365e364268d9..b3ad50a874dc 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -1,5 +1,4 @@
// SPDX-License-Identifier: GPL-2.0
-
/*
* Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
*
@@ -14,8 +13,8 @@
#include <linux/irqchip/irq-msi-lib.h>
#include <asm/mshyperv.h>
-static int hv_map_interrupt(union hv_device_id device_id, bool level,
- int cpu, int vector, struct hv_interrupt_entry *entry)
+static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
+ int cpu, int vector, struct hv_interrupt_entry *ret_entry)
{
struct hv_input_map_device_interrupt *input;
struct hv_output_map_device_interrupt *output;
@@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
intr_desc = &input->interrupt_descriptor;
memset(input, 0, sizeof(*input));
input->partition_id = hv_current_partition_id;
- input->device_id = device_id.as_uint64;
+ input->device_id = hv_devid.as_uint64;
intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
intr_desc->vector_count = 1;
intr_desc->target.vector = vector;
@@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
intr_desc->target.vp_set.valid_bank_mask = 0;
intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
- nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
+ nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
if (nr_bank < 0) {
local_irq_restore(flags);
pr_err("%s: unable to generate VP set\n", __func__);
@@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
input, output);
- *entry = output->interrupt_entry;
+ *ret_entry = output->interrupt_entry;
local_irq_restore(flags);
@@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
return hv_result_to_errno(status);
}
-static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
{
unsigned long flags;
struct hv_input_unmap_device_interrupt *input;
- struct hv_interrupt_entry *intr_entry;
u64 status;
local_irq_save(flags);
input = *this_cpu_ptr(hyperv_pcpu_input_arg);
memset(input, 0, sizeof(*input));
- intr_entry = &input->interrupt_entry;
input->partition_id = hv_current_partition_id;
input->device_id = id;
- *intr_entry = *old_entry;
+ input->interrupt_entry = *irq_entry;
status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
local_irq_restore(flags);
@@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
return 0;
}
-static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
{
- union hv_device_id dev_id;
+ int pos;
+ union hv_device_id hv_devid;
struct rid_data data = {
.bridge = NULL,
- .rid = PCI_DEVID(dev->bus->number, dev->devfn)
+ .rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
};
- pci_for_each_dma_alias(dev, get_rid_cb, &data);
+ pci_for_each_dma_alias(pdev, get_rid_cb, &data);
- dev_id.as_uint64 = 0;
- dev_id.device_type = HV_DEVICE_TYPE_PCI;
- dev_id.pci.segment = pci_domain_nr(dev->bus);
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_PCI;
+ hv_devid.pci.segment = pci_domain_nr(pdev->bus);
- dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
- dev_id.pci.bdf.device = PCI_SLOT(data.rid);
- dev_id.pci.bdf.function = PCI_FUNC(data.rid);
- dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
+ hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
+ hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
+ hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
+ hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
- if (data.bridge) {
- int pos;
+ if (data.bridge == NULL)
+ goto out;
- /*
- * Microsoft Hypervisor requires a bus range when the bridge is
- * running in PCI-X mode.
- *
- * To distinguish conventional vs PCI-X bridge, we can check
- * the bridge's PCI-X Secondary Status Register, Secondary Bus
- * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
- * Specification Revision 1.0 5.2.2.1.3.
- *
- * Value zero means it is in conventional mode, otherwise it is
- * in PCI-X mode.
- */
+ /*
+ * Microsoft Hypervisor requires a bus range when the bridge is
+ * running in PCI-X mode.
+ *
+ * To distinguish conventional vs PCI-X bridge, we can check
+ * the bridge's PCI-X Secondary Status Register, Secondary Bus
+ * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
+ * Specification Revision 1.0 5.2.2.1.3.
+ *
+ * Value zero means it is in conventional mode, otherwise it is
+ * in PCI-X mode.
+ */
- pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
- if (pos) {
- u16 status;
+ pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
+ if (pos) {
+ u16 status;
- pci_read_config_word(data.bridge, pos +
- PCI_X_BRIDGE_SSTATUS, &status);
+ pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
+ &status);
- if (status & PCI_X_SSTATUS_FREQ) {
- /* Non-zero, PCI-X mode */
- u8 sec_bus, sub_bus;
+ if (status & PCI_X_SSTATUS_FREQ) {
+ /* Non-zero, PCI-X mode */
+ u8 sec_bus, sub_bus;
- dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
+ hv_devid.pci.source_shadow =
+ HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
- pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
- dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
- pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
- dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
- }
+ pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
+ &sec_bus);
+ hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
+ pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
+ &sub_bus);
+ hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
}
}
- return dev_id;
+out:
+ return hv_devid;
}
-/**
- * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+/*
+ * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
* @data: Describes the IRQ
* @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
*
@@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
{
struct irq_cfg *cfg = irqd_cfg(data);
struct hv_interrupt_entry dummy;
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
struct msi_desc *msidesc;
- struct pci_dev *dev;
+ struct pci_dev *pdev;
int cpu;
msidesc = irq_data_get_msi_desc(data);
- dev = msi_desc_to_pci_dev(msidesc);
- device_id = hv_build_pci_dev_id(dev);
+ pdev = msi_desc_to_pci_dev(msidesc);
+ hv_devid = hv_build_devid_type_pci(pdev);
cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
- return hv_map_interrupt(device_id, false, cpu, cfg->vector,
+ return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
out_entry ? out_entry : &dummy);
}
EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
-static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
+static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
+ struct msi_msg *msg)
{
/* High address is always 0 */
msg->address_hi = 0;
@@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
msg->data = entry->msi_entry.data.as_uint32;
}
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+ struct hv_interrupt_entry *irq_entry);
+
static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
struct hv_interrupt_entry *stored_entry;
struct irq_cfg *cfg = irqd_cfg(data);
struct msi_desc *msidesc;
- struct pci_dev *dev;
+ struct pci_dev *pdev;
int ret;
msidesc = irq_data_get_msi_desc(data);
- dev = msi_desc_to_pci_dev(msidesc);
+ pdev = msi_desc_to_pci_dev(msidesc);
if (!cfg) {
pr_debug("%s: cfg is NULL", __func__);
@@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
stored_entry = data->chip_data;
data->chip_data = NULL;
- ret = hv_unmap_msi_interrupt(dev, stored_entry);
+ ret = hv_unmap_msi_interrupt(pdev, stored_entry);
kfree(stored_entry);
@@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
}
stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
- if (!stored_entry) {
- pr_debug("%s: failed to allocate chip data\n", __func__);
+ if (!stored_entry)
return;
- }
ret = hv_map_msi_interrupt(data, stored_entry);
if (ret) {
@@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
data->chip_data = stored_entry;
entry_to_msi_msg(data->chip_data, msg);
-
- return;
}
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+ struct hv_interrupt_entry *irq_entry)
{
- return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
+ union hv_device_id hv_devid;
+
+ hv_devid = hv_build_devid_type_pci(pdev);
+ return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
}
-static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
+/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
+static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
{
- struct hv_interrupt_entry old_entry;
+ struct hv_interrupt_entry irq_entry;
struct msi_msg msg;
if (!irqd->chip_data) {
@@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
return;
}
- old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
- entry_to_msi_msg(&old_entry, &msg);
+ irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
+ entry_to_msi_msg(&irq_entry, &msg);
kfree(irqd->chip_data);
irqd->chip_data = NULL;
- (void)hv_unmap_msi_interrupt(dev, &old_entry);
+ (void)hv_unmap_msi_interrupt(pdev, &irq_entry);
}
/*
@@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
};
static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
- struct irq_domain *real_parent, struct msi_domain_info *info)
+ struct irq_domain *real_parent,
+ struct msi_domain_info *info)
{
struct irq_chip *chip = info->chip;
@@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
}
#define HV_MSI_FLAGS_SUPPORTED (MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
-#define HV_MSI_FLAGS_REQUIRED (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
+#define HV_MSI_FLAGS_REQUIRED (MSI_FLAG_USE_DEF_DOM_OPS | \
+ MSI_FLAG_USE_DEF_CHIP_OPS)
static struct msi_parent_ops hv_msi_parent_ops = {
.supported_flags = HV_MSI_FLAGS_SUPPORTED,
@@ -329,14 +336,14 @@ static struct msi_parent_ops hv_msi_parent_ops = {
.init_dev_msi_info = hv_init_dev_msi_info,
};
-static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
- void *arg)
+/* Allocate nr_irqs IRQs for the given irq domain */
+static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
+ unsigned int nr_irqs, void *arg)
{
/*
- * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
- * entry_to_msi_msg() should be in here.
+ * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
+ * everything except entry_to_msi_msg() should be in here.
*/
-
int ret;
ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
@@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
return ret;
for (int i = 0; i < nr_irqs; ++i) {
- irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
- handle_edge_irq, NULL, "edge");
+ irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
+ NULL, handle_edge_irq, NULL, "edge");
}
+
return 0;
}
-static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
+static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
+ unsigned int nr_irqs)
{
for (int i = 0; i < nr_irqs; ++i) {
struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
@@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
}
+
irq_domain_free_irqs_top(d, virq, nr_irqs);
}
@@ -394,25 +404,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
{
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
- device_id.as_uint64 = 0;
- device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
- device_id.ioapic.ioapic_id = (u8)ioapic_id;
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+ hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
- return hv_unmap_interrupt(device_id.as_uint64, entry);
+ return hv_unmap_interrupt(hv_devid.as_uint64, entry);
}
EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
struct hv_interrupt_entry *entry)
{
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
- device_id.as_uint64 = 0;
- device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
- device_id.ioapic.ioapic_id = (u8)ioapic_id;
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+ hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
- return hv_map_interrupt(device_id, level, cpu, vector, entry);
+ return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
}
EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
In-Reply-To: <20260512020259.1678627-1-mrathor@linux.microsoft.com>
This file actually implements irq remapping, so rename to more appropriate
hyperv-irq.c. A new file to implement hyperv iommu will be introduced
later. Also, it should not be tied to HYPERV_IOMMU, but to CONFIG_HYPERV
and IRQ_REMAP. The file already has #ifdef CONFIG_IRQ_REMAP.
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
MAINTAINERS | 2 +-
drivers/iommu/Makefile | 2 +-
drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +++---
drivers/iommu/irq_remapping.c | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)
rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
diff --git a/MAINTAINERS b/MAINTAINERS
index d1cc0e12fe1f..f803a6a38fee 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11914,7 +11914,7 @@ F: drivers/clocksource/hyperv_timer.c
F: drivers/hid/hid-hyperv.c
F: drivers/hv/
F: drivers/input/serio/hyperv-keyboard.c
-F: drivers/iommu/hyperv-iommu.c
+F: drivers/iommu/hyperv-irq.c
F: drivers/net/ethernet/microsoft/
F: drivers/net/hyperv/
F: drivers/pci/controller/pci-hyperv-intf.c
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 0275821f4ef9..335ea77cced6 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -30,7 +30,7 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
-obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
+obj-$(CONFIG_HYPERV) += hyperv-irq.o
obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-irq.c
similarity index 99%
rename from drivers/iommu/hyperv-iommu.c
rename to drivers/iommu/hyperv-irq.c
index 479103261ae6..d11076f906fb 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-irq.c
@@ -8,6 +8,8 @@
* Author : Lan Tianyu <Tianyu.Lan@microsoft.com>
*/
+#ifdef CONFIG_IRQ_REMAP
+
#include <linux/types.h>
#include <linux/interrupt.h>
#include <linux/irq.h>
@@ -24,8 +26,6 @@
#include "irq_remapping.h"
-#ifdef CONFIG_IRQ_REMAP
-
/*
* According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
* Redirection Table. Hyper-V exposes one single IO-APIC and so define
@@ -331,4 +331,4 @@ static const struct irq_domain_ops hyperv_root_ir_domain_ops = {
.free = hyperv_root_irq_remapping_free,
};
-#endif
+#endif /* CONFIG_IRQ_REMAP */
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index c2443659812a..41bf65e4ea88 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -108,7 +108,7 @@ int __init irq_remapping_prepare(void)
else if (IS_ENABLED(CONFIG_AMD_IOMMU) &&
amd_iommu_irq_ops.prepare() == 0)
remap_ops = &amd_iommu_irq_ops;
- else if (IS_ENABLED(CONFIG_HYPERV_IOMMU) &&
+ else if (IS_ENABLED(CONFIG_HYPERV) &&
hyperv_irq_remap_ops.prepare() == 0)
remap_ops = &hyperv_irq_remap_ops;
else
--
2.51.2.vfs.0.1
^ permalink raw reply related
* [PATCH V3 00/11] PCI passthru on Hyper-V (Part I)
From: Mukesh R @ 2026-05-12 2:02 UTC (permalink / raw)
To: hpa, robin.murphy, robh, wei.liu, mrathor, mhklinux, muislam,
namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
linux-pci, linux-arch
Cc: kys, haiyangz, decui, longli, tglx, mingo, bp, dave.hansen, x86,
joro, will, lpieralisi, kwilczynski, bhelgaas, arnd, jacob.pan
Implement passthru of PCI devices to unprivileged virtual machines
(VMs) when Linux is running as a privileged VM on Microsoft Hyper-V
hypervisor. This support is made to fit within the workings of VFIO
framework, and any VMM needing to use it must use the VFIO subsystem.
This supports both full device passthru and SR-IOV based VFs.
At a high level, the hypervisor supports traditional mapped iommu domains
that use explicit map and unmap hypercalls for mapping and unmapping guest
RAM into the iommu subsystem. Hyper-V also has a concept of direct attach
devices whereby the iommu subsystem simply uses the guest HW page table
(ept/npt/..). This series adds support for both, and both are made to
work with the VFIO subsystem.
While this Part I focuses on memory mappings, Part II focuses on irq
remapping and irq migrations.
This series rebased to: 5170a82e8921 (origin/hyperv-next)
Testing:
o Most testing done on hyperv-next:e733a9e28180 using Cloud Hypervisor (51).
o Limited testing on : 5170a82e8921
o Tested with impending Part II irq patches.
o All tests involved passthru of devices using MSIx.
o Following combinations were tested doing PF passthru:
- L1VH(1): test 1: Mellanox ConnectX-6 Lx passthru
test 2: NVIDIA Tesla Tesla T4 GPU.
test 3: Both of above simultaneous passthru
- Baremetal dom0/root: All of above.
o VF: Mellanox ConnectX-6 Lx passthru on baremetal dom0/root.
(1) L1VH: this is a semi privileged VM that runs on Windows root on
Hyper-V, and allows users to create more child VMs.
This series strives to establish a base line. Some pending work items:
o arm64 : some delta to make this work on arm64 (in progress).
o Qemu and OpenVMM support (in progress).
o Further VF testing on l1vh
o device sleep/wakeup.
o More stress testing with high end GPUs
Changes in V3:
o patch #8: fix compiler issues incase of !CONFIG_HYPERV. Also, do forward
declaration of struct pci_dev instead of including pci.h.
o patch #9: minor changes to comments. Pass hv_domain instead of
iommu_domain to hv_iommu_detach_dev() since that's what it needs. Set
device private to null if attach fails. Clam down number of PFNs passed
to hv_iommu_map_pgs().
Changes in V2:
o rebase to 5170a82e8921
o minor fixes for arm64 build
o drop patch 03: "x86/hyperv: add insufficient memory support in irqdomain.c"
as it that path is no longer used
o drop patch 08: "PCI: hv: rename hv_compose_msi_msg .. " and do it separately
outside this series.
o minor updates to commit messages
Changes in V1:
o patch 1: Don't tie hyperv-irq.c to CONFIG_HYPERV_IOMMU.
o patch 4: Redesigned to address security vulnerability found by copilot
with passing tgid as a parameter. Also, do tgid setting right
after setting pt_id.
o patch 5: Remove unused type parameter from mshv_device_ops.device_create
o patch 7: mshv_partition_ioctl_create_device cleanup on copy_to_user.
o patch 10: Add export of hv_build_devid_type_pci here to get rid of
patch 11.
o patch 12: Move functions to build device ids from patch 11 here for
the benefit of arm64. Rename file to: hyperv-iommu-root.c.
o patch 13: removed to be made part of interrupt part II of this support.
o patch 14: get rid of fast path to reduce review noise.
o New (last) patch to pin ram regions if device passthru to a VM.
Thanks,
-Mukesh
Mukesh R (11):
iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c
x86/hyperv: Cosmetic changes in irqdomain.c for readability
mshv: Provide a way to get partition ID if running in a VMM process
mshv: Declarations and definitions for VFIO-MSHV bridge device
mshv: Implement mshv bridge device for VFIO
mshv: Add ioctl support for MSHV-VFIO bridge device
mshv: Import data structs around device passthru from hyperv headers
PCI: hv: VMBus and PCI device IDs for PCI passthru
x86/hyperv: Implement Hyper-V virtual IOMMU
mshv: Populate mmio mappings for PCI passthru
mshv: Mark mem regions as non-movable upfront if device passthru
MAINTAINERS | 3 +-
arch/x86/hyperv/irqdomain.c | 199 ++--
arch/x86/include/asm/mshyperv.h | 6 +
arch/x86/kernel/pci-dma.c | 2 +
drivers/hv/Makefile | 3 +-
drivers/hv/mshv_root.h | 21 +
drivers/hv/mshv_root_main.c | 266 ++++-
drivers/hv/mshv_vfio.c | 211 ++++
drivers/iommu/Kconfig | 5 +-
drivers/iommu/Makefile | 3 +-
drivers/iommu/hyperv-iommu-root.c | 918 ++++++++++++++++++
.../iommu/{hyperv-iommu.c => hyperv-irq.c} | 6 +-
drivers/iommu/irq_remapping.c | 2 +-
drivers/pci/controller/pci-hyperv.c | 24 +
include/asm-generic/mshyperv.h | 33 +
include/hyperv/hvgdk_mini.h | 11 +
include/hyperv/hvhdk_mini.h | 112 +++
include/linux/hyperv.h | 6 +
include/uapi/linux/mshv.h | 31 +
19 files changed, 1740 insertions(+), 122 deletions(-)
create mode 100644 drivers/hv/mshv_vfio.c
create mode 100644 drivers/iommu/hyperv-iommu-root.c
rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)
--
2.51.2.vfs.0.1
^ permalink raw reply
* [PATCH v3 04/10] RDMA: Convert drivers using sizeof() to ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
Convert the pattern:
ib_copy_to_udata(udata, &resp, sizeof(resp));
Using Coccinelle:
@@
identifier resp;
expression udata;
@@
- ib_copy_to_udata(udata, &resp, sizeof(resp))
+ ib_respond_udata(udata, resp)
Run another pass with AI to propagate the return code correctly and
remove redundant prints.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/cxgb4/provider.c | 9 ++++++---
drivers/infiniband/hw/cxgb4/qp.c | 4 ++--
drivers/infiniband/hw/erdma/erdma_verbs.c | 4 ++--
.../infiniband/hw/ionic/ionic_controlpath.c | 8 ++++----
drivers/infiniband/hw/mana/qp.c | 16 ++++------------
drivers/infiniband/hw/mlx4/main.c | 8 ++++----
drivers/infiniband/hw/mlx5/main.c | 5 +++--
drivers/infiniband/hw/mthca/mthca_provider.c | 5 +++--
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 19 +++++++------------
drivers/infiniband/hw/qedr/verbs.c | 7 +------
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 9 ++-------
drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c | 7 +++----
drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c | 6 +++---
.../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c | 11 +++++------
drivers/infiniband/sw/rdmavt/cq.c | 2 +-
drivers/infiniband/sw/rdmavt/qp.c | 3 +--
drivers/infiniband/sw/siw/siw_verbs.c | 10 +++++-----
17 files changed, 56 insertions(+), 77 deletions(-)
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 616019ac1da501..a119e8793aef40 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -52,6 +52,7 @@
#include <rdma/ib_smi.h>
#include <rdma/ib_umem.h>
#include <rdma/ib_user_verbs.h>
+#include <rdma/uverbs_ioctl.h>
#include "iw_cxgb4.h"
@@ -209,8 +210,9 @@ static int c4iw_allocate_pd(struct ib_pd *pd, struct ib_udata *udata)
{
struct c4iw_pd *php = to_c4iw_pd(pd);
struct ib_device *ibdev = pd->device;
- u32 pdid;
struct c4iw_dev *rhp;
+ u32 pdid;
+ int ret;
pr_debug("ibdev %p\n", ibdev);
rhp = (struct c4iw_dev *) ibdev;
@@ -223,9 +225,10 @@ static int c4iw_allocate_pd(struct ib_pd *pd, struct ib_udata *udata)
if (udata) {
struct c4iw_alloc_pd_resp uresp = {.pdid = php->pdid};
- if (ib_copy_to_udata(udata, &uresp, sizeof(uresp))) {
+ ret = ib_respond_udata(udata, uresp);
+ if (ret) {
c4iw_deallocate_pd(&php->ibpd, udata);
- return -EFAULT;
+ return ret;
}
}
mutex_lock(&rhp->rdev.stats.lock);
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index d9a86e4c546189..f9c7030ac6bfd0 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -2280,7 +2280,7 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
ucontext->key += PAGE_SIZE;
}
spin_unlock(&ucontext->mmap_lock);
- ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_free_ma_sync_key;
sq_key_mm->key = uresp.sq_key;
@@ -2777,7 +2777,7 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
uresp.srq_db_gts_key = ucontext->key;
ucontext->key += PAGE_SIZE;
spin_unlock(&ucontext->mmap_lock);
- ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_free_srq_db_key_mm;
srq_key_mm->key = uresp.srq_key;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 9bba470c6e3257..92a65970ab6fa1 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1055,7 +1055,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
uresp.qp_id = QP_ID(qp);
uresp.rq_offset = qp->user_qp.rq_offset;
- ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_out_cmd;
} else {
@@ -1571,7 +1571,7 @@ int erdma_alloc_ucontext(struct ib_ucontext *ibctx, struct ib_udata *udata)
uresp.dev_id = dev->pdev->device;
- ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_put_mmap_entries;
diff --git a/drivers/infiniband/hw/ionic/ionic_controlpath.c b/drivers/infiniband/hw/ionic/ionic_controlpath.c
index 7051a81cca9420..2b01345848ddb7 100644
--- a/drivers/infiniband/hw/ionic/ionic_controlpath.c
+++ b/drivers/infiniband/hw/ionic/ionic_controlpath.c
@@ -414,7 +414,7 @@ int ionic_alloc_ucontext(struct ib_ucontext *ibctx, struct ib_udata *udata)
if (dev->lif_cfg.rq_expdb)
resp.expdb_qtypes |= IONIC_EXPDB_RQ;
- rc = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ rc = ib_respond_udata(udata, resp);
if (rc)
goto err_resp;
@@ -752,7 +752,7 @@ int ionic_create_ah(struct ib_ah *ibah, struct rdma_ah_init_attr *init_attr,
if (udata) {
resp.ahid = ah->ahid;
- rc = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ rc = ib_respond_udata(udata, resp);
if (rc)
goto err_resp;
}
@@ -1263,7 +1263,7 @@ int ionic_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
if (udata) {
resp.udma_mask = vcq->udma_mask;
- rc = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ rc = ib_respond_udata(udata, resp);
if (rc)
goto err_resp;
}
@@ -2315,7 +2315,7 @@ int ionic_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
resp.rq_cmb = qp->rq_cmb;
}
- rc = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ rc = ib_respond_udata(udata, resp);
if (rc)
goto err_resp;
}
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index ecf5910dbf0702..39d9cdcc5df45a 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -212,13 +212,9 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
if (ret)
goto fail;
- ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
- if (ret) {
- ibdev_dbg(&mdev->ib_dev,
- "Failed to copy to udata create rss-qp, %d\n",
- ret);
+ ret = ib_respond_udata(udata, resp);
+ if (ret)
goto err_disable_vport_rx;
- }
kfree(mana_ind_table);
@@ -353,13 +349,9 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
resp.cqid = send_cq->queue.id;
resp.tx_vp_offset = pd->tx_vp_offset;
- err = ib_copy_to_udata(udata, &resp, sizeof(resp));
- if (err) {
- ibdev_dbg(&mdev->ib_dev,
- "Failed copy udata for create qp-raw, %d\n",
- err);
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_remove_cq_cb;
- }
return 0;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 16e9ce8138cb30..ce77e893065c92 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1121,16 +1121,16 @@ static int mlx4_ib_alloc_ucontext(struct ib_ucontext *uctx,
mutex_init(&context->wqn_ranges_mutex);
if (ibdev->ops.uverbs_abi_ver == MLX4_IB_UVERBS_NO_DEV_CAPS_ABI_VERSION)
- err = ib_copy_to_udata(udata, &resp_v3, sizeof(resp_v3));
+ err = ib_respond_udata(udata, resp_v3);
else
- err = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ err = ib_respond_udata(udata, resp);
if (err) {
mlx4_uar_free(to_mdev(ibdev)->dev, &context->uar);
- return -EFAULT;
+ return err;
}
- return err;
+ return 0;
}
static void mlx4_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 2bb5caf5a89266..0b3eda9b0ad0c4 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2792,9 +2792,10 @@ static int mlx5_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
pd->uid = uid;
if (udata) {
resp.pdn = pd->pdn;
- if (ib_copy_to_udata(udata, &resp, sizeof(resp))) {
+ err = ib_respond_udata(udata, resp);
+ if (err) {
mlx5_cmd_dealloc_pd(to_mdev(ibdev)->mdev, pd->pdn, uid);
- return -EFAULT;
+ return err;
}
}
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index e8d5d865c1f1f7..07c60797c86091 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -311,10 +311,11 @@ static int mthca_alloc_ucontext(struct ib_ucontext *uctx,
return err;
}
- if (ib_copy_to_udata(udata, &uresp, sizeof(uresp))) {
+ err = ib_respond_udata(udata, uresp);
+ if (err) {
mthca_cleanup_user_db_tab(to_mdev(ibdev), &context->uar, context->db_tab);
mthca_uar_free(to_mdev(ibdev), &context->uar);
- return -EFAULT;
+ return err;
}
context->reg_mr_warned = 0;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index a88cc5d84af828..2a174d0fe6ca1e 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -502,7 +502,7 @@ int ocrdma_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
resp.dpp_wqe_size = dev->attr.wqe_size;
memcpy(resp.fw_ver, dev->attr.fw_ver, sizeof(resp.fw_ver));
- status = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ status = ib_respond_udata(udata, resp);
if (status)
goto cpy_err;
return 0;
@@ -611,7 +611,7 @@ static int ocrdma_copy_pd_uresp(struct ocrdma_dev *dev, struct ocrdma_pd *pd,
rsp.dpp_page_addr_lo = dpp_page_addr;
}
- status = ib_copy_to_udata(udata, &rsp, sizeof(rsp));
+ status = ib_respond_udata(udata, rsp);
if (status)
goto ucopy_err;
@@ -945,12 +945,9 @@ static int ocrdma_copy_cq_uresp(struct ocrdma_dev *dev, struct ocrdma_cq *cq,
uresp.db_page_addr = ocrdma_get_db_addr(dev, uctx->cntxt_pd->id);
uresp.db_page_size = dev->nic_info.db_page_size;
uresp.phase_change = cq->phase_change ? 1 : 0;
- status = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
- if (status) {
- pr_err("%s(%d) copy error cqid=0x%x.\n",
- __func__, dev->id, cq->id);
+ status = ib_respond_udata(udata, uresp);
+ if (status)
goto err;
- }
status = ocrdma_add_mmap(uctx, uresp.db_page_addr, uresp.db_page_size);
if (status)
goto err;
@@ -1206,11 +1203,9 @@ static int ocrdma_copy_qp_uresp(struct ocrdma_qp *qp,
uresp.dpp_credit = dpp_credit_lmt;
uresp.dpp_offset = dpp_offset;
}
- status = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
- if (status) {
- pr_err("%s(%d) user copy error.\n", __func__, dev->id);
+ status = ib_respond_udata(udata, uresp);
+ if (status)
goto err;
- }
status = ocrdma_add_mmap(pd->uctx, uresp.sq_page_addr[0],
uresp.sq_page_size);
if (status)
@@ -1754,7 +1749,7 @@ static int ocrdma_copy_srq_uresp(struct ocrdma_dev *dev, struct ocrdma_srq *srq,
uresp.db_shift = 16;
}
- status = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ status = ib_respond_udata(udata, uresp);
if (status)
return status;
status = ocrdma_add_mmap(srq->pd->uctx, uresp.rq_page_addr[0],
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 679aa6f3a63bc5..3b86ea1cf88883 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -1251,15 +1251,10 @@ static int qedr_copy_srq_uresp(struct qedr_dev *dev,
struct qedr_srq *srq, struct ib_udata *udata)
{
struct qedr_create_srq_uresp uresp = {};
- int rc;
uresp.srq_id = srq->srq_id;
- rc = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
- if (rc)
- DP_ERR(dev, "create srq: problem copying data to user space\n");
-
- return rc;
+ return ib_respond_udata(udata, uresp);
}
static void qedr_copy_rq_uresp(struct qedr_dev *dev,
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 615de9c4209bf1..e887f03a84d063 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -82,7 +82,6 @@ static void usnic_ib_fw_string_to_u64(char *fw_ver_str, u64 *fw_ver)
static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
struct ib_udata *udata)
{
- struct usnic_ib_dev *us_ibdev;
struct usnic_ib_create_qp_resp resp;
struct pci_dev *pdev;
struct vnic_dev_bar *bar;
@@ -92,7 +91,6 @@ static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
memset(&resp, 0, sizeof(resp));
- us_ibdev = qp_grp->vf->pf;
pdev = usnic_vnic_get_pdev(qp_grp->vf->vnic);
if (!pdev) {
usnic_err("Failed to get pdev of qp_grp %d\n",
@@ -157,12 +155,9 @@ static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
struct usnic_ib_qp_grp_flow, link);
resp.transport = default_flow->trans_type;
- err = ib_copy_to_udata(udata, &resp, sizeof(resp));
- if (err) {
- usnic_err("Failed to copy udata for %s",
- dev_name(&us_ibdev->ib_dev.dev));
+ err = ib_respond_udata(udata, resp);
+ if (err)
return err;
- }
return 0;
}
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
index bc3adcc1ae67c2..d5bfdbfe1376d1 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c
@@ -203,11 +203,10 @@ int pvrdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
cq->uar = &context->uar;
/* Copy udata back. */
- if (ib_copy_to_udata(udata, &cq_resp, sizeof(cq_resp))) {
- dev_warn(&dev->pdev->dev,
- "failed to copy back udata\n");
+ ret = ib_respond_udata(udata, cq_resp);
+ if (ret) {
pvrdma_destroy_cq(&cq->ibcq, udata);
- return -EINVAL;
+ return ret;
}
}
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
index d31fb692fcaafb..e69eadde6c26e9 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c
@@ -195,10 +195,10 @@ int pvrdma_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init_attr,
spin_unlock_irqrestore(&dev->srq_tbl_lock, flags);
/* Copy udata back. */
- if (ib_copy_to_udata(udata, &srq_resp, sizeof(srq_resp))) {
- dev_warn(&dev->pdev->dev, "failed to copy back udata\n");
+ ret = ib_respond_udata(udata, srq_resp);
+ if (ret) {
pvrdma_destroy_srq(&srq->ibsrq, udata);
- return -EINVAL;
+ return ret;
}
return 0;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
index c7c2b41060e526..b9c3202b9545e3 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c
@@ -320,11 +320,11 @@ int pvrdma_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
/* copy back to user */
uresp.qp_tab_size = vdev->dsr->caps.max_qp;
- ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret) {
/* pvrdma_dealloc_ucontext() also frees the UAR */
pvrdma_dealloc_ucontext(&context->ibucontext);
- return -EFAULT;
+ return ret;
}
return 0;
@@ -430,11 +430,10 @@ int pvrdma_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
pd_resp.pdn = resp->pd_handle;
if (udata) {
- if (ib_copy_to_udata(udata, &pd_resp, sizeof(pd_resp))) {
- dev_warn(&dev->pdev->dev,
- "failed to copy back protection domain\n");
+ ret = ib_respond_udata(udata, pd_resp);
+ if (ret) {
pvrdma_dealloc_pd(&pd->ibpd, udata);
- return -EFAULT;
+ return ret;
}
}
diff --git a/drivers/infiniband/sw/rdmavt/cq.c b/drivers/infiniband/sw/rdmavt/cq.c
index 30904c6ae852db..45404611c9ce56 100644
--- a/drivers/infiniband/sw/rdmavt/cq.c
+++ b/drivers/infiniband/sw/rdmavt/cq.c
@@ -372,7 +372,7 @@ int rvt_resize_cq(struct ib_cq *ibcq, unsigned int cqe, struct ib_udata *udata)
if (udata && udata->outlen >= sizeof(__u64)) {
__u64 offset = 0;
- ret = ib_copy_to_udata(udata, &offset, sizeof(offset));
+ ret = ib_respond_udata(udata, offset);
if (ret)
goto bail_free;
}
diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c
index 816624e0991a0a..70e7d08fdce692 100644
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -1192,8 +1192,7 @@ int rvt_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr,
if (!qp->r_rq.wq) {
__u64 offset = 0;
- ret = ib_copy_to_udata(udata, &offset,
- sizeof(offset));
+ ret = ib_respond_udata(udata, offset);
if (ret)
goto bail_qpn;
} else {
diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
index 1e1d262a4ae2db..b34f3d6547ffc7 100644
--- a/drivers/infiniband/sw/siw/siw_verbs.c
+++ b/drivers/infiniband/sw/siw/siw_verbs.c
@@ -102,7 +102,7 @@ int siw_alloc_ucontext(struct ib_ucontext *base_ctx, struct ib_udata *udata)
rv = -EINVAL;
goto err_out;
}
- rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rv = ib_respond_udata(udata, uresp);
if (rv)
goto err_out;
@@ -472,7 +472,7 @@ int siw_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
rv = -EINVAL;
goto err_out_xa;
}
- rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rv = ib_respond_udata(udata, uresp);
if (rv)
goto err_out_xa;
}
@@ -1205,7 +1205,7 @@ int siw_create_cq(struct ib_cq *base_cq, const struct ib_cq_init_attr *attr,
rv = -EINVAL;
goto err_out;
}
- rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rv = ib_respond_udata(udata, uresp);
if (rv)
goto err_out;
}
@@ -1386,7 +1386,7 @@ struct ib_mr *siw_reg_user_mr(struct ib_pd *pd, u64 start, u64 len,
rv = -EINVAL;
goto err_out;
}
- rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rv = ib_respond_udata(udata, uresp);
if (rv)
goto err_out;
}
@@ -1646,7 +1646,7 @@ int siw_create_srq(struct ib_srq *base_srq,
rv = -EINVAL;
goto err_out;
}
- rv = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rv = ib_respond_udata(udata, uresp);
if (rv)
goto err_out;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 09/10] RDMA: Add missed = {} initialization to uresp structs
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
All of these are fully initialized so no bugs are being fixed. Add
the missing initializer as a precaution against future changes.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 2 +-
drivers/infiniband/hw/erdma/erdma_verbs.c | 2 +-
drivers/infiniband/hw/mlx4/main.c | 4 ++--
drivers/infiniband/hw/mlx5/main.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 7ed294516b7edb..ccb362d6d2e669 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1884,7 +1884,7 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
}
if (udata) {
- struct bnxt_re_qp_resp resp;
+ struct bnxt_re_qp_resp resp = {};
resp.qpid = qp->qplib_qp.id;
resp.rsvd = 0;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 92a65970ab6fa1..c8a35337ba51e8 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1977,7 +1977,7 @@ int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
if (!rdma_is_kernel_res(&ibcq->res)) {
struct erdma_ureq_create_cq ureq;
- struct erdma_uresp_create_cq uresp;
+ struct erdma_uresp_create_cq uresp = {};
ret = ib_copy_validate_udata_in(udata, ureq, rsvd0);
if (ret)
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 25f9738bd77223..d50743f090bf21 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1090,8 +1090,8 @@ static int mlx4_ib_alloc_ucontext(struct ib_ucontext *uctx,
struct ib_device *ibdev = uctx->device;
struct mlx4_ib_dev *dev = to_mdev(ibdev);
struct mlx4_ib_ucontext *context = to_mucontext(uctx);
- struct mlx4_ib_alloc_ucontext_resp_v3 resp_v3;
- struct mlx4_ib_alloc_ucontext_resp resp;
+ struct mlx4_ib_alloc_ucontext_resp_v3 resp_v3 = {};
+ struct mlx4_ib_alloc_ucontext_resp resp = {};
int err;
if (!dev->ib_active)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index fb9689e453bce4..03550b12ee6d1c 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2773,7 +2773,7 @@ static int mlx5_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
{
struct mlx5_ib_pd *pd = to_mpd(ibpd);
struct ib_device *ibdev = ibpd->device;
- struct mlx5_ib_alloc_pd_resp resp;
+ struct mlx5_ib_alloc_pd_resp resp = {};
int err;
u32 out[MLX5_ST_SZ_DW(alloc_pd_out)] = {};
u32 in[MLX5_ST_SZ_DW(alloc_pd_in)] = {};
--
2.43.0
^ permalink raw reply related
* [PATCH v3 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
Use copy_to_user() directly since the data is not being placed in the
udata response memory.
It is unclear why this is trying to do two copies, but leave it alone.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/sw/rdmavt/srq.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
index fe125bf85b2726..d022aa56c5bfd5 100644
--- a/drivers/infiniband/sw/rdmavt/srq.c
+++ b/drivers/infiniband/sw/rdmavt/srq.c
@@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
struct rvt_rq tmp_rq = {};
+ __u64 offset_addr;
int ret = 0;
if (attr_mask & IB_SRQ_MAX_WR) {
@@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
return -ENOMEM;
/* Check that we can write the offset to mmap. */
if (udata && udata->inlen >= sizeof(__u64)) {
- __u64 offset_addr;
__u64 offset = 0;
ret = ib_copy_from_udata(&offset_addr, udata,
sizeof(offset_addr));
if (ret)
goto bail_free;
- udata->outbuf = (void __user *)
- (unsigned long)offset_addr;
- ret = ib_copy_to_udata(udata, &offset,
- sizeof(offset));
- if (ret)
+ if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
+ sizeof(offset))) {
+ ret = -EFAULT;
goto bail_free;
+ }
}
spin_lock_irq(&srq->rq.kwq->c_lock);
@@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
* See rvt_mmap() for details.
*/
if (udata && udata->inlen >= sizeof(__u64)) {
- ret = ib_copy_to_udata(udata, &ip->offset,
- sizeof(ip->offset));
- if (ret)
- return ret;
+ if (copy_to_user(u64_to_user_ptr(offset_addr),
+ &ip->offset,
+ sizeof(ip->offset)))
+ return -EFAULT;
}
/*
--
2.43.0
^ permalink raw reply related
* [PATCH v3 06/10] RDMA/qedr: Replace qedr_ib_copy_to_udata() with ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
This is another instance of the min() pattern.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/qedr/verbs.c | 35 +++++-------------------------
1 file changed, 6 insertions(+), 29 deletions(-)
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 3b86ea1cf88883..79190c5b8b50b0 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -64,14 +64,6 @@ enum {
QEDR_USER_MMAP_PHYS_PAGE,
};
-static inline int qedr_ib_copy_to_udata(struct ib_udata *udata, void *src,
- size_t len)
-{
- size_t min_len = min_t(size_t, len, udata->outlen);
-
- return ib_copy_to_udata(udata, src, min_len);
-}
-
int qedr_query_pkey(struct ib_device *ibdev, u32 port, u16 index, u16 *pkey)
{
if (index >= QEDR_ROCE_PKEY_TABLE_LEN)
@@ -340,7 +332,7 @@ int qedr_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
uresp.sges_per_srq_wr = dev->attr.max_srq_sge;
uresp.max_cqes = QEDR_MAX_CQES;
- rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rc = ib_respond_udata(udata, uresp);
if (rc)
goto err;
@@ -459,9 +451,8 @@ int qedr_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
struct qedr_ucontext *context = rdma_udata_to_drv_context(
udata, struct qedr_ucontext, ibucontext);
- rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+ rc = ib_respond_udata(udata, uresp);
if (rc) {
- DP_ERR(dev, "copy error pd_id=0x%x.\n", pd_id);
dev->ops->rdma_dealloc_pd(dev->rdma_ctx, pd_id);
return rc;
}
@@ -696,12 +687,10 @@ static void qedr_db_recovery_del(struct qedr_dev *dev,
dev->ops->common->db_recovery_del(dev->cdev, db_addr, db_data);
}
-static int qedr_copy_cq_uresp(struct qedr_dev *dev,
- struct qedr_cq *cq, struct ib_udata *udata,
+static int qedr_copy_cq_uresp(struct qedr_cq *cq, struct ib_udata *udata,
u32 db_offset)
{
struct qedr_create_cq_uresp uresp;
- int rc;
memset(&uresp, 0, sizeof(uresp));
@@ -711,11 +700,7 @@ static int qedr_copy_cq_uresp(struct qedr_dev *dev,
uresp.db_rec_addr =
rdma_user_mmap_get_offset(cq->q.db_mmap_entry);
- rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
- if (rc)
- DP_ERR(dev, "copy error cqid=0x%x.\n", cq->icid);
-
- return rc;
+ return ib_respond_udata(udata, uresp);
}
static void consume_cqe(struct qedr_cq *cq)
@@ -994,7 +979,7 @@ int qedr_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
spin_lock_init(&cq->cq_lock);
if (udata) {
- rc = qedr_copy_cq_uresp(dev, cq, udata, db_offset);
+ rc = qedr_copy_cq_uresp(cq, udata, db_offset);
if (rc)
goto err2;
@@ -1298,8 +1283,6 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
struct qedr_qp *qp, struct ib_udata *udata,
struct qedr_create_qp_uresp *uresp)
{
- int rc;
-
memset(uresp, 0, sizeof(*uresp));
if (qedr_qp_has_sq(qp))
@@ -1311,13 +1294,7 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
uresp->atomic_supported = dev->atomic_cap != IB_ATOMIC_NONE;
uresp->qp_id = qp->qp_id;
- rc = qedr_ib_copy_to_udata(udata, uresp, sizeof(*uresp));
- if (rc)
- DP_ERR(dev,
- "create qp: failed a copy to user space with qp icid=0x%x.\n",
- qp->icid);
-
- return rc;
+ return ib_respond_udata(udata, *uresp);
}
static void qedr_reset_qp_hwq_info(struct qedr_qp_hwq_info *qph)
--
2.43.0
^ permalink raw reply related
* [PATCH v3 10/10] RDMA: Replace memset with = {} pattern for ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
Most drivers do this already, but some open-code a memset. Switch
all instances found. qedr_copy_qp_uresp() is already called with
zeroed memory so that memset is redundant.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/cxgb4/cq.c | 3 +--
drivers/infiniband/hw/cxgb4/qp.c | 6 ++----
drivers/infiniband/hw/erdma/erdma_verbs.c | 4 +---
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 12 ++++--------
drivers/infiniband/hw/qedr/verbs.c | 6 +-----
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 4 +---
6 files changed, 10 insertions(+), 25 deletions(-)
diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 47508df4cec023..d1517f2560b981 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1004,7 +1004,7 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
struct c4iw_dev *rhp = to_c4iw_dev(ibcq->device);
struct c4iw_cq *chp = to_c4iw_cq(ibcq);
struct c4iw_create_cq ucmd;
- struct c4iw_create_cq_resp uresp;
+ struct c4iw_create_cq_resp uresp = {};
int ret, wr_len;
size_t memsize, hwentries;
struct c4iw_mm_entry *mm, *mm2;
@@ -1102,7 +1102,6 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
if (!mm2)
goto err_free_mm;
- memset(&uresp, 0, sizeof(uresp));
uresp.qid_mask = rhp->rdev.cqmask;
uresp.cqid = chp->cq.cqid;
uresp.size = chp->cq.size;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index f9c7030ac6bfd0..e295f79e0cd3e5 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -2120,7 +2120,7 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
struct c4iw_pd *php;
struct c4iw_cq *schp;
struct c4iw_cq *rchp;
- struct c4iw_create_qp_resp uresp;
+ struct c4iw_create_qp_resp uresp = {};
unsigned int sqsize, rqsize = 0;
struct c4iw_ucontext *ucontext = rdma_udata_to_drv_context(
udata, struct c4iw_ucontext, ibucontext);
@@ -2242,7 +2242,6 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
goto err_free_sq_db_key;
}
}
- memset(&uresp, 0, sizeof(uresp));
if (t4_sq_onchip(&qhp->wq.sq)) {
ma_sync_key_mm = kmalloc_obj(*ma_sync_key_mm);
if (!ma_sync_key_mm) {
@@ -2686,7 +2685,7 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
struct c4iw_dev *rhp;
struct c4iw_srq *srq = to_c4iw_srq(ib_srq);
struct c4iw_pd *php;
- struct c4iw_create_srq_resp uresp;
+ struct c4iw_create_srq_resp uresp = {};
struct c4iw_ucontext *ucontext;
struct c4iw_mm_entry *srq_key_mm, *srq_db_key_mm;
int rqsize;
@@ -2764,7 +2763,6 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
ret = -ENOMEM;
goto err_free_srq_key_mm;
}
- memset(&uresp, 0, sizeof(uresp));
uresp.flags = srq->flags;
uresp.qid_mask = rhp->rdev.qpmask;
uresp.srqid = srq->wq.qid;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index c8a35337ba51e8..b59c2e3a5306d1 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -996,7 +996,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
struct erdma_ucontext *uctx = rdma_udata_to_drv_context(
udata, struct erdma_ucontext, ibucontext);
struct erdma_ureq_create_qp ureq;
- struct erdma_uresp_create_qp uresp;
+ struct erdma_uresp_create_qp uresp = {};
void *old_entry;
int ret = 0;
@@ -1048,8 +1048,6 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
if (ret)
goto err_out_xa;
- memset(&uresp, 0, sizeof(uresp));
-
uresp.num_sqe = qp->attrs.sq_size;
uresp.num_rqe = qp->attrs.rq_size;
uresp.qp_id = QP_ID(qp);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 2a174d0fe6ca1e..383f1d9c15d151 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -586,11 +586,10 @@ static int ocrdma_copy_pd_uresp(struct ocrdma_dev *dev, struct ocrdma_pd *pd,
u64 db_page_addr;
u64 dpp_page_addr = 0;
u32 db_page_size;
- struct ocrdma_alloc_pd_uresp rsp;
+ struct ocrdma_alloc_pd_uresp rsp = {};
struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
udata, struct ocrdma_ucontext, ibucontext);
- memset(&rsp, 0, sizeof(rsp));
rsp.id = pd->id;
rsp.dpp_enabled = pd->dpp_enabled;
db_page_addr = ocrdma_get_db_addr(dev, pd->id);
@@ -930,13 +929,12 @@ static int ocrdma_copy_cq_uresp(struct ocrdma_dev *dev, struct ocrdma_cq *cq,
int status;
struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
udata, struct ocrdma_ucontext, ibucontext);
- struct ocrdma_create_cq_uresp uresp;
+ struct ocrdma_create_cq_uresp uresp = {};
/* this must be user flow! */
if (!udata)
return -EINVAL;
- memset(&uresp, 0, sizeof(uresp));
uresp.cq_id = cq->id;
uresp.page_size = PAGE_ALIGN(cq->len);
uresp.num_pages = 1;
@@ -1173,11 +1171,10 @@ static int ocrdma_copy_qp_uresp(struct ocrdma_qp *qp,
{
int status;
u64 usr_db;
- struct ocrdma_create_qp_uresp uresp;
+ struct ocrdma_create_qp_uresp uresp = {};
struct ocrdma_pd *pd = qp->pd;
struct ocrdma_dev *dev = get_ocrdma_dev(pd->ibpd.device);
- memset(&uresp, 0, sizeof(uresp));
usr_db = dev->nic_info.unmapped_db +
(pd->id * dev->nic_info.db_page_size);
uresp.qp_id = qp->id;
@@ -1730,9 +1727,8 @@ static int ocrdma_copy_srq_uresp(struct ocrdma_dev *dev, struct ocrdma_srq *srq,
struct ib_udata *udata)
{
int status;
- struct ocrdma_create_srq_uresp uresp;
+ struct ocrdma_create_srq_uresp uresp = {};
- memset(&uresp, 0, sizeof(uresp));
uresp.rq_dbid = srq->rq.dbid;
uresp.num_rq_pages = 1;
uresp.rq_page_addr[0] = virt_to_phys(srq->rq.va);
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 79190c5b8b50b0..1af908275ca729 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -690,9 +690,7 @@ static void qedr_db_recovery_del(struct qedr_dev *dev,
static int qedr_copy_cq_uresp(struct qedr_cq *cq, struct ib_udata *udata,
u32 db_offset)
{
- struct qedr_create_cq_uresp uresp;
-
- memset(&uresp, 0, sizeof(uresp));
+ struct qedr_create_cq_uresp uresp = {};
uresp.db_offset = db_offset;
uresp.icid = cq->icid;
@@ -1283,8 +1281,6 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
struct qedr_qp *qp, struct ib_udata *udata,
struct qedr_create_qp_uresp *uresp)
{
- memset(uresp, 0, sizeof(*uresp));
-
if (qedr_qp_has_sq(qp))
qedr_copy_sq_uresp(dev, uresp, qp);
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index e887f03a84d063..261f18a8368543 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -82,15 +82,13 @@ static void usnic_ib_fw_string_to_u64(char *fw_ver_str, u64 *fw_ver)
static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
struct ib_udata *udata)
{
- struct usnic_ib_create_qp_resp resp;
+ struct usnic_ib_create_qp_resp resp = {};
struct pci_dev *pdev;
struct vnic_dev_bar *bar;
struct usnic_vnic_res_chunk *chunk;
struct usnic_ib_qp_grp_flow *default_flow;
int i, err;
- memset(&resp, 0, sizeof(resp));
-
pdev = usnic_vnic_get_pdev(qp_grp->vf->vnic);
if (!pdev) {
usnic_err("Failed to get pdev of qp_grp %d\n",
--
2.43.0
^ permalink raw reply related
* [PATCH v3 03/10] RDMA: Convert drivers using min to ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
Convert the pattern:
ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));
Using Coccinelle:
@@
identifier resp;
expression udata;
@@
- ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen))
+ ib_respond_udata(udata, resp)
@@
identifier resp;
expression udata;
@@
- ib_copy_to_udata(udata, &resp, min(udata->outlen, sizeof(resp)))
+ ib_respond_udata(udata, resp)
Run another pass with AI to propagate the return code correctly and
remove redundant prints.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/efa/efa_verbs.c | 44 +++++-------------
drivers/infiniband/hw/erdma/erdma_verbs.c | 3 +-
drivers/infiniband/hw/hns/hns_roce_ah.c | 4 +-
drivers/infiniband/hw/hns/hns_roce_cq.c | 3 +-
drivers/infiniband/hw/hns/hns_roce_main.c | 3 +-
drivers/infiniband/hw/hns/hns_roce_pd.c | 8 ++--
drivers/infiniband/hw/hns/hns_roce_qp.c | 13 ++----
drivers/infiniband/hw/hns/hns_roce_srq.c | 6 +--
drivers/infiniband/hw/irdma/verbs.c | 48 +++++++-------------
drivers/infiniband/hw/mana/cq.c | 6 +--
drivers/infiniband/hw/mana/qp.c | 6 +--
drivers/infiniband/hw/mlx5/srq.c | 7 +--
drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c | 8 ++--
13 files changed, 49 insertions(+), 110 deletions(-)
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 3ad5d6e27b1590..395290ab05847a 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -270,13 +270,9 @@ int efa_query_device(struct ib_device *ibdev,
if (dev->neqs)
resp.device_caps |= EFA_QUERY_DEVICE_CAPS_CQ_NOTIFICATIONS;
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(ibdev,
- "Failed to copy udata for query_device\n");
+ err = ib_respond_udata(udata, resp);
+ if (err)
return err;
- }
}
return 0;
@@ -442,13 +438,9 @@ int efa_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
resp.pdn = result.pdn;
if (udata->outlen) {
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(&dev->ibdev,
- "Failed to copy udata for alloc_pd\n");
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_dealloc_pd;
- }
}
ibdev_dbg(&dev->ibdev, "Allocated pd[%d]\n", pd->pdn);
@@ -782,14 +774,9 @@ int efa_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr,
qp->max_inline_data = init_attr->cap.max_inline_data;
if (udata->outlen) {
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(&dev->ibdev,
- "Failed to copy udata for qp[%u]\n",
- create_qp_resp.qp_num);
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_remove_mmap_entries;
- }
}
ibdev_dbg(&dev->ibdev, "Created qp[%d]\n", qp->ibqp.qp_num);
@@ -1226,13 +1213,9 @@ int efa_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
}
if (udata->outlen) {
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(ibdev,
- "Failed to copy udata for create_cq\n");
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_xa_erase;
- }
}
ibdev_dbg(ibdev, "Created cq[%d], cq depth[%u]. dma[%pad] virt[0x%p]\n",
@@ -1935,8 +1918,7 @@ int efa_alloc_ucontext(struct ib_ucontext *ibucontext, struct ib_udata *udata)
resp.max_tx_batch = dev->dev_attr.max_tx_batch;
resp.min_sq_wr = dev->dev_attr.min_sq_depth;
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
+ err = ib_respond_udata(udata, resp);
if (err)
goto err_dealloc_uar;
@@ -2087,13 +2069,9 @@ int efa_create_ah(struct ib_ah *ibah,
resp.efa_address_handle = result.ah;
if (udata->outlen) {
- err = ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(&dev->ibdev,
- "Failed to copy udata for create_ah response\n");
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_destroy_ah;
- }
}
ibdev_dbg(&dev->ibdev, "Created ah[%d]\n", ah->ah);
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 5523b4e151e1ff..9bba470c6e3257 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1990,8 +1990,7 @@ int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
uresp.cq_id = cq->cqn;
uresp.num_cqe = depth;
- ret = ib_copy_to_udata(udata, &uresp,
- min(sizeof(uresp), udata->outlen));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_free_res;
} else {
diff --git a/drivers/infiniband/hw/hns/hns_roce_ah.c b/drivers/infiniband/hw/hns/hns_roce_ah.c
index 8a605da8a93c97..925ddf15b68102 100644
--- a/drivers/infiniband/hw/hns/hns_roce_ah.c
+++ b/drivers/infiniband/hw/hns/hns_roce_ah.c
@@ -32,6 +32,7 @@
#include <rdma/ib_addr.h>
#include <rdma/ib_cache.h>
+#include <rdma/uverbs_ioctl.h>
#include "hns_roce_device.h"
#include "hns_roce_hw_v2.h"
@@ -112,8 +113,7 @@ int hns_roce_create_ah(struct ib_ah *ibah, struct rdma_ah_init_attr *init_attr,
resp.priority = ah->av.sl;
resp.tc_mode = tc_mode;
memcpy(resp.dmac, ah_attr->roce.dmac, ETH_ALEN);
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
+ ret = ib_respond_udata(udata, resp);
}
err_out:
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 621568e114054b..24de651f735e03 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -452,8 +452,7 @@ int hns_roce_create_cq(struct ib_cq *ib_cq, const struct ib_cq_init_attr *attr,
if (udata) {
resp.cqn = hr_cq->cqn;
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
+ ret = ib_respond_udata(udata, resp);
if (ret)
goto err_cqc;
}
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 0dbe99aab6ad21..c17ff5347a0147 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -477,8 +477,7 @@ static int hns_roce_alloc_ucontext(struct ib_ucontext *uctx,
resp.cqe_size = hr_dev->caps.cqe_sz;
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
+ ret = ib_respond_udata(udata, resp);
if (ret)
goto error_fail_copy_to_udata;
diff --git a/drivers/infiniband/hw/hns/hns_roce_pd.c b/drivers/infiniband/hw/hns/hns_roce_pd.c
index 225c3e328e0e08..73bb000574c50d 100644
--- a/drivers/infiniband/hw/hns/hns_roce_pd.c
+++ b/drivers/infiniband/hw/hns/hns_roce_pd.c
@@ -30,6 +30,7 @@
* SOFTWARE.
*/
+#include <rdma/uverbs_ioctl.h>
#include "hns_roce_device.h"
void hns_roce_init_pd_table(struct hns_roce_dev *hr_dev)
@@ -61,12 +62,9 @@ int hns_roce_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
if (udata) {
struct hns_roce_ib_alloc_pd_resp resp = {.pdn = pd->pdn};
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
- if (ret) {
+ ret = ib_respond_udata(udata, resp);
+ if (ret)
ida_free(&pd_ida->ida, id);
- ibdev_err(ib_dev, "failed to copy to udata, ret = %d\n", ret);
- }
}
return ret;
diff --git a/drivers/infiniband/hw/hns/hns_roce_qp.c b/drivers/infiniband/hw/hns/hns_roce_qp.c
index bf04ee84a94392..e333a8c4acb52c 100644
--- a/drivers/infiniband/hw/hns/hns_roce_qp.c
+++ b/drivers/infiniband/hw/hns/hns_roce_qp.c
@@ -1236,12 +1236,9 @@ static int hns_roce_create_qp_common(struct hns_roce_dev *hr_dev,
if (udata) {
resp.cap_flags = hr_qp->en_flags;
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
- if (ret) {
- ibdev_err(ibdev, "copy qp resp failed!\n");
+ ret = ib_respond_udata(udata, resp);
+ if (ret)
goto err_flow_ctrl;
- }
}
if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_QP_FLOW_CTRL) {
@@ -1494,11 +1491,7 @@ int hns_roce_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
if (udata && udata->outlen) {
resp.tc_mode = hr_qp->tc_mode;
resp.priority = hr_qp->sl;
- ret = ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)));
- if (ret)
- ibdev_err_ratelimited(&hr_dev->ib_dev,
- "failed to copy modify qp resp.\n");
+ ret = ib_respond_udata(udata, resp);
}
out:
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c
index 8b94cbdfa54dfa..241fc9980f4f51 100644
--- a/drivers/infiniband/hw/hns/hns_roce_srq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
@@ -477,11 +477,9 @@ int hns_roce_create_srq(struct ib_srq *ib_srq,
if (udata) {
resp.cap_flags = srq->cap_flags;
resp.srqn = srq->srqn;
- if (ib_copy_to_udata(udata, &resp,
- min(udata->outlen, sizeof(resp)))) {
- ret = -EFAULT;
+ ret = ib_respond_udata(udata, resp);
+ if (ret)
goto err_srqc;
- }
}
return 0;
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 17086048d2d7fc..79e72a457e7983 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -325,9 +325,9 @@ static int irdma_alloc_ucontext(struct ib_ucontext *uctx,
uresp.max_pds = iwdev->rf->sc_dev.hw_attrs.max_hw_pds;
uresp.wq_size = iwdev->rf->sc_dev.hw_attrs.max_qp_wr * 2;
uresp.kernel_ver = req.userspace_ver;
- if (ib_copy_to_udata(udata, &uresp,
- min(sizeof(uresp), udata->outlen)))
- return -EFAULT;
+ ret = ib_respond_udata(udata, uresp);
+ if (ret)
+ return ret;
} else {
u64 bar_off = (uintptr_t)iwdev->rf->sc_dev.hw_regs[IRDMA_DB_ADDR_OFFSET];
@@ -354,10 +354,10 @@ static int irdma_alloc_ucontext(struct ib_ucontext *uctx,
uresp.comp_mask |= IRDMA_ALLOC_UCTX_MIN_HW_WQ_SIZE;
uresp.max_hw_srq_quanta = uk_attrs->max_hw_srq_quanta;
uresp.comp_mask |= IRDMA_ALLOC_UCTX_MAX_HW_SRQ_QUANTA;
- if (ib_copy_to_udata(udata, &uresp,
- min(sizeof(uresp), udata->outlen))) {
+ ret = ib_respond_udata(udata, uresp);
+ if (ret) {
rdma_user_mmap_entry_remove(ucontext->db_mmap_entry);
- return -EFAULT;
+ return ret;
}
}
@@ -420,11 +420,9 @@ static int irdma_alloc_pd(struct ib_pd *pd, struct ib_udata *udata)
ibucontext);
irdma_sc_pd_init(dev, sc_pd, pd_id, ucontext->abi_ver);
uresp.pd_id = pd_id;
- if (ib_copy_to_udata(udata, &uresp,
- min(sizeof(uresp), udata->outlen))) {
- err = -EFAULT;
+ err = ib_respond_udata(udata, uresp);
+ if (err)
goto error;
- }
} else {
irdma_sc_pd_init(dev, sc_pd, pd_id, IRDMA_ABI_VER);
}
@@ -1124,10 +1122,8 @@ static int irdma_create_qp(struct ib_qp *ibqp,
uresp.qp_id = qp_num;
uresp.qp_caps = qp->qp_uk.qp_caps;
- err_code = ib_copy_to_udata(udata, &uresp,
- min(sizeof(uresp), udata->outlen));
+ err_code = ib_respond_udata(udata, uresp);
if (err_code) {
- ibdev_dbg(&iwdev->ibdev, "VERBS: copy_to_udata failed\n");
irdma_destroy_qp(&iwqp->ibqp, udata);
return err_code;
}
@@ -1612,12 +1608,9 @@ int irdma_modify_qp_roce(struct ib_qp *ibqp, struct ib_qp_attr *attr,
uresp.push_valid = 1;
uresp.push_offset = iwqp->sc_qp.push_offset;
}
- ret = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp),
- udata->outlen));
+ ret = ib_respond_udata(udata, uresp);
if (ret) {
irdma_remove_push_mmap_entries(iwqp);
- ibdev_dbg(&iwdev->ibdev,
- "VERBS: copy_to_udata failed\n");
return ret;
}
}
@@ -1860,12 +1853,9 @@ int irdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask,
uresp.push_offset = iwqp->sc_qp.push_offset;
}
- err = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp),
- udata->outlen));
+ err = ib_respond_udata(udata, uresp);
if (err) {
irdma_remove_push_mmap_entries(iwqp);
- ibdev_dbg(&iwdev->ibdev,
- "VERBS: copy_to_udata failed\n");
return err;
}
}
@@ -2418,11 +2408,9 @@ static int irdma_create_srq(struct ib_srq *ibsrq,
resp.srq_id = iwsrq->srq_num;
resp.srq_size = ukinfo->srq_size;
- if (ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen))) {
- err_code = -EPROTO;
+ err_code = ib_respond_udata(udata, resp);
+ if (err_code)
goto srq_destroy;
- }
}
return 0;
@@ -2664,13 +2652,9 @@ static int irdma_create_cq(struct ib_cq *ibcq,
resp.cq_id = info.cq_uk_init_info.cq_id;
resp.cq_size = info.cq_uk_init_info.cq_size;
- if (ib_copy_to_udata(udata, &resp,
- min(sizeof(resp), udata->outlen))) {
- ibdev_dbg(&iwdev->ibdev,
- "VERBS: copy to user data\n");
- err_code = -EPROTO;
+ err_code = ib_respond_udata(udata, resp);
+ if (err_code)
goto cq_destroy;
- }
}
init_completion(&iwcq->free_cq);
@@ -5330,7 +5314,7 @@ static int irdma_create_user_ah(struct ib_ah *ibah,
mutex_unlock(&iwdev->rf->ah_tbl_lock);
uresp.ah_id = ah->sc_ah.ah_info.ah_idx;
- err = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp), udata->outlen));
+ err = ib_respond_udata(udata, uresp);
if (err)
irdma_destroy_ah(ibah, attr->flags);
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index 2d682428ef202a..f2547989f42290 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -79,11 +79,9 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
if (udata) {
resp.cqid = cq->queue.id;
- err = ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(&mdev->ib_dev, "Failed to copy to udata, %d\n", err);
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_remove_cq_cb;
- }
}
spin_lock_init(&cq->cq_lock);
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 0fbcf449c134b5..ecf5910dbf0702 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -557,11 +557,9 @@ static int mana_ib_create_rc_qp(struct ib_qp *ibqp, struct ib_pd *ibpd,
resp.queue_id[j] = qp->rc_qp.queues[i].id;
j++;
}
- err = ib_copy_to_udata(udata, &resp, min(sizeof(resp), udata->outlen));
- if (err) {
- ibdev_dbg(&mdev->ib_dev, "Failed to copy to udata, %d\n", err);
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto destroy_qp;
- }
}
err = mana_table_store_qp(mdev, qp);
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index 852f6f502d14d0..3fb8519a4ce0d7 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -292,12 +292,9 @@ int mlx5_ib_create_srq(struct ib_srq *ib_srq,
.srqn = srq->msrq.srqn,
};
- if (ib_copy_to_udata(udata, &resp, min(udata->outlen,
- sizeof(resp)))) {
- mlx5_ib_dbg(dev, "copy to user failed\n");
- err = -EFAULT;
+ err = ib_respond_udata(udata, resp);
+ if (err)
goto err_core;
- }
}
init_attr->attr.max_wr = srq->msrq.max - 1;
diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
index 16aab967a20308..cefcb243c3a6f2 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c
@@ -406,12 +406,10 @@ int pvrdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr,
qp_resp.qpn = qp->ibqp.qp_num;
qp_resp.qp_handle = qp->qp_handle;
- if (ib_copy_to_udata(udata, &qp_resp,
- min(udata->outlen, sizeof(qp_resp)))) {
- dev_warn(&dev->pdev->dev,
- "failed to copy back udata\n");
+ ret = ib_respond_udata(udata, qp_resp);
+ if (ret) {
__pvrdma_destroy_qp(dev, qp);
- return -EINVAL;
+ return ret;
}
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 05/10] RDMA/cxgb4: Convert to ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
These cases carefully work around 32-bit unpadded structures, but
the min integrated into ib_respond_udata() handles this
automatically. Zero-initialize data that would not have been copied.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/cxgb4/cq.c | 8 +++-----
drivers/infiniband/hw/cxgb4/provider.c | 5 ++---
2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index e31fb9134aa818..47508df4cec023 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1115,13 +1115,11 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
/* communicate to the userspace that
* kernel driver supports 64B CQE
*/
- uresp.flags |= C4IW_64B_CQE;
+ if (!ucontext->is_32b_cqe)
+ uresp.flags |= C4IW_64B_CQE;
spin_unlock(&ucontext->mmap_lock);
- ret = ib_copy_to_udata(udata, &uresp,
- ucontext->is_32b_cqe ?
- sizeof(uresp) - sizeof(uresp.flags) :
- sizeof(uresp));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_free_mm2;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index a119e8793aef40..0e3827022c63da 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -80,7 +80,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
struct ib_device *ibdev = ucontext->device;
struct c4iw_ucontext *context = to_c4iw_ucontext(ucontext);
struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
- struct c4iw_alloc_ucontext_resp uresp;
+ struct c4iw_alloc_ucontext_resp uresp = {};
int ret = 0;
struct c4iw_mm_entry *mm = NULL;
@@ -106,8 +106,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
context->key += PAGE_SIZE;
spin_unlock(&context->mmap_lock);
- ret = ib_copy_to_udata(udata, &uresp,
- sizeof(uresp) - sizeof(uresp.reserved));
+ ret = ib_respond_udata(udata, uresp);
if (ret)
goto err_mm;
--
2.43.0
^ permalink raw reply related
* [PATCH v3 00/10] Convert all drivers to the new udata response flow
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
Go through the drivers and migrate them to use ib_respond_udata(). Remove
debugging prints on failure paths. Ensure the error propagates from
ib_respond_udata(). Use the = {} pattern to initialize the uresp.
There are a couple of oddball cases which are fixed up in their own
commits, but otherwise this is fairly straightforward.
v3:
- Rebased to v7.1-rc3 which has the error flow bugs, now removed from
this series
v2: https://patch.msgid.link/r/0-v2-1c49eeb88c48+91-rdma_udata_rep_jgg@nvidia.com
- More patches fixing the pre-existing error flow bugs
found by Sashiko. I left the rvt issue behind because it was too broken
and scary.
v1: https://patch.msgid.link/r/0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com
Jason Gunthorpe (10):
RDMA: Use ib_is_udata_in_empty() for places calling
ib_is_udata_cleared()
IB/rdmavt: Don't abuse udata and ib_respond_udata()
RDMA: Convert drivers using min to ib_respond_udata()
RDMA: Convert drivers using sizeof() to ib_respond_udata()
RDMA/cxgb4: Convert to ib_respond_udata()
RDMA/qedr: Replace qedr_ib_copy_to_udata() with ib_respond_udata()
RDMA/mlx: Replace response_len with ib_respond_udata()
RDMA: Use proper driver data response structs instead of open coding
RDMA: Add missed = {} initialization to uresp structs
RDMA: Replace memset with = {} pattern for ib_respond_udata()
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 2 +-
drivers/infiniband/hw/cxgb4/cq.c | 11 +--
drivers/infiniband/hw/cxgb4/provider.c | 14 +--
drivers/infiniband/hw/cxgb4/qp.c | 10 +--
drivers/infiniband/hw/efa/efa_verbs.c | 87 ++++++-------------
drivers/infiniband/hw/erdma/erdma_verbs.c | 13 ++-
drivers/infiniband/hw/hns/hns_roce_ah.c | 4 +-
drivers/infiniband/hw/hns/hns_roce_cq.c | 3 +-
drivers/infiniband/hw/hns/hns_roce_main.c | 3 +-
drivers/infiniband/hw/hns/hns_roce_pd.c | 8 +-
drivers/infiniband/hw/hns/hns_roce_qp.c | 13 +--
drivers/infiniband/hw/hns/hns_roce_srq.c | 6 +-
.../infiniband/hw/ionic/ionic_controlpath.c | 8 +-
drivers/infiniband/hw/irdma/verbs.c | 48 ++++------
drivers/infiniband/hw/mana/cq.c | 6 +-
drivers/infiniband/hw/mana/qp.c | 22 ++---
drivers/infiniband/hw/mlx4/cq.c | 7 +-
drivers/infiniband/hw/mlx4/main.c | 31 ++++---
drivers/infiniband/hw/mlx4/qp.c | 9 +-
drivers/infiniband/hw/mlx4/srq.c | 12 ++-
drivers/infiniband/hw/mlx5/ah.c | 2 +-
drivers/infiniband/hw/mlx5/cq.c | 7 +-
drivers/infiniband/hw/mlx5/main.c | 16 ++--
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/infiniband/hw/mlx5/qp.c | 17 ++--
drivers/infiniband/hw/mlx5/srq.c | 7 +-
drivers/infiniband/hw/mthca/mthca_provider.c | 40 ++++++---
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 31 +++----
drivers/infiniband/hw/qedr/verbs.c | 48 ++--------
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 13 +--
drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c | 7 +-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c | 8 +-
drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c | 6 +-
.../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c | 11 ++-
drivers/infiniband/sw/rdmavt/cq.c | 2 +-
drivers/infiniband/sw/rdmavt/qp.c | 3 +-
drivers/infiniband/sw/rdmavt/srq.c | 19 ++--
drivers/infiniband/sw/siw/siw_verbs.c | 10 +--
38 files changed, 225 insertions(+), 341 deletions(-)
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
--
2.43.0
^ permalink raw reply
* [PATCH v3 08/10] RDMA: Use proper driver data response structs instead of open coding
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
At some point the response structs were added and rdma-core is using
them, but the kernel was not changed to use them as well. Replace
the open-coded copy with the right struct and ib_respond_udata().
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/mlx4/cq.c | 7 ++--
drivers/infiniband/hw/mlx4/main.c | 11 ++++--
drivers/infiniband/hw/mlx4/srq.c | 12 ++++---
drivers/infiniband/hw/mlx5/cq.c | 7 ++--
drivers/infiniband/hw/mthca/mthca_provider.c | 35 ++++++++++++++------
5 files changed, 48 insertions(+), 24 deletions(-)
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 7a6eb602d4a6de..7e4505f6c78b30 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -142,6 +142,7 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
{
struct ib_udata *udata = &attrs->driver_udata;
struct ib_device *ibdev = ibcq->device;
+ struct mlx4_ib_create_cq_resp uresp = {};
int entries = attr->cqe;
int vector = attr->comp_vector;
struct mlx4_ib_dev *dev = to_mdev(ibdev);
@@ -219,10 +220,10 @@ int mlx4_ib_create_user_cq(struct ib_cq *ibcq,
cq->mcq.event = mlx4_ib_cq_event;
cq->mcq.usage = MLX4_RES_USAGE_USER_VERBS;
- if (ib_copy_to_udata(udata, &cq->mcq.cqn, sizeof(__u32))) {
- err = -EFAULT;
+ uresp.cqn = cq->mcq.cqn;
+ err = ib_respond_udata(udata, uresp);
+ if (err)
goto err_cq_free;
- }
return 0;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 4b187ec9e01738..25f9738bd77223 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1199,9 +1199,14 @@ static int mlx4_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
if (err)
return err;
- if (udata && ib_copy_to_udata(udata, &pd->pdn, sizeof(__u32))) {
- mlx4_pd_free(to_mdev(ibdev)->dev, pd->pdn);
- return -EFAULT;
+ if (udata) {
+ struct mlx4_ib_alloc_pd_resp uresp = { .pdn = pd->pdn };
+
+ err = ib_respond_udata(udata, uresp);
+ if (err) {
+ mlx4_pd_free(to_mdev(ibdev)->dev, pd->pdn);
+ return err;
+ }
}
return 0;
}
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 767840736d583b..dd868f9b893d70 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -191,11 +191,15 @@ int mlx4_ib_create_srq(struct ib_srq *ib_srq,
srq->msrq.event = mlx4_ib_srq_event;
srq->ibsrq.ext.xrc.srq_num = srq->msrq.srqn;
- if (udata)
- if (ib_copy_to_udata(udata, &srq->msrq.srqn, sizeof (__u32))) {
- err = -EFAULT;
+ if (udata) {
+ struct mlx4_ib_create_srq_resp uresp = {
+ .srqn = srq->msrq.srqn
+ };
+
+ err = ib_respond_udata(udata, uresp);
+ if (err)
goto err_srq;
- }
+ }
init_attr->attr.max_wr = srq->msrq.max - 1;
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index a76b7a36087d98..c548d4dfbbc96a 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -949,6 +949,7 @@ int mlx5_ib_create_user_cq(struct ib_cq *ibcq,
{
struct ib_udata *udata = &attrs->driver_udata;
struct ib_device *ibdev = ibcq->device;
+ struct mlx5_ib_create_cq_resp uresp = {};
int entries = attr->cqe;
int vector = attr->comp_vector;
struct mlx5_ib_dev *dev = to_mdev(ibdev);
@@ -1015,10 +1016,10 @@ int mlx5_ib_create_user_cq(struct ib_cq *ibcq,
INIT_LIST_HEAD(&cq->wc_list);
- if (ib_copy_to_udata(udata, &cq->mcq.cqn, sizeof(__u32))) {
- err = -EFAULT;
+ uresp.cqn = cq->mcq.cqn;
+ err = ib_respond_udata(udata, uresp);
+ if (err)
goto err_cmd;
- }
kvfree(cqb);
return 0;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 07c60797c86091..afa97d3801f783 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -357,9 +357,12 @@ static int mthca_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
return err;
if (udata) {
- if (ib_copy_to_udata(udata, &pd->pd_num, sizeof (__u32))) {
+ struct mthca_alloc_pd_resp uresp = { .pdn = pd->pd_num };
+
+ err = ib_respond_udata(udata, uresp);
+ if (err) {
mthca_pd_free(to_mdev(ibdev), pd);
- return -EFAULT;
+ return err;
}
}
@@ -428,11 +431,17 @@ static int mthca_create_srq(struct ib_srq *ibsrq,
if (err)
return err;
- if (context && ib_copy_to_udata(udata, &srq->srqn, sizeof(__u32))) {
- mthca_free_srq(to_mdev(ibsrq->device), srq);
- mthca_unmap_user_db(to_mdev(ibsrq->device), &context->uar,
- context->db_tab, ucmd.db_index);
- return -EFAULT;
+ if (context) {
+ struct mthca_create_srq_resp uresp = { .srqn = srq->srqn };
+
+ err = ib_respond_udata(udata, uresp);
+ if (err) {
+ mthca_free_srq(to_mdev(ibsrq->device), srq);
+ mthca_unmap_user_db(to_mdev(ibsrq->device),
+ &context->uar, context->db_tab,
+ ucmd.db_index);
+ return err;
+ }
}
return 0;
@@ -631,10 +640,14 @@ static int mthca_create_cq(struct ib_cq *ibcq,
if (err)
goto err_unmap_arm;
- if (udata && ib_copy_to_udata(udata, &cq->cqn, sizeof(__u32))) {
- mthca_free_cq(to_mdev(ibdev), cq);
- err = -EFAULT;
- goto err_unmap_arm;
+ if (udata) {
+ struct mthca_create_cq_resp uresp = { .cqn = cq->cqn };
+
+ err = ib_respond_udata(udata, uresp);
+ if (err) {
+ mthca_free_cq(to_mdev(ibdev), cq);
+ goto err_unmap_arm;
+ }
}
cq->resize_buf = NULL;
--
2.43.0
^ permalink raw reply related
* [PATCH v3 07/10] RDMA/mlx: Replace response_len with ib_respond_udata()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
The Mellanox drivers have a pattern where they compute the response
length they think they need based on what the user asked for, then
blindly write that ignoring the provided size limit on the response
structure.
Drop this and just use ib_respond_udata() which caps the response
struct to the user's memory, which is fine for what mlx5 is doing.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/mlx4/main.c | 2 +-
drivers/infiniband/hw/mlx4/qp.c | 2 +-
drivers/infiniband/hw/mlx5/ah.c | 2 +-
drivers/infiniband/hw/mlx5/main.c | 4 ++--
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/infiniband/hw/mlx5/qp.c | 10 +++++-----
6 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ce77e893065c92..4b187ec9e01738 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -626,7 +626,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
}
if (uhw->outlen) {
- err = ib_copy_to_udata(uhw, &resp, resp.response_length);
+ err = ib_respond_udata(uhw, resp);
if (err)
goto out;
}
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index aca8a985ce33cd..8dc4196218bf05 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -4331,7 +4331,7 @@ int mlx4_ib_create_rwq_ind_table(struct ib_rwq_ind_table *rwq_ind_table,
if (udata->outlen) {
resp.response_length = offsetof(typeof(resp), response_length) +
sizeof(resp.response_length);
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
}
return err;
diff --git a/drivers/infiniband/hw/mlx5/ah.c b/drivers/infiniband/hw/mlx5/ah.c
index 531a57f9ee7e8b..a3aa700d08355d 100644
--- a/drivers/infiniband/hw/mlx5/ah.c
+++ b/drivers/infiniband/hw/mlx5/ah.c
@@ -121,7 +121,7 @@ int mlx5_ib_create_ah(struct ib_ah *ibah, struct rdma_ah_init_attr *init_attr,
resp.response_length = min_resp_len;
memcpy(resp.dmac, ah_attr->roce.dmac, ETH_ALEN);
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err)
return err;
}
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 0b3eda9b0ad0c4..fb9689e453bce4 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1356,7 +1356,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
}
if (uhw_outlen) {
- err = ib_copy_to_udata(uhw, &resp, resp.response_length);
+ err = ib_respond_udata(uhw, resp);
if (err)
return err;
@@ -2281,7 +2281,7 @@ static int mlx5_ib_alloc_ucontext(struct ib_ucontext *uctx,
goto out_mdev;
resp.response_length = min(udata->outlen, sizeof(resp));
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err)
goto out_mdev;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3b6da45061a552..d7d8f3ae8b647a 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1809,7 +1809,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
resp.response_length =
min(offsetofend(typeof(resp), response_length), udata->outlen);
if (resp.response_length) {
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err)
goto free_mkey;
}
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 23abea36cf71cf..6859e8ba2732ac 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3332,7 +3332,7 @@ int mlx5_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
* including MLX5_IB_QPT_DCT, which doesn't need it.
* In that case, resp will be filled with zeros.
*/
- err = ib_copy_to_udata(udata, ¶ms.resp, params.outlen);
+ err = ib_respond_udata(udata, params.resp);
if (err)
goto destroy_qp;
@@ -4631,7 +4631,7 @@ static int mlx5_ib_modify_dct(struct ib_qp *ibqp, struct ib_qp_attr *attr,
resp.dctn = qp->dct.mdct.mqp.qpn;
if (MLX5_CAP_GEN(dev->mdev, ece_support))
resp.ece_options = MLX5_GET(create_dct_out, out, ece);
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err) {
mlx5_core_destroy_dct(dev, &qp->dct.mdct);
return err;
@@ -4790,7 +4790,7 @@ int mlx5_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
if (!err && resp.response_length &&
udata->outlen >= resp.response_length)
/* Return -EFAULT to the user and expect him to destroy QP. */
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
out:
mutex_unlock(&qp->mutex);
@@ -5490,7 +5490,7 @@ struct ib_wq *mlx5_ib_create_wq(struct ib_pd *pd,
if (udata->outlen) {
resp.response_length = offsetofend(
struct mlx5_ib_create_wq_resp, response_length);
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err)
goto err_copy;
}
@@ -5581,7 +5581,7 @@ int mlx5_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
resp.response_length =
offsetofend(struct mlx5_ib_create_rwq_ind_tbl_resp,
response_length);
- err = ib_copy_to_udata(udata, &resp, resp.response_length);
+ err = ib_respond_udata(udata, resp);
if (err)
goto err_copy;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 01/10] RDMA: Use ib_is_udata_in_empty() for places calling ib_is_udata_cleared()
From: Jason Gunthorpe @ 2026-05-12 0:09 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler,
Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
Yishai Hadas
Cc: patches
In-Reply-To: <0-v3-4effdebad75a+e1-rdma_udata_rep_jgg@nvidia.com>
Convert the pattern:
if (udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen))
Using Coccinelle:
virtual patch
virtual context
virtual report
@@
expression udata;
@@
(
- udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)
|
- udata->inlen > 0 && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)
)
@@
expression udata;
@@
- udata && udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)
These cases are already checking for zeroed data that the kernel does
not understand.
Run another pass with AI to propagate the return code correctly and
remove redundant prints.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/hw/efa/efa_verbs.c | 43 +++++++++------------------
drivers/infiniband/hw/mlx4/main.c | 6 ++--
drivers/infiniband/hw/mlx4/qp.c | 7 ++---
drivers/infiniband/hw/mlx5/main.c | 5 ++--
drivers/infiniband/hw/mlx5/qp.c | 7 ++---
5 files changed, 26 insertions(+), 42 deletions(-)
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 7bd0838ebc99e4..3ad5d6e27b1590 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -218,12 +218,9 @@ int efa_query_device(struct ib_device *ibdev,
struct efa_dev *dev = to_edev(ibdev);
int err;
- if (udata && udata->inlen &&
- !ib_is_udata_cleared(udata, 0, udata->inlen)) {
- ibdev_dbg(ibdev,
- "Incompatible ABI params, udata not cleared\n");
- return -EINVAL;
- }
+ err = ib_is_udata_in_empty(udata);
+ if (err)
+ return err;
dev_attr = &dev->dev_attr;
@@ -433,13 +430,9 @@ int efa_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
struct efa_pd *pd = to_epd(ibpd);
int err;
- if (udata->inlen &&
- !ib_is_udata_cleared(udata, 0, udata->inlen)) {
- ibdev_dbg(&dev->ibdev,
- "Incompatible ABI params, udata not cleared\n");
- err = -EINVAL;
+ err = ib_is_udata_in_empty(udata);
+ if (err)
goto err_out;
- }
err = efa_com_alloc_pd(&dev->edev, &result);
if (err)
@@ -982,12 +975,9 @@ int efa_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
if (qp_attr_mask & ~IB_QP_ATTR_STANDARD_BITS)
return -EOPNOTSUPP;
- if (udata->inlen &&
- !ib_is_udata_cleared(udata, 0, udata->inlen)) {
- ibdev_dbg(&dev->ibdev,
- "Incompatible ABI params, udata not cleared\n");
- return -EINVAL;
- }
+ err = ib_is_udata_in_empty(udata);
+ if (err)
+ return err;
cur_state = qp_attr_mask & IB_QP_CUR_STATE ? qp_attr->cur_qp_state :
qp->state;
@@ -1612,13 +1602,11 @@ static struct efa_mr *efa_alloc_mr(struct ib_pd *ibpd, int access_flags,
struct efa_dev *dev = to_edev(ibpd->device);
int supp_access_flags;
struct efa_mr *mr;
+ int ret;
- if (udata && udata->inlen &&
- !ib_is_udata_cleared(udata, 0, udata->inlen)) {
- ibdev_dbg(&dev->ibdev,
- "Incompatible ABI params, udata not cleared\n");
- return ERR_PTR(-EINVAL);
- }
+ ret = ib_is_udata_in_empty(udata);
+ if (ret)
+ return ERR_PTR(ret);
supp_access_flags =
IB_ACCESS_LOCAL_WRITE |
@@ -2082,12 +2070,9 @@ int efa_create_ah(struct ib_ah *ibah,
goto err_out;
}
- if (udata->inlen &&
- !ib_is_udata_cleared(udata, 0, udata->inlen)) {
- ibdev_dbg(&dev->ibdev, "Incompatible ABI params\n");
- err = -EINVAL;
+ err = ib_is_udata_in_empty(udata);
+ if (err)
goto err_out;
- }
memcpy(params.dest_addr, ah_attr->grh.dgid.raw,
sizeof(params.dest_addr));
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 464c9ab4251636..16e9ce8138cb30 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1696,9 +1696,9 @@ static struct ib_flow *mlx4_ib_create_flow(struct ib_qp *qp,
(flow_attr->type != IB_FLOW_ATTR_NORMAL))
return ERR_PTR(-EOPNOTSUPP);
- if (udata &&
- udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen))
- return ERR_PTR(-EOPNOTSUPP);
+ err = ib_is_udata_in_empty(udata);
+ if (err)
+ return ERR_PTR(err);
memset(type, 0, sizeof(type));
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 790be09d985a1a..aca8a985ce33cd 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -4297,10 +4297,9 @@ int mlx4_ib_create_rwq_ind_table(struct ib_rwq_ind_table *rwq_ind_table,
size_t min_resp_len;
int i, err = 0;
- if (udata->inlen > 0 &&
- !ib_is_udata_cleared(udata, 0,
- udata->inlen))
- return -EOPNOTSUPP;
+ err = ib_is_udata_in_empty(udata);
+ if (err)
+ return err;
min_resp_len = offsetof(typeof(resp), reserved) + sizeof(resp.reserved);
if (udata->outlen && udata->outlen < min_resp_len)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 61078281953d6c..2bb5caf5a89266 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -965,8 +965,9 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
resp.response_length = resp_len;
- if (uhw && uhw->inlen && !ib_is_udata_cleared(uhw, 0, uhw->inlen))
- return -EINVAL;
+ err = ib_is_udata_in_empty(uhw);
+ if (err)
+ return err;
memset(props, 0, sizeof(*props));
err = mlx5_query_system_image_guid(ibdev,
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8fd05532c09cc7..23abea36cf71cf 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -5538,10 +5538,9 @@ int mlx5_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
u32 *in;
void *rqtc;
- if (udata->inlen > 0 &&
- !ib_is_udata_cleared(udata, 0,
- udata->inlen))
- return -EOPNOTSUPP;
+ err = ib_is_udata_in_empty(udata);
+ if (err)
+ return err;
if (init_attr->log_ind_tbl_size >
MLX5_CAP_GEN(dev->mdev, log_max_rqt_size)) {
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox