Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU

Linux PCI subsystem development
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Mukesh R" <mrathor@linux.microsoft.com>
Cc: linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
Date: Wed, 13 May 2026 12:41:25 +0000	[thread overview]
Message-ID: <20260513124126.7D2EAC2BCB7@smtp.kernel.org> (raw)
In-Reply-To: <20260512020259.1678627-10-mrathor@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [Critical] `hv_iommu_unmap_pages` makes a rep hypercall but ignores the completion count, leaving pages permanently mapped.
- [Critical] Integer truncation in `hv_iommu_unmap_pages` causes incomplete hardware unmapping.
- [Critical] Global `unique_id` data race leads to IOMMU domain collisions and cross-VM DMA sharing.
- [High] The `sscanf` return value check in `hv_iommu_probe_device` breaks the loop on success instead of failure.
- [High] Partial unmaps corrupt the interval tree state, desynchronizing hardware and software.
- [High] Unsafe interval tree iteration leads to skipped nodes during removal.
- [High] Memory leak of interval tree nodes on domain free.
- [High] Ignoring `gfp` flags causes premature OOM and risks sleeping in atomic context.
- [Medium] Missing NULL check on `mapped` pointer in fast path.
- [Low] The `__setup` handlers violate the API by returning 0.
--

commit 4a3c258bcace1805e7afe537c6eb3e451a6fd030
Author: Mukesh R <mrathor@linux.microsoft.com>

x86/hyperv: Implement Hyper-V virtual IOMMU

Add a new file to implement management of device domains, mapping and
unmapping of IOMMU memory, and other iommu_ops to fit within the VFIO
framework for PCI passthru on Hyper-V running Linux as baremetal root
or L1VH root.

> diff --git a/drivers/iommu/hyperv-iommu-root.c b/drivers/iommu/hyperv-iommu-root.c
> new file mode 100644
> index 0000000000000..a2e0f6cc78e66
> --- /dev/null
> +++ b/drivers/iommu/hyperv-iommu-root.c

[ ... ]

> +static char *pci_devs_to_skip;
> +static int __init hv_iommu_setup_skip(char *str)
> +{
> +	pci_devs_to_skip = str;
> +
> +	return 0;
> +}

Should this return 1 instead of 0? Returning 0 causes the kernel to print an
unknown command line parameters warning and passes the parameters to the init
process.

> +/* hv_iommu_skip=(SSSS:BB:DD.F)(SSSS:BB:DD.F) */
> +__setup("hv_iommu_skip=", hv_iommu_setup_skip);
> +
> +bool hv_no_attdev;	 /* disable direct device attach for passthru */
> +EXPORT_SYMBOL_GPL(hv_no_attdev);
> +static int __init setup_hv_no_attdev(char *str)
> +{
> +	hv_no_attdev = true;
> +	return 0;
> +}

Should this also return 1 instead of 0 to indicate the parameter was handled?

[ ... ]

> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +	struct hv_domain *hvdom;
> +	int rc;

[ ... ]

> +	/* Called under iommu group mutex, so single threaded */
> +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL)   /* ie, UINTMAX */
> +		goto out_err;

Can this increment of unique_id race if called concurrently?

iommu_domain_alloc() can be called by subsystems without holding global or
group locks. If multiple allocations race, it could assign identical domid_num
values to different domains, which might place them in the same S2 domain.

[ ... ]

> +static void hv_iommu_domain_free(struct iommu_domain *immdom)
> +{
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	unsigned long flags;
> +	u64 status;
> +	struct hv_input_delete_device_domain *input;
> +
> +	if (hv_special_domain(hvdom))
> +		return;
> +
> +	if (!hv_dom_owner_is_vmm(hvdom) || hv_no_attdev) {

[ ... ]

> +	}
> +
> +	kfree(hvdom);
> +}

Does this code leak the interval tree nodes?

It looks like the remaining struct hv_iommu_mapping nodes in
hvdom->mappings_tree are not freed before kfree(hvdom).

[ ... ]

> +static int hv_iommu_add_tree_mapping(struct hv_domain *hvdom,
> +				     unsigned long iova, phys_addr_t paddr,
> +				     size_t size, u32 flags)
> +{
> +	unsigned long irqflags;
> +	struct hv_iommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);

Should this use the gfp flag passed into hv_iommu_map_pages() instead of
GFP_ATOMIC? It might avoid failing prematurely under memory pressure when the
caller can sleep.

[ ... ]

> +static size_t hv_iommu_del_tree_mappings(struct hv_domain *hvdom,
> +					unsigned long iova, size_t size)
> +{
> +	unsigned long flags;
> +	size_t unmapped = 0;
> +	unsigned long last = iova + size - 1;
> +	struct hv_iommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&hvdom->mappings_lock, flags);
> +	next = interval_tree_iter_first(&hvdom->mappings_tree, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct hv_iommu_mapping, iova);
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/* Trying to split a mapping? Not supported for now. */
> +		if (mapping->iova.start < iova)
> +			break;

Does breaking out early here leave stale mappings in the tree?

If the target region is unmapped in hardware, the tracking node would remain,
causing stale lookups. Also, if unmapping the head of a mapping, is the
entire node deleted even though the tail is still mapped?

> +
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &hvdom->mappings_tree);

Can erasing the node corrupt the iteration?

Since interval_tree_remove() rebalances the tree and alters parent pointers,
calling interval_tree_iter_next(next) on the next loop iteration might skip
overlapping nodes.

[ ... ]

> +static int hv_iommu_map_pages(struct iommu_domain *immdom, ulong iova,
> +			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
> +			      int prot, gfp_t gfp, size_t *mapped)
> +{
> +	u32 map_flags;
> +	int ret;
> +	u64 status;
> +	unsigned long npages, done = 0;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t size = pgsize * pgcount;
> +
> +	map_flags = HV_MAP_GPA_READABLE;	/* required */
> +	map_flags |= prot & IOMMU_WRITE ? HV_MAP_GPA_WRITABLE : 0;
> +
> +	ret = hv_iommu_add_tree_mapping(hvdom, iova, paddr, size, map_flags);
> +	if (ret)
> +		return ret;
> +
> +	if (hvdom->attached_dom) {
> +		*mapped = size;
> +		return 0;
> +	}

Is it possible for mapped to be NULL here?

The fast path assigns to *mapped without validating the pointer, while later
in the function there are defensive checks like if (mapped).

[ ... ]

> +		if (hv_result(status) == HV_STATUS_INSUFFICIENT_MEMORY) {
> +			ret = hv_call_deposit_pages(NUMA_NO_NODE,
> +						    hv_current_partition_id,
> +						    256);

Could this trigger a sleeping in atomic context bug?

hv_call_deposit_pages() typically performs sleepable allocations, but the
caller of hv_iommu_map_pages() might have provided a non-sleeping gfp flag
which was not forwarded.

[ ... ]

> +static size_t hv_iommu_unmap_pages(struct iommu_domain *immdom, ulong iova,
> +				   size_t pgsize, size_t pgcount,
> +				   struct iommu_iotlb_gather *gather)
> +{
> +	unsigned long flags, npages;
> +	struct hv_input_unmap_device_gpa_pages *input;
> +	u64 status;
> +	struct hv_domain *hvdom = to_hv_domain(immdom);
> +	size_t unmapped, size = pgsize * pgcount;

[ ... ]

> +	npages = size >> HV_HYP_PAGE_SHIFT;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +
> +	input->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +	input->device_domain.domain_id.type = HV_DEVICE_DOMAIN_TYPE_S2;
> +	input->device_domain.domain_id.id = hvdom->domid_num;
> +	input->target_device_va_base = iova;
> +
> +	status = hv_do_rep_hypercall(HVCALL_UNMAP_DEVICE_GPA_PAGES, npages,
> +				     0, input, NULL);

Can npages be truncated here?

The hypercall parameter rep_count in hv_do_rep_hypercall() expects a u16.
If npages is >= 65536, it would be silently truncated, causing incomplete
unmapping in the hardware.

Does this also need a loop to handle partial completions?

hv_do_rep_hypercall() might only partially complete. Since the completion count
from hv_repcomp(status) is ignored, remaining pages could be left fully
mapped in the hardware.

[ ... ]

> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-ENODEV);
> +
> +	if (pci_devs_to_skip && *pci_devs_to_skip) {
> +		int rc, pos = 0;
> +		int parsed;
> +		int segment, bus, slot, func;
> +		struct pci_dev *pdev = to_pci_dev(dev);
> +
> +		do {
> +			parsed = 0;
> +
> +			rc = sscanf(pci_devs_to_skip + pos, " (%x:%x:%x.%x) %n",
> +				    &segment, &bus, &slot, &func, &parsed);
> +			if (rc)
> +				break;

Should this check for failure instead of success?

sscanf() returns the number of successfully matched items. If it succeeds, rc
will be 4, causing the loop to break prematurely without checking the parsed
values against the device.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260512020259.1678627-1-mrathor@linux.microsoft.com?part=9

next prev parent reply	other threads:[~2026-05-13 12:41 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12  2:02 [PATCH V3 00/11] PCI passthru on Hyper-V (Part I) Mukesh R
2026-05-12  2:02 ` [PATCH V3 01/11] iommu/hyperv: Rename hyperv-iommu.c to hyperv-irq.c Mukesh R
2026-05-12 10:26   ` Souradeep Chakrabarti
2026-05-12 23:46   ` Jacob Pan
2026-05-13  1:31     ` Mukesh R
2026-05-13  3:15     ` Michael Kelley
2026-05-12  2:02 ` [PATCH V3 02/11] x86/hyperv: Cosmetic changes in irqdomain.c for readability Mukesh R
2026-05-12 10:27   ` Souradeep Chakrabarti
2026-05-13  3:26   ` sashiko-bot
2026-05-12  2:02 ` [PATCH V3 03/11] mshv: Provide a way to get partition ID if running in a VMM process Mukesh R
2026-05-13  3:47   ` sashiko-bot
2026-05-12  2:02 ` [PATCH V3 04/11] mshv: Declarations and definitions for VFIO-MSHV bridge device Mukesh R
2026-05-12 10:26   ` Souradeep Chakrabarti
2026-05-12  2:02 ` [PATCH V3 05/11] mshv: Implement mshv bridge device for VFIO Mukesh R
2026-05-13  5:09   ` sashiko-bot
2026-05-12  2:02 ` [PATCH V3 06/11] mshv: Add ioctl support for MSHV-VFIO bridge device Mukesh R
2026-05-13  5:27   ` sashiko-bot
2026-05-12  2:02 ` [PATCH V3 07/11] mshv: Import data structs around device passthru from hyperv headers Mukesh R
2026-05-12  2:02 ` [PATCH V3 08/11] PCI: hv: VMBus and PCI device IDs for PCI passthru Mukesh R
2026-05-12 17:41   ` Bjorn Helgaas
2026-05-13  6:43   ` sashiko-bot
2026-05-13 15:08   ` Souradeep Chakrabarti
2026-05-13 15:17     ` Souradeep Chakrabarti
2026-05-12  2:02 ` [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU Mukesh R
2026-05-13 12:41   ` sashiko-bot [this message]
2026-05-12  2:02 ` [PATCH V3 10/11] mshv: Populate mmio mappings for PCI passthru Mukesh R
2026-05-12  2:02 ` [PATCH V3 11/11] mshv: Mark mem regions as non-movable upfront if device passthru Mukesh R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260513124126.7D2EAC2BCB7@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mrathor@linux.microsoft.com \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox