Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH] mshv: add vmbus dependency
From: Arnd Bergmann @ 2026-05-20  7:40 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Jork Loeser,
	Stanislav Kinsburskii
  Cc: Arnd Bergmann, linux-hyperv, linux-kernel

From: Arnd Bergmann <arnd@arndb.de>

When the vmbus driver is not part of the kernel, the mvhv_root
driver now fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Avoid this by adding an explicit Kconfig dependency. Note that
stubbing out the hv_vmbus_exists() based on configuration would
also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.

Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/hv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 52af086fdeb2..21193b571a80 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
-- 
2.39.5


^ permalink raw reply related

* [PATCH] drivers: hv: use kmalloc_array in mshv_root_scheduler_init
From: Can Peng @ 2026-05-20  7:16 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, longli, decui
  Cc: linux-kernel, linux-hyperv, Can Peng

Replace kmalloc() with kmalloc_array() to prevent potential
overflow, as recommended in Documentation/process/deprecated.rst.

Signed-off-by: Can Peng <pengcan@kylinos.cn>
---
 drivers/hv/mshv_root_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bd1359eb58dd..146726cc4e9b 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2241,7 +2241,7 @@ static int mshv_root_scheduler_init(unsigned int cpu)
 	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
 
 	/* Allocate two consecutive pages. One for input, one for output. */
-	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
+	p = kmalloc_array(2, HV_HYP_PAGE_SIZE, GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v1 1/4] iommu: Move Hyper-V IOMMU driver to its own subdirectory
From: Yu Zhang @ 2026-05-20  6:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260515221918.GJ7702@ziepe.ca>

On Fri, May 15, 2026 at 07:19:18PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:05AM +0800, Yu Zhang wrote:
> > From: Easwar Hariharan <eahariha@linux.microsoft.com>
> > 
> > The Hyper-V IOMMU driver currently only supports IRQ remapping.
> > As it will be adding DMA remapping support, prepare a directory
> > to contain all the different feature files.
> 
> Any possibility we could put the irq remapping thing under the irq
> directory?
> 
> The other drivers have it here because they are co-mingled with their
> iommu HW, will hyperv have the same issue?
> 

Good question. I don't think Hyper-V have the same co-mingling issue.

But from a code organization perspective, I think drivers/iommu/hyperv/
is still the most natural place:

- The IRQ remapping framework itself (drivers/iommu/irq_remapping.c
  and its internal header irq_remapping.h) lives under drivers/iommu/,
  and all three backends (intel/, amd/, hyperv/) sit there today.
  hyperv/irq_remapping.c includes that internal header directly.

- irq_remapping_prepare() references the extern hyperv_irq_remap_ops
  declared in that header, so the provider naturally belongs in the
  same tree.

Moving it to e.g. drivers/irqchip/ would break that symmetry.

Yu

> Jason
> 

^ permalink raw reply

* [PATCH net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
From: Aditya Garg @ 2026-05-20  5:15 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, ernis, gargaditya,
	gargaditya, kees, stephen, shacharr, ssengar, linux-hyperv,
	netdev, linux-kernel

In mana_hwc_rx_event_handler(), rx_req_idx is derived from
sge->address in DMA-coherent memory. In Confidential VMs
(SEV-SNP/TDX), this memory is shared unencrypted and HW can modify
WQE contents at any time. No bounds check exists on rx_req_idx,
which can lead to an out-of-bounds access into reqs[].

Add bounds check on rx_req_idx in mana_hwc_rx_event_handler() before
using it to index the reqs[] array.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/hw_channel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index dbbde0fa57e7..a60f733d1a07 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -266,6 +266,12 @@ static void mana_hwc_rx_event_handler(void *ctx, u32 gdma_rxq_id,
 	rq_base_addr = hwc_rxq->msg_buf->mem_info.dma_handle;
 	rx_req_idx = (sge->address - rq_base_addr) / hwc->max_req_msg_size;
 
+	if (rx_req_idx >= hwc_rxq->msg_buf->num_reqs) {
+		dev_err(hwc->dev, "HWC RX: wrong rx_req_idx=%llu, num_reqs=%u\n",
+			rx_req_idx, hwc_rxq->msg_buf->num_reqs);
+		return;
+	}
+
 	rx_req = &hwc_rxq->msg_buf->reqs[rx_req_idx];
 	resp = (struct gdma_resp_hdr *)rx_req->buf_va;
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
  To: dipayanroy
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
	leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
	yury.norov
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Skip redundant detach in queue reset handler if already detached

This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);

The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?

If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).

If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.

Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.

The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).

Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?

^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
  To: dipayanroy
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
	leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
	yury.norov
In-Reply-To: <20260518194654.735580-2-dipayanroy@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Add NULL guards in teardown path to prevent panic on attach failure
    
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
>  	struct mana_rxq *rxq;
>  	u32 rxq_idx;
>  
> -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> -		rxq = apc->rxqs[rxq_idx];
> -		if (!rxq)
> -			continue;
> +	if (apc->rxqs) {
>  
> -		mana_destroy_rxq(apc, rxq, true);
> -		apc->rxqs[rxq_idx] = NULL;
> +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> +			rxq = apc->rxqs[rxq_idx];
> +			if (!rxq)
> +				continue;
> +
> +			mana_destroy_rxq(apc, rxq, true);
> +			apc->rxqs[rxq_idx] = NULL;
> +		}
>  	}

If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?

When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:

    apc->rxqs[i] = rxq;

If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?

[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 * number of queues.
>  	 */
>  
> -	for (i = 0; i < apc->num_queues; i++) {
> -		txq = &apc->tx_qp[i].txq;
> -		tsleep = 1000;
> -		while (atomic_read(&txq->pending_sends) > 0 &&
> -		       time_before(jiffies, timeout)) {
> -			usleep_range(tsleep, tsleep + 1000);
> -			tsleep <<= 1;
> -		}
> -		if (atomic_read(&txq->pending_sends)) {
> -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> -			if (err) {
> -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> -					   err, atomic_read(&txq->pending_sends),
> -					   txq->gdma_txq_id);
> +	if (apc->tx_qp) {

In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?

If mana_destroy_txq() unconditionally iterates over all apc->num_queues:

    for (i = 0; i < apc->num_queues; i++) {
        ...
        mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
        ...

Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?

Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?

> +		for (i = 0; i < apc->num_queues; i++) {
> +			txq = &apc->tx_qp[i].txq;
> +			tsleep = 1000;
> +			while (atomic_read(&txq->pending_sends) > 0 &&
> +			       time_before(jiffies, timeout)) {
> +				usleep_range(tsleep, tsleep + 1000);
> +				tsleep <<= 1;
> +			}
> +			if (atomic_read(&txq->pending_sends)) {
> +				err =
> +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));

If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.

Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?

Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?

> +				if (err) {
> +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> +						   err,
> +					    atomic_read(&txq->pending_sends),
> +					    txq->gdma_txq_id);
> +				}
> +				break;
>  			}
> -			break;
>  		}
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: sashiko-bot @ 2026-05-19 21:34 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <e6e63276cca2901641ab39029e4fd3d621b1ee92.1779221799.git.me@berkoc.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [Critical] Using `bytes_recvd` for `memcpy()` without checking `vmbus_recvpacket()` return value leads to a massive heap buffer overflow.
- [High] Strict sizeof() validation incorrectly rejects dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.

Pre-existing issues:
- [High] Concurrent lockless write to `hv->init_buf` from VMBus callback allows a malicious host to overwrite data while the guest is validating it.
- [High] Missing `reinit_completion()` before reusing the shared `hv->wait` completion object.
--

commit 8035423b95c9b24007478019421c5f94fe1179ce
Author: Berkant Koc <me@berkoc.com>
Subject: drm/hyperv: validate VMBus packet size in receive callback

This patch passes bytes_recvd into hyperv_receive_sub() to validate that
packets cover the pipe and synthvid headers as well as the type-specific
payload. The memcpy operation is also modified to use bytes_recvd.

Cc: stable@vger.kernel.org # 5.14+

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index c3d0ff229e3d0..12d3feb4f88ac 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
[ ... ]
> @@ -420,26 +420,62 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
>  
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
[ ... ]
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		size_t need = hdr_size;
> +
> +		switch (msg->vid_hdr.type) {
> +		case SYNTHVID_VERSION_RESPONSE:
> +			need += sizeof(struct synthvid_version_resp);
> +			break;
> +		case SYNTHVID_RESOLUTION_RESPONSE:
> +			need += sizeof(struct synthvid_supported_resolution_resp);

[Severity: High]
Does this strict sizeof() check correctly handle dynamically-sized packets?

If the host sends a response sized only for the active elements (up to
resolution_count) rather than the maximum 64 elements, bytes_recvd will
be less than need. Will this cause valid packets to be falsely rejected
and break the driver initialization sequence?

> +			break;
> +		case SYNTHVID_VRAM_LOCATION_ACK:
> +			need += sizeof(struct synthvid_vram_location_ack);
> +			break;
> +		}
> +		if (bytes_recvd < need) {
> +			drm_err_ratelimited(&hv->dev,
> +					    "synthvid packet too small for type %u: %u < %zu\n",
> +					    msg->vid_hdr.type, bytes_recvd, need);
> +			return;
> +		}
> +		memcpy(hv->init_buf, msg, bytes_recvd);

[Severity: High]
This is a pre-existing issue, but does this pattern allow a malicious host
to overwrite data while the guest is validating it?

The callback unconditionally copies incoming response packets into
hv->init_buf via memcpy() and calls complete().

Since the waiting thread processes hv->init_buf in process context without
copying it to isolated local memory, can an untrusted host send a second
packet immediately after validation to trigger a time-of-check to
time-of-use (TOCTOU) data race?

>  		complete(&hv->wait);

[Severity: High]
This isn't a bug introduced by this patch, but does the driver need to call
reinit_completion(&hv->wait) before sending a new request?

If a prior request times out and its response arrives late, the completion
counter is incremented. Will the next wait immediately consume the
completion counter and process stale data left in hv->init_buf?

>  		return;
>  	}
[ ... ]
> @@ -466,7 +502,7 @@ static void hyperv_receive(void *ctx)
>  				       &bytes_recvd, &req_id);
>  		if (bytes_recvd > 0 &&
>  		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
> -			hyperv_receive_sub(hdev);
> +			hyperv_receive_sub(hdev, bytes_recvd);

[Severity: Critical]
If vmbus_recvpacket() encounters a packet larger than VMBUS_MAX_PACKET_SIZE
(16KB), it returns -ENOBUFS and sets bytes_recvd to the required packet
length.

Since ret == 0 is only checked in the while condition, does this pass an
unbounded bytes_recvd into the callback? If recv_buf still contains the
header from a previous successful iteration, it looks like
memcpy(hv->init_buf, msg, bytes_recvd) will copy more than 16KB into
hv->init_buf, causing a heap buffer overflow.

>  	} while (bytes_recvd > 0 && ret == 0);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1779221339.git.me@berkoc.com?part=2

^ permalink raw reply

* Re: [PATCH v3 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: sashiko-bot @ 2026-05-19 20:55 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <1b88bc7edeb2f0153475225b67f19aaca629eca8.1779221799.git.me@berkoc.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Unsynchronized access to the shared `hv->init_buf` introduces Double Fetch/TOCTOU vulnerabilities, allowing a malicious host to bypass bounds checks and leak kernel memory.
- [High] The driver reuses the `hv->wait` completion for synchronous messages without re-initializing it, which can cause spurious wakeups and synchronization bypass.
--

commit dde0fd11b79967ca7df11b1049abebec363d25a5
Author: Berkant Koc <me@berkoc.com>

drm/hyperv: validate resolution_count and fix WIN8 fallback

Adds bounds checking for resolution_count to prevent walking past the
supported_resolution array. It also adds fallback handling to ensure WIN8
defaults are used when the resolution probe fails.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc5268321..c3d0ff229e3d0 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -387,12 +387,15 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	hyperv_sendpacket(hdev, msg);
>  
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);

[Severity: High]
This isn't a bug introduced by this patch, but does the driver need to call
reinit_completion(&hv->wait) before sending requests?

If a delayed or unsolicited response arrives, hyperv_receive_sub() calls
complete(&hv->wait), incrementing its internal counter.  When this function
is called later, wait_for_completion_timeout() will return immediately
without waiting for the actual response.

Could this cause the driver to erroneously parse its own outgoing request
buffer as a response?

>  	if (!t) {
>  		drm_err(dev, "Time out on waiting resolution response\n");
>  		return -ETIMEDOUT;
>  	}
>  
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {

[Severity: Critical]
This is a pre-existing issue, but does reading directly from the shared
hv->init_buf introduce a TOCTOU or Double Fetch problem here?

Since hv->init_buf is also used by the asynchronous hyperv_receive_sub()
callback, could a malicious host send a valid count to pass this check,
and then immediately overwrite hv->init_buf with a large value before the
subsequent for-loop evaluates i < msg->resolution_resp.resolution_count?

Additionally, when the driver prepares requests earlier in this function
(e.g., setting msg->vid_hdr.size), an unsolicited response could overwrite
hv->init_buf. Could this trick vmbus_sendpacket() into using a
host-controlled size and leaking kernel memory?

> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1779221339.git.me@berkoc.com?part=1

^ permalink raw reply

* [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779221339.git.me@berkoc.com>

hyperv_receive_sub() reads msg->vid_hdr.type and dispatches into one
of four message-type branches without knowing how many bytes the host
wrote into hv->recv_buf. The completion path then runs
memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE), so the consumer
that wakes on wait_for_completion_timeout() can read up to 16 KiB of
residue from a prior message as if it were the response payload.

Pass bytes_recvd into hyperv_receive_sub() and reject any packet that
does not cover the pipe + synthvid header. For each of the three
completion-driving types (SYNTHVID_VERSION_RESPONSE,
SYNTHVID_RESOLUTION_RESPONSE, SYNTHVID_VRAM_LOCATION_ACK) also
require the type-specific payload before memcpy/complete, and apply
the same rule to SYNTHVID_FEATURE_CHANGE before reading is_dirt_needed.
The memcpy then uses bytes_recvd, which is bounded by
VMBUS_MAX_PACKET_SIZE through the call to vmbus_recvpacket().

Rejected packets are reported via drm_err_ratelimited() rather than
silently dropped, matching the CoCo-hardened pattern in
hv_kvp_onchannelcallback().

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 42 +++++++++++++++++++++--
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index c3d0ff229..12d3feb4f 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -420,26 +420,62 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev)
+static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
 	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
+	size_t hdr_size;
 
 	if (!hv)
 		return;
 
+	hdr_size = sizeof(struct pipe_msg_hdr) +
+		   sizeof(struct synthvid_msg_hdr);
+	if (bytes_recvd < hdr_size) {
+		drm_err_ratelimited(&hv->dev,
+				    "synthvid packet too small for header: %u\n",
+				    bytes_recvd);
+		return;
+	}
+
 	msg = (struct synthvid_msg *)hv->recv_buf;
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
-		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
+		size_t need = hdr_size;
+
+		switch (msg->vid_hdr.type) {
+		case SYNTHVID_VERSION_RESPONSE:
+			need += sizeof(struct synthvid_version_resp);
+			break;
+		case SYNTHVID_RESOLUTION_RESPONSE:
+			need += sizeof(struct synthvid_supported_resolution_resp);
+			break;
+		case SYNTHVID_VRAM_LOCATION_ACK:
+			need += sizeof(struct synthvid_vram_location_ack);
+			break;
+		}
+		if (bytes_recvd < need) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid packet too small for type %u: %u < %zu\n",
+					    msg->vid_hdr.type, bytes_recvd, need);
+			return;
+		}
+		memcpy(hv->init_buf, msg, bytes_recvd);
 		complete(&hv->wait);
 		return;
 	}
 
 	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
+		if (bytes_recvd < hdr_size +
+		    sizeof(struct synthvid_feature_change)) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid feature change packet too small: %u\n",
+					    bytes_recvd);
+			return;
+		}
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
 			hyperv_hide_hw_ptr(hv->hdev);
@@ -466,7 +502,7 @@ static void hyperv_receive(void *ctx)
 				       &bytes_recvd, &req_id);
 		if (bytes_recvd > 0 &&
 		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
-			hyperv_receive_sub(hdev);
+			hyperv_receive_sub(hdev, bytes_recvd);
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779221339.git.me@berkoc.com>

A SYNTHVID_RESOLUTION_RESPONSE with resolution_count > 64 walks past
the supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT] array in the
parse loop. Bound resolution_count against the array size, folded
into the existing zero-check.

When the WIN10 resolution probe fails, the caller in
hyperv_connect_vsp() left hv->screen_*_max / preferred_* unpopulated,
which sets mode_config.max_width / max_height to 0 and makes
drm_internal_framebuffer_create() reject every userspace framebuffer
with -EINVAL. The pre-WIN10 branch had the same gap for
preferred_width / preferred_height. Use a single post-probe fallback
guarded by screen_width_max == 0 so both paths converge on the WIN8
defaults.

Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526..c3d0ff229 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		return -ETIMEDOUT;
 	}
 
-	if (msg->resolution_resp.resolution_count == 0) {
-		drm_err(dev, "No supported resolutions\n");
+	if (msg->resolution_resp.resolution_count == 0 ||
+	    msg->resolution_resp.resolution_count >
+	    SYNTHVID_MAX_RESOLUTION_COUNT) {
+		drm_err(dev, "Invalid resolution count: %d\n",
+			msg->resolution_resp.resolution_count);
 		return -ENODEV;
 	}
 
@@ -508,9 +511,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
 		ret = hyperv_get_supported_resolution(hdev);
 		if (ret)
 			drm_err(dev, "Failed to get supported resolution from host, use default\n");
-	} else {
+	}
+
+	if (!hv->screen_width_max) {
 		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
 		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
+		hv->preferred_width = SYNTHVID_WIDTH_WIN8;
+		hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
 	}
 
 	hv->mmio_megabytes = hdev->channel->offermsg.offer.mmio_megabytes;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 0/2] drm/hyperv: harden host message parsing
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <20260517-drm-hyperv-cover-v2@berkoc.com>

Two independent issues in the synthetic video driver that both stem
from trusting unvalidated host data.

1/2 bounds resolution_count from SYNTHVID_RESOLUTION_RESPONSE against
the supported_resolution[] array, and populates WIN8 defaults for
hv->screen_*_max / hv->preferred_* in both the WIN10-probe-failure
path and the pre-WIN10 path, so a failed or pre-WIN10 probe yields
a usable display instead of having drm_internal_framebuffer_create()
reject every userspace framebuffer with -EINVAL.

2/2 forwards bytes_recvd from vmbus_recvpacket() into the sub-handler,
rejects packets that do not cover the synthvid header, and additionally
requires the type-specific payload size before memcpy/complete or
before reading the feature-change byte. Rejected packets are logged
via drm_err_ratelimited() instead of being silently dropped, matching
the CoCo-hardened pattern in hv_kvp_onchannelcallback().

Changes since v2 (per review by Michael Kelley):

  1/2: dropped the reinit_completion() change. Kelley pointed out that
       the negotiate-version and update-vram-location timeouts cause
       hyperv_vmbus_probe() to fail and free the device, so the stale
       completion can only outlive its request in hyperv_vmbus_resume()
       after a get_supported_resolution() timeout. That is a narrower
       fix and belongs in a separate patch against the resume path.
       Subject and commit message rewritten to reflect that this patch
       is now bounds-check + WIN8 fallback only. Pre-WIN10 branch now
       also populates hv->preferred_* (Kelley spotted the gap).
       Followed the post-probe-test refactor Kelley suggested: the else
       branch is gone, a single screen_width_max == 0 check covers
       both the pre-WIN10 case and a failed WIN10 probe.

  2/2: dropped the redundant upper bound on bytes_recvd. Added a
       per-type switch for the three completion-driving message types
       (SYNTHVID_VERSION_RESPONSE, SYNTHVID_RESOLUTION_RESPONSE,
       SYNTHVID_VRAM_LOCATION_ACK) so the wait-completion path
       validates payload size before memcpy/complete. Every reject
       path now emits drm_err_ratelimited() rather than returning
       silently. Commit message rewritten to lead with the residue
       read, with "wasteful copy" reframed as the secondary observation.

Changes since v1:
  1/2: bound resolution_count check folded into the existing zero
       check; populate WIN8 defaults when hyperv_get_supported_resolution()
       fails.
  2/2: forward bytes_recvd into hyperv_receive_sub(); enforce the
       pipe + synthvid header minimum; check synthvid_feature_change
       payload size before reading is_dirt_needed.

Both patches carry an Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
trailer per the kernel coding-assistants policy. Code, analysis and
review responses are mine; the model is used as a structured reviewer
under human verification.

base-commit: 4bf5d3da79c48e1df4bab82c9680c53adeff7820

Berkant Koc (2):
  drm/hyperv: validate resolution_count and fix WIN8 fallback
  drm/hyperv: validate VMBus packet size in receive callback

 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 58 ++++++++++++++++++++---
 1 file changed, 52 insertions(+), 6 deletions(-)

--
2.47.3

^ permalink raw reply

* Re: [PATCH v2 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-19 20:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Saurabh Sengar, Dexuan Cui, Long Li, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, linux-hyperv, dri-devel,
	linux-kernel
In-Reply-To: <SN6PR02MB415761EE7992EAFA14F2201BD4002@SN6PR02MB4157.namprd02.prod.outlook.com>

Hello Michael,

Thanks for the thorough review. v3 is on the list and addresses each
point:

> Does copying the full 16 KiB break anything? Or are you flagging as just
> wasteful activity?

It is the residue read that is the actual hazard, not the copy cost:
the consumer that wakes on complete() then reads up to 16 KiB of bytes
the host did not write in this packet as if it were the response
payload. The v3 commit message now leads with that and treats
"wasteful" as a secondary observation.

> Version related comments should go below the "---" following the
> Signed-off line.

Moved into the cover letter changelog in v3 so it stays out of git
log.

> The check against VMBUS_MAX_PACKET_SIZE shouldn't be needed.

Dropped. The v3 check is bytes_recvd < hdr_size only.

> In similar cases in other drivers that have been hardened for CoCo VMs,
> the code outputs a rate limited error message. [...] See
> hv_kvp_onchannelcallback() for example.

Done in v3 via drm_err_ratelimited() on every short-packet path
(synthvid header underflow, type-specific payload underflow, feature
change underflow). The driver already uses drm_err_ratelimited() in
hyperv_sendpacket() for the corresponding send path.

> Additional logic is needed here. Each of the three message types
> in the "if" statement has data beyond just the header. Before doing
> the memcpy() and complete(), the code should validate that the msg
> is big enough to contain that expected data.

Fixed in v3. For the three completion types I now compute the required
payload size with a switch on msg->vid_hdr.type and reject the packet
before memcpy/complete:

  SYNTHVID_VERSION_RESPONSE    -> sizeof(struct synthvid_version_resp)
  SYNTHVID_RESOLUTION_RESPONSE -> sizeof(struct synthvid_supported_resolution_resp)
  SYNTHVID_VRAM_LOCATION_ACK   -> sizeof(struct synthvid_vram_location_ack)

The memcpy then uses bytes_recvd, so wait_for_completion_timeout()
consumers never see truncated data and never read past what the host
wrote.

Series: <cover.1779221339.git.me@berkoc.com>

The v3 patches carry an `Assisted-by: Claude:claude-opus-4-7
berkoc-pipeline` trailer per the kernel coding-assistants policy.
Code, analysis and review responses are mine; the model is used as
a structured reviewer under human verification.

Thanks,
Berkant

^ permalink raw reply

* Re: [PATCH v2 1/2] drm/hyperv: validate resolution_count and harden VSP request paths
From: Berkant Koc @ 2026-05-19 20:20 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Saurabh Sengar, Dexuan Cui, Long Li, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, linux-hyperv, dri-devel,
	linux-kernel
In-Reply-To: <SN6PR02MB4157D595B990A321BFA85B40D4002@SN6PR02MB4157.namprd02.prod.outlook.com>

Hello Michael,

Thanks for the CoCo context, that lines up with what is in
vmbus_devs[] for the framebuffer device. The piecemeal approach is
what I am aiming for here.

v3 is on the list and addresses your three concrete points:

> This change should probably be a separate patch, as it's not really
> related to the bounds checking issue.
> [...]
> All that notwithstanding, I don't think your fix is needed in its
> current form.

Dropped from v3. You are right that the negotiate-version and
update-vram-location timeouts let hyperv_vmbus_probe() bail out and
free the device, so the stale-completion path is only reachable from
hyperv_vmbus_resume() after a get_supported_resolution() timeout.
That is a narrower fix and belongs in a separate patch against the
resume path, which I will send afterwards.

> Is there a separate problem here in that preferred_width and
> preferred_height are not set in the pre-WIN10 case?

Yes, separate problem, and I missed it in v2. The pre-WIN10 branch
in hyperv_connect_vsp() sets only screen_*_max and leaves preferred_*
at zero, which is inconsistent with the WIN10-failure path.

> Also, having to duplicate the fallback code is distasteful. Instead
> of having an "else" clause, maybe have a follow-up test for
> screen_width_max [...] being zero as an indicator [...]

Adopted in v3. The else branch is gone; the WIN10 path runs the probe
and the post-probe block applies the WIN8 defaults whenever
screen_width_max is still zero. One source of truth, both paths
converge on it.

> In my view, your commit message is a bit too detailed.

Rewritten in v3. The bounds check and the WIN8 fallback are now one
short paragraph each, focused on the "why".

Series: <cover.1779221339.git.me@berkoc.com>

The v3 patches carry an `Assisted-by: Claude:claude-opus-4-7
berkoc-pipeline` trailer per the kernel coding-assistants policy.
Code, analysis and review responses are mine; the model is used as
a structured reviewer under human verification.

Thanks,
Berkant

^ permalink raw reply

* RE: [PATCH v4 16/18] mshv: Validate scheduler message bounds from hypervisor
From: Michael Kelley @ 2026-05-19 20:21 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177816866691.21765.15605640837157423543.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, May 7, 2026 8:44 AM
> 
> handle_pair_message() iterates up to msg->vp_count without verifying it
> against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
> from untrusted hypervisor data, a malformed message with a large value
> would cause out-of-bounds reads from the partition_ids and vp_indexes
> arrays.
> 
> handle_bitset_message() iterates over set bits in valid_bank_mask (up to
> 64) and advances bank_contents for each one. However, the payload buffer
> only has space for 16 bank entries. A valid_bank_mask with more than 16
> bits set causes bank_contents to read beyond the message buffer.
> 
> Fix both by adding bounds validation:
> - Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
> - Track banks consumed and stop before exceeding buffer capacity
> 
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_synic.c |   20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 89207aad7cf1f..5d509299f14d7 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -190,7 +190,9 @@ static void kick_vp(struct mshv_vp *vp)
>  static void
>  handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  {
> -	int bank_idx, vps_signaled = 0, bank_mask_size;
> +	int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
> +	const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
> +			      sizeof(u64) - 2; /* subtract format + mask */
>  	struct mshv_partition *partition;
>  	const struct hv_vpset *vpset;
>  	const u64 *bank_contents;
> @@ -230,6 +232,11 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  		if (bank_idx == bank_mask_size)
>  			break;
> 
> +		if (unlikely(banks_used >= max_banks)) {
> +			pr_debug("valid_bank_mask exceeds buffer capacity\n");
> +			goto unlock_out;
> +		}

Instead of counting the banks while going through the loop, and checking
against the max on each loop iteration, consider doing the following up
front before entering the while loop:

		if (hweight64(vpset->valid_bank_mask) > max_banks) {
			pr_debug("valid_bank_mask exceeds buffer capacity\n");
			return;
		}

Of course, such an approach would not process *any* bits in the message,
but that's probably OK since the message Linux got is malformed.

Michael

> +
>  		while (true) {
>  			struct mshv_vp *vp;
> 
> @@ -258,6 +265,7 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
>  		}
> 
>  		bank_contents++;
> +		banks_used++;
>  	}
> 
>  unlock_out:
> @@ -274,10 +282,18 @@ handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
>  	struct mshv_partition *partition = NULL;
>  	struct mshv_vp *vp;
>  	int idx;
> +	u8 vp_count = msg->vp_count;
> +
> +	if (unlikely(vp_count > HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT)) {
> +		pr_debug("pair message vp_count %u exceeds max %lu\n",
> +			 vp_count,
> +			 (unsigned long)HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT);
> +		return;
> +	}
> 
>  	rcu_read_lock();
> 
> -	for (idx = 0; idx < msg->vp_count; idx++) {
> +	for (idx = 0; idx < vp_count; idx++) {
>  		u64 partition_id = msg->partition_ids[idx];
>  		u32 vp_index = msg->vp_indexes[idx];
> 
> 
> 


^ permalink raw reply

* RE: [PATCH v4 04/18] mshv: Add NULL check for vp in mshv_try_assert_irq_fast
From: Michael Kelley @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177816860118.21765.7481864545928795603.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, May 7, 2026 8:43 AM
> 
> mshv_try_assert_irq_fast() dereferences the vp pointer obtained from
> pt_vp_array[lapic_apic_id] without checking for NULL or validating that
> lapic_apic_id is within bounds. A spurious interrupt from the hypervisor
> targeting a non-existent VP (or one not yet created) causes a NULL
> pointer dereference and crashes the host.
> 
> Add a bounds check on lapic_apic_id against MSHV_MAX_VPS and a NULL
> check on the vp pointer before dereferencing.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_eventfd.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> index 5995a62aff8d8..b398e58411dd7 100644
> --- a/drivers/hv/mshv_eventfd.c
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -169,7 +169,12 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd)
>  		return -EOPNOTSUPP;
>  #endif
> 
> +	if (irq->lapic_apic_id >= MSHV_MAX_VPS)
> +		return -EINVAL;

APIC IDs are 8-bit values, and indeed lapic_apic_id is set in
mshv_copy_girq_info() after masking the value with 0xFF. So
this check doesn't do anything. (I guess MSHV_MAX_VPS
could be changed to something smaller than 256, but that
seems highly unlikely.) There is extended destination
id functionality that provides for a 15-bit APIC ID, but I don't
see any support for that in the MSHV code.

> +
>  	vp = partition->pt_vp_array[irq->lapic_apic_id];

The APIC ID does *not* equal the Linux CPU number or VP
index in the general case, so this indexing is problematic. APIC
IDs are not dense if the target partition has multiple NUMA
nodes and the number of VPs in a NUMA node is not a power
of 2. In such case, the APIC ID space has gaps. Azure has such
VM sizes, and you can create such a configuration using
the local Hyper-V Manager UI. So having APIC IDs that aren't
dense is a real case.

> +	if (!vp)
> +		return -EINVAL;
> 
>  	if (!vp->vp_register_page)
>  		return -EOPNOTSUPP;
> 
> 


^ permalink raw reply

* RE: [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Michael Kelley @ 2026-05-19 20:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177816858406.21765.9718563917415905259.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, May 7, 2026 8:43 AM
> 
> The bounds check inside the PFN-filling loop can return -EINVAL while
> interrupts are disabled via local_irq_save(), leaking IRQ state.
> 
> Remove the check — it is redundant because the loop invariant
> (done + i < page_count == page_struct_count >> large_shift) guarantees
> (done + i) << large_shift < page_struct_count always holds.
> 
> While here, fix type mismatches: change 'int done' to 'u64 done' and
> use u64 for loop and batch-size variables so they match the u64
> page_count they are compared against.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index 129456bd72aba..cc580225e9e45 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  {
>  	struct hv_input_modify_sparse_spa_page_host_access *input_page;
>  	u64 status;
> -	int done = 0;
> +	u64 done = 0;
>  	unsigned long irq_flags, large_shift = 0;
>  	u64 page_count = page_struct_count;
>  	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> @@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  	}
> 
>  	while (done < page_count) {
> -		ulong i, completed, remain = page_count - done;
> -		int rep_count = min(remain,
> -				        HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
> +		u64 i, completed, remain = page_count - done;
> +		u64 rep_count = min_t(u64, remain,
> + 					HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);

Why does this need to be "min_t()" instead of just "min()"?  Needing min_t()
was the case in times past, but "min()" does a much better job now of handling
mixed integer types. There are a number of patches on LKML that are
replacing min_t() with min(). This change seems to be going the opposite
direction.

> 
>  		local_irq_save(irq_flags);
>  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  		input_page->flags = flags;
>  		input_page->host_access = host_access;
> 
> -		for (i = 0; i < rep_count; i++) {
> -			u64 index = (done + i) << large_shift;
> -
> -			if (index >= page_struct_count)
> -				return -EINVAL;
> -
> +		for (i = 0; i < rep_count; i++)
>  			input_page->spa_page_list[i] =
> -						page_to_pfn(pages[index]);
> -		}
> +				page_to_pfn(pages[(done + i) << large_shift]);
> 
>  		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
>  					     NULL);
> 
The "completed" local variable could be collapsed out by doing:

		done += hv_repcomp(status)

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: bpf, linux-hyperv
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The patch correctly prevents `mana_detach()` from running on an already-detached device in the reset handler, but identical bugs remain in `mana_change_mtu()`, `mana_set_channels()`, `mana_set_ringparam()`, and `mana_bpf()`.
--

commit bac73a4812067e75feeec962b27ab9647e30ed9c
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: Skip redundant detach in queue reset handler if already detached

This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);

The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?

If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).

If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.

Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.

The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).

Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: linux-hyperv, bpf
In-Reply-To: <20260518194654.735580-2-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Missing re-allocation of `apc->rxqs` during interface bring-up causes deterministic NULL pointer dereference
- [High] Uninitialized queue handles (0 and INVALID_MANA_HANDLE) are sent to the firmware during error teardown, corrupting hardware state
- [High] GDMA admin commands are issued immediately after a PCI Function Level Reset (FLR), causing severe cascading timeouts and hardware state violations
--

commit 2129283d37aabfe724d7509e34346eab93d302a2
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: Add NULL guards in teardown path to prevent panic on attach failure
    
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
>  	struct mana_rxq *rxq;
>  	u32 rxq_idx;
>  
> -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> -		rxq = apc->rxqs[rxq_idx];
> -		if (!rxq)
> -			continue;
> +	if (apc->rxqs) {
>  
> -		mana_destroy_rxq(apc, rxq, true);
> -		apc->rxqs[rxq_idx] = NULL;
> +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> +			rxq = apc->rxqs[rxq_idx];
> +			if (!rxq)
> +				continue;
> +
> +			mana_destroy_rxq(apc, rxq, true);
> +			apc->rxqs[rxq_idx] = NULL;
> +		}
>  	}

If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?

When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:

    apc->rxqs[i] = rxq;

If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?

[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 * number of queues.
>  	 */
>  
> -	for (i = 0; i < apc->num_queues; i++) {
> -		txq = &apc->tx_qp[i].txq;
> -		tsleep = 1000;
> -		while (atomic_read(&txq->pending_sends) > 0 &&
> -		       time_before(jiffies, timeout)) {
> -			usleep_range(tsleep, tsleep + 1000);
> -			tsleep <<= 1;
> -		}
> -		if (atomic_read(&txq->pending_sends)) {
> -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> -			if (err) {
> -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> -					   err, atomic_read(&txq->pending_sends),
> -					   txq->gdma_txq_id);
> +	if (apc->tx_qp) {

In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?

If mana_destroy_txq() unconditionally iterates over all apc->num_queues:

    for (i = 0; i < apc->num_queues; i++) {
        ...
        mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
        ...

Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?

Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?

> +		for (i = 0; i < apc->num_queues; i++) {
> +			txq = &apc->tx_qp[i].txq;
> +			tsleep = 1000;
> +			while (atomic_read(&txq->pending_sends) > 0 &&
> +			       time_before(jiffies, timeout)) {
> +				usleep_range(tsleep, tsleep + 1000);
> +				tsleep <<= 1;
> +			}
> +			if (atomic_read(&txq->pending_sends)) {
> +				err =
> +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));

If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.

Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?

Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?

> +				if (err) {
> +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> +						   err,
> +					    atomic_read(&txq->pending_sends),
> +					    txq->gdma_txq_id);
> +				}
> +				break;
>  			}
> -			break;
>  		}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=1

^ permalink raw reply

* RE: [PATCH v2 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Michael Kelley @ 2026-05-19 18:33 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, stable@vger.kernel.org
In-Reply-To: <20260517-drm-hyperv-patch2-v2@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Sunday, May 17, 2026 7:25 AM
> 
> hyperv_receive() reads bytes_recvd from vmbus_recvpacket() but does not
> forward that value to hyperv_receive_sub(). The sub-handler reads
> msg->vid_hdr.type and msg->feature_chg.is_dirt_needed without knowing
> how many bytes the host actually wrote, so a short packet leaves the
> parser reading bytes that the host did not write in this packet.

This is a valid concern. Similar patterns have hopefully been fixed in other
drivers that were hardened for CoCo VMs.

> The unconditional memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE) on the
> wait-completion path also copies the full 16 KiB recv_buf regardless
> of bytes_recvd, including any residue from the prior message.

Does copying the full 16 KiB break anything? Or are you flagging as just
wasteful activity? This gets to what I previously mentioned about commit
messages -- explain "why".  Being wasteful is a valid reason, but it can be
helpful to future readers to be explicit about the "why".

> 
> Pass bytes_recvd into hyperv_receive_sub() and reject any packet shorter
> than the pipe + synthvid header. The dirt-feature branch additionally
> requires the feature_change payload size before reading is_dirt_needed.
> The init_buf copy now uses bytes_recvd as the length argument, which
> keeps it bounded by VMBUS_MAX_PACKET_SIZE through the new upper check.
> 
> Unchanged from v1.

Version related comments should go below the "---" following the Signed-off
line. They will be visible in the patch emails, but won't get put into the
git log when the patch is finally accepted. The Linux kernel doesn't want
to clutter the git log with intermediate version information that accumulates
as the patch is reviewed and revised.

> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> Signed-off-by: Berkant Koc <me@berkoc.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 3b5065fe06e4..cdab4895dd40 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -423,26 +423,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
> 
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
>  	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
>  	struct synthvid_msg *msg;
> +	size_t hdr_size;
> 
>  	if (!hv)
>  		return;
> 
> +	hdr_size = sizeof(struct pipe_msg_hdr) +
> +		   sizeof(struct synthvid_msg_hdr);
> +	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)

The check against VMBUS_MAX_PACKET_SIZE shouldn't be needed.
vmbus_recvpacket() takes VMBUS_MAX_PACKET_SIZE as the max
receive size and won't return a bytes_recvd value that is larger.

> +		return;

In similar cases in other drivers that have been hardened for CoCo VMs, the
code outputs a rate limited error message. That should be done here so the
bad received message isn't just ignored. See hv_kvp_onchannelcallback() for
example.

> +
>  	msg = (struct synthvid_msg *)hv->recv_buf;
> 
>  	/* Complete the wait event */
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		memcpy(hv->init_buf, msg, bytes_recvd);

Additional logic is needed here. Each of the three message types
in the "if" statement has data beyond just the header. Before doing
the memcpy() and complete(), the code should validate that the msg
is big enough to contain that expected data.  I.e., it needs to do the
same as you have done for SYNTHVID_FEATURE_CHANGE.
Otherwise, the code that wakes up from the completion
will think that it has valid data to work with, and it may not.

Of course, the problem already exists in the upstream code. But if you
are going to make changes to fix the problem, the fix should fully
handle situation and not just be a partial fix.

>  		complete(&hv->wait);
>  		return;
>  	}
> 
>  	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
> +		if (bytes_recvd < hdr_size +
> +		    sizeof(struct synthvid_feature_change))
> +			return;

Same here.  Output a rate limited error message.

>  		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
>  		if (hv->dirt_needed)
>  			hyperv_hide_hw_ptr(hv->hdev);
> @@ -469,7 +478,7 @@ static void hyperv_receive(void *ctx)
>  				       &bytes_recvd, &req_id);
>  		if (bytes_recvd > 0 &&
>  		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
> -			hyperv_receive_sub(hdev);
> +			hyperv_receive_sub(hdev, bytes_recvd);
>  	} while (bytes_recvd > 0 && ret == 0);
>  }
> 
> --
> 2.47.3
> 

^ permalink raw reply

* RE: [PATCH v2 1/2] drm/hyperv: validate resolution_count and harden VSP request paths
From: Michael Kelley @ 2026-05-19 18:33 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, stable@vger.kernel.org
In-Reply-To: <20260517-drm-hyperv-patch1-v2@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Sunday, May 17, 2026 7:25 AM
> 
> The synthetic video device parses a SYNTHVID_RESOLUTION_RESPONSE that
> contains a u8 resolution_count and a u8 default_resolution_index. The
> existing check rejects resolution_count == 0 and an index greater or
> equal to resolution_count, but does not bound resolution_count itself
> against the fixed supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT]
> array. A host that returns resolution_count > 64 together with an
> in-range default_resolution_index causes the subsequent loop to read
> past the array. Reject any resolution_count that exceeds the array
> bound, folded into the existing zero-check so a single log entry
> remains per failure.

I think this is a good fix, but let me provide some background. A core
assumption has been that the Hyper-V host is not malicious, and data from
the host can be trusted to be well-formed and valid. Linux code interacting
with the host does *not* take a paranoid approach and does *not* validate
every value received from the host. Existing validation is ad hoc and
probably not comprehensive.

A few years ago with the introduction of Confidential Computing and
"CoCo VMs" in in Linux, the assumption changed. (Hyper-V calls these
"Confidential VMs", or sometimes "Isolated VMs" -- unfortunately, the
terminology is a bit scattered.) The host is treated as potentially malicious
and Linux code was hardened against bad values passed from the host. But
not all drivers for Hyper-V synthetic devices were hardened because CoCo VMs
on Hyper-V don't allow all the possible synthetic devices. The Hyper-V synthetic
frame buffer is one such disallowed device, so the Hyper-V DRM driver was not
hardened. The allowed devices can be seen in the vmbus_devs[] array
where .allowed_in_isolation is "true".

All that said, I'm in favor of adding hardening, even if it is done piecemeal.
There won't likely be a comprehensive effort to harden the Hyper-V DRM
driver unless the frame buffer device becomes available in a CoCo VM.

> 
> When that bounds check (or any later failure in
> hyperv_get_supported_resolution()) returns an error, the caller in
> hyperv_connect_vsp() previously logged a warning and continued without
> populating hv->screen_width_max / hv->screen_height_max / preferred_*.
> hyperv_mode_config_init() then set dev->mode_config.max_width and
> max_height to 0, which makes drm_internal_framebuffer_create() reject
> every userspace framebuffer with -EINVAL. Populate the fields with the
> WIN8 defaults that the pre-WIN10 branch already uses so a failed
> resolution probe degrades to a usable display instead of disabling it.

Yes, this also makes sense.

> 
> The driver also issues three sequential VSP requests (negotiate
> version, update VRAM location, get supported resolution) that share a
> single hv->wait completion. None of the call sites call
> reinit_completion() between requests. If wait_for_completion_timeout()
> returns 0 but a delayed response later triggers complete(&hv->wait) in
> the receive callback, the next request's wait can consume that stale
> completion, return immediately, and parse stale data out of
> hv->init_buf. Call reinit_completion() before each send so every
> request waits for its own response.

This change should probably be a separate patch, as it's not really
related to the bounds checking issue. You could even argue that the
bounds check and using the default size values in case of an error
should be separate patches, but it's a judgment call and I could go
either way. Combining the first two works for me, though someone
else might disagree.

All that notwithstanding, I don't think your fix is needed in its current
form. If wait_for_completion_timeout() times out in "negotiate version"
or "update VRAM location", the code returns -ETIMEDOUT. The caller
then bails out and does not send the next message in the sequence.
Eventually hyperv_vmbus_probe() fails and the struct hyperv_drm_device
memory is freed. A similar thing could happen with hyperv_vmbus_resume().
The memory isn't freed, but if the resume fails and is tried again later, it
should be with a fresh copy of the saved memory image. In that case, the
completion should be clean for the retry.

The problem case is "get supported resolution", which would leave the
completion in a pending state if it times out. This could later affect
hyperv_vmbus_resume().  I could see reinit'ing the completion in
hyperv_vmbus_resume() to handle this case.

But double-check my thinking on all this. Maybe I'm missing something.

One other comment:  In my view, your commit message is a bit too
detailed. A commit message should focus on "why" the change, and
less on the details of the implementation. Explaining the failure case
is fine, but don't recapitulate code that is straightforward and obvious.

> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> Signed-off-by: Berkant Koc <me@berkoc.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc526832..3b5065fe06e4 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -223,6 +223,7 @@ static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
>  	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
>  		sizeof(struct synthvid_version_req);
>  	msg->ver_req.version = ver;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -257,6 +258,7 @@ int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
>  	msg->vram.user_ctx = vram_pp;
>  	msg->vram.vram_gpa = vram_pp;
>  	msg->vram.is_vram_gpa_specified = 1;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -383,6 +385,7 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		sizeof(struct synthvid_supported_resolution_req);
>  	msg->resolution_req.maximum_resolution_count =
>  		SYNTHVID_MAX_RESOLUTION_COUNT;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -391,8 +394,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		return -ETIMEDOUT;
>  	}
> 
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {
> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;
>  	}
> 
> @@ -506,8 +512,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
> 
>  	if (hyperv_version_ge(hv->synthvid_version, SYNTHVID_VERSION_WIN10)) {
>  		ret = hyperv_get_supported_resolution(hdev);
> -		if (ret)
> +		if (ret) {
>  			drm_err(dev, "Failed to get supported resolution from host, use default\n");
> +			hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
> +			hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
> +			hv->preferred_width = SYNTHVID_WIDTH_WIN8;
> +			hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
> +		}
>  	} else {
>  		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
>  		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;

Is there a separate problem here in that preferred_width and preferred_height
are not set in the pre-WIN10 case? I'm not familiar enough with DRM driver code
to know how these preferred values are used. It looks inconsistent to have the
two fallback cases be different.

Also, having to duplicate the fallback code is distasteful. Instead of having an
"else" clause, maybe have a follow-up test for screen_width_max (or any of the
values) being zero as an indicator that they haven't been set, and apply the defaults in
that case?

Michael

^ permalink raw reply

* [PATCH net v2] vsock: keep poll shutdown state consistent
From: Ziyu Zhang @ 2026-05-19 16:56 UTC (permalink / raw)
  To: Stefano Garzarella, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stefan Hajnoczi, Michael S . Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Bryan Tan, Vishnu Dasa,
	bcm-kernel-feedback-list, virtualization, netdev, linux-kernel,
	linux-hyperv, kvm, baijiaju1990, r33s3n6, gality369,
	zhenghaoran154, hanguidong02, zzzccc427, Ziyu Zhang

vsock_poll() reads vsk->peer_shutdown before taking the socket lock
to set EPOLLHUP and EPOLLRDHUP, then reads it again after taking
the lock to report EOF readability. A shutdown packet can update
peer_shutdown while poll is waiting for the lock, so one poll invocation
can report EOF readability without the corresponding HUP/RDHUP bits.

For connectible sockets, take one peer_shutdown snapshot after
lock_sock() and use it for all peer-shutdown-derived poll bits. For
datagram sockets, which do not take lock_sock() in poll(), take one
lockless READ_ONCE() snapshot and pair it with WRITE_ONCE() on the
writer side.

This keeps the peer-shutdown-derived bits internally consistent for each
poll pass.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
---
Link: https://lore.kernel.org/netdev/20260516034745.260442-1-ziyuzhang201@gmail.com/

v2:
- Pair lockless READ_ONCE() users with WRITE_ONCE() on peer_shutdown writers.
- Move datagram shutdown handling into the SOCK_DGRAM branch and add a comment.
- Keep one connectible peer_shutdown snapshot after lock_sock() and
  restore the previous shutdown-derived mask ordering.

 net/vmw_vsock/af_vsock.c                | 49 ++++++++++++++++---------
 net/vmw_vsock/hyperv_transport.c        |  9 +++--
 net/vmw_vsock/virtio_transport_common.c | 14 ++++---
 net/vmw_vsock/vmci_transport.c          |  8 ++--
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index adcba1b7b..789b00f6e 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -523,7 +523,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		 */
 		sock_reset_flag(sk, SOCK_DONE);
 		sk->sk_state = TCP_CLOSE;
-		vsk->peer_shutdown = 0;
+		WRITE_ONCE(vsk->peer_shutdown, 0);
 	}
 
 	if (sk->sk_type == SOCK_SEQPACKET) {
@@ -814,7 +814,7 @@ static struct sock *__vsock_create(struct net *net,
 	vsk->rejected = false;
 	vsk->sent_request = false;
 	vsk->ignore_connecting_rst = false;
-	vsk->peer_shutdown = 0;
+	WRITE_ONCE(vsk->peer_shutdown, 0);
 	INIT_DELAYED_WORK(&vsk->connect_work, vsock_connect_timeout);
 	INIT_DELAYED_WORK(&vsk->pending_work, vsock_pending_work);
 
@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
 	return err;
 }
 
+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
+{
+	__poll_t mask = 0;
+
+	/* INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of EPOLLHUP set.
+	 */
+	if (sk->sk_shutdown == SHUTDOWN_MASK ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (peer_shutdown & SEND_SHUTDOWN)))
+		mask |= EPOLLHUP;
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    peer_shutdown & SEND_SHUTDOWN)
+		mask |= EPOLLRDHUP;
+
+	return mask;
+}
+
 static __poll_t vsock_poll(struct file *file, struct socket *sock,
 			       poll_table *wait)
 {
@@ -1139,24 +1158,17 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		/* Signify that there has been an error on this socket. */
 		mask |= EPOLLERR;
 
-	/* INET sockets treat local write shutdown and peer write shutdown as a
-	 * case of EPOLLHUP set.
-	 */
-	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
-	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
-	     (vsk->peer_shutdown & SEND_SHUTDOWN))) {
-		mask |= EPOLLHUP;
-	}
-
-	if (sk->sk_shutdown & RCV_SHUTDOWN ||
-	    vsk->peer_shutdown & SEND_SHUTDOWN) {
-		mask |= EPOLLRDHUP;
-	}
-
 	if (sk_is_readable(sk))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
 	if (sock->type == SOCK_DGRAM) {
+		u32 peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+
+		/* DGRAM sockets do not take lock_sock() in poll(), so use one
+		 * lockless snapshot for all shutdown-derived mask bits.
+		 */
+		mask |= vsock_poll_shutdown(sk, peer_shutdown);
+
 		/* For datagram sockets we can read if there is something in
 		 * the queue and write as long as the socket isn't shutdown for
 		 * sending.
@@ -1171,6 +1183,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 
 	} else if (sock_type_connectible(sk->sk_type)) {
 		const struct vsock_transport *transport;
+		u32 peer_shutdown;
 
 		lock_sock(sk);
 
@@ -1203,8 +1216,10 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		 * terminated should also be considered read, and we check the
 		 * shutdown flag for that.
 		 */
+		peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+		mask |= vsock_poll_shutdown(sk, peer_shutdown);
 		if (sk->sk_shutdown & RCV_SHUTDOWN ||
-		    vsk->peer_shutdown & SEND_SHUTDOWN) {
+		    peer_shutdown & SEND_SHUTDOWN) {
 			mask |= EPOLLIN | EPOLLRDNORM;
 		}
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 432fcbbd1..16b981566 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -264,7 +264,7 @@ static void hvs_do_close_lock_held(struct vsock_sock *vsk,
 	struct sock *sk = sk_vsock(vsk);
 
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown = SHUTDOWN_MASK;
+	WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 	if (vsock_stream_has_data(vsk) <= 0)
 		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
@@ -593,7 +593,9 @@ static int hvs_update_recv_data(struct hvsock *hvs)
 		return -EIO;
 
 	if (payload_len == 0)
-		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(hvs->vsk->peer_shutdown,
+			   READ_ONCE(hvs->vsk->peer_shutdown) |
+			   SEND_SHUTDOWN);
 
 	hvs->recv_data_len = payload_len;
 	hvs->recv_data_off = 0;
@@ -715,7 +717,8 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
 			return ret;
 		return hvs->recv_data_len;
 	case 0:
-		vsk->peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(vsk->peer_shutdown,
+			   READ_ONCE(vsk->peer_shutdown) | SEND_SHUTDOWN);
 		ret = 0;
 		break;
 	default: /* -1 */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index dcc8a1d58..71d8eac82 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1220,7 +1220,7 @@ static void virtio_transport_do_close(struct vsock_sock *vsk,
 	struct sock *sk = sk_vsock(vsk);
 
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown = SHUTDOWN_MASK;
+	WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 	if (vsock_stream_has_data(vsk) <= 0)
 		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
@@ -1411,12 +1411,15 @@ virtio_transport_recv_connected(struct sock *sk,
 	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
 		sk->sk_write_space(sk);
 		break;
-	case VIRTIO_VSOCK_OP_SHUTDOWN:
+	case VIRTIO_VSOCK_OP_SHUTDOWN: {
+		u32 peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+
 		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
-			vsk->peer_shutdown |= RCV_SHUTDOWN;
+			peer_shutdown |= RCV_SHUTDOWN;
 		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
-			vsk->peer_shutdown |= SEND_SHUTDOWN;
-		if (vsk->peer_shutdown == SHUTDOWN_MASK) {
+			peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(vsk->peer_shutdown, peer_shutdown);
+		if (peer_shutdown == SHUTDOWN_MASK) {
 			if (vsock_stream_has_data(vsk) <= 0 && !sock_flag(sk, SOCK_DONE)) {
 				(void)virtio_transport_reset(vsk, NULL);
 				virtio_transport_do_close(vsk, true);
@@ -1431,6 +1434,7 @@ virtio_transport_recv_connected(struct sock *sk,
 		if (le32_to_cpu(virtio_vsock_hdr(skb)->flags))
 			sk->sk_state_change(sk);
 		break;
+	}
 	case VIRTIO_VSOCK_OP_RST:
 		virtio_transport_do_close(vsk, true);
 		break;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 7eccd6708..c2231c402 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -811,7 +811,7 @@ static void vmci_transport_handle_detach(struct sock *sk)
 		/* On a detach the peer will not be sending or receiving
 		 * anymore.
 		 */
-		vsk->peer_shutdown = SHUTDOWN_MASK;
+		WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 
 		/* We should not be sending anymore since the peer won't be
 		 * there to receive, but we can still receive if there is data
@@ -1534,7 +1534,9 @@ static int vmci_transport_recv_connected(struct sock *sk,
 		if (pkt->u.mode) {
 			vsk = vsock_sk(sk);
 
-			vsk->peer_shutdown |= pkt->u.mode;
+			WRITE_ONCE(vsk->peer_shutdown,
+				   READ_ONCE(vsk->peer_shutdown) |
+				   pkt->u.mode);
 			sk->sk_state_change(sk);
 		}
 		break;
@@ -1551,7 +1553,7 @@ static int vmci_transport_recv_connected(struct sock *sk,
 		 * a clean shutdown.
 		 */
 		sock_set_flag(sk, SOCK_DONE);
-		vsk->peer_shutdown = SHUTDOWN_MASK;
+		WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 		if (vsock_stream_has_data(vsk) <= 0)
 			sk->sk_state = TCP_CLOSING;
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: Paolo Abeni @ 2026-05-19 13:30 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, edumazet, andrew+netdev, jgg, leon,
	haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260519132825.172490-1-pabeni@redhat.com>

On 5/19/26 3:28 PM, Paolo Abeni wrote:
> From: AI Reviewer <ai@example.com>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> net: mana: Create separate EQs for each vPort
> 
> This patch moves EQ allocation from per-device to per-vPort, exports
> mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
> a vport reference for RSS QPs, and adds cross-port PD-sharing
> rejection for raw QPs along with serialization between
> mana_set_channels() and the RDMA vport configuration.
> 
>> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
>> index ac5e75dd3494..f8a9013f0ca3 100644
>> --- a/drivers/infiniband/hw/mana/main.c
>> +++ b/drivers/infiniband/hw/mana/main.c
>> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>>  
>>  	pd->vport_use_count++;
>>  	if (pd->vport_use_count > 1) {
>> +		/* Reject cross-port PD sharing. EQs and vport config
>> +		 * are per-port, so the PD must stay bound to the port
>> +		 * that was configured on the first raw QP creation.
>> +		 */
>> +		if (pd->vport_port != port) {
>> +			pd->vport_use_count--;
>> +			mutex_unlock(&pd->vport_mutex);
>> +			ibdev_dbg(&dev->ib_dev,
>> +				  "PD already bound to port %u\n",
>> +				  pd->vport_port);
>> +			return -EINVAL;
>> +		}
>>  		ibdev_dbg(&dev->ib_dev,
>>  			  "Skip as this PD is already configured vport\n");
>>  		mutex_unlock(&pd->vport_mutex);
>>  		return 0;
>>  	}
>>  
>> -	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
>> +	pd->vport_port = port;
>> +
>> +	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
>>  	if (err) {
>>  		pd->vport_use_count--;
>>  		mutex_unlock(&pd->vport_mutex);
> 
> [Low]
> In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
> later mana_create_eq() fails, vport_use_count is rolled back to 0 but
> pd->vport_port keeps the value just written.
> 
> Is the contract that pd->vport_port is only meaningful while
> vport_use_count > 0?  That invariant is not documented in the struct
> mana_ib_pd comment, and a future reader who consults pd->vport_port
> without first checking vport_use_count under vport_mutex would see stale
> data.  Would it be worth either resetting pd->vport_port on the rollback
> path or documenting the rule explicitly?
> 
>> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>>  		return err;
>>  	}
>>  
>> -	mutex_unlock(&pd->vport_mutex);
>>  
>> -	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>> -	pd->tx_vp_offset = mpc->tx_vp_offset;
>> +	err = mana_create_eq(mpc);
>> +	if (err) {
>> +		mana_uncfg_vport(mpc);
>> +		pd->vport_use_count--;
>> +	} else {
>> +		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>> +		pd->tx_vp_offset = mpc->tx_vp_offset;
>> +	}
> 
> [ ... ]
> 
>> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
>> index 0fbcf449c134..108ec4c5ce51 100644
>> --- a/drivers/infiniband/hw/mana/qp.c
>> +++ b/drivers/infiniband/hw/mana/qp.c
>> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>>  
>>  	qp->port = port;
>>  
>> +	/* Take a reference on the vport to ensure EQs outlive this QP.
>> +	 * The vport must already be configured by a raw QP.
>> +	 */
>> +	mutex_lock(&mana_pd->vport_mutex);
>> +	if (!mana_pd->vport_use_count) {
>> +		mutex_unlock(&mana_pd->vport_mutex);
>> +		ret = -EINVAL;
>> +		goto fail;
>> +	}
>> +	mana_pd->vport_use_count++;
>> +	mutex_unlock(&mana_pd->vport_mutex);
>> +
> 
> [Critical]
> The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
> mirrored here.  mana_ib_create_qp_rss() takes the user-supplied port
> from ucmd.port and only checks that vport_use_count is non-zero, so the
> RSS path will happily attach to a different port than the one bound to
> the PD by the first raw QP.
> 
> If a user creates a raw QP on port A (which sets pd->vport_port=A and
> allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
> the same PD with ucmd.port=B where port B has no Ethernet up and no
> RDMA activity, mpc_B->eqs is NULL.  The subsequent code in this same
> function:
> 
> 	eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
> 	cq_spec.attached_eq = eq->eq->id;
> 
> would then dereference NULL through mpc_B->eqs.
> 
> If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
> driver and the RSS QP attaches to those EQs.  When the QP is destroyed
> mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
> and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
> mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
> on a port whose apc->vport_use_count was never incremented by this PD,
> tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
> vport state.  Meanwhile port A's EQs and vport configuration are leaked
> because nothing on this PD will destroy them.
> 
> Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
> check that mana_ib_cfg_vport() now performs, before bumping
> vport_use_count?

Sashiko reported alredy this problematic pattern in 2 separate places.
Please ensure that there are no other similar buggy usage pattern
elsewhere in the newly introduced code.

/P


^ permalink raw reply

* Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: Paolo Abeni @ 2026-05-19 13:28 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
	leon, haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-2-longli@microsoft.com>

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: mana: Create separate EQs for each vPort

This patch moves EQ allocation from per-device to per-vPort, exports
mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
a vport reference for RSS QPs, and adds cross-port PD-sharing
rejection for raw QPs along with serialization between
mana_set_channels() and the RDMA vport configuration.

> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..f8a9013f0ca3 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  
>  	pd->vport_use_count++;
>  	if (pd->vport_use_count > 1) {
> +		/* Reject cross-port PD sharing. EQs and vport config
> +		 * are per-port, so the PD must stay bound to the port
> +		 * that was configured on the first raw QP creation.
> +		 */
> +		if (pd->vport_port != port) {
> +			pd->vport_use_count--;
> +			mutex_unlock(&pd->vport_mutex);
> +			ibdev_dbg(&dev->ib_dev,
> +				  "PD already bound to port %u\n",
> +				  pd->vport_port);
> +			return -EINVAL;
> +		}
>  		ibdev_dbg(&dev->ib_dev,
>  			  "Skip as this PD is already configured vport\n");
>  		mutex_unlock(&pd->vport_mutex);
>  		return 0;
>  	}
>  
> -	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
> +	pd->vport_port = port;
> +
> +	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
>  	if (err) {
>  		pd->vport_use_count--;
>  		mutex_unlock(&pd->vport_mutex);

[Low]
In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
later mana_create_eq() fails, vport_use_count is rolled back to 0 but
pd->vport_port keeps the value just written.

Is the contract that pd->vport_port is only meaningful while
vport_use_count > 0?  That invariant is not documented in the struct
mana_ib_pd comment, and a future reader who consults pd->vport_port
without first checking vport_use_count under vport_mutex would see stale
data.  Would it be worth either resetting pd->vport_port on the rollback
path or documenting the rule explicitly?

> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  		return err;
>  	}
>  
> -	mutex_unlock(&pd->vport_mutex);
>  
> -	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> -	pd->tx_vp_offset = mpc->tx_vp_offset;
> +	err = mana_create_eq(mpc);
> +	if (err) {
> +		mana_uncfg_vport(mpc);
> +		pd->vport_use_count--;
> +	} else {
> +		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> +		pd->tx_vp_offset = mpc->tx_vp_offset;
> +	}

[ ... ]

> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..108ec4c5ce51 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  
>  	qp->port = port;
>  
> +	/* Take a reference on the vport to ensure EQs outlive this QP.
> +	 * The vport must already be configured by a raw QP.
> +	 */
> +	mutex_lock(&mana_pd->vport_mutex);
> +	if (!mana_pd->vport_use_count) {
> +		mutex_unlock(&mana_pd->vport_mutex);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	mana_pd->vport_use_count++;
> +	mutex_unlock(&mana_pd->vport_mutex);
> +

[Critical]
The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
mirrored here.  mana_ib_create_qp_rss() takes the user-supplied port
from ucmd.port and only checks that vport_use_count is non-zero, so the
RSS path will happily attach to a different port than the one bound to
the PD by the first raw QP.

If a user creates a raw QP on port A (which sets pd->vport_port=A and
allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
the same PD with ucmd.port=B where port B has no Ethernet up and no
RDMA activity, mpc_B->eqs is NULL.  The subsequent code in this same
function:

	eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
	cq_spec.attached_eq = eq->eq->id;

would then dereference NULL through mpc_B->eqs.

If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
driver and the RSS QP attaches to those EQs.  When the QP is destroyed
mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
on a port whose apc->vport_use_count was never incremented by this PD,
tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
vport state.  Meanwhile port A's EQs and vport configuration are leaked
because nothing on this PD will destroy them.

Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
check that mana_ib_cfg_vport() now performs, before bumping
vport_use_count?

[Medium]
The new "vport must already be configured by a raw QP" requirement is a
visible uAPI change.  Before this patch, EQs were allocated device-wide
on mana_context at probe time, so an RSS QP could be created without any
prior raw QP on the PD.  After the patch, applications that legitimately
use only RSS QPs (RX-only RDMA workloads) will see -EINVAL here unless
they also create a raw QP first.

Was breaking the RSS-only case considered, and would lazy EQ creation on
the first QP regardless of type be feasible to avoid the
incompatibility?

>  	for (i = 0; i < ind_tbl_size; i++) {
>  		struct mana_obj_spec wq_spec = {};
>  		struct mana_obj_spec cq_spec = {};
> @@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  		cq_spec.gdma_region = cq->queue.gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		eq = &mpc->ac->eqs[cq->comp_vector];
> +		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>  		cq_spec.attached_eq = eq->eq->id;

[ ... ]

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e..4633acc976f0 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
>  	unsigned int new_count = channels->combined_count;
>  	unsigned int old_count = apc->num_queues;
>  	int err;
>  
> +	/* Set channel_changing to block RDMA from grabbing the vport
> +	 * during the detach/attach window. mana_cfg_vport() checks
> +	 * this flag under vport_mutex and returns -EBUSY if set.
> +	 */
> +	mutex_lock(&apc->vport_mutex);
> +	if (!apc->port_is_up && apc->vport_use_count) {
> +		mutex_unlock(&apc->vport_mutex);
> +		return -EBUSY;
> +	}
> +	apc->channel_changing = true;
> +	mutex_unlock(&apc->vport_mutex);

[Low]
The commit message says:

    When the port is down, apc->vport_mutex is held for the entire
    operation since mana_detach()/mana_attach() do not take vport_mutex
    in that case. When the port is up, Ethernet owns the vport
    exclusively so no additional locking is needed.

Does this description match the code?  mana_set_channels() drops
apc->vport_mutex immediately after setting channel_changing=true and
runs mana_pre_alloc_rxbufs(), mana_detach() and mana_attach() without
the mutex held, then re-acquires the mutex only at the end to clear the
flag.

The actual serialization mechanism is the channel_changing boolean
checked by mana_cfg_vport() under vport_mutex, not a long mutex hold.
Could the commit message be updated to match the implemented design so
that future readers do not assume a different invariant?
-- 
This is an AI-generated review.


^ permalink raw reply

* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: Jason Gunthorpe @ 2026-05-19 12:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Mukesh R, hpa, robin.murphy, robh, wei.liu, mhklinux, muislam,
	namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260518224136.0000403e@linux.microsoft.com>

On Mon, May 18, 2026 at 10:41:36PM -0700, Jacob Pan wrote:

> Just wondering what work is needed to support this "direct attach"? I
> felt this issue is due to trying to cram two distinct domain types
> (paging domain & direct attach) into the VFIO container model where
> only unmanaged paging domain is supported.

Xen has the same issue and you two need to come up with a uniform
solution.

VFIO container can't support it, that's out.

You should be focusing on a iommufd flow that accepts some FD
representing the VM (ie KVM FD) that can be converted by the driver
into a HWPT representing that FD's S2 translation.

> I am thinking if we were to switch to iommufd and let user(vmm) have
> direct control of HWPT, vmm will be able to selectively use a
> different domain type to handle direct attach. 

Yes

> IMHO, it is essentially the same as attaching nest parent domain
> without nested domain immediately attached. The unprivileged guest
> may attach nested domain directly with Hyper-V if nested translation
> is needed.

nest parent domain is really for supporting the viommu objects.. If
you don't have that flow you don't need to worry about that nest
parent stuff.

> From this driver POV, it will allocate a 2nd stage only domain with
> different domain ops (w/o map/unmap) for "direct attach" thus avoid this
> hack.

Yes, from a driver POV you need a unique iommu_domain allocator that
returns an iommu_domain without an ops.

It should probably work similarly to the viommu where the iommufd path
can send in a driver-specific tagged struct that can describe these
special domains.

But be mindful of the lifetime rules, whatever ID is used to describe
the VM at the hypercall boundary has to be bound into a linux FD and
become immutable. The driver has to hold that FD as long as the domain
exists.

Jason

^ permalink raw reply

* Re: [PATCH net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
From: patchwork-bot+netdevbpf @ 2026-05-19 11:10 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, kees, shacharr,
	stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260514194156.466823-1-ernis@linux.microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Thu, 14 May 2026 12:41:51 -0700 you wrote:
> In mana_hwc_rx_event_handler(), resp->response.hwc_msg_id is read from
> DMA-coherent memory and bounds-checked, then mana_hwc_handle_resp()
> re-reads the same field from the same DMA buffer for test_bit() and
> pointer arithmetic.
> 
> DMA-coherent memory is mapped uncacheable on x86 and is shared,
> unencrypted, in Confidential VMs (SEV-SNP/TDX), so each load goes
> directly to host-visible memory. A H/W can modify the value
> between the check and the use, bypassing the bounds validation.
> 
> [...]

Here is the summary with links:
  - [net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
    https://git.kernel.org/netdev/net/c/35f0f0a2536a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox