Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v3 0/2] drm/hyperv: harden host message parsing
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <20260517-drm-hyperv-cover-v2@berkoc.com>

Two independent issues in the synthetic video driver that both stem
from trusting unvalidated host data.

1/2 bounds resolution_count from SYNTHVID_RESOLUTION_RESPONSE against
the supported_resolution[] array, and populates WIN8 defaults for
hv->screen_*_max / hv->preferred_* in both the WIN10-probe-failure
path and the pre-WIN10 path, so a failed or pre-WIN10 probe yields
a usable display instead of having drm_internal_framebuffer_create()
reject every userspace framebuffer with -EINVAL.

2/2 forwards bytes_recvd from vmbus_recvpacket() into the sub-handler,
rejects packets that do not cover the synthvid header, and additionally
requires the type-specific payload size before memcpy/complete or
before reading the feature-change byte. Rejected packets are logged
via drm_err_ratelimited() instead of being silently dropped, matching
the CoCo-hardened pattern in hv_kvp_onchannelcallback().

Changes since v2 (per review by Michael Kelley):

  1/2: dropped the reinit_completion() change. Kelley pointed out that
       the negotiate-version and update-vram-location timeouts cause
       hyperv_vmbus_probe() to fail and free the device, so the stale
       completion can only outlive its request in hyperv_vmbus_resume()
       after a get_supported_resolution() timeout. That is a narrower
       fix and belongs in a separate patch against the resume path.
       Subject and commit message rewritten to reflect that this patch
       is now bounds-check + WIN8 fallback only. Pre-WIN10 branch now
       also populates hv->preferred_* (Kelley spotted the gap).
       Followed the post-probe-test refactor Kelley suggested: the else
       branch is gone, a single screen_width_max == 0 check covers
       both the pre-WIN10 case and a failed WIN10 probe.

  2/2: dropped the redundant upper bound on bytes_recvd. Added a
       per-type switch for the three completion-driving message types
       (SYNTHVID_VERSION_RESPONSE, SYNTHVID_RESOLUTION_RESPONSE,
       SYNTHVID_VRAM_LOCATION_ACK) so the wait-completion path
       validates payload size before memcpy/complete. Every reject
       path now emits drm_err_ratelimited() rather than returning
       silently. Commit message rewritten to lead with the residue
       read, with "wasteful copy" reframed as the secondary observation.

Changes since v1:
  1/2: bound resolution_count check folded into the existing zero
       check; populate WIN8 defaults when hyperv_get_supported_resolution()
       fails.
  2/2: forward bytes_recvd into hyperv_receive_sub(); enforce the
       pipe + synthvid header minimum; check synthvid_feature_change
       payload size before reading is_dirt_needed.

Both patches carry an Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
trailer per the kernel coding-assistants policy. Code, analysis and
review responses are mine; the model is used as a structured reviewer
under human verification.

base-commit: 4bf5d3da79c48e1df4bab82c9680c53adeff7820

Berkant Koc (2):
  drm/hyperv: validate resolution_count and fix WIN8 fallback
  drm/hyperv: validate VMBus packet size in receive callback

 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 58 ++++++++++++++++++++---
 1 file changed, 52 insertions(+), 6 deletions(-)

--
2.47.3

^ permalink raw reply

* [PATCH v3 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779221339.git.me@berkoc.com>

A SYNTHVID_RESOLUTION_RESPONSE with resolution_count > 64 walks past
the supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT] array in the
parse loop. Bound resolution_count against the array size, folded
into the existing zero-check.

When the WIN10 resolution probe fails, the caller in
hyperv_connect_vsp() left hv->screen_*_max / preferred_* unpopulated,
which sets mode_config.max_width / max_height to 0 and makes
drm_internal_framebuffer_create() reject every userspace framebuffer
with -EINVAL. The pre-WIN10 branch had the same gap for
preferred_width / preferred_height. Use a single post-probe fallback
guarded by screen_width_max == 0 so both paths converge on the WIN8
defaults.

Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526..c3d0ff229 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		return -ETIMEDOUT;
 	}
 
-	if (msg->resolution_resp.resolution_count == 0) {
-		drm_err(dev, "No supported resolutions\n");
+	if (msg->resolution_resp.resolution_count == 0 ||
+	    msg->resolution_resp.resolution_count >
+	    SYNTHVID_MAX_RESOLUTION_COUNT) {
+		drm_err(dev, "Invalid resolution count: %d\n",
+			msg->resolution_resp.resolution_count);
 		return -ENODEV;
 	}
 
@@ -508,9 +511,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
 		ret = hyperv_get_supported_resolution(hdev);
 		if (ret)
 			drm_err(dev, "Failed to get supported resolution from host, use default\n");
-	} else {
+	}
+
+	if (!hv->screen_width_max) {
 		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
 		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
+		hv->preferred_width = SYNTHVID_WIDTH_WIN8;
+		hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
 	}
 
 	hv->mmio_megabytes = hdev->channel->offermsg.offer.mmio_megabytes;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779221339.git.me@berkoc.com>

hyperv_receive_sub() reads msg->vid_hdr.type and dispatches into one
of four message-type branches without knowing how many bytes the host
wrote into hv->recv_buf. The completion path then runs
memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE), so the consumer
that wakes on wait_for_completion_timeout() can read up to 16 KiB of
residue from a prior message as if it were the response payload.

Pass bytes_recvd into hyperv_receive_sub() and reject any packet that
does not cover the pipe + synthvid header. For each of the three
completion-driving types (SYNTHVID_VERSION_RESPONSE,
SYNTHVID_RESOLUTION_RESPONSE, SYNTHVID_VRAM_LOCATION_ACK) also
require the type-specific payload before memcpy/complete, and apply
the same rule to SYNTHVID_FEATURE_CHANGE before reading is_dirt_needed.
The memcpy then uses bytes_recvd, which is bounded by
VMBUS_MAX_PACKET_SIZE through the call to vmbus_recvpacket().

Rejected packets are reported via drm_err_ratelimited() rather than
silently dropped, matching the CoCo-hardened pattern in
hv_kvp_onchannelcallback().

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 42 +++++++++++++++++++++--
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index c3d0ff229..12d3feb4f 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -420,26 +420,62 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev)
+static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
 	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
+	size_t hdr_size;
 
 	if (!hv)
 		return;
 
+	hdr_size = sizeof(struct pipe_msg_hdr) +
+		   sizeof(struct synthvid_msg_hdr);
+	if (bytes_recvd < hdr_size) {
+		drm_err_ratelimited(&hv->dev,
+				    "synthvid packet too small for header: %u\n",
+				    bytes_recvd);
+		return;
+	}
+
 	msg = (struct synthvid_msg *)hv->recv_buf;
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
-		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
+		size_t need = hdr_size;
+
+		switch (msg->vid_hdr.type) {
+		case SYNTHVID_VERSION_RESPONSE:
+			need += sizeof(struct synthvid_version_resp);
+			break;
+		case SYNTHVID_RESOLUTION_RESPONSE:
+			need += sizeof(struct synthvid_supported_resolution_resp);
+			break;
+		case SYNTHVID_VRAM_LOCATION_ACK:
+			need += sizeof(struct synthvid_vram_location_ack);
+			break;
+		}
+		if (bytes_recvd < need) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid packet too small for type %u: %u < %zu\n",
+					    msg->vid_hdr.type, bytes_recvd, need);
+			return;
+		}
+		memcpy(hv->init_buf, msg, bytes_recvd);
 		complete(&hv->wait);
 		return;
 	}
 
 	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
+		if (bytes_recvd < hdr_size +
+		    sizeof(struct synthvid_feature_change)) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid feature change packet too small: %u\n",
+					    bytes_recvd);
+			return;
+		}
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
 			hyperv_hide_hw_ptr(hv->hdev);
@@ -466,7 +502,7 @@ static void hyperv_receive(void *ctx)
 				       &bytes_recvd, &req_id);
 		if (bytes_recvd > 0 &&
 		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
-			hyperv_receive_sub(hdev);
+			hyperv_receive_sub(hdev, bytes_recvd);
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v3 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: sashiko-bot @ 2026-05-19 20:55 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <1b88bc7edeb2f0153475225b67f19aaca629eca8.1779221799.git.me@berkoc.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Unsynchronized access to the shared `hv->init_buf` introduces Double Fetch/TOCTOU vulnerabilities, allowing a malicious host to bypass bounds checks and leak kernel memory.
- [High] The driver reuses the `hv->wait` completion for synchronous messages without re-initializing it, which can cause spurious wakeups and synchronization bypass.
--

commit dde0fd11b79967ca7df11b1049abebec363d25a5
Author: Berkant Koc <me@berkoc.com>

drm/hyperv: validate resolution_count and fix WIN8 fallback

Adds bounds checking for resolution_count to prevent walking past the
supported_resolution array. It also adds fallback handling to ensure WIN8
defaults are used when the resolution probe fails.

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc5268321..c3d0ff229e3d0 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -387,12 +387,15 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	hyperv_sendpacket(hdev, msg);
>  
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);

[Severity: High]
This isn't a bug introduced by this patch, but does the driver need to call
reinit_completion(&hv->wait) before sending requests?

If a delayed or unsolicited response arrives, hyperv_receive_sub() calls
complete(&hv->wait), incrementing its internal counter.  When this function
is called later, wait_for_completion_timeout() will return immediately
without waiting for the actual response.

Could this cause the driver to erroneously parse its own outgoing request
buffer as a response?

>  	if (!t) {
>  		drm_err(dev, "Time out on waiting resolution response\n");
>  		return -ETIMEDOUT;
>  	}
>  
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {

[Severity: Critical]
This is a pre-existing issue, but does reading directly from the shared
hv->init_buf introduce a TOCTOU or Double Fetch problem here?

Since hv->init_buf is also used by the asynchronous hyperv_receive_sub()
callback, could a malicious host send a valid count to pass this check,
and then immediately overwrite hv->init_buf with a large value before the
subsequent for-loop evaluates i < msg->resolution_resp.resolution_count?

Additionally, when the driver prepares requests earlier in this function
(e.g., setting msg->vid_hdr.size), an unsolicited response could overwrite
hv->init_buf. Could this trick vmbus_sendpacket() into using a
host-controlled size and leaking kernel memory?

> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1779221339.git.me@berkoc.com?part=1

^ permalink raw reply

* Re: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: sashiko-bot @ 2026-05-19 21:34 UTC (permalink / raw)
  To: Berkant Koc; +Cc: linux-hyperv
In-Reply-To: <e6e63276cca2901641ab39029e4fd3d621b1ee92.1779221799.git.me@berkoc.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:

New issues:
- [Critical] Using `bytes_recvd` for `memcpy()` without checking `vmbus_recvpacket()` return value leads to a massive heap buffer overflow.
- [High] Strict sizeof() validation incorrectly rejects dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.

Pre-existing issues:
- [High] Concurrent lockless write to `hv->init_buf` from VMBus callback allows a malicious host to overwrite data while the guest is validating it.
- [High] Missing `reinit_completion()` before reusing the shared `hv->wait` completion object.
--

commit 8035423b95c9b24007478019421c5f94fe1179ce
Author: Berkant Koc <me@berkoc.com>
Subject: drm/hyperv: validate VMBus packet size in receive callback

This patch passes bytes_recvd into hyperv_receive_sub() to validate that
packets cover the pipe and synthvid headers as well as the type-specific
payload. The memcpy operation is also modified to use bytes_recvd.

Cc: stable@vger.kernel.org # 5.14+

> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index c3d0ff229e3d0..12d3feb4f88ac 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
[ ... ]
> @@ -420,26 +420,62 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
>  
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
[ ... ]
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		size_t need = hdr_size;
> +
> +		switch (msg->vid_hdr.type) {
> +		case SYNTHVID_VERSION_RESPONSE:
> +			need += sizeof(struct synthvid_version_resp);
> +			break;
> +		case SYNTHVID_RESOLUTION_RESPONSE:
> +			need += sizeof(struct synthvid_supported_resolution_resp);

[Severity: High]
Does this strict sizeof() check correctly handle dynamically-sized packets?

If the host sends a response sized only for the active elements (up to
resolution_count) rather than the maximum 64 elements, bytes_recvd will
be less than need. Will this cause valid packets to be falsely rejected
and break the driver initialization sequence?

> +			break;
> +		case SYNTHVID_VRAM_LOCATION_ACK:
> +			need += sizeof(struct synthvid_vram_location_ack);
> +			break;
> +		}
> +		if (bytes_recvd < need) {
> +			drm_err_ratelimited(&hv->dev,
> +					    "synthvid packet too small for type %u: %u < %zu\n",
> +					    msg->vid_hdr.type, bytes_recvd, need);
> +			return;
> +		}
> +		memcpy(hv->init_buf, msg, bytes_recvd);

[Severity: High]
This is a pre-existing issue, but does this pattern allow a malicious host
to overwrite data while the guest is validating it?

The callback unconditionally copies incoming response packets into
hv->init_buf via memcpy() and calls complete().

Since the waiting thread processes hv->init_buf in process context without
copying it to isolated local memory, can an untrusted host send a second
packet immediately after validation to trigger a time-of-check to
time-of-use (TOCTOU) data race?

>  		complete(&hv->wait);

[Severity: High]
This isn't a bug introduced by this patch, but does the driver need to call
reinit_completion(&hv->wait) before sending a new request?

If a prior request times out and its response arrives late, the completion
counter is incremented. Will the next wait immediately consume the
completion counter and process stale data left in hv->init_buf?

>  		return;
>  	}
[ ... ]
> @@ -466,7 +502,7 @@ static void hyperv_receive(void *ctx)
>  				       &bytes_recvd, &req_id);
>  		if (bytes_recvd > 0 &&
>  		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
> -			hyperv_receive_sub(hdev);
> +			hyperv_receive_sub(hdev, bytes_recvd);

[Severity: Critical]
If vmbus_recvpacket() encounters a packet larger than VMBUS_MAX_PACKET_SIZE
(16KB), it returns -ENOBUFS and sets bytes_recvd to the required packet
length.

Since ret == 0 is only checked in the while condition, does this pass an
unbounded bytes_recvd into the callback? If recv_buf still contains the
header from a previous successful iteration, it looks like
memcpy(hv->init_buf, msg, bytes_recvd) will copy more than 16KB into
hv->init_buf, causing a heap buffer overflow.

>  	} while (bytes_recvd > 0 && ret == 0);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/cover.1779221339.git.me@berkoc.com?part=2

^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
  To: dipayanroy
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
	leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
	yury.norov
In-Reply-To: <20260518194654.735580-2-dipayanroy@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Add NULL guards in teardown path to prevent panic on attach failure

When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
>  	struct mana_rxq *rxq;
>  	u32 rxq_idx;
>  
> -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> -		rxq = apc->rxqs[rxq_idx];
> -		if (!rxq)
> -			continue;
> +	if (apc->rxqs) {
>  
> -		mana_destroy_rxq(apc, rxq, true);
> -		apc->rxqs[rxq_idx] = NULL;
> +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> +			rxq = apc->rxqs[rxq_idx];
> +			if (!rxq)
> +				continue;
> +
> +			mana_destroy_rxq(apc, rxq, true);
> +			apc->rxqs[rxq_idx] = NULL;
> +		}
>  	}

If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?

When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:

    apc->rxqs[i] = rxq;

If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?

[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 * number of queues.
>  	 */
>  
> -	for (i = 0; i < apc->num_queues; i++) {
> -		txq = &apc->tx_qp[i].txq;
> -		tsleep = 1000;
> -		while (atomic_read(&txq->pending_sends) > 0 &&
> -		       time_before(jiffies, timeout)) {
> -			usleep_range(tsleep, tsleep + 1000);
> -			tsleep <<= 1;
> -		}
> -		if (atomic_read(&txq->pending_sends)) {
> -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> -			if (err) {
> -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> -					   err, atomic_read(&txq->pending_sends),
> -					   txq->gdma_txq_id);
> +	if (apc->tx_qp) {

In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?

If mana_destroy_txq() unconditionally iterates over all apc->num_queues:

    for (i = 0; i < apc->num_queues; i++) {
        ...
        mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
        ...

Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?

Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?

> +		for (i = 0; i < apc->num_queues; i++) {
> +			txq = &apc->tx_qp[i].txq;
> +			tsleep = 1000;
> +			while (atomic_read(&txq->pending_sends) > 0 &&
> +			       time_before(jiffies, timeout)) {
> +				usleep_range(tsleep, tsleep + 1000);
> +				tsleep <<= 1;
> +			}
> +			if (atomic_read(&txq->pending_sends)) {
> +				err =
> +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));

If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.

Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?

Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?

> +				if (err) {
> +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> +						   err,
> +					    atomic_read(&txq->pending_sends),
> +					    txq->gdma_txq_id);
> +				}
> +				break;
>  			}
> -			break;
>  		}
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Jakub Kicinski @ 2026-05-19 22:55 UTC (permalink / raw)
  To: dipayanroy
  Cc: Jakub Kicinski, kys, haiyangz, wei.liu, decui, andrew+netdev,
	davem, edumazet, pabeni, leon, longli, kotaranov, horms,
	shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
	linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
	leitao, kees, john.fastabend, hawk, bpf, daniel, ast, sdf,
	yury.norov
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net: mana: Skip redundant detach in queue reset handler if already detached

This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);

The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?

If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).

If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.

Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.

The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).

Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?

^ permalink raw reply

* [PATCH net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
From: Aditya Garg @ 2026-05-20  5:15 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, ernis, gargaditya,
	gargaditya, kees, stephen, shacharr, ssengar, linux-hyperv,
	netdev, linux-kernel

In mana_hwc_rx_event_handler(), rx_req_idx is derived from
sge->address in DMA-coherent memory. In Confidential VMs
(SEV-SNP/TDX), this memory is shared unencrypted and HW can modify
WQE contents at any time. No bounds check exists on rx_req_idx,
which can lead to an out-of-bounds access into reqs[].

Add bounds check on rx_req_idx in mana_hwc_rx_event_handler() before
using it to index the reqs[] array.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/hw_channel.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index dbbde0fa57e7..a60f733d1a07 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -266,6 +266,12 @@ static void mana_hwc_rx_event_handler(void *ctx, u32 gdma_rxq_id,
 	rq_base_addr = hwc_rxq->msg_buf->mem_info.dma_handle;
 	rx_req_idx = (sge->address - rq_base_addr) / hwc->max_req_msg_size;
 
+	if (rx_req_idx >= hwc_rxq->msg_buf->num_reqs) {
+		dev_err(hwc->dev, "HWC RX: wrong rx_req_idx=%llu, num_reqs=%u\n",
+			rx_req_idx, hwc_rxq->msg_buf->num_reqs);
+		return;
+	}
+
 	rx_req = &hwc_rxq->msg_buf->reqs[rx_req_idx];
 	resp = (struct gdma_resp_hdr *)rx_req->buf_va;
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v1 1/4] iommu: Move Hyper-V IOMMU driver to its own subdirectory
From: Yu Zhang @ 2026-05-20  6:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260515221918.GJ7702@ziepe.ca>

On Fri, May 15, 2026 at 07:19:18PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:05AM +0800, Yu Zhang wrote:
> > From: Easwar Hariharan <eahariha@linux.microsoft.com>
> > 
> > The Hyper-V IOMMU driver currently only supports IRQ remapping.
> > As it will be adding DMA remapping support, prepare a directory
> > to contain all the different feature files.
> 
> Any possibility we could put the irq remapping thing under the irq
> directory?
> 
> The other drivers have it here because they are co-mingled with their
> iommu HW, will hyperv have the same issue?
> 

Good question. I don't think Hyper-V have the same co-mingling issue.

But from a code organization perspective, I think drivers/iommu/hyperv/
is still the most natural place:

- The IRQ remapping framework itself (drivers/iommu/irq_remapping.c
  and its internal header irq_remapping.h) lives under drivers/iommu/,
  and all three backends (intel/, amd/, hyperv/) sit there today.
  hyperv/irq_remapping.c includes that internal header directly.

- irq_remapping_prepare() references the extern hyperv_irq_remap_ops
  declared in that header, so the provider naturally belongs in the
  same tree.

Moving it to e.g. drivers/irqchip/ would break that symmetry.

Yu

> Jason
> 

^ permalink raw reply

* [PATCH] drivers: hv: use kmalloc_array in mshv_root_scheduler_init
From: Can Peng @ 2026-05-20  7:16 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, longli, decui
  Cc: linux-kernel, linux-hyperv, Can Peng

Replace kmalloc() with kmalloc_array() to prevent potential
overflow, as recommended in Documentation/process/deprecated.rst.

Signed-off-by: Can Peng <pengcan@kylinos.cn>
---
 drivers/hv/mshv_root_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index bd1359eb58dd..146726cc4e9b 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2241,7 +2241,7 @@ static int mshv_root_scheduler_init(unsigned int cpu)
 	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
 
 	/* Allocate two consecutive pages. One for input, one for output. */
-	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
+	p = kmalloc_array(2, HV_HYP_PAGE_SIZE, GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH] mshv: add vmbus dependency
From: Arnd Bergmann @ 2026-05-20  7:40 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Jork Loeser,
	Stanislav Kinsburskii
  Cc: Arnd Bergmann, linux-hyperv, linux-kernel

From: Arnd Bergmann <arnd@arndb.de>

When the vmbus driver is not part of the kernel, the mvhv_root
driver now fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Avoid this by adding an explicit Kconfig dependency. Note that
stubbing out the hv_vmbus_exists() based on configuration would
also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.

Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/hv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 52af086fdeb2..21193b571a80 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -75,6 +75,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
-- 
2.39.5


^ permalink raw reply related

* Re: [PATCH] drivers: hv: use kmalloc_array in mshv_root_scheduler_init
From: sashiko-bot @ 2026-05-20  7:55 UTC (permalink / raw)
  To: Can Peng; +Cc: linux-hyperv
In-Reply-To: <20260520071632.557990-1-pengcan@kylinos.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] A CPU hotplug offline operation can lead to a NULL pointer dereference in `mshv_vp_dispatch` because user tasks can still execute on the offlining CPU after the per-CPU buffers have been freed.
--

commit 182b63fe189dbbf1da723a03d972eaa7cbb71f56
Author: Can Peng <pengcan@kylinos.cn>

drivers: hv: use kmalloc_array in mshv_root_scheduler_init

Replace kmalloc() with kmalloc_array() to prevent potential overflow, as
recommended in Documentation/process/deprecated.rst.

> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index bd1359eb58dd4..146726cc4e9ba 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2241,7 +2241,7 @@ static int mshv_root_scheduler_init(unsigned int cpu)
>  	outputarg = (void **)this_cpu_ptr(root_scheduler_output);
>  
>  	/* Allocate two consecutive pages. One for input, one for output. */
> -	p = kmalloc(2 * HV_HYP_PAGE_SIZE, GFP_KERNEL);
> +	p = kmalloc_array(2, HV_HYP_PAGE_SIZE, GFP_KERNEL);
        ^^^^^^^^^^^^^

[Severity: High]
This isn't a bug introduced by this patch, but is there a pre-existing
issue where a CPU hotplug offline operation can lead to a NULL pointer
dereference in mshv_vp_dispatch()?

The CPU hotplug teardown callback mshv_root_scheduler_cleanup() is registered
at CPUHP_AP_ONLINE_DYN. During a CPU offline operation, it frees the per-CPU
buffers and sets the pointers to NULL:

drivers/hv/mshv_root_main.c:mshv_root_scheduler_cleanup() {
    ...
    *inputarg = NULL;
    *outputarg = NULL;
    ...
}

User tasks can still execute on the offlining CPU at this high state in
the offline sequence. If a vCPU thread scheduled on the offlining CPU invokes
mshv_vp_dispatch() after the teardown callback has run, it unconditionally
dereferences the input buffer:

drivers/hv/mshv_root_main.c:mshv_vp_dispatch() {
    ...
    preempt_disable();
    input = *this_cpu_ptr(root_scheduler_input);
    ...
    memset(input, 0, sizeof(*input));
    ...
}

Does this sequence result in a NULL pointer dereference if the input buffer
pointer has already been cleared by the cleanup callback?

>  	if (!p)
>  		return -ENOMEM;
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260520071632.557990-1-pengcan@kylinos.cn?part=1

^ permalink raw reply

* Re: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-20 13:23 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat, sashiko-reviews
In-Reply-To: <20260519213450.50E611F000E9@smtp.kernel.org>

sashiko-bot@kernel.org wrote:
> - [Critical] Using `bytes_recvd` for `memcpy()` without checking
>   `vmbus_recvpacket()` return value leads to a massive heap buffer
>   overflow.

This one is bounded on this channel. hyperv_connect_vsp() calls
vmbus_open() without setting max_pkt_size, so the inbound ring uses
VMBUS_DEFAULT_MAX_PKT_SIZE (4096) and hv_pkt_iter_first() clamps the
packet length to pkt_buffer_size. bytes_recvd therefore cannot exceed
4096, well under the 16 KiB recv_buf and init_buf, and
vmbus_recvpacket() does not return -ENOBUFS here, so the memcpy length
stays bounded.

I will still gate the dispatch on a successful vmbus_recvpacket()
return in the next revision, as defense in depth, so the bound is
local instead of relying on the ring clamp.

> - [High] Strict sizeof() validation incorrectly rejects
>   dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.

Agreed. The response carries resolution_count entries, not the full
SYNTHVID_MAX_RESOLUTION_COUNT array, so checking against
sizeof(struct synthvid_supported_resolution_resp) is too strict. The
next revision validates the fixed prefix, reads and bounds
resolution_count, then requires only the count-sized array.

> - [High] Concurrent lockless write to `hv->init_buf` from VMBus
>   callback allows a malicious host to overwrite data while the guest
>   is validating it.
> - [High] Missing `reinit_completion()` before reusing the shared
>   `hv->wait` completion object.

Both pre-existing. On v2 Michael Kelley suggested splitting the
completion reinit into a separate patch on the resume path. The
init_buf reuse sits in the same area, so I plan to send the reinit and
the related response-type handling as a separate follow-up rather than
fold them into this size-validation change.

Thanks for the review.

Berkant

^ permalink raw reply

* Re: [PATCH v1 1/4] iommu: Move Hyper-V IOMMU driver to its own subdirectory
From: Jason Gunthorpe @ 2026-05-20 13:38 UTC (permalink / raw)
  To: Yu Zhang
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <elo7enhxk5m7x4rc4quiqnkgbwsqa3ex7di2ksv7szu75xqe6x@zum5nwmkl6kn>

On Wed, May 20, 2026 at 02:37:03PM +0800, Yu Zhang wrote:
> On Fri, May 15, 2026 at 07:19:18PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 12, 2026 at 12:24:05AM +0800, Yu Zhang wrote:
> > > From: Easwar Hariharan <eahariha@linux.microsoft.com>
> > > 
> > > The Hyper-V IOMMU driver currently only supports IRQ remapping.
> > > As it will be adding DMA remapping support, prepare a directory
> > > to contain all the different feature files.
> > 
> > Any possibility we could put the irq remapping thing under the irq
> > directory?
> > 
> > The other drivers have it here because they are co-mingled with their
> > iommu HW, will hyperv have the same issue?
> > 
> 
> Good question. I don't think Hyper-V have the same co-mingling issue.
> 
> But from a code organization perspective, I think drivers/iommu/hyperv/
> is still the most natural place:
> 
> - The IRQ remapping framework itself (drivers/iommu/irq_remapping.c
>   and its internal header irq_remapping.h) lives under drivers/iommu/,
>   and all three backends (intel/, amd/, hyperv/) sit there today.
>   hyperv/irq_remapping.c includes that internal header directly.

IMHO it is a fair question if that even belongs under iommu.

I think it was dumped into here because of the co-mingled drivers..

Jason

^ permalink raw reply

* RE: [PATCH] mshv: add vmbus dependency
From: Michael Kelley @ 2026-05-20 13:46 UTC (permalink / raw)
  To: Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft), Jork Loeser,
	Stanislav Kinsburskii
  Cc: Arnd Bergmann, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20260520074044.923728-1-arnd@kernel.org>

From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
> 
> When the vmbus driver is not part of the kernel, the mvhv_root
> driver now fails to link:
> 
> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
> 
> Avoid this by adding an explicit Kconfig dependency. Note that
> stubbing out the hv_vmbus_exists() based on configuration would
> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.

Conceptually, the MSHV root code should not have a dependency on
VMBus. The "does VMBus exist?" question should handled differently
by setting up a boolean in the core Hyper-V code that defaults to "false".
If the VMBus driver loads, it would set the boolean to "true". MSHV
root code would query the boolean.

Michael

> 
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/hv/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 52af086fdeb2..21193b571a80 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
>  	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
>  	# no particular order, making it impossible to reassemble larger pages
>  	depends on PAGE_SIZE_4KB
> +	depends on HYPERV_VMBUS
>  	select EVENTFD
>  	select VIRT_XFER_TO_GUEST_WORK
>  	select HMM_MIRROR
> --
> 2.39.5
> 


^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Arnd Bergmann @ 2026-05-20 13:51 UTC (permalink / raw)
  To: Michael Kelley, Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft),
	Jork Loeser, Stanislav Kinsburskii
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157CDF10D33D62DD8709D63D4012@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 20, 2026, at 15:46, Michael Kelley wrote:
> From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
>> 
>> When the vmbus driver is not part of the kernel, the mvhv_root
>> driver now fails to link:
>> 
>> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>> 
>> Avoid this by adding an explicit Kconfig dependency. Note that
>> stubbing out the hv_vmbus_exists() based on configuration would
>> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>
> Conceptually, the MSHV root code should not have a dependency on
> VMBus. The "does VMBus exist?" question should handled differently
> by setting up a boolean in the core Hyper-V code that defaults to "false".
> If the VMBus driver loads, it would set the boolean to "true". MSHV
> root code would query the boolean.

That makes sense to me, but it's outside of my area for a drive-by
build fix. Please treat my patch as a 'Reported-by' then, I expect
this can easily be addressed by Jork or someone else in the
hyperv team.

      Arnd

^ permalink raw reply

* RE: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Michael Kelley @ 2026-05-20 14:24 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, sashiko-reviews@lists.linux.dev
In-Reply-To: <177928341826.371979.14701698047864220449@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Wednesday, May 20, 2026 6:24 AM
> 
> sashiko-bot@kernel.org wrote:
> > - [Critical] Using `bytes_recvd` for `memcpy()` without checking
> >   `vmbus_recvpacket()` return value leads to a massive heap buffer
> >   overflow.
> 
> This one is bounded on this channel. hyperv_connect_vsp() calls
> vmbus_open() without setting max_pkt_size, so the inbound ring uses
> VMBUS_DEFAULT_MAX_PKT_SIZE (4096) and hv_pkt_iter_first() clamps the
> packet length to pkt_buffer_size. bytes_recvd therefore cannot exceed
> 4096, well under the 16 KiB recv_buf and init_buf, and
> vmbus_recvpacket() does not return -ENOBUFS here, so the memcpy length
> stays bounded.

Actually, the behavior of vmbus_recvpacket() is more subtle,
and the problem pointed out by Sashiko AI is real. I had forgotten
about this subtle behavior in my first reply to you on this topic. :-(

The incoming message from Hyper-V has a length encoded in it.
hv_ringbuffer_read() gets that message length, and if it is larger
than the buflen_parameter, -ENOBUFS is returned. But the
returned buffer_actual_len is also set to the incoming message
length provided by Hyper-V. This allows the caller to realize
it didn't provide enough buffer space, and also tells it how much
buffer space is needed. hv_pci_onchannelcallback() uses
this functionality to retry the receive operation with a larger
buffer. I think all the other callers just treat -ENOBUFS as a
fatal error, which is also fine.

> 
> I will still gate the dispatch on a successful vmbus_recvpacket()
> return in the next revision, as defense in depth, so the bound is
> local instead of relying on the ring clamp.

Indeed, the error check is needed, and it's not just defense in depth.

> 
> > - [High] Strict sizeof() validation incorrectly rejects
> >   dynamically-sized SYNTHVID_RESOLUTION_RESPONSE packets.
> 
> Agreed. The response carries resolution_count entries, not the full
> SYNTHVID_MAX_RESOLUTION_COUNT array, so checking against
> sizeof(struct synthvid_supported_resolution_resp) is too strict. The
> next revision validates the fixed prefix, reads and bounds
> resolution_count, then requires only the count-sized array.

OK, good.

> 
> > - [High] Concurrent lockless write to `hv->init_buf` from VMBus
> >   callback allows a malicious host to overwrite data while the guest
> >   is validating it.

I don't think this actually happens. In hv_pkt_iter_first(), the message
is copied out of the ring buffer into a temporary buffer that is not
explicitly shared with the host. This temporary copy prevents a
malicious host from modifying the data while the guest is validating it.

> > - [High] Missing `reinit_completion()` before reusing the shared
> >   `hv->wait` completion object.
> 
> Both pre-existing. On v2 Michael Kelley suggested splitting the
> completion reinit into a separate patch on the resume path. The
> init_buf reuse sits in the same area, so I plan to send the reinit and
> the related response-type handling as a separate follow-up rather than
> fold them into this size-validation change.

Handling a timeout, and then the host providing a belated response
is a really messy problem. There's some mechanism using request IDs
in hv_ringbuffer_write() and vmbus_sendpacket_getid() to help match
up requests and responses, but my recollection is that even this
extra machinery is not 100% foolproof. You may or may not want
to go down the path of trying to fix it. :-)

A process comment:  The emails for v2 and v3 of your patch set
are being threaded in an unexpected way. They are showing up
as replies under the original v1. See https://lore.kernel.org/linux-hyperv/.
The preferred approach is for each version to start a new email
thread. The cover letter should start the new thread, and the
patches should show up as threaded under the cover letter.

Also, don't post a new version more frequently than every
24 hours at a minimum, and there's no problem with waiting
2 to 3 days. The idea is to give people a chance to review and
provide comments so that you can batch any changes in
response to the feedback.

Michael

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-20 15:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, mhklinux, jacob.pan,
	tgopinath, easwar.hariharan
In-Reply-To: <20260515223139.GK7702@ziepe.ca>

On Fri, May 15, 2026 at 07:31:39PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:07AM +0800, Yu Zhang wrote:
> > +/*
> > + * Identity and blocking domains are static singletons: identity is a 1:1
> > + * passthrough with no page table, blocking rejects all DMA. Neither holds
> > + * per-IOMMU state, so one instance suffices even with multiple vIOMMUs.
> > + */
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> 
> Please follow the format of other drivers and statically initialize
> these here instead of in C code.
> 

Thanks! Will do.

> > +static struct iommu_ops hv_iommu_ops;
> > +static LIST_HEAD(hv_iommu_pci_bus_list);
> > +static DEFINE_SPINLOCK(hv_iommu_pci_bus_lock);
> > +
> > +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> > +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> > +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL)
> > +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)
> 
> prefer to see static inlines
> 

Yes. Thanks.

> > +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device *dev)
> > +{
> > +	u64 status;
> > +	unsigned long flags;
> > +	struct hv_input_detach_device_domain *input;
> > +	struct pci_dev *pdev;
> > +	struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > +	/* See the attach function, only PCI devices for now */
> > +	if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> > +		return;
> > +
> > +	pdev = to_pci_dev(dev);
> > +
> > +	dev_dbg(dev, "detaching from domain %d\n", hv_domain->device_domain.domain_id.id);
> > +
> > +	local_irq_save(flags);
> > +
> > +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +	memset(input, 0, sizeof(*input));
> > +	input->partition_id = HV_PARTITION_ID_SELF;
> > +	if (hv_iommu_lookup_logical_dev_id(pdev, &input->device_id.as_uint64)) {
> > +		local_irq_restore(flags);
> > +		dev_warn(&pdev->dev, "no IOMMU registration for vPCI bus on detach\n");
> > +		return;
> > +	}
> > +	status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> 
> FWIW the hypervisor cannot implement the linux attach semantics if it
> has detach/attach. It must support simply 'attach' which atomically
> changes the translation. With detach you have a confusing problem if
> errors happen in the middle of the sequence the device is left in an
> unclear state. You should at least document what state the hypervisor
> is supposed to leaave the translation iin during these failures..
> 

Thanks for raising this!

My understanding is that the detach hypercall just clears the HW
IOMMU entry for the device (e.g., context table entry on VT-d, DTE
on AMD IOMMU) and flushes the IOTLB. So after detach and before the
subsequent attach, DMA from the device is blocked by the HW IOMMU
- the device is not in an unprotected state.

The detach shall not fail unless the pvIOMMU driver passes incorrect
parameters or is otherwise in a buggy state, or Hyper-V itself has
a bug. 

But making attach atomic might be a cleaner approach (drop the
explicit detach and let the hypervisor handle the domain switch
internally)? I'll look into making that work, though it needs
verification (and possibly changes) on the hypervisor side (these
hypercalls are not exclusively for Linux guest). If that doesn't
work out, I'll keep the two-step approach with comments documenting
the intermediate translation state.

> > +static void hv_iommu_release_device(struct device *dev)
> > +{
> > +	struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +	if (pdev->ats_enabled)
> > +		pci_disable_ats(pdev);
> > +
> > +	dev_iommu_priv_set(dev, NULL);
> > +	set_dma_ops(dev, NULL);
> 
> Does the driver really need to mess with set_dma_ops ? I'd rather not
> see that in new iommu drivers if at all possible :|
> 

No. I should have removed it. Thanks!

> > +static int __init hv_initialize_static_domains(void)
> > +{
> > +	int ret;
> > +	struct hv_iommu_domain *hv_domain;
> > +
> > +	/* Default stage-1 identity domain */
> > +	hv_domain = &hv_identity_domain;
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> > +	if (ret)
> > +		goto delete_identity_domain;
> > +
> > +	hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> > +	hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> > +	hv_domain->domain.owner = &hv_iommu_ops;
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> > +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> 
> identity doesn't use geometry or pgsize_bitmap
> 

Sure. Will remove it.

> > +	/* Default stage-1 blocked domain */
> > +	hv_domain = &hv_blocking_domain;
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret)
> > +		goto delete_identity_domain;
> > +
> > +	ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> > +	if (ret)
> > +		goto delete_blocked_domain;
> > +
> > +	hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> > +	hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> > +	hv_domain->domain.owner = &hv_iommu_ops;
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> > +	hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> 
> Nor does blocked
> 

Same. Will remove.

> > +#define INTERRUPT_RANGE_START	(0xfee00000)
> > +#define INTERRUPT_RANGE_END	(0xfeefffff)
> > +static void hv_iommu_get_resv_regions(struct device *dev,
> > +		struct list_head *head)
> > +{
> > +	struct iommu_resv_region *region;
> > +
> > +	region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> > +				      INTERRUPT_RANGE_END - INTERRUPT_RANGE_START + 1,
> > +				      0, IOMMU_RESV_MSI, GFP_KERNEL);
> 
> Surprised these constants are not discovered from the hv?
> 

Thanks. Since the pvIOMMU currently only targets x86, hardcoding
the x86 architectural MSI range should be reasonable?

If in the future we need additional reserved regions (e.g.,
RMRR-style identity-mapped ranges), those would be discovered
from the hypervisor.

> > +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> > +{
> > +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +}
> > +
> > +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> > +				struct iommu_iotlb_gather *iotlb_gather)
> > +{
> > +	hv_flush_device_domain(to_hv_iommu_domain(domain));
> > +
> > +	iommu_put_pages_list(&iotlb_gather->freelist);
> > +}
> 
> Full invalidation only huh?
> 

Page-selective IOTLB flush is added in patch 4/4.

> > +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +};
> > +
> > +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +};
> 
> Usually I would expect these to have their own attach
> functions. blocking in particular must have an attach op that cannot
> fail. It is used to recover the device back to a known translation in
> case of cascading other errors.
> 

For blocking domain, the hypercall handler of such attach essentially
disable the translation and IOPF for the device. It should not fail
unless there is a bug in the driver or the hypervisor.
How about something like this?

	static int hv_blocking_attach_dev(domain, dev, old)
	{
		...
		status = hv_do_hypercall(HVCALL_ATTACH, blocking, dev);
		WARN_ON(!hv_result_success(status));
		vdev->hv_domain = blocking;
		return 0;
	}

For identity domain, would it be fine to keep sharing the same attach
callback with paging domain?

> > +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> > +	.attach_dev	= hv_iommu_attach_dev,
> > +	IOMMU_PT_DOMAIN_OPS(x86_64),
> > +	.flush_iotlb_all = hv_iommu_flush_iotlb_all,
> > +	.iotlb_sync = hv_iommu_iotlb_sync,
> > +	.free = hv_iommu_paging_domain_free,
> > +};
> > +
> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> > +{
> > +	int ret;
> > +	struct hv_iommu_domain *hv_domain;
> > +	struct pt_iommu_x86_64_cfg cfg = {};
> > +
> > +	hv_domain = kzalloc_obj(*hv_domain, GFP_KERNEL);
> > +	if (!hv_domain)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> > +	if (ret) {
> > +		kfree(hv_domain);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	hv_domain->domain.geometry = hv_iommu_device->geometry;
> 
> geoemtry shouldn't be set here, it is overriden by
> pt_iommu_x86_64_init() with the exact page table configuration
> 

Right. Will remove.
> > +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> > +			struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> > +{
> > +	ida_init(&hv_iommu->domain_ids);
> > +
> > +	hv_iommu->cap = hv_iommu_cap->iommu_cap;
> > +	hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> > +	if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> > +	    hv_iommu->max_iova_width > 48) {
> > +		pr_info("5-level paging not supported, limiting iova width to 48.\n");
> > +		hv_iommu->max_iova_width = 48;
> > +	}
> > +
> > +	hv_iommu->geometry = (struct iommu_domain_geometry) {
> > +		.aperture_start = 0,
> > +		.aperture_end = (((u64)1) << hv_iommu->max_iova_width) - 1,
> > +		.force_aperture = true,
> > +	};
> > +
> > +	hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> > +	hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> > +	/* Only x86 page sizes (4K/2M/1G) are supported */
> > +	hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap &
> > +				  (SZ_4K | SZ_2M | SZ_1G);
> > +	if (hv_iommu->pgsize_bitmap != hv_iommu_cap->pgsize_bitmap)
> > +		pr_warn("unsupported page sizes masked: 0x%llx -> 0x%llx\n",
> > +			hv_iommu_cap->pgsize_bitmap, hv_iommu->pgsize_bitmap);
> 
> IKD if you need this logic, the way the page table code is used it
> just sort of falls out naturally that other page sizes are ignored.
> 

Right, the page table code can handle 4K/2M/1G naturally. We'd like
hypervisor to be able to choose to support only a subset of {4K, 2M,
1G}. This masking filters out any unexpected page sizes and falls
back to 4K if nothing valid remains.

The if statement may be a bit paranoid perhaps, but it felt safer
to validate explicitly. Happy to drop it if you think it's unnecessary.


> > +struct hv_iommu_domain {
> > +	union {
> > +		struct iommu_domain    domain;
> > +		struct pt_iommu        pt_iommu;
> > +		struct pt_iommu_x86_64 pt_iommu_x86_64;
> > +	};
> 
> You should retain the build assertions here
> 

Sure. Thanks!

B.R.
Yu

> Jason
> 

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-20 15:50 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com, Mukesh R
In-Reply-To: <SN6PR02MB4157A1B2D9B56062A0917BC5D4042@SN6PR02MB4157.namprd02.prod.outlook.com>

On Fri, May 15, 2026 at 05:36:40PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 9:54 AM
> > 
> > On Fri, May 15, 2026 at 02:51:38PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 7:00 AM
> > > >
> > > > On Thu, May 14, 2026 at 06:13:24PM +0000, Michael Kelley wrote:
> > > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> 
> [....]
> 
> > > > >
> > > > > Previous versions of this function did hv_iommu_detach_dev(). With that call
> > > > > removed from here, hv_iommu_detach_dev() is only called when attaching a
> > > > > domain to a device that already has a domain attached. Is it the case that
> > > > > Hyper-V doesn't require the detach as a cleanup step?
> > > > >
> > > >
> > > > The IOMMU core attaches the device to release_domain (our blocking domain)
> > > > before calling release_device(), so I believe the explicit detach in the RFC
> > > > was redundant. I simply didn't realize that at the time.
> > > >
> > >
> > > Got it. But after the IOMMU core attaches the device to the blocking
> > > domain, there's the possibility that the vPCI device is rescinded by
> > > Hyper-V and it goes away entirely. Or the device might be subjected
> > > to an "unbind/bind" cycle in Linux. Does the detach need to be done
> > > on the blocking domain in such cases? In this version of the patches, the
> > > Hyper-V "attach" and "detach" hypercalls still end up unbalanced. That
> > > seems a bit untidy at best, and I wonder if there are scenarios where
> > > Hyper-V will complain about the lack of balance.
> > >
> > 
> > Thank you, Michael. May I ask what "the vPCI device is rescinded by
> > Hyper-V and it goes away entirely" mean?
> > 
> 
> See the documentation at Documentation/virt/hyperv/vpci.rst in a
> kernel source code tree, and particularly the section entitled "PCI Device
> Removal". Such removals can and do happen in running Azure guest
> VMs. Start with that info and then I'll do my best to answer follow-up
> questions you may have.
> 
> The unbind/bind case is separate, but has some of the same effects in
> that Linux should be removing all setup of the PCI device. There's actually
> two unbind steps -- one to unbind the device-specific driver (e.g., the
> Mellanox MLX5 driver or the NMVe driver) driver from the device, and
> potentially a second to unbind the VMBus vPCI driver from the device.
> These unbind/bind sequences can be done in the Linux guest without
> the Hyper-V host rescinding the device.

Thanks for pointing me to the vpci.rst doc, Michael!

IIUC, for the vPCI rescind case and the VMBus vPCI driver unbind
case, both will trigger iommu_deinit_device(), in which IOMMU core
attaches the device to our blocking domain. And hypervisor will
handle such attaching to the blocking domain by essentially disable
the DMA translation and IOPF for the device. Since the device is
already in a safe state after that, I don't think an additional
detach is necessary.

For the device-specific driver unbind (e.g., unbind nvme then
bind back, or bind to vfio-pci), that is transparent to the IOMMU
layer.

Also, while looking at Jason's comments on the detach/attach
semantics, I'm now thinking that making the attach hypercall
atomic (having the hypervisor handle the domain switch
internally) might be a cleaner approach. That also eliminates
the attach/detach imbalance issue entirely. I need to do some
verification on the hypervisor side first, but will keep you
posted. Thanks!

B.R.
Yu

> 
> Michael
> 

^ permalink raw reply

* [PATCH] hyperv: move vmbus_root_device outside of the VMBus driver
From: Hamza Mahfooz @ 2026-05-20 15:49 UTC (permalink / raw)
  To: linux-hyperv
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stanislav Kinsburskii, Anirudh Rayabharam (Microsoft),
	Jork Loeser, Michael Kelley, Hamza Mahfooz, Arnd Bergmann

Conceptually, we shouldn't have to build the VMBus driver to answer the
question "does VMBus exist?" So, move vmbus_root_device and the
functions that access it directly to hyperv.h. Also, introduce
hv_set_vmbus_root_device() to allow vmbus_drv to set the root device.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/r/20260520074044.923728-1-arnd@kernel.org/
Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
 drivers/hv/vmbus_drv.c | 27 ++++++---------------------
 include/linux/hyperv.h | 18 ++++++++++++++++--
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 71f1b4d52f7f..b26e6b42fa49 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -47,9 +47,6 @@ struct vmbus_dynid {
 	struct hv_vmbus_device_id id;
 };
 
-/* VMBus Root Device */
-static struct device  *vmbus_root_device;
-
 static int hyperv_cpuhp_online;
 
 static DEFINE_PER_CPU(long, vmbus_evt);
@@ -95,18 +92,6 @@ static struct resource *fb_mmio;
 static struct resource *hyperv_mmio;
 static DEFINE_MUTEX(hyperv_mmio_lock);
 
-struct device *hv_get_vmbus_root_device(void)
-{
-	return vmbus_root_device;
-}
-EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device);
-
-bool hv_vmbus_exists(void)
-{
-	return vmbus_root_device != NULL;
-}
-EXPORT_SYMBOL_GPL(hv_vmbus_exists);
-
 static u8 channel_monitor_group(const struct vmbus_channel *channel)
 {
 	return (u8)channel->offermsg.monitorid / 32;
@@ -903,7 +888,7 @@ static int vmbus_dma_configure(struct device *child_device)
 	 * On x86/x64 coherence is assumed and these calls have no effect.
 	 */
 	hv_setup_dma_ops(child_device,
-		device_get_dma_attr(vmbus_root_device) == DEV_DMA_COHERENT);
+		device_get_dma_attr(hv_get_vmbus_root_device()) == DEV_DMA_COHERENT);
 	return 0;
 }
 
@@ -2172,7 +2157,7 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 		     &child_device_obj->channel->offermsg.offer.if_instance);
 
 	child_device_obj->device.bus = &hv_bus;
-	child_device_obj->device.parent = vmbus_root_device;
+	child_device_obj->device.parent = hv_get_vmbus_root_device();
 	child_device_obj->device.release = vmbus_device_release;
 
 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
@@ -2561,7 +2546,7 @@ static int vmbus_acpi_add(struct platform_device *pdev)
 	struct acpi_device *ancestor;
 	struct acpi_device *device = ACPI_COMPANION(&pdev->dev);
 
-	vmbus_root_device = &device->dev;
+	hv_set_vmbus_root_device(&device->dev);
 
 	/*
 	 * Older versions of Hyper-V for ARM64 fail to include the _CCA
@@ -2648,7 +2633,7 @@ static int vmbus_device_add(struct platform_device *pdev)
 	struct device_node *np = pdev->dev.of_node;
 	int ret;
 
-	vmbus_root_device = &pdev->dev;
+	hv_set_vmbus_root_device(&pdev->dev);
 
 	ret = of_range_parser_init(&parser, np);
 	if (ret)
@@ -2969,7 +2954,7 @@ static int __init hv_acpi_init(void)
 	if (ret)
 		return ret;
 
-	if (!vmbus_root_device) {
+	if (!hv_vmbus_exists()) {
 		ret = -ENODEV;
 		goto cleanup;
 	}
@@ -3000,7 +2985,7 @@ static int __init hv_acpi_init(void)
 
 cleanup:
 	platform_driver_unregister(&vmbus_platform_driver);
-	vmbus_root_device = NULL;
+	hv_set_vmbus_root_device(NULL);
 	return ret;
 }
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 5459e776ec17..3e8ea7d9653c 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1302,9 +1302,23 @@ static inline void *hv_get_drvdata(struct hv_device *dev)
 	return dev_get_drvdata(&dev->device);
 }
 
-struct device *hv_get_vmbus_root_device(void);
+/* Do *not* use this outside of hyperv.h */
+static struct device *__vmbus_root_device __read_mostly;
 
-bool hv_vmbus_exists(void);
+static inline void hv_set_vmbus_root_device(struct device *dev)
+{
+	__vmbus_root_device = dev;
+}
+
+static inline struct device *hv_get_vmbus_root_device(void)
+{
+	return __vmbus_root_device;
+}
+
+static inline bool hv_vmbus_exists(void)
+{
+	return hv_get_vmbus_root_device();
+}
 
 struct hv_ring_buffer_debug_info {
 	u32 current_interrupt_mask;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-20 16:27 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	iommu@lists.linux.dev, linux-pci@vger.kernel.org,
	linux-arch@vger.kernel.org, wei.liu@kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, decui@microsoft.com, longli@microsoft.com,
	joro@8bytes.org, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, kwilczynski@kernel.org,
	lpieralisi@kernel.org, mani@kernel.org, robh@kernel.org,
	arnd@arndb.de, jgg@ziepe.ca, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <BN7PR02MB4148D820AEBBBB9A1268D640D4042@BN7PR02MB4148.namprd02.prod.outlook.com>

On Fri, May 15, 2026 at 11:33:21PM +0000, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com> Sent: Friday, May 15, 2026 11:00 AM
> > 
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Friday, May 15, 2026 9:24 AM
> > >
> > > On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > > > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Monday, May 11, 2026 9:24 AM
> > > > >
> > 
> > [....]
> > 
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > >
> > > > I don't understand why __ffs() is correct here. I would expect
> > > > __fls() so it is consistent with the calculation of size_align. But I
> > > > can only surmise how the hypercall works since there's no
> > > > documentation, so maybe my understanding of the hypercall is
> > > > wrong.   If __ffs really is correct, a comment explaining why
> > > > would help. :-)
> > > >
> > >
> > > The use of __ffs() is intentional. Each flush entry invalidates a
> > > naturally aligned 2^N page block, and the hypervisor requires the
> > > page_number to be aligned to 2^page_mask_shift.
> > >
> > > Here __ffs() and __fls() serve different purposes:
> > > - __ffs(start_pfn) is about the alignment constraint, e.g.,  how
> > > large a block can this address support?
> > > - __fls(nr_pages) is about  the size constraint, e.g., how large
> > > a block can the remaining range hold?
> > >
> > > Taking min() of both ensures each entry is both properly aligned
> > > and within bounds.
> > >
> > > Thanks for raising this - it definitely deserves a comment. I had to
> > > stare at it for a while myself to remember why. :)
> > 
> > Hmmm. Something about this still nags at me. I'll run some
> > experiments to either convince myself that you are right, or to
> > come up with a counterexample.
> 
> I have resolved what was nagging at me. My understanding of how
> _ffs() and __fls() work was wrong. :-(  Your code is correct.
> 
> > 
> > A related thought occurred to me. If each flush entry that is passed
> > to Hyper-V describes a naturally aligned 2^N page block, I don't
> > think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
> > be reached. The number of entries is limited by the number of
> > bits in a PFN and the pages count, both of which are 64. And with
> > 52 bit physical addressing and 4KiB pages, the actual size of
> > a PFN and pages count is even smaller than 64.
> > HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
> > union hv_iommu_flush_va entries that fit in a 4KiB page, so
> > it's ~500.
> > 
> > My statement applies to a single flush range. If multiple flush
> > ranges were strung together in a single hypercall, a larger count
> > could be reached, but hv_flush_device_domain_list() only does
> > a single range. So I don't think the overflow case in
> > hv_flush_device_domain_list() can ever happen. But let me
> > do my experiments, and I will also look at this aspect to confirm
> > if it's right.
> 
> My experiments also convince me that the overflow case can't
> happen as long as the hypercall is being populated with entries
> derived from a single range.

Gosh. You're right. I think I've been overthinking this :( 

So we can simplify this to just do the list flush and fall back
to HVCALL_FLUSH_DEVICE_DOMAIN only if the list flush fails.

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Sean Christopherson @ 2026-05-20 16:40 UTC (permalink / raw)
  To: Michael Kelley
  Cc: sashiko-reviews@lists.linux.dev, linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D50A944A2794475E32FED4002@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, May 19, 2026, Michael Kelley wrote:
> From: Sean Christopherson <seanjc@google.com> Sent: Monday, May 18, 2026 3:18 PM
> > > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> > > >
> > > >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> > > >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > > > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > > > +
> > > > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > > > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > > > +
> > > > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > > > +						  tsc_properties);
> > > >  	}
> > >
> > > [ ... ]
> > >
> > > > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> > > >  		 * is called.
> > > >  		 */
> > > >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > > > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > > >  	}
> > >
> > > If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> > > does it bypass the tsc_register_calibration_routines() block entirely?
> > 
> > Yes.
> > 
> > > Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> > > here, it looks like these VMs will lose the reliable flag.
> > >
> > > Will this inadvertently enable the TSC watchdog, potentially causing a
> > > performance regression if the system falsely marks the TSC as unstable due
> > > to virtualization scheduling delays?
> > 
> > Hmm, I was going to say that the change was intentional and desriable, but looking
> > at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
> > just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
> > X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
> > X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
> > as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
> > info, the host is still clearly telling the guest that the TSC is reliable.
> 
> Yes, I agree. But I'm doubtful that such a combination ever occurs in practice.
> I've never seen an occurrence of a Hyper-V (even really old versions) guest
> where the frequency MSRs are not available or are not accessible. The
> Hyper-V spec allows for either condition, so we have the code to test the
> flags. I've thought of the flags as always set, though I suppose one never
> knows what the future holds.
> 
> > 
> > Michael, does keeping the
> > 
> > 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > 
> > but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
> > sense?
> 
> I don't see that it would break anything. But it seems a bit disjointed in
> that HV_ACCESS_TSC_INVARIANT is tested in two places in
> ms_hyperv_init_platform(). Does TSC_RELIABLE *need* to be passed to
> tsc_register_calibration_routines() if later code in ms_hyperv_init_platform()
> does setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)?

Sort of?  It's not strictly necessary, but passing in TSC_RELIABLE allows
tsc_register_calibration_routines() to ensure it doesn't clobber a more robust
calibration routine with a "lesser" routine.  I.e. not passing TSC_RELIABLE in
this case would trigger a false positive (and break Hyper-V).

In other words, invoking setup_force_cpu_cap() is a (happy, desirable) side effect,
not the primary goal.
 
> In other words, I'm suggesting let tsc_register_calibration_routines() handle
> the TSC_FREQ_KNOWN case since that's what the calibration routines are all
> about. Leave the setting of X86_FEATURE_TSC_RELIABLE to the later code that
> tests HV_ACCESS_TSC_INVARIANT, instead of duplicating the
> setup_force_cpu_cap() operation.
>
> While combining FREQUENCY_KNOWN and RELIABLE into
> tsc_register_calibration_routines() is convenient, the two
> concepts turn out to be independent when looking strictly at
> the Hyper-V spec and code written to follow that spec.
> Combining them into the same function ends up being clumsy

Yeah, it's a bit awkward for Hyper-V, but Hyper-V is definitely the odd one out
here, in that it has an "out-of-band" feature that marks the TSC as reliable.
All other PV features that trigger overrides of the calibration routines bundle
the RELIABLE aspect with the feature itself.

^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Jork Loeser @ 2026-05-20 17:13 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Michael Kelley, Arnd Bergmann, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Anirudh Rayabharam (Microsoft),
	Stanislav Kinsburskii, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <56938994-3795-415a-8be3-16e28236082b@app.fastmail.com>

On Wed, 20 May 2026, Arnd Bergmann wrote:

> On Wed, May 20, 2026, at 15:46, Michael Kelley wrote:
>> From: Arnd Bergmann <arnd@kernel.org> Sent: Wednesday, May 20, 2026 12:40 AM
>>>
>>> When the vmbus driver is not part of the kernel, the mvhv_root
>>> driver now fails to link:
>>>
>>> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>>>
>>> Avoid this by adding an explicit Kconfig dependency. Note that
>>> stubbing out the hv_vmbus_exists() based on configuration would
>>> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>>
>> Conceptually, the MSHV root code should not have a dependency on
>> VMBus. The "does VMBus exist?" question should handled differently
>> by setting up a boolean in the core Hyper-V code that defaults to "false".
>> If the VMBus driver loads, it would set the boolean to "true". MSHV
>> root code would query the boolean.
>
> That makes sense to me, but it's outside of my area for a drive-by
> build fix. Please treat my patch as a 'Reported-by' then, I expect
> this can easily be addressed by Jork or someone else in the
> hyperv team.

Apologies for letting this linger - we were looking for the best fix 
internally. While Michael is correct that conceptually the dependency is 
not ideal, I think it's the right fix *for now*.

Yes, this would not allow MSHV root without VMBUS, which is relevant for 
MSHV root only, which misses a few other components. For L1VH it is the 
right fix that also enforces module load order.

Ultimately (and this is planned), we need a SYNIC driver, and the 
hv_vmbus_exists() will vanish for this usecase. It is a hack even now.

Best,
Jork


^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-20 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Michael Kelley
  Cc: linux-kernel, linux-hyperv, iommu, linux-pci, linux-arch, wei.liu,
	kys, haiyangz, decui, longli, joro, will, robin.murphy, bhelgaas,
	kwilczynski, lpieralisi, mani, robh, arnd, jacob.pan, tgopinath,
	easwar.hariharan
In-Reply-To: <20260515223545.GL7702@ziepe.ca>

On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > +					  unsigned long start,
> > +					  unsigned long end)
> > +{
> > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > +	unsigned long nr_pages = end_pfn - start_pfn;
> > +	u16 count = 0;
> > +
> > +	while (nr_pages > 0) {
> > +		unsigned long flush_pages;
> > +		int order;
> > +		unsigned long pfn_align;
> > +		unsigned long size_align;
> > +
> > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > +			break;
> > +		}
> > +
> > +		if (start_pfn)
> > +			pfn_align = __ffs(start_pfn);
> > +		else
> > +			pfn_align = BITS_PER_LONG - 1;
> > +
> > +		size_align = __fls(nr_pages);
> > +		order = min(pfn_align, size_align);
> > +		iova_list[count].page_mask_shift = order;
> > +		iova_list[count].page_number = start_pfn;
> > +
> > +		flush_pages = 1UL << order;
> > +		start_pfn += flush_pages;
> > +		nr_pages -= flush_pages;
> > +		count++;
> > +	}
> 
> This seems like a really silly hypervisor interface. Why doesn't it
> just accept a normal range? Splitting it into power of two aligned
> ranges is very inefficient.

Fair point. I'm not sure how much flexibility we have to change
this hypercall interface at the moment - it predates the pvIOMMU
work and may have other consumers beyond Linux guest. On the other
hand, having the guest specify 2^N-aligned blocks does save the
hypervisor from having to decompose ranges itself before issuing
hardware invalidation commands - the guest-provided entries can be
fed to the HW more or less directly.

That said, the way I'm currently using this interface may be
more precise than necessary. Maybe we have 2 options:

1) Current approach: decompose the range into multiple exact
   2^N-aligned blocks with no over-flush, but at the cost of
   more complex calculations and more entries.

2) Follow what Intel/AMD drivers do: find a single minimal
   2^N-aligned block that covers the entire range, but may
   over-flush.

Any preference?

@Michael, since you've also been reviewing this patch, I'd
appreciate your thoughts on the above as well. :)

Yu

^ permalink raw reply

* Re: [PATCH] mshv: add vmbus dependency
From: Jork Loeser @ 2026-05-20 17:15 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Stanislav Kinsburskii,
	Arnd Bergmann, linux-hyperv, linux-kernel
In-Reply-To: <20260520074044.923728-1-arnd@kernel.org>

On Wed, 20 May 2026, Arnd Bergmann wrote:

> From: Arnd Bergmann <arnd@arndb.de>
>
> When the vmbus driver is not part of the kernel, the mvhv_root
> driver now fails to link:
>
> ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
>
> Avoid this by adding an explicit Kconfig dependency. Note that
> stubbing out the hv_vmbus_exists() based on configuration would
> also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
>
> Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/hv/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 52af086fdeb2..21193b571a80 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -75,6 +75,7 @@ config MSHV_ROOT
> 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> 	# no particular order, making it impossible to reassemble larger pages
> 	depends on PAGE_SIZE_4KB
> +	depends on HYPERV_VMBUS
> 	select EVENTFD
> 	select VIRT_XFER_TO_GUEST_WORK
> 	select HMM_MIRROR
> -- 
> 2.39.5
>

Yes, this is the right short-term fix. We will need to solve the root case 
(no VMBUS required) with a separate SYNIC driver abstraction.

Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>

Thank you!
Jork


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox