Linux-HyperV List
 help / color / mirror / Atom feed
* RE: [PATCH v4 01/18] mshv: Fix IRQ leak and type hazards in hv_call_modify_spa_host_access
From: Michael Kelley @ 2026-05-19 20:07 UTC (permalink / raw)
  To: Stanislav Kinsburskii, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <177816858406.21765.9718563917415905259.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Thursday, May 7, 2026 8:43 AM
> 
> The bounds check inside the PFN-filling loop can return -EINVAL while
> interrupts are disabled via local_irq_save(), leaking IRQ state.
> 
> Remove the check — it is redundant because the loop invariant
> (done + i < page_count == page_struct_count >> large_shift) guarantees
> (done + i) << large_shift < page_struct_count always holds.
> 
> While here, fix type mismatches: change 'int done' to 'u64 done' and
> use u64 for loop and batch-size variables so they match the u64
> page_count they are compared against.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |   18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index 129456bd72aba..cc580225e9e45 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -1042,7 +1042,7 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  {
>  	struct hv_input_modify_sparse_spa_page_host_access *input_page;
>  	u64 status;
> -	int done = 0;
> +	u64 done = 0;
>  	unsigned long irq_flags, large_shift = 0;
>  	u64 page_count = page_struct_count;
>  	u16 code = acquire ? HVCALL_ACQUIRE_SPARSE_SPA_PAGE_HOST_ACCESS :
> @@ -1059,9 +1059,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  	}
> 
>  	while (done < page_count) {
> -		ulong i, completed, remain = page_count - done;
> -		int rep_count = min(remain,
> -				        HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);
> +		u64 i, completed, remain = page_count - done;
> +		u64 rep_count = min_t(u64, remain,
> + 					HV_MODIFY_SPARSE_SPA_PAGE_HOST_ACCESS_MAX_PAGE_COUNT);

Why does this need to be "min_t()" instead of just "min()"?  Needing min_t()
was the case in times past, but "min()" does a much better job now of handling
mixed integer types. There are a number of patches on LKML that are
replacing min_t() with min(). This change seems to be going the opposite
direction.

> 
>  		local_irq_save(irq_flags);
>  		input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -1075,15 +1075,9 @@ int hv_call_modify_spa_host_access(u64 partition_id, struct page **pages,
>  		input_page->flags = flags;
>  		input_page->host_access = host_access;
> 
> -		for (i = 0; i < rep_count; i++) {
> -			u64 index = (done + i) << large_shift;
> -
> -			if (index >= page_struct_count)
> -				return -EINVAL;
> -
> +		for (i = 0; i < rep_count; i++)
>  			input_page->spa_page_list[i] =
> -						page_to_pfn(pages[index]);
> -		}
> +				page_to_pfn(pages[(done + i) << large_shift]);
> 
>  		status = hv_do_rep_hypercall(code, rep_count, 0, input_page,
>  					     NULL);
> 
The "completed" local variable could be collapsed out by doing:

		done += hv_repcomp(status)

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: bpf, linux-hyperv
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The patch correctly prevents `mana_detach()` from running on an already-detached device in the reset handler, but identical bugs remain in `mana_change_mtu()`, `mana_set_channels()`, `mana_set_ringparam()`, and `mana_bpf()`.
--

commit bac73a4812067e75feeec962b27ab9647e30ed9c
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: Skip redundant detach in queue reset handler if already detached

This commit adds a check for netif_device_present() in the queue reset
handler. If the device is already detached, it skips the redundant
mana_detach() call and proceeds directly to mana_attach().

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9e7e4bf526bf6..740f3c34db451 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -316,12 +316,19 @@ static void mana_per_port_queue_reset_work_handler(struct work_struct *work)
[ ... ]
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);

The fix here correctly prevents mana_detach() from running on an already
detached device. However, do identical issues remain in mana_change_mtu(),
mana_set_channels(), mana_set_ringparam(), and mana_bpf()?

If mana_attach() previously failed (for example, due to out of memory),
the port is left with apc->port_is_up = false, but it preserves
apc->port_st_save = true, and calls netif_device_detach(ndev).

If an admin tries to recover the port by retrying a configuration command
like changing the MTU or attaching XDP, these functions appear to
unconditionally call mana_detach(ndev, false) without the
!netif_device_present(ndev) check added here.

Inside mana_detach(), apc->port_st_save = apc->port_is_up executes. Since
port_is_up is already false, port_st_save is overwritten to false,
permanently erasing the intent that the port should be up.

The subsequent call to mana_attach() would then see port_st_save == false,
skip mana_alloc_queues(), set port_is_up = false, and successfully call
netif_device_attach(ndev).

Would this result in the configuration command returning 0 to userspace
while the port queues were never allocated, leaving the interface silently
broken until manually toggled down and up?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: sashiko-bot @ 2026-05-19 19:47 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: linux-hyperv, bpf
In-Reply-To: <20260518194654.735580-2-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Missing re-allocation of `apc->rxqs` during interface bring-up causes deterministic NULL pointer dereference
- [High] Uninitialized queue handles (0 and INVALID_MANA_HANDLE) are sent to the firmware during error teardown, corrupting hardware state
- [High] GDMA admin commands are issued immediately after a PCI Function Level Reset (FLR), causing severe cascading timeouts and hardware state violations
--

commit 2129283d37aabfe724d7509e34346eab93d302a2
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: Add NULL guards in teardown path to prevent panic on attach failure
    
When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. The patch adds NULL guards for these
pointers in teardown paths such as mana_destroy_vport(), mana_fence_rqs(),
and mana_dealloc_queues() to prevent NULL pointer dereferences.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a8..9e7e4bf526bf6 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[ ... ]
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
>  	struct mana_rxq *rxq;
>  	u32 rxq_idx;
>  
> -	for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> -		rxq = apc->rxqs[rxq_idx];
> -		if (!rxq)
> -			continue;
> +	if (apc->rxqs) {
>  
> -		mana_destroy_rxq(apc, rxq, true);
> -		apc->rxqs[rxq_idx] = NULL;
> +		for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> +			rxq = apc->rxqs[rxq_idx];
> +			if (!rxq)
> +				continue;
> +
> +			mana_destroy_rxq(apc, rxq, true);
> +			apc->rxqs[rxq_idx] = NULL;
> +		}
>  	}

If the teardown path frees apc->rxqs and sets it to NULL on attach failure,
how does the interface recover if the user subsequently brings it back up?

When mana_open() is invoked, it directly calls mana_alloc_queues(). Inside
mana_alloc_queues() -> mana_add_rx_queues(), the code does:

    apc->rxqs[i] = rxq;

If apc->rxqs was left as NULL during the teardown, will this cause a NULL
pointer dereference on the next interface bring-up?

[ ... ]
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
>  	 * number of queues.
>  	 */
>  
> -	for (i = 0; i < apc->num_queues; i++) {
> -		txq = &apc->tx_qp[i].txq;
> -		tsleep = 1000;
> -		while (atomic_read(&txq->pending_sends) > 0 &&
> -		       time_before(jiffies, timeout)) {
> -			usleep_range(tsleep, tsleep + 1000);
> -			tsleep <<= 1;
> -		}
> -		if (atomic_read(&txq->pending_sends)) {
> -			err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> -			if (err) {
> -				netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> -					   err, atomic_read(&txq->pending_sends),
> -					   txq->gdma_txq_id);
> +	if (apc->tx_qp) {

In the error path where queue creation fails partway through, will
mana_destroy_txq() be called?

If mana_destroy_txq() unconditionally iterates over all apc->num_queues:

    for (i = 0; i < apc->num_queues; i++) {
        ...
        mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
        ...

Will it pass 0 or INVALID_MANA_HANDLE to mana_destroy_wq_obj() for queues
that were never fully initialized?

Does mana_destroy_wq_obj() send a MANA_DESTROY_WQ_OBJ command to the hardware
with these uninitialized handles, and could handle 0 corrupt the firmware
state?

> +		for (i = 0; i < apc->num_queues; i++) {
> +			txq = &apc->tx_qp[i].txq;
> +			tsleep = 1000;
> +			while (atomic_read(&txq->pending_sends) > 0 &&
> +			       time_before(jiffies, timeout)) {
> +				usleep_range(tsleep, tsleep + 1000);
> +				tsleep <<= 1;
> +			}
> +			if (atomic_read(&txq->pending_sends)) {
> +				err =
> +				    pcie_flr(to_pci_dev(gd->gdma_context->dev));

If pcie_flr() is triggered here, the PCI Function Level Reset wipes the
internal GDMA state.

Further down in this function, mana_disable_vport_rx(), mana_fence_rqs(),
and mana_destroy_vport() are called. Will these functions attempt to send
commands to the now-destroyed GDMA admin queue via mana_send_request() and
wait for completions?

Could this lead to cascading timeouts since the hardware was just reset and
cannot process admin commands?

> +				if (err) {
> +					netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> +						   err,
> +					    atomic_read(&txq->pending_sends),
> +					    txq->gdma_txq_id);
> +				}
> +				break;
>  			}
> -			break;
>  		}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260518194654.735580-1-dipayanroy@linux.microsoft.com?part=1

^ permalink raw reply

* RE: [PATCH v2 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Michael Kelley @ 2026-05-19 18:33 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, stable@vger.kernel.org
In-Reply-To: <20260517-drm-hyperv-patch2-v2@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Sunday, May 17, 2026 7:25 AM
> 
> hyperv_receive() reads bytes_recvd from vmbus_recvpacket() but does not
> forward that value to hyperv_receive_sub(). The sub-handler reads
> msg->vid_hdr.type and msg->feature_chg.is_dirt_needed without knowing
> how many bytes the host actually wrote, so a short packet leaves the
> parser reading bytes that the host did not write in this packet.

This is a valid concern. Similar patterns have hopefully been fixed in other
drivers that were hardened for CoCo VMs.

> The unconditional memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE) on the
> wait-completion path also copies the full 16 KiB recv_buf regardless
> of bytes_recvd, including any residue from the prior message.

Does copying the full 16 KiB break anything? Or are you flagging as just
wasteful activity? This gets to what I previously mentioned about commit
messages -- explain "why".  Being wasteful is a valid reason, but it can be
helpful to future readers to be explicit about the "why".

> 
> Pass bytes_recvd into hyperv_receive_sub() and reject any packet shorter
> than the pipe + synthvid header. The dirt-feature branch additionally
> requires the feature_change payload size before reading is_dirt_needed.
> The init_buf copy now uses bytes_recvd as the length argument, which
> keeps it bounded by VMBUS_MAX_PACKET_SIZE through the new upper check.
> 
> Unchanged from v1.

Version related comments should go below the "---" following the Signed-off
line. They will be visible in the patch emails, but won't get put into the
git log when the patch is finally accepted. The Linux kernel doesn't want
to clutter the git log with intermediate version information that accumulates
as the patch is reviewed and revised.

> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> Signed-off-by: Berkant Koc <me@berkoc.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 3b5065fe06e4..cdab4895dd40 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -423,26 +423,35 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
> 
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
>  	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
>  	struct synthvid_msg *msg;
> +	size_t hdr_size;
> 
>  	if (!hv)
>  		return;
> 
> +	hdr_size = sizeof(struct pipe_msg_hdr) +
> +		   sizeof(struct synthvid_msg_hdr);
> +	if (bytes_recvd < hdr_size || bytes_recvd > VMBUS_MAX_PACKET_SIZE)

The check against VMBUS_MAX_PACKET_SIZE shouldn't be needed.
vmbus_recvpacket() takes VMBUS_MAX_PACKET_SIZE as the max
receive size and won't return a bytes_recvd value that is larger.

> +		return;

In similar cases in other drivers that have been hardened for CoCo VMs, the
code outputs a rate limited error message. That should be done here so the
bad received message isn't just ignored. See hv_kvp_onchannelcallback() for
example.

> +
>  	msg = (struct synthvid_msg *)hv->recv_buf;
> 
>  	/* Complete the wait event */
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		memcpy(hv->init_buf, msg, bytes_recvd);

Additional logic is needed here. Each of the three message types
in the "if" statement has data beyond just the header. Before doing
the memcpy() and complete(), the code should validate that the msg
is big enough to contain that expected data.  I.e., it needs to do the
same as you have done for SYNTHVID_FEATURE_CHANGE.
Otherwise, the code that wakes up from the completion
will think that it has valid data to work with, and it may not.

Of course, the problem already exists in the upstream code. But if you
are going to make changes to fix the problem, the fix should fully
handle situation and not just be a partial fix.

>  		complete(&hv->wait);
>  		return;
>  	}
> 
>  	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
> +		if (bytes_recvd < hdr_size +
> +		    sizeof(struct synthvid_feature_change))
> +			return;

Same here.  Output a rate limited error message.

>  		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
>  		if (hv->dirt_needed)
>  			hyperv_hide_hw_ptr(hv->hdev);
> @@ -469,7 +478,7 @@ static void hyperv_receive(void *ctx)
>  				       &bytes_recvd, &req_id);
>  		if (bytes_recvd > 0 &&
>  		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
> -			hyperv_receive_sub(hdev);
> +			hyperv_receive_sub(hdev, bytes_recvd);
>  	} while (bytes_recvd > 0 && ret == 0);
>  }
> 
> --
> 2.47.3
> 

^ permalink raw reply

* RE: [PATCH v2 1/2] drm/hyperv: validate resolution_count and harden VSP request paths
From: Michael Kelley @ 2026-05-19 18:33 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat, stable@vger.kernel.org
In-Reply-To: <20260517-drm-hyperv-patch1-v2@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Sunday, May 17, 2026 7:25 AM
> 
> The synthetic video device parses a SYNTHVID_RESOLUTION_RESPONSE that
> contains a u8 resolution_count and a u8 default_resolution_index. The
> existing check rejects resolution_count == 0 and an index greater or
> equal to resolution_count, but does not bound resolution_count itself
> against the fixed supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT]
> array. A host that returns resolution_count > 64 together with an
> in-range default_resolution_index causes the subsequent loop to read
> past the array. Reject any resolution_count that exceeds the array
> bound, folded into the existing zero-check so a single log entry
> remains per failure.

I think this is a good fix, but let me provide some background. A core
assumption has been that the Hyper-V host is not malicious, and data from
the host can be trusted to be well-formed and valid. Linux code interacting
with the host does *not* take a paranoid approach and does *not* validate
every value received from the host. Existing validation is ad hoc and
probably not comprehensive.

A few years ago with the introduction of Confidential Computing and
"CoCo VMs" in in Linux, the assumption changed. (Hyper-V calls these
"Confidential VMs", or sometimes "Isolated VMs" -- unfortunately, the
terminology is a bit scattered.) The host is treated as potentially malicious
and Linux code was hardened against bad values passed from the host. But
not all drivers for Hyper-V synthetic devices were hardened because CoCo VMs
on Hyper-V don't allow all the possible synthetic devices. The Hyper-V synthetic
frame buffer is one such disallowed device, so the Hyper-V DRM driver was not
hardened. The allowed devices can be seen in the vmbus_devs[] array
where .allowed_in_isolation is "true".

All that said, I'm in favor of adding hardening, even if it is done piecemeal.
There won't likely be a comprehensive effort to harden the Hyper-V DRM
driver unless the frame buffer device becomes available in a CoCo VM.

> 
> When that bounds check (or any later failure in
> hyperv_get_supported_resolution()) returns an error, the caller in
> hyperv_connect_vsp() previously logged a warning and continued without
> populating hv->screen_width_max / hv->screen_height_max / preferred_*.
> hyperv_mode_config_init() then set dev->mode_config.max_width and
> max_height to 0, which makes drm_internal_framebuffer_create() reject
> every userspace framebuffer with -EINVAL. Populate the fields with the
> WIN8 defaults that the pre-WIN10 branch already uses so a failed
> resolution probe degrades to a usable display instead of disabling it.

Yes, this also makes sense.

> 
> The driver also issues three sequential VSP requests (negotiate
> version, update VRAM location, get supported resolution) that share a
> single hv->wait completion. None of the call sites call
> reinit_completion() between requests. If wait_for_completion_timeout()
> returns 0 but a delayed response later triggers complete(&hv->wait) in
> the receive callback, the next request's wait can consume that stale
> completion, return immediately, and parse stale data out of
> hv->init_buf. Call reinit_completion() before each send so every
> request waits for its own response.

This change should probably be a separate patch, as it's not really
related to the bounds checking issue. You could even argue that the
bounds check and using the default size values in case of an error
should be separate patches, but it's a judgment call and I could go
either way. Combining the first two works for me, though someone
else might disagree.

All that notwithstanding, I don't think your fix is needed in its current
form. If wait_for_completion_timeout() times out in "negotiate version"
or "update VRAM location", the code returns -ETIMEDOUT. The caller
then bails out and does not send the next message in the sequence.
Eventually hyperv_vmbus_probe() fails and the struct hyperv_drm_device
memory is freed. A similar thing could happen with hyperv_vmbus_resume().
The memory isn't freed, but if the resume fails and is tried again later, it
should be with a fresh copy of the saved memory image. In that case, the
completion should be clean for the retry.

The problem case is "get supported resolution", which would leave the
completion in a pending state if it times out. This could later affect
hyperv_vmbus_resume().  I could see reinit'ing the completion in
hyperv_vmbus_resume() to handle this case.

But double-check my thinking on all this. Maybe I'm missing something.

One other comment:  In my view, your commit message is a bit too
detailed. A commit message should focus on "why" the change, and
less on the details of the implementation. Explaining the failure case
is fine, but don't recapitulate code that is straightforward and obvious.

> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> Signed-off-by: Berkant Koc <me@berkoc.com>
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc526832..3b5065fe06e4 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -223,6 +223,7 @@ static int hyperv_negotiate_version(struct hv_device *hdev, u32 ver)
>  	msg->vid_hdr.size = sizeof(struct synthvid_msg_hdr) +
>  		sizeof(struct synthvid_version_req);
>  	msg->ver_req.version = ver;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -257,6 +258,7 @@ int hyperv_update_vram_location(struct hv_device *hdev, phys_addr_t vram_pp)
>  	msg->vram.user_ctx = vram_pp;
>  	msg->vram.vram_gpa = vram_pp;
>  	msg->vram.is_vram_gpa_specified = 1;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -383,6 +385,7 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		sizeof(struct synthvid_supported_resolution_req);
>  	msg->resolution_req.maximum_resolution_count =
>  		SYNTHVID_MAX_RESOLUTION_COUNT;
> +	reinit_completion(&hv->wait);
>  	hyperv_sendpacket(hdev, msg);
> 
>  	t = wait_for_completion_timeout(&hv->wait, VMBUS_VSP_TIMEOUT);
> @@ -391,8 +394,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  		return -ETIMEDOUT;
>  	}
> 
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {
> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;
>  	}
> 
> @@ -506,8 +512,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
> 
>  	if (hyperv_version_ge(hv->synthvid_version, SYNTHVID_VERSION_WIN10)) {
>  		ret = hyperv_get_supported_resolution(hdev);
> -		if (ret)
> +		if (ret) {
>  			drm_err(dev, "Failed to get supported resolution from host, use default\n");
> +			hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
> +			hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
> +			hv->preferred_width = SYNTHVID_WIDTH_WIN8;
> +			hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
> +		}
>  	} else {
>  		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
>  		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;

Is there a separate problem here in that preferred_width and preferred_height
are not set in the pre-WIN10 case? I'm not familiar enough with DRM driver code
to know how these preferred values are used. It looks inconsistent to have the
two fallback cases be different.

Also, having to duplicate the fallback code is distasteful. Instead of having an
"else" clause, maybe have a follow-up test for screen_width_max (or any of the
values) being zero as an indicator that they haven't been set, and apply the defaults in
that case?

Michael

^ permalink raw reply

* [PATCH net v2] vsock: keep poll shutdown state consistent
From: Ziyu Zhang @ 2026-05-19 16:56 UTC (permalink / raw)
  To: Stefano Garzarella, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Andy King, George Zhang, Dmitry Torokhov,
	K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Stefan Hajnoczi, Michael S . Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Bryan Tan, Vishnu Dasa,
	bcm-kernel-feedback-list, virtualization, netdev, linux-kernel,
	linux-hyperv, kvm, baijiaju1990, r33s3n6, gality369,
	zhenghaoran154, hanguidong02, zzzccc427, Ziyu Zhang

vsock_poll() reads vsk->peer_shutdown before taking the socket lock
to set EPOLLHUP and EPOLLRDHUP, then reads it again after taking
the lock to report EOF readability. A shutdown packet can update
peer_shutdown while poll is waiting for the lock, so one poll invocation
can report EOF readability without the corresponding HUP/RDHUP bits.

For connectible sockets, take one peer_shutdown snapshot after
lock_sock() and use it for all peer-shutdown-derived poll bits. For
datagram sockets, which do not take lock_sock() in poll(), take one
lockless READ_ONCE() snapshot and pair it with WRITE_ONCE() on the
writer side.

This keeps the peer-shutdown-derived bits internally consistent for each
poll pass.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Ziyu Zhang <ziyuzhang201@gmail.com>
---
Link: https://lore.kernel.org/netdev/20260516034745.260442-1-ziyuzhang201@gmail.com/

v2:
- Pair lockless READ_ONCE() users with WRITE_ONCE() on peer_shutdown writers.
- Move datagram shutdown handling into the SOCK_DGRAM branch and add a comment.
- Keep one connectible peer_shutdown snapshot after lock_sock() and
  restore the previous shutdown-derived mask ordering.

 net/vmw_vsock/af_vsock.c                | 49 ++++++++++++++++---------
 net/vmw_vsock/hyperv_transport.c        |  9 +++--
 net/vmw_vsock/virtio_transport_common.c | 14 ++++---
 net/vmw_vsock/vmci_transport.c          |  8 ++--
 4 files changed, 52 insertions(+), 28 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index adcba1b7b..789b00f6e 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -523,7 +523,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
 		 */
 		sock_reset_flag(sk, SOCK_DONE);
 		sk->sk_state = TCP_CLOSE;
-		vsk->peer_shutdown = 0;
+		WRITE_ONCE(vsk->peer_shutdown, 0);
 	}
 
 	if (sk->sk_type == SOCK_SEQPACKET) {
@@ -814,7 +814,7 @@ static struct sock *__vsock_create(struct net *net,
 	vsk->rejected = false;
 	vsk->sent_request = false;
 	vsk->ignore_connecting_rst = false;
-	vsk->peer_shutdown = 0;
+	WRITE_ONCE(vsk->peer_shutdown, 0);
 	INIT_DELAYED_WORK(&vsk->connect_work, vsock_connect_timeout);
 	INIT_DELAYED_WORK(&vsk->pending_work, vsock_pending_work);
 
@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int mode)
 	return err;
 }
 
+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
+{
+	__poll_t mask = 0;
+
+	/* INET sockets treat local write shutdown and peer write shutdown as a
+	 * case of EPOLLHUP set.
+	 */
+	if (sk->sk_shutdown == SHUTDOWN_MASK ||
+	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
+	     (peer_shutdown & SEND_SHUTDOWN)))
+		mask |= EPOLLHUP;
+
+	if (sk->sk_shutdown & RCV_SHUTDOWN ||
+	    peer_shutdown & SEND_SHUTDOWN)
+		mask |= EPOLLRDHUP;
+
+	return mask;
+}
+
 static __poll_t vsock_poll(struct file *file, struct socket *sock,
 			       poll_table *wait)
 {
@@ -1139,24 +1158,17 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		/* Signify that there has been an error on this socket. */
 		mask |= EPOLLERR;
 
-	/* INET sockets treat local write shutdown and peer write shutdown as a
-	 * case of EPOLLHUP set.
-	 */
-	if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
-	    ((sk->sk_shutdown & SEND_SHUTDOWN) &&
-	     (vsk->peer_shutdown & SEND_SHUTDOWN))) {
-		mask |= EPOLLHUP;
-	}
-
-	if (sk->sk_shutdown & RCV_SHUTDOWN ||
-	    vsk->peer_shutdown & SEND_SHUTDOWN) {
-		mask |= EPOLLRDHUP;
-	}
-
 	if (sk_is_readable(sk))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
 	if (sock->type == SOCK_DGRAM) {
+		u32 peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+
+		/* DGRAM sockets do not take lock_sock() in poll(), so use one
+		 * lockless snapshot for all shutdown-derived mask bits.
+		 */
+		mask |= vsock_poll_shutdown(sk, peer_shutdown);
+
 		/* For datagram sockets we can read if there is something in
 		 * the queue and write as long as the socket isn't shutdown for
 		 * sending.
@@ -1171,6 +1183,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 
 	} else if (sock_type_connectible(sk->sk_type)) {
 		const struct vsock_transport *transport;
+		u32 peer_shutdown;
 
 		lock_sock(sk);
 
@@ -1203,8 +1216,10 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
 		 * terminated should also be considered read, and we check the
 		 * shutdown flag for that.
 		 */
+		peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+		mask |= vsock_poll_shutdown(sk, peer_shutdown);
 		if (sk->sk_shutdown & RCV_SHUTDOWN ||
-		    vsk->peer_shutdown & SEND_SHUTDOWN) {
+		    peer_shutdown & SEND_SHUTDOWN) {
 			mask |= EPOLLIN | EPOLLRDNORM;
 		}
 
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 432fcbbd1..16b981566 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -264,7 +264,7 @@ static void hvs_do_close_lock_held(struct vsock_sock *vsk,
 	struct sock *sk = sk_vsock(vsk);
 
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown = SHUTDOWN_MASK;
+	WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 	if (vsock_stream_has_data(vsk) <= 0)
 		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
@@ -593,7 +593,9 @@ static int hvs_update_recv_data(struct hvsock *hvs)
 		return -EIO;
 
 	if (payload_len == 0)
-		hvs->vsk->peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(hvs->vsk->peer_shutdown,
+			   READ_ONCE(hvs->vsk->peer_shutdown) |
+			   SEND_SHUTDOWN);
 
 	hvs->recv_data_len = payload_len;
 	hvs->recv_data_off = 0;
@@ -715,7 +717,8 @@ static s64 hvs_stream_has_data(struct vsock_sock *vsk)
 			return ret;
 		return hvs->recv_data_len;
 	case 0:
-		vsk->peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(vsk->peer_shutdown,
+			   READ_ONCE(vsk->peer_shutdown) | SEND_SHUTDOWN);
 		ret = 0;
 		break;
 	default: /* -1 */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index dcc8a1d58..71d8eac82 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1220,7 +1220,7 @@ static void virtio_transport_do_close(struct vsock_sock *vsk,
 	struct sock *sk = sk_vsock(vsk);
 
 	sock_set_flag(sk, SOCK_DONE);
-	vsk->peer_shutdown = SHUTDOWN_MASK;
+	WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 	if (vsock_stream_has_data(vsk) <= 0)
 		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
@@ -1411,12 +1411,15 @@ virtio_transport_recv_connected(struct sock *sk,
 	case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
 		sk->sk_write_space(sk);
 		break;
-	case VIRTIO_VSOCK_OP_SHUTDOWN:
+	case VIRTIO_VSOCK_OP_SHUTDOWN: {
+		u32 peer_shutdown = READ_ONCE(vsk->peer_shutdown);
+
 		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_RCV)
-			vsk->peer_shutdown |= RCV_SHUTDOWN;
+			peer_shutdown |= RCV_SHUTDOWN;
 		if (le32_to_cpu(hdr->flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
-			vsk->peer_shutdown |= SEND_SHUTDOWN;
-		if (vsk->peer_shutdown == SHUTDOWN_MASK) {
+			peer_shutdown |= SEND_SHUTDOWN;
+		WRITE_ONCE(vsk->peer_shutdown, peer_shutdown);
+		if (peer_shutdown == SHUTDOWN_MASK) {
 			if (vsock_stream_has_data(vsk) <= 0 && !sock_flag(sk, SOCK_DONE)) {
 				(void)virtio_transport_reset(vsk, NULL);
 				virtio_transport_do_close(vsk, true);
@@ -1431,6 +1434,7 @@ virtio_transport_recv_connected(struct sock *sk,
 		if (le32_to_cpu(virtio_vsock_hdr(skb)->flags))
 			sk->sk_state_change(sk);
 		break;
+	}
 	case VIRTIO_VSOCK_OP_RST:
 		virtio_transport_do_close(vsk, true);
 		break;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 7eccd6708..c2231c402 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -811,7 +811,7 @@ static void vmci_transport_handle_detach(struct sock *sk)
 		/* On a detach the peer will not be sending or receiving
 		 * anymore.
 		 */
-		vsk->peer_shutdown = SHUTDOWN_MASK;
+		WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 
 		/* We should not be sending anymore since the peer won't be
 		 * there to receive, but we can still receive if there is data
@@ -1534,7 +1534,9 @@ static int vmci_transport_recv_connected(struct sock *sk,
 		if (pkt->u.mode) {
 			vsk = vsock_sk(sk);
 
-			vsk->peer_shutdown |= pkt->u.mode;
+			WRITE_ONCE(vsk->peer_shutdown,
+				   READ_ONCE(vsk->peer_shutdown) |
+				   pkt->u.mode);
 			sk->sk_state_change(sk);
 		}
 		break;
@@ -1551,7 +1553,7 @@ static int vmci_transport_recv_connected(struct sock *sk,
 		 * a clean shutdown.
 		 */
 		sock_set_flag(sk, SOCK_DONE);
-		vsk->peer_shutdown = SHUTDOWN_MASK;
+		WRITE_ONCE(vsk->peer_shutdown, SHUTDOWN_MASK);
 		if (vsock_stream_has_data(vsk) <= 0)
 			sk->sk_state = TCP_CLOSING;
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: Paolo Abeni @ 2026-05-19 13:30 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, edumazet, andrew+netdev, jgg, leon,
	haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260519132825.172490-1-pabeni@redhat.com>

On 5/19/26 3:28 PM, Paolo Abeni wrote:
> From: AI Reviewer <ai@example.com>
> 
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> net: mana: Create separate EQs for each vPort
> 
> This patch moves EQ allocation from per-device to per-vPort, exports
> mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
> a vport reference for RSS QPs, and adds cross-port PD-sharing
> rejection for raw QPs along with serialization between
> mana_set_channels() and the RDMA vport configuration.
> 
>> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
>> index ac5e75dd3494..f8a9013f0ca3 100644
>> --- a/drivers/infiniband/hw/mana/main.c
>> +++ b/drivers/infiniband/hw/mana/main.c
>> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>>  
>>  	pd->vport_use_count++;
>>  	if (pd->vport_use_count > 1) {
>> +		/* Reject cross-port PD sharing. EQs and vport config
>> +		 * are per-port, so the PD must stay bound to the port
>> +		 * that was configured on the first raw QP creation.
>> +		 */
>> +		if (pd->vport_port != port) {
>> +			pd->vport_use_count--;
>> +			mutex_unlock(&pd->vport_mutex);
>> +			ibdev_dbg(&dev->ib_dev,
>> +				  "PD already bound to port %u\n",
>> +				  pd->vport_port);
>> +			return -EINVAL;
>> +		}
>>  		ibdev_dbg(&dev->ib_dev,
>>  			  "Skip as this PD is already configured vport\n");
>>  		mutex_unlock(&pd->vport_mutex);
>>  		return 0;
>>  	}
>>  
>> -	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
>> +	pd->vport_port = port;
>> +
>> +	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
>>  	if (err) {
>>  		pd->vport_use_count--;
>>  		mutex_unlock(&pd->vport_mutex);
> 
> [Low]
> In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
> later mana_create_eq() fails, vport_use_count is rolled back to 0 but
> pd->vport_port keeps the value just written.
> 
> Is the contract that pd->vport_port is only meaningful while
> vport_use_count > 0?  That invariant is not documented in the struct
> mana_ib_pd comment, and a future reader who consults pd->vport_port
> without first checking vport_use_count under vport_mutex would see stale
> data.  Would it be worth either resetting pd->vport_port on the rollback
> path or documenting the rule explicitly?
> 
>> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>>  		return err;
>>  	}
>>  
>> -	mutex_unlock(&pd->vport_mutex);
>>  
>> -	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>> -	pd->tx_vp_offset = mpc->tx_vp_offset;
>> +	err = mana_create_eq(mpc);
>> +	if (err) {
>> +		mana_uncfg_vport(mpc);
>> +		pd->vport_use_count--;
>> +	} else {
>> +		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
>> +		pd->tx_vp_offset = mpc->tx_vp_offset;
>> +	}
> 
> [ ... ]
> 
>> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
>> index 0fbcf449c134..108ec4c5ce51 100644
>> --- a/drivers/infiniband/hw/mana/qp.c
>> +++ b/drivers/infiniband/hw/mana/qp.c
>> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>>  
>>  	qp->port = port;
>>  
>> +	/* Take a reference on the vport to ensure EQs outlive this QP.
>> +	 * The vport must already be configured by a raw QP.
>> +	 */
>> +	mutex_lock(&mana_pd->vport_mutex);
>> +	if (!mana_pd->vport_use_count) {
>> +		mutex_unlock(&mana_pd->vport_mutex);
>> +		ret = -EINVAL;
>> +		goto fail;
>> +	}
>> +	mana_pd->vport_use_count++;
>> +	mutex_unlock(&mana_pd->vport_mutex);
>> +
> 
> [Critical]
> The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
> mirrored here.  mana_ib_create_qp_rss() takes the user-supplied port
> from ucmd.port and only checks that vport_use_count is non-zero, so the
> RSS path will happily attach to a different port than the one bound to
> the PD by the first raw QP.
> 
> If a user creates a raw QP on port A (which sets pd->vport_port=A and
> allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
> the same PD with ucmd.port=B where port B has no Ethernet up and no
> RDMA activity, mpc_B->eqs is NULL.  The subsequent code in this same
> function:
> 
> 	eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
> 	cq_spec.attached_eq = eq->eq->id;
> 
> would then dereference NULL through mpc_B->eqs.
> 
> If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
> driver and the RSS QP attaches to those EQs.  When the QP is destroyed
> mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
> and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
> mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
> on a port whose apc->vport_use_count was never incremented by this PD,
> tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
> vport state.  Meanwhile port A's EQs and vport configuration are leaked
> because nothing on this PD will destroy them.
> 
> Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
> check that mana_ib_cfg_vport() now performs, before bumping
> vport_use_count?

Sashiko reported alredy this problematic pattern in 2 separate places.
Please ensure that there are no other similar buggy usage pattern
elsewhere in the newly introduced code.

/P


^ permalink raw reply

* Re: [PATCH net-next v10 1/6] net: mana: Create separate EQs for each vPort
From: Paolo Abeni @ 2026-05-19 13:28 UTC (permalink / raw)
  To: longli
  Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
	leon, haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260515040508.491748-2-longli@microsoft.com>

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net: mana: Create separate EQs for each vPort

This patch moves EQ allocation from per-device to per-vPort, exports
mana_create_eq()/mana_destroy_eq() for use by the RDMA driver, takes
a vport reference for RSS QPs, and adds cross-port PD-sharing
rejection for raw QPs along with serialization between
mana_set_channels() and the RDMA vport configuration.

> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index ac5e75dd3494..f8a9013f0ca3 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -40,13 +42,27 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  
>  	pd->vport_use_count++;
>  	if (pd->vport_use_count > 1) {
> +		/* Reject cross-port PD sharing. EQs and vport config
> +		 * are per-port, so the PD must stay bound to the port
> +		 * that was configured on the first raw QP creation.
> +		 */
> +		if (pd->vport_port != port) {
> +			pd->vport_use_count--;
> +			mutex_unlock(&pd->vport_mutex);
> +			ibdev_dbg(&dev->ib_dev,
> +				  "PD already bound to port %u\n",
> +				  pd->vport_port);
> +			return -EINVAL;
> +		}
>  		ibdev_dbg(&dev->ib_dev,
>  			  "Skip as this PD is already configured vport\n");
>  		mutex_unlock(&pd->vport_mutex);
>  		return 0;
>  	}
>  
> -	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
> +	pd->vport_port = port;
> +
> +	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id, true);
>  	if (err) {
>  		pd->vport_use_count--;
>  		mutex_unlock(&pd->vport_mutex);

[Low]
In the failure path of mana_ib_cfg_vport(), if mana_cfg_vport() or the
later mana_create_eq() fails, vport_use_count is rolled back to 0 but
pd->vport_port keeps the value just written.

Is the contract that pd->vport_port is only meaningful while
vport_use_count > 0?  That invariant is not documented in the struct
mana_ib_pd comment, and a future reader who consults pd->vport_port
without first checking vport_use_count under vport_mutex would see stale
data.  Would it be worth either resetting pd->vport_port on the rollback
path or documenting the rule explicitly?

> @@ -55,15 +71,23 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
>  		return err;
>  	}
>  
> -	mutex_unlock(&pd->vport_mutex);
>  
> -	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> -	pd->tx_vp_offset = mpc->tx_vp_offset;
> +	err = mana_create_eq(mpc);
> +	if (err) {
> +		mana_uncfg_vport(mpc);
> +		pd->vport_use_count--;
> +	} else {
> +		pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> +		pd->tx_vp_offset = mpc->tx_vp_offset;
> +	}

[ ... ]

> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 0fbcf449c134..108ec4c5ce51 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -155,6 +156,18 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  
>  	qp->port = port;
>  
> +	/* Take a reference on the vport to ensure EQs outlive this QP.
> +	 * The vport must already be configured by a raw QP.
> +	 */
> +	mutex_lock(&mana_pd->vport_mutex);
> +	if (!mana_pd->vport_use_count) {
> +		mutex_unlock(&mana_pd->vport_mutex);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	mana_pd->vport_use_count++;
> +	mutex_unlock(&mana_pd->vport_mutex);
> +

[Critical]
The cross-port PD-sharing check added to mana_ib_cfg_vport() is not
mirrored here.  mana_ib_create_qp_rss() takes the user-supplied port
from ucmd.port and only checks that vport_use_count is non-zero, so the
RSS path will happily attach to a different port than the one bound to
the PD by the first raw QP.

If a user creates a raw QP on port A (which sets pd->vport_port=A and
allocates mpc_A->eqs via mana_create_eq()), then creates an RSS QP on
the same PD with ucmd.port=B where port B has no Ethernet up and no
RDMA activity, mpc_B->eqs is NULL.  The subsequent code in this same
function:

	eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
	cq_spec.attached_eq = eq->eq->id;

would then dereference NULL through mpc_B->eqs.

If port B does have Ethernet up, mpc_B->eqs is owned by the Ethernet
driver and the RSS QP attaches to those EQs.  When the QP is destroyed
mana_ib_destroy_qp_rss() calls mana_ib_uncfg_vport(mdev, pd, qp->port=B),
and once pd->vport_use_count reaches 0 mana_ib_uncfg_vport() will run
mana_destroy_eq(mpc_B) on Ethernet's live EQs and call mana_uncfg_vport
on a port whose apc->vport_use_count was never incremented by this PD,
tripping the WARN_ON in mana_uncfg_vport() and corrupting Ethernet's
vport state.  Meanwhile port A's EQs and vport configuration are leaked
because nothing on this PD will destroy them.

Should mana_ib_create_qp_rss() apply the same pd->vport_port == port
check that mana_ib_cfg_vport() now performs, before bumping
vport_use_count?

[Medium]
The new "vport must already be configured by a raw QP" requirement is a
visible uAPI change.  Before this patch, EQs were allocated device-wide
on mana_context at probe time, so an RSS QP could be created without any
prior raw QP on the PD.  After the patch, applications that legitimately
use only RSS QPs (RX-only RDMA workloads) will see -EINVAL here unless
they also create a raw QP first.

Was breaking the RSS-only case considered, and would lazy EQ creation on
the first QP regardless of type be feasible to avoid the
incompatibility?

>  	for (i = 0; i < ind_tbl_size; i++) {
>  		struct mana_obj_spec wq_spec = {};
>  		struct mana_obj_spec cq_spec = {};
> @@ -171,13 +184,13 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
>  		cq_spec.gdma_region = cq->queue.gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		eq = &mpc->ac->eqs[cq->comp_vector];
> +		eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>  		cq_spec.attached_eq = eq->eq->id;

[ ... ]

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index 04350973e19e..4633acc976f0 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -462,10 +467,22 @@ static int mana_set_channels(struct net_device *ndev,
>  	unsigned int new_count = channels->combined_count;
>  	unsigned int old_count = apc->num_queues;
>  	int err;
>  
> +	/* Set channel_changing to block RDMA from grabbing the vport
> +	 * during the detach/attach window. mana_cfg_vport() checks
> +	 * this flag under vport_mutex and returns -EBUSY if set.
> +	 */
> +	mutex_lock(&apc->vport_mutex);
> +	if (!apc->port_is_up && apc->vport_use_count) {
> +		mutex_unlock(&apc->vport_mutex);
> +		return -EBUSY;
> +	}
> +	apc->channel_changing = true;
> +	mutex_unlock(&apc->vport_mutex);

[Low]
The commit message says:

    When the port is down, apc->vport_mutex is held for the entire
    operation since mana_detach()/mana_attach() do not take vport_mutex
    in that case. When the port is up, Ethernet owns the vport
    exclusively so no additional locking is needed.

Does this description match the code?  mana_set_channels() drops
apc->vport_mutex immediately after setting channel_changing=true and
runs mana_pre_alloc_rxbufs(), mana_detach() and mana_attach() without
the mutex held, then re-acquires the mutex only at the end to clear the
flag.

The actual serialization mechanism is the channel_changing boolean
checked by mana_cfg_vport() under vport_mutex, not a long mutex hold.
Could the commit message be updated to match the implemented design so
that future readers do not assume a different invariant?
-- 
This is an AI-generated review.


^ permalink raw reply

* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: Jason Gunthorpe @ 2026-05-19 12:52 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Mukesh R, hpa, robin.murphy, robh, wei.liu, mhklinux, muislam,
	namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260518224136.0000403e@linux.microsoft.com>

On Mon, May 18, 2026 at 10:41:36PM -0700, Jacob Pan wrote:

> Just wondering what work is needed to support this "direct attach"? I
> felt this issue is due to trying to cram two distinct domain types
> (paging domain & direct attach) into the VFIO container model where
> only unmanaged paging domain is supported.

Xen has the same issue and you two need to come up with a uniform
solution.

VFIO container can't support it, that's out.

You should be focusing on a iommufd flow that accepts some FD
representing the VM (ie KVM FD) that can be converted by the driver
into a HWPT representing that FD's S2 translation.

> I am thinking if we were to switch to iommufd and let user(vmm) have
> direct control of HWPT, vmm will be able to selectively use a
> different domain type to handle direct attach. 

Yes

> IMHO, it is essentially the same as attaching nest parent domain
> without nested domain immediately attached. The unprivileged guest
> may attach nested domain directly with Hyper-V if nested translation
> is needed.

nest parent domain is really for supporting the viommu objects.. If
you don't have that flow you don't need to worry about that nest
parent stuff.

> From this driver POV, it will allocate a 2nd stage only domain with
> different domain ops (w/o map/unmap) for "direct attach" thus avoid this
> hack.

Yes, from a driver POV you need a unique iommu_domain allocator that
returns an iommu_domain without an ops.

It should probably work similarly to the viommu where the iommufd path
can send in a driver-specific tagged struct that can describe these
special domains.

But be mindful of the lifetime rules, whatever ID is used to describe
the VM at the hypercall boundary has to be bound into a linux FD and
become immutable. The driver has to hold that FD as long as the domain
exists.

Jason

^ permalink raw reply

* Re: [PATCH net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
From: patchwork-bot+netdevbpf @ 2026-05-19 11:10 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, kees, shacharr,
	stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260514194156.466823-1-ernis@linux.microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Thu, 14 May 2026 12:41:51 -0700 you wrote:
> In mana_hwc_rx_event_handler(), resp->response.hwc_msg_id is read from
> DMA-coherent memory and bounds-checked, then mana_hwc_handle_resp()
> re-reads the same field from the same DMA buffer for test_bit() and
> pointer arithmetic.
> 
> DMA-coherent memory is mapped uncacheable on x86 and is shared,
> unencrypted, in Confidential VMs (SEV-SNP/TDX), so each load goes
> directly to host-visible memory. A H/W can modify the value
> between the check and the use, bypassing the bounds validation.
> 
> [...]

Here is the summary with links:
  - [net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
    https://git.kernel.org/netdev/net/c/35f0f0a2536a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2 0/5] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-05-19 11:09 UTC (permalink / raw)
  To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
	wei.liu, decui, longli, andersson, mathieu.poirier
  Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
	linux-remoteproc
In-Reply-To: <DIG1MUKJVLEO.YGTSSYIO5T1K@kernel.org>

On Mon May 11, 2026 at 8:02 PM CEST, Danilo Krummrich wrote:
> On Tue May 5, 2026 at 3:37 PM CEST, Danilo Krummrich wrote:
>> This series is based on v7.1-rc1 with no additional dependencies, hence those
>> patches can be picked up by subsystems individually.
>
> Gentle ping on this one; I can alternatively also take those patches through the
> driver-core tree if you prefer.

Since the remaining patches of this series have been out since v7.0-rc5 without
any objections, I'm planning to pick it up through the driver-core tree for
v7.2-rc1.

Please let me know if there are any concerns or whether you prefer to take
individual patches through your own tree.

Thanks,
Danilo

^ permalink raw reply

* Re: [PATCH net] net: mana: Fix TOCTOU double-fetch of hwc_msg_id from DMA buffer
From: Paolo Abeni @ 2026-05-19 11:00 UTC (permalink / raw)
  To: Erni Sri Satya Vennela, kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, davem, edumazet, kuba, dipayanroy, horms, kees,
	shacharr, stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260514194156.466823-1-ernis@linux.microsoft.com>

On 5/14/26 9:41 PM, Erni Sri Satya Vennela wrote:
> In mana_hwc_rx_event_handler(), resp->response.hwc_msg_id is read from
> DMA-coherent memory and bounds-checked, then mana_hwc_handle_resp()
> re-reads the same field from the same DMA buffer for test_bit() and
> pointer arithmetic.
> 
> DMA-coherent memory is mapped uncacheable on x86 and is shared,
> unencrypted, in Confidential VMs (SEV-SNP/TDX), so each load goes
> directly to host-visible memory. A H/W can modify the value
> between the check and the use, bypassing the bounds validation.

Sashiko noted there are more related issues in the nearby code:

https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260514194156.466823-1-ernis%40linux.microsoft.com

you may consider a follow-up.

/P


^ permalink raw reply

* [PATCH net-next v9] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-05-19  6:46 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
	dipayanroy, ernis, kees, linux-hyperv, netdev, linux-kernel,
	linux-rdma

Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.

The debugfs directory for each PCI device is named using pci_name()
(the unique BDF address), and its creation and removal is integrated
into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
that all callers (probe, remove, suspend, resume, shutdown) share a
single code path.

Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
  - num_msix_usable, max_num_queues: Max resources from hardware
  - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
  - num_vports, bm_hostmode: Device configuration

Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
  - port_handle: Hardware vPort handle
  - max_sq, max_rq: Max queues from vPort config
  - indir_table_sz: Indirection table size
  - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
    Last applied steering configuration parameters

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
Change in v9:
* Change steer_update_tab type from u32 to bool and use
  debugfs_create_bool() accordingly
* Guard debugfs_lookup_and_remove() calls in mana_remove() with a
  NULL check on gc->mana_pci_debugfs
* Fix mana_gd_resume() RDMA failure unwind: call mana_rdma_remove()
  to undo partial RDMA state and return err, instead of
  mana_remove(true) + mana_gd_cleanup_device(), avoiding a UAF
  where gf_stats_work could fire against an already-destroyed HWC
Changes in v8:
* Move debugfs_create_u16("num_vports", ...) and
  debugfs_create_u8("bm_hostmode", ...) to after ac->num_ports has been
  assigned and clamped to MAX_PORTS_IN_MANA_DEV, so the value exposed
  via debugfs always reflects the final, hardware-reported count
  rather than a transient zero or unclamped value.
* Update the stale comment above mana_gd_resume() to reflect the new
  rollback-on-failure behavior.
Changes in v7:
* Rebase to latest main.
Changes in v6:
* Move out of patchset and create a separate patch.
Changes in v5:
* Update commit message.
* Fix conflicts to align with the new patches.
* Make it part of patchset.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
  because "ac" gets freed before debugfs_remove_recursive, to avoid
  Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
  values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
  mana_gd_suspend to handle multiple duplicates creation in
  mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
  if(!resuming) condition since we have to create it again even for
  resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---

 .../net/ethernet/microsoft/mana/gdma_main.c   | 73 ++++++++++---------
 drivers/net/ethernet/microsoft/mana/mana_en.c | 35 +++++++++
 include/net/mana/gdma.h                       |  1 +
 include/net/mana/mana.h                       |  8 ++
 4 files changed, 83 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 3bc3fff55999..712a0881d720 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -227,6 +227,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	if (gc->max_num_queues == 0)
 		return -ENOSPC;
 
+	debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+			   &gc->num_msix_usable);
+	debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+			   &gc->max_num_queues);
+
 	return 0;
 }
 
@@ -1297,6 +1302,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
 		return err ? err : -EPROTO;
 	}
 	gc->pf_cap_flags1 = resp.pf_cap_flags1;
+	gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+	debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+			   &gc->gdma_protocol_ver);
+	debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+			   &gc->pf_cap_flags1);
+
 	if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
 		err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
 		if (err) {
@@ -1976,15 +1988,20 @@ static int mana_gd_setup(struct pci_dev *pdev)
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	int err;
 
+	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+						  mana_debugfs_root);
+
 	err = mana_gd_init_registers(pdev);
 	if (err)
-		return err;
+		goto remove_debugfs;
 
 	mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
 
 	gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
-	if (!gc->service_wq)
-		return -ENOMEM;
+	if (!gc->service_wq) {
+		err = -ENOMEM;
+		goto remove_debugfs;
+	}
 
 	err = mana_gd_setup_hwc_irqs(pdev);
 	if (err) {
@@ -2025,11 +2042,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
 free_workqueue:
 	destroy_workqueue(gc->service_wq);
 	gc->service_wq = NULL;
+remove_debugfs:
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
 	dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
 	return err;
 }
 
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 
@@ -2041,6 +2061,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
 		destroy_workqueue(gc->service_wq);
 		gc->service_wq = NULL;
 	}
+
+	debugfs_remove_recursive(gc->mana_pci_debugfs);
+	gc->mana_pci_debugfs = NULL;
+
 	dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
 }
 
@@ -2098,9 +2122,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	gc->dev = &pdev->dev;
 	xa_init(&gc->irq_contexts);
 
-	gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
-						  mana_debugfs_root);
-
 	err = mana_gd_setup(pdev);
 	if (err)
 		goto unmap_bar;
@@ -2129,16 +2150,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 cleanup_mana:
 	mana_remove(&gc->mana, false);
 cleanup_gd:
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 unmap_bar:
-	/*
-	 * at this point we know that the other debugfs child dir/files
-	 * are either not yet created or are already cleaned up.
-	 * The pci debugfs folder clean-up now, will only be cleaning up
-	 * adapter-MTU file and apc->mana_pci_debugfs folder.
-	 */
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-	gc->mana_pci_debugfs = NULL;
 	xa_destroy(&gc->irq_contexts);
 	pci_iounmap(pdev, bar0_va);
 free_gc:
@@ -2188,11 +2201,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, false);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	xa_destroy(&gc->irq_contexts);
 
@@ -2214,15 +2223,11 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
+	mana_gd_cleanup_device(pdev);
 
 	return 0;
 }
 
-/* In case the NIC hardware stops working, the suspend and resume callbacks will
- * fail -- if this happens, it's safer to just report an error than try to undo
- * what has been done.
- */
 int mana_gd_resume(struct pci_dev *pdev)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -2234,13 +2239,17 @@ int mana_gd_resume(struct pci_dev *pdev)
 
 	err = mana_probe(&gc->mana, true);
 	if (err)
-		return err;
+		goto cleanup_gd;
 
 	err = mana_rdma_probe(&gc->mana_ib);
 	if (err)
-		return err;
+		mana_rdma_remove(&gc->mana_ib);
 
-	return 0;
+	return err;
+
+cleanup_gd:
+	mana_gd_cleanup_device(pdev);
+	return err;
 }
 
 /* Quiesce the device for kexec. This is also called upon reboot/shutdown. */
@@ -2253,11 +2262,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
 	mana_rdma_remove(&gc->mana_ib);
 	mana_remove(&gc->mana, true);
 
-	mana_gd_cleanup(pdev);
-
-	debugfs_remove_recursive(gc->mana_pci_debugfs);
-
-	gc->mana_pci_debugfs = NULL;
+	mana_gd_cleanup_device(pdev);
 
 	pci_disable_device(pdev);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b2faa7cf398f..82f1461a48e9 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1282,6 +1282,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	apc->port_handle = resp.vport;
 	ether_addr_copy(apc->mac_addr, resp.mac_addr);
 
+	apc->vport_max_sq = *max_sq;
+	apc->vport_max_rq = *max_rq;
+
 	return 0;
 }
 
@@ -1436,6 +1439,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 
 	netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
 		    apc->port_handle, apc->indir_table_sz);
+
+	apc->steer_rx = rx;
+	apc->steer_rss = apc->rss_state;
+	apc->steer_update_tab = update_tab;
+	apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
 out:
 	kfree(req);
 	return err;
@@ -3178,6 +3186,23 @@ static int mana_init_port(struct net_device *ndev)
 	eth_hw_addr_set(ndev, apc->mac_addr);
 	sprintf(vport, "vport%d", port_idx);
 	apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+	debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+			   &apc->port_handle);
+	debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_sq);
+	debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+			   &apc->vport_max_rq);
+	debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+			   &apc->indir_table_sz);
+	debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rx);
+	debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+			   &apc->steer_rss);
+	debugfs_create_bool("steer_update_tab", 0400, apc->mana_port_debugfs,
+			    &apc->steer_update_tab);
+	debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+			   &apc->steer_cqe_coalescing);
 	debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
 			   &apc->speed);
 	return 0;
@@ -3695,6 +3720,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 	if (ac->num_ports > MAX_PORTS_IN_MANA_DEV)
 		ac->num_ports = MAX_PORTS_IN_MANA_DEV;
 
+	debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+			   &ac->num_ports);
+	debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+			  &ac->bm_hostmode);
+
 	ac->per_port_queue_reset_wq =
 		create_singlethread_workqueue("mana_per_port_queue_reset_wq");
 	if (!ac->per_port_queue_reset_wq) {
@@ -3817,6 +3847,11 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 
 	mana_gd_deregister_device(gd);
 
+	if (gc->mana_pci_debugfs) {
+		debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+		debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+	}
+
 	if (suspending)
 		return;
 
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 6d836060976a..70d62bc32837 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
 	struct gdma_dev		mana_ib;
 
 	u64 pf_cap_flags1;
+	u64 gdma_protocol_ver;
 
 	struct workqueue_struct *service_wq;
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aa90a858c8e3..d9c27310fd04 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
 
 	/* Debugfs */
 	struct dentry *mana_port_debugfs;
+
+	/* Cached vport/steering config for debugfs */
+	u32 vport_max_sq;
+	u32 vport_max_rq;
+	u32 steer_rx;
+	u32 steer_rss;
+	bool steer_update_tab;
+	u32 steer_cqe_coalescing;
 };
 
 netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU
From: Jacob Pan @ 2026-05-19  5:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mukesh R, hpa, robin.murphy, robh, wei.liu, mhklinux, muislam,
	namjain, magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd, jacob.pan
In-Reply-To: <20260515182322.GI7702@ziepe.ca>

Hi Jason,

On Fri, 15 May 2026 15:23:22 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, May 11, 2026 at 07:02:57PM -0700, Mukesh R wrote:
> > +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct
> > device *dev) +{
> > +	struct hv_domain *hvdom;
> > +	int rc;
> > +
> > +	if (hv_l1vh_partition() && !hv_curr_thread_is_vmm()) {
> > +		pr_err("Hyper-V: l1vh iommu does not support host
> > devices\n");
> > +		return NULL;
> > +	}
> > +
> > +	hvdom = kzalloc(sizeof(struct hv_domain), GFP_KERNEL);
> > +	if (hvdom == NULL)
> > +		return NULL;
> > +
> > +	spin_lock_init(&hvdom->mappings_lock);
> > +	hvdom->mappings_tree = RB_ROOT_CACHED;
> > +
> > +	/* Called under iommu group mutex, so single threaded */
> > +	if (++unique_id == HV_DEVICE_DOMAIN_ID_S2_NULL)   /* ie,
> > UINTMAX */
> > +		goto out_err;
> > +
> > +	hvdom->domid_num = unique_id;
> > +	hvdom->partid = hv_get_current_partid();
> > +	hvdom->iommu_dom.geometry = default_geometry;
> > +	hvdom->iommu_dom.pgsize_bitmap = HV_IOMMU_PGSIZES;
> > +
> > +	/* For guests, by default we do direct attaches, so no
> > domain in hyp */
> > +	if (hv_dom_owner_is_vmm(hvdom) && !hv_no_attdev)
> > +		hvdom->attached_dom = true;  
> 
> What are you thinking sending something like this?!?!?
> 
> The function is called *alloc domain PAGING*, it does not, and can not
> allocate weird "special" domains that are not PAGING domains. I just
> spent a long time removing all this kind of crazyness from drivers.
> 
> There is alot of other things I don't like in this patch, but this is
> too much.
> 
> You have to drop this "direct attach" idea from the first iteration,
> Linux can't do it without alot more work, you should start with the
> basic paging domain mode.
> 
Just wondering what work is needed to support this "direct attach"? I
felt this issue is due to trying to cram two distinct domain types
(paging domain & direct attach) into the VFIO container model where
only unmanaged paging domain is supported.

I am thinking if we were to switch to iommufd and let user(vmm) have
direct control of HWPT, vmm will be able to selectively use a
different domain type to handle direct attach. IMHO, it is essentially
the same as attaching nest parent domain without nested domain
immediately attached. The unprivileged guest may attach nested domain
directly with Hyper-V if nested translation is needed.

I understand nest parent is still a paging domain today and it is
expected to work with nested domain. So maybe we can make iommufd
accept nest parent w/o paging? i.e. no map/unmap and do not call
iopt_table_add_domain()? or maybe a new object?

From this driver POV, it will allocate a 2nd stage only domain with
different domain ops (w/o map/unmap) for "direct attach" thus avoid this
hack.

Thanks,

Jacob

^ permalink raw reply

* RE: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Michael Kelley @ 2026-05-19  3:12 UTC (permalink / raw)
  To: Sean Christopherson, sashiko-reviews@lists.linux.dev
  Cc: linux-hyperv@vger.kernel.org
In-Reply-To: <aguQIJQYsarMScnl@google.com>

From: Sean Christopherson <seanjc@google.com> Sent: Monday, May 18, 2026 3:18 PM
> 
> +Michael (sashiko was quite stingy with its Cc list)
> 
> On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> > - [High] Hyper-V VMs with an invariant TSC but no frequency MSRs lose the X86_FEATURE_TSC_RELIABLE capability.
> > --
> >
> > commit 0c0fbc4f0451c30c0e87e0e3591ea4f487eff75c
> > Author: Sean Christopherson <seanjc@google.com>
> >
> > x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
> >
> > This commit adds a tsc_properties parameter to the paravirtual TSC
> > calibration registration routine to consolidate feature flag tracking.
> > It updates various hypervisor initialization paths to pass combined flags
> > instead of manually setting the CPU capabilities.
> >
> > > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > > --- a/arch/x86/kernel/cpu/mshyperv.c
> > > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> > >
> > >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> > >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > > +
> > > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > > +
> > > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > > +						  tsc_properties);
> > >  	}
> >
> > [ ... ]
> >
> > > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> > >  		 * is called.
> > >  		 */
> > >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> > >  	}
> >
> > If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> > does it bypass the tsc_register_calibration_routines() block entirely?
> 
> Yes.
> 
> > Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> > here, it looks like these VMs will lose the reliable flag.
> >
> > Will this inadvertently enable the TSC watchdog, potentially causing a
> > performance regression if the system falsely marks the TSC as unstable due
> > to virtualization scheduling delays?
> 
> Hmm, I was going to say that the change was intentional and desriable, but looking
> at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
> just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
> X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
> X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
> as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
> info, the host is still clearly telling the guest that the TSC is reliable.

Yes, I agree. But I'm doubtful that such a combination ever occurs in practice.
I've never seen an occurrence of a Hyper-V (even really old versions) guest
where the frequency MSRs are not available or are not accessible. The
Hyper-V spec allows for either condition, so we have the code to test the
flags. I've thought of the flags as always set, though I suppose one never
knows what the future holds.

> 
> Michael, does keeping the
> 
> 		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> 
> but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
> sense?

I don't see that it would break anything. But it seems a bit disjointed in
that HV_ACCESS_TSC_INVARIANT is tested in two places in
ms_hyperv_init_platform(). Does TSC_RELIABLE *need* to be passed to
tsc_register_calibration_routines() if later code in ms_hyperv_init_platform()
does setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE)? In other words,
I'm suggesting let tsc_register_calibration_routines() handle the 
TSC_FREQ_KNOWN case since that's what the calibration routines are
all about. Leave the setting of X86_FEATURE_TSC_RELIABLE to the
later code that tests HV_ACCESS_TSC_INVARIANT, instead of
duplicating the setup_force_cpu_cap() operation.

While combining FREQUENCY_KNOWN and RELIABLE into
tsc_register_calibration_routines() is convenient, the two
concepts turn out to be independent when looking strictly at
the Hyper-V spec and code written to follow that spec.
Combining them into the same function ends up being clumsy
in this case.

Michael

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: David Woodhouse @ 2026-05-18 23:38 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> like this through the tip tree?
> 
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point.  I am reaonsably
> confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> v2, and while there were conflicts when rebasing, they were largely
> superficial (and I've just jinxed myself).  For all other hypervisors, assume
> the code is compile-tested only, but those changes are all quite small and
> straightforward.
> 
> The only changes that are questionable/contentious are the last two patches,
> which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> AMD (that's the dubious part).  I very deliberately put them last, so that
> they can be dropped at will (I don't care terribly if those patches land).
> To merge them, I would want explicit Acks from Paolo and David W.
> 
> So, except for the last two patches, to get the stuff I really care about
> landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> reviews/acks?
> 
> The primary goal of this series is (or at least was, when I started) to
> fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> hypervisor is used instead of the secure/trusted TSC that is controlled by
> trusted firmware.
> 
> The secondary goal is to draft off of the SNP and TDX changes to slightly
> modernize running under KVM.  Currently, KVM guests will use TSC for
> clocksource, but not sched_clock.  And they ignore Intel's CPUID-based TSC
> and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> guests can use that for the APIC timer period instead of manually calibrating
> the frequency.
> 
> The tertiary goal is to clean up all of the PV clock code to deduplicate logic
> across hypervisors, and to hopefully make it all easier to maintain going
> forward.

I booted this in qemu with -cpu host,+invtsc,+vmware-cpuid-freq

I was expecting to see it eschew the kvmclock and use *only* the TSC.
Is there even any need for 'tsc-early' given that it's *told* the TSC
frequency in CPUID? Shouldn't it have detected that the TSC is known
before init_tsc_clocksource() runs?

And then it even spent some time at boot actually using the kvmclock as
clocksource... when ideally I don't think it would even have *enabled*
it at all?

[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] tsc: Detected 2400.000 MHz processor
[    0.008205] TSC deadline timer available
[    0.008270] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.159085] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.164074] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
[    0.229087] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.337095] clocksource: Switched to clocksource kvm-clock
[    0.345246] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.356201] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
[    0.360560] clocksource: Switched to clocksource tsc


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-18 23:04 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515194705.09AFDC2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] The logic to defer AP kvmclock registration to the CPUHP_AP_ONLINE_DYN state when it is not used as sched_clock is architecturally unsafe and will break timekeeping.
> --
> 
> commit 88625772445f01c4397e3dfd4ee43222efd758cf
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
> 
> This commit prepares for making early_percpu_clock_init() a dedicated
> sched_clock hook. It adds logic to enable kvmclock on APs during CPU online
> if kvmclock is not used as sched_clock, and includes WARNs in the suspend
> and resume paths.
> 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index e7250d21c6721..d3bb281c0805d 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -141,25 +142,51 @@ static void kvm_restore_sched_clock_state(void)
> [ ... ]
> >  void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >  {
> > -	/*
> > -	 * Don't disable kvmclock on the BSP during suspend.  If kvmclock is
> > -	 * being used for sched_clock, then it needs to be kept alive until the
> > -	 * last minute, and restored as quickly as possible after resume.
> > -	 */
> > -	if (action != KVM_GUEST_BSP_SUSPEND)
> > +	switch (action) {
> [ ... ]
> > +	case KVM_GUEST_AP_ONLINE:
> > +		/*
> > +		 * Secondary CPUs use dedicated sched_clock hooks to enable
> > +		 * kvmclock early during bringup, there's nothing to be done
> > +		 * when during CPU online.
> > +		 */
> > +		if (kvmclock_is_sched_clock)
> > +			break;
> > +		kvm_register_clock("secondary cpu, online");
> > +		break;
> 
> Will deferring AP kvmclock registration to the CPU online phase break
> timekeeping if kvmclock is still being used as the system clocksource?
> 
> If the user boots with clocksource=kvm-clock or if kvmclock is selected as a
> fallback, it will be used as the clocksource despite not being sched_clock.
> The kvmclock clocksource read function accesses per-CPU memory that the
> hypervisor only populates after the AP registers its clock.
> 
> If early registration in start_secondary() is skipped, as anticipated here,
> an AP will boot, enable local interrupts, and process timer ticks before
> reaching the late CPUHP_AP_ONLINE_DYN hotplug state.
> 
> Could a call to ktime_get() during this window read uninitialized per-CPU
> memory and return 0? It seems this might cause the timekeeper to compute
> massive forward time jumps, leading to immediate timer expirations and
> watchdogs firing.

Hrm, yes, that seems plausible.  Even if it's not actually possible in practice,
replacing x86_cpuinit.early_percpu_clock_init() with a dedicated
paravirt_sched_clock_start_secondary() isn't obviously better, and so definitely
isn't worth risking breakage, at least not in this series.

I'll scrap most of this, but still plumb in AP_ONLINE and BSP_RESUME for
documentation purposes.

> > +	case KVM_GUEST_AP_OFFLINE:
> > +	case KVM_GUEST_SHUTDOWN:
> >  		kvmclock_disable();
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		break;
> > +	}
> >  }
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260515191942.1892718-1-seanjc@google.com?part=27

^ permalink raw reply

* Re: [PATCH v3 31/41] x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
From: Sean Christopherson @ 2026-05-18 22:18 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, Michael Kelley
In-Reply-To: <20260515194535.56B84C2BCB0@smtp.kernel.org>

+Michael (sashiko was quite stingy with its Cc list)

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Hyper-V VMs with an invariant TSC but no frequency MSRs lose the X86_FEATURE_TSC_RELIABLE capability.
> --
> 
> commit 0c0fbc4f0451c30c0e87e0e3591ea4f487eff75c
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/tsc: Pass KNOWN_FREQ and RELIABLE as params to registration
> 
> This commit adds a tsc_properties parameter to the paravirtual TSC
> calibration registration routine to consolidate feature flag tracking.
> It updates various hypervisor initialization paths to pass combined flags
> instead of manually setting the CPU capabilities.
> 
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -516,8 +516,13 @@ static void __init ms_hyperv_init_platform(void)
> >  
> >  	if (ms_hyperv.features & HV_ACCESS_FREQUENCY_MSRS &&
> >  	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
> > -		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz);
> > -		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > +		enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
> > +
> > +		if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> > +			tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > +
> > +		tsc_register_calibration_routines(hv_get_tsc_khz, hv_get_tsc_khz,
> > +						  tsc_properties);
> >  	}
> 
> [ ... ]
> 
> > @@ -629,7 +634,6 @@ static void __init ms_hyperv_init_platform(void)
> >  		 * is called.
> >  		 */
> >  		wrmsrq(HV_X64_MSR_TSC_INVARIANT_CONTROL, HV_EXPOSE_INVARIANT_TSC);
> > -		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> >  	}
> 
> If a Hyper-V VM exposes an invariant TSC but lacks the frequency MSRs,
> does it bypass the tsc_register_calibration_routines() block entirely?

Yes.

> Without the standalone setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE) call
> here, it looks like these VMs will lose the reliable flag.
> 
> Will this inadvertently enable the TSC watchdog, potentially causing a
> performance regression if the system falsely marks the TSC as unstable due
> to virtualization scheduling delays?

Hmm, I was going to say that the change was intentional and desriable, but looking
at this yet again, I don't think that's true.  Enabling HV_EXPOSE_INVARIANT_TSC
just means the kernel will (probably) set X86_FEATURE_CONSTANT_TSC and
X86_FEATURE_NONSTOP_TSC during early_init_intel(), AFAICT it doesn't lead to
X86_FEATURE_TSC_RELIABLE being set.  And I think in this case, marking the TSC
as reliable makes sense; even if the kernel doesn't user Hyper-V's calibration
info, the host is still clearly telling the guest that the TSC is reliable.

Michael, does keeping the

		setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);

but also passing TSC_FREQ_KNOWN_AND_RELIABLE to the calibration routine make
sense?

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Woodhouse, David @ 2026-05-18 21:59 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-3-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 999 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> 
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
>  	static_call_update(pv_steal_clock, xen_steal_clock);
>  	paravirt_set_sched_clock(xen_sched_clock);
>  
> -	x86_platform.calibrate_tsc = xen_tsc_khz;
> +	tsc_register_calibration_routines(xen_tsc_khz, NULL);
>  	x86_platform.get_wallclock = xen_get_wallclock;
>  }
>  

xen_tsc_khz() doesn't use CPUID but really *should*.

Care to pull in
https://lore.kernel.org/all/20260509224824.3264567-31-dwmw2@infradead.org/
to your next round please?

(Without the misplaced changes in kvm/x86.c that should have been in
two different prior commits, and are now folded into those correctly in
my kvmclock5 branch ready for the next posting of that).

I'll drop that patch, and the similar x86/kvm one which you *have*
already taken in this series, from my next posting.

Thanks.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/2] net: mana: Add NULL guards in teardown path to prevent panic on attach failure
From: Harshitha Ramamurthy @ 2026-05-18 21:49 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-2-dipayanroy@linux.microsoft.com>

On Mon, May 18, 2026 at 12:52 PM Dipayaan Roy
<dipayanroy@linux.microsoft.com> wrote:
>
> When queue allocation fails partway through, the error cleanup frees
> and NULLs apc->tx_qp and apc->rxqs. Multiple teardown paths such as
> mana_remove(), mana_change_mtu() recovery, and internal error handling
> in mana_alloc_queues() can subsequently call into functions that
> dereference these pointers without NULL checks:
>
> - mana_chn_setxdp() dereferences apc->rxqs[0], causing a NULL pointer
>   dereference panic (CR2: 0000000000000000 at mana_chn_setxdp+0x26).
> - mana_destroy_vport() iterates apc->rxqs without a NULL check.
> - mana_fence_rqs() iterates apc->rxqs without a NULL check.
> - mana_dealloc_queues() iterates apc->tx_qp without a NULL check.
>
> Add NULL guards for apc->rxqs in mana_fence_rqs(),
> mana_destroy_vport(), and before the mana_chn_setxdp() call. Add a
> NULL guard for apc->tx_qp in mana_dealloc_queues() to skip TX queue
> draining when TX queues were never allocated or already freed.
>
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 70 +++++++++++--------
>  1 file changed, 41 insertions(+), 29 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 9afc786b297a..0582803907a8 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1727,6 +1727,9 @@ static void mana_fence_rqs(struct mana_port_context *apc)
>         struct mana_rxq *rxq;
>         int err;
>
> +       if (!apc->rxqs)
> +               return;
> +
>         for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
>                 rxq = apc->rxqs[rxq_idx];
>                 err = mana_fence_rq(apc, rxq);
> @@ -2858,13 +2861,16 @@ static void mana_destroy_vport(struct mana_port_context *apc)
>         struct mana_rxq *rxq;
>         u32 rxq_idx;
>
> -       for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> -               rxq = apc->rxqs[rxq_idx];
> -               if (!rxq)
> -                       continue;
> +       if (apc->rxqs) {
>
> -               mana_destroy_rxq(apc, rxq, true);
> -               apc->rxqs[rxq_idx] = NULL;
> +               for (rxq_idx = 0; rxq_idx < apc->num_queues; rxq_idx++) {
> +                       rxq = apc->rxqs[rxq_idx];
> +                       if (!rxq)
> +                               continue;
> +
> +                       mana_destroy_rxq(apc, rxq, true);
> +                       apc->rxqs[rxq_idx] = NULL;
> +               }
>         }
>
>         mana_destroy_txq(apc);
> @@ -3269,7 +3275,8 @@ static int mana_dealloc_queues(struct net_device *ndev)
>         if (apc->port_is_up)
>                 return -EINVAL;
>
> -       mana_chn_setxdp(apc, NULL);
> +       if (apc->rxqs)
> +               mana_chn_setxdp(apc, NULL);

Is this check required? mana_dealloc_queues() is only called by
mana_detach(). And mana_detach() calls `mana_dealloc_queues()` only if
the port state was previously up. apc->port_is_up seems to be set to
true only on successful queue allocation. Can apc->rxqs be NULL here?

>
>         if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
>                 mana_pf_deregister_filter(apc);
> @@ -3287,33 +3294,38 @@ static int mana_dealloc_queues(struct net_device *ndev)
>          * number of queues.
>          */
>
> -       for (i = 0; i < apc->num_queues; i++) {
> -               txq = &apc->tx_qp[i].txq;
> -               tsleep = 1000;
> -               while (atomic_read(&txq->pending_sends) > 0 &&
> -                      time_before(jiffies, timeout)) {
> -                       usleep_range(tsleep, tsleep + 1000);
> -                       tsleep <<= 1;
> -               }
> -               if (atomic_read(&txq->pending_sends)) {
> -                       err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> -                       if (err) {
> -                               netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> -                                          err, atomic_read(&txq->pending_sends),
> -                                          txq->gdma_txq_id);
> +       if (apc->tx_qp) {

And same as above, is it possible for apc->tx_qp to be NULL here?

> +               for (i = 0; i < apc->num_queues; i++) {
> +                       txq = &apc->tx_qp[i].txq;
> +                       tsleep = 1000;
> +                       while (atomic_read(&txq->pending_sends) > 0 &&
> +                              time_before(jiffies, timeout)) {
> +                               usleep_range(tsleep, tsleep + 1000);
> +                               tsleep <<= 1;
> +                       }
> +                       if (atomic_read(&txq->pending_sends)) {
> +                               err =
> +                                   pcie_flr(to_pci_dev(gd->gdma_context->dev));
> +                               if (err) {
> +                                       netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> +                                                  err,
> +                                           atomic_read(&txq->pending_sends),
> +                                           txq->gdma_txq_id);
> +                               }
> +                               break;
>                         }
> -                       break;
>                 }
> -       }
>
> -       for (i = 0; i < apc->num_queues; i++) {
> -               txq = &apc->tx_qp[i].txq;
> -               while ((skb = skb_dequeue(&txq->pending_skbs))) {
> -                       mana_unmap_skb(skb, apc);
> -                       dev_kfree_skb_any(skb);
> +               for (i = 0; i < apc->num_queues; i++) {
> +                       txq = &apc->tx_qp[i].txq;
> +                       while ((skb = skb_dequeue(&txq->pending_skbs))) {
> +                               mana_unmap_skb(skb, apc);
> +                               dev_kfree_skb_any(skb);
> +                       }
> +                       atomic_set(&txq->pending_sends, 0);
>                 }
> -               atomic_set(&txq->pending_sends, 0);
>         }
> +
>         /* We're 100% sure the queues can no longer be woken up, because
>          * we're sure now mana_poll_tx_cq() can't be running.
>          */
> --
> 2.43.0
>
>

^ permalink raw reply

* Re: [PATCH v8 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-05-18 21:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260512022133.856196-1-kuba@kernel.org>

On Mon, May 11, 2026 at 07:21:33PM -0700, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> ---
> net: mana: force full-page RX buffers via ethtool private flag
> 
> This patch adds an ethtool private flag "full-page-rx" that forces one
> RX buffer per page, bypassing the page_pool fragment path, to address
> a 15-20% throughput regression seen on some 4K PAGE_SIZE ARM64
> platforms under high connection counts.  Toggling the flag triggers
> mana_detach() + mana_attach() to apply the new RX buffer layout.
> 
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > index a28ca461c135..dfd222de11d0 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> 
> [ ... ]
> 
> > @@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
> >  	return 0;
> >  }
> >  
> 
> [ ... ]
> 
> > +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> > +{
> > +	struct mana_port_context *apc = netdev_priv(ndev);
> > +	u32 changed = apc->priv_flags ^ priv_flags;
> > +	u32 old_priv_flags = apc->priv_flags;
> > +	bool schedule_port_reset = false;
> > +	int err = 0;
> > +
> > +	if (!changed)
> > +		return 0;
> > +
> > +	/* Reject unknown bits */
> > +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> > +		return -EINVAL;
> > +
> > +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> > +		apc->priv_flags = priv_flags;
> > +
> > +		if (!apc->port_is_up) {
> > +			/* Port is down, flag updated to apply on next up
> > +			 * so just return.
> > +			 */
> > +			return 0;
> > +		}
> > +
> > +		/* Pre-allocate buffers to prevent failure in mana_attach
> > +		 * later
> > +		 */
> > +		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
> > +		if (err) {
> > +			netdev_err(ndev,
> > +				   "Insufficient memory for new allocations\n");
> > +			apc->priv_flags = old_priv_flags;
> > +			return err;
> > +		}
> > +
> > +		err = mana_detach(ndev, false);
> > +		if (err) {
> > +			netdev_err(ndev, "mana_detach failed: %d\n", err);
> > +			apc->priv_flags = old_priv_flags;
> > +
> > +			/* Port is in an inconsistent state. Restore
> > +			 * 'port_is_up' so that queue reset work handler
> > +			 * can properly detach and re-attach.
> > +			 */
> > +			apc->port_is_up = true;
> > +			schedule_port_reset = true;
> > +			goto out;
> > +		}
> > +
> > +		err = mana_attach(ndev);
> > +		if (err) {
> > +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> > +			apc->priv_flags = old_priv_flags;
> > +
> > +			/* Restore 'port_is_up' so the reset work handler
> > +			 * can properly detach/attach. Without this,
> > +			 * the handler sees port_is_up=false and skips
> > +			 * queue allocation, leaving the port dead.
> > +			 */
> > +			apc->port_is_up = true;
> > +			schedule_port_reset = true;
> > +		}
> > +	}
> > +
> > +out:
> > +	mana_pre_dealloc_rxbufs(apc);
> > +
> > +	if (schedule_port_reset)
> > +		queue_work(apc->ac->per_port_queue_reset_wq,
> > +			   &apc->queue_reset_work);
> > +
> > +	return err;
> > +}
> 
> [High]
> Can this failure path oops inside the reset work handler?
> 
> When mana_attach() fails after mana_detach() has already torn the vport
> down, apc->tx_qp has been set to NULL by mana_destroy_txq() and apc->rxqs
> has been cleared by mana_cleanup_port_context().  This code then forces
> apc->port_is_up back to true and queues apc->queue_reset_work:
> 
>     apc->port_is_up = true;
>     schedule_port_reset = true;
> 
> When mana_per_port_queue_reset_work_handler runs it calls
> mana_detach(ndev, false), which latches port_st_save = port_is_up = true
> and, because port_st_save is true, invokes mana_dealloc_queues().  The TX
> drain loop there does:
> 
>     txq = &apc->tx_qp[i].txq;
>     atomic_read(&txq->pending_sends);
> 
> with no NULL check on apc->tx_qp, and mana_destroy_vport further down
> iterates apc->rxqs[rxq_idx] while rxqs is NULL.  Would that produce a
> NULL pointer dereference in the workqueue worker whenever mana_attach()
> fails here (for example on an allocation failure in
> mana_init_port_context or mana_create_txq, or a transient HW command
> failure in mana_query_vport_cfg / mana_cfg_vport)?
> 
> The mana_detach() failure path above (goto out) reaches
> mana_dealloc_queues() through the same chain and looks to have the same
> exposure.
Hi Jakub,

Thanks for your comments, I have sent a separate fixes series to the
net tree that fixes the possible NULL pointer derefernce issue and also
makes the reset handler safe for the case where it runs after a failed
attach:
https://lore.kernel.org/all/20260518194654.735580-1-dipayanroy@linux.microsoft.com/


> 
> For comparison, mana_change_mtu() handles a mana_attach() failure by
> returning the error without scheduling a reset.  Would a similar
> treatment here avoid the asynchronous oops, or is there a reason the
> reset must be scheduled in this specific failure case?
> -- 
> pw-bot: cr

The full-page-rx private flag is intended to be driven by a udev rule
that fires automatically during VM provisioning on affected platforms.
If there is a transient failure, the VM fails to provision, requiring manual
intervention.The reset handler retries the attach, giving the port a
chance to recover to default config autonomously without intervention.

Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: Sean Christopherson @ 2026-05-18 21:14 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515194804.AE30FC2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] kvm_sched_clock_read is registered as the system sched_clock before its required offset (kvm_sched_clock_offset) is initialized.
> --
> 
> commit b6e429e4990c00d845488d46e8dc4132f3a2fd8b
> Author: Sean Christopherson <seanjc@google.com>
> x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
> 
> This commit adds a return code to __paravirt_set_sched_clock() to allow the
> kernel to reject attempts to use a PV sched_clock. It updates the caller
> in kvmclock to check the return value and abort initialization if needed.
> 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index d3bb281c0805d..9b3d1ed1a96dc 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -338,10 +338,12 @@ static int kvmclock_setup_percpu(unsigned int cpu)
> >  
> >  static __init void kvm_sched_clock_init(bool stable)
> >  {
> > +	if (__paravirt_set_sched_clock(kvm_sched_clock_read, stable,
> > +				       kvm_save_sched_clock_state,
> > +				       kvm_restore_sched_clock_state))
> > +		return;
> > +
> >  	kvm_sched_clock_offset = kvm_clock_read();
> 
> Could this cause a massive backward jump in system time if an NMI or early
> exception calls sched_clock() right after registration?
> 
> __paravirt_set_sched_clock() uses static_call_update() to globally publish
> kvm_sched_clock_read() as the live system sched_clock. Since the
> initialization of kvm_sched_clock_offset is now delayed until after this
> registration completes, any early caller will execute kvm_sched_clock_read()
> with a zero offset, returning a huge absolute raw clock value.
> 
> When the code resumes, it sets the offset to the current raw clock value.
> Subsequent sched_clock() calls will abruptly subtract this large offset,
> causing time to jump backwards and violating monotonic clock guarantees.
> 
> Would it be safer to initialize kvm_sched_clock_offset before calling
> __paravirt_set_sched_clock(), since calculating the offset early has no
> side effects even if the registration is subsequently rejected?

Hmm, yes.  I'll also add a comment explaining exactly that.

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-18 21:11 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

On Fri, May 15, 2026, Sean Christopherson wrote:
> Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> like this through the tip tree?  
> 
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point.  I am reaonsably
> confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> v2, and while there were conflicts when rebasing, they were largely
> superficial (and I've just jinxed myself).  For all other hypervisors, assume
> the code is compile-tested only, but those changes are all quite small and
> straightforward.
> 
> The only changes that are questionable/contentious are the last two patches,
> which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> AMD (that's the dubious part).  I very deliberately put them last, so that
> they can be dropped at will (I don't care terribly if those patches land).
> To merge them, I would want explicit Acks from Paolo and David W.
> 
> So, except for the last two patches, to get the stuff I really care about
> landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> reviews/acks?

FYI, don't bother reviewing this version.  Sashiko found several glaring flaws,
but I just realized that sashiko-bot's emails are only being sent to myself and
linux-hyperv@vger.kernel.org.  I'll make sure to highlight the changes in the
next version.

In the meantime, Sashiko's feedback is archived on lore if you want to see me
get torched by AI :-)

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: Sean Christopherson @ 2026-05-18 20:57 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515195557.DEF05C2BCB0@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] lapic_timer_period is incorrectly assigned the raw APIC frequency in kHz, which fails to account for the kernel's HZ configuration and breaks APIC timer calibration for any HZ != 1000.
> --
> 
> commit e5bbef05c7ba30fb00c716accae48693e295289a
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
> 
> This commit reads the local APIC bus frequency from the PV CPUID leaf
> and uses it to directly set lapic_timer_period, bypassing the standard
> calibration process.
> 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -200,10 +200,19 @@ void kvmclock_cpu_action(enum kvm_guest_cpu_action action)
> >   */
> >  static unsigned long kvm_get_tsc_khz(void)
> >  {
> > +#ifdef CONFIG_X86_LOCAL_APIC
> > +	u32 apic_khz = kvm_para_apic_bus_khz();
> > +
> >  	/*
> > -	 * If KVM advertises the frequency directly in CPUID, use that
> > -	 * instead of reverse-calculating it from the KVM clock data.
> > +	 * Use the TSC frequency from KVM's (and other hypervisors') PV CPUID
> > +	 * leaf when available, instead of reverse-calculating it from the KVM
> > +	 * clock data.  As a bonus, the CPUID leaf also includes the local APIC
> > +	 * bus/timer frequency.
> >  	 */
> > +	if (apic_khz)
> > +		lapic_timer_period = apic_khz;
> 
> Does setting lapic_timer_period directly to apic_khz assume that HZ is
> always configured to 1000?
> 
> The lapic_timer_period variable represents the number of APIC timer ticks
> per jiffy. This is usually calculated as (frequency_in_hz / HZ), which would
> be (apic_khz * 1000) / HZ.

Ugh, TIL.  That's quite evil.  I'll fix in the next version.

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Sean Christopherson @ 2026-05-18 20:28 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260515200929.19F2CC2BCB3@smtp.kernel.org>

On Fri, May 15, 2026, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] Bypassing `kvm_sched_clock_init()` when `CONSTANT_TSC` is present
> ignores KVM's `PVCLOCK_TSC_STABLE_BIT`, leading to unfiltered backward time
> jumps in the core scheduler.  --
> 
> commit 57cf68835ef70733c037ca9ffa04fd39e1d70027
> Author: Sean Christopherson <seanjc@google.com>
> 
> x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
> 
> This commit updates kvmclock_init to prefer the TSC over kvmclock for
> sched_clock when the TSC is constant, non-stop, and not marked unstable.
> It does this by bypassing kvm_sched_clock_init and instead modifying the
> TSC properties so that TSC becomes the preferred clocksource.
> 
> Link: https://lore.kernel.org/all/Z4hDK27OV7wK572A@google.com
> 
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> [ ... ]
> > @@ -416,22 +416,22 @@ void __init kvmclock_init(void)
> [ ... ]
> >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >  	    !check_tsc_unstable()) {
> >  		kvm_clock.rating = 299;
> >  		tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > +	} else {
> > +		kvm_sched_clock_init(stable);
> >  	}
> >  
> > -	kvm_sched_clock_init(stable);
> 
> Does bypassing kvm_sched_clock_init() when CONSTANT_TSC is present
> accidentally ignore KVM's PVCLOCK_TSC_STABLE_BIT?
> 
> CONSTANT_TSC indicates the hardware TSC rate doesn't change with power
> states, but it doesn't guarantee cross-CPU synchronization, such as across
> multi-socket systems or differing vCPU offsets from hotplug and migration.
> KVM explicitly signals this cross-CPU stability via PVCLOCK_TSC_STABLE_BIT,
> which is mapped to the stable variable.
> 
> Prior to this patch, kvm_sched_clock_init() was always called, invoking
> __paravirt_set_sched_clock() with the stable flag. This allowed the system
> to call clear_sched_clock_stable() if the hypervisor could not guarantee
> synchronization, enabling crucial per-CPU monotonicity filters in
> sched_clock_cpu().
> 
> By skipping kvm_sched_clock_init() entirely and registering
> TSC_FREQ_KNOWN_AND_RELIABLE, the kernel's native cross-CPU TSC synchronization
> checks and the clocksource watchdog appear to be bypassed.

This is very much intented behavior.

> If KVM clears PVCLOCK_TSC_STABLE_BIT, could this result in the guest
> ignoring the condition and using raw, unfiltered rdtsc() values?

Yes, that's the entire point.
 
Note, PVCLOCK_TSC_STABLE_BIT is still set in pvclock_read_flags(&hv_clock_boot[0].pvti),
so anything else that directly consumes PVCLOCK_TSC_STABLE_BIT still does the
right thing.  E.g. kvm_setup_vsyscall_timeinfo() will still set kvm_clock's vDSO
mode to VDSO_CLOCKMODE_PVCLOCK, to communicate that kvm_lock can be used for the
vDSO page, *if* it's chose.

> When tasks migrate between vCPUs with differing TSC offsets, this regression
> might cause backward time jumps in sched_clock(), which breaks rq_clock
> monotonicity and corrupts CFS vruntime and min_vruntime calculations.

Only if the host messed up and incorrectly advertised CONSTANT+NONSTOP.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox