Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v4 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779396074.git.me@berkoc.com>

hyperv_receive_sub() reads msg->vid_hdr.type and dispatches into one
of four message-type branches without knowing how many bytes the host
wrote into hv->recv_buf. The completion path then runs
memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE), so the consumer that
wakes on wait_for_completion_timeout() can read up to 16 KiB of
residue from a prior message as if it were the response payload.

Pass bytes_recvd into hyperv_receive_sub() and reject any packet that
does not cover the pipe + synthvid header. For the three
completion-driving types (SYNTHVID_VERSION_RESPONSE,
SYNTHVID_RESOLUTION_RESPONSE, SYNTHVID_VRAM_LOCATION_ACK) require the
type-specific payload before memcpy/complete, and apply the same rule
to SYNTHVID_FEATURE_CHANGE before reading is_dirt_needed.

SYNTHVID_RESOLUTION_RESPONSE is variable length: the host fills
resolution_count entries, not the full SYNTHVID_MAX_RESOLUTION_COUNT
array. Validate the fixed prefix first so resolution_count can be
read, bound it against the array, then require only the count-sized
array, so the shorter responses the host actually sends are accepted.

Only run the sub-handler when vmbus_recvpacket() returned success. The
memcpy length is bytes_recvd, which is bounded by VMBUS_MAX_PACKET_SIZE
only on a successful receive; on -ENOBUFS vmbus_recvpacket() instead
reports the required length, which can exceed hv->recv_buf, so copying
bytes_recvd would read and write past the 16 KiB buffers. Gating on the
success return keeps the copy bounded.

Rejected packets are reported via drm_err_ratelimited() rather than
silently dropped, matching the CoCo-hardened pattern in
hv_kvp_onchannelcallback().

Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 63 +++++++++++++++++++++--
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index c3d0ff229..48054b607 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -420,26 +420,81 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 	return 0;
 }
 
-static void hyperv_receive_sub(struct hv_device *hdev)
+static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
 {
 	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
 	struct synthvid_msg *msg;
+	size_t hdr_size;
 
 	if (!hv)
 		return;
 
+	hdr_size = sizeof(struct pipe_msg_hdr) +
+		   sizeof(struct synthvid_msg_hdr);
+	if (bytes_recvd < hdr_size) {
+		drm_err_ratelimited(&hv->dev,
+				    "synthvid packet too small for header: %u\n",
+				    bytes_recvd);
+		return;
+	}
+
 	msg = (struct synthvid_msg *)hv->recv_buf;
 
 	/* Complete the wait event */
 	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
 	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
-		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
+		size_t need = hdr_size;
+
+		switch (msg->vid_hdr.type) {
+		case SYNTHVID_VERSION_RESPONSE:
+			need += sizeof(struct synthvid_version_resp);
+			break;
+		case SYNTHVID_RESOLUTION_RESPONSE:
+			/*
+			 * The resolution response is variable length: the host
+			 * fills resolution_count entries, not the full
+			 * SYNTHVID_MAX_RESOLUTION_COUNT array. Require the fixed
+			 * prefix first so resolution_count can be read, then
+			 * demand exactly the count-sized array.
+			 */
+			need += offsetof(struct synthvid_supported_resolution_resp,
+					 supported_resolution);
+			if (bytes_recvd < need)
+				break;
+			if (msg->resolution_resp.resolution_count >
+			    SYNTHVID_MAX_RESOLUTION_COUNT) {
+				drm_err_ratelimited(&hv->dev,
+						    "synthvid resolution count too large: %u\n",
+						    msg->resolution_resp.resolution_count);
+				return;
+			}
+			need += msg->resolution_resp.resolution_count *
+				sizeof(struct hvd_screen_info);
+			break;
+		case SYNTHVID_VRAM_LOCATION_ACK:
+			need += sizeof(struct synthvid_vram_location_ack);
+			break;
+		}
+		if (bytes_recvd < need) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid packet too small for type %u: %u < %zu\n",
+					    msg->vid_hdr.type, bytes_recvd, need);
+			return;
+		}
+		memcpy(hv->init_buf, msg, bytes_recvd);
 		complete(&hv->wait);
 		return;
 	}
 
 	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
+		if (bytes_recvd < hdr_size +
+		    sizeof(struct synthvid_feature_change)) {
+			drm_err_ratelimited(&hv->dev,
+					    "synthvid feature change packet too small: %u\n",
+					    bytes_recvd);
+			return;
+		}
 		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
 		if (hv->dirt_needed)
 			hyperv_hide_hw_ptr(hv->hdev);
@@ -464,9 +519,9 @@ static void hyperv_receive(void *ctx)
 		ret = vmbus_recvpacket(hdev->channel, recv_buf,
 				       VMBUS_MAX_PACKET_SIZE,
 				       &bytes_recvd, &req_id);
-		if (bytes_recvd > 0 &&
+		if (!ret && bytes_recvd > 0 &&
 		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
-			hyperv_receive_sub(hdev);
+			hyperv_receive_sub(hdev, bytes_recvd);
 	} while (bytes_recvd > 0 && ret == 0);
 }
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: Berkant Koc @ 2026-05-19 20:08 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat
In-Reply-To: <cover.1779396074.git.me@berkoc.com>

A SYNTHVID_RESOLUTION_RESPONSE with resolution_count > 64 walks past
the supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT] array in the
parse loop. Bound resolution_count against the array size, folded
into the existing zero-check.

When the WIN10 resolution probe fails, the caller in
hyperv_connect_vsp() left hv->screen_*_max / preferred_* unpopulated,
which sets mode_config.max_width / max_height to 0 and makes
drm_internal_framebuffer_create() reject every userspace framebuffer
with -EINVAL. The pre-WIN10 branch had the same gap for
preferred_width / preferred_height. Use a single post-probe fallback
guarded by screen_width_max == 0 so both paths converge on the WIN8
defaults.

Signed-off-by: Berkant Koc <me@berkoc.com>
Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
Cc: stable@vger.kernel.org # 5.14+
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
---
 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
index 051ecc526..c3d0ff229 100644
--- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
+++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
@@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
 		return -ETIMEDOUT;
 	}
 
-	if (msg->resolution_resp.resolution_count == 0) {
-		drm_err(dev, "No supported resolutions\n");
+	if (msg->resolution_resp.resolution_count == 0 ||
+	    msg->resolution_resp.resolution_count >
+	    SYNTHVID_MAX_RESOLUTION_COUNT) {
+		drm_err(dev, "Invalid resolution count: %d\n",
+			msg->resolution_resp.resolution_count);
 		return -ENODEV;
 	}
 
@@ -508,9 +511,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
 		ret = hyperv_get_supported_resolution(hdev);
 		if (ret)
 			drm_err(dev, "Failed to get supported resolution from host, use default\n");
-	} else {
+	}
+
+	if (!hv->screen_width_max) {
 		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
 		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
+		hv->preferred_width = SYNTHVID_WIDTH_WIN8;
+		hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
 	}
 
 	hv->mmio_megabytes = hdev->channel->offermsg.offer.mmio_megabytes;
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4 0/2] drm/hyperv: harden host message parsing
From: Berkant Koc @ 2026-05-21 20:41 UTC (permalink / raw)
  To: Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv, dri-devel, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Michael Kelley, Thomas Zimmermann,
	Maarten Lankhorst, Maxime Ripard, Deepak Rawat

Two independent issues in the synthetic video driver that both stem
from trusting unvalidated host data.

1/2 bounds resolution_count from SYNTHVID_RESOLUTION_RESPONSE against
the supported_resolution[] array, and populates WIN8 defaults for
hv->screen_*_max / hv->preferred_* in both the WIN10-probe-failure
path and the pre-WIN10 path, so a failed or pre-WIN10 probe yields a
usable display instead of having drm_internal_framebuffer_create()
reject every userspace framebuffer with -EINVAL.

2/2 forwards bytes_recvd from vmbus_recvpacket() into the sub-handler,
rejects packets that do not cover the synthvid header, and requires
the type-specific payload size before memcpy/complete or before
reading the feature-change byte. Rejected packets are logged via
drm_err_ratelimited() instead of being silently dropped, matching the
CoCo-hardened pattern in hv_kvp_onchannelcallback().

1/2 is unchanged from v3 and carries Michael Kelley's Reviewed-by.

Changes since v3 (per review by Michael Kelley):

  2/2: validate SYNTHVID_RESOLUTION_RESPONSE against its actual
       variable length. The response carries resolution_count entries,
       not the full SYNTHVID_MAX_RESOLUTION_COUNT array, so requiring
       sizeof(struct synthvid_supported_resolution_resp) rejected the
       shorter responses the host legitimately sends and broke
       resolution probing. Require the fixed prefix, read and bound
       resolution_count, then require only the count-sized array.

  2/2: only run hyperv_receive_sub() when vmbus_recvpacket() returned
       success. v3 dropped the bytes_recvd upper bound as redundant,
       which holds only on a successful receive: on -ENOBUFS
       vmbus_recvpacket() reports the required length in bytes_recvd,
       which can exceed the 16 KiB hv->recv_buf, and the subsequent
       memcpy(hv->init_buf, msg, bytes_recvd) would read and write
       past both buffers. Gating on the success return restores the
       invariant that made the bound redundant, so an oversized host
       packet is dropped rather than copied.

Changes since v2 (per review by Michael Kelley):

  1/2: dropped the reinit_completion() change; the stale completion can
       only outlive its request in hyperv_vmbus_resume() after a
       get_supported_resolution() timeout, which is a narrower fix that
       belongs in a separate patch against the resume path. Pre-WIN10
       branch now also populates hv->preferred_*. The else branch is
       gone; a single screen_width_max == 0 check covers both the
       pre-WIN10 case and a failed WIN10 probe.

  2/2: added a per-type switch for the three completion-driving message
       types so the wait-completion path validates payload size before
       memcpy/complete. Every reject path emits drm_err_ratelimited()
       rather than returning silently.

Changes since v1:

  1/2: bound resolution_count check folded into the existing zero check;
       populate WIN8 defaults when hyperv_get_supported_resolution()
       fails.
  2/2: forward bytes_recvd into hyperv_receive_sub(); enforce the pipe +
       synthvid header minimum; check synthvid_feature_change payload
       size before reading is_dirt_needed.

The shared init_buf reuse (a duplicate or late host response can
overwrite init_buf between successive request/response cycles) and the
related completion reinit are real but orthogonal to this size
validation. As discussed on v2, they are queued as a separate follow-up
against the resume/expected-type path once this series lands.

This series is verified by static analysis and code inspection against
the synthvid protocol structures and the vmbus_recvpacket() contract. I
do not currently have a Hyper-V test environment to exercise the receive
and resolution-probe paths at runtime, so confirmation from someone who
can run it in a Hyper-V VM would be welcome.

Both patches carry an Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
trailer per the kernel coding-assistants policy. Code, analysis and
review responses are mine; the model is used as a structured reviewer
under human verification.

Berkant Koc (2):
  drm/hyperv: validate resolution_count and fix WIN8 fallback
  drm/hyperv: validate VMBus packet size in receive callback

 drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 76 ++++++++++++++++++++---
 1 file changed, 69 insertions(+), 7 deletions(-)

base-commit: 4bf5d3da79c48e1df4bab82c9680c53adeff7820
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-21 20:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <44e0d60548d317fd59895f18bd17220dfb2f834b.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
> > frequency calibration routines.  This will allow consolidating handling
> > of common TSC properties that are forced by hypervisor (PV routines),
> > and will also allow adding sanity checks to guard against overriding a
> > TSC calibration routine with a routine that is less robust/trusted.
> > 
> > Make the CPU calibration routine optional, as Xen (very sanely) doesn't
> > assume the CPU runs as the same frequency as the TSC.
> > 
> > Wrap the helper in an #ifdef to document that the kernel overrides
> > the native routines when running as a VM, and to guard against unwanted
> > usage.  Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
> > depend on HYPERVISOR_GUEST because it gates both guest and host code.
> > 
> > No functional change intended.
> > 
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > Tested-by: Michael Kelley <mhklinux@outlook.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Mildly concerned that we might want to support multiple options — does
> it have CPUID 0x15? Does it have 0x40000x10? Does it have a pvclock?
> There are various permutations of those which are perhaps best handled
> by *trying* each one, in some order, and populating a struct with the
> answers?
> 
> But on the basis that perfect is the enemy of good,

This has been bothering me too.

Aha!  AHA!  Idea.

... 4 hours later ...

Mhahahaahah, victory is mine!!!!

TL;DR: Overriding x86_platform_ops hooks is dumb.

To your point about making an informed decision, that's essentialy what this series
is already doing, just in a very roundabout way:

  1. x86_platform.calibrate_{cpu,tsc}() are initialized to "native" versions
  2. Hypervisor init code runs and conditionally overrides calibrate_{cpu,tsc}()
  3. CoCo init code runs and conditionally overrides calibrate_{cpu,tsc}()

So the ordering you want is already there, as is "trying" each source to some
extent, in the form of steps #2 and #3 overriding the hooks if and only if their
source of information is valid.  For all intents and purposes, the hardening I
was adding by formalizing the calibration overrides was to enforce the above ordering.

But that's obviously all but impossible to follow, _and_ it's pointless.

For every PV case, including TDX and SNP, "calibration" is simply information
retrieval, i.e. it never changes (barring broken hypervisors/firmware), and the
information is always available during early boot.

Contrast that with the pre-CPUID CPU frequency calibration, where the frequency
might change, the kernel is making a best guest based on other timekeeping sources,
and not all timekeeping sources are available during early boot.

And so overriding x86_platform.calibrate_{cpu,tsc}() for PV code is completely
unecessary, because steps #2 and #3 already know the frequency when they override
the hooks, and "success" is guaranteed, i.e. the kernel won't have to switch to a
"late" calibration flow.

If we provide x86_hyper_init hooks:

	unsigned int (*get_tsc_khz)(void);
	unsigned int (*get_cpu_khz)(void);

then we can kill off x86_platform.calibrate_{cpu,tsc}() entirely, explicitly
define the preferred ordering (user-forced => CoCo => Hypervisor => native), and
depup some of the hypervisor code.

E.g. this is what I've got for the early flow.  Testing now. 

  void __init tsc_early_init(void)
  {
	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;

	if (!boot_cpu_has(X86_FEATURE_TSC))
		return;
	/* Don't change UV TSC multi-chassis synchronization */
	if (is_early_uv_system())
		return;

	if (x86_init.hyper.get_cpu_khz)
		known_cpu_khz = x86_init.hyper.get_cpu_khz();

	if (tsc_early_khz)
		known_tsc_khz = tsc_early_khz;
	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
		known_tsc_khz = snp_secure_tsc_init();
	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
		known_tsc_khz = tdx_tsc_init();

	/*
	 * If the TSC frequency is still unknown, i.e. not provided by the user
	 * or by trusted firmware, try to get it from the hypervisor (which is
	 * untrusted when running as a CoCo guest).
	 */
	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
		known_tsc_khz = x86_init.hyper.get_tsc_khz();

	if (known_tsc_khz)
		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);

	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
		return;
	tsc_enable_sched_clock();
  }

^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: Sean Christopherson @ 2026-05-21 20:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <13d79ba1e0450068c9573ccd8deb3ec007aea8d6.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Add a return code to __paravirt_set_sched_clock() so that the kernel can
> > reject attempts to use a PV sched_clock without breaking the caller.  E.g.
> > when running as a CoCo VM with a secure TSC, using a PV clock is generally
> > undesirable.
> > 
> > Note, kvmclock is the only PV clock that does anything "extra" beyond
> > simply registering itself as sched_clock, i.e. is the only caller that
> > needs to check the new return value.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Oooh... can we use this to reject the kvmclock when we have a stable
> and reliable TSC even for non-CoCo guests?

Yes, but I would much rather "fix" kvmclock to not even attempt to register itself
as the sched_clock (which this series does).

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: Sean Christopherson @ 2026-05-21 20:34 UTC (permalink / raw)
  To: David Woodhouse
  Cc: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	pbonzini@redhat.com, kys@microsoft.com, decui@microsoft.com,
	daniel.lezcano@kernel.org, wei.liu@kernel.org,
	peterz@infradead.org, jgross@suse.com, boris.ostrovsky@oracle.com,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	mhklinux@outlook.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <7489ff3cc1ff402bf0ade38272fc52dcbcc75fc1.camel@amazon.co.uk>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > When running as a KVM guest with kvmclock support enabled, stuff the APIC
> > timer period/frequency with the local APIC bus frequency reported in
> > CPUID.0x40000010.EBX instead of trying to calibrate/guess the frequency.
> > 
> > See Documentation/virt/kvm/x86/cpuid.rst for details.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> I still don't much like the way this is done inside kvm_get_tsc_khz().

Yeah, I don't like it either (understatement).  Aha!  native_calibrate_tsc() is
the oddball, all of the PV flows stuff lapic_timer_period when parsing the initial
timing info.  I'll just do that.  Blindly writing a global is all kinds of fugly,
but that's a future
problem to solve.

> We also probably ought to be looking for the timing leaf on other
> hypervisors including VMware 

VMware gets the frequency via hypercall.  Why, I have no idea.  I'll let the
VMware folks deal with that.

	eax = vmware_hypercall3(VMWARE_CMD_GETHZ, UINT_MAX, &ebx, &ecx);

> and probably Bhyve too.  Should it be done somewhere else?

I'm not opposed to that, though I don't know that it'd be a net positive. The
"hard" part of getting the info is finding the CPUID base and checking if the
leaf is available.  Unlike the native CPUID leaf, no math is necessary, and so
once the leaf is obtained, getting the frequency is trivial.

Regardless, I definitely don't want to take it on in this series. :-)


^ permalink raw reply

* Re: [PATCH 1/1] mshv: Add conditional VMBus dependency
From: Arnd Bergmann @ 2026-05-21 20:15 UTC (permalink / raw)
  To: Michael Kelley, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, longli, Jork Loeser, linux-hyperv
  Cc: linux-kernel, hamzamahfooz
In-Reply-To: <20260521164921.1995-1-mhklkml@zohomail.com>

On Thu, May 21, 2026, at 18:49, Michael Kelley wrote:
>
> Existing code ensures that the VMBus driver loads first if it is
> built-in. The VMBus driver uses subsys_initcall(), which is
> initcall level 4. The MSHV root driver uses module_init(), which
> becomes device_init() when built-in, and device_init() is
> initcall level 6.
>
> Reported-by: Arnd Bergmann <arnd@arndb.de>
> Closes: https://lore.kernel.org/all/20260520074044.923728-1-arnd@kernel.org/
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Looks good to me, thanks for fixing it!

Acked-by: Arnd Bergmann <arnd@arndb.de>

>  	/*
>  	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
>  	 * See hv_hyp_synic_enable_regs() for that initialization.
>  	 */
> -	bool vmbus_active = hv_vmbus_exists();
> +#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
> +	vmbus_active = hv_vmbus_exists();
> +#endif

I would usually write this as 

        if (IS_ENABLED(CONFIG_HYPERV_VMBUS))
                  vmbus_active = hv_vmbus_exists();

for readability, since the hv_vmbus_exists() declarations is still
visible and the IS_ENABLED() check avoids the link failure.

      ARnd

^ permalink raw reply

* [PATCH 1/1] x86/hyperv: Refactor hv_smp_prepare_cpus()
From: Michael Kelley @ 2026-05-21 19:23 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, x86, hpa, linux-hyperv, linux-kernel, linux-arch

From: Michael Kelley <mhklinux@outlook.com>

hv_smp_prepare_cpus() current handles two disjoint cases: a fully
enlightened SNP guest and running in the root partition. The root
partition case has recently added more steps, and as a result the
function is getting somewhat messy.

Refactor the code by putting the SNP and root cases into separate
functions. For the root case, move most of the code into hv_proc.c,
which is built only when MSHV_ROOT is configured. The move reduces
the surface area between the main code and the root partition
extensions. Several stubs go away, with an overall modest reduction
in lines of code.

No functional change.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 52 +++++++---------------------------
 drivers/hv/hv_proc.c           | 35 +++++++++++++++++++++--
 include/asm-generic/mshyperv.h | 17 ++---------
 3 files changed, 45 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 640e6b223c2d..442156056cd2 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -32,7 +32,6 @@
 #include <asm/msr.h>
 #include <asm/nmi.h>
 #include <clocksource/hyperv_timer.h>
-#include <asm/numa.h>
 #include <asm/svm.h>
 
 /* Is Linux running on nested Microsoft Hypervisor */
@@ -413,46 +412,16 @@ static void __init hv_smp_prepare_boot_cpu(void)
 #endif
 }
 
-static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
+static void __init hv_smp_prepare_cpus_for_snp(unsigned int max_cpus)
 {
-#ifdef CONFIG_X86_64
-	int i;
-	int ret;
-#endif
-
 	native_smp_prepare_cpus(max_cpus);
+	apic->wakeup_secondary_cpu_64 = hv_snp_boot_ap;
+}
 
-	/*
-	 *  Override wakeup_secondary_cpu_64 callback for SEV-SNP
-	 *  enlightened guest.
-	 */
-	if (!ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
-		apic->wakeup_secondary_cpu_64 = hv_snp_boot_ap;
-		return;
-	}
-
-#ifdef CONFIG_X86_64
-	/* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
-	if (num_present_cpus() == 1 || hv_lp_exists(1))
-		return;
-
-	for_each_present_cpu(i) {
-		if (i == 0)
-			continue;
-		ret = hv_call_add_logical_proc(numa_cpu_node(i), i, cpu_physical_id(i));
-		BUG_ON(ret);
-	}
-
-	ret = hv_call_notify_all_processors_started();
-	WARN_ON(ret);
-
-	for_each_present_cpu(i) {
-		if (i == 0)
-			continue;
-		ret = hv_call_create_vp(numa_cpu_node(i), hv_current_partition_id, i, i);
-		BUG_ON(ret);
-	}
-#endif
+static void __init hv_smp_prepare_cpus_for_root(unsigned int max_cpus)
+{
+	native_smp_prepare_cpus(max_cpus);
+	hv_smp_prep_cpus();
 }
 #endif
 
@@ -722,9 +691,10 @@ static void __init ms_hyperv_init_platform(void)
 
 # ifdef CONFIG_SMP
 	smp_ops.smp_prepare_boot_cpu = hv_smp_prepare_boot_cpu;
-	if (hv_root_partition() ||
-	    (!ms_hyperv.paravisor_present && hv_isolation_type_snp()))
-		smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus;
+	if (!ms_hyperv.paravisor_present && hv_isolation_type_snp())
+		smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus_for_snp;
+	else if (hv_root_partition())
+		smp_ops.smp_prepare_cpus = hv_smp_prepare_cpus_for_root;
 # endif
 
 	/*
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 57b2c64197cb..b8aa76a7b19b 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -8,6 +8,7 @@
 #include <linux/minmax.h>
 #include <linux/export.h>
 #include <asm/mshyperv.h>
+#include <asm/numa.h>
 
 /*
  * See struct hv_deposit_memory. The first u64 is partition ID, the rest
@@ -154,7 +155,7 @@ bool hv_result_needs_memory(u64 status)
 }
 EXPORT_SYMBOL_GPL(hv_result_needs_memory);
 
-int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
+static int hv_call_add_logical_proc(int node, u32 lp_index, u32 apic_id)
 {
 	struct hv_input_add_logical_processor *input;
 	struct hv_output_add_logical_processor *output;
@@ -240,7 +241,7 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 }
 EXPORT_SYMBOL_GPL(hv_call_create_vp);
 
-int hv_call_notify_all_processors_started(void)
+static int hv_call_notify_all_processors_started(void)
 {
 	struct hv_input_notify_partition_event *input;
 	u64 status;
@@ -262,7 +263,7 @@ int hv_call_notify_all_processors_started(void)
 	return ret;
 }
 
-bool hv_lp_exists(u32 lp_index)
+static bool hv_lp_exists(u32 lp_index)
 {
 	struct hv_input_get_logical_processor_run_time *input;
 	struct hv_output_get_logical_processor_run_time *output;
@@ -286,3 +287,31 @@ bool hv_lp_exists(u32 lp_index)
 
 	return hv_result_success(status);
 }
+
+void hv_smp_prep_cpus(void)
+{
+#ifdef CONFIG_X86_64
+	int i, ret;
+
+	/* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
+	if (num_present_cpus() == 1 || hv_lp_exists(1))
+		return;
+
+	for_each_present_cpu(i) {
+		if (i == 0)
+			continue;
+		ret = hv_call_add_logical_proc(numa_cpu_node(i), i, cpu_physical_id(i));
+		BUG_ON(ret);
+	}
+
+	ret = hv_call_notify_all_processors_started();
+	WARN_ON(ret);
+
+	for_each_present_cpu(i) {
+		if (i == 0)
+			continue;
+		ret = hv_call_create_vp(numa_cpu_node(i), hv_current_partition_id, i, i);
+		BUG_ON(ret);
+	}
+#endif
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb..ea1c4acda1ec 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -346,9 +346,7 @@ static inline bool hv_parent_partition(void)
 bool hv_result_needs_memory(u64 status);
 int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
-int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
-int hv_call_notify_all_processors_started(void);
-bool hv_lp_exists(u32 lp_index);
+void hv_smp_prep_cpus(void);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
 
 #else /* CONFIG_MSHV_ROOT */
@@ -364,18 +362,7 @@ static inline int hv_call_deposit_pages(int node, u64 partition_id, u32 num_page
 {
 	return -EOPNOTSUPP;
 }
-static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
-{
-	return -EOPNOTSUPP;
-}
-static inline int hv_call_notify_all_processors_started(void)
-{
-	return -EOPNOTSUPP;
-}
-static inline bool hv_lp_exists(u32 lp_index)
-{
-	return false;
-}
+static inline void hv_smp_prep_cpus(void) {}
 static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 {
 	return -EOPNOTSUPP;
-- 
2.25.1


^ permalink raw reply related

* RE: [PATCH v3 2/2] drm/hyperv: validate VMBus packet size in receive callback
From: Michael Kelley @ 2026-05-21 17:19 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat
In-Reply-To: <e6e63276cca2901641ab39029e4fd3d621b1ee92.1779221799.git.me@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Tuesday, May 19, 2026 1:09 PM
> 
> hyperv_receive_sub() reads msg->vid_hdr.type and dispatches into one
> of four message-type branches without knowing how many bytes the host
> wrote into hv->recv_buf. The completion path then runs
> memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE), so the consumer
> that wakes on wait_for_completion_timeout() can read up to 16 KiB of
> residue from a prior message as if it were the response payload.
> 
> Pass bytes_recvd into hyperv_receive_sub() and reject any packet that
> does not cover the pipe + synthvid header. For each of the three
> completion-driving types (SYNTHVID_VERSION_RESPONSE,
> SYNTHVID_RESOLUTION_RESPONSE, SYNTHVID_VRAM_LOCATION_ACK) also
> require the type-specific payload before memcpy/complete, and apply
> the same rule to SYNTHVID_FEATURE_CHANGE before reading is_dirt_needed.
> The memcpy then uses bytes_recvd, which is bounded by
> VMBUS_MAX_PACKET_SIZE through the call to vmbus_recvpacket().
> 
> Rejected packets are reported via drm_err_ratelimited() rather than
> silently dropped, matching the CoCo-hardened pattern in
> hv_kvp_onchannelcallback().

We discussed several issues with this patch in the feedback
from Sashiko. But see one more issue below.

> 
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> Signed-off-by: Berkant Koc <me@berkoc.com>
> Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 42 +++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index c3d0ff229..12d3feb4f 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -420,26 +420,62 @@ static int hyperv_get_supported_resolution(struct hv_device *hdev)
>  	return 0;
>  }
> 
> -static void hyperv_receive_sub(struct hv_device *hdev)
> +static void hyperv_receive_sub(struct hv_device *hdev, u32 bytes_recvd)
>  {
>  	struct hyperv_drm_device *hv = hv_get_drvdata(hdev);
>  	struct synthvid_msg *msg;
> +	size_t hdr_size;
> 
>  	if (!hv)
>  		return;
> 
> +	hdr_size = sizeof(struct pipe_msg_hdr) +
> +		   sizeof(struct synthvid_msg_hdr);
> +	if (bytes_recvd < hdr_size) {
> +		drm_err_ratelimited(&hv->dev,
> +				    "synthvid packet too small for header: %u\n",
> +				    bytes_recvd);
> +		return;
> +	}
> +
>  	msg = (struct synthvid_msg *)hv->recv_buf;
> 
>  	/* Complete the wait event */
>  	if (msg->vid_hdr.type == SYNTHVID_VERSION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_RESOLUTION_RESPONSE ||
>  	    msg->vid_hdr.type == SYNTHVID_VRAM_LOCATION_ACK) {
> -		memcpy(hv->init_buf, msg, VMBUS_MAX_PACKET_SIZE);
> +		size_t need = hdr_size;
> +
> +		switch (msg->vid_hdr.type) {
> +		case SYNTHVID_VERSION_RESPONSE:
> +			need += sizeof(struct synthvid_version_resp);
> +			break;
> +		case SYNTHVID_RESOLUTION_RESPONSE:
> +			need += sizeof(struct synthvid_supported_resolution_resp);

I'm concerned that this might be too aggressive.  The last element
of struct synthvid_supported_resolution_resp is an array, and there's
a count in the message describing how many elements of the array
are populated. But Hyper-V may not (and probably doesn't) include
unpopulated elements in the response message.  So "need" is likely
calculated as too large. Are you able to test this in a Hyper-V VM to
confirm?

I think you'll find it necessary to first check that enough bytes
have arrived to read the "resolution_count" field, and then use
that value to calculate "need".  There are several other places
in hardened VMBus drivers that use that same two-level
technique. It's a pain, but there's not really any alternative.

Michael

> +			break;
> +		case SYNTHVID_VRAM_LOCATION_ACK:
> +			need += sizeof(struct synthvid_vram_location_ack);
> +			break;
> +		}
> +		if (bytes_recvd < need) {
> +			drm_err_ratelimited(&hv->dev,
> +					    "synthvid packet too small for type %u: %u < %zu\n",
> +					    msg->vid_hdr.type, bytes_recvd, need);
> +			return;
> +		}
> +		memcpy(hv->init_buf, msg, bytes_recvd);
>  		complete(&hv->wait);
>  		return;
>  	}
> 
>  	if (msg->vid_hdr.type == SYNTHVID_FEATURE_CHANGE) {
> +		if (bytes_recvd < hdr_size +
> +		    sizeof(struct synthvid_feature_change)) {
> +			drm_err_ratelimited(&hv->dev,
> +					    "synthvid feature change packet too small: %u\n",
> +					    bytes_recvd);
> +			return;
> +		}
>  		hv->dirt_needed = msg->feature_chg.is_dirt_needed;
>  		if (hv->dirt_needed)
>  			hyperv_hide_hw_ptr(hv->hdev);
> @@ -466,7 +502,7 @@ static void hyperv_receive(void *ctx)
>  				       &bytes_recvd, &req_id);
>  		if (bytes_recvd > 0 &&
>  		    recv_buf->pipe_hdr.type == PIPE_MSG_DATA)
> -			hyperv_receive_sub(hdev);
> +			hyperv_receive_sub(hdev, bytes_recvd);
>  	} while (bytes_recvd > 0 && ret == 0);
>  }
> 
> --
> 2.47.3
> 


^ permalink raw reply

* RE: [PATCH v3 1/2] drm/hyperv: validate resolution_count and fix WIN8 fallback
From: Michael Kelley @ 2026-05-21 17:07 UTC (permalink / raw)
  To: Berkant Koc, Saurabh Sengar, Dexuan Cui, Long Li
  Cc: linux-hyperv@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Michael Kelley, Thomas Zimmermann, Maarten Lankhorst,
	Maxime Ripard, Deepak Rawat
In-Reply-To: <1b88bc7edeb2f0153475225b67f19aaca629eca8.1779221799.git.me@berkoc.com>

From: Berkant Koc <me@berkoc.com> Sent: Tuesday, May 19, 2026 1:08 PM
> 
> A SYNTHVID_RESOLUTION_RESPONSE with resolution_count > 64 walks past
> the supported_resolution[SYNTHVID_MAX_RESOLUTION_COUNT] array in the
> parse loop. Bound resolution_count against the array size, folded
> into the existing zero-check.
> 
> When the WIN10 resolution probe fails, the caller in
> hyperv_connect_vsp() left hv->screen_*_max / preferred_* unpopulated,
> which sets mode_config.max_width / max_height to 0 and makes
> drm_internal_framebuffer_create() reject every userspace framebuffer
> with -EINVAL. The pre-WIN10 branch had the same gap for
> preferred_width / preferred_height. Use a single post-probe fallback
> guarded by screen_width_max == 0 so both paths converge on the WIN8
> defaults.
> 
> Signed-off-by: Berkant Koc <me@berkoc.com>
> Assisted-by: Claude:claude-opus-4-7 berkoc-pipeline
> Fixes: 76c56a5affeb ("drm/hyperv: Add DRM driver for hyperv synthetic video device")
> Cc: stable@vger.kernel.org # 5.14+
> ---
>  drivers/gpu/drm/hyperv/hyperv_drm_proto.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> index 051ecc526..c3d0ff229 100644
> --- a/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> +++ b/drivers/gpu/drm/hyperv/hyperv_drm_proto.c
> @@ -391,8 +391,11 @@ static int hyperv_get_supported_resolution(struct hv_device
> *hdev)
>  		return -ETIMEDOUT;
>  	}
> 
> -	if (msg->resolution_resp.resolution_count == 0) {
> -		drm_err(dev, "No supported resolutions\n");
> +	if (msg->resolution_resp.resolution_count == 0 ||
> +	    msg->resolution_resp.resolution_count >
> +	    SYNTHVID_MAX_RESOLUTION_COUNT) {
> +		drm_err(dev, "Invalid resolution count: %d\n",
> +			msg->resolution_resp.resolution_count);
>  		return -ENODEV;
>  	}
> 
> @@ -508,9 +511,13 @@ int hyperv_connect_vsp(struct hv_device *hdev)
>  		ret = hyperv_get_supported_resolution(hdev);
>  		if (ret)
>  			drm_err(dev, "Failed to get supported resolution from host, use default\n");
> -	} else {
> +	}
> +
> +	if (!hv->screen_width_max) {
>  		hv->screen_width_max = SYNTHVID_WIDTH_WIN8;
>  		hv->screen_height_max = SYNTHVID_HEIGHT_WIN8;
> +		hv->preferred_width = SYNTHVID_WIDTH_WIN8;
> +		hv->preferred_height = SYNTHVID_HEIGHT_WIN8;
>  	}
> 
>  	hv->mmio_megabytes = hdev->channel->offermsg.offer.mmio_megabytes;
> --
> 2.47.3
> 

Looks good to me.

Reviewed-by: Michael Kelley <mhklinux@outlook.com>

^ permalink raw reply

* [PATCH 1/1] mshv: Add conditional VMBus dependency
From: Michael Kelley @ 2026-05-21 16:49 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, jloeser, linux-hyperv
  Cc: linux-kernel, arnd, hamzamahfooz

From: Michael Kelley <mhklinux@outlook.com>

When the VMBus driver is not part of the kernel (CONFIG_HYPERV_VMBUS=n),
the MSHV root driver fails to link:

ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!

Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
  VMBus driver. In such case, the MSHV root driver can be built-in
  to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
  MSHV root driver and VMBus driver can both be built-in, or they can
  both be separate modules. Or the MSHV root driver can be a module
  while the VMBus driver can be built-in, but the reverse is
  disallowed. Regardless of the build choices, the VMBus driver must
  be loaded before the MSHV driver in order for the SynIC to be
  managed properly (see comments in the MSHV SynIC code).

The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
  HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
  built-in when HYPERV_VMBUS is a module, but without requiring that
  HYPERV_VMBUS be built.
* Add #ifdefs around MSHV SynIC calls to hv_vmbus_exists(). When
  the VMBus driver is present, these calls establish a module
  dependency to ensure that the VMBus driver loads first when both
  are built as modules. But if the VMBus driver is not present,
  the behavior is as if hv_vmbus_exists() returned "false", and
  there is no module dependency.

Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/all/20260520074044.923728-1-arnd@kernel.org/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 drivers/hv/Kconfig      |  1 +
 drivers/hv/mshv_synic.c | 12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 2d0b3fcb0ff8..aa11bcefddf2 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -74,6 +74,7 @@ config MSHV_ROOT
 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
 	# no particular order, making it impossible to reassemble larger pages
 	depends on PAGE_SIZE_4KB
+	depends on HYPERV_VMBUS if HYPERV_VMBUS
 	select EVENTFD
 	select VIRT_XFER_TO_GUEST_WORK
 	select HMM_MIRROR
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 88170ce6b83f..3f72a3dd232d 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -463,11 +463,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
 			&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 			&spages->synic_event_ring_page;
+	bool vmbus_active = false;
+
 	/*
 	 * VMBus owns SIMP/SIEFP/SCONTROL when it is active.
 	 * See hv_hyp_synic_enable_regs() for that initialization.
 	 */
-	bool vmbus_active = hv_vmbus_exists();
+#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
+	vmbus_active = hv_vmbus_exists();
+#endif

 	/*
 	 * Map the SYNIC message page. When VMBus is not active the
@@ -587,8 +591,12 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
 		&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 		&spages->synic_event_ring_page;
+	bool vmbus_active = false;
+
 	/* VMBus owns SIMP/SIEFP/SCONTROL when it is active */
-	bool vmbus_active = hv_vmbus_exists();
+#if IS_ENABLED(CONFIG_HYPERV_VMBUS)
+	vmbus_active = hv_vmbus_exists();
+#endif

 	/* Disable the interrupt */
 	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
-- 
2.25.1

^ permalink raw reply related

* RE: [PATCH] mshv: add vmbus dependency
From: Michael Kelley @ 2026-05-21 15:56 UTC (permalink / raw)
  To: Jork Loeser, Arnd Bergmann
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Anirudh Rayabharam (Microsoft), Stanislav Kinsburskii,
	Arnd Bergmann, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <52a29c5-715e-8ea-af1-dafebfca7a84@linux.microsoft.com>

From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:16 AM
> 
> On Wed, 20 May 2026, Arnd Bergmann wrote:
> 
> > From: Arnd Bergmann <arnd@arndb.de>
> >
> > When the vmbus driver is not part of the kernel, the mvhv_root
> > driver now fails to link:
> >
> > ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
> >
> > Avoid this by adding an explicit Kconfig dependency. Note that
> > stubbing out the hv_vmbus_exists() based on configuration would
> > also work for some cases, but not with MSHV_ROOT=y and HYPERV_VMBUS=m.
> >
> > Fixes: f1a9e67c1138 ("mshv: limit SynIC management to MSHV-owned resources")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> > drivers/hv/Kconfig | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> > index 52af086fdeb2..21193b571a80 100644
> > --- a/drivers/hv/Kconfig
> > +++ b/drivers/hv/Kconfig
> > @@ -75,6 +75,7 @@ config MSHV_ROOT
> > 	# e.g. When withdrawing memory, the hypervisor gives back 4k pages in
> > 	# no particular order, making it impossible to reassemble larger pages
> > 	depends on PAGE_SIZE_4KB
> > +	depends on HYPERV_VMBUS
> > 	select EVENTFD
> > 	select VIRT_XFER_TO_GUEST_WORK
> > 	select HMM_MIRROR
> > --
> > 2.39.5
> >
> 
> Yes, this is the right short-term fix. We will need to solve the root case
> (no VMBUS required) with a separate SYNIC driver abstraction.
> 
> Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>
> 

I have what I think is a better way to fix this. It preserves the
ability to build MSHV without VMBus, while also guaranteeing
that VMBus loads first when present. And it is relatively simple --
hv_vmbus_exists() does not need to be moved out of the
VMBus module. Later today I'll post a separate patch for
consideration.

The separate SynIC driver abstraction can still come later
and improve things further.

Michael

^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-21 15:45 UTC (permalink / raw)
  To: Jacob Pan, Michael Kelley
  Cc: Yu Zhang, Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <20260520134027.00005e91@linux.microsoft.com>

From: Jacob Pan <jacob.pan@linux.microsoft.com> Sent: Wednesday, May 20, 2026 1:40 PM
> 
> Hi Michael,
> 
> On Wed, 20 May 2026 19:26:24 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Michael Kelley <mhklinux@outlook.com> To: Yu Zhang <zhangyu1@linux.microsoft.com>, Jason Gunthorpe
> >
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > >
> > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > > hv_iommu_flush_va *iova_list,
> > > > > +					  unsigned long start,
> > > > > +					  unsigned long end)
> > > > > +{
> > > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > > > +		else
> > > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > > +
> > > > > +		size_align = __fls(nr_pages);
> > > > > +		order = min(pfn_align, size_align);
> > > > > +		iova_list[count].page_mask_shift = order;
> > > > > +		iova_list[count].page_number = start_pfn;
> > > > > +
> > > > > +		flush_pages = 1UL << order;
> > > > > +		start_pfn += flush_pages;
> > > > > +		nr_pages -= flush_pages;
> > > > > +		count++;
> > > > > +	}
> > > >
> > > > This seems like a really silly hypervisor interface. Why doesn't
> > > > it just accept a normal range? Splitting it into power of two
> > > > aligned ranges is very inefficient.
> > >
> > > Fair point. I'm not sure how much flexibility we have to change
> > > this hypercall interface at the moment - it predates the pvIOMMU
> > > work and may have other consumers beyond Linux guest. On the other
> > > hand, having the guest specify 2^N-aligned blocks does save the
> > > hypervisor from having to decompose ranges itself before issuing
> > > hardware invalidation commands - the guest-provided entries can be
> > > fed to the HW more or less directly.
> > >
> > > That said, the way I'm currently using this interface may be
> > > more precise than necessary. Maybe we have 2 options:
> > >
> > > 1) Current approach: decompose the range into multiple exact
> > >    2^N-aligned blocks with no over-flush, but at the cost of
> > >    more complex calculations and more entries.
> > >
> > > 2) Follow what Intel/AMD drivers do: find a single minimal
> > >    2^N-aligned block that covers the entire range, but may
> > >    over-flush.
> > >
> > > Any preference?
> > >
> > > @Michael, since you've also been reviewing this patch, I'd
> > > appreciate your thoughts on the above as well. :)
> > >
> >
> > I'm just guessing, but perhaps flushing an aligned power-of-2
> > range can be processed by the hypervisor at a relatively fixed
> > cost, regardless of the size. Having the guest do the decomposing
> > of an arbitrary range allows the hypervisor to make use of the
> > existing "rep" hypercall mechanism if the hypercall is taking
> > "too long". The hypervisor can pause its processing, return to
> > the guest temporarily, and then continue the hypercall. If the
> > arbitrary range were passed into the hypercall for the hypervisor
> > to do the decomposing, that pause-and-restart mechanism
> > wouldn't be available.
> >
> > Of course, Linux doesn't really take advantage of the pause to
> > reduce guest interrupt latency because the Hyper-V code in
> > Linux typically disable interrupts around a hypercall due to the
> > way the hypercall input page is allocated. But other guest
> > operating systems might benefit from such a pause. And we could
> > probably fix the Hyper-V code in Linux to allow interrupts during a
> > hypercall pause/restart if long-running hypercalls turn out to be
> > a problem.

> I am not sure if this pause feature is suitable for IOTLB flush at all
> since it is inherently synchronous — the caller must block until all
> invalidations complete. Pausing mid-flush to return to the guest
> doesn't help if the guest can't make forward progress anyway.

I agree that hypercall pause/resume doesn't help with
forward progress. But it could help with interrupt latency in the
guest if the hypercall executes with interrupts enabled in the
guest. During the pause when control returns to the guest,
the guest could take an interrupt, versus the interrupt having
to wait until the entire hypercall completes. And if preemption
is enabled in the guest thread executing the hypercall, the thread
could be descheduled, potentially improving scheduling latency.

At least that's my understanding of why Hyper-V has this pause/
resume mechanism for "rep" hypercalls. :-)

Michael




^ permalink raw reply

* RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Michael Kelley @ 2026-05-21 15:39 UTC (permalink / raw)
  To: Yu Zhang, Michael Kelley
  Cc: Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <pxod76qh3jtpvnxdlflvntc5svqgibaeu6tywn2ejrlnea65w3@djehcr3vidnk>

From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Thursday, May 21, 2026 7:34 AM
> 
> On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > >
> > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > > > +					  unsigned long start,
> > > > > +					  unsigned long end)
> > > > > +{
> > > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +	u16 count = 0;
> > > > > +
> > > > > +	while (nr_pages > 0) {
> > > > > +		unsigned long flush_pages;
> > > > > +		int order;
> > > > > +		unsigned long pfn_align;
> > > > > +		unsigned long size_align;
> > > > > +
> > > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +			break;
> > > > > +		}
> > > > > +
> > > > > +		if (start_pfn)
> > > > > +			pfn_align = __ffs(start_pfn);
> > > > > +		else
> > > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > > +
> > > > > +		size_align = __fls(nr_pages);
> > > > > +		order = min(pfn_align, size_align);
> > > > > +		iova_list[count].page_mask_shift = order;
> > > > > +		iova_list[count].page_number = start_pfn;
> > > > > +
> > > > > +		flush_pages = 1UL << order;
> > > > > +		start_pfn += flush_pages;
> > > > > +		nr_pages -= flush_pages;
> > > > > +		count++;
> > > > > +	}
> > > >
> > > > This seems like a really silly hypervisor interface. Why doesn't it
> > > > just accept a normal range? Splitting it into power of two aligned
> > > > ranges is very inefficient.
> > >
> > > Fair point. I'm not sure how much flexibility we have to change
> > > this hypercall interface at the moment - it predates the pvIOMMU
> > > work and may have other consumers beyond Linux guest. On the other
> > > hand, having the guest specify 2^N-aligned blocks does save the
> > > hypervisor from having to decompose ranges itself before issuing
> > > hardware invalidation commands - the guest-provided entries can be
> > > fed to the HW more or less directly.
> > >
> > > That said, the way I'm currently using this interface may be
> > > more precise than necessary. Maybe we have 2 options:
> > >
> > > 1) Current approach: decompose the range into multiple exact
> > >    2^N-aligned blocks with no over-flush, but at the cost of
> > >    more complex calculations and more entries.
> > >
> > > 2) Follow what Intel/AMD drivers do: find a single minimal
> > >    2^N-aligned block that covers the entire range, but may
> > >    over-flush.
> > >
> > > Any preference?
> > >
> > > @Michael, since you've also been reviewing this patch, I'd
> > > appreciate your thoughts on the above as well. :)
> > >
> >
> > I'm just guessing, but perhaps flushing an aligned power-of-2
> > range can be processed by the hypervisor at a relatively fixed
> > cost, regardless of the size. Having the guest do the decomposing
> > of an arbitrary range allows the hypervisor to make use of the
> > existing "rep" hypercall mechanism if the hypercall is taking
> > "too long". The hypervisor can pause its processing, return to
> > the guest temporarily, and then continue the hypercall. If the
> > arbitrary range were passed into the hypercall for the hypervisor
> > to do the decomposing, that pause-and-restart mechanism
> > wouldn't be available.
> >
> > Of course, Linux doesn't really take advantage of the pause to
> > reduce guest interrupt latency because the Hyper-V code in
> > Linux typically disable interrupts around a hypercall due to the
> > way the hypercall input page is allocated. But other guest
> > operating systems might benefit from such a pause. And we could
> > probably fix the Hyper-V code in Linux to allow interrupts during a
> > hypercall pause/restart if long-running hypercalls turn out to be
> > a problem.
> >
> > Regarding proposal (1) vs. (2), perhaps you could confirm with
> > the Hyper-V team that flushing an aligned power-of-2 range
> > has relatively fixed cost, regardless of the size. And what do the
> > flush requests coming from the generic IOMMU subsystem look
> > like? Do they match dma_unmap() ranges, which are probably
> > dominated by relatively small ranges of a few pages at most,
> > with a few outliers for disk I/O requests of 1 MiB or some such?
> > If the dominant flush request is for a few pages at most, then
> > doing (2) seems reasonable.
> 
> Thanks for the thoughtful suggestions, Michael!
> 
> I believe the time might be dominated by the number of descriptors,
> instead of the size of each range, especially when device TLB
> invalidations are involved.
> 
> Here's my understanding of what hypervisor does in its handler:
> 
> Hyper-V constructs one IOTLB invalidation descriptor (and possibly
> a Device TLB invalidation descriptor as well) per iova_list entry
> and submits them to the HW invalidation queue, then synchronously
> waits for completion. So multiple 2^N-aligned entries should be less
> efficient than a single larger 2^N aligned one.

Agreed. The hypercall time should be roughly linear in the number
of descriptors.

If the approach is to do a precise flush, my argument is that
it is better for the guest to construct the 2^N aligned descriptors
instead of having the host do it. In the former case, the hypercall
can do the pause/resume thing, which provides the opportunity
to reduce interrupt latency in the guest. In the latter case, it cannot.

But my argument is moot if you do Option 2. And I'm fine with
Option 2 if the assumptions about it are true.

Michael

> 
> Since both options submit 2^N-aligned entries to the hypervisor,
> either one single coarser-grained entry or a precise decomposition,
> I'm now also leaning towards option 2, which is also what Intel/AMD
> drivers do for page-selective IOTLB flush. Simpler guest code, faster
> flush, and the hypervisor can feed the single entry almost directly
> to HW.
> 
> Yu


^ permalink raw reply

* Re: [PATCH net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
From: patchwork-bot+netdevbpf @ 2026-05-21 15:20 UTC (permalink / raw)
  To: Aditya Garg
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, dipayanroy, horms, ernis, gargaditya,
	kees, stephen, shacharr, ssengar, linux-hyperv, netdev,
	linux-kernel
In-Reply-To: <20260520051553.857120-1-gargaditya@linux.microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 19 May 2026 22:15:53 -0700 you wrote:
> In mana_hwc_rx_event_handler(), rx_req_idx is derived from
> sge->address in DMA-coherent memory. In Confidential VMs
> (SEV-SNP/TDX), this memory is shared unencrypted and HW can modify
> WQE contents at any time. No bounds check exists on rx_req_idx,
> which can lead to an out-of-bounds access into reqs[].
> 
> Add bounds check on rx_req_idx in mana_hwc_rx_event_handler() before
> using it to index the reqs[] array.
> 
> [...]

Here is the summary with links:
  - [net] net: mana: validate rx_req_idx to prevent out-of-bounds array access
    https://git.kernel.org/netdev/net/c/b809d0409991

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support
From: Yu Zhang @ 2026-05-21 14:34 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Jason Gunthorpe, linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org, iommu@lists.linux.dev,
	linux-pci@vger.kernel.org, linux-arch@vger.kernel.org,
	wei.liu@kernel.org, kys@microsoft.com, haiyangz@microsoft.com,
	decui@microsoft.com, longli@microsoft.com, joro@8bytes.org,
	will@kernel.org, robin.murphy@arm.com, bhelgaas@google.com,
	kwilczynski@kernel.org, lpieralisi@kernel.org, mani@kernel.org,
	robh@kernel.org, arnd@arndb.de, jacob.pan@linux.microsoft.com,
	tgopinath@linux.microsoft.com,
	easwar.hariharan@linux.microsoft.com
In-Reply-To: <SN6PR02MB4157C1EC7F5F69C5ABDA9C7FD4012@SN6PR02MB4157.namprd02.prod.outlook.com>

On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <zhangyu1@linux.microsoft.com> Sent: Wednesday, May 20, 2026 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va *iova_list,
> > > > +					  unsigned long start,
> > > > +					  unsigned long end)
> > > > +{
> > > > +	unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +	unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +	unsigned long nr_pages = end_pfn - start_pfn;
> > > > +	u16 count = 0;
> > > > +
> > > > +	while (nr_pages > 0) {
> > > > +		unsigned long flush_pages;
> > > > +		int order;
> > > > +		unsigned long pfn_align;
> > > > +		unsigned long size_align;
> > > > +
> > > > +		if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +			count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +			break;
> > > > +		}
> > > > +
> > > > +		if (start_pfn)
> > > > +			pfn_align = __ffs(start_pfn);
> > > > +		else
> > > > +			pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +		size_align = __fls(nr_pages);
> > > > +		order = min(pfn_align, size_align);
> > > > +		iova_list[count].page_mask_shift = order;
> > > > +		iova_list[count].page_number = start_pfn;
> > > > +
> > > > +		flush_pages = 1UL << order;
> > > > +		start_pfn += flush_pages;
> > > > +		nr_pages -= flush_pages;
> > > > +		count++;
> > > > +	}
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't it
> > > just accept a normal range? Splitting it into power of two aligned
> > > ranges is very inefficient.
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> > 
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
> 
> Regarding proposal (1) vs. (2), perhaps you could confirm with
> the Hyper-V team that flushing an aligned power-of-2 range
> has relatively fixed cost, regardless of the size. And what do the
> flush requests coming from the generic IOMMU subsystem look
> like? Do they match dma_unmap() ranges, which are probably
> dominated by relatively small ranges of a few pages at most,
> with a few outliers for disk I/O requests of 1 MiB or some such?
> If the dominant flush request is for a few pages at most, then
> doing (2) seems reasonable.

Thanks for the thoughtful suggestions, Michael!

I believe the time might be dominated by the number of descriptors,
instead of the size of each range, especially when device TLB
invalidations are involved.

Here's my understanding of what hypervisor does in its handler:

Hyper-V constructs one IOTLB invalidation descriptor (and possibly
a Device TLB invalidation descriptor as well) per iova_list entry
and submits them to the HW invalidation queue, then synchronously
waits for completion. So multiple 2^N-aligned entries should be less
efficient than a single larger 2^N aligned one.

Since both options submit 2^N-aligned entries to the hypervisor,
either one single coarser-grained entry or a precise decomposition,
I'm now also leaning towards option 2, which is also what Intel/AMD
drivers do for page-selective IOTLB flush. Simpler guest code, faster
flush, and the hypervisor can feed the single entry almost directly
to HW.

Yu

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-21 14:13 UTC (permalink / raw)
  To: Sean Christopherson, Peter Zijlstra
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Juergen Gross,
	Daniel Lezcano, Thomas Gleixner, John Stultz, Rick Edgecombe,
	Vitaly Kuznetsov, Broadcom internal kernel review list,
	Boris Ostrovsky, Stephen Boyd, x86, linux-coco, kvm, linux-hyperv,
	virtualization, linux-kernel, xen-devel, Michael Kelley,
	Tom Lendacky, Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8K2FRGcoEa-D2Y@google.com>

[-- Attachment #1: Type: text/plain, Size: 2440 bytes --]

On Thu, 2026-05-21 at 06:38 -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, Peter Zijlstra wrote:
> > On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > > On Thu, May 21, 2026, David Woodhouse wrote:
> > > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > > bringup when it's being used as sched_clock.
> > > > > 
> > > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > > initialized and thus registers its hooks.
> > > > > 
> > > > > Opportunsitically make the registration messages more precise to help
> > > > > debug issues where kvmclock is enabled too late.
> > > > 
> > > > That's a hard word to type, isn't it?
> > > 
> > > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > > while, it is time...
> > 
> > 'z=' not good enough?
> 
> You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

I reached the peak of my vi knowledge in about 1995 when I learned that
I could log in on another terminal, kill it from there, and then set
EDITOR=emacs.

Ironically I still find myself doing that kind of thing when I'm
composing a git-send-email cover letter and decide I don't want to send
the series as-is at all. Maybe there's a way to put a poison pill in
the message (or save it unchanged?) to make git *NOT* send anything...
but I always err on the side of caution and just nuke it from orbit, or
at least from another terminal.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260521131019.GI3126523@noisy.programming.kicks-ass.net>

On Thu, May 21, 2026, Peter Zijlstra wrote:
> On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> > On Thu, May 21, 2026, David Woodhouse wrote:
> > > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > > bringup when it's being used as sched_clock.
> > > > 
> > > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > > initialized and thus registers its hooks.
> > > > 
> > > > Opportunsitically make the registration messages more precise to help
> > > > debug issues where kvmclock is enabled too late.
> > > 
> > > That's a hard word to type, isn't it?
> > 
> > Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> > while, it is time...
> 
> 'z=' not good enough?

You people and your fancy ways.  I'm just happy I can get in and out of the editor :-)

^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Peter Zijlstra @ 2026-05-21 13:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: David Woodhouse, Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Dave Hansen, Andy Lutomirski,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag8Bpc_uVNrNWqfX@google.com>

On Thu, May 21, 2026 at 05:59:17AM -0700, Sean Christopherson wrote:
> On Thu, May 21, 2026, David Woodhouse wrote:
> > On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > > their kvmclock during CPU online.  While a redundant write to the MSR is
> > > technically ok, skip the registration when kvmclock is sched_clock so that
> > > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > > bringup when it's being used as sched_clock.
> > > 
> > > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > > work today, because KVM's hooks happen to run after/before timekeeping's
> > > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > > which syscore_ops are invoked depends entirely on when a subsystem is
> > > initialized and thus registers its hooks.
> > > 
> > > Opportunsitically make the registration messages more precise to help
> > > debug issues where kvmclock is enabled too late.
> > 
> > That's a hard word to type, isn't it?
> 
> Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
> while, it is time...

'z=' not good enough?


^ permalink raw reply

* Re: [PATCH v3 27/41] x86/kvmclock: Enable kvmclock on APs during onlining if kvmclock isn't sched_clock
From: Sean Christopherson @ 2026-05-21 12:59 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <423b37f056f0d4d596d5f4cc73802fb1079ecf63.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > In anticipation of making x86_cpuinit.early_percpu_clock_init(), i.e.
> > kvm_setup_secondary_clock(), a dedicated sched_clock hook that will be
> > invoked if and only if kvmclock is set as sched_clock, ensure APs enable
> > their kvmclock during CPU online.  While a redundant write to the MSR is
> > technically ok, skip the registration when kvmclock is sched_clock so that
> > it's somewhat obvious that kvmclock *needs* to be enabled during early
> > bringup when it's being used as sched_clock.
> > 
> > Plumb in the BSP's resume path purely for documentation purposes.  Both
> > KVM (as-a-guest) and timekeeping/clocksource hook syscore_ops, and it's
> > not super obvious that using KVM's hooks would be flawed.  E.g. it would
> > work today, because KVM's hooks happen to run after/before timekeeping's
> > hooks during suspend/resume, but that's sheer dumb luck as the order in
> > which syscore_ops are invoked depends entirely on when a subsystem is
> > initialized and thus registers its hooks.
> > 
> > Opportunsitically make the registration messages more precise to help
> > debug issues where kvmclock is enabled too late.
> 
> That's a hard word to type, isn't it?

Heh, you have no idea.  I've been "this" close to creating a VIM binding for a
while, it is time...

^ permalink raw reply

* Re: [PATCH v1 3/4] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest
From: Yu Zhang @ 2026-05-21 12:27 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Jason Gunthorpe, linux-kernel, linux-hyperv, iommu, linux-pci,
	linux-arch, wei.liu, kys, haiyangz, decui, longli, joro, will,
	robin.murphy, bhelgaas, kwilczynski, lpieralisi, mani, robh, arnd,
	mhklinux, tgopinath, easwar.hariharan
In-Reply-To: <20260520112708.00003640@linux.microsoft.com>

On Wed, May 20, 2026 at 11:27:08AM -0700, Jacob Pan wrote:
> Hi Yu,
> 
> On Wed, 20 May 2026 23:25:43 +0800
> Yu Zhang <zhangyu1@linux.microsoft.com> wrote:
> 
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_identity_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};
> > > > +
> > > > +static const struct iommu_domain_ops
> > > > hv_iommu_blocking_domain_ops = {
> > > > +	.attach_dev	= hv_iommu_attach_dev,
> > > > +};  
> > > 
> > > Usually I would expect these to have their own attach
> > > functions. blocking in particular must have an attach op that cannot
> > > fail. It is used to recover the device back to a known translation
> > > in case of cascading other errors.
> > >   
> > 
> > For blocking domain, the hypercall handler of such attach essentially
> > disable the translation and IOPF for the device.
> I think this should disable all faults, including unrecoverable fault
> reporting. right?

Yes, it should (e.g., my understanding is it shall set the FPD in scalable
mode context entry). Will double confirm with hypervisor team and make
sure it behaves so.

B.R.
Yu

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Dongli Zhang @ 2026-05-21  9:14 UTC (permalink / raw)
  To: Sean Christopherson, kvm
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-38-seanjc@google.com>



On 2026-05-15 12:19 PM, Sean Christopherson wrote:
> Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> nonstop, and not marked unstable via command line.  I.e. use the same
> criteria as tweaking the clocksource rating so that TSC is preferred over
> kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> is more tolerant of slop than clocksource; using TSC for clocksource but
> not sched_clock makes little to no sense, especially now that KVM CoCo
> guests with a trusted TSC use TSC, not kvmclock.
> 
>         /*
>          * Fall back to jiffies if there's no TSC available:
>          * ( But note that we still use it if the TSC is marked
>          *   unstable. We do this because unlike Time Of Day,
>          *   the scheduler clock tolerates small errors and it's
>          *   very important for it to be as fast as the platform
>          *   can achieve it. )
>          */
> 
> The only advantage of using kvmclock is that doing so allows for early
> and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> broken for over two years with nary a complaint, i.e. it can't be
> _that_ valuable.  And as above, certain types of KVM guests are losing
> the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> needs to be decoupled from sched_clock() no matter what.

Has it been broken for two years because of pvclock_clocksource_read_nowd()?

Thank you very much!

Dongli Zhang

^ permalink raw reply

* Re: [PATCH net 2/2] net: mana: Skip redundant detach in queue reset handler if already detached
From: Jakub Kicinski @ 2026-05-21  0:17 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260518194654.735580-3-dipayanroy@linux.microsoft.com>

On Mon, 18 May 2026 12:43:51 -0700 Dipayaan Roy wrote:
> +	/* If already detached (indicates detach succeeded but attach failed
> +	 * previously). Now skip mana detach and just retry mana_attach.
> +	 */
> +	if (!netif_device_present(ndev))
> +		goto attach;
> +
>  	err = mana_detach(ndev, false);
>  	if (err) {
>  		netdev_err(ndev, "mana_detach failed: %d\n", err);
>  		goto dealloc_pre_rxbufs;
>  	}
>  
> +attach:

goto's are acceptable for error unwinding, not to jump around 
a function seemingly to avoid indenting something. Please use
normal constructs or perhaps move the netif_device_present()
into mana_detach() as an early exit condition? 

>  	err = mana_attach(ndev);

^ permalink raw reply

* Re: [PATCH v3 39/41] x86/paravirt: Move using_native_sched_clock() stub into timer.h
From: David Woodhouse @ 2026-05-21  0:00 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-40-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that timer.h ended up with CONFIG_PARAVIRT #ifdeffery anyways, move the
> PARAVIRT=n using_native_sched_clock() stub into timer.h as a "free"
> optimization.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 38/41] x86/paravirt: kvmclock: Setup kvmclock early iff it's sched_clock
From: David Woodhouse @ 2026-05-20 23:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-39-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Rework the seemingly generic x86_cpuinit_ops.early_percpu_clock_init hook
> into a dedicated PV sched_clock hook, as the only reason the hook exists
> is to allow kvmclock to enable its PV clock on secondary CPUs before the
> kernel tries to reference sched_clock, e.g. when grabbing a timestamp for
> printk.
> 
> Rearranging the hook doesn't exactly reduce complexity; arguably it does
> the opposite.  But as-is, it's practically impossible to understand *why*
> kvmclock needs to do early configuration.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox