Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-04-04 15:07 UTC (permalink / raw)
  To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
  Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Danilo Krummrich
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>

On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
> Danilo Krummrich (12):
>   PCI: use generic driver_override infrastructure
>   platform/wmi: use generic driver_override infrastructure
>   vdpa: use generic driver_override infrastructure
>   s390/cio: use generic driver_override infrastructure
>   s390/ap: use generic driver_override infrastructure

Applied to driver-core-testing, thanks!

>   amba: use generic driver_override infrastructure
>   cdx: use generic driver_override infrastructure
>   hv: vmbus: use generic driver_override infrastructure
>   rpmsg: use generic driver_override infrastructure

I have not picked these up, as they have not received ACKs from the
corresponding subsystem maintainers so far.

>   bus: fsl-mc: use generic driver_override infrastructure
>   spi: use generic driver_override infrastructure

These have already been picked up via the respective subsystem trees -- thanks!

Thanks,
Danilo

^ permalink raw reply

* [PATCH v2] Drivers: hv: vmbus: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-04-04 15:01 UTC (permalink / raw)
  To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
  Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Danilo Krummrich, Michael Kelley, Gui-Dong Han
In-Reply-To: <BN7PR02MB414825D0532A1DFE16F3B671D449A@BN7PR02MB4148.namprd02.prod.outlook.com>

When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.

Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.

Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]

Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: d765edbb301c ("vmbus: add driver_override support")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
Changes in v2:
  - Change patch subject and comments according to Michael's suggestion.
---
 drivers/hv/vmbus_drv.c | 43 ++++++++++--------------------------------
 include/linux/hyperv.h |  5 -----
 2 files changed, 10 insertions(+), 38 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1..ba7326ec6add 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -541,34 +541,6 @@ static ssize_t device_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(device);
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	int ret;
-
-	ret = driver_set_override(dev, &hv_dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", hv_dev->driver_override);
-	device_unlock(dev);
-
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 /* Set up per device attributes in /sys/bus/vmbus/devices/<bus device> */
 static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_id.attr,
@@ -599,7 +571,6 @@ static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_channel_vp_mapping.attr,
 	&dev_attr_vendor.attr,
 	&dev_attr_device.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 
@@ -711,9 +682,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 {
 	const guid_t *guid = &dev->dev_type;
 	const struct hv_vmbus_device_id *id;
+	int ret;
 
-	/* When driver_override is set, only bind to the matching driver */
-	if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+	/* If a driver override is set, only bind to the matching driver */
+	ret = device_match_driver_override(&dev->device, &drv->driver);
+	if (ret == 0)
 		return NULL;
 
 	/* Look at the dynamic ids first, before the static ones */
@@ -721,8 +694,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 	if (!id)
 		id = hv_vmbus_dev_match(drv->id_table, guid);
 
-	/* driver_override will always match, send a dummy id */
-	if (!id && dev->driver_override)
+	/*
+	 * If there's a matching driver override, this function should succeed,
+	 * thus return a dummy device ID if no matching ID is found.
+	 */
+	if (!id && ret > 0)
 		id = &vmbus_device_null;
 
 	return id;
@@ -1024,6 +1000,7 @@ static const struct dev_pm_ops vmbus_pm = {
 /* The one and only one */
 static const struct bus_type  hv_bus = {
 	.name =		"vmbus",
+	.driver_override =	true,
 	.match =		vmbus_match,
 	.shutdown =		vmbus_shutdown,
 	.remove =		vmbus_remove,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..bf689d07d750 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1272,11 +1272,6 @@ struct hv_device {
 	u16 device_id;
 
 	struct device device;
-	/*
-	 * Driver name to force a match.  Do not set directly, because core
-	 * frees it.  Use driver_set_override() to set or clear it.
-	 */
-	const char *driver_override;
 
 	struct vmbus_channel *channel;
 	struct kset	     *channels_kset;
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next v5 1/3] net: mana: Use pci_name() for debugfs directory naming
From: Simon Horman @ 2026-04-04  9:05 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, shradhagupta, shirazsaleem,
	yury.norov, kees, ssengar, dipayanroy, gargaditya, linux-hyperv,
	netdev, linux-kernel, linux-rdma
In-Reply-To: <20260402182704.2474739-2-ernis@linux.microsoft.com>

On Thu, Apr 02, 2026 at 11:26:55AM -0700, Erni Sri Satya Vennela wrote:
> Use pci_name(pdev) for the per-device debugfs directory instead of
> hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
> previous approach had two issues:
> 
> 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
>    in environments like generic VFIO passthrough or nested KVM,
>    causing a NULL pointer dereference.
> 
> 2. Multiple PFs would all use "0", and VFs across different PCI
>    domains or buses could share the same slot name, leading to
>    -EEXIST errors from debugfs_create_dir().
> 
> pci_name(pdev) returns the unique BDF address, is always valid, and
> is unique across the system.
> 
> Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>

Hi Erni,

Possibly the code differs between net and net-next.
But if this is fixing a bug in code present in net - as per the cited
commit - then I think it should be a patch that targets net.
With some strategy for merging that change into net-next
if conflicts are expected.

^ permalink raw reply

* Re: [PATCH] mshv: Add tracepoint for GPA intercept handling
From: Wei Liu @ 2026-04-04  6:45 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <20260402-sarcastic-rottweiler-of-criticism-8330ae@anirudhrb>

On Thu, Apr 02, 2026 at 03:45:58PM +0000, Anirudh Rayabharam wrote:
> On Tue, Mar 24, 2026 at 11:59:59PM +0000, Stanislav Kinsburskii wrote:
> > Provide visibility into GPA intercept operations for debugging and
> > performance analysis of Microsoft Hypervisor guest memory management.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |    6 ++++--
> >  drivers/hv/mshv_trace.h     |   29 +++++++++++++++++++++++++++++
> >  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 

This patch has a dependency on another bug fix. That's not listed
anywhere. In the future, please list the dependency, or post this as
part of a series.

The tracing support is in hyeprv-next. This patch needs to wait.

Wei


^ permalink raw reply

* Re: [PATCH] mshv: Add tracepoint for GPA intercept handling
From: Wei Liu @ 2026-04-04  6:41 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <20260402-sarcastic-rottweiler-of-criticism-8330ae@anirudhrb>

On Thu, Apr 02, 2026 at 03:45:58PM +0000, Anirudh Rayabharam wrote:
> On Tue, Mar 24, 2026 at 11:59:59PM +0000, Stanislav Kinsburskii wrote:
> > Provide visibility into GPA intercept operations for debugging and
> > performance analysis of Microsoft Hypervisor guest memory management.
> > 
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |    6 ++++--
> >  drivers/hv/mshv_trace.h     |   29 +++++++++++++++++++++++++++++
> >  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 

This doesn't apply cleanly to hyperv-next.

Wei

^ permalink raw reply

* Re: [PATCH net-next 0/4] net: mana: Fix probe/remove error path bugs
From: Mohsin Bashir @ 2026-04-04  6:28 UTC (permalink / raw)
  To: Erni Sri Satya Vennela, kys, haiyangz, wei.liu, decui, longli,
	andrew+netdev, davem, edumazet, kuba, pabeni, ssengar, dipayanroy,
	gargaditya, shirazsaleem, kees, kotaranov, leon, shacharr,
	stephen, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260403070948.9225-1-ernis@linux.microsoft.com>



On 4/3/26 12:09 AM, Erni Sri Satya Vennela wrote:
> Fix four pre-existing bugs in mana_probe()/mana_remove() error handling
> that can cause warnings on uninitialized work structs, masked errors,
> and resource leaks when early probe steps fail.
> 
> Patches 1-2 move work struct initialization (link_change_work and
> gf_stats_work) to before any error path that could trigger
> mana_remove(), preventing WARN_ON in __flush_work() or debug object
> warnings when sync cancellation runs on uninitialized work structs.
> 
> Patch 3 prevents add_adev() from overwriting a port probe error,
> which could leave the driver in a broken state with NULL ports while
> reporting success.
> 
> Patch 4 changes 'goto out' to 'break' in mana_remove()'s port loop
> so that mana_destroy_eq() is always reached, preventing EQ leaks when
> a NULL port is encountered.
> 
> Erni Sri Satya Vennela (4):
>    net: mana: Init link_change_work before potential error paths in probe
>    net: mana: Init gf_stats_work before potential error paths in probe
>    net: mana: Don't overwrite port probe error with add_adev result
>    net: mana: Fix EQ leak in mana_remove on NULL port
> 
>   drivers/net/ethernet/microsoft/mana/mana_en.c | 28 +++++++++----------
>   1 file changed, 14 insertions(+), 14 deletions(-)
> 
I believe mana is already in the mainline so fixes go to the net tree?

^ permalink raw reply

* Re: [PATCH 2/2] Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
From: Wei Liu @ 2026-04-04  6:11 UTC (permalink / raw)
  To: mhklinux
  Cc: kys, haiyangz, wei.liu, decui, longli, tglx, mingo, bp,
	dave.hansen, hpa, maz, bigeasy, x86, linux-kernel, linux-hyperv
In-Reply-To: <20260402202400.1707-3-mhklkml@zohomail.com>

On Thu, Apr 02, 2026 at 01:24:00PM -0700, Michael Kelley wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> The Hyper-V ISRs, for normal guests and when running in the
> hypervisor root patition, are calling add_interrupt_randomness() as a
> primary source of entropy. The call is currently in the ISRs as a common
> place to handle both x86/x64 and arm64. On x86/x64, hypervisor interrupts
> come through a custom sysvec entry, and do not go through a generic
> interrupt handler. On arm64, hypervisor interrupts come through an
> emulated GICv3. GICv3 uses the generic handler handle_percpu_devid_irq(),
> which does not do add_interrupt_randomness() -- unlike its counterpart
> handle_percpu_irq(). But handle_percpu_devid_irq() is now updated to do
> the add_interrupt_randomness(). So add_interrupt_randomness() is now
> needed only in Hyper-V's x86/x64 custom sysvec path.
> 
> Move add_interrupt_randomness() from the Hyper-V ISRs into the Hyper-V
> x86/x64 custom sysvec path, matching the existing STIMER0 sysvec path.
> With this change, add_interrupt_randomness() is no longer called from any
> device drivers, which is appropriate.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Acked-by: Wei Liu <wei.liu@kernel.org>

^ permalink raw reply

* Re: [PATCH] mshv: Fix infinite fault loop on permission-denied GPA intercepts
From: Wei Liu @ 2026-04-04  5:26 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: Stanislav Kinsburskii, kys, haiyangz, wei.liu, decui, longli,
	linux-hyperv, linux-kernel
In-Reply-To: <20260402-rare-heretic-tench-adfc68@anirudhrb>

On Thu, Apr 02, 2026 at 03:44:58PM +0000, Anirudh Rayabharam wrote:
> On Tue, Mar 24, 2026 at 11:57:40PM +0000, Stanislav Kinsburskii wrote:
> > Prevent infinite fault loops when guests access memory regions without
> > proper permissions. Currently, mshv_handle_gpa_intercept() attempts to
> > remap pages for all faults on movable memory regions, regardless of
> > whether the access type is permitted. When a guest writes to a read-only
> > region, the remap succeeds but the region remains read-only, causing
> > immediate re-fault and spinning the vCPU indefinitely.
> > 
> > Validate intercept access type against region permissions before
> > attempting remaps. Reject writes to non-writable regions and executes to
> > non-executable regions early, returning false to let the VMM handle the
> > intercept appropriately.
> > 
> > This also closes a potential DoS vector where malicious guests could
> > intentionally trigger these fault loops to consume host resources.
> > 
> > Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions")
> > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > ---
> >  drivers/hv/mshv_root_main.c |   15 ++++++++++++---
> >  include/hyperv/hvgdk_mini.h |    6 ++++++
> >  include/hyperv/hvhdk.h      |    4 ++--
> >  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
> 

Applied to hyperv-fixes.

^ permalink raw reply

* Re: [PATCH] PCI: hv: Fix double ida_free in hv_pci_probe error path
From: Wei Liu @ 2026-04-04  5:19 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Sahil Chandna, kys, haiyangz, wei.liu, decui, longli, lpieralisi,
	kwilczynski, robh, bhelgaas, dan.j.williams, mhklinux,
	linux-hyperv, linux-pci, linux-kernel
In-Reply-To: <v7l3vghep7h52f3qyoxdbqzu6un5vhio3njy4u46wjrqzhrrvx@hl5imwzjwocf>

On Sat, Apr 04, 2026 at 08:12:36AM +0530, Manivannan Sadhasivam wrote:
> On Fri, Apr 03, 2026 at 05:09:29AM -0700, Sahil Chandna wrote:
> > If hv_pci_probe() fails after storing the domain number in
> > hbus->bridge->domain_nr, there is a call to free this domain_nr via
> > pci_bus_release_emul_domain_nr(), however, during cleanup, the bridge
> > release callback pci_release_host_bridge_dev() also frees the domain_nr
> > causing ida_free to be called on same ID twice and triggering following
> > warning:
> > 
> >   ida_free called for id=28971 which is not allocated.
> >   WARNING: lib/idr.c:594 at ida_free+0xdf/0x160, CPU#0: kworker/0:2/198
> >   Call Trace:
> >    pci_bus_release_emul_domain_nr+0x17/0x20
> >    pci_release_host_bridge_dev+0x4b/0x60
> >    device_release+0x3b/0xa0
> >    kobject_put+0x8e/0x220
> >    devm_pci_alloc_host_bridge_release+0xe/0x20
> >    devres_release_all+0x9a/0xd0
> >    device_unbind_cleanup+0x12/0xa0
> >    really_probe+0x1c5/0x3f0
> >    vmbus_add_channel_work+0x135/0x1a0
> > 
> > Fix this by letting pci core handle the free domain_nr and remove
> > the explicit free called in pci-hyperv driver.
> > 
> > Fixes: bcce8c74f1ce ("PCI: Enable host bridge emulation for PCI_DOMAINS_GENERIC platforms")
> > Signed-off-by: Sahil Chandna <sahilchandna@linux.microsoft.com>
> 
> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
> 
> Wei: I see that you've taken previous hyperv patches. So feel free to take this
> one also.

Sure. I will pick this up.

Wei

^ permalink raw reply

* Re: [PATCH] PCI: hv: Fix double ida_free in hv_pci_probe error path
From: Manivannan Sadhasivam @ 2026-04-04  2:42 UTC (permalink / raw)
  To: Sahil Chandna
  Cc: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	robh, bhelgaas, dan.j.williams, mhklinux, linux-hyperv, linux-pci,
	linux-kernel
In-Reply-To: <20260403120933.466259-1-sahilchandna@linux.microsoft.com>

On Fri, Apr 03, 2026 at 05:09:29AM -0700, Sahil Chandna wrote:
> If hv_pci_probe() fails after storing the domain number in
> hbus->bridge->domain_nr, there is a call to free this domain_nr via
> pci_bus_release_emul_domain_nr(), however, during cleanup, the bridge
> release callback pci_release_host_bridge_dev() also frees the domain_nr
> causing ida_free to be called on same ID twice and triggering following
> warning:
> 
>   ida_free called for id=28971 which is not allocated.
>   WARNING: lib/idr.c:594 at ida_free+0xdf/0x160, CPU#0: kworker/0:2/198
>   Call Trace:
>    pci_bus_release_emul_domain_nr+0x17/0x20
>    pci_release_host_bridge_dev+0x4b/0x60
>    device_release+0x3b/0xa0
>    kobject_put+0x8e/0x220
>    devm_pci_alloc_host_bridge_release+0xe/0x20
>    devres_release_all+0x9a/0xd0
>    device_unbind_cleanup+0x12/0xa0
>    really_probe+0x1c5/0x3f0
>    vmbus_add_channel_work+0x135/0x1a0
> 
> Fix this by letting pci core handle the free domain_nr and remove
> the explicit free called in pci-hyperv driver.
> 
> Fixes: bcce8c74f1ce ("PCI: Enable host bridge emulation for PCI_DOMAINS_GENERIC platforms")
> Signed-off-by: Sahil Chandna <sahilchandna@linux.microsoft.com>

Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>

Wei: I see that you've taken previous hyperv patches. So feel free to take this
one also.

We will be proactive from next release onwards ;)

- Mani

> ---
>  drivers/pci/controller/pci-hyperv.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..5616ad5d2a8f 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3778,7 +3778,7 @@ static int hv_pci_probe(struct hv_device *hdev,
>  					   hbus->bridge->domain_nr);
>  	if (!hbus->wq) {
>  		ret = -ENOMEM;
> -		goto free_dom;
> +		goto free_bus;
>  	}
> 
>  	hdev->channel->next_request_id_callback = vmbus_next_request_id;
> @@ -3874,8 +3874,6 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	vmbus_close(hdev->channel);
>  destroy_wq:
>  	destroy_workqueue(hbus->wq);
> -free_dom:
> -	pci_bus_release_emul_domain_nr(hbus->bridge->domain_nr);
>  free_bus:
>  	kfree(hbus);
>  	return ret;
> --
> 2.53.0
> 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply

* RE: [PATCH 2/2] hv: vmbus: Remove vmbus_irq_initialized
From: Michael Kelley @ 2026-04-04  1:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-hyperv@vger.kernel.org,
	linux-rt-devel@lists.linux.dev
  Cc: K. Y. Srinivasan, Dexuan Cui, Haiyang Zhang, Jan Kiszka, Long Li,
	Wei Liu
In-Reply-To: <20260401151517.1743555-3-bigeasy@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Wednesday, April 1, 2026 8:15 AM
> 
> vmbus_irq_initialized is only true if the registration of the per-CPU
> threads succeeded. If it failed, the whole registration aborts and the
> vmbus_exit() path is never called.
> 
> Remove vmbus_irq_initialized.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Reviewed-by: Michael Kelley <mhklinux@outlook.com>
But see comment about the patch Subject prefix from Patch 1 of this series.

> ---
>  drivers/hv/vmbus_drv.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index e44275370ac2a..7417841cd1f70 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1392,8 +1392,6 @@ static void run_vmbus_irqd(unsigned int cpu)
>  	__vmbus_isr();
>  }
> 
> -static bool vmbus_irq_initialized;
> -
>  static struct smp_hotplug_thread vmbus_irq_threads = {
>  	.store                  = &vmbus_irqd,
>  	.setup			= vmbus_irqd_setup,
> @@ -1513,11 +1511,10 @@ static int vmbus_bus_init(void)
>  	 * the VMbus interrupt handler.
>  	 */
> 
> -	if (IS_ENABLED(CONFIG_PREEMPT_RT) && !vmbus_irq_initialized) {
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>  		ret = smpboot_register_percpu_thread(&vmbus_irq_threads);
>  		if (ret)
>  			goto err_kthread;
> -		vmbus_irq_initialized = true;
>  	}
> 
>  	if (vmbus_irq == -1) {
> @@ -1561,10 +1558,8 @@ static int vmbus_bus_init(void)
>  	else
>  		free_percpu_irq(vmbus_irq, &vmbus_evt);
>  err_setup:
> -	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
> -		vmbus_irq_initialized = false;
> -	}
>  err_kthread:
>  	bus_unregister(&hv_bus);
>  	return ret;
> @@ -3033,10 +3028,9 @@ static void __exit vmbus_exit(void)
>  		hv_remove_vmbus_handler();
>  	else
>  		free_percpu_irq(vmbus_irq, &vmbus_evt);
> -	if (IS_ENABLED(CONFIG_PREEMPT_RT) && vmbus_irq_initialized) {
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		smpboot_unregister_percpu_thread(&vmbus_irq_threads);
> -		vmbus_irq_initialized = false;
> -	}
> +
>  	for_each_online_cpu(cpu) {
>  		struct hv_per_cpu_context *hv_cpu
>  			= per_cpu_ptr(hv_context.cpu_context, cpu);
> --
> 2.53.0


^ permalink raw reply

* RE: [PATCH 1/2] hv: vmbus: Replace lockdep_hardirq_threaded() with lockdep annotation
From: Michael Kelley @ 2026-04-04  1:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-hyperv@vger.kernel.org,
	linux-rt-devel@lists.linux.dev
  Cc: K. Y. Srinivasan, Dexuan Cui, Haiyang Zhang, Jan Kiszka, Long Li,
	Wei Liu
In-Reply-To: <20260401151517.1743555-2-bigeasy@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Sent: Wednesday, April 1, 2026 8:15 AM
> 

Nit: For historical consistency, use "Drivers: hv: vmbus:" as the prefix for the
patch "Subject:" line.  Same in Patch 2 of this series.

Also, any reason not to have copied linux-kernel@vger.kernel.org? I know this
is pretty much just a Hyper-V thing, but I would have liked to see what the
Sashiko AI did with these two patches. :-)

> lockdep_hardirq_threaded() is supposed to be used within IRQ core code
> and not within drivers. It is not obvious from within the driver, that
> this is the only interrupt service routing and that it is not shared
> handler.

I presume you meant "routine". And what do you mean by "the only interrupt
service routine"? And why is the lack of obviousness relevant here? I don't have
deep expertise in lockdep, but evidently there's some conclusion to reach and it
would have helped me to have it spelled out.

> 
> Replace lockdep_hardirq_threaded() with a lockdep annotation limiting
> threaded context on PREEMPT_RT to __vmbus_isr().

Again, I'm not clear what "limiting threaded context" means. But see my
additional comment further down.

> 
> Fixes: f8e6343b7a89c ("Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  drivers/hv/vmbus_drv.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index bc4fc1951ae1c..e44275370ac2a 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1407,8 +1407,11 @@ void vmbus_isr(void)
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>  		vmbus_irqd_wake();
>  	} else {
> -		lockdep_hardirq_threaded();

I see two similar occurrences of LD_WAIT_CONFIG in the kernel:
__kfree_rcu_sheaf() and adjacent to printk_legacy_allow_spinlock_enter().
Both occurrences have a multi-line comment explaining the "why". I'd like
to see a similar comment here so that drive-by readers of the code have 
some idea of what's going on. My suggestion is something like this:

   vmbus_isr() always runs at hard IRQ level -- the interrupt is not threaded. It
   calls __vmbus_isr() here, which may obtain the spinlock_t sched_lock for
   a VMBus channel in vmbus_chan_sched(). If CONFIG_PROVE_LOCKING=y,
   lockdep complains because obtaining spinlock_t's is not permitted at hard
   IRQ level in PREEMPT_RT configurations. However, the PREEMPT_RT path
   is handled separately above, so there's actually not a problem. Tell
   lockdep that acquiring the spinlock_t is valid by temporarily raising
   the wait-type to LD_WAIT_CONFIG using the "fake" lock vmbus_map.
   If lockdep is not enabled, the acquire & release of the fake lock are no-ops,
   so performance is not impacted.

Please review my suggested text and revise as appropriate, as I'm far
from an expert on any of this. The above is based on what I've learned
just now from a bit of research.

And thanks for jumping in and making all this better ....

Michael

> +		static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
> +
> +		lock_map_acquire_try(&vmbus_map);
>  		__vmbus_isr();
> +		lock_map_release(&vmbus_map);
>  	}
>  }
>  EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
> --
> 2.53.0

^ permalink raw reply

* Re: [PATCH 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-03 19:19 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
	Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260403-natural-fuzzy-kakapo-f5cbab@anirudhrb>

On Fri, 3 Apr 2026, Anirudh Rayabharam wrote:

> Actually, I believe VMBus does run on a nested root partition. Doesn't
> it? That would then be another case to handle in this patch.

Good catch, v2 series is out.

Best,
Jork

^ permalink raw reply

* [PATCH v2 6/6] mshv: unmap debugfs stats pages on kexec
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.

Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_debugfs.c   | 7 ++++++-
 drivers/hv/mshv_root_main.c | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index ebf2549eb44d..f9a4499cf8f3 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -676,8 +676,10 @@ int __init mshv_debugfs_init(void)
 
 	mshv_debugfs = debugfs_create_dir("mshv", NULL);
 	if (IS_ERR(mshv_debugfs)) {
+		err = PTR_ERR(mshv_debugfs);
+		mshv_debugfs = NULL;
 		pr_err("%s: failed to create debugfs directory\n", __func__);
-		return PTR_ERR(mshv_debugfs);
+		return err;
 	}
 
 	if (hv_root_partition()) {
@@ -712,6 +714,9 @@ int __init mshv_debugfs_init(void)
 
 void mshv_debugfs_exit(void)
 {
+	if (!mshv_debugfs)
+		return;
+
 	mshv_debugfs_parent_partition_remove();
 
 	if (hv_root_partition()) {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 281f530b68a9..7038fd830646 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2252,6 +2252,7 @@ root_scheduler_deinit(void)
 static int mshv_reboot_notify(struct notifier_block *nb,
 			      unsigned long code, void *unused)
 {
+	mshv_debugfs_exit();
 	cpuhp_remove_state(mshv_cpuhp_online);
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

Register the mshv reboot notifier for all parent partitions, not just
root. Previously the notifier was gated on hv_root_partition(), so on
L1VH (where hv_root_partition() is false) SINT0, SINT5, and SIRBP were
never cleaned up before kexec. The kexec'd kernel then inherited stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.

The L1VH SIRBP also needs special handling: unlike the root partition
where the hypervisor provides the SIRBP page, L1VH must allocate its
own page and program the GPA into the MSR. Add this allocation to
mshv_synic_init() and the corresponding free to mshv_synic_cleanup().

Remove the unnecessary mshv_root_partition_init/exit wrappers and
register the reboot notifier directly in mshv_parent_partition_init().
Make mshv_reboot_nb static since it no longer needs external linkage.

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_root_main.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e6509c980763..281f530b68a9 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -2256,20 +2256,10 @@ static int mshv_reboot_notify(struct notifier_block *nb,
 	return 0;
 }
 
-struct notifier_block mshv_reboot_nb = {
+static struct notifier_block mshv_reboot_nb = {
 	.notifier_call = mshv_reboot_notify,
 };
 
-static void mshv_root_partition_exit(void)
-{
-	unregister_reboot_notifier(&mshv_reboot_nb);
-}
-
-static int __init mshv_root_partition_init(struct device *dev)
-{
-	return register_reboot_notifier(&mshv_reboot_nb);
-}
-
 static int __init mshv_init_vmm_caps(struct device *dev)
 {
 	int ret;
@@ -2339,8 +2329,7 @@ static int __init mshv_parent_partition_init(void)
 	if (ret)
 		goto remove_cpu_state;
 
-	if (hv_root_partition())
-		ret = mshv_root_partition_init(dev);
+	ret = register_reboot_notifier(&mshv_reboot_nb);
 	if (ret)
 		goto remove_cpu_state;
 
@@ -2368,8 +2357,7 @@ static int __init mshv_parent_partition_init(void)
 deinit_root_scheduler:
 	root_scheduler_deinit();
 exit_partition:
-	if (hv_root_partition())
-		mshv_root_partition_exit();
+	unregister_reboot_notifier(&mshv_reboot_nb);
 remove_cpu_state:
 	cpuhp_remove_state(mshv_cpuhp_online);
 free_synic_pages:
@@ -2387,8 +2375,7 @@ static void __exit mshv_parent_partition_exit(void)
 	misc_deregister(&mshv_dev);
 	mshv_irqfd_wq_cleanup();
 	root_scheduler_deinit();
-	if (hv_root_partition())
-		mshv_root_partition_exit();
+	unregister_reboot_notifier(&mshv_reboot_nb);
 	cpuhp_remove_state(mshv_cpuhp_online);
 	free_percpu(mshv_root.synic_pages);
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).

Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cleanup()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
from under VMBus causes its later cleanup to write SynIC MSRs while
SynIC is disabled, which the hypervisor does not tolerate.

Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
  and nested root partition); on a non-nested root partition VMBus
  doesn't run, so MSHV must enable/disable them

Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/mshv_synic.c | 142 ++++++++++++++++++++++++++--------------
 1 file changed, 94 insertions(+), 48 deletions(-)

diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f8b0337cdc82..7d273766bdb5 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -454,46 +454,72 @@ int mshv_synic_init(unsigned int cpu)
 #ifdef HYPERVISOR_CALLBACK_VECTOR
 	union hv_synic_sint sint;
 #endif
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 			&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 			&spages->synic_event_ring_page;
+	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+	bool vmbus_active = !hv_root_partition() || hv_nested;
 
-	/* Setup the Synic's message page */
+	/*
+	 * Map the SYNIC message page. When VMBus is not active the
+	 * hypervisor pre-provisions the SIMP GPA but may not set
+	 * simp_enabled — enable it here.
+	 */
 	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = true;
+	if (!vmbus_active) {
+		simp.simp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
 			     HV_HYP_PAGE_SIZE,
 			     MEMREMAP_WB);
 
 	if (!(*msg_page))
-		return -EFAULT;
+		goto cleanup_simp;
 
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-
-	/* Setup the Synic's event flags page */
+	/*
+	 * Map the event flags page. Same as SIMP: enable when
+	 * VMBus is not active, already enabled by VMBus otherwise.
+	 */
 	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = true;
+	if (!vmbus_active) {
+		siefp.siefp_enabled = true;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	}
 	*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
 				     PAGE_SIZE, MEMREMAP_WB);
 
 	if (!(*event_flags_page))
-		goto cleanup;
-
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+		goto cleanup_siefp;
 
 	/* Setup the Synic's event ring page */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
-	sirbp.sirbp_enabled = true;
-	*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
-				    PAGE_SIZE, MEMREMAP_WB);
 
-	if (!(*event_ring_page))
-		goto cleanup;
+	if (hv_root_partition()) {
+		*event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+					    PAGE_SIZE, MEMREMAP_WB);
 
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+	} else {
+		/*
+		 * On L1VH the hypervisor does not provide a SIRBP page.
+		 * Allocate one and program its GPA into the MSR.
+		 */
+		*event_ring_page = (struct hv_synic_event_ring_page *)
+			get_zeroed_page(GFP_KERNEL);
+
+		if (!(*event_ring_page))
+			goto cleanup_siefp;
+
+		sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+				>> PAGE_SHIFT;
+	}
+
+	sirbp.sirbp_enabled = true;
 	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
 
 #ifdef HYPERVISOR_CALLBACK_VECTOR
@@ -515,28 +541,30 @@ int mshv_synic_init(unsigned int cpu)
 			      sint.as_uint64);
 #endif
 
-	/* Enable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 1;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it already enabled SCONTROL. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 1;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 
-cleanup:
-	if (*event_ring_page) {
-		sirbp.sirbp_enabled = false;
-		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-		memunmap(*event_ring_page);
-	}
-	if (*event_flags_page) {
+cleanup_siefp:
+	if (*event_flags_page)
+		memunmap(*event_flags_page);
+	if (!vmbus_active) {
 		siefp.siefp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
-		memunmap(*event_flags_page);
 	}
-	if (*msg_page) {
+cleanup_simp:
+	if (*msg_page)
+		memunmap(*msg_page);
+	if (!vmbus_active) {
 		simp.simp_enabled = false;
 		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-		memunmap(*msg_page);
 	}
 
 	return -EFAULT;
@@ -545,16 +573,15 @@ int mshv_synic_init(unsigned int cpu)
 int mshv_synic_cleanup(unsigned int cpu)
 {
 	union hv_synic_sint sint;
-	union hv_synic_simp simp;
-	union hv_synic_siefp siefp;
 	union hv_synic_sirbp sirbp;
-	union hv_synic_scontrol sctrl;
 	struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
 	struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
 	struct hv_synic_event_flags_page **event_flags_page =
 		&spages->synic_event_flags_page;
 	struct hv_synic_event_ring_page **event_ring_page =
 		&spages->synic_event_ring_page;
+	/* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+	bool vmbus_active = !hv_root_partition() || hv_nested;
 
 	/* Disable the interrupt */
 	sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
@@ -568,28 +595,47 @@ int mshv_synic_cleanup(unsigned int cpu)
 	hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
 			      sint.as_uint64);
 
-	/* Disable Synic's event ring page */
+	/* Disable SYNIC event ring page owned by MSHV */
 	sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
 	sirbp.sirbp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
-	memunmap(*event_ring_page);
 
-	/* Disable Synic's event flags page */
-	siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
-	siefp.siefp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+	if (hv_root_partition()) {
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		memunmap(*event_ring_page);
+	} else {
+		sirbp.base_sirbp_gpa = 0;
+		hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+		free_page((unsigned long)*event_ring_page);
+	}
+
+	/*
+	 * Release our mappings of the message and event flags pages.
+	 * When VMBus is not active, we enabled SIMP/SIEFP — disable
+	 * them. Otherwise VMBus owns the MSRs — leave them.
+	 */
 	memunmap(*event_flags_page);
+	if (!vmbus_active) {
+		union hv_synic_simp simp;
+		union hv_synic_siefp siefp;
 
-	/* Disable Synic's message page */
-	simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
-	simp.simp_enabled = false;
-	hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+		siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+		siefp.siefp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+		simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+		simp.simp_enabled = false;
+		hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+	}
 	memunmap(*msg_page);
 
-	/* Disable global synic bit */
-	sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
-	sctrl.enable = 0;
-	hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	/* When VMBus is active it owns SCONTROL — leave it. */
+	if (!vmbus_active) {
+		union hv_synic_scontrol sctrl;
+
+		sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+		sctrl.enable = 0;
+		hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+	}
 
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam,
	Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.

Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.

Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
Co-developed-by: Mukesh Rathor <mukeshrathor@microsoft.com>
Signed-off-by: Mukesh Rathor <mukeshrathor@microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c |  7 +++++
 drivers/hv/hv_proc.c           | 47 ++++++++++++++++++++++++++++++++++
 include/asm-generic/mshyperv.h | 10 ++++++++
 include/hyperv/hvgdk_mini.h    |  1 +
 include/hyperv/hvhdk_mini.h    | 12 +++++++++
 5 files changed, 77 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 235087456bdf..f653feea880b 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -429,6 +429,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
 	}
 
 #ifdef CONFIG_X86_64
+	/* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
+	if (num_present_cpus() == 1 || hv_lp_exists(1))
+		return;
+
 	for_each_present_cpu(i) {
 		if (i == 0)
 			continue;
@@ -436,6 +440,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
 		BUG_ON(ret);
 	}
 
+	ret = hv_call_notify_all_processors_started();
+	WARN_ON(ret);
+
 	for_each_present_cpu(i) {
 		if (i == 0)
 			continue;
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 5f4fd9c3231c..63a48e5a02c5 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(hv_call_create_vp);
+
+int hv_call_notify_all_processors_started(void)
+{
+	struct hv_input_notify_partition_event *input;
+	u64 status;
+	unsigned long irq_flags;
+	int ret = 0;
+
+	local_irq_save(irq_flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
+	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
+				 input, NULL);
+	local_irq_restore(irq_flags);
+
+	if (!hv_result_success(status)) {
+		hv_status_err(status, "\n");
+		ret = hv_result_to_errno(status);
+	}
+	return ret;
+}
+
+bool hv_lp_exists(u32 lp_index)
+{
+	struct hv_input_get_logical_processor_run_time *input;
+	struct hv_output_get_logical_processor_run_time *output;
+	unsigned long flags;
+	u64 status;
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	input->lp_index = lp_index;
+	status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
+				 input, output);
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status) &&
+	    hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
+		hv_status_err(status, "\n");
+		BUG();
+	}
+
+	return hv_result_success(status);
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index d37b68238c97..bf601d67cecb 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
 int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
 int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
+int hv_call_notify_all_processors_started(void);
+bool hv_lp_exists(u32 lp_index);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
 
 #else /* CONFIG_MSHV_ROOT */
@@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
 {
 	return -EOPNOTSUPP;
 }
+static inline int hv_call_notify_all_processors_started(void)
+{
+	return -EOPNOTSUPP;
+}
+static inline bool hv_lp_exists(u32 lp_index)
+{
+	return false;
+}
 static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
 {
 	return -EOPNOTSUPP;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 056ef7b6b360..f2598e186550 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 /* HV_CALL_CODE */
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE		0x0002
 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST		0x0003
+#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME		0x0004
 #define HVCALL_NOTIFY_LONG_SPIN_WAIT			0x0008
 #define HVCALL_SEND_IPI					0x000b
 #define HVCALL_ENABLE_VP_VTL				0x000f
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 091c03e26046..b4cb2fa26e9b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -362,6 +362,7 @@ union hv_partition_event_input {
 
 enum hv_partition_event {
 	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
+	HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
 };
 
 struct hv_input_notify_partition_event {
@@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
 	union hv_partition_event_input input;
 } __packed;
 
+struct hv_input_get_logical_processor_run_time {
+	u32 lp_index;
+} __packed;
+
+struct hv_output_get_logical_processor_run_time {
+	u64 global_time;
+	u64 local_run_time;
+	u64 rsvdz0;
+	u64 hypervisor_time;
+} __packed;
+
 struct hv_lp_startup_status {
 	u64 hv_status;
 	u64 substatus1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
 drivers/hv/vmbus_drv.c         | 1 -
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 89a2eb8a0722..235087456bdf 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -235,8 +235,12 @@ void hv_remove_crash_handler(void)
 #ifdef CONFIG_KEXEC_CORE
 static void hv_machine_shutdown(void)
 {
-	if (kexec_in_progress && hv_kexec_handler)
-		hv_kexec_handler();
+	if (kexec_in_progress) {
+		hv_stimer_global_cleanup();
+
+		if (hv_kexec_handler)
+			hv_kexec_handler();
+	}
 
 	/*
 	 * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 301273d61892..5d1449f8c6ea 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2892,7 +2892,6 @@ static struct platform_driver vmbus_platform_driver = {
 
 static void hv_kexec_handler(void)
 {
-	hv_stimer_global_cleanup();
 	vmbus_initiate_unload(false);
 	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
 	mb();
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 1/6] Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260403190613.47026-1-jloeser@linux.microsoft.com>

vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().

Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.

Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
 drivers/hv/vmbus_drv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 3e7a52918ce0..301273d61892 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1430,7 +1430,6 @@ static int vmbus_alloc_synic_and_connect(void)
 {
 	int ret, cpu;
 	struct work_struct __percpu *works;
-	int hyperv_cpuhp_online;

 	ret = hv_synic_alloc();
 	if (ret < 0)
-- 
2.43.0

^ permalink raw reply related

* [PATCH v2 0/6] Hyper-V: kexec fixes for L1VH (mshv)
From: Jork Loeser @ 2026-04-03 19:06 UTC (permalink / raw)
  To: linux-hyperv
  Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
	linux-kernel, linux-arch, Jork Loeser

This series fixes kexec support when Linux runs as an L1 Virtual Host
(L1VH) under Hyper-V, using the MSHV driver to manage child VMs.

1. A variable shadowing bug in vmbus that hides the cpuhp state used
   for teardown.

2. Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
   hv_machine_shutdown(). This ensures stimer cleanup happens before
   the vmbus unload.

3. LP/VP re-creation: after kexec, logical processors and virtual
   processors already exist in the hypervisor. Detect this and skip
   re-adding them.

4-5. SynIC cleanup: the MSHV driver manages its own SynIC resources
     separately from vmbus. Add proper teardown of MSHV-owned SINTs,
     SIMP, and SIEFP on kexec, scoped to only the resources MSHV
     owns.

6. Debugfs stats pages: unmap the VP statistics overlay pages before
   kexec to avoid stale mappings in the new kernel.

Changes since v1:
- Patch 4: account for nested root partitions where VMBus is also
  active (not just L1VH); use a vmbus_active local variable; allocate
  SIRBP L1VH allocation path for when the hypervisor doesn't
  pre-provision the page

Jork Loeser (6):
  Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
  x86/hyperv: move stimer cleanup to hv_machine_shutdown()
  x86/hyperv: Skip LP/VP creation on kexec
  mshv: limit SynIC management to MSHV-owned resources
  mshv: clean up SynIC state on kexec for L1VH
  mshv: unmap debugfs stats pages on kexec

 arch/x86/kernel/cpu/mshyperv.c |  15 +++-
 drivers/hv/hv_proc.c           |  47 +++++++++++
 drivers/hv/mshv_debugfs.c      |   7 +-
 drivers/hv/mshv_root_main.c    |  22 ++---
 drivers/hv/mshv_synic.c        | 142 ++++++++++++++++++++++-----------
 drivers/hv/vmbus_drv.c         |   2 -
 include/asm-generic/mshyperv.h |  10 +++
 include/hyperv/hvgdk_mini.h    |   1 +
 include/hyperv/hvhdk_mini.h    |  12 +++
 9 files changed, 188 insertions(+), 70 deletions(-)

--
2.43.0


^ permalink raw reply

* RE: [PATCH] Drivers: hv: mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
From: Michael Kelley @ 2026-04-03 18:37 UTC (permalink / raw)
  To: Naman Jain, Michael Kelley, K . Y . Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li
  Cc: Saurabh Sengar, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <9f0e27ac-099b-4b7d-b082-f43245331bbc@linux.microsoft.com>

From: Naman Jain <namjain@linux.microsoft.com> Sent: Friday, April 3, 2026 12:25 AM
> 

Nit: I wonder what's the best prefix to use in the patch Subject field.
"Drivers: hv: mshv_vtl:" is rather long.  There was agreement to use
just "mshv:" for the root partition code, and I probably misused that
in commit 754cf84504ea. How about just "mshv_vtl:" as the prefix for this
patch and other VTL patches going forward?

> On 4/3/2026 9:05 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, March 31, 2026 10:40 PM
> >>
> >> When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
> >> computes pgmap->vmemmap_shift as the number of trailing zeros in the
> >> OR of start_pfn and last_pfn, intending to use the largest compound
> >> page order both endpoints are aligned to.
> >>
> >> However, this value is not clamped to MAX_FOLIO_ORDER, so a
> >> sufficiently aligned range (e.g. physical range 0x800000000000-
> >> 0x800080000000, corresponding to start_pfn=0x800000000 with 35
> >> trailing zeros) can produce a shift larger than what
> >> memremap_pages() accepts, triggering a WARN and returning -EINVAL:
> >>
> >>    WARNING: ... memremap_pages+0x512/0x650
> >>    requested folio size unsupported
> >>
> >> The MAX_FOLIO_ORDER check was added by
> >> commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
> >> page sizes in memremap_pages()").
> >> When CONFIG_HAVE_GIGANTIC_FOLIOS=y, CONFIG_SPARSEMEM_VMEMMAP=y, and
> >> CONFIG_HUGETLB_PAGE is not set, MAX_FOLIO_ORDER resolves to
> >> (PUD_SHIFT - PAGE_SHIFT) = 18. Any range whose PFN alignment exceeds
> >> order 18 hits this path.
> >
> > I'm not clear on what point you are making with this specific
> > configuration that results in MAX_FOLIO_ORDER being 18. Is it just
> > an example? Is 18 the largest expected value for MAX_FOLIO_ORDER?
> > And note that PUD_SHIFT and PAGE_SHIFT might have different values
> > on arm64 with a page size other than 4K.
> >
> 
> Yes, this was just an example. It is not generalized enough, I will
> remove it.
> MAX_FOLIO_ORDER could go beyond 18.
> 
> >>
> >> Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
> >> request the largest order the kernel supports, rather than an
> >> out-of-range value.
> >>
> >> Also fix the error path to propagate the actual error code from
> >> devm_memremap_pages() instead of hard-coding -EFAULT, which was
> >> masking the real -EINVAL return.
> >>
> >> Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
> >> Cc: <stable@vger.kernel.org>
> >> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
> >> ---
> >>   drivers/hv/mshv_vtl_main.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
> >> index 5856975f32e12..255fed3a740c1 100644
> >> --- a/drivers/hv/mshv_vtl_main.c
> >> +++ b/drivers/hv/mshv_vtl_main.c
> >> @@ -405,8 +405,12 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
> >>   	/*
> >>   	 * Determine the highest page order that can be used for the given memory range.
> >>   	 * This works best when the range is aligned; i.e. both the start and the length.
> >> +	 * Clamp to MAX_FOLIO_ORDER to avoid a WARN in memremap_pages() when the range
> >> +	 * alignment exceeds the maximum supported folio order for this kernel config.
> >>   	 */
> >> -	pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn);
> >> +	pgmap->vmemmap_shift = min_t(unsigned long,
> >> +				     count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn),
> >> +				     MAX_FOLIO_ORDER);
> >
> > Is it necessary to use min_t() here, or would min() work?  Neither count_trailing_zeros()
> > nor MAX_FOLIO_ORDER is ever negative, so it seems like just min() would work with
> > no potential for doing a bogus comparison or assignment.
> >
> 
> min could work, yes. I just felt min_t is more safer for comparing these
> two different types of values -
> count_trailing_zeroes being 'int'
> MAX_FOLIO_ORDER being a macro, calculated by applying bit operations.
> 
> and destination being unsigned int.
> 
> 
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          MAX_PAGE_ORDER
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          PFN_SECTION_SHIFT
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (ilog2(SZ_16G) - PAGE_SHIFT)
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (ilog2(SZ_1G) - PAGE_SHIFT)
> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (PUD_SHIFT - PAGE_SHIFT)
> 
> I am fine with anything you suggest here.

There's a fair number of patches on LKML that are replacing min_t() with
min().  At some point in the not-too-distant past, the implementation of
min() was improved to deal with different but compatible integer types.
My sense is that min() is the better choice for general integer comparisons,
particularly when the values are known to be non-negative.

> 
> > The shift is calculated using the originally passed in start_pfn and last_pfn, while the
> > "range" struct in pgmap has an "end" value that is one page less. So is the idea to
> > go ahead and create the mapping with folios of a size that includes that last page,
> > and then just waste the last page of the last folio?
> 
> No, waste does not occur. The vmemmap_shift determines the folio order,
> and memmap_init_zone_device() walks the range [start_pfn, last_pfn) in
> steps of (1 << vmemmap_shift) pages, creating one folio per step. The OR
> operation ensures both boundaries are aligned to multiples of (1 <<
> vmemmap_shift), guaranteeing the range divides evenly into folios with
> no partial folio at the end.
> The intention is to find the highest folio order possible here, and if
> it reaches the MAX_FOLIO_ORDER, restrict vmemmap_shift to it.

OK, I figured out what is confusing me. I had a misunderstanding
when I reviewed this code during its original submission, and that
misunderstanding has influenced my (incorrect) review of this change.

The struct mshv_vtl_ram_disposition that is passed from user space has
two fields: start_pfn and last_pfn. But last_pfn is somewhat misnamed
in my view. For example, an aligned 2 MiB of memory would consist of
512 PFNs. If the first PFN is 0x200, the last PFN would be 0x3FF.  But in
the semantics of the struct, the last_pfn field should be 0x400.

In response to my comments in the original review, you added the comment
about last_pfn being excluded in the pagemap range, which is true. But it's
not because that page is somehow reserved or being wasted. It's because
the range is being described by specifying the PFN *after* the last PFN.

With the start_pfn and last_pfn fields used to determine the highest
page order that can be used, the slightly unorthodox semantics of
last_pfn make that calculation easy. But then you must subtract 1
from last_pfn when setting the range start and end for
devm_memremap_pages() to use. And the code does that, so the code
is all correct. The comment might be improved to speak about the
semantics of the last_pfn field, not that a page of memory is
intentionally being excluded/wasted.  And/or maybe the struct
mshv_vtl_ram_disposition definition should get a comment to clarify
the semantics of last_pfn.

Michael

^ permalink raw reply

* RE: [PATCH v2 4/4] drivers: hv: mark channel attributes as const
From: Michael Kelley @ 2026-04-03 17:46 UTC (permalink / raw)
  To: Thomas Weißschuh, Michael Kelley
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <8d8fdf46-5374-4e23-9c79-140117724441@t-8ch.de>

From: Thomas Weißschuh <linux@weissschuh.net> Sent: Friday, April 3, 2026 9:16 AM
> 
> On 2026-04-03 15:56:54+0000, Michael Kelley wrote:
> > From: Thomas Weißschuh <linux@weissschuh.net> Sent: Friday, April 3, 2026 1:29 AM
> > >
> > > These attributes are never modified, mark them as const.
> > >
> > > Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> > > Tested-by: Michael Kelley <mhklinux@outlook.com>
> >
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> 
> Thanks.
> 
> > But take a look at this analysis from Sashiko AI:
> > https://sashiko.dev/#/patchset/20260403-sysfs-const-hv-v2-0-8932ab8d41db%40weissschuh.net
> >
> > This seems to be a valid concern with your earlier commit 7dd9fdb4939b9
> > "sysfs: attribute_group: enable const variants of is_visible()".
> 
> Indeed, I missed this user of ->is_visible.
> I'll send a fix for that, but it shouldn't affect this series.
> 

Agreed.

^ permalink raw reply

* Re: [PATCH] PCI: hv: Fix double ida_free in hv_pci_probe error path
From: Saurabh Singh Sengar @ 2026-04-03 17:16 UTC (permalink / raw)
  To: Sahil Chandna
  Cc: kys, haiyangz, wei.liu, decui, longli, lpieralisi, kwilczynski,
	mani, robh, bhelgaas, dan.j.williams, mhklinux, linux-hyperv,
	linux-pci, linux-kernel
In-Reply-To: <20260403120933.466259-1-sahilchandna@linux.microsoft.com>

On Fri, Apr 03, 2026 at 05:09:29AM -0700, Sahil Chandna wrote:
> If hv_pci_probe() fails after storing the domain number in
> hbus->bridge->domain_nr, there is a call to free this domain_nr via
> pci_bus_release_emul_domain_nr(), however, during cleanup, the bridge
> release callback pci_release_host_bridge_dev() also frees the domain_nr
> causing ida_free to be called on same ID twice and triggering following
> warning:
> 
>   ida_free called for id=28971 which is not allocated.
>   WARNING: lib/idr.c:594 at ida_free+0xdf/0x160, CPU#0: kworker/0:2/198
>   Call Trace:
>    pci_bus_release_emul_domain_nr+0x17/0x20
>    pci_release_host_bridge_dev+0x4b/0x60
>    device_release+0x3b/0xa0
>    kobject_put+0x8e/0x220
>    devm_pci_alloc_host_bridge_release+0xe/0x20
>    devres_release_all+0x9a/0xd0
>    device_unbind_cleanup+0x12/0xa0
>    really_probe+0x1c5/0x3f0
>    vmbus_add_channel_work+0x135/0x1a0
> 
> Fix this by letting pci core handle the free domain_nr and remove
> the explicit free called in pci-hyperv driver.
> 
> Fixes: bcce8c74f1ce ("PCI: Enable host bridge emulation for PCI_DOMAINS_GENERIC platforms")
> Signed-off-by: Sahil Chandna <sahilchandna@linux.microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..5616ad5d2a8f 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3778,7 +3778,7 @@ static int hv_pci_probe(struct hv_device *hdev,
>  					   hbus->bridge->domain_nr);
>  	if (!hbus->wq) {
>  		ret = -ENOMEM;
> -		goto free_dom;
> +		goto free_bus;
>  	}
> 
>  	hdev->channel->next_request_id_callback = vmbus_next_request_id;
> @@ -3874,8 +3874,6 @@ static int hv_pci_probe(struct hv_device *hdev,
>  	vmbus_close(hdev->channel);
>  destroy_wq:
>  	destroy_workqueue(hbus->wq);
> -free_dom:
> -	pci_bus_release_emul_domain_nr(hbus->bridge->domain_nr);
>  free_bus:
>  	kfree(hbus);
>  	return ret;
> --
> 2.53.0
> 

LGTM,
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>

^ permalink raw reply

* Re: [PATCH v2 4/4] drivers: hv: mark channel attributes as const
From: Thomas Weißschuh @ 2026-04-03 16:15 UTC (permalink / raw)
  To: Michael Kelley
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157D5F04608E4E3C21AB56ED45EA@SN6PR02MB4157.namprd02.prod.outlook.com>

On 2026-04-03 15:56:54+0000, Michael Kelley wrote:
> From: Thomas Weißschuh <linux@weissschuh.net> Sent: Friday, April 3, 2026 1:29 AM
> > 
> > These attributes are never modified, mark them as const.
> > 
> > Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
> > Tested-by: Michael Kelley <mhklinux@outlook.com>
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>

Thanks.

> But take a look at this analysis from Sashiko AI:
> https://sashiko.dev/#/patchset/20260403-sysfs-const-hv-v2-0-8932ab8d41db%40weissschuh.net
> 
> This seems to be a valid concern with your earlier commit 7dd9fdb4939b9
> "sysfs: attribute_group: enable const variants of is_visible()".

Indeed, I missed this user of ->is_visible.
I'll send a fix for that, but it shouldn't affect this series.


Thomas

^ permalink raw reply

* Re: [PATCH 3/7] mshv: Rename mshv_mem_region to mshv_region
From: Stanislav Kinsburskii @ 2026-04-03 16:09 UTC (permalink / raw)
  To: Anirudh Rayabharam
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <20260403-gainful-cerulean-asp-5a1a2d@anirudhrb>

On Fri, Apr 03, 2026 at 03:54:52AM +0000, Anirudh Rayabharam wrote:
> On Thu, Apr 02, 2026 at 10:57:09AM -0700, Stanislav Kinsburskii wrote:
> > On Thu, Apr 02, 2026 at 04:33:24PM +0000, Anirudh Rayabharam wrote:
> > > On Wed, Apr 01, 2026 at 10:12:40PM +0000, Stanislav Kinsburskii wrote:
> > > > The mshv_mem_region structure represents guest address space regions,
> > > > which can be either RAM-backed memory or memory-mapped IO regions
> > > > without physical backing. The "mem_" prefix incorrectly suggests the
> > > > structure only handles memory regions, creating confusion about its
> > > > actual purpose.
> > > > 
> > > > Remove the "mem_" prefix to align with existing function naming
> > > > (mshv_region_map, mshv_region_pin, etc.) and accurately reflect that
> > > > this structure manages arbitrary guest address space mappings
> > > > regardless of their backing type.
> > > 
> > > I don't think the "mem_" prefix automatically suggested the backing
> > > type.
> > > 
> > 
> > What else can it suggest?
> 
> To me it just suggested that the struct contained info or properties of
> a guest memory region. The name itself didn't suggest what backing type
> the memory has.
> 

Right, that’s what the old name suggested to me too. And that’s the
problem. It’s inaccurate for MMIO regions. They have nothing to do with
memory. They are never backed by it.

> > 
> > > Isn't mshv_region too vague now? Region of what?
> > > 
> > 
> > The region of address space, which can or can not be backed by memory.
> 
> Yeah but that's not obvious just from the new name. The new name just
> says mshv_region and doesn't what the region is of.
> 

I hear you, but what else can we do? Keeping mshv_mem_region isn’t good
for the reasons above. Renaming it to `mshv_gpa_region` also feels
redundant detailization.

To me, this is a good tradeoff. Yes, it’s a bit too generic. But it’s
concise, and we don’t have any other regions in the codebase. So I don’t
think it will cause confusion.

Thanks,
Stanislav

> Thanks,
> Anirudh.
> 
> > 
> > Thanks,
> > Stanislav
> > 
> > > Thanks,
> > > Anirudh.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox