* [PATCH 0/2] hv: vmbus: Small cleanups
From: Sebastian Andrzej Siewior @ 2026-04-01 15:15 UTC (permalink / raw)
To: linux-hyperv, linux-rt-devel
Cc: K. Y. Srinivasan, Dexuan Cui, Haiyang Zhang, Jan Kiszka, Long Li,
Michael Kelley, Wei Liu, Sebastian Andrzej Siewior
Replacing the lockdep_hardirq_threaded() annotation which does not
belong in drivers and removing the vmbus_irq_initialized which seems
redundant.
Sebastian Andrzej Siewior (2):
hv: vmbus: Replace lockdep_hardirq_threaded() with lockdep annotation
hv: vmbus: Remove vmbus_irq_initialized
drivers/hv/vmbus_drv.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
--
2.53.0
^ permalink raw reply
* [PATCH 1/2] hv: vmbus: Replace lockdep_hardirq_threaded() with lockdep annotation
From: Sebastian Andrzej Siewior @ 2026-04-01 15:15 UTC (permalink / raw)
To: linux-hyperv, linux-rt-devel
Cc: K. Y. Srinivasan, Dexuan Cui, Haiyang Zhang, Jan Kiszka, Long Li,
Michael Kelley, Wei Liu, Sebastian Andrzej Siewior
In-Reply-To: <20260401151517.1743555-1-bigeasy@linutronix.de>
lockdep_hardirq_threaded() is supposed to be used within IRQ core code
and not within drivers. It is not obvious from within the driver, that
this is the only interrupt service routing and that it is not shared
handler.
Replace lockdep_hardirq_threaded() with a lockdep annotation limiting
threaded context on PREEMPT_RT to __vmbus_isr().
Fixes: f8e6343b7a89c ("Drivers: hv: vmbus: Use kthread for vmbus interrupts on PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
drivers/hv/vmbus_drv.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1c..e44275370ac2a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1407,8 +1407,11 @@ void vmbus_isr(void)
if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
vmbus_irqd_wake();
} else {
- lockdep_hardirq_threaded();
+ static DEFINE_WAIT_OVERRIDE_MAP(vmbus_map, LD_WAIT_CONFIG);
+
+ lock_map_acquire_try(&vmbus_map);
__vmbus_isr();
+ lock_map_release(&vmbus_map);
}
}
EXPORT_SYMBOL_FOR_MODULES(vmbus_isr, "mshv_vtl");
--
2.53.0
^ permalink raw reply related
* [PATCH] Drivers: hv: mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
From: Naman Jain @ 2026-04-01 5:40 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Michael Kelley
Cc: Saurabh Sengar, Naman Jain, linux-hyperv, linux-kernel
When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
computes pgmap->vmemmap_shift as the number of trailing zeros in the
OR of start_pfn and last_pfn, intending to use the largest compound
page order both endpoints are aligned to.
However, this value is not clamped to MAX_FOLIO_ORDER, so a
sufficiently aligned range (e.g. physical range 0x800000000000-
0x800080000000, corresponding to start_pfn=0x800000000 with 35
trailing zeros) can produce a shift larger than what
memremap_pages() accepts, triggering a WARN and returning -EINVAL:
WARNING: ... memremap_pages+0x512/0x650
requested folio size unsupported
The MAX_FOLIO_ORDER check was added by
commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
page sizes in memremap_pages()").
When CONFIG_HAVE_GIGANTIC_FOLIOS=y, CONFIG_SPARSEMEM_VMEMMAP=y, and
CONFIG_HUGETLB_PAGE is not set, MAX_FOLIO_ORDER resolves to
(PUD_SHIFT - PAGE_SHIFT) = 18. Any range whose PFN alignment exceeds
order 18 hits this path.
Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
request the largest order the kernel supports, rather than an
out-of-range value.
Also fix the error path to propagate the actual error code from
devm_memremap_pages() instead of hard-coding -EFAULT, which was
masking the real -EINVAL return.
Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
Cc: <stable@vger.kernel.org>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
drivers/hv/mshv_vtl_main.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 5856975f32e12..255fed3a740c1 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -405,8 +405,12 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
/*
* Determine the highest page order that can be used for the given memory range.
* This works best when the range is aligned; i.e. both the start and the length.
+ * Clamp to MAX_FOLIO_ORDER to avoid a WARN in memremap_pages() when the range
+ * alignment exceeds the maximum supported folio order for this kernel config.
*/
- pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn);
+ pgmap->vmemmap_shift = min_t(unsigned long,
+ count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn),
+ MAX_FOLIO_ORDER);
dev_dbg(vtl->module_dev,
"Add VTL0 memory: start: 0x%llx, end_pfn: 0x%llx, page order: %lu\n",
vtl0_mem.start_pfn, vtl0_mem.last_pfn, pgmap->vmemmap_shift);
@@ -415,7 +419,7 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
if (IS_ERR(addr)) {
dev_err(vtl->module_dev, "devm_memremap_pages error: %ld\n", PTR_ERR(addr));
kfree(pgmap);
- return -EFAULT;
+ return PTR_ERR(addr);
}
/* Don't free pgmap, since it has to stick around until the memory
base-commit: 36ece9697e89016181e5ae87510e40fb31d86f2b
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
From: patchwork-bot+netdevbpf @ 2026-04-01 3:20 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
shirazsaleem, kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260326173101.2010514-1-ernis@linux.microsoft.com>
Hello:
This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 26 Mar 2026 10:30:56 -0700 you wrote:
> As a part of MANA hardening for CVM, validate the adapter_mtu value
> returned from the MANA_QUERY_DEV_CONFIG HWC command.
>
> The adapter_mtu value is used to compute ndev->max_mtu via:
> gc->adapter_mtu - ETH_HLEN. If hardware returns a bogus adapter_mtu
> smaller than ETH_HLEN (e.g. 0), the unsigned subtraction wraps to a
> huge value, silently allowing oversized MTU settings.
>
> [...]
Here is the summary with links:
- [net-next] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
https://git.kernel.org/netdev/net-next/c/d7709812e13d
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Alex Williamson @ 2026-03-31 19:18 UTC (permalink / raw)
To: Danilo Krummrich, Jason Gunthorpe
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Christophe Leroy (CS GROUP), linux-kernel, driver-core,
linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
virtualization, kvm, xen-devel, linux-arm-kernel, Gui-Dong Han,
alex
In-Reply-To: <DHGTLY0FJWD2.2VLT6NQWF97YY@kernel.org>
On Tue, 31 Mar 2026 10:22:14 +0200
"Danilo Krummrich" <dakr@kernel.org> wrote:
> On Tue Mar 31, 2026 at 10:06 AM CEST, Danilo Krummrich wrote:
> > On Mon Mar 30, 2026 at 10:10 PM CEST, Alex Williamson wrote:
> >> On Mon, 30 Mar 2026 19:38:41 +0200
> >> "Danilo Krummrich" <dakr@kernel.org> wrote:
> >>
> >>> (Cc: Jason)
> >>>
> >>> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
> >>> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >>> > index d43745fe4c84..460852f79f29 100644
> >>> > --- a/drivers/vfio/pci/vfio_pci_core.c
> >>> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> >>> > @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
> >>> > pdev->is_virtfn && physfn == vdev->pdev) {
> >>> > pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
> >>> > pci_name(pdev));
> >>> > - pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
> >>> > - vdev->vdev.ops->name);
> >>> > - WARN_ON(!pdev->driver_override);
> >>> > + WARN_ON(device_set_driver_override(&pdev->dev,
> >>> > + vdev->vdev.ops->name));
> >>>
> >>> Technically, this is a change in behavior. If vdev->vdev.ops->name is NULL, it
> >>> will trigger the WARN_ON(), whereas before it would have just written "(null)"
> >>> into driver_override.
> >>
> >> It's worse than that. Looking at the implementation in [1], we have:
> >>
> >> +static inline int device_set_driver_override(struct device *dev, const char *s)
> >> +{
> >> + return __device_set_driver_override(dev, s, strlen(s));
> >> +}
> >>
> >> So if name is NULL, we oops in strlen() before we even hit the -EINVAL
> >> and WARN_ON().
> >
> > This was changed in v2 [2] and the actual code in-tree is
> >
> > static inline int device_set_driver_override(struct device *dev, const char *s)
> > {
> > return __device_set_driver_override(dev, s, s ? strlen(s) : 0);
> > }
> >
> > so it does indeed return -EINVAL for a NULL pointer.
Ok, good.
> >> I don't believe we have any vfio-pci variant drivers where the name is
> >> NULL, but kasprintf() handling NULL as "(null)" was a consideration in
> >> this design, that even if there is no name the device is sequestered
> >> with a driver_override that won't match an actual driver.
> >>
> >>> I assume that vfio_pci_core drivers are expected to set the name in struct
> >>> vfio_device_ops in the first place and this code (silently) relies on this
> >>> invariant?
> >>
> >> We do expect that, but it was previously safe either way to make sure
> >> VFs are only bound to the same ops driver or barring that, at least
> >> don't perform a standard driver match. The last thing we want to
> >> happen automatically is for a user owned PF to create SR-IOV VFs that
> >> automatically bind to native kernel drivers.
> >>
> >>> Alex, Jason: Should we keep this hunk above as is and check for a proper name in
> >>> struct vfio_device_ops in vfio_pci_core_register_device() with a subsequent
> >>> patch?
> >>
> >> Given the oops, my preference would be to roll it in here. This change
> >> is what makes it a requirement that name cannot be NULL, where this was
> >> safely handled with kasprintf().
> >
> > Again, no oops here. :)
> >
> > I still think it makes more sense to fail early in
> > vfio_pci_core_register_device(), rather than silently accept "(null)" in
> > driver_override. It also doesn't seem unreasonable with only the WARN_ON(), but
> > I can also just add vdev->vdev.ops->name ?: "(null)".
>
> (Or just skip the call if !vdev->vdev.ops->name, as a user will read "(null)"
> from sysfs either way.)
Hmm, I suppose they would, but there's a fundamental difference between
driver_override being set to "(null)" vs being NULL and only
interpreted as "(null)" via show. The former would prevent ID table
matching, the latter would not. The former is what was intended here,
without realizing both look the same through sysfs and present a
difficult debugging challenge.
Given no oops for strlen() now and no known in-kernel drivers with a
NULL name, I can post a patch that rejects a NULL name in the ops
structure in vfio_pci_core_register_device(), avoiding the entire
situation.
For this,
Acked-by: Alex Williamson <alex@shazbot.org>
Thanks,
Alex
^ permalink raw reply
* Re: [PATCH net-next] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
From: Erni Sri Satya Vennela @ 2026-03-31 18:02 UTC (permalink / raw)
To: Paolo Abeni
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, ssengar, dipayanroy, gargaditya, shirazsaleem,
kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <f0bc585f-b1d2-46e6-b0eb-801881862692@redhat.com>
> +
> > + if (*max_sq == 0 || *max_rq == 0) {
> > + netdev_err(apc->ndev, "Invalid max queues from vPort config\n");
> > + return -EPROTO;
>
> AI review says:
>
> Will returning -EPROTO here expose a pre-existing resource leak in the
> driver's teardown path?
> If mana_query_vport_cfg() returns an error, mana_init_port() fails and
> mana_probe_port() frees the ndev, leaving ac->ports[i] as NULL. In
> mana_probe(), the port initialization loop breaks upon this error, but
> the err variable is then overwritten:
>
> mana_probe() {
> ...
> for (i = 0; i < ac->num_ports; i++) {
> err = mana_probe_port(ac, i, &ac->ports[i]);
> if (err) {
> dev_err(dev, "Probe Failed for port %d\n", i);
> break;
> }
> }
>
> err = add_adev(gd, "eth");
> ...
> }
>
> If add_adev() succeeds, mana_probe() completes successfully instead of
> failing, masking the earlier error while leaving ac->ports[0] as NULL.
> Later, when the driver is unloaded or if add_adev() fails and triggers
> immediate cleanup, mana_remove() is called. It iterates over ac->ports
> and, upon encountering the NULL device, immediately executes goto out:
>
> mana_remove() {
> ...
> for (i = 0; i < ac->num_ports; i++) {
> ndev = ac->ports[i];
> if (!ndev) {
> if (i == 0)
> ...
> goto out;
> }
> ...
> }
>
> mana_destroy_eq(ac);
> out:
> ...
> }
>
> Because the out label in mana_remove() is located after the
> mana_destroy_eq(ac) call, jumping there completely skips destroying the
> event queues allocated earlier by mana_create_eq(ac).
> In a Confidential Virtual Machine context, could an untrusted hypervisor
> repeatedly return invalid configs to continuously leak guest memory and
> hardware queues?
Thankyou for the review.
Since these issues are pre-existing, I will send it in a separate
patchset.
The patchset will also include the issues reported in:
[PATCH net-next] net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG
- Vennela
^ permalink raw reply
* Re: [PATCH net-next] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
From: Erni Sri Satya Vennela @ 2026-03-31 18:00 UTC (permalink / raw)
To: Paolo Abeni
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, ssengar, dipayanroy, gargaditya, shirazsaleem,
kees, linux-hyperv, netdev, linux-kernel
In-Reply-To: <4ceecee2-ea5a-4026-ad38-66c0a7d263cd@redhat.com>
> > - if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
> > + if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
> > + if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
> > + dev_err(dev, "Adapter MTU too small: %u\n",
> > + resp.adapter_mtu);
> > + return -EPROTO;
>
> AI review says:
>
> If this returns -EPROTO, does the caller mana_probe() jump to an error
> label and call mana_remove()?
> If so, mana_remove() unconditionally calls
> disable_work_sync(&ac->link_change_work) and
> cancel_delayed_work_sync(&ac->gf_stats_work).
> Since mana_query_device_cfg() is called before INIT_WORK() and
> INIT_DELAYED_WORK() in the probe sequence, wouldn't this result in
> calling sync cancellation functions on uninitialized, zeroed work
> structures?
> This can lead to a WARN_ON(!work->func) in __flush_work(), or debug
> object warnings if CONFIG_DEBUG_OBJECTS_WORK is enabled.
> While this initialization issue appears to already exist for other early
> error paths, this new error path can also trigger it.
Thankyou for the review.
Since the issue is pre-existing. I will send it in a separate patchset.
The patchset will include the the issue reported with:
[PATCH net-next] net: mana: hardening: Reject zero max_num_queues from
MANA_QUERY_VPORT_CONFIG
- Vennela
^ permalink raw reply
* Re: [PATCH net-next v4] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-03-31 17:58 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, pabeni, kotaranov, horms, shradhagupta, shirazsaleem,
dipayanroy, yury.norov, kees, ssengar, gargaditya, linux-hyperv,
netdev, linux-kernel, linux-rdma
In-Reply-To: <20260330152318.144c1b30@kernel.org>
On Mon, Mar 30, 2026 at 03:23:18PM -0700, Jakub Kicinski wrote:
> On Mon, 30 Mar 2026 12:09:52 -0700 Erni Sri Satya Vennela wrote:
> > Just a quick follow‑up on this. Since these issues were pre‑existing and
> > not introduced by this patch, would you prefer that I send them as a
> > separate fix patch, or fold the fixes into the current patch?
>
> Anything that's pre-existing should be a separate patch, before any new
> code. If the bug exists only in net-next - earlier in the same series,
> if the bug exists in net - posted separately for the net tree.
Thankyou for the explanation.
I will send it in a separate patch.
- Vennela
^ permalink raw reply
* Re: [PATCH 06/12] platform/wmi: use generic driver_override infrastructure
From: Ilpo Järvinen @ 2026-03-31 15:50 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP), LKML,
driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Gui-Dong Han
In-Reply-To: <f819b7d8-5c80-4463-9afa-933a2ddc8ab3@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 635 bytes --]
On Tue, 31 Mar 2026, Danilo Krummrich wrote:
> On 3/31/26 5:02 PM, Ilpo Järvinen wrote:
> > I tried applying this to platform-drivers tree but it failed to compile so
> > I ended up dropping the changed.
>
> As the cover letter mentions, it sits on top of v7.0-rc5, did you consider this?
I noticed it but I just assumed you were working on top of that, not that
there's something past -rc1 that is required.
> I can also pick it up via the driver-core tree.
If there's some post -rc1 material this depends, it's probably better that
way.
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
--
i.
^ permalink raw reply
* Re: [PATCH 06/12] platform/wmi: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-31 15:43 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP), LKML,
driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Gui-Dong Han
In-Reply-To: <f15629e4-ef8f-b1b6-0158-064f40f111da@linux.intel.com>
On 3/31/26 5:02 PM, Ilpo Järvinen wrote:
> I tried applying this to platform-drivers tree but it failed to compile so
> I ended up dropping the changed.
As the cover letter mentions, it sits on top of v7.0-rc5, did you consider this?
I can also pick it up via the driver-core tree.
Thanks,
Danilo
^ permalink raw reply
* Re: [PATCH 06/12] platform/wmi: use generic driver_override infrastructure
From: Ilpo Järvinen @ 2026-03-31 15:02 UTC (permalink / raw)
To: Danilo Krummrich
Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP), LKML,
driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Gui-Dong Han
In-Reply-To: <20260324005919.2408620-7-dakr@kernel.org>
On Tue, 24 Mar 2026, Danilo Krummrich wrote:
> When a driver is probed through __driver_attach(), the bus' match()
> callback is called without the device lock held, thus accessing the
> driver_override field without a lock, which can cause a UAF.
>
> Fix this by using the driver-core driver_override infrastructure taking
> care of proper locking internally.
>
> Note that calling match() from __driver_attach() without the device lock
> held is intentional. [1]
>
> Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
> Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
> Fixes: 12046f8c77e0 ("platform/x86: wmi: Add driver_override support")
> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
> ---
> drivers/platform/wmi/core.c | 36 +++++-------------------------------
> include/linux/wmi.h | 4 ----
> 2 files changed, 5 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/platform/wmi/core.c b/drivers/platform/wmi/core.c
> index b8e6b9a421c6..750e3619724e 100644
> --- a/drivers/platform/wmi/core.c
> +++ b/drivers/platform/wmi/core.c
> @@ -842,39 +842,11 @@ static ssize_t expensive_show(struct device *dev,
> }
> static DEVICE_ATTR_RO(expensive);
>
> -static ssize_t driver_override_show(struct device *dev, struct device_attribute *attr,
> - char *buf)
> -{
> - struct wmi_device *wdev = to_wmi_device(dev);
> - ssize_t ret;
> -
> - device_lock(dev);
> - ret = sysfs_emit(buf, "%s\n", wdev->driver_override);
> - device_unlock(dev);
> -
> - return ret;
> -}
> -
> -static ssize_t driver_override_store(struct device *dev, struct device_attribute *attr,
> - const char *buf, size_t count)
> -{
> - struct wmi_device *wdev = to_wmi_device(dev);
> - int ret;
> -
> - ret = driver_set_override(dev, &wdev->driver_override, buf, count);
> - if (ret < 0)
> - return ret;
> -
> - return count;
> -}
> -static DEVICE_ATTR_RW(driver_override);
> -
> static struct attribute *wmi_attrs[] = {
> &dev_attr_modalias.attr,
> &dev_attr_guid.attr,
> &dev_attr_instance_count.attr,
> &dev_attr_expensive.attr,
> - &dev_attr_driver_override.attr,
> NULL
> };
> ATTRIBUTE_GROUPS(wmi);
> @@ -943,7 +915,6 @@ static void wmi_dev_release(struct device *dev)
> {
> struct wmi_block *wblock = dev_to_wblock(dev);
>
> - kfree(wblock->dev.driver_override);
> kfree(wblock);
> }
>
> @@ -952,10 +923,12 @@ static int wmi_dev_match(struct device *dev, const struct device_driver *driver)
> const struct wmi_driver *wmi_driver = to_wmi_driver(driver);
> struct wmi_block *wblock = dev_to_wblock(dev);
> const struct wmi_device_id *id = wmi_driver->id_table;
> + int ret;
>
> /* When driver_override is set, only bind to the matching driver */
> - if (wblock->dev.driver_override)
> - return !strcmp(wblock->dev.driver_override, driver->name);
> + ret = device_match_driver_override(dev, driver);
> + if (ret >= 0)
> + return ret;
>
> if (id == NULL)
> return 0;
> @@ -1076,6 +1049,7 @@ static struct class wmi_bus_class = {
> static const struct bus_type wmi_bus_type = {
> .name = "wmi",
> .dev_groups = wmi_groups,
> + .driver_override = true,
> .match = wmi_dev_match,
> .uevent = wmi_dev_uevent,
> .probe = wmi_dev_probe,
> diff --git a/include/linux/wmi.h b/include/linux/wmi.h
> index 75cb0c7cfe57..14fb644e1701 100644
> --- a/include/linux/wmi.h
> +++ b/include/linux/wmi.h
> @@ -18,16 +18,12 @@
> * struct wmi_device - WMI device structure
> * @dev: Device associated with this WMI device
> * @setable: True for devices implementing the Set Control Method
> - * @driver_override: Driver name to force a match; do not set directly,
> - * because core frees it; use driver_set_override() to
> - * set or clear it.
> *
> * This represents WMI devices discovered by the WMI driver core.
> */
> struct wmi_device {
> struct device dev;
> bool setable;
> - const char *driver_override;
> };
>
> /**
>
Hi,
I tried applying this to platform-drivers tree but it failed to compile so
I ended up dropping the changed.
--
i.
^ permalink raw reply
* Re: [PATCH net-next] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
From: Paolo Abeni @ 2026-03-31 9:33 UTC (permalink / raw)
To: Erni Sri Satya Vennela, kys, haiyangz, wei.liu, decui, longli,
andrew+netdev, davem, edumazet, kuba, ssengar, dipayanroy,
gargaditya, shirazsaleem, kees, linux-hyperv, netdev,
linux-kernel
In-Reply-To: <20260326174815.2012137-1-ernis@linux.microsoft.com>
On 3/26/26 6:48 PM, Erni Sri Satya Vennela wrote:
> As a part of MANA hardening for CVM, validate that max_num_sq and
> max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
> values flow into apc->num_queues, which is used as an allocation count
> and loop bound. A zero value would result in zero-size allocations and
> incorrect driver behavior.
>
> Return -EPROTO if either value is zero.
>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index b39e8b920791..a4197b4b0597 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1249,6 +1249,12 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
>
> *max_sq = resp.max_num_sq;
> *max_rq = resp.max_num_rq;
> +
> + if (*max_sq == 0 || *max_rq == 0) {
> + netdev_err(apc->ndev, "Invalid max queues from vPort config\n");
> + return -EPROTO;
AI review says:
Will returning -EPROTO here expose a pre-existing resource leak in the
driver's teardown path?
If mana_query_vport_cfg() returns an error, mana_init_port() fails and
mana_probe_port() frees the ndev, leaving ac->ports[i] as NULL. In
mana_probe(), the port initialization loop breaks upon this error, but
the err variable is then overwritten:
mana_probe() {
...
for (i = 0; i < ac->num_ports; i++) {
err = mana_probe_port(ac, i, &ac->ports[i]);
if (err) {
dev_err(dev, "Probe Failed for port %d\n", i);
break;
}
}
err = add_adev(gd, "eth");
...
}
If add_adev() succeeds, mana_probe() completes successfully instead of
failing, masking the earlier error while leaving ac->ports[0] as NULL.
Later, when the driver is unloaded or if add_adev() fails and triggers
immediate cleanup, mana_remove() is called. It iterates over ac->ports
and, upon encountering the NULL device, immediately executes goto out:
mana_remove() {
...
for (i = 0; i < ac->num_ports; i++) {
ndev = ac->ports[i];
if (!ndev) {
if (i == 0)
...
goto out;
}
...
}
mana_destroy_eq(ac);
out:
...
}
Because the out label in mana_remove() is located after the
mana_destroy_eq(ac) call, jumping there completely skips destroying the
event queues allocated earlier by mana_create_eq(ac).
In a Confidential Virtual Machine context, could an untrusted hypervisor
repeatedly return invalid configs to continuously leak guest memory and
hardware queues?
^ permalink raw reply
* Re: [PATCH net-next] net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG
From: Paolo Abeni @ 2026-03-31 9:28 UTC (permalink / raw)
To: Erni Sri Satya Vennela, kys, haiyangz, wei.liu, decui, longli,
andrew+netdev, davem, edumazet, kuba, ssengar, dipayanroy,
gargaditya, shirazsaleem, kees, linux-hyperv, netdev,
linux-kernel
In-Reply-To: <20260326173101.2010514-1-ernis@linux.microsoft.com>
On 3/26/26 6:30 PM, Erni Sri Satya Vennela wrote:
> As a part of MANA hardening for CVM, validate the adapter_mtu value
> returned from the MANA_QUERY_DEV_CONFIG HWC command.
>
> The adapter_mtu value is used to compute ndev->max_mtu via:
> gc->adapter_mtu - ETH_HLEN. If hardware returns a bogus adapter_mtu
> smaller than ETH_HLEN (e.g. 0), the unsigned subtraction wraps to a
> huge value, silently allowing oversized MTU settings.
>
> Add a validation check to reject adapter_mtu values below
> ETH_MIN_MTU + ETH_HLEN, returning -EPROTO to fail the device
> configuration early with a clear error message.
>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index b39e8b920791..bd07d17a6017 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1207,10 +1207,16 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
>
> *max_num_vports = resp.max_num_vports;
>
> - if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2)
> + if (resp.hdr.response.msg_version >= GDMA_MESSAGE_V2) {
> + if (resp.adapter_mtu < ETH_MIN_MTU + ETH_HLEN) {
> + dev_err(dev, "Adapter MTU too small: %u\n",
> + resp.adapter_mtu);
> + return -EPROTO;
AI review says:
If this returns -EPROTO, does the caller mana_probe() jump to an error
label and call mana_remove()?
If so, mana_remove() unconditionally calls
disable_work_sync(&ac->link_change_work) and
cancel_delayed_work_sync(&ac->gf_stats_work).
Since mana_query_device_cfg() is called before INIT_WORK() and
INIT_DELAYED_WORK() in the probe sequence, wouldn't this result in
calling sync cancellation functions on uninitialized, zeroed work
structures?
This can lead to a WARN_ON(!work->func) in __flush_work(), or debug
object warnings if CONFIG_DEBUG_OBJECTS_WORK is enabled.
While this initialization issue appears to already exist for other early
error paths, this new error path can also trigger it.
^ permalink raw reply
* Re: [PATCH v2 00/16] Update drivers to use ib_copy_validate_udata_in()
From: Leon Romanovsky @ 2026-03-31 8:35 UTC (permalink / raw)
To: Abhijit Gangurde, Allen Hubbe,
Broadcom internal kernel review list, Bernard Metzler, Bryan Tan,
Cheng Xu, Gal Pressman, Junxian Huang, Kai Shen,
Konstantin Taranov, Krzysztof Czurylo, linux-hyperv, linux-rdma,
Michal Kalderon, Michael Margolin, Nelson Escobar, Satish Kharat,
Selvin Xavier, Yossi Leybovich, Chengchang Tang, Tatyana Nikolova,
Vishnu Dasa, Yishai Hadas, Zhu Yanjun, Jason Gunthorpe
Cc: Long Li, patches
In-Reply-To: <0-v2-f4ac6f418bd6+12c5-rdma_udata_req_jgg@nvidia.com>
On Wed, 25 Mar 2026 18:26:46 -0300, Jason Gunthorpe wrote:
> Progress the uAPI work by shifting nearly all drivers to use
> ib_copy_validate_udata_in() and its variations.
>
> These helpers are easier to use and enforce a tighter uAPI protocol
> for the udata.
>
> v2:
> - Drop EFA patch, rename the field instead
> - Fix the mlx5 mw change, userspace doesn't use the udata struct at all
> v1: https://patch.msgid.link/r/0-v1-2b86f54cda42+7d-rdma_udata_req_jgg@nvidia.com
>
> [...]
Applied, thanks!
[01/16] RDMA: Consolidate patterns with offsetofend() to ib_copy_validate_udata_in()
(no commit info)
[02/16] RDMA: Consolidate patterns with offsetof() to ib_copy_validate_udata_in()
(no commit info)
[03/16] RDMA: Consolidate patterns with sizeof() to ib_copy_validate_udata_in()
(no commit info)
[04/16] RDMA: Use ib_copy_validate_udata_in() for implicit full structs
(no commit info)
[05/16] RDMA/pvrdma: Use ib_copy_validate_udata_in() for srq
(no commit info)
[06/16] RDMA/mlx5: Use ib_copy_validate_udata_in() for SRQ
(no commit info)
[07/16] RDMA/mlx5: Use ib_copy_validate_udata_in() for MW
(no commit info)
[08/16] RDMA/mlx4: Use ib_copy_validate_udata_in()
(no commit info)
[09/16] RDMA/mlx4: Use ib_copy_validate_udata_in() for QP
(no commit info)
[10/16] RDMA/hns: Use ib_copy_validate_udata_in()
(no commit info)
[11/16] RDMA: Use ib_copy_validate_udata_in_cm() for zero comp_mask
(no commit info)
[12/16] RDMA/mlx5: Pull comp_mask validation into ib_copy_validate_udata_in_cm()
(no commit info)
[13/16] RDMA/hns: Add missing comp_mask check in create_qp
(no commit info)
[14/16] RDMA/irdma: Add missing comp_mask check in alloc_ucontext
(no commit info)
[15/16] RDMA: Remove redundant = {} for udata req structs
(no commit info)
[16/16] RDMA/hns: Remove the duplicate calls to ib_copy_validate_udata_in()
(no commit info)
Best regards,
--
Leon Romanovsky <leon@kernel.org>
^ permalink raw reply
* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-31 8:22 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Russell King, Greg Kroah-Hartman,
Rafael J. Wysocki, Ioana Ciornei, Nipun Gupta, Nikhil Agarwal,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Bjorn Helgaas, Armin Wolf, Bjorn Andersson, Mathieu Poirier,
Vineeth Vijayan, Peter Oberparleiter, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Christophe Leroy (CS GROUP), linux-kernel, driver-core,
linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
virtualization, kvm, xen-devel, linux-arm-kernel, Gui-Dong Han
In-Reply-To: <DHGT9XCG8Y96.3IB1EI6FF1ZDZ@kernel.org>
On Tue Mar 31, 2026 at 10:06 AM CEST, Danilo Krummrich wrote:
> On Mon Mar 30, 2026 at 10:10 PM CEST, Alex Williamson wrote:
>> On Mon, 30 Mar 2026 19:38:41 +0200
>> "Danilo Krummrich" <dakr@kernel.org> wrote:
>>
>>> (Cc: Jason)
>>>
>>> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
>>> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>> > index d43745fe4c84..460852f79f29 100644
>>> > --- a/drivers/vfio/pci/vfio_pci_core.c
>>> > +++ b/drivers/vfio/pci/vfio_pci_core.c
>>> > @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
>>> > pdev->is_virtfn && physfn == vdev->pdev) {
>>> > pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
>>> > pci_name(pdev));
>>> > - pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
>>> > - vdev->vdev.ops->name);
>>> > - WARN_ON(!pdev->driver_override);
>>> > + WARN_ON(device_set_driver_override(&pdev->dev,
>>> > + vdev->vdev.ops->name));
>>>
>>> Technically, this is a change in behavior. If vdev->vdev.ops->name is NULL, it
>>> will trigger the WARN_ON(), whereas before it would have just written "(null)"
>>> into driver_override.
>>
>> It's worse than that. Looking at the implementation in [1], we have:
>>
>> +static inline int device_set_driver_override(struct device *dev, const char *s)
>> +{
>> + return __device_set_driver_override(dev, s, strlen(s));
>> +}
>>
>> So if name is NULL, we oops in strlen() before we even hit the -EINVAL
>> and WARN_ON().
>
> This was changed in v2 [2] and the actual code in-tree is
>
> static inline int device_set_driver_override(struct device *dev, const char *s)
> {
> return __device_set_driver_override(dev, s, s ? strlen(s) : 0);
> }
>
> so it does indeed return -EINVAL for a NULL pointer.
>
>> I don't believe we have any vfio-pci variant drivers where the name is
>> NULL, but kasprintf() handling NULL as "(null)" was a consideration in
>> this design, that even if there is no name the device is sequestered
>> with a driver_override that won't match an actual driver.
>>
>>> I assume that vfio_pci_core drivers are expected to set the name in struct
>>> vfio_device_ops in the first place and this code (silently) relies on this
>>> invariant?
>>
>> We do expect that, but it was previously safe either way to make sure
>> VFs are only bound to the same ops driver or barring that, at least
>> don't perform a standard driver match. The last thing we want to
>> happen automatically is for a user owned PF to create SR-IOV VFs that
>> automatically bind to native kernel drivers.
>>
>>> Alex, Jason: Should we keep this hunk above as is and check for a proper name in
>>> struct vfio_device_ops in vfio_pci_core_register_device() with a subsequent
>>> patch?
>>
>> Given the oops, my preference would be to roll it in here. This change
>> is what makes it a requirement that name cannot be NULL, where this was
>> safely handled with kasprintf().
>
> Again, no oops here. :)
>
> I still think it makes more sense to fail early in
> vfio_pci_core_register_device(), rather than silently accept "(null)" in
> driver_override. It also doesn't seem unreasonable with only the WARN_ON(), but
> I can also just add vdev->vdev.ops->name ?: "(null)".
(Or just skip the call if !vdev->vdev.ops->name, as a user will read "(null)"
from sysfs either way.)
> Please let me know what you prefer.
>
> - Danilo
>
>> [1] https://lore.kernel.org/all/20260302002729.19438-2-dakr@kernel.org/
>
> [2] https://lore.kernel.org/driver-core/20260303115720.48783-1-dakr@kernel.org/
^ permalink raw reply
* Re: [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-31 8:06 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Russell King, Greg Kroah-Hartman,
Rafael J. Wysocki, Ioana Ciornei, Nipun Gupta, Nikhil Agarwal,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Bjorn Helgaas, Armin Wolf, Bjorn Andersson, Mathieu Poirier,
Vineeth Vijayan, Peter Oberparleiter, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Christophe Leroy (CS GROUP), linux-kernel, driver-core,
linuxppc-dev, linux-hyperv, linux-pci, platform-driver-x86,
linux-arm-msm, linux-remoteproc, linux-s390, linux-spi,
virtualization, kvm, xen-devel, linux-arm-kernel, Gui-Dong Han
In-Reply-To: <20260330141050.2cb47bd9@shazbot.org>
On Mon Mar 30, 2026 at 10:10 PM CEST, Alex Williamson wrote:
> On Mon, 30 Mar 2026 19:38:41 +0200
> "Danilo Krummrich" <dakr@kernel.org> wrote:
>
>> (Cc: Jason)
>>
>> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
>> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> > index d43745fe4c84..460852f79f29 100644
>> > --- a/drivers/vfio/pci/vfio_pci_core.c
>> > +++ b/drivers/vfio/pci/vfio_pci_core.c
>> > @@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
>> > pdev->is_virtfn && physfn == vdev->pdev) {
>> > pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
>> > pci_name(pdev));
>> > - pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
>> > - vdev->vdev.ops->name);
>> > - WARN_ON(!pdev->driver_override);
>> > + WARN_ON(device_set_driver_override(&pdev->dev,
>> > + vdev->vdev.ops->name));
>>
>> Technically, this is a change in behavior. If vdev->vdev.ops->name is NULL, it
>> will trigger the WARN_ON(), whereas before it would have just written "(null)"
>> into driver_override.
>
> It's worse than that. Looking at the implementation in [1], we have:
>
> +static inline int device_set_driver_override(struct device *dev, const char *s)
> +{
> + return __device_set_driver_override(dev, s, strlen(s));
> +}
>
> So if name is NULL, we oops in strlen() before we even hit the -EINVAL
> and WARN_ON().
This was changed in v2 [2] and the actual code in-tree is
static inline int device_set_driver_override(struct device *dev, const char *s)
{
return __device_set_driver_override(dev, s, s ? strlen(s) : 0);
}
so it does indeed return -EINVAL for a NULL pointer.
> I don't believe we have any vfio-pci variant drivers where the name is
> NULL, but kasprintf() handling NULL as "(null)" was a consideration in
> this design, that even if there is no name the device is sequestered
> with a driver_override that won't match an actual driver.
>
>> I assume that vfio_pci_core drivers are expected to set the name in struct
>> vfio_device_ops in the first place and this code (silently) relies on this
>> invariant?
>
> We do expect that, but it was previously safe either way to make sure
> VFs are only bound to the same ops driver or barring that, at least
> don't perform a standard driver match. The last thing we want to
> happen automatically is for a user owned PF to create SR-IOV VFs that
> automatically bind to native kernel drivers.
>
>> Alex, Jason: Should we keep this hunk above as is and check for a proper name in
>> struct vfio_device_ops in vfio_pci_core_register_device() with a subsequent
>> patch?
>
> Given the oops, my preference would be to roll it in here. This change
> is what makes it a requirement that name cannot be NULL, where this was
> safely handled with kasprintf().
Again, no oops here. :)
I still think it makes more sense to fail early in
vfio_pci_core_register_device(), rather than silently accept "(null)" in
driver_override. It also doesn't seem unreasonable with only the WARN_ON(), but
I can also just add vdev->vdev.ops->name ?: "(null)".
Please let me know what you prefer.
- Danilo
> [1] https://lore.kernel.org/all/20260302002729.19438-2-dakr@kernel.org/
[2] https://lore.kernel.org/driver-core/20260303115720.48783-1-dakr@kernel.org/
^ permalink raw reply
* Re: [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Jakub Kicinski @ 2026-03-30 22:47 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy
In-Reply-To: <acrkwuIFyBXhwICF@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, 30 Mar 2026 14:01:54 -0700 Dipayaan Roy wrote:
> On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
> allocation in the RX refill path can cause 15-20% throughput
> regression under high connection counts (>16 TCP streams).
Did you investigate what makes such a difference exactly?
As I said I suspect there are some improvements we could
make in the page pool fragmentation logic that could yield
similar wins without bothering the user.
> Add an ethtool private flag "full-page-rx" that allows the user to
> force one RX buffer per page, bypassing the page_pool fragment path.
> This restores line-rate(180+ Gbps) performance on affected platforms.
>
> Usage:
> ethtool --set-priv-flags eth0 full-page-rx on
>
> There is no behavioral change by default. The flag must be explicitly
> enabled by the user or udev rule.
>
> The existing single-buffer-per-page logic for XDP and jumbo frames is
> consolidated into a new helper mana_use_single_rxbuf_per_page().
ethtool -g rx-buf-len could also fit the bill but I guess this is more
of a hack / workaround than legit config so no strong preference.
> -static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> +static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
> {
> - struct mana_port_context *apc = netdev_priv(ndev);
> unsigned int num_queues = apc->num_queues;
> int i, j;
>
> - if (stringset != ETH_SS_STATS)
> - return;
> for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
> - ethtool_puts(&data, mana_eth_stats[i].name);
> + ethtool_puts(data, mana_eth_stats[i].name);
>
> for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
> - ethtool_puts(&data, mana_hc_stats[i].name);
> + ethtool_puts(data, mana_hc_stats[i].name);
>
> for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
> - ethtool_puts(&data, mana_phy_stats[i].name);
> + ethtool_puts(data, mana_phy_stats[i].name);
>
> for (i = 0; i < num_queues; i++) {
> - ethtool_sprintf(&data, "rx_%d_packets", i);
> - ethtool_sprintf(&data, "rx_%d_bytes", i);
> - ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
> - ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
> - ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
> - ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
> + ethtool_sprintf(data, "rx_%d_packets", i);
Please factor out the noisy, no-op prep work into a separate patch for
ease of review
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH net-next v4] net: mana: Expose hardware diagnostic info via debugfs
From: Jakub Kicinski @ 2026-03-30 22:23 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, pabeni, kotaranov, horms, shradhagupta, shirazsaleem,
dipayanroy, yury.norov, kees, ssengar, gargaditya, linux-hyperv,
netdev, linux-kernel, linux-rdma
In-Reply-To: <acrKgG0USsGABqYT@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>
On Mon, 30 Mar 2026 12:09:52 -0700 Erni Sri Satya Vennela wrote:
> Just a quick follow‑up on this. Since these issues were pre‑existing and
> not introduced by this patch, would you prefer that I send them as a
> separate fix patch, or fold the fixes into the current patch?
Anything that's pre-existing should be a separate patch, before any new
code. If the bug exists only in net-next - earlier in the same series,
if the bug exists in net - posted separately for the net tree.
^ permalink raw reply
* Re: [PATCH 6/6] mshv: unmap debugfs stats pages on kexec
From: Stanislav Kinsburskii @ 2026-03-30 21:54 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260327201920.2100427-7-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:17PM -0700, Jork Loeser wrote:
> On L1VH, debugfs stats pages are overlay pages: the kernel allocates
> them and registers the GPAs with the hypervisor via
> HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
> hypervisor across kexec. If the kexec'd kernel reuses those physical
> pages, the hypervisor's overlay semantics cause a machine check
> exception.
>
> Fix this by calling mshv_debugfs_exit() from the reboot notifier,
> which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
> kexec. This releases the overlay bindings so the physical pages can be
> safely reused. Guard mshv_debugfs_exit() against being called when
> init failed.
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/mshv_debugfs.c | 7 ++++++-
> drivers/hv/mshv_root_main.c | 1 +
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
> index ebf2549eb44d..f9a4499cf8f3 100644
> --- a/drivers/hv/mshv_debugfs.c
> +++ b/drivers/hv/mshv_debugfs.c
> @@ -676,8 +676,10 @@ int __init mshv_debugfs_init(void)
>
> mshv_debugfs = debugfs_create_dir("mshv", NULL);
> if (IS_ERR(mshv_debugfs)) {
> + err = PTR_ERR(mshv_debugfs);
> + mshv_debugfs = NULL;
> pr_err("%s: failed to create debugfs directory\n", __func__);
> - return PTR_ERR(mshv_debugfs);
> + return err;
> }
>
> if (hv_root_partition()) {
> @@ -712,6 +714,9 @@ int __init mshv_debugfs_init(void)
>
> void mshv_debugfs_exit(void)
> {
> + if (!mshv_debugfs)
nit: this should allow to avoid setting mshv_debugfs to NULL in the
error path of mshv_debugfs_init():
if (!IS_ERR_OR_NULL(mshv_debugfs))
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> + return;
> +
> mshv_debugfs_parent_partition_remove();
>
> if (hv_root_partition()) {
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 281f530b68a9..7038fd830646 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2252,6 +2252,7 @@ root_scheduler_deinit(void)
> static int mshv_reboot_notify(struct notifier_block *nb,
> unsigned long code, void *unused)
> {
> + mshv_debugfs_exit();
> cpuhp_remove_state(mshv_cpuhp_online);
> return 0;
> }
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Stanislav Kinsburskii @ 2026-03-30 21:52 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260327201920.2100427-6-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:16PM -0700, Jork Loeser wrote:
> Register the mshv reboot notifier for all parent partitions, not just
> root. Previously the notifier was gated on hv_root_partition(), so on
> L1VH (where hv_root_partition() is false) SINT0, SINT5, and SIRBP were
> never cleaned up before kexec. The kexec'd kernel then inherited stale
> unmasked SINTs and an enabled SIRBP pointing to freed memory.
>
> The L1VH SIRBP also needs special handling: unlike the root partition
> where the hypervisor provides the SIRBP page, L1VH must allocate its
> own page and program the GPA into the MSR. Add this allocation to
> mshv_synic_init() and the corresponding free to mshv_synic_cleanup().
>
> Remove the unnecessary mshv_root_partition_init/exit wrappers and
> register the reboot notifier directly in mshv_parent_partition_init().
> Make mshv_reboot_nb static since it no longer needs external linkage.
>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_main.c | 21 ++++-----------------
> drivers/hv/mshv_synic.c | 37 ++++++++++++++++++++++++++++++-------
> 2 files changed, 34 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index e6509c980763..281f530b68a9 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -2256,20 +2256,10 @@ static int mshv_reboot_notify(struct notifier_block *nb,
> return 0;
> }
>
> -struct notifier_block mshv_reboot_nb = {
> +static struct notifier_block mshv_reboot_nb = {
> .notifier_call = mshv_reboot_notify,
> };
>
> -static void mshv_root_partition_exit(void)
> -{
> - unregister_reboot_notifier(&mshv_reboot_nb);
> -}
> -
> -static int __init mshv_root_partition_init(struct device *dev)
> -{
> - return register_reboot_notifier(&mshv_reboot_nb);
> -}
> -
> static int __init mshv_init_vmm_caps(struct device *dev)
> {
> int ret;
> @@ -2339,8 +2329,7 @@ static int __init mshv_parent_partition_init(void)
> if (ret)
> goto remove_cpu_state;
>
> - if (hv_root_partition())
> - ret = mshv_root_partition_init(dev);
> + ret = register_reboot_notifier(&mshv_reboot_nb);
> if (ret)
> goto remove_cpu_state;
>
> @@ -2368,8 +2357,7 @@ static int __init mshv_parent_partition_init(void)
> deinit_root_scheduler:
> root_scheduler_deinit();
> exit_partition:
> - if (hv_root_partition())
> - mshv_root_partition_exit();
> + unregister_reboot_notifier(&mshv_reboot_nb);
> remove_cpu_state:
> cpuhp_remove_state(mshv_cpuhp_online);
> free_synic_pages:
> @@ -2387,8 +2375,7 @@ static void __exit mshv_parent_partition_exit(void)
> misc_deregister(&mshv_dev);
> mshv_irqfd_wq_cleanup();
> root_scheduler_deinit();
> - if (hv_root_partition())
> - mshv_root_partition_exit();
> + unregister_reboot_notifier(&mshv_reboot_nb);
> cpuhp_remove_state(mshv_cpuhp_online);
> free_percpu(mshv_root.synic_pages);
> }
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index 8a7d76a10dc3..32f91a714c97 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -495,13 +495,29 @@ int mshv_synic_init(unsigned int cpu)
>
> /* Setup the Synic's event ring page */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> - sirbp.sirbp_enabled = true;
> - *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> - PAGE_SIZE, MEMREMAP_WB);
>
> - if (!(*event_ring_page))
> - goto cleanup_siefp;
> + if (hv_root_partition()) {
> + *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
> + PAGE_SIZE, MEMREMAP_WB);
> +
> + if (!(*event_ring_page))
> + goto cleanup_siefp;
> + } else {
> + /*
> + * On L1VH the hypervisor does not provide a SIRBP page.
> + * Allocate one and program its GPA into the MSR.
> + */
> + *event_ring_page = (struct hv_synic_event_ring_page *)
> + get_zeroed_page(GFP_KERNEL);
> +
> + if (!(*event_ring_page))
> + goto cleanup_siefp;
>
> + sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
> + >> PAGE_SHIFT;
> + }
> +
> + sirbp.sirbp_enabled = true;
> hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>
> #ifdef HYPERVISOR_CALLBACK_VECTOR
> @@ -581,8 +597,15 @@ int mshv_synic_cleanup(unsigned int cpu)
> /* Disable SYNIC event ring page owned by MSHV */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> sirbp.sirbp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> - memunmap(*event_ring_page);
> +
> + if (hv_root_partition()) {
> + hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> + memunmap(*event_ring_page);
> + } else {
> + sirbp.base_sirbp_gpa = 0;
> + hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> + free_page((unsigned long)*event_ring_page);
> + }
>
> /*
> * Release our mappings of the message and event flags pages.
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Stanislav Kinsburskii @ 2026-03-30 21:51 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260327201920.2100427-5-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:15PM -0700, Jork Loeser wrote:
> The SynIC is shared between VMBus and MSHV. VMBus owns the message
> page (SIMP), event flags page (SIEFP), global enable (SCONTROL), and
> SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
>
> Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
> SCONTROL that VMBus already configured, and mshv_synic_cleanup()
> disables all of them. This is wrong because MSHV can be torn down
> while VMBus is still active. In particular, a kexec reboot notifier
> tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out from
> under VMBus causes its later cleanup to write SynIC MSRs while SynIC
> is disabled, which the hypervisor does not tolerate.
>
> Restrict MSHV to managing only the resources it owns:
> - SINT0, SINT5: mask on cleanup, unmask on init
> - SIRBP: enable/disable as before
> - SIMP, SIEFP, SCONTROL: on L1VH leave entirely to VMBus (it
> already enabled them); on root partition VMBus doesn't run, so
> MSHV must enable/disable them
>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/mshv_synic.c | 109 ++++++++++++++++++++++++----------------
> 1 file changed, 67 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
> index f8b0337cdc82..8a7d76a10dc3 100644
> --- a/drivers/hv/mshv_synic.c
> +++ b/drivers/hv/mshv_synic.c
> @@ -454,7 +454,6 @@ int mshv_synic_init(unsigned int cpu)
> #ifdef HYPERVISOR_CALLBACK_VECTOR
> union hv_synic_sint sint;
> #endif
> - union hv_synic_scontrol sctrl;
> struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> struct hv_synic_event_flags_page **event_flags_page =
> @@ -462,28 +461,37 @@ int mshv_synic_init(unsigned int cpu)
> struct hv_synic_event_ring_page **event_ring_page =
> &spages->synic_event_ring_page;
>
> - /* Setup the Synic's message page */
> + /*
> + * Map the SYNIC message page. On root partition the hypervisor
> + * pre-provisions the SIMP GPA but may not set simp_enabled;
> + * on L1VH, VMBus already fully set it up. Enable it on root.
> + */
> simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> - simp.simp_enabled = true;
> + if (hv_root_partition()) {
Is it possible to split out the root partition logic to a separate
function(s) instead of weawing it into this function?
Ideally, there should be a generic function called by VMBUS and a
root partition-specific function called by MSHV if needed.
Thanks,
Stanislav
> + simp.simp_enabled = true;
> + hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + }
> *msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
> HV_HYP_PAGE_SIZE,
> MEMREMAP_WB);
>
> if (!(*msg_page))
> - return -EFAULT;
> -
> - hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + goto cleanup_simp;
>
> - /* Setup the Synic's event flags page */
> + /*
> + * Map the event flags page. Same as SIMP: enable on root,
> + * already enabled by VMBus on L1VH.
> + */
> siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> - siefp.siefp_enabled = true;
> + if (hv_root_partition()) {
> + siefp.siefp_enabled = true;
> + hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + }
> *event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
> PAGE_SIZE, MEMREMAP_WB);
>
> if (!(*event_flags_page))
> - goto cleanup;
> -
> - hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + goto cleanup_siefp;
>
> /* Setup the Synic's event ring page */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> @@ -492,7 +500,7 @@ int mshv_synic_init(unsigned int cpu)
> PAGE_SIZE, MEMREMAP_WB);
>
> if (!(*event_ring_page))
> - goto cleanup;
> + goto cleanup_siefp;
>
> hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
>
> @@ -515,28 +523,33 @@ int mshv_synic_init(unsigned int cpu)
> sint.as_uint64);
> #endif
>
> - /* Enable global synic bit */
> - sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> - sctrl.enable = 1;
> - hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + /*
> + * On L1VH, VMBus owns SCONTROL and has already enabled it.
> + * On root partition, VMBus doesn't run so we must enable it.
> + */
> + if (hv_root_partition()) {
> + union hv_synic_scontrol sctrl;
> +
> + sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> + sctrl.enable = 1;
> + hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + }
>
> return 0;
>
> -cleanup:
> - if (*event_ring_page) {
> - sirbp.sirbp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> - memunmap(*event_ring_page);
> - }
> - if (*event_flags_page) {
> +cleanup_siefp:
> + if (*event_flags_page)
> + memunmap(*event_flags_page);
> + if (hv_root_partition()) {
> siefp.siefp_enabled = false;
> hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> - memunmap(*event_flags_page);
> }
> - if (*msg_page) {
> +cleanup_simp:
> + if (*msg_page)
> + memunmap(*msg_page);
> + if (hv_root_partition()) {
> simp.simp_enabled = false;
> hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> - memunmap(*msg_page);
> }
>
> return -EFAULT;
> @@ -545,10 +558,7 @@ int mshv_synic_init(unsigned int cpu)
> int mshv_synic_cleanup(unsigned int cpu)
> {
> union hv_synic_sint sint;
> - union hv_synic_simp simp;
> - union hv_synic_siefp siefp;
> union hv_synic_sirbp sirbp;
> - union hv_synic_scontrol sctrl;
> struct hv_synic_pages *spages = this_cpu_ptr(mshv_root.synic_pages);
> struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
> struct hv_synic_event_flags_page **event_flags_page =
> @@ -568,28 +578,43 @@ int mshv_synic_cleanup(unsigned int cpu)
> hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
> sint.as_uint64);
>
> - /* Disable Synic's event ring page */
> + /* Disable SYNIC event ring page owned by MSHV */
> sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
> sirbp.sirbp_enabled = false;
> hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
> memunmap(*event_ring_page);
>
> - /* Disable Synic's event flags page */
> - siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> - siefp.siefp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> + /*
> + * Release our mappings of the message and event flags pages.
> + * On root partition, we enabled SIMP/SIEFP — disable them.
> + * On L1VH, VMBus owns the MSRs, leave them alone.
> + */
> memunmap(*event_flags_page);
> + if (hv_root_partition()) {
> + union hv_synic_simp simp;
> + union hv_synic_siefp siefp;
>
> - /* Disable Synic's message page */
> - simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> - simp.simp_enabled = false;
> - hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
> + siefp.siefp_enabled = false;
> + hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
> +
> + simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
> + simp.simp_enabled = false;
> + hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
> + }
> memunmap(*msg_page);
>
> - /* Disable global synic bit */
> - sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> - sctrl.enable = 0;
> - hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + /*
> + * On root partition, we enabled SCONTROL in init — disable it.
> + * On L1VH, VMBus owns SCONTROL, leave it alone.
> + */
> + if (hv_root_partition()) {
> + union hv_synic_scontrol sctrl;
> +
> + sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
> + sctrl.enable = 0;
> + hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
> + }
>
> return 0;
> }
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Stanislav Kinsburskii @ 2026-03-30 21:30 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch,
Anirudh Rayabharam, Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260327201920.2100427-4-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:14PM -0700, Jork Loeser wrote:
> After a kexec the logical processors and virtual processors already
> exist in the hypervisor because they were created by the previous
> kernel. Attempting to add them again causes either a BUG_ON or
> corrupted VP state leading to MCEs in the new kernel.
>
> Add hv_lp_exists() to probe whether an LP is already present by
> calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
> LP exists and we skip the add-LP and create-VP loops entirely.
>
> Also add hv_call_notify_all_processors_started() which informs the
> hypervisor that all processors are online. This is required after
> adding LPs (fresh boot) and is a no-op on kexec since we skip that
> path.
>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
> Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com>
> Co-developed-by: Mukesh Rathor <mukeshrathor@microsoft.com>
> Signed-off-by: Mukesh Rathor <mukeshrathor@microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 7 +++++
> drivers/hv/hv_proc.c | 47 ++++++++++++++++++++++++++++++++++
> include/asm-generic/mshyperv.h | 10 ++++++++
> include/hyperv/hvgdk_mini.h | 1 +
> include/hyperv/hvhdk_mini.h | 12 +++++++++
> 5 files changed, 77 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 235087456bdf..f653feea880b 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -429,6 +429,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
> }
>
> #ifdef CONFIG_X86_64
> + /* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
> + if (num_present_cpus() == 1 || hv_lp_exists(1))
> + return;
> +
> for_each_present_cpu(i) {
> if (i == 0)
> continue;
> @@ -436,6 +440,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
> BUG_ON(ret);
> }
>
> + ret = hv_call_notify_all_processors_started();
> + WARN_ON(ret);
> +
> for_each_present_cpu(i) {
> if (i == 0)
> continue;
> diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
> index 5f4fd9c3231c..63a48e5a02c5 100644
> --- a/drivers/hv/hv_proc.c
> +++ b/drivers/hv/hv_proc.c
> @@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> return ret;
> }
> EXPORT_SYMBOL_GPL(hv_call_create_vp);
> +
> +int hv_call_notify_all_processors_started(void)
> +{
> + struct hv_input_notify_partition_event *input;
> + u64 status;
> + unsigned long irq_flags;
> + int ret = 0;
> +
> + local_irq_save(irq_flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> + input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
> + input, NULL);
nit: hv_do_fast_hypercall8 should do here as this would simplify the
code.
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> + local_irq_restore(irq_flags);
> +
> + if (!hv_result_success(status)) {
> + hv_status_err(status, "\n");
> + ret = hv_result_to_errno(status);
> + }
> + return ret;
> +}
> +
> +bool hv_lp_exists(u32 lp_index)
> +{
> + struct hv_input_get_logical_processor_run_time *input;
> + struct hv_output_get_logical_processor_run_time *output;
> + unsigned long flags;
> + u64 status;
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> + input->lp_index = lp_index;
> + status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
> + input, output);
> + local_irq_restore(flags);
> +
> + if (!hv_result_success(status) &&
> + hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
> + hv_status_err(status, "\n");
> + BUG();
> + }
> +
> + return hv_result_success(status);
> +}
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index d37b68238c97..bf601d67cecb 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
> int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
> int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> +int hv_call_notify_all_processors_started(void);
> +bool hv_lp_exists(u32 lp_index);
> int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>
> #else /* CONFIG_MSHV_ROOT */
> @@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
> {
> return -EOPNOTSUPP;
> }
> +static inline int hv_call_notify_all_processors_started(void)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline bool hv_lp_exists(u32 lp_index)
> +{
> + return false;
> +}
> static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
> {
> return -EOPNOTSUPP;
> diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
> index 056ef7b6b360..f2598e186550 100644
> --- a/include/hyperv/hvgdk_mini.h
> +++ b/include/hyperv/hvgdk_mini.h
> @@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
> /* HV_CALL_CODE */
> #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
> #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
> +#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME 0x0004
> #define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
> #define HVCALL_SEND_IPI 0x000b
> #define HVCALL_ENABLE_VP_VTL 0x000f
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 091c03e26046..b4cb2fa26e9b 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -362,6 +362,7 @@ union hv_partition_event_input {
>
> enum hv_partition_event {
> HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
> + HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
> };
>
> struct hv_input_notify_partition_event {
> @@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
> union hv_partition_event_input input;
> } __packed;
>
> +struct hv_input_get_logical_processor_run_time {
> + u32 lp_index;
> +} __packed;
> +
> +struct hv_output_get_logical_processor_run_time {
> + u64 global_time;
> + u64 local_run_time;
> + u64 rsvdz0;
> + u64 hypervisor_time;
> +} __packed;
> +
> struct hv_lp_startup_status {
> u64 hv_status;
> u64 substatus1;
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Stanislav Kinsburskii @ 2026-03-30 21:26 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch,
Anirudh Rayabharam
In-Reply-To: <20260327201920.2100427-3-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:13PM -0700, Jork Loeser wrote:
> Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
> hv_machine_shutdown() in the platform code. This ensures stimer cleanup
> happens before the vmbus unload, which is required for root partition
> kexec to work correctly.
>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
> drivers/hv/vmbus_drv.c | 1 -
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 89a2eb8a0722..235087456bdf 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -235,8 +235,12 @@ void hv_remove_crash_handler(void)
> #ifdef CONFIG_KEXEC_CORE
> static void hv_machine_shutdown(void)
> {
> - if (kexec_in_progress && hv_kexec_handler)
> - hv_kexec_handler();
> + if (kexec_in_progress) {
> + hv_stimer_global_cleanup();
> +
> + if (hv_kexec_handler)
> + hv_kexec_handler();
> + }
>
> /*
> * Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 301273d61892..5d1449f8c6ea 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2892,7 +2892,6 @@ static struct platform_driver vmbus_platform_driver = {
>
> static void hv_kexec_handler(void)
> {
> - hv_stimer_global_cleanup();
> vmbus_initiate_unload(false);
> /* Make sure conn_state is set as hv_synic_cleanup checks for it */
> mb();
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH 1/6] Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
From: Stanislav Kinsburskii @ 2026-03-30 21:25 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Roman Kisel, Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260327201920.2100427-2-jloeser@linux.microsoft.com>
On Fri, Mar 27, 2026 at 01:19:12PM -0700, Jork Loeser wrote:
> vmbus_alloc_synic_and_connect() declares a local 'int
> hyperv_cpuhp_online' that shadows the file-scope global of the same
> name. The cpuhp state returned by cpuhp_setup_state() is stored in
> the local, leaving the global at 0 (CPUHP_OFFLINE). When
> hv_kexec_handler() or hv_machine_shutdown() later call
> cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
> BUG_ON in __cpuhp_remove_state_cpuslocked().
>
> Remove the local declaration so the cpuhp state is stored in the
> file-scope global where hv_kexec_handler() and hv_machine_shutdown()
> expect it.
>
> Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
> ---
> drivers/hv/vmbus_drv.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 3e7a52918ce0..301273d61892 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1430,7 +1430,6 @@ static int vmbus_alloc_synic_and_connect(void)
> {
> int ret, cpu;
> struct work_struct __percpu *works;
> - int hyperv_cpuhp_online;
>
> ret = hv_synic_alloc();
> if (ret < 0)
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v2] Drivers: hv: mshv: fix integer overflow in memory region overlap check
From: Stanislav Kinsburskii @ 2026-03-30 21:13 UTC (permalink / raw)
To: Junrui Luo
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nuno Das Neves, Anirudh Rayabharam, Mukesh Rathor, Muminul Islam,
Praveen K Paladugu, Jinank Jain, linux-hyperv, linux-kernel,
Yuhao Jiang, Roman Kisel, stable
In-Reply-To: <SYBPR01MB788138A30BC69B0F5C3316E5AF54A@SYBPR01MB7881.ausprd01.prod.outlook.com>
On Sat, Mar 28, 2026 at 05:18:45PM +0800, Junrui Luo wrote:
> mshv_partition_create_region() computes mem->guest_pfn + nr_pages to
> check for overlapping regions without verifying u64 wraparound. A
> sufficiently large guest_pfn can cause the addition to overflow,
> bypassing the overlap check and allowing creation of regions that wrap
> around the address space.
>
> Fix by using check_add_overflow() to reject such regions early, and
> validate that the region end does not exceed MAX_PHYSMEM_BITS. These
> checks also protect downstream callers that compute start_gfn +
> nr_pages on stored regions without overflow guards.
>
> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Suggested-by: Roman Kisel <romank@linux.microsoft.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
> ---
> Changes in v2:
> - Add a maximum check suggested by Roman Kisel
> - Link to v1: https://lore.kernel.org/all/SYBPR01MB7881689C0F58149DD986A6D1AF49A@SYBPR01MB7881.ausprd01.prod.outlook.com/
> ---
> drivers/hv/mshv_root_main.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> index 6f42423f7faa..32826247dbce 100644
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -1174,11 +1174,20 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> {
> struct mshv_mem_region *rg;
> u64 nr_pages = HVPFN_DOWN(mem->size);
> + u64 new_region_end;
> +
Minor nit: just "end" or even "tmp" would be sufficient, since it's only
used for the overflow checks. "new_region_end" is a bit verbose and it's
not really "new" per se.
> + /* Reject regions whose end address would wrap around */
> + if (check_add_overflow(mem->guest_pfn, nr_pages, &new_region_end))
> + return -EOVERFLOW;
> +
> + /* Reject regions beyond the maximum physical address */
> + if (new_region_end > HVPFN_DOWN(1ULL << MAX_PHYSMEM_BITS))
This is a PFN, so the check should be against MAX_PHYSMEM_BITS -
PAGE_SHIFT, right?
Or maybe it's even better to use "pfn_valid"?
Thanks,
Stanislav
> + return -EINVAL;
>
> /* Reject overlapping regions */
> spin_lock(&partition->pt_mem_regions_lock);
> hlist_for_each_entry(rg, &partition->pt_mem_regions, hnode) {
> - if (mem->guest_pfn + nr_pages <= rg->start_gfn ||
> + if (new_region_end <= rg->start_gfn ||
> rg->start_gfn + rg->nr_pages <= mem->guest_pfn)
> continue;
> spin_unlock(&partition->pt_mem_regions_lock);
>
> ---
> base-commit: c369299895a591d96745d6492d4888259b004a9e
> change-id: 20260328-fixes-0296eb3dbb52
>
> Best regards,
> --
> Junrui Luo <moonafterrain@outlook.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox