* Re: [PATCH v2] tools: hv: Fix cross-compilation
From: Aditya Garg @ 2026-04-08 12:36 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, gregkh, ssengar,
linux-hyperv, linux-kernel, romank, avladu, vdso, gargaditya
In-Reply-To: <20260407122040.249733-1-gargaditya@linux.microsoft.com>
On 07-04-2026 17:50, Aditya Garg wrote:
> Use the native ARCH only in case it is not set, this will allow the
> cross-compilation where ARCH is explicitly set.
>
> Additionally, simplify the check for ARCH so that fcopy daemon is built
> only for x86_64.
>
> Fixes: 82b0945ce2c2 ("tools: hv: Add new fcopy application based on uio driver")
> Reported-by: Adrian Vladu <avladu@cloudbasesolutions.com>
> Closes: https://lore.kernel.org/linux-hyperv/PR3PR09MB54119DB2FD76977C62D8DD6AB04D2@PR3PR09MB5411.eurprd09.prod.outlook.com/
> Co-developed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
> Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
> ---
> Changes since v1:
> - Dropped the info target printing CC, LD and ARCH
>
> v1: https://lore.kernel.org/all/1733992114-7305-1-git-send-email-ssengar@linux.microsoft.com/
> ---
> tools/hv/Makefile | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/hv/Makefile b/tools/hv/Makefile
> index 34ffcec264ab..e377caf89fb6 100644
> --- a/tools/hv/Makefile
> +++ b/tools/hv/Makefile
> @@ -2,7 +2,7 @@
> # Makefile for Hyper-V tools
> include ../scripts/Makefile.include
>
> -ARCH := $(shell uname -m 2>/dev/null)
> +ARCH ?= $(shell uname -m 2>/dev/null)
> sbindir ?= /usr/sbin
> libexecdir ?= /usr/libexec
> sharedstatedir ?= /var/lib
> @@ -20,7 +20,7 @@ override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
> override CFLAGS += -Wno-address-of-packed-member
>
> ALL_TARGETS := hv_kvp_daemon hv_vss_daemon
> -ifneq ($(ARCH), aarch64)
> +ifeq ($(ARCH), x86_64)
> ALL_TARGETS += hv_fcopy_uio_daemon
> endif
> ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
Sashiko AI review flagged an issue, I tested it and confirmed.
When building via make tools/hv from the top-level kernel directory,
scripts/subarch.include normalizes x86_64 to x86, and since ARCH is
exported, the ?= assignment in tools/hv/Makefile preserves the
normalized value, causing ifeq ($(ARCH), x86_64) to be false and
hv_fcopy_uio_daemon to be silently excluded.
I'll change this to include x86 as well in v3.
Regards,
Aditya
^ permalink raw reply
* Re: [PATCH net-next 0/4] net: mana: Fix probe/remove error path bugs
From: Erni Sri Satya Vennela @ 2026-04-08 10:17 UTC (permalink / raw)
To: Mohsin Bashir
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ssengar, dipayanroy, gargaditya,
shirazsaleem, kees, kotaranov, leon, shacharr, stephen,
linux-hyperv, netdev, linux-kernel
In-Reply-To: <62c3fcc1-6758-47c4-984e-f6940139de0c@gmail.com>
On Fri, Apr 03, 2026 at 11:28:34PM -0700, Mohsin Bashir wrote:
>
>
> On 4/3/26 12:09 AM, Erni Sri Satya Vennela wrote:
> > Fix four pre-existing bugs in mana_probe()/mana_remove() error handling
> > that can cause warnings on uninitialized work structs, masked errors,
> > and resource leaks when early probe steps fail.
> >
> > Patches 1-2 move work struct initialization (link_change_work and
> > gf_stats_work) to before any error path that could trigger
> > mana_remove(), preventing WARN_ON in __flush_work() or debug object
> > warnings when sync cancellation runs on uninitialized work structs.
> >
> > Patch 3 prevents add_adev() from overwriting a port probe error,
> > which could leave the driver in a broken state with NULL ports while
> > reporting success.
> >
> > Patch 4 changes 'goto out' to 'break' in mana_remove()'s port loop
> > so that mana_destroy_eq() is always reached, preventing EQ leaks when
> > a NULL port is encountered.
> >
> > Erni Sri Satya Vennela (4):
> > net: mana: Init link_change_work before potential error paths in probe
> > net: mana: Init gf_stats_work before potential error paths in probe
> > net: mana: Don't overwrite port probe error with add_adev result
> > net: mana: Fix EQ leak in mana_remove on NULL port
> >
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 28 +++++++++----------
> > 1 file changed, 14 insertions(+), 14 deletions(-)
> >
> I believe mana is already in the mainline so fixes go to the net tree?
Thanks for the correction Mohsin.
I'll make this chaneg in the next version.
- Vennela
^ permalink raw reply
* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08 9:24 UTC (permalink / raw)
To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
Jake Oshins, linux-hyperv@vger.kernel.org,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
matthew.ruffell@canonical.com, kjlx@templeofstupid.com
Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SN6PR02MB415794E53D2B621F6A8BA382D45CA@SN6PR02MB4157.namprd02.prod.outlook.com>
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:15 PM
> > ...
> > Note: we still need to figure out how to address the possible MMIO
> > conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> > MMIO BARs, but that's of low priority because all PCI devices available
> > to a Linux VM on Azure or on a modern host should use 64-bit BARs and
> > should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
> > devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.
>
> Just to clarify, since this patch is predicated on all BARs being 64-bit,
> hv_pci_alloc_bridge_windows() never encounters a non-zero
> hbus->low_mmio_space, and hence also never allocates from low
> MMIO space. So hv_pci_alloc_bridge_windows() does not need to be
> patched. Is that correct?
Correct. For 32-bit BARs (if any), IMO we can't really do anything for
them in hv_pci_allocate_bridge_windows(), since they must reside
below 4GB.
Note: while the patch doesn't fix the MMIO conflict if there are any
32-bit BARs, the patch doesn't make things worse for 32-bit BARs (if any).
> Taking a broader view, fundamentally the current MMIO location of
> the frame buffer may be unknown to the Linux guest. At the same time,
> Linux must ensure that PCI devices don't get assigned to the MMIO space
> where the frame buffer is located. While the current MMIO location of
> the frame buffer may be unknown, we can assume it was placed in low
> MMIO space by the host -- either Windows Hyper-V or Linux/VMM
> in the root partition, and perhaps as mediated by a paravisor. Probably
> need to confirm with the Linux-in-the-root partition team (and maybe
> the OpenHCL team) that this assumption is true.
IMO this is a good idea! It looks like the framebuffer base always starts
at the beginning of the low MMIO space. We can reserve some
MMIO for the framebuffer at the beginning of the low MMIO space.
> Presumably the
> hyperv_drm driver doesn't need to move the frame buffer, but if it
> does, it must stay in the low MMIO space.
It looks like this assumption is true.
> This patch depends on this assumption, and effectively reserves
> the entire low MMIO space for the frame buffer.
To make it precise, the patch reserves the entire low MMIO space for
the frame buffer and the 32-bit BARs (if any), and there is no MMIO
conflict in the first kernel (assuming hyperv_drm doesn't relocate the
MMIO range), and there can be an MMIO conflict in the
kdump/kexec kernel if there is any 32-bit BAR.
> The low MMIO space
> size defaults to 128 MiB on a local Hyper-V,
Yes, by default, the low MMIO base =0xf800_0000, size=128MB,
but the range [0xfed4_0000, 0xffff_ffff], whose size is 18.75MB,
is reserved for vTPM: see vmbus_walk_resources(). So by default
the available low MMIO size for hyperv_drm is 128 - 18.75 =
109.25 MB.
The size of the framebuffer should be aligned to 2MB, so if the
framebuffer size is bigger than 108MB, it looks like there is no
enough MMIO space in the low MMIO range, e.g. with the below
command:
Set-VMVideo -VMName vm_name -HorizontalResolution 7680
-VerticalResolution 4320 -ResolutionType Maximum
, the resulting max framebuffer size is
7680 * 4320 * 32/8 /1024.0/1024 = 126.5625, which would be
rounded up to 128MB.
However, according to my testing, with the above command,
the low MMIO base = 0xf000_0000, size=256MB, so it's probably
ok to reserve 128 MB for the frame buffer.
In case the low MMIO size is <=64MB, we would want to reserve
less MMIO for the frame buffer.
> and is set to 3 GiB in most
> Azure VMs (or to 1 GiB in an Azure CVM), so that all gets reserved.
>
> A slightly different approach to the whole problem is to change
> vmbus_reserve_fb(). If it is unable to get a non-zero "start" value, then
> it should use the same assumption as above, and reserve a frame buffer
> area starting at the lowest address in low MMIO space. The reserved size
> could be the max possible frame buffer size, which I think is 64 MiB (?).
It can be 128MB with the highest resolution 7680*4320 (I hope the
highest resolution won't become bigger in the future).
> This still leaves low MMIO space for subsequent PCI devices, and allows
> 32-bit BARs to continue to work. This approach requires one further
> assumption, which is that the host, plus any movement by hyperv_drm,
> has kept the frame buffer at the low end of the low MMIO space. From
> what I've seen, that assumption is reality -- the frame buffer always
> starts at the beginning of low MMIO space.
>
> This approach could be taken one step further, where vmbus_reserve_fb()
> *always* reserves 64 MiB starting at the low end of low MMIO space,
> regardless of the value of "start". The messy code for getting "start"
> could be dropped entirely, and the dependency on CONFIG_SYSFB goes
> away. Or maybe still get the value of "start" and "size", and if non-zero
> just do a sanity check that they are within the fixed 64 MiB reserved area.
>
> Thoughts? To me tweaking vmbus_reserve_fb() is a more
> straightforward and explicit way to do the reserving, vs. modifying
> the requested range in the Hyper-V PCI driver.
Agreed. Let me try to make a new patch for review.
> And FWIW, it avoids introducing the 32-bit BAR limitation.
This patch addresses the MMIO conflict for 64-bit BARs and not for
32-bit BARs (if any). The patch does not introduce the 32-bit BAR limitation.
Thanks,
-- Dexuan
^ permalink raw reply
* [PATCH net-next v6] net: mana: Expose hardware diagnostic info via debugfs
From: Erni Sri Satya Vennela @ 2026-04-08 8:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, shradhagupta,
shirazsaleem, yury.norov, kees, ssengar, ernis, dipayanroy,
gargaditya, linux-hyperv, netdev, linux-kernel, linux-rdma
Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.
The debugfs directory for each PCI device is named using pci_name()
(the unique BDF address), and its creation and removal is integrated
into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
that all callers (probe, remove, suspend, resume, shutdown) share a
single code path.
Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
- num_msix_usable, max_num_queues: Max resources from hardware
- gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
- num_vports, bm_hostmode: Device configuration
Per-vPort entries (under /sys/kernel/debug/mana/<BDF>/vportN/):
- port_handle: Hardware vPort handle
- max_sq, max_rq: Max queues from vPort config
- indir_table_sz: Indirection table size
- steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
Last applied steering configuration parameters
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
This patch depends on the following fixes submitted to net:
- "net: mana: Use pci_name() for debugfs directory naming"
- "net: mana: Move current_speed debugfs file to mana_init_port()"
Conflict resolution may be needed when net merges into net-next.
---
Changes in v6:
* Move out of patchset and create a separate patch.
Changes in v5:
* Update commit message.
* Fix conflicts to align with the new patches.
* Make it part of patchset.
Changes in v4:
* Rebase and fix conflicts.
Changes in v3:
* Rename mana_gd_cleanup to mana_gd_cleanup_device.
* Add creation of debugfs entries in mana_gd_setup.
* Add removal of debugfs entries in mana_gd_cleanup_device.
* Remove bm_hostmode and num_vports from debugfs in mana_remove itself,
because "ac" gets freed before debugfs_remove_recursive, to avoid
Use-After-Free error.
* Add "goto out:" in mana_cfg_vport_steering to avoid populating apc
values when resp.hdr.status is not NULL.
Changes in v2:
* Add debugfs_remove_recursice for gc>mana_pci_debugfs in
mana_gd_suspend to handle multiple duplicates creation in
mana_gd_setup and mana_gd_resume path.
* Move debugfs creation for num_vports and bm_hostmode out of
if(!resuming) condition since we have to create it again even for
resume.
* Recreate mana_pci_debugfs in mana_gd_resume.
---
.../net/ethernet/microsoft/mana/gdma_main.c | 59 ++++++++++---------
drivers/net/ethernet/microsoft/mana/mana_en.c | 33 +++++++++++
include/net/mana/gdma.h | 1 +
include/net/mana/mana.h | 8 +++
4 files changed, 74 insertions(+), 27 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..7a99db9afa03 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -194,6 +194,11 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues > gc->num_msix_usable - 1)
gc->max_num_queues = gc->num_msix_usable - 1;
+ debugfs_create_u32("num_msix_usable", 0400, gc->mana_pci_debugfs,
+ &gc->num_msix_usable);
+ debugfs_create_u32("max_num_queues", 0400, gc->mana_pci_debugfs,
+ &gc->max_num_queues);
+
return 0;
}
@@ -1264,6 +1269,13 @@ int mana_gd_verify_vf_version(struct pci_dev *pdev)
return err ? err : -EPROTO;
}
gc->pf_cap_flags1 = resp.pf_cap_flags1;
+ gc->gdma_protocol_ver = resp.gdma_protocol_ver;
+
+ debugfs_create_x64("gdma_protocol_ver", 0400, gc->mana_pci_debugfs,
+ &gc->gdma_protocol_ver);
+ debugfs_create_x64("pf_cap_flags1", 0400, gc->mana_pci_debugfs,
+ &gc->pf_cap_flags1);
+
if (resp.pf_cap_flags1 & GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG) {
err = mana_gd_query_hwc_timeout(pdev, &hwc->hwc_timeout);
if (err) {
@@ -1943,15 +1955,20 @@ static int mana_gd_setup(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
int err;
+ gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+ mana_debugfs_root);
+
err = mana_gd_init_registers(pdev);
if (err)
- return err;
+ goto remove_debugfs;
mana_smc_init(&gc->shm_channel, gc->dev, gc->shm_base);
gc->service_wq = alloc_ordered_workqueue("gdma_service_wq", 0);
- if (!gc->service_wq)
- return -ENOMEM;
+ if (!gc->service_wq) {
+ err = -ENOMEM;
+ goto remove_debugfs;
+ }
err = mana_gd_setup_hwc_irqs(pdev);
if (err) {
@@ -1992,11 +2009,14 @@ static int mana_gd_setup(struct pci_dev *pdev)
free_workqueue:
destroy_workqueue(gc->service_wq);
gc->service_wq = NULL;
+remove_debugfs:
+ debugfs_remove_recursive(gc->mana_pci_debugfs);
+ gc->mana_pci_debugfs = NULL;
dev_err(&pdev->dev, "%s failed (error %d)\n", __func__, err);
return err;
}
-static void mana_gd_cleanup(struct pci_dev *pdev)
+static void mana_gd_cleanup_device(struct pci_dev *pdev)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -2008,6 +2028,10 @@ static void mana_gd_cleanup(struct pci_dev *pdev)
destroy_workqueue(gc->service_wq);
gc->service_wq = NULL;
}
+
+ debugfs_remove_recursive(gc->mana_pci_debugfs);
+ gc->mana_pci_debugfs = NULL;
+
dev_dbg(&pdev->dev, "mana gdma cleanup successful\n");
}
@@ -2065,9 +2089,6 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
gc->dev = &pdev->dev;
xa_init(&gc->irq_contexts);
- gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
- mana_debugfs_root);
-
err = mana_gd_setup(pdev);
if (err)
goto unmap_bar;
@@ -2096,16 +2117,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
cleanup_mana:
mana_remove(&gc->mana, false);
cleanup_gd:
- mana_gd_cleanup(pdev);
+ mana_gd_cleanup_device(pdev);
unmap_bar:
- /*
- * at this point we know that the other debugfs child dir/files
- * are either not yet created or are already cleaned up.
- * The pci debugfs folder clean-up now, will only be cleaning up
- * adapter-MTU file and apc->mana_pci_debugfs folder.
- */
- debugfs_remove_recursive(gc->mana_pci_debugfs);
- gc->mana_pci_debugfs = NULL;
xa_destroy(&gc->irq_contexts);
pci_iounmap(pdev, bar0_va);
free_gc:
@@ -2155,11 +2168,7 @@ static void mana_gd_remove(struct pci_dev *pdev)
mana_rdma_remove(&gc->mana_ib);
mana_remove(&gc->mana, false);
- mana_gd_cleanup(pdev);
-
- debugfs_remove_recursive(gc->mana_pci_debugfs);
-
- gc->mana_pci_debugfs = NULL;
+ mana_gd_cleanup_device(pdev);
xa_destroy(&gc->irq_contexts);
@@ -2181,7 +2190,7 @@ int mana_gd_suspend(struct pci_dev *pdev, pm_message_t state)
mana_rdma_remove(&gc->mana_ib);
mana_remove(&gc->mana, true);
- mana_gd_cleanup(pdev);
+ mana_gd_cleanup_device(pdev);
return 0;
}
@@ -2220,11 +2229,7 @@ static void mana_gd_shutdown(struct pci_dev *pdev)
mana_rdma_remove(&gc->mana_ib);
mana_remove(&gc->mana, true);
- mana_gd_cleanup(pdev);
-
- debugfs_remove_recursive(gc->mana_pci_debugfs);
-
- gc->mana_pci_debugfs = NULL;
+ mana_gd_cleanup_device(pdev);
pci_disable_device(pdev);
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 6302432b9bf6..e7c627e3379a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1276,6 +1276,9 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
apc->port_handle = resp.vport;
ether_addr_copy(apc->mac_addr, resp.mac_addr);
+ apc->vport_max_sq = *max_sq;
+ apc->vport_max_rq = *max_rq;
+
return 0;
}
@@ -1430,6 +1433,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
apc->port_handle, apc->indir_table_sz);
+
+ apc->steer_rx = rx;
+ apc->steer_rss = apc->rss_state;
+ apc->steer_update_tab = update_tab;
+ apc->steer_cqe_coalescing = req->cqe_coalescing_enable;
out:
kfree(req);
return err;
@@ -3154,6 +3162,23 @@ static int mana_init_port(struct net_device *ndev)
eth_hw_addr_set(ndev, apc->mac_addr);
sprintf(vport, "vport%d", port_idx);
apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+
+ debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
+ &apc->port_handle);
+ debugfs_create_u32("max_sq", 0400, apc->mana_port_debugfs,
+ &apc->vport_max_sq);
+ debugfs_create_u32("max_rq", 0400, apc->mana_port_debugfs,
+ &apc->vport_max_rq);
+ debugfs_create_u32("indir_table_sz", 0400, apc->mana_port_debugfs,
+ &apc->indir_table_sz);
+ debugfs_create_u32("steer_rx", 0400, apc->mana_port_debugfs,
+ &apc->steer_rx);
+ debugfs_create_u32("steer_rss", 0400, apc->mana_port_debugfs,
+ &apc->steer_rss);
+ debugfs_create_u32("steer_update_tab", 0400, apc->mana_port_debugfs,
+ &apc->steer_update_tab);
+ debugfs_create_u32("steer_cqe_coalescing", 0400, apc->mana_port_debugfs,
+ &apc->steer_cqe_coalescing);
debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
&apc->speed);
return 0;
@@ -3646,6 +3671,11 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("num_vports", 0400, gc->mana_pci_debugfs,
+ &ac->num_ports);
+ debugfs_create_u8("bm_hostmode", 0400, gc->mana_pci_debugfs,
+ &ac->bm_hostmode);
+
if (!resuming) {
ac->num_ports = num_ports;
@@ -3786,6 +3816,9 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
mana_gd_deregister_device(gd);
+ debugfs_lookup_and_remove("bm_hostmode", gc->mana_pci_debugfs);
+ debugfs_lookup_and_remove("num_vports", gc->mana_pci_debugfs);
+
if (suspending)
return;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 7fe3a1b61b2d..c4e3ce5147f7 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -442,6 +442,7 @@ struct gdma_context {
struct gdma_dev mana_ib;
u64 pf_cap_flags1;
+ u64 gdma_protocol_ver;
struct workqueue_struct *service_wq;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 96d21cbbdee2..6d2e05a7368c 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -568,6 +568,14 @@ struct mana_port_context {
/* Debugfs */
struct dentry *mana_port_debugfs;
+
+ /* Cached vport/steering config for debugfs */
+ u32 vport_max_sq;
+ u32 vport_max_rq;
+ u32 steer_rx;
+ u32 steer_rss;
+ u32 steer_update_tab;
+ u32 steer_cqe_coalescing;
};
netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
--
2.34.1
^ permalink raw reply related
* [PATCH net 2/2] net: mana: Move current_speed debugfs file to mana_init_port()
From: Erni Sri Satya Vennela @ 2026-04-08 8:12 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
linux-kernel
In-Reply-To: <20260408081224.302308-1-ernis@linux.microsoft.com>
Move the current_speed debugfs file creation from mana_probe_port() to
mana_init_port(). The file was previously created only during initial
probe, but mana_cleanup_port_context() removes the entire vPort debugfs
directory during detach/attach cycles. Since mana_init_port() recreates
the directory on re-attach, moving current_speed here ensures it survives
these cycles.
Fixes: 75cabb46935b ("net: mana: Add support for net_shaper_ops")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 07630322545f..6302432b9bf6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3154,6 +3154,8 @@ static int mana_init_port(struct net_device *ndev)
eth_hw_addr_set(ndev, apc->mac_addr);
sprintf(vport, "vport%d", port_idx);
apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
+ debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs,
+ &apc->speed);
return 0;
reset_apc:
@@ -3432,8 +3434,6 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
netif_carrier_on(ndev);
- debugfs_create_u32("current_speed", 0400, apc->mana_port_debugfs, &apc->speed);
-
return 0;
free_indir:
--
2.34.1
^ permalink raw reply related
* [PATCH net 1/2] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-08 8:12 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
linux-kernel
In-Reply-To: <20260408081224.302308-1-ernis@linux.microsoft.com>
Use pci_name(pdev) for the per-device debugfs directory instead of
hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
previous approach had two issues:
1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
in environments like generic VFIO passthrough or nested KVM,
causing a NULL pointer dereference.
2. Multiple PFs would all use "0", and VFs across different PCI
domains or buses could share the same slot name, leading to
-EEXIST errors from debugfs_create_dir().
pci_name(pdev) returns the unique BDF address, is always valid, and is
unique across the system.
Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 43741cd35af8..098fbda0d128 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -2065,11 +2065,8 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
gc->dev = &pdev->dev;
xa_init(&gc->irq_contexts);
- if (gc->is_pf)
- gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
- else
- gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
- mana_debugfs_root);
+ gc->mana_pci_debugfs = debugfs_create_dir(pci_name(pdev),
+ mana_debugfs_root);
err = mana_gd_setup(pdev);
if (err)
--
2.34.1
^ permalink raw reply related
* [PATCH net 0/2] net: mana: Fix debugfs directory naming and file lifecycle
From: Erni Sri Satya Vennela @ 2026-04-08 8:12 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, ernis, ssengar, dipayanroy, gargaditya,
shradhagupta, kees, kotaranov, yury.norov, linux-hyperv, netdev,
linux-kernel
This series fixes two pre-existing debugfs issues in the MANA driver.
Patch 1 fixes the per-device debugfs directory naming to use the unique
PCI BDF address via pci_name(), avoiding a potential NULL pointer
dereference when pdev->slot is NULL (e.g. VFIO passthrough, nested KVM)
and preventing name collisions across multiple PFs or VFs.
Patch 2 moves the current_speed debugfs file creation from
mana_probe_port() to mana_init_port() so it survives detach/attach
cycles triggered by MTU changes or XDP program changes.
Erni Sri Satya Vennela (2):
net: mana: Use pci_name() for debugfs directory naming
net: mana: Move current_speed debugfs file to mana_init_port()
drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 ++-----
drivers/net/ethernet/microsoft/mana/mana_en.c | 4 ++--
2 files changed, 4 insertions(+), 7 deletions(-)
--
2.34.1
^ permalink raw reply
* Re: [PATCH net-next v5 1/3] net: mana: Use pci_name() for debugfs directory naming
From: Erni Sri Satya Vennela @ 2026-04-08 8:12 UTC (permalink / raw)
To: Simon Horman
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, shradhagupta, shirazsaleem,
yury.norov, kees, ssengar, dipayanroy, gargaditya, linux-hyperv,
netdev, linux-kernel, linux-rdma
In-Reply-To: <20260404090514.GS113102@horms.kernel.org>
On Sat, Apr 04, 2026 at 10:05:14AM +0100, Simon Horman wrote:
> On Thu, Apr 02, 2026 at 11:26:55AM -0700, Erni Sri Satya Vennela wrote:
> > Use pci_name(pdev) for the per-device debugfs directory instead of
> > hardcoded "0" for PFs and pci_slot_name(pdev->slot) for VFs. The
> > previous approach had two issues:
> >
> > 1. pci_slot_name() dereferences pdev->slot, which can be NULL for VFs
> > in environments like generic VFIO passthrough or nested KVM,
> > causing a NULL pointer dereference.
> >
> > 2. Multiple PFs would all use "0", and VFs across different PCI
> > domains or buses could share the same slot name, leading to
> > -EEXIST errors from debugfs_create_dir().
> >
> > pci_name(pdev) returns the unique BDF address, is always valid, and
> > is unique across the system.
> >
> > Fixes: 6607c17c6c5e ("net: mana: Enable debugfs files for MANA device")
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
>
> Hi Erni,
>
> Possibly the code differs between net and net-next.
> But if this is fixing a bug in code present in net - as per the cited
> commit - then I think it should be a patch that targets net.
> With some strategy for merging that change into net-next
> if conflicts are expected.
Thankyou for the clarity Simon.
I will send a separate patchset for net tree with the fixes.
- Vennela
^ permalink raw reply
* [PATCH] x86/VMBus: Confidential VMBus for dynamic DMA transfers
From: Tianyu Lan @ 2026-04-08 7:31 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, James.Bottomley,
martin.petersen, apais
Cc: Tianyu Lan, linux-hyperv, linux-kernel, linux-scsi, vdso,
mhklinux
Hyper-V provides Confidential VMBus to communicate between
device model and device guest driver via encrypted/private
memory in Confidential VM. The device model is in OpenHCL
(https://openvmm.dev/guide/user_guide/openhcl.html) that
plays the paravisor role.
For a VMBus device, there are two communication methods to
talk with Host/Hypervisor. 1) VMBUS Ring buffer 2) Dynamic
DMA transfer.
The Confidential VMBus Ring buffer has been upstreamed by
Roman Kisel(commit 6802d8af47d1).
The dynamic DMA transition of VMBus device normally goes
through DMA core and it uses SWIOTLB as bounce buffer in
a CoCo VM.
The Confidential VMBus device can do DMA directly to
private/encrypted memory. Because the swiotlb is decrypted
memory, the DMA transfer must not be bounced through the
swiotlb, so as to preserve confidentiality. This is different
from the default for Linux CoCo VMs, so not use DMA(SWIOTLB)
API in VMBus driver when confidential dynamic DMA transfers
capability is present.
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
---
drivers/scsi/storvsc_drv.c | 28 +++++++++++++++++++++-------
include/linux/hyperv.h | 1 +
2 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index ae1abab97835..79b7611518b7 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1316,7 +1316,8 @@ static void storvsc_on_channel_callback(void *context)
continue;
}
request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
- scsi_dma_unmap(scmnd);
+ if (!device->co_external_memory)
+ scsi_dma_unmap(scmnd);
}
storvsc_on_receive(stor_device, packet, request);
@@ -1339,6 +1340,8 @@ static int storvsc_connect_to_vsp(struct hv_device *device, u32 ring_size,
device->channel->max_pkt_size = STORVSC_MAX_PKT_SIZE;
device->channel->next_request_id_callback = storvsc_next_request_id;
+ if (device->channel->co_external_memory)
+ device->co_external_memory = true;
ret = vmbus_open(device->channel,
ring_size,
@@ -1805,7 +1808,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
unsigned long offset_in_hvpg = offset_in_hvpage(sgl->offset);
unsigned int hvpg_count = HVPFN_UP(offset_in_hvpg + length);
struct scatterlist *sg;
- unsigned long hvpfn, hvpfns_to_add;
+ unsigned long hvpfn, hvpfns_to_add, hvpgoff;
int j, i = 0, sg_count;
payload_sz = (hvpg_count * sizeof(u64) +
@@ -1821,7 +1824,11 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
payload->range.len = length;
payload->range.offset = offset_in_hvpg;
- sg_count = scsi_dma_map(scmnd);
+ if (dev->co_external_memory)
+ sg_count = scsi_sg_count(scmnd);
+ else
+ sg_count = scsi_dma_map(scmnd);
+
if (sg_count < 0) {
ret = SCSI_MLQUEUE_DEVICE_BUSY;
goto err_free_payload;
@@ -1836,9 +1843,16 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
* Such offsets are handled even on other than the first
* sgl entry, provided they are a multiple of PAGE_SIZE.
*/
- hvpfn = HVPFN_DOWN(sg_dma_address(sg));
- hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
- sg_dma_len(sg)) - hvpfn;
+ if (dev->co_external_memory) {
+ hvpgoff = HVPFN_DOWN(sg->offset);
+ hvpfn = page_to_hvpfn(sg_page(sg)) + hvpgoff;
+ hvpfns_to_add = HVPFN_UP(sg->offset + sg->length) -
+ hvpgoff;
+ } else {
+ hvpfn = HVPFN_DOWN(sg_dma_address(sg));
+ hvpfns_to_add = HVPFN_UP(sg_dma_address(sg) +
+ sg_dma_len(sg)) - hvpfn;
+ }
/*
* Fill the next portion of the PFN array with
@@ -1860,7 +1874,7 @@ static enum scsi_qc_status storvsc_queuecommand(struct Scsi_Host *host,
ret = storvsc_do_io(dev, cmd_request, smp_processor_id());
migrate_enable();
- if (ret)
+ if (ret && (!dev->co_external_memory))
scsi_dma_unmap(scmnd);
if (ret == -EAGAIN) {
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..bcb143766d6e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1285,6 +1285,7 @@ struct hv_device {
/* place holder to keep track of the dir for hv device in debugfs */
struct dentry *debug_dir;
+ bool co_external_memory;
};
--
2.50.1
^ permalink raw reply related
* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08 6:37 UTC (permalink / raw)
To: Michael Kelley, Matthew Ruffell
Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
kwilczynski@kernel.org, KY Srinivasan,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
wei.liu@kernel.org
In-Reply-To: <SN6PR02MB4157F48F3B37D74E3A9F1AFBD45CA@SN6PR02MB4157.namprd02.prod.outlook.com>
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:13 PM
> > ...
> > 96959283a58d adds "select SYSFB if !HYPERV_VTL_MODE", but we can
> > still manually unset CONFIG_SYSFB (I happened to do this when debugging
> > the kdump issue), and hv_pci won't work.
>
> Just curious -- how would you manually unset CONFIG_SYSFB? The kernel
> makefile always resync's .config against the Kconfig rules, which would add
> CONFIG_SYSFB back again. The Kconfig files essentially say that removing
> CONFIG_SYSFB is an invalid configuration.
Sorry, my description above is wrong: on the mainline kernel that has
96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests"),
I'm unable to unset CONFIG_SYSFB.
When I was able to unset CONFIG_SYSFB, I was actually on Ubuntu 22.04
(Ubuntu-azure-6.8-6.8.0-1049.55_22.04.1, released in Feb 2026). I thought the
kernel has 96959283a58d, but actually it doesn't...
> > IMO vmbus_reserve_fb() should unconditionally reserve the frame buffer
> > MMIO range. I'll post a patch like this:
> >
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -2395,10 +2398,8 @@ static void __maybe_unused
> vmbus_reserve_fb(void)
> >
> > if (efi_enabled(EFI_BOOT)) {
> > /* Gen2 VM: get FB base from EFI framebuffer */
> > - if (IS_ENABLED(CONFIG_SYSFB)) {
> > - start = sysfb_primary_display.screen.lfb_base;
> > - size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);
> > - }
> > + start = sysfb_primary_display.screen.lfb_base;
> > + size = max_t(__u32, sysfb_primary_display.screen.lfb_size,
> 0x800000);
Please ignore the change above.
> On arm64 the existence of sysfb_primary_display is conditional on
> several config variables, including CONFIG_SYSFB and CONFIG_EFI_EARLYCON.
> (see drivers/firmware/efi/efi-init.c) If you can take away CONFIG_SYSFB, you
> could also take away CONFIG_EFI_EARLYCON and end up with build error on
> arm64. So I'm not clear how this approach would be more robust against
> invalid .config changes.
Agreed. Then let's keep vmbus_reserve_fb() as is.
Thanks,
Dexuan
^ permalink raw reply
* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Dexuan Cui @ 2026-04-08 6:15 UTC (permalink / raw)
To: Michael Kelley, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Long Li, lpieralisi@kernel.org, kwilczynski@kernel.org,
mani@kernel.org, robh@kernel.org, bhelgaas@google.com,
Jake Oshins, linux-hyperv@vger.kernel.org,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org, Matthew Ruffell, Krister Johansen
In-Reply-To: <SN6PR02MB415726C7758C985528ACB912D45CA@SN6PR02MB4157.namprd02.prod.outlook.com>
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Sunday, April 5, 2026 4:11 PM
> > ...
> > Unluckily, setup_linux_vesafb() only recognizes the vesafb
> > driver in Linux kernel ("VESA VGA") and the efifb driver ("EFI VGA").
> > It looks like normally arch_options.reuse_video_type is always 0.
> >
> > This means the kdump kernel's screen_info.lfb_base is 0, if
> > hyperv_fb or hyperv_drm loads. In the past, for a Ubuntu kernel
> > with CONFIG_FB_EFI=y, our workaround is blacklisting
> > hyperv_fb or hyperv_drm, so /dev/fb0 is backed by efifb, and
> > the screen_info.lfb_base is correctly set for kdump.
>
> Hmmm. This worse than I thought for x86/x64. In fact, it means
> a part of my commit message for 304386373007 is now wrong. I had
> described everything as working when using the kexec_load() system
> call because the FBIOGET_FSCREENINFO ioctl was returning a good
> value for smem_start (at least with the hyperv_fb driver). But as you
> point out further down, newer versions of the kexec user space program
> are ignoring that smem_start value unless the driver is vesafb or efifb.
>
> Was blacklisting hyperv_fb or hyperv_drm in the kdump kernel
> a workaround we had promulgated in the past? My recollection
> is vague. But no matter.
Blacklisting hyperv_fb or hyperv_drm in the *first* kernel was an
internal workaround, which no longer works since CONFIG_FB_EFI
is not set in the linux-azure kernels.
Thanks,
Dexuan
^ permalink raw reply
* Re: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Thomas Lefebvre @ 2026-04-08 4:13 UTC (permalink / raw)
To: Michael Kelley
Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org
In-Reply-To: <SN6PR02MB4157A58DA24233B77829B59FD45AA@SN6PR02MB4157.namprd02.prod.outlook.com>
Thanks, that's very useful information.
Unfortunately, I can't easily reproduce the issue; I can't seem to get
pvclock_update_vm_gtod_copy() and __get_kvmclock() to run on different vCPUs
which was one of the two required conditions that triggered the
unsigned subtraction wraparound
(the second condition being inconsistent values between L1 vCPUs).
I just upgraded to Windows 11 25H2 and my Hyper-V VM config from v9 to v12,
I now see tsc_reliable and constant_tsc in the L1 Linux VM lscpu and
/sys/devices/system/clocksource/clocksource0/current_clocksource is
tsc.
I'll report back if I still encounter the problem when spinning up L2
Linux VMs with KVM.
On Tue, Apr 7, 2026 at 1:40 PM Michael Kelley <mhklinux@outlook.com> wrote:
>
> From: Thomas Lefebvre <thomas.lefebvre3@gmail.com> Sent: Tuesday, April 7, 2026 12:13 PM
> >
> > Hi everyone, thank you for your attention to this bug report.
> >
> > Michael,
> >
> > 1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
> > and "constant_tsc".
> > $ lscpu | grep tsc_reliable
> > $ lscpu | grep constant_tsc
> > $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> > hyperv_clocksource_tsc_page
> >
> > 2. Windows 10
> > Version 22H2 (OS Build 19045.6466)
> >
> > 3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
> > 0x24e24, misc 0xbed7b2
> >
> > 4. Yes, the laptop hibernates and then resumes.
> > When the problem occurred, the laptop had gone through multiple
> > hibernate and resume cycles.
> > I haven't seen it happen after a full reboot before a hibernate/resume cycle.
> >
> > Thomas
> >
>
> How easy is it for you to reproduce the problem? Would it be feasible
> to get a definitive answer on whether the problem repros after a
> full reboot, but before a hibernate/resume cycle?
>
> There's a known bug Windows 10 Hyper-V where the hardware TSC
> scaling gets messed up after a hibernate/resume cycle, causing the TSC
> values read in the guest to drift from what the Hyper-V host thinks
> the guest's TSC value is. A summary of the problem is here:
> https://github.com/microsoft/WSL/issues/6982#issuecomment-2294892954
>
> Of course, this doesn't sound like your symptom. And Hyper-V is not
> telling your guest that it supports hardware TSC scaling, because the
> HV_ACCESS_TSC_INVARIANT flag is *not* set and the clocksource
> is hyperv_clocksource_tsc_page. But my understanding is that the code
> changes to fix the Hyper-V problem weren't trivial, and I'm speculating
> that maybe you are seeing some other symptom of whatever the
> underlying Hyper-V issue was.
>
> Of course, this is just speculation. If the problem can occur before
> any hibernate/resume cycles are done, then my speculation is
> wrong. But if the problem only happens after a hibernate/resume
> cycle, then this known problem, or something related to it, becomes
> a pretty good candidate. Unfortunately, I'm pretty sure there's no
> fix for Windows 10 Hyper-V. You would need to upgrade to
> Windows 11 22H2 or later.
>
> Michael
^ permalink raw reply
* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-08 2:55 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Alexander Lobakin, kys, haiyangz, wei.liu, decui, andrew+netdev,
davem, edumazet, pabeni, leon, longli, kotaranov, horms,
shradhagupta, ssengar, ernis, shirazsaleem, linux-hyperv, netdev,
linux-kernel, linux-rdma, stephen, jacob.e.keller, dipayanroy,
leitao, kees
In-Reply-To: <20260407185128.605dcacf@kernel.org>
On Tue, Apr 07, 2026 at 06:51:28PM -0700, Jakub Kicinski wrote:
> On Tue, 7 Apr 2026 15:10:45 +0200 Alexander Lobakin wrote:
> > > On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
> > > fragments for allocation in the RX refill path (~2kB buffer per fragment)
> > > causes 15-20% throughput regression under high connection counts
> > > (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> > > platforms shows no regression and restores line-rate performance.
> > >
> > > This behavior is observed on a single platform; other platforms
> > > perform better with page_pool fragments, indicating this is not a
> > > page_pool issue but platform-specific.
> > >
> > > This series adds an ethtool private flag "full-page-rx" to let the
> > > user opt in to one RX buffer per page:
> > >
> > > ethtool --set-priv-flags eth0 full-page-rx on
> >
> > Sorry I may've missed the previous threads.
> >
> > Has this approach been discussed here? Private flags are generally
> > discouraged.
> >
> > Alternatively, you can provide Ethtool ops to change the Rx buffer size,
> > so that you'd be able to set it to PAGE_SIZE on affected platforms and
> > the result would be the same.
>
> Actually, hm. Now that you spoke up I wonder how much this is
> an inherent ARM problem vs problem in whatever ARM Microsoft's
> management empire-built themselves into.
>
> Do you have access to any ARM servers? Google says GCP offers ARM
> instances with idpf NICs. So if idpf benefits from the same
> "tuning" we should totally push for a proper API not priv flags.
Hi,
Sharing an observation from earlier, with a different ARM64 fabric/platfrom
when configured with base size of 4Kb and the smae MANA NIC, did not show
this behaviour. In fact, it showed better performance with page fragments
in single as well as multiple connections. Thats why initial version this
patch we wanted to apply the work around only to this specific chip where
the issue is seen with page fragments.
Regards
^ permalink raw reply
* Re: [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-08 1:51 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Dipayaan Roy, kys, haiyangz, wei.liu, decui, andrew+netdev, davem,
edumazet, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees
In-Reply-To: <e80b603d-8be0-4aee-8a31-c9cbb4a8ab00@intel.com>
On Tue, 7 Apr 2026 15:10:45 +0200 Alexander Lobakin wrote:
> > On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
> > fragments for allocation in the RX refill path (~2kB buffer per fragment)
> > causes 15-20% throughput regression under high connection counts
> > (>16 TCP streams at 180+ Gbps). Using full-page buffers on these
> > platforms shows no regression and restores line-rate performance.
> >
> > This behavior is observed on a single platform; other platforms
> > perform better with page_pool fragments, indicating this is not a
> > page_pool issue but platform-specific.
> >
> > This series adds an ethtool private flag "full-page-rx" to let the
> > user opt in to one RX buffer per page:
> >
> > ethtool --set-priv-flags eth0 full-page-rx on
>
> Sorry I may've missed the previous threads.
>
> Has this approach been discussed here? Private flags are generally
> discouraged.
>
> Alternatively, you can provide Ethtool ops to change the Rx buffer size,
> so that you'd be able to set it to PAGE_SIZE on affected platforms and
> the result would be the same.
Actually, hm. Now that you spoke up I wonder how much this is
an inherent ARM problem vs problem in whatever ARM Microsoft's
management empire-built themselves into.
Do you have access to any ARM servers? Google says GCP offers ARM
instances with idpf NICs. So if idpf benefits from the same
"tuning" we should totally push for a proper API not priv flags.
^ permalink raw reply
* [PATCH v3 6/6] mshv: unmap debugfs stats pages on kexec
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.
Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_debugfs.c | 7 ++++++-
drivers/hv/mshv_synic.c | 1 +
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_debugfs.c b/drivers/hv/mshv_debugfs.c
index 418b6dc8f3c2..3c3e02237ae9 100644
--- a/drivers/hv/mshv_debugfs.c
+++ b/drivers/hv/mshv_debugfs.c
@@ -674,8 +674,10 @@ int __init mshv_debugfs_init(void)
mshv_debugfs = debugfs_create_dir("mshv", NULL);
if (IS_ERR(mshv_debugfs)) {
+ err = PTR_ERR(mshv_debugfs);
+ mshv_debugfs = NULL;
pr_err("%s: failed to create debugfs directory\n", __func__);
- return PTR_ERR(mshv_debugfs);
+ return err;
}
if (hv_root_partition()) {
@@ -710,6 +712,9 @@ int __init mshv_debugfs_init(void)
void mshv_debugfs_exit(void)
{
+ if (!mshv_debugfs)
+ return;
+
mshv_debugfs_parent_partition_remove();
if (hv_root_partition()) {
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 8fe673c876fd..ed025f90003f 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -719,6 +719,7 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
static int mshv_synic_reboot_notify(struct notifier_block *nb,
unsigned long code, void *unused)
{
+ mshv_debugfs_exit();
cpuhp_remove_state(synic_cpuhp_online);
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
The reboot notifier that tears down the SynIC cpuhp state guards the
cleanup with hv_root_partition(), so on L1VH (where
hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
cleaned up before kexec. The kexec'd kernel then inherits stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.
Remove the hv_root_partition() guard so the cleanup runs for all
parent partitions.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_synic.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index f71d5dfce1c1..8fe673c876fd 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -719,9 +719,6 @@ mshv_unregister_doorbell(u64 partition_id, int doorbell_portid)
static int mshv_synic_reboot_notify(struct notifier_block *nb,
unsigned long code, void *unused)
{
- if (!hv_root_partition())
- return 0;
-
cpuhp_remove_state(synic_cpuhp_online);
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
from under VMBus causes its later cleanup to write SynIC MSRs while
SynIC is disabled, which the hypervisor does not tolerate.
Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
and nested root partition); on a non-nested root partition VMBus
doesn't run, so MSHV must enable/disable them
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/mshv_synic.c | 142 ++++++++++++++++++++++++++--------------
1 file changed, 94 insertions(+), 48 deletions(-)
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index e2288a726fec..f71d5dfce1c1 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -456,46 +456,72 @@ static int mshv_synic_cpu_init(unsigned int cpu)
union hv_synic_siefp siefp;
union hv_synic_sirbp sirbp;
union hv_synic_sint sint;
- union hv_synic_scontrol sctrl;
struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
struct hv_synic_event_flags_page **event_flags_page =
&spages->synic_event_flags_page;
struct hv_synic_event_ring_page **event_ring_page =
&spages->synic_event_ring_page;
+ /* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+ bool vmbus_active = !hv_root_partition() || hv_nested;
- /* Setup the Synic's message page */
+ /*
+ * Map the SYNIC message page. When VMBus is not active the
+ * hypervisor pre-provisions the SIMP GPA but may not set
+ * simp_enabled — enable it here.
+ */
simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
- simp.simp_enabled = true;
+ if (!vmbus_active) {
+ simp.simp_enabled = true;
+ hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ }
*msg_page = memremap(simp.base_simp_gpa << HV_HYP_PAGE_SHIFT,
HV_HYP_PAGE_SIZE,
MEMREMAP_WB);
if (!(*msg_page))
- return -EFAULT;
+ goto cleanup_simp;
- hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
-
- /* Setup the Synic's event flags page */
+ /*
+ * Map the event flags page. Same as SIMP: enable when
+ * VMBus is not active, already enabled by VMBus otherwise.
+ */
siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
- siefp.siefp_enabled = true;
+ if (!vmbus_active) {
+ siefp.siefp_enabled = true;
+ hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ }
*event_flags_page = memremap(siefp.base_siefp_gpa << PAGE_SHIFT,
PAGE_SIZE, MEMREMAP_WB);
if (!(*event_flags_page))
- goto cleanup;
-
- hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ goto cleanup_siefp;
/* Setup the Synic's event ring page */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
- sirbp.sirbp_enabled = true;
- *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
- PAGE_SIZE, MEMREMAP_WB);
- if (!(*event_ring_page))
- goto cleanup;
+ if (hv_root_partition()) {
+ *event_ring_page = memremap(sirbp.base_sirbp_gpa << PAGE_SHIFT,
+ PAGE_SIZE, MEMREMAP_WB);
+ if (!(*event_ring_page))
+ goto cleanup_siefp;
+ } else {
+ /*
+ * On L1VH the hypervisor does not provide a SIRBP page.
+ * Allocate one and program its GPA into the MSR.
+ */
+ *event_ring_page = (struct hv_synic_event_ring_page *)
+ get_zeroed_page(GFP_KERNEL);
+
+ if (!(*event_ring_page))
+ goto cleanup_siefp;
+
+ sirbp.base_sirbp_gpa = virt_to_phys(*event_ring_page)
+ >> PAGE_SHIFT;
+ }
+
+ sirbp.sirbp_enabled = true;
hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
if (mshv_sint_irq != -1)
@@ -518,28 +544,30 @@ static int mshv_synic_cpu_init(unsigned int cpu)
hv_set_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_DOORBELL_SINT_INDEX,
sint.as_uint64);
- /* Enable global synic bit */
- sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
- sctrl.enable = 1;
- hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ /* When VMBus is active it already enabled SCONTROL. */
+ if (!vmbus_active) {
+ union hv_synic_scontrol sctrl;
+
+ sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+ sctrl.enable = 1;
+ hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ }
return 0;
-cleanup:
- if (*event_ring_page) {
- sirbp.sirbp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
- memunmap(*event_ring_page);
- }
- if (*event_flags_page) {
+cleanup_siefp:
+ if (*event_flags_page)
+ memunmap(*event_flags_page);
+ if (!vmbus_active) {
siefp.siefp_enabled = false;
hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
- memunmap(*event_flags_page);
}
- if (*msg_page) {
+cleanup_simp:
+ if (*msg_page)
+ memunmap(*msg_page);
+ if (!vmbus_active) {
simp.simp_enabled = false;
hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
- memunmap(*msg_page);
}
return -EFAULT;
@@ -548,16 +576,15 @@ static int mshv_synic_cpu_init(unsigned int cpu)
static int mshv_synic_cpu_exit(unsigned int cpu)
{
union hv_synic_sint sint;
- union hv_synic_simp simp;
- union hv_synic_siefp siefp;
union hv_synic_sirbp sirbp;
- union hv_synic_scontrol sctrl;
struct hv_synic_pages *spages = this_cpu_ptr(synic_pages);
struct hv_message_page **msg_page = &spages->hyp_synic_message_page;
struct hv_synic_event_flags_page **event_flags_page =
&spages->synic_event_flags_page;
struct hv_synic_event_ring_page **event_ring_page =
&spages->synic_event_ring_page;
+ /* VMBus runs on L1VH and nested root; it owns SIMP/SIEFP/SCONTROL */
+ bool vmbus_active = !hv_root_partition() || hv_nested;
/* Disable the interrupt */
sint.as_uint64 = hv_get_non_nested_msr(HV_MSR_SINT0 + HV_SYNIC_INTERCEPTION_SINT_INDEX);
@@ -574,28 +601,47 @@ static int mshv_synic_cpu_exit(unsigned int cpu)
if (mshv_sint_irq != -1)
disable_percpu_irq(mshv_sint_irq);
- /* Disable Synic's event ring page */
+ /* Disable SYNIC event ring page owned by MSHV */
sirbp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIRBP);
sirbp.sirbp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
- memunmap(*event_ring_page);
- /* Disable Synic's event flags page */
- siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
- siefp.siefp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+ if (hv_root_partition()) {
+ hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+ memunmap(*event_ring_page);
+ } else {
+ sirbp.base_sirbp_gpa = 0;
+ hv_set_non_nested_msr(HV_MSR_SIRBP, sirbp.as_uint64);
+ free_page((unsigned long)*event_ring_page);
+ }
+
+ /*
+ * Release our mappings of the message and event flags pages.
+ * When VMBus is not active, we enabled SIMP/SIEFP — disable
+ * them. Otherwise VMBus owns the MSRs — leave them.
+ */
memunmap(*event_flags_page);
+ if (!vmbus_active) {
+ union hv_synic_simp simp;
+ union hv_synic_siefp siefp;
- /* Disable Synic's message page */
- simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
- simp.simp_enabled = false;
- hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ siefp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIEFP);
+ siefp.siefp_enabled = false;
+ hv_set_non_nested_msr(HV_MSR_SIEFP, siefp.as_uint64);
+
+ simp.as_uint64 = hv_get_non_nested_msr(HV_MSR_SIMP);
+ simp.simp_enabled = false;
+ hv_set_non_nested_msr(HV_MSR_SIMP, simp.as_uint64);
+ }
memunmap(*msg_page);
- /* Disable global synic bit */
- sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
- sctrl.enable = 0;
- hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ /* When VMBus is active it owns SCONTROL — leave it. */
+ if (!vmbus_active) {
+ union hv_synic_scontrol sctrl;
+
+ sctrl.as_uint64 = hv_get_non_nested_msr(HV_MSR_SCONTROL);
+ sctrl.enable = 0;
+ hv_set_non_nested_msr(HV_MSR_SCONTROL, sctrl.as_uint64);
+ }
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH v3 3/6] x86/hyperv: Skip LP/VP creation on kexec
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam,
Stanislav Kinsburskii, Mukesh Rathor
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.
Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.
Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 7 +++++
drivers/hv/hv_proc.c | 47 ++++++++++++++++++++++++++++++++++
include/asm-generic/mshyperv.h | 10 ++++++++
include/hyperv/hvgdk_mini.h | 1 +
include/hyperv/hvhdk_mini.h | 12 +++++++++
5 files changed, 77 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index e498b6b2ef19..b5b6a58b67b0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -431,6 +431,10 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
}
#ifdef CONFIG_X86_64
+ /* If AP LPs exist, we are in a kexec'd kernel and VPs already exist */
+ if (num_present_cpus() == 1 || hv_lp_exists(1))
+ return;
+
for_each_present_cpu(i) {
if (i == 0)
continue;
@@ -438,6 +442,9 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
BUG_ON(ret);
}
+ ret = hv_call_notify_all_processors_started();
+ WARN_ON(ret);
+
for_each_present_cpu(i) {
if (i == 0)
continue;
diff --git a/drivers/hv/hv_proc.c b/drivers/hv/hv_proc.c
index 3cb4b2a3035c..57b2c64197cb 100644
--- a/drivers/hv/hv_proc.c
+++ b/drivers/hv/hv_proc.c
@@ -239,3 +239,50 @@ int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
return ret;
}
EXPORT_SYMBOL_GPL(hv_call_create_vp);
+
+int hv_call_notify_all_processors_started(void)
+{
+ struct hv_input_notify_partition_event *input;
+ u64 status;
+ unsigned long irq_flags;
+ int ret = 0;
+
+ local_irq_save(irq_flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ memset(input, 0, sizeof(*input));
+ input->event = HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED;
+ status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT,
+ input, NULL);
+ local_irq_restore(irq_flags);
+
+ if (!hv_result_success(status)) {
+ hv_status_err(status, "\n");
+ ret = hv_result_to_errno(status);
+ }
+ return ret;
+}
+
+bool hv_lp_exists(u32 lp_index)
+{
+ struct hv_input_get_logical_processor_run_time *input;
+ struct hv_output_get_logical_processor_run_time *output;
+ unsigned long flags;
+ u64 status;
+
+ local_irq_save(flags);
+ input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+ output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+ input->lp_index = lp_index;
+ status = hv_do_hypercall(HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME,
+ input, output);
+ local_irq_restore(flags);
+
+ if (!hv_result_success(status) &&
+ hv_result(status) != HV_STATUS_INVALID_LP_INDEX) {
+ hv_status_err(status, "\n");
+ BUG();
+ }
+
+ return hv_result_success(status);
+}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index d37b68238c97..bf601d67cecb 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -347,6 +347,8 @@ bool hv_result_needs_memory(u64 status);
int hv_deposit_memory_node(int node, u64 partition_id, u64 status);
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
+int hv_call_notify_all_processors_started(void);
+bool hv_lp_exists(u32 lp_index);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
#else /* CONFIG_MSHV_ROOT */
@@ -366,6 +368,14 @@ static inline int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id)
{
return -EOPNOTSUPP;
}
+static inline int hv_call_notify_all_processors_started(void)
+{
+ return -EOPNOTSUPP;
+}
+static inline bool hv_lp_exists(u32 lp_index)
+{
+ return false;
+}
static inline int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags)
{
return -EOPNOTSUPP;
diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index f9600f87186a..6a4e8b9d570f 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -435,6 +435,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */
/* HV_CALL_CODE */
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
+#define HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME 0x0004
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
#define HVCALL_SEND_IPI 0x000b
#define HVCALL_ENABLE_VP_VTL 0x000f
diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 091c03e26046..b4cb2fa26e9b 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -362,6 +362,7 @@ union hv_partition_event_input {
enum hv_partition_event {
HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
+ HV_PARTITION_ALL_LOGICAL_PROCESSORS_STARTED = 4,
};
struct hv_input_notify_partition_event {
@@ -369,6 +370,17 @@ struct hv_input_notify_partition_event {
union hv_partition_event_input input;
} __packed;
+struct hv_input_get_logical_processor_run_time {
+ u32 lp_index;
+} __packed;
+
+struct hv_output_get_logical_processor_run_time {
+ u64 global_time;
+ u64 local_run_time;
+ u64 rsvdz0;
+ u64 hypervisor_time;
+} __packed;
+
struct hv_lp_startup_status {
u64 hv_status;
u64 substatus1;
--
2.43.0
^ permalink raw reply related
* [PATCH v3 2/6] x86/hyperv: move stimer cleanup to hv_machine_shutdown()
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser, Anirudh Rayabharam
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
arch/x86/kernel/cpu/mshyperv.c | 8 ++++++--
drivers/hv/vmbus_drv.c | 1 -
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index a7dfc29d3470..e498b6b2ef19 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -237,8 +237,12 @@ void hv_remove_crash_handler(void)
#ifdef CONFIG_KEXEC_CORE
static void hv_machine_shutdown(void)
{
- if (kexec_in_progress && hv_kexec_handler)
- hv_kexec_handler();
+ if (kexec_in_progress) {
+ hv_stimer_global_cleanup();
+
+ if (hv_kexec_handler)
+ hv_kexec_handler();
+ }
/*
* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 5e7a6839c933..c5dfe9f3b206 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2891,7 +2891,6 @@ static struct platform_driver vmbus_platform_driver = {
static void hv_kexec_handler(void)
{
- hv_stimer_global_cleanup();
vmbus_initiate_unload(false);
/* Make sure conn_state is set as hv_synic_cleanup checks for it */
mb();
--
2.43.0
^ permalink raw reply related
* [PATCH v3 1/6] Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser
In-Reply-To: <20260408013645.286723-1-jloeser@linux.microsoft.com>
vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().
Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.
Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
---
drivers/hv/vmbus_drv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 3faa74e49a6b..5e7a6839c933 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1422,7 +1422,6 @@ static int vmbus_alloc_synic_and_connect(void)
{
int ret, cpu;
struct work_struct __percpu *works;
- int hyperv_cpuhp_online;
ret = hv_synic_alloc();
if (ret < 0)
--
2.43.0
^ permalink raw reply related
* [PATCH v3 0/6] Hyper-V: kexec fixes for L1VH (mshv)
From: Jork Loeser @ 2026-04-08 1:36 UTC (permalink / raw)
To: linux-hyperv
Cc: x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Long Li, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H . Peter Anvin, Arnd Bergmann, Michael Kelley,
linux-kernel, linux-arch, Jork Loeser
This series fixes kexec support when Linux runs as an L1 Virtual Host
(L1VH) under Hyper-V, using the MSHV driver to manage child VMs.
1. A variable shadowing bug in vmbus that hides the cpuhp state used
for teardown.
2. Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown(). This ensures stimer cleanup happens before
the vmbus unload.
3. LP/VP re-creation: after kexec, logical processors and virtual
processors already exist in the hypervisor. Detect this and skip
re-adding them.
4-5. SynIC cleanup: the MSHV driver manages its own SynIC resources
separately from vmbus. Add proper teardown of MSHV-owned SINTs,
SIMP, and SIEFP on kexec, scoped to only the resources MSHV
owns.
6. Debugfs stats pages: unmap the VP statistics overlay pages before
kexec to avoid stale mappings in the new kernel.
Changes since v2:
- Rebased onto linux-next/master to adapt to the upstream SynIC
refactor (commit 5a674ef871fe, "mshv: refactor synic init and
cleanup").
Changes since v1:
- Patch 4: account for nested root partitions where VMBus is also
active (not just L1VH); use a vmbus_active local variable; allocate
SIRBP L1VH allocation path for when the hypervisor doesn't
pre-provision the page
Jork Loeser (6):
Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing
x86/hyperv: move stimer cleanup to hv_machine_shutdown()
x86/hyperv: Skip LP/VP creation on kexec
mshv: limit SynIC management to MSHV-owned resources
mshv: clean up SynIC state on kexec for L1VH
mshv: unmap debugfs stats pages on kexec
arch/x86/kernel/cpu/mshyperv.c | 15 +++-
drivers/hv/hv_proc.c | 47 +++++++++++
drivers/hv/mshv_debugfs.c | 7 +-
drivers/hv/mshv_synic.c | 146 +++++++++++++++++++++------------
drivers/hv/vmbus_drv.c | 2 -
include/asm-generic/mshyperv.h | 10 +++
include/hyperv/hvgdk_mini.h | 1 +
include/hyperv/hvhdk_mini.h | 12 +++
8 files changed, 184 insertions(+), 56 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH v2 5/6] mshv: clean up SynIC state on kexec for L1VH
From: Jork Loeser @ 2026-04-08 1:25 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260406-hasty-academic-hippo-a4cfb8@anirudhrb>
On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:
> I believe this code has moved to mshv_synic.c in hyperv-fixes. So, this
> probably won't apply.
Right, this needs a fix.
Best,
Jork
^ permalink raw reply
* RE: [EXTERNAL] [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Long Li @ 2026-04-07 22:30 UTC (permalink / raw)
To: Li Tian, linux-scsi@vger.kernel.org
Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
James E.J. Bottomley, Martin K. Petersen,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260406015344.12566-1-litian@redhat.com>
> -----Original Message-----
> From: Li Tian <litian@redhat.com>
> Sent: Sunday, April 5, 2026 6:54 PM
> To: linux-scsi@vger.kernel.org
> Cc: Li Tian <litian@redhat.com>; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; James E.J. Bottomley
> <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> <martin.petersen@oracle.com>; linux-hyperv@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: [EXTERNAL] [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN
> truncation for Hyper-V vFC
>
> The storvsc driver has become stricter in handling SRB status codes returned by
> the Hyper-V host. When using Virtual Fibre Channel (vFC) passthrough, the host
> may return SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN
> commands if the allocation length in the CDB does not match the host's expected
> response size.
>
> Currently, this status is treated as a fatal error, propagating
> Host_status=0x07 [DID_ERROR] to the SCSI mid-layer. This causes userspace
> storage utilities (such as sg_persist) to fail with transport errors, even when the
> host has actually returned the requested reservation data in the buffer.
>
> Refactor the existing command-specific workarounds into a new helper function,
> storvsc_host_mishandles_cmd(), and add PERSISTENT_RESERVE_IN to the list of
> commands where SRB status errors should be suppressed for vFC devices. This
> ensures that the SCSI mid-layer processes the returned data buffer instead of
> terminating the command.
>
> Signed-off-by: Li Tian <litian@redhat.com>
Reviewed-by: Long Li <longli@microsoft.com>
> ---
> drivers/scsi/storvsc_drv.c | 32 +++++++++++++++++++++-----------
> 1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index
> ae1abab97835..6977ca8a0658 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1131,6 +1131,26 @@ static void storvsc_command_completion(struct
> storvsc_cmd_request *cmd_request,
> kfree(payload);
> }
>
> +/*
> + * The current SCSI handling on the host side does not correctly handle:
> + * INQUIRY with page code 0x80, MODE_SENSE / MODE_SENSE_10 with
> cmd[2]
> +== 0x1c,
> + * and (for FC) MAINTENANCE_IN / PERSISTENT_RESERVE_IN passthrough.
> + */
> +static bool storvsc_host_mishandles_cmd(u8 opcode, struct hv_device
> +*device) {
> + switch (opcode) {
> + case INQUIRY:
> + case MODE_SENSE:
> + case MODE_SENSE_10:
> + return true;
> + case MAINTENANCE_IN:
> + case PERSISTENT_RESERVE_IN:
> + return hv_dev_is_fc(device);
> + default:
> + return false;
> + }
> +}
> +
> static void storvsc_on_io_completion(struct storvsc_device *stor_device,
> struct vstor_packet *vstor_packet,
> struct storvsc_cmd_request *request) @@ -
> 1141,22 +1161,12 @@ static void storvsc_on_io_completion(struct
> storvsc_device *stor_device,
> stor_pkt = &request->vstor_packet;
>
> /*
> - * The current SCSI handling on the host side does
> - * not correctly handle:
> - * INQUIRY command with page code parameter set to 0x80
> - * MODE_SENSE and MODE_SENSE_10 command with cmd[2] == 0x1c
> - * MAINTENANCE_IN is not supported by HyperV FC passthrough
> - *
> * Setup srb and scsi status so this won't be fatal.
> * We do this so we can distinguish truly fatal failues
> * (srb status == 0x4) and off-line the device in that case.
> */
>
> - if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
> - (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
> - (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
> - (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
> - hv_dev_is_fc(device))) {
> + if (storvsc_host_mishandles_cmd(stor_pkt->vm_srb.cdb[0], device)) {
> vstor_packet->vm_srb.scsi_status = 0;
> vstor_packet->vm_srb.srb_status = SRB_STATUS_SUCCESS;
> }
> --
> 2.53.0
^ permalink raw reply
* Re: [PATCH v2 4/6] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-04-07 21:27 UTC (permalink / raw)
To: Anirudh Rayabharam
Cc: linux-hyperv, x86, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H . Peter Anvin, Arnd Bergmann,
Michael Kelley, linux-kernel, linux-arch
In-Reply-To: <20260406-ninja-civet-of-tornado-67ff54@anirudhrb>
On Mon, 6 Apr 2026, Anirudh Rayabharam wrote:
> On Fri, Apr 03, 2026 at 12:06:10PM -0700, Jork Loeser wrote:
>> The SynIC is shared between VMBus and MSHV. VMBus owns the message
>> page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
>> and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
>>
>> Currently mshv_synic_init() redundantly enables SIMP, SIEFP, and
>
> The redundant enable is probably a no-op from the hypervisor side so it
> probably doesn't hurt us. The main problem is with the tear down.
It's an MSR intercept. If we can replace this by an "if()" we shave a few
cycles.
> An alternative approach could be: check if SIMP/SIEFP/SCONTROL is
> already enabled. If so, don't enable it again. If not enabled, enable it
> and keep track of what all stuf we have enabled. Then disable all of
> them during cleanup. This approach makes less assumptions about the
> behavior of the VMBUS driver and what stuff it does or doesn't use.
It would, yes. Then again, we drag yet more state and make debugging more
complicated / less clear to reason what happens dynamically. I had been
debating this briefly myself, and ultimately decided against it for that
very reason.
Best,
Jork
^ permalink raw reply
* RE: [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Michael Kelley @ 2026-04-07 20:40 UTC (permalink / raw)
To: Thomas Lefebvre
Cc: Sean Christopherson, Vitaly Kuznetsov, pbonzini@redhat.com,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org
In-Reply-To: <CAKdXbaXdSWq-NYk8z6_OtSgb6xsp+GJxrnF2iBMRdk0nfYne=A@mail.gmail.com>
From: Thomas Lefebvre <thomas.lefebvre3@gmail.com> Sent: Tuesday, April 7, 2026 12:13 PM
>
> Hi everyone, thank you for your attention to this bug report.
>
> Michael,
>
> 1. No, lscpu in the L1 guest does not show the flags "tsc_reliable"
> and "constant_tsc".
> $ lscpu | grep tsc_reliable
> $ lscpu | grep constant_tsc
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> hyperv_clocksource_tsc_page
>
> 2. Windows 10
> Version 22H2 (OS Build 19045.6466)
>
> 3. Hyper-V: privilege flags low 0x2e7f, high 0x3b8030, ext 0x2, hints
> 0x24e24, misc 0xbed7b2
>
> 4. Yes, the laptop hibernates and then resumes.
> When the problem occurred, the laptop had gone through multiple
> hibernate and resume cycles.
> I haven't seen it happen after a full reboot before a hibernate/resume cycle.
>
> Thomas
>
How easy is it for you to reproduce the problem? Would it be feasible
to get a definitive answer on whether the problem repros after a
full reboot, but before a hibernate/resume cycle?
There's a known bug Windows 10 Hyper-V where the hardware TSC
scaling gets messed up after a hibernate/resume cycle, causing the TSC
values read in the guest to drift from what the Hyper-V host thinks
the guest's TSC value is. A summary of the problem is here:
https://github.com/microsoft/WSL/issues/6982#issuecomment-2294892954
Of course, this doesn't sound like your symptom. And Hyper-V is not
telling your guest that it supports hardware TSC scaling, because the
HV_ACCESS_TSC_INVARIANT flag is *not* set and the clocksource
is hyperv_clocksource_tsc_page. But my understanding is that the code
changes to fix the Hyper-V problem weren't trivial, and I'm speculating
that maybe you are seeing some other symptom of whatever the
underlying Hyper-V issue was.
Of course, this is just speculation. If the problem can occur before
any hibernate/resume cycles are done, then my speculation is
wrong. But if the problem only happens after a hibernate/resume
cycle, then this known problem, or something related to it, becomes
a pretty good candidate. Unfortunately, I'm pretty sure there's no
fix for Windows 10 Hyper-V. You would need to upgrade to
Windows 11 22H2 or later.
Michael
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox