* Re: [PATCH v4 17/47] x86/kvm: Mark TSC as reliable when it's constant and nonstop
From: David Woodhouse @ 2026-06-01 22:02 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnhnjvfIblFxTFX@google.com>
[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]
On Fri, 29 May 2026 11:57:34 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 909d3e5e5bcd5..4fe9c69bf40b3 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > [ ... ]
> > > @@ -1040,7 +1041,20 @@ static void __init kvm_init_platform(void)
> > [ ... ]
> > > - kvmclock_init();
> > > + /*
> > > + * If the TSC counts at a constant frequency across P/T states, counts
> > > + * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > + * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
> > > + * exists purely to honor the TSC being marked unstable via command
> > > + * line, any runtime detection of an unstable will happen after this.
> > > + */
> > > + tsc_is_reliable = boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > + !check_tsc_unstable();
> >
> > [Severity: High]
> > Does this evaluate check_tsc_unstable() too early to catch the command line
> > parameter?
>
> Huh, it does indeed.
>
> > It looks like kvm_init_platform() is called from setup_arch(), but the
> > tsc=unstable kernel parameter is parsed via __setup() later during
> > parse_args() in start_kernel().
> >
> > If check_tsc_unstable() evaluates to 0 here because the parameter hasn't
> > been parsed yet, wouldn't it incorrectly force X86_FEATURE_TSC_RELIABLE
> > and set prefer_tsc to true?
>
> Yep, but this is a pre-existing problem that goes all the way back to the original
> commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if invariant TSC is exposed").
>
> We could try to fix that, but I'm _very_ strongly inclined to add (yet another)
> patch to simply drop the check_tsc_unstable() since it has always been dead code.
Yeah, kill it with fire.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 18/47] x86/kvm: Get local APIC bus frequency from PV CPUID Timing Info
From: David Woodhouse @ 2026-06-01 22:06 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx, sashiko-reviews
In-Reply-To: <ahnZ8tuHFVFTeofW@google.com>
[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]
On Fri, 29 May 2026 11:24:50 -0700, Sean Christopherson wrote:
> On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 4fe9c69bf40b3..c1139182121d5 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> >
> > [ ... ]
> >
> > > @@ -1039,6 +1040,13 @@ static void __init kvm_init_platform(void)
> > > x86_init.hyper.get_tsc_khz = kvm_get_tsc_khz;
> > > x86_init.hyper.get_cpu_khz = kvm_get_tsc_khz;
> > > }
> > > +
> > > +#ifdef CONFIG_X86_LOCAL_APIC
> > > + /* The leaf also includes the local APIC bus/timer frequency.*/
> > > + apic_khz = cpuid_ebx(timing_info_leaf);
> > > + if (apic_khz)
> > > + lapic_timer_period = apic_khz * 1000 / HZ;
> >
> > [Severity: High]
> > Could the intermediate multiplication of apic_khz * 1000 overflow here?
> >
> > Since apic_khz is a u32, the multiplication is evaluated in 32-bit unsigned
> > arithmetic. If the hypervisor reports an APIC timer frequency greater than
> > ~4.29 GHz (apic_khz > 4294967), this calculation will silently overflow.
>
> Hmm, easy enough to use mul_u64_u32_div() (I think that's the write helper for
> this?).
Yep.
> But this problem pre-exits in almost every other path that sets lapic_timer_period.
> So while I tried to avoid doing yet more tangentially related cleanup, it seems
> like adding a helper to set lapic_timer_period is the way to go. That would also
> allow making lapic_timer_period local to arch/x86/kernel/apic/apic.c.
>
> *sigh*
Yay, more patches!
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 31/47] x86/vmware: NOP-ify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150753.714296-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 628 bytes --]
On Fri, 29 May 2026 08:07:52 -0700, Sean Christopherson wrote:
> NOP-ify the sched_clock save/restore hooks when using VMware's version of
> sched_clock. This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
>
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem. Leave it
> for a VMware expert to sort out.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 30/47] x86/xen/time: NOP-ify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150741.714145-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 403 bytes --]
On Fri, 29 May 2026 08:07:41 -0700, Sean Christopherson wrote:
> NOP-ify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v4 46/47] x86/kvmclock: Plumb in AP-online and BSP-resume to kvmlock, for documentation
From: David Woodhouse @ 2026-06-01 22:09 UTC (permalink / raw)
To: seanjc
Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, kas, kys, haiyangz,
wei.liu, decui, longli, ajay.kaher, alexey.makhalov, jan.kiszka,
luto, peterz, jgross, daniel.lezcano, jstultz, hpa,
rick.p.edgecombe, vkuznets, bcm-kernel-feedback-list,
boris.ostrovsky, sboyd, kvm, linux-kernel, linux-coco,
linux-hyperv, virtualization, xen-devel, dwmw, thomas.lendacky,
nikunj, dwmw2, mhklinux, tglx
In-Reply-To: <20260529150833.715042-1-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 543 bytes --]
On Fri, 29 May 2026 08:08:33 -0700, Sean Christopherson wrote:
> Invoke kvmclock_cpu_action() with AP_ONLINE and BSP_RESUME, even though
> kvmclock doesn't need to do anything in either case, so that the asymmetry
> of kvmclock is a detail buried in kvmclock, and to explicitly document
> that doing nothing during those phases is intentional and correct.
>
> For all intents and purposes, no functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* [PATCH] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: Mukesh R @ 2026-06-01 22:51 UTC (permalink / raw)
To: linux-hyperv, linux-kernel; +Cc: wei.liu
Make cosmetic changes:
o Rename struct pci_dev *dev to *pdev since there are cases of
struct device *dev in the file and all over the kernel
o Rename hv_build_pci_dev_id to hv_build_devid_type_pci in anticipation
of building different types of device ids
o Fix checkpatch.pl issues with return and extraneous printk
o Replace spaces with tabs
o Rename struct hv_devid *xxx to struct hv_devid *hv_devid given code
paths involve many types of device ids
o Fix indentation in a large if block by using goto.
There are no functional changes.
Reviewed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
---
arch/x86/hyperv/irqdomain.c | 198 +++++++++++++++++++-----------------
1 file changed, 104 insertions(+), 94 deletions(-)
diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
index 365e364268d9..b3ad50a874dc 100644
--- a/arch/x86/hyperv/irqdomain.c
+++ b/arch/x86/hyperv/irqdomain.c
@@ -1,5 +1,4 @@
// SPDX-License-Identifier: GPL-2.0
-
/*
* Irqdomain for Linux to run as the root partition on Microsoft Hypervisor.
*
@@ -14,8 +13,8 @@
#include <linux/irqchip/irq-msi-lib.h>
#include <asm/mshyperv.h>
-static int hv_map_interrupt(union hv_device_id device_id, bool level,
- int cpu, int vector, struct hv_interrupt_entry *entry)
+static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
+ int cpu, int vector, struct hv_interrupt_entry *ret_entry)
{
struct hv_input_map_device_interrupt *input;
struct hv_output_map_device_interrupt *output;
@@ -32,7 +31,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
intr_desc = &input->interrupt_descriptor;
memset(input, 0, sizeof(*input));
input->partition_id = hv_current_partition_id;
- input->device_id = device_id.as_uint64;
+ input->device_id = hv_devid.as_uint64;
intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
intr_desc->vector_count = 1;
intr_desc->target.vector = vector;
@@ -44,7 +43,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
intr_desc->target.vp_set.valid_bank_mask = 0;
intr_desc->target.vp_set.format = HV_GENERIC_SET_SPARSE_4K;
- nr_bank = cpumask_to_vpset(&(intr_desc->target.vp_set), cpumask_of(cpu));
+ nr_bank = cpumask_to_vpset(&intr_desc->target.vp_set, cpumask_of(cpu));
if (nr_bank < 0) {
local_irq_restore(flags);
pr_err("%s: unable to generate VP set\n", __func__);
@@ -61,7 +60,7 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
status = hv_do_rep_hypercall(HVCALL_MAP_DEVICE_INTERRUPT, 0, var_size,
input, output);
- *entry = output->interrupt_entry;
+ *ret_entry = output->interrupt_entry;
local_irq_restore(flags);
@@ -71,21 +70,19 @@ static int hv_map_interrupt(union hv_device_id device_id, bool level,
return hv_result_to_errno(status);
}
-static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_interrupt(u64 id, struct hv_interrupt_entry *irq_entry)
{
unsigned long flags;
struct hv_input_unmap_device_interrupt *input;
- struct hv_interrupt_entry *intr_entry;
u64 status;
local_irq_save(flags);
input = *this_cpu_ptr(hyperv_pcpu_input_arg);
memset(input, 0, sizeof(*input));
- intr_entry = &input->interrupt_entry;
input->partition_id = hv_current_partition_id;
input->device_id = id;
- *intr_entry = *old_entry;
+ input->interrupt_entry = *irq_entry;
status = hv_do_hypercall(HVCALL_UNMAP_DEVICE_INTERRUPT, input, NULL);
local_irq_restore(flags);
@@ -115,67 +112,71 @@ static int get_rid_cb(struct pci_dev *pdev, u16 alias, void *data)
return 0;
}
-static union hv_device_id hv_build_pci_dev_id(struct pci_dev *dev)
+static union hv_device_id hv_build_devid_type_pci(struct pci_dev *pdev)
{
- union hv_device_id dev_id;
+ int pos;
+ union hv_device_id hv_devid;
struct rid_data data = {
.bridge = NULL,
- .rid = PCI_DEVID(dev->bus->number, dev->devfn)
+ .rid = PCI_DEVID(pdev->bus->number, pdev->devfn)
};
- pci_for_each_dma_alias(dev, get_rid_cb, &data);
+ pci_for_each_dma_alias(pdev, get_rid_cb, &data);
- dev_id.as_uint64 = 0;
- dev_id.device_type = HV_DEVICE_TYPE_PCI;
- dev_id.pci.segment = pci_domain_nr(dev->bus);
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_PCI;
+ hv_devid.pci.segment = pci_domain_nr(pdev->bus);
- dev_id.pci.bdf.bus = PCI_BUS_NUM(data.rid);
- dev_id.pci.bdf.device = PCI_SLOT(data.rid);
- dev_id.pci.bdf.function = PCI_FUNC(data.rid);
- dev_id.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
+ hv_devid.pci.bdf.bus = PCI_BUS_NUM(data.rid);
+ hv_devid.pci.bdf.device = PCI_SLOT(data.rid);
+ hv_devid.pci.bdf.function = PCI_FUNC(data.rid);
+ hv_devid.pci.source_shadow = HV_SOURCE_SHADOW_NONE;
- if (data.bridge) {
- int pos;
+ if (data.bridge == NULL)
+ goto out;
- /*
- * Microsoft Hypervisor requires a bus range when the bridge is
- * running in PCI-X mode.
- *
- * To distinguish conventional vs PCI-X bridge, we can check
- * the bridge's PCI-X Secondary Status Register, Secondary Bus
- * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
- * Specification Revision 1.0 5.2.2.1.3.
- *
- * Value zero means it is in conventional mode, otherwise it is
- * in PCI-X mode.
- */
+ /*
+ * Microsoft Hypervisor requires a bus range when the bridge is
+ * running in PCI-X mode.
+ *
+ * To distinguish conventional vs PCI-X bridge, we can check
+ * the bridge's PCI-X Secondary Status Register, Secondary Bus
+ * Mode and Frequency bits. See PCI Express to PCI/PCI-X Bridge
+ * Specification Revision 1.0 5.2.2.1.3.
+ *
+ * Value zero means it is in conventional mode, otherwise it is
+ * in PCI-X mode.
+ */
- pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
- if (pos) {
- u16 status;
+ pos = pci_find_capability(data.bridge, PCI_CAP_ID_PCIX);
+ if (pos) {
+ u16 status;
- pci_read_config_word(data.bridge, pos +
- PCI_X_BRIDGE_SSTATUS, &status);
+ pci_read_config_word(data.bridge, pos + PCI_X_BRIDGE_SSTATUS,
+ &status);
- if (status & PCI_X_SSTATUS_FREQ) {
- /* Non-zero, PCI-X mode */
- u8 sec_bus, sub_bus;
+ if (status & PCI_X_SSTATUS_FREQ) {
+ /* Non-zero, PCI-X mode */
+ u8 sec_bus, sub_bus;
- dev_id.pci.source_shadow = HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
+ hv_devid.pci.source_shadow =
+ HV_SOURCE_SHADOW_BRIDGE_BUS_RANGE;
- pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS, &sec_bus);
- dev_id.pci.shadow_bus_range.secondary_bus = sec_bus;
- pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS, &sub_bus);
- dev_id.pci.shadow_bus_range.subordinate_bus = sub_bus;
- }
+ pci_read_config_byte(data.bridge, PCI_SECONDARY_BUS,
+ &sec_bus);
+ hv_devid.pci.shadow_bus_range.secondary_bus = sec_bus;
+ pci_read_config_byte(data.bridge, PCI_SUBORDINATE_BUS,
+ &sub_bus);
+ hv_devid.pci.shadow_bus_range.subordinate_bus = sub_bus;
}
}
- return dev_id;
+out:
+ return hv_devid;
}
-/**
- * hv_map_msi_interrupt() - "Map" the MSI IRQ in the hypervisor.
+/*
+ * hv_map_msi_interrupt() - Map the MSI IRQ in the hypervisor.
* @data: Describes the IRQ
* @out_entry: Hypervisor (MSI) interrupt entry (can be NULL)
*
@@ -188,22 +189,23 @@ int hv_map_msi_interrupt(struct irq_data *data,
{
struct irq_cfg *cfg = irqd_cfg(data);
struct hv_interrupt_entry dummy;
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
struct msi_desc *msidesc;
- struct pci_dev *dev;
+ struct pci_dev *pdev;
int cpu;
msidesc = irq_data_get_msi_desc(data);
- dev = msi_desc_to_pci_dev(msidesc);
- device_id = hv_build_pci_dev_id(dev);
+ pdev = msi_desc_to_pci_dev(msidesc);
+ hv_devid = hv_build_devid_type_pci(pdev);
cpu = cpumask_first(irq_data_get_effective_affinity_mask(data));
- return hv_map_interrupt(device_id, false, cpu, cfg->vector,
+ return hv_map_interrupt(hv_devid, false, cpu, cfg->vector,
out_entry ? out_entry : &dummy);
}
EXPORT_SYMBOL_GPL(hv_map_msi_interrupt);
-static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi_msg *msg)
+static void entry_to_msi_msg(struct hv_interrupt_entry *entry,
+ struct msi_msg *msg)
{
/* High address is always 0 */
msg->address_hi = 0;
@@ -211,17 +213,19 @@ static inline void entry_to_msi_msg(struct hv_interrupt_entry *entry, struct msi
msg->data = entry->msi_entry.data.as_uint32;
}
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry);
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+ struct hv_interrupt_entry *irq_entry);
+
static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
struct hv_interrupt_entry *stored_entry;
struct irq_cfg *cfg = irqd_cfg(data);
struct msi_desc *msidesc;
- struct pci_dev *dev;
+ struct pci_dev *pdev;
int ret;
msidesc = irq_data_get_msi_desc(data);
- dev = msi_desc_to_pci_dev(msidesc);
+ pdev = msi_desc_to_pci_dev(msidesc);
if (!cfg) {
pr_debug("%s: cfg is NULL", __func__);
@@ -240,7 +244,7 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
stored_entry = data->chip_data;
data->chip_data = NULL;
- ret = hv_unmap_msi_interrupt(dev, stored_entry);
+ ret = hv_unmap_msi_interrupt(pdev, stored_entry);
kfree(stored_entry);
@@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
}
stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
- if (!stored_entry) {
- pr_debug("%s: failed to allocate chip data\n", __func__);
+ if (!stored_entry)
return;
- }
ret = hv_map_msi_interrupt(data, stored_entry);
if (ret) {
@@ -262,18 +264,21 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
data->chip_data = stored_entry;
entry_to_msi_msg(data->chip_data, msg);
-
- return;
}
-static int hv_unmap_msi_interrupt(struct pci_dev *dev, struct hv_interrupt_entry *old_entry)
+static int hv_unmap_msi_interrupt(struct pci_dev *pdev,
+ struct hv_interrupt_entry *irq_entry)
{
- return hv_unmap_interrupt(hv_build_pci_dev_id(dev).as_uint64, old_entry);
+ union hv_device_id hv_devid;
+
+ hv_devid = hv_build_devid_type_pci(pdev);
+ return hv_unmap_interrupt(hv_devid.as_uint64, irq_entry);
}
-static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
+/* NB: during map, hv_interrupt_entry is saved via data->chip_data */
+static void hv_teardown_msi_irq(struct pci_dev *pdev, struct irq_data *irqd)
{
- struct hv_interrupt_entry old_entry;
+ struct hv_interrupt_entry irq_entry;
struct msi_msg msg;
if (!irqd->chip_data) {
@@ -281,13 +286,13 @@ static void hv_teardown_msi_irq(struct pci_dev *dev, struct irq_data *irqd)
return;
}
- old_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
- entry_to_msi_msg(&old_entry, &msg);
+ irq_entry = *(struct hv_interrupt_entry *)irqd->chip_data;
+ entry_to_msi_msg(&irq_entry, &msg);
kfree(irqd->chip_data);
irqd->chip_data = NULL;
- (void)hv_unmap_msi_interrupt(dev, &old_entry);
+ (void)hv_unmap_msi_interrupt(pdev, &irq_entry);
}
/*
@@ -302,7 +307,8 @@ static struct irq_chip hv_pci_msi_controller = {
};
static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
- struct irq_domain *real_parent, struct msi_domain_info *info)
+ struct irq_domain *real_parent,
+ struct msi_domain_info *info)
{
struct irq_chip *chip = info->chip;
@@ -317,7 +323,8 @@ static bool hv_init_dev_msi_info(struct device *dev, struct irq_domain *domain,
}
#define HV_MSI_FLAGS_SUPPORTED (MSI_GENERIC_FLAGS_MASK | MSI_FLAG_PCI_MSIX)
-#define HV_MSI_FLAGS_REQUIRED (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS)
+#define HV_MSI_FLAGS_REQUIRED (MSI_FLAG_USE_DEF_DOM_OPS | \
+ MSI_FLAG_USE_DEF_CHIP_OPS)
static struct msi_parent_ops hv_msi_parent_ops = {
.supported_flags = HV_MSI_FLAGS_SUPPORTED,
@@ -329,14 +336,14 @@ static struct msi_parent_ops hv_msi_parent_ops = {
.init_dev_msi_info = hv_init_dev_msi_info,
};
-static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs,
- void *arg)
+/* Allocate nr_irqs IRQs for the given irq domain */
+static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq,
+ unsigned int nr_irqs, void *arg)
{
/*
- * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e. everything except
- * entry_to_msi_msg() should be in here.
+ * TODO: The allocation bits of hv_irq_compose_msi_msg(), i.e.
+ * everything except entry_to_msi_msg() should be in here.
*/
-
int ret;
ret = irq_domain_alloc_irqs_parent(d, virq, nr_irqs, arg);
@@ -344,13 +351,15 @@ static int hv_msi_domain_alloc(struct irq_domain *d, unsigned int virq, unsigned
return ret;
for (int i = 0; i < nr_irqs; ++i) {
- irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller, NULL,
- handle_edge_irq, NULL, "edge");
+ irq_domain_set_info(d, virq + i, 0, &hv_pci_msi_controller,
+ NULL, handle_edge_irq, NULL, "edge");
}
+
return 0;
}
-static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
+static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
+ unsigned int nr_irqs)
{
for (int i = 0; i < nr_irqs; ++i) {
struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
@@ -362,6 +371,7 @@ static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned
hv_teardown_msi_irq(to_pci_dev(desc->dev), irqd);
}
+
irq_domain_free_irqs_top(d, virq, nr_irqs);
}
@@ -394,25 +404,25 @@ struct irq_domain * __init hv_create_pci_msi_domain(void)
int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry)
{
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
- device_id.as_uint64 = 0;
- device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
- device_id.ioapic.ioapic_id = (u8)ioapic_id;
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+ hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
- return hv_unmap_interrupt(device_id.as_uint64, entry);
+ return hv_unmap_interrupt(hv_devid.as_uint64, entry);
}
EXPORT_SYMBOL_GPL(hv_unmap_ioapic_interrupt);
int hv_map_ioapic_interrupt(int ioapic_id, bool level, int cpu, int vector,
struct hv_interrupt_entry *entry)
{
- union hv_device_id device_id;
+ union hv_device_id hv_devid;
- device_id.as_uint64 = 0;
- device_id.device_type = HV_DEVICE_TYPE_IOAPIC;
- device_id.ioapic.ioapic_id = (u8)ioapic_id;
+ hv_devid.as_uint64 = 0;
+ hv_devid.device_type = HV_DEVICE_TYPE_IOAPIC;
+ hv_devid.ioapic.ioapic_id = (u8)ioapic_id;
- return hv_map_interrupt(device_id, level, cpu, vector, entry);
+ return hv_map_interrupt(hv_devid, level, cpu, vector, entry);
}
EXPORT_SYMBOL_GPL(hv_map_ioapic_interrupt);
--
2.51.2.vfs.0.1
^ permalink raw reply related
* Re: [PATCH] x86/hyperv: Cosmetic changes in irqdomain.c for readability
From: sashiko-bot @ 2026-06-01 22:59 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-hyperv
In-Reply-To: <20260601225116.956392-1-mrathor@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
Pre-existing issues:
- [Critical] Allocation or mapping failure in `hv_irq_compose_msi_msg()` causes the function to return without initializing the caller-provided `msg` structure, potentially leading to uncontrolled MSI writes by the device.
- [High] The `hv_msi_domain_free` function fails to iterate through allocated IRQs for multi-vector MSI, leading to leaked memory and stale hypervisor mappings.
--
commit 042e553a37dead023377268bdb9c0d43c5394d75
Author: Mukesh R <mrathor@linux.microsoft.com>
x86/hyperv: Cosmetic changes in irqdomain.c for readability
This commit makes various cosmetic changes to irqdomain.c, such as renaming
variables, fixing indentation, replacing spaces with tabs, and resolving
checkpatch warnings, with no functional changes intended.
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index 365e364268d9a..b3ad50a874dca 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
[ ... ]
> @@ -249,10 +253,8 @@ static void hv_irq_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> }
>
> stored_entry = kzalloc_obj(*stored_entry, GFP_ATOMIC);
> - if (!stored_entry) {
> - pr_debug("%s: failed to allocate chip data\n", __func__);
> + if (!stored_entry)
> return;
> - }
>
> ret = hv_map_msi_interrupt(data, stored_entry);
> if (ret) {
> kfree(stored_entry);
> return;
> }
[Severity: High]
This is a pre-existing issue, but does returning early here leave the
caller-provided msg structure uninitialized?
If kzalloc_obj() or hv_map_msi_interrupt() fails in hv_irq_compose_msi_msg(),
the function returns without populating msg. Because the callback returns
void, the error is swallowed and the caller might write the uninitialized
contents of msg to the device's MSI address/data registers, potentially
causing the device to perform DMA writes to arbitrary or null physical
addresses when an interrupt fires.
[ ... ]
> -static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq, unsigned int nr_irqs)
> +static void hv_msi_domain_free(struct irq_domain *d, unsigned int virq,
> + unsigned int nr_irqs)
> {
> for (int i = 0; i < nr_irqs; ++i) {
> struct irq_data *irqd = irq_domain_get_irq_data(d, virq);
[Severity: High]
This is a pre-existing issue, but does this loop process the same base IRQ
on every iteration?
By calling irq_domain_get_irq_data() with virq rather than virq + i inside
hv_msi_domain_free(), subsequent iterations will skip the remaining
interrupts in a multi-vector MSI scenario. This could leave them unfreed
and potentially lead to leaked kernel memory and stale hypervisor mappings.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260601225116.956392-1-mrathor@linux.microsoft.com?part=1
^ permalink raw reply
* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Borislav Petkov @ 2026-06-02 3:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>
On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> +{
> + unsigned int ecx_hz, edx;
> +
> + memset(info, 0, sizeof(*info));
Let's not clear this unnecessarily...
> +
> + if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> + return -ENOENT;
... just to return here...
> +
> + /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> + cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> +
> + if (!info->denominator || !info->numerator)
> + return -ENOENT;
... or here.
We wanna clear it here, when we'll return success.
> +
> + /*
> + * Note, some CPUs provide the multiplier information, but not the core
Note: some CPUs...
> + * crystal frequency. The multiplier information is still useful for
> + * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
> + */
> + info->crystal_khz = ecx_hz / 1000;
> + return 0;
> +}
> +
> +int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
> +{
> + if (cpuid_get_tsc_info(info) || !info->crystal_khz)
> + return -ENOENT;
> +
> + info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
> + return 0;
> +}
Unused here. Add it with its first user pls.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: sashiko-bot @ 2026-06-02 10:28 UTC (permalink / raw)
To: Shradha Gupta; +Cc: linux-hyperv
In-Reply-To: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Missing error check for xa_store() leads to resource leak and potential kernel oops on allocation failure.
--
commit 853688f50dc4ca56dd74698a3ecfdfeaf857e716
Author: Shradha Gupta <shradhagupta@linux.microsoft.com>
net: mana: Optimize irq affinity for low vcpu configs
This patch optimizes irq affinity in the mana driver for low vCPU
environments. When the number of allocated irqs is greater than the vCPU
count, it ensures that queue irqs are spread across all available vCPUs
to prevent softirq handling imbalances.
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
[ ... ]
> @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> * first CPU sibling group since they are already affinitized to HWC IRQ
> */
> cpus_read_lock();
> - if (gc->num_msix_usable <= num_online_cpus())
> - skip_first_cpu = true;
> + if (gc->num_msix_usable <= num_online_cpus()) {
> + err = irq_setup(irqs, nvec, gc->numa_node, true);
> + if (err) {
> + cpus_read_unlock();
> + goto free_irq;
[Severity: High]
This isn't a bug introduced by this patch, but looking at the error handling
path for goto free_irq, is there a potential resource leak if xa_store()
fails earlier in the function?
When setting up the irqs in mana_gd_setup_dyn_irqs(), we call xa_store()
with GFP_KERNEL:
xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
However, the return value is not checked. If this allocation fails, the gic
pointer is not saved. If a subsequent operation like irq_setup() fails and
we jump to the free_irq label, the cleanup loop attempts to retrieve it:
gic = xa_load(&gc->irq_contexts, i);
if (WARN_ON(!gic))
continue;
Because the store failed, xa_load() will return NULL, causing the loop
to continue without calling free_irq() or kfree(gic). Could this result
in an un-freed irq handler that might trigger a panic later if the
module is unloaded and the hardware raises an interrupt?
> + }
> + } else {
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260601102749.1768304-1-shradhagupta@linux.microsoft.com?part=1
^ permalink raw reply
* [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: LeantionX @ 2026-06-02 15:52 UTC (permalink / raw)
To: netdev
Cc: linux-hyperv, kys, haiyangz, wei.liu, decui, longli,
andrew+netdev, kuba, pabeni, edumazet, davem, stable,
linux-kernel, Anton Leontev
From: Anton Leontev <leontyevantony@gmail.com>
netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.
A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).
Map the fragment page on demand with kmap_local_page()/kunmap_local()
instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
described by the page buffer entry. On configurations without HIGHMEM
(amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
page_address(), so this is a no-op there.
Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
drivers/net/hyperv/netvsc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..6984f6c97257 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <linux/delay.h>
#include <linux/io.h>
#include <linux/slab.h>
@@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
}
for (i = 0; i < page_count; i++) {
- char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
+ struct page *page = pfn_to_page(pb[i].pfn);
+ char *src = kmap_local_page(page);
u32 offset = pb[i].offset;
u32 len = pb[i].len;
memcpy(dest, (src + offset), len);
+ kunmap_local(src);
dest += len;
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next] net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queries
From: Jakub Kicinski @ 2026-06-02 20:21 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, pabeni, kotaranov, horms, dipayanroy, kees,
linux-hyperv, netdev, linux-kernel, linux-rdma
In-Reply-To: <20260528180757.1536640-1-ernis@linux.microsoft.com>
On Thu, 28 May 2026 11:07:51 -0700 Erni Sri Satya Vennela wrote:
> mana_query_link_cfg() sends an HWC command to firmware on every call,
> but the link speed and QoS values it returns only change when the
> driver explicitly calls mana_set_bw_clamp(). This function is called
> not only by userspace via ethtool get_link_ksettings, but also
> periodically by hv_netvsc through netvsc_get_link_ksettings and by
> the sysfs speed_show attribute via dev_attr_show, resulting in
> unnecessary HWC traffic every few minutes.
mana is ops-locked, right? Because you support net shapers
Could you instead take the netdev_lock() in the work?
It's already held around the user space originated calls.
^ permalink raw reply
* [PATCH net-next v10 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.
This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.
This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:
ethtool --set-priv-flags eth0 full-page-rx on
There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.
This series depends on the following fixes now merged in net-next:
17bfe0a8c014 ("net: mana: add NULL guards in teardown path to prevent panic")
5b05aa36ee24 ("net: mana: skip redundant detach on already-detached port")
Changes in v10:
- Rebased on net-next which now includes the prerequisite fixes.
- Recovery logic in mana_set_priv_flags() leverages the idempotent
mana_detach() from the merged fixes.
Changes in v9:
- Added correct tree.
Changes in v8:
- Fixed queue_reset_work recovery by restoring port_is_up before
scheduling reset so the handler can properly re-attach.
- Simplified "err && schedule_port_reset" to "schedule_port_reset".
Changes in v7:
- Rebased onto net-next.
- Retained private flag approach after David Wei's testing on
Grace (ARM64) confirmed that fragment mode outperforms
full-page mode on other platforms, validating this is a
single-platform workaround rather than a generic issue.
Changes in v6:
- Added missed maintainers.
Changes in v5:
- Split prep refactor into separate patch (patch 1/2)
Changes in v4:
- Dropping the smbios string parsing and add ethtool priv flag
to reconfigure the queues with full page rx buffers.
Changes in v3:
- changed u8* to char*
Changes in v2:
- separate reading string index and the string, remove inline.
Dipayaan Roy (2):
net: mana: refactor mana_get_strings() and mana_get_sset_count() to
use switch
net: mana: force full-page RX buffers via ethtool private flag
drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++-
.../ethernet/microsoft/mana/mana_ethtool.c | 178 +++++++++++++++---
include/net/mana/mana.h | 8 +
3 files changed, 177 insertions(+), 31 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next v10 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <20260602202801.1873742-1-dipayanroy@linux.microsoft.com>
Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.
No functional change.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
.../ethernet/microsoft/mana/mana_ethtool.c | 75 ++++++++++++-------
1 file changed, 46 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 04350973e19e..7e79681634db 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
- if (stringset != ETH_SS_STATS)
+ switch (stringset) {
+ case ETH_SS_STATS:
+ return ARRAY_SIZE(mana_eth_stats) +
+ ARRAY_SIZE(mana_phy_stats) +
+ ARRAY_SIZE(mana_hc_stats) +
+ num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+ default:
return -EINVAL;
-
- return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
- num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+ }
}
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
{
- struct mana_port_context *apc = netdev_priv(ndev);
unsigned int num_queues = apc->num_queues;
int i, j;
- if (stringset != ETH_SS_STATS)
- return;
for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
- ethtool_puts(&data, mana_eth_stats[i].name);
+ ethtool_puts(data, mana_eth_stats[i].name);
for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
- ethtool_puts(&data, mana_hc_stats[i].name);
+ ethtool_puts(data, mana_hc_stats[i].name);
for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
- ethtool_puts(&data, mana_phy_stats[i].name);
+ ethtool_puts(data, mana_phy_stats[i].name);
for (i = 0; i < num_queues; i++) {
- ethtool_sprintf(&data, "rx_%d_packets", i);
- ethtool_sprintf(&data, "rx_%d_bytes", i);
- ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
- ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
- ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
- ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+ ethtool_sprintf(data, "rx_%d_packets", i);
+ ethtool_sprintf(data, "rx_%d_bytes", i);
+ ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+ ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+ ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+ ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
- ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+ ethtool_sprintf(data,
+ "rx_%d_coalesced_cqe_%d",
+ i,
+ j + 2);
}
for (i = 0; i < num_queues; i++) {
- ethtool_sprintf(&data, "tx_%d_packets", i);
- ethtool_sprintf(&data, "tx_%d_bytes", i);
- ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
- ethtool_sprintf(&data, "tx_%d_tso_packets", i);
- ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
- ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
- ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
- ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
- ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
- ethtool_sprintf(&data, "tx_%d_csum_partial", i);
- ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+ ethtool_sprintf(data, "tx_%d_packets", i);
+ ethtool_sprintf(data, "tx_%d_bytes", i);
+ ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+ ethtool_sprintf(data, "tx_%d_tso_packets", i);
+ ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+ ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+ ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+ ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+ ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+ ethtool_sprintf(data, "tx_%d_csum_partial", i);
+ ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+ }
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ switch (stringset) {
+ case ETH_SS_STATS:
+ mana_get_strings_stats(apc, &data);
+ break;
+ default:
+ break;
}
}
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-06-02 20:24 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov,
pavan.chebbi
In-Reply-To: <20260602202801.1873742-1-dipayanroy@linux.microsoft.com>
On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).
Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.
Usage:
ethtool --set-priv-flags eth0 full-page-rx on
There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.
The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 22 +++-
.../ethernet/microsoft/mana/mana_ethtool.c | 103 ++++++++++++++++++
include/net/mana/mana.h | 8 ++
3 files changed, 131 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index db14357d3732..447cecfd3f67 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
return va;
}
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+ /* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+ * in the RX refill path (~2kB buffer) can cause significant throughput
+ * regression under high connection counts. Allow user to force one RX
+ * buffer per page via ethtool private flag to bypass the fragment
+ * path.
+ */
+ if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+ return true;
+
+ /* For xdp and jumbo frames make sure only one packet fits per page. */
+ if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+ return true;
+
+ return false;
+}
+
/* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
/* Calculate datasize first (consistent across all cases) */
*datasize = mtu + ETH_HLEN;
- /* For xdp and jumbo frames make sure only one packet fits per page */
- if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+ if (mana_use_single_rxbuf_per_page(apc, mtu)) {
if (mana_xdp_get(apc)) {
*headroom = XDP_PACKET_HEADROOM;
*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 7e79681634db..f22bbb325948 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
};
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+ [MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
static int mana_get_sset_count(struct net_device *ndev, int stringset)
{
struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
ARRAY_SIZE(mana_phy_stats) +
ARRAY_SIZE(mana_hc_stats) +
num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+ case ETH_SS_PRIV_FLAGS:
+ return MANA_PRIV_FLAG_MAX;
+
default:
return -EINVAL;
}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
}
}
+static void mana_get_strings_priv_flags(u8 **data)
+{
+ int i;
+
+ for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+ ethtool_puts(data, mana_priv_flags[i]);
+}
+
static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
{
struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
case ETH_SS_STATS:
mana_get_strings_stats(apc, &data);
break;
+ case ETH_SS_PRIV_FLAGS:
+ mana_get_strings_priv_flags(&data);
+ break;
default:
break;
}
@@ -590,6 +609,88 @@ static int mana_get_link_ksettings(struct net_device *ndev,
return 0;
}
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+
+ return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+ struct mana_port_context *apc = netdev_priv(ndev);
+ u32 changed = apc->priv_flags ^ priv_flags;
+ u32 old_priv_flags = apc->priv_flags;
+ bool schedule_port_reset = false;
+ int err = 0;
+
+ if (!changed)
+ return 0;
+
+ /* Reject unknown bits */
+ if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+ return -EINVAL;
+
+ if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+ apc->priv_flags = priv_flags;
+
+ if (!apc->port_is_up) {
+ /* Port is down, flag updated to apply on next up
+ * so just return.
+ */
+ return 0;
+ }
+
+ /* Pre-allocate buffers to prevent failure in mana_attach
+ * later
+ */
+ err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+ if (err) {
+ netdev_err(ndev,
+ "Insufficient memory for new allocations\n");
+ apc->priv_flags = old_priv_flags;
+ return err;
+ }
+
+ err = mana_detach(ndev, false);
+ if (err) {
+ netdev_err(ndev, "mana_detach failed: %d\n", err);
+ apc->priv_flags = old_priv_flags;
+
+ /* Port is in an inconsistent state. Restore
+ * 'port_is_up' so that queue reset work handler
+ * can properly detach and re-attach.
+ */
+ apc->port_is_up = true;
+ schedule_port_reset = true;
+ goto out;
+ }
+
+ err = mana_attach(ndev);
+ if (err) {
+ netdev_err(ndev, "mana_attach failed: %d\n", err);
+ apc->priv_flags = old_priv_flags;
+
+ /* Restore 'port_is_up' so the reset work handler
+ * can properly detach/attach. Without this,
+ * the handler sees port_is_up=false and skips
+ * queue allocation, leaving the port dead.
+ */
+ apc->port_is_up = true;
+ schedule_port_reset = true;
+ }
+ }
+
+out:
+ mana_pre_dealloc_rxbufs(apc);
+
+ if (schedule_port_reset)
+ queue_work(apc->ac->per_port_queue_reset_wq,
+ &apc->queue_reset_work);
+
+ return err;
+}
+
const struct ethtool_ops mana_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.get_ethtool_stats = mana_get_ethtool_stats,
@@ -608,4 +709,6 @@ const struct ethtool_ops mana_ethtool_ops = {
.set_ringparam = mana_set_ringparam,
.get_link_ksettings = mana_get_link_ksettings,
.get_link = ethtool_op_get_link,
+ .get_priv_flags = mana_get_priv_flags,
+ .set_priv_flags = mana_set_priv_flags,
};
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index d9c27310fd04..26fd5e041a47 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
TRI_STATE_TRUE = 1
};
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+ MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+ MANA_PRIV_FLAG_MAX,
+};
+
/* Number of entries for hardware indirection table must be in power of 2 */
#define MANA_INDIRECT_TABLE_MAX_SIZE 512
#define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
u32 rxbpre_headroom;
u32 rxbpre_frag_count;
+ u32 priv_flags;
+
struct bpf_prog *bpf_prog;
/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v8 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-06-02 20:42 UTC (permalink / raw)
To: Jakub Kicinski
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260529105739.5fc5fdf5@kernel.org>
On Fri, May 29, 2026 at 10:57:39AM -0700, Jakub Kicinski wrote:
> On Thu, 28 May 2026 19:55:23 -0700 Dipayaan Roy wrote:
> > As the pre-requisite fixes patches are accepted now:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=17bfe0a8c014
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5b05aa36ee24
> >
> > Can this series be merged now? Let me know if it needs a rebase or
> > anything else.
>
> If there's any dependency, functional or otherwise on the fixes you
> need to wait until they make their way to net-next.
>
> Otherwise you can rebase and repost now.
Hi Jakub,
I see the fixes have landed in net-next. So I rebased this series and posted as v10.
https://lore.kernel.org/all/20260602202801.1873742-1-dipayanroy@linux.microsoft.com/
Regards
Dipayaan Roy
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v11 0/6] net: mana: Per-vPort EQ and MSI-X management
From: Long Li @ 2026-06-02 22:48 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
KY Srinivasan, Wei Liu, Dexuan Cui,
shradhagupta@linux.microsoft.com, Simon Horman,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683A7B2415BEAF17BD0EB4ECE092@SA1PR21MB6683.namprd21.prod.outlook.com>
>
> > On Fri, 22 May 2026 19:02:50 -0700 Long Li wrote:
> > > The following changes since commit
> > 95fab46aea57d6d7b76b319341acbefe8a9293c8:
> > >
> > > Merge branch
> > > 'net-convert-atm-xdp-af_iucv-l2tp_ppp-rxrpc-tipc-to-getsockopt_iter'
> > > (2026-05-22 11:11:12 -0700)
> > >
> > > are available in the Git repository at:
> > >
> > >
> > >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > >
> > ub.com%2Flonglimsft%2Flinux.git&data=05%7C02%7Clongli%40microsoft.co
> > m%
> > >
> > 7C36237239bb6949843c7508debc60af6c%7C72f988bf86f141af91ab2d7c
> > d011db47%
> > >
> > 7C1%7C0%7C639155320616840917%7CUnknown%7CTWFpbGZsb3d8eyJF
> > bXB0eU1hcGkiO
> > >
> > nRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoy
> > fQ%
> > >
> > 3D%3D%7C0%7C%7C%7C&sdata=43aUwSeHYaOhd%2Bmd1lwfmCqmrAObg
> > MWJoDRpKDhmCt8
> > > %3D&reserved=0 tags/mana-eq-msi-v11
> > >
> > > for you to fetch changes up to
> > a26d11135abba51e81ae8b9689e288718af95088:
> > >
> > > RDMA/mana_ib: Allocate interrupt contexts on EQs (2026-05-22
> > > 20:35:43 +0000)
> >
> > The branch is no good, it needs to be your patches applied on top of a
> > commit already in Linus's tree. The current branch is on top of
> > net-next, RDMA would have to pull in 100s of networking commits together
> with your changes.
>
> Hi Jakub,
>
> Thanks for looking into this. Since the RDMA patch (patch 6) depends on the
> networking changes in patches 1-5, could this series go through net-next? I've
> verified that the tag pulls cleanly into the latest net-next.
>
> Leon, Jason - could you provide an Acked-by for patch 6 ("RDMA/mana_ib:
> Allocate interrupt contexts on EQs") so it can be taken through the networking
> tree?
>
> Thanks,
> Long
Hi Jakub,
Thank you for the feedback. Since the RDMA patch builds on the networking changes, would it be possible to take this series through net-next? I've confirmed the tag merges cleanly into the current net-next head.
Leon has acked patch 6 on v2 [1]. There are no significant structural changes to the RDMA driver since then - the differences are minor:
Changes in patch 6 from v2 to v11:
- Error handling updated from NULL/-ENOMEM to IS_ERR()/PTR_ERR() for mana_gd_get_gic() return values
- Added mdev->eqs[i]->eq.irq = gic->irq to populate the irq field on all RDMA EQs for consistency with the Ethernet path
- Introduced a separate msi variable instead of modifying spec.eq.msix_index directly
- Commit message updated
The gdma.h changes are identical to v2.
[1] https://lore.kernel.org/all/20260304145923.GE12611@unreal/
Thanks,
Long
^ permalink raw reply
* RE: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Haiyang Zhang @ 2026-06-02 22:54 UTC (permalink / raw)
To: LeantionX, netdev@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org, KY Srinivasan, wei.liu@kernel.org,
Dexuan Cui, Long Li, andrew+netdev@lunn.ch, kuba@kernel.org,
pabeni@redhat.com, edumazet@google.com, davem@davemloft.net,
stable@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>
> -----Original Message-----
> From: LeantionX <leontyevantony@gmail.com>
> Sent: Tuesday, June 2, 2026 11:52 AM
> To: netdev@vger.kernel.org
> Cc: linux-hyperv@vger.kernel.org; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; kuba@kernel.org; pabeni@redhat.com;
> edumazet@google.com; davem@davemloft.net; stable@vger.kernel.org; linux-
> kernel@vger.kernel.org; Anton Leontev <leontyevantony@gmail.com>
> Subject: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in
> netvsc_copy_to_send_buf
>
> [You don't often get email from leontyevantony@gmail.com. Learn why this
> is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> From: Anton Leontev <leontyevantony@gmail.com>
>
> netvsc_copy_to_send_buf() copies skb fragment pages into the shared
> VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
> x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
> for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
> address that has no kernel page table entry and lies outside the
> kernel direct map, so the subsequent memcpy() faults. As this happens
> on the transmit softirq path, the fault is fatal.
Please include the stack trace in patch description.
> A HIGHMEM fragment reaches this path whenever the page backing an skb
> fragment lives above the LOWMEM boundary, which is common on a 32-bit
> guest with several GiB of RAM (for example when the in-kernel NFS
> server splices page cache pages directly into the reply skb).
>
> Map the fragment page on demand with kmap_local_page()/kunmap_local()
> instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
> described by the page buffer entry. On configurations without HIGHMEM
> (amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
> page_address(), so this is a no-op there.
So, on 64bit kernel, it has no performance impact?
Thanks,
- Haiyang
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH net-next v11 0/6] net: mana: Per-vPort EQ and MSI-X management
From: Jakub Kicinski @ 2026-06-03 0:36 UTC (permalink / raw)
To: Long Li
Cc: Konstantin Taranov, David S . Miller, Paolo Abeni, Eric Dumazet,
Andrew Lunn, Jason Gunthorpe, Leon Romanovsky, Haiyang Zhang,
KY Srinivasan, Wei Liu, Dexuan Cui,
shradhagupta@linux.microsoft.com, Simon Horman,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB6683230E973519C12E2AB797CE122@SA1PR21MB6683.namprd21.prod.outlook.com>
On Tue, 2 Jun 2026 22:48:05 +0000 Long Li wrote:
> Changes in patch 6 from v2 to v11:
>
> - Error handling updated from NULL/-ENOMEM to IS_ERR()/PTR_ERR() for mana_gd_get_gic() return values
> - Added mdev->eqs[i]->eq.irq = gic->irq to populate the irq field on all RDMA EQs for consistency with the Ethernet path
> - Introduced a separate msi variable instead of modifying spec.eq.msix_index directly
> - Commit message updated
>
> The gdma.h changes are identical to v2.
Hm, yes, Leon seems to be AFK since May 19th.
Please repost with his tag included, the list of changes you provided
does seem immaterial. I don't want to merge v11 as is, there's a good
chance people marked this thread as ignored by now.
^ permalink raw reply
* Re: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 5:29 UTC (permalink / raw)
To: Haiyang Zhang
Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
KY Srinivasan, wei.liu@kernel.org, Dexuan Cui, Long Li,
andrew+netdev@lunn.ch, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, davem@davemloft.net, stable@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <BY1PR21MB38709E89497445EECE3C931DCA122@BY1PR21MB3870.namprd21.prod.outlook.com>
>
>
>
> > -----Original Message-----
> > From: LeantionX <leontyevantony@gmail.com>
> > Sent: Tuesday, June 2, 2026 11:52 AM
> > To: netdev@vger.kernel.org
> > Cc: linux-hyperv@vger.kernel.org; KY Srinivasan <kys@microsoft.com>;
> > Haiyang Zhang <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> > andrew+netdev@lunn.ch; kuba@kernel.org; pabeni@redhat.com;
> > edumazet@google.com; davem@davemloft.net; stable@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Anton Leontev <leontyevantony@gmail.com>
> > Subject: [EXTERNAL] [PATCH net] hv_netvsc: use kmap_local_page in
> > netvsc_copy_to_send_buf
> >
> > [You don't often get email from leontyevantony@gmail.com. Learn why this
> > is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > From: Anton Leontev <leontyevantony@gmail.com>
> >
> > netvsc_copy_to_send_buf() copies skb fragment pages into the shared
> > VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
> > x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
> > for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
> > address that has no kernel page table entry and lies outside the
> > kernel direct map, so the subsequent memcpy() faults. As this happens
> > on the transmit softirq path, the fault is fatal.
> Please include the stack trace in patch description.
>
> > A HIGHMEM fragment reaches this path whenever the page backing an skb
> > fragment lives above the LOWMEM boundary, which is common on a 32-bit
> > guest with several GiB of RAM (for example when the in-kernel NFS
> > server splices page cache pages directly into the reply skb).
> >
> > Map the fragment page on demand with kmap_local_page()/kunmap_local()
> > instead. Using pfn_to_page() on pb[i].pfn maps exactly the page
> > described by the page buffer entry. On configurations without HIGHMEM
> > (amd64, i386 without CONFIG_HIGHMEM) kmap_local_page() reduces to
> > page_address(), so this is a no-op there.
>
> So, on 64bit kernel, it has no performance impact?
>
> Thanks,
> - Haiyang
>
Correct. On 64-bit (and any !CONFIG_HIGHMEM config) all pages are
permanently present in the kernel direct map, so kmap_local_page()
folds to page_address() and kunmap_local() is a no-op. The generated
code is therefore equivalent to the previous direct-map access, with
no extra mapping cost on the tx path.
The kmap is only meaningful on 32-bit CONFIG_HIGHMEM, where the
fragment page may live above the LOWMEM boundary and the old
phys_to_virt() result is invalid.
Thanks,
Anton
^ permalink raw reply
* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-03 7:08 UTC (permalink / raw)
To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260602102811.A83AD1F00893@smtp.kernel.org>
On Tue, Jun 02, 2026 at 10:28:11AM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [High] Missing error check for xa_store() leads to resource leak and potential kernel oops on allocation failure.
> --
>
> commit 853688f50dc4ca56dd74698a3ecfdfeaf857e716
> Author: Shradha Gupta <shradhagupta@linux.microsoft.com>
>
> net: mana: Optimize irq affinity for low vcpu configs
>
> This patch optimizes irq affinity in the mana driver for low vCPU
> environments. When the number of allocated irqs is greater than the vCPU
> count, it ensures that queue irqs are spread across all available vCPUs
> to prevent softirq handling imbalances.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> [ ... ]
> > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > * first CPU sibling group since they are already affinitized to HWC IRQ
> > */
> > cpus_read_lock();
> > - if (gc->num_msix_usable <= num_online_cpus())
> > - skip_first_cpu = true;
> > + if (gc->num_msix_usable <= num_online_cpus()) {
> > + err = irq_setup(irqs, nvec, gc->numa_node, true);
> > + if (err) {
> > + cpus_read_unlock();
> > + goto free_irq;
>
> [Severity: High]
> This isn't a bug introduced by this patch, but looking at the error handling
> path for goto free_irq, is there a potential resource leak if xa_store()
> fails earlier in the function?
>
> When setting up the irqs in mana_gd_setup_dyn_irqs(), we call xa_store()
> with GFP_KERNEL:
>
> xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
>
> However, the return value is not checked. If this allocation fails, the gic
> pointer is not saved. If a subsequent operation like irq_setup() fails and
> we jump to the free_irq label, the cleanup loop attempts to retrieve it:
>
> gic = xa_load(&gc->irq_contexts, i);
> if (WARN_ON(!gic))
> continue;
>
> Because the store failed, xa_load() will return NULL, causing the loop
> to continue without calling free_irq() or kfree(gic). Could this result
> in an un-freed irq handler that might trigger a panic later if the
> module is unloaded and the hardware raises an interrupt?
>
> > + }
> > + } else {
> [ ... ]
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260601102749.1768304-1-shradhagupta@linux.microsoft.com?part=1
Since this isn't a bug introduced by this patch, I will fix it in a
seperate patch. Will submit the fix for this bug by next week.
Thanks,
Shradha.
^ permalink raw reply
* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Mike Rapoport @ 2026-06-03 9:29 UTC (permalink / raw)
To: Jork Loeser
Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
Michael Kelley
In-Reply-To: <f9d95fb1-ef3-d4a-19e4-afe7cdde5d1f@linux.microsoft.com>
On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
> On Sun, 31 May 2026, Mike Rapoport wrote:
>
> > > Patch 19: Export kexec_in_progress for modules
> >
> > Isn't there another way to differentiate kexec reboot?
There's that "kexec reboot" string passed as the cmd to the reboot
notifier.
Maybe we can make it somehow more well defined API and use it?
> I could not find one, unfortunately.
>
> > Sincerely yours,
> > Mike.
>
> Best,
> Jork
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH v4 07/47] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Kiryl Shutsemau @ 2026-06-03 10:02 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, David Woodhouse, Tom Lendacky,
Nikunj A Dadhania, David Woodhouse, Michael Kelley,
Thomas Gleixner
In-Reply-To: <20260529144435.704127-8-seanjc@google.com>
On Fri, May 29, 2026 at 07:43:54AM -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly set the TSC frequency to a known
> value, using CPUID-based information, instead of potentially relying on a
> hypervisor-controlled PV routine. For TDX guests, CPUID.0x15 is always
> emulated by the TDX-Module, i.e. the information from CPUID is more
> trustworthy than the information provided by the hypervisor.
Right. EBX is configurable by TD_PARAMS.TSC_FREQUENCY at TD build. The
rest is fixed.
> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency. While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
>
> The APIC timer frequency will be the processor’s bus clock or core
> crystal clock frequency (when TSC/core crystal clock ratio is enumerated
> in CPUID leaf 0x15).
>
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.
Agreed.
> Deliberately leave CPU frequency calibration as is, since the TDX-Module
> doesn't provide any guarantees with respect to CPUID.0x16.
It is fixed to zeros. Sounds like a guarantee to me :P
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Looks sane to me. Including your reasoning about tsc_early_khz= in reply
to Sashiko.
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: sashiko-bot @ 2026-06-03 15:53 UTC (permalink / raw)
To: LeantionX; +Cc: linux-hyperv
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Passing a Hyper-V PFN directly to `pfn_to_page()` causes out-of-bounds memory accesses on architectures where `PAGE_SIZE > 4KB` (e.g., ARM64 with 64KB pages), and loses the intra-page chunk offset.
--
commit a844fbf018cd37eedc9a1f0aea79d6235c00bec3
Author: Anton Leontev <leontyevantony@gmail.com>
hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
This commit replaces phys_to_virt() with kmap_local_page() in
netvsc_copy_to_send_buf() to map skb fragment pages. This is intended to
fix a fatal transmit softirq fault on 32-bit x86 configurations with
HIGHMEM, where phys_to_virt() is only valid for LOWMEM addresses.
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 59e95341f9b1e..6984f6c97257d 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
[ ... ]
> @@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
> }
>
> for (i = 0; i < page_count; i++) {
> - char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
> + struct page *page = pfn_to_page(pb[i].pfn);
> + char *src = kmap_local_page(page);
[Severity: Critical]
Does this cause out-of-bounds memory accesses on architectures with a native
page size greater than 4KB?
The pb[i].pfn field contains a Hyper-V PFN, which always operates at 4KB
granularity. Passing this directly to pfn_to_page() assumes it is a native
Linux PFN.
On architectures where the native page size is larger than 4KB (such as
ARM64 with 64KB pages), the Hyper-V PFN is significantly larger than the
corresponding native Linux PFN. Using it with pfn_to_page() would return an
invalid struct page pointer.
Calling kmap_local_page() on this invalid pointer in
netvsc_copy_to_send_buf() and subsequently reading from it could lead to
memory corruption, out-of-bounds accesses, or kernel panics.
Could this also lose the native page offset of the original 4KB chunk within
the larger Linux page, resulting in an incorrect read offset during the
subsequent memcpy?
> u32 offset = pb[i].offset;
> u32 len = pb[i].len;
>
> memcpy(dest, (src + offset), len);
> + kunmap_local(src);
> dest += len;
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260602155210.90987-1-leontyevanton1995@gmail.com?part=1
^ permalink raw reply
* Re: [PATCH net] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 16:29 UTC (permalink / raw)
To: sashiko-reviews; +Cc: linux-hyperv
In-Reply-To: <20260603155344.7CE871F00893@smtp.kernel.org>
You are right, thanks for catching this. pb[i].pfn is a Hyper-V PFN at
4K granularity, so passing it directly to pfn_to_page() is wrong where
PAGE_SIZE > 4K, and it also drops the intra-page offset. Fixed in v2 by
reconstructing the physical address and using phys_to_page() +
offset_in_page(). Built for i386 (HIGHMEM) and arm64 (64K pages).
Thanks,
Anton
ср, 3 июн. 2026 г. в 18:53, <sashiko-bot@kernel.org>:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Critical] Passing a Hyper-V PFN directly to `pfn_to_page()` causes out-of-bounds memory accesses on architectures where `PAGE_SIZE > 4KB` (e.g., ARM64 with 64KB pages), and loses the intra-page chunk offset.
> --
>
> commit a844fbf018cd37eedc9a1f0aea79d6235c00bec3
> Author: Anton Leontev <leontyevantony@gmail.com>
>
> hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
>
> This commit replaces phys_to_virt() with kmap_local_page() in
> netvsc_copy_to_send_buf() to map skb fragment pages. This is intended to
> fix a fatal transmit softirq fault on 32-bit x86 configurations with
> HIGHMEM, where phys_to_virt() is only valid for LOWMEM addresses.
>
> > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> > index 59e95341f9b1e..6984f6c97257d 100644
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
>
> [ ... ]
>
> > @@ -965,11 +966,13 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
> > }
> >
> > for (i = 0; i < page_count; i++) {
> > - char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
> > + struct page *page = pfn_to_page(pb[i].pfn);
> > + char *src = kmap_local_page(page);
>
> [Severity: Critical]
> Does this cause out-of-bounds memory accesses on architectures with a native
> page size greater than 4KB?
>
> The pb[i].pfn field contains a Hyper-V PFN, which always operates at 4KB
> granularity. Passing this directly to pfn_to_page() assumes it is a native
> Linux PFN.
>
> On architectures where the native page size is larger than 4KB (such as
> ARM64 with 64KB pages), the Hyper-V PFN is significantly larger than the
> corresponding native Linux PFN. Using it with pfn_to_page() would return an
> invalid struct page pointer.
>
> Calling kmap_local_page() on this invalid pointer in
> netvsc_copy_to_send_buf() and subsequently reading from it could lead to
> memory corruption, out-of-bounds accesses, or kernel panics.
>
> Could this also lose the native page offset of the original 4KB chunk within
> the larger Linux page, resulting in an incorrect read offset during the
> subsequent memcpy?
>
> > u32 offset = pb[i].offset;
> > u32 len = pb[i].len;
> >
> > memcpy(dest, (src + offset), len);
> > + kunmap_local(src);
> > dest += len;
> > }
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260602155210.90987-1-leontyevanton1995@gmail.com?part=1
^ permalink raw reply
* [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 16:38 UTC (permalink / raw)
To: netdev
Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
edumazet, davem, stable, linux-kernel, Anton Leontev
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>
netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.
A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).
pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity. The
physical address is reconstructed first and phys_to_page() is used to
obtain the native struct page, with offset_in_page() added so the
in-page offset stays correct where PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g.
arm64 with 64K pages). The page is then mapped on demand with
kmap_local_page()/kunmap_local(). On !CONFIG_HIGHMEM configs
kmap_local_page() reduces to page_address(), so this is a no-op there.
Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v2:
- Reconstruct the physical address from the Hyper-V PFN and use
phys_to_page() + offset_in_page() instead of pfn_to_page() on the
raw PFN, correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).
Reported by Haiyang Zhang.
- Built for i386 (CONFIG_HIGHMEM) and arm64 (64K pages).
drivers/net/hyperv/netvsc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..2038d9f5c9f9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <linux/delay.h>
#include <linux/io.h>
#include <linux/slab.h>
@@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
}
for (i = 0; i < page_count; i++) {
- char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
- u32 offset = pb[i].offset;
+ phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
+ struct page *page = phys_to_page(paddr);
+ u32 offset = offset_in_page(paddr) + pb[i].offset;
u32 len = pb[i].len;
+ char *src = kmap_local_page(page);
memcpy(dest, (src + offset), len);
+ kunmap_local(src);
dest += len;
}
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox