Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH net v3] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-04 16:59 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
	edumazet, davem, stable, linux-kernel, Anton Leontev

netvsc_copy_to_send_buf() copies page buffer entries into the VMBus
send buffer using phys_to_virt() on the entry PFN. Entries for the
RNDIS header and the skb linear data come from kmalloc'd memory and
are always in the kernel direct map, but entries for skb fragments
reference page cache or user pages, which on 32-bit x86 with
CONFIG_HIGHMEM=y can live above the LOWMEM boundary. For such a page
phys_to_virt() returns an address outside the direct map and the
subsequent memcpy() faults on the transmit softirq path, which is
fatal.

Map the pages with kmap_local_page() instead, handling two properties
of the page buffer entries:

 - pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity,
   not a native PFN. Reconstruct the physical address first and derive
   the native page from it, so the mapping stays correct where
   PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g. arm64 with 64K pages).

 - Since commit 41a6328b2c55 ("hv_netvsc: Preserve contiguous PFN
   grouping in the page buffer array"), an entry describes a full
   physically contiguous fragment and pb[i].len can exceed PAGE_SIZE,
   while kmap_local_page() maps a single page. Copy page by page,
   splitting at native page boundaries.

The copy path only handles packets smaller than the send section size
(6144 bytes by default); larger packets take the cp_partial path where
only the RNDIS header is copied. So entries here are bounded by the
section size and a copy is split at most once on 4K-page systems. On
!CONFIG_HIGHMEM configs kmap_local_page() folds to page_address() and
no mapping work is added.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v3:
 - Copy page by page: since 41a6328b2c55 a pb entry describes a full
   contiguous fragment and pb[i].len can exceed PAGE_SIZE, while
   kmap_local_page() maps a single page. Split copies at native page
   boundaries.
v2:
 - Derive the native page and in-page offset from the physical
   address instead of passing the Hyper-V 4K PFN to pfn_to_page(),
   correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).

I do not have a 32-bit HIGHMEM Hyper-V setup to exercise this path;
testing help from the Hyper-V folks would be much appreciated.
 drivers/net/hyperv/netvsc.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..4d319c50955e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,12 +966,22 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}

 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
-		u32 offset = pb[i].offset;
+		phys_addr_t paddr = (pb[i].pfn << HV_HYP_PAGE_SHIFT) +
+				    pb[i].offset;
 		u32 len = pb[i].len;

-		memcpy(dest, (src + offset), len);
-		dest += len;
+		while (len) {
+			struct page *page = phys_to_page(paddr);
+			u32 off = offset_in_page(paddr);
+			u32 chunk = min_t(u32, len, PAGE_SIZE - off);
+			char *src = kmap_local_page(page);
+
+			memcpy(dest, src + off, chunk);
+			kunmap_local(src);
+			dest += chunk;
+			paddr += chunk;
+			len -= chunk;
+		}
 	}

 	if (padding)
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: sashiko-bot @ 2026-06-04 16:39 UTC (permalink / raw)
  To: Anton Leontev; +Cc: linux-hyperv
In-Reply-To: <20260603163851.18058-1-leontyevantony@gmail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] The patch replaces `phys_to_virt()` with `kmap_local_page()` in `netvsc_copy_to_send_buf()` to support HIGHMEM SKB fragments. However, `kmap_local_page()` maps exactly one page (e.g., 4096 bytes). SKB fragments (such as those from TCP sends using `sk_page_frag()` or NFS) can be backed by compound pages where the fragment size (`len`) exceeds a single page, or where `offset + len` crosses a page boundary. The `pb` array populated by `init_page_array()` contains exactly one entry per fragment, meaning `pb[i].len` can exceed `PAGE_SIZE`. When `memcpy(dest, src + offset, len)` is called, it will read past the 4KB window mapped by `kmap_local_page()`. On 32-bit HIGHMEM systems, this accesses adjacent unmapped or unrelated fixmap slots, causing a fatal page fault or memory leakage.
--

commit dfe36312b193784583b28a894f895984d7d0ebba
Author: Anton Leontev <leontyevantony@gmail.com>

hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf

This patch replaces phys_to_virt() with kmap_local_page() to support HIGHMEM
SKB fragments that are beyond the LOWMEM boundary on 32-bit systems.

> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index 59e95341f9b1e..2038d9f5c9f9a 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
[ ... ]
> @@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
>  	}
>  
>  	for (i = 0; i < page_count; i++) {
> -		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
> -		u32 offset = pb[i].offset;
> +		phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
> +		struct page *page = phys_to_page(paddr);
> +		u32 offset = offset_in_page(paddr) + pb[i].offset;
>  		u32 len = pb[i].len;
> +		char *src = kmap_local_page(page);
>  
>  		memcpy(dest, (src + offset), len);

[Severity: Critical]
Will this memcpy() read beyond the 1-page mapping established by
kmap_local_page()?

If the fragment size (len) plus offset exceeds PAGE_SIZE, it seems memcpy()
will read past the locally mapped window. SKB fragments can be backed by
compound pages where the fragment size exceeds a single page, and
init_page_array() assigns the full fragment size to pb[i].len:

drivers/net/hyperv/netvsc_drv.c:init_page_array() {
	...
	cur_pb->offset = offset_in_hvpage(offset);
	cur_pb->len = skb_frag_size(frag);
	cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
	...
}

Other parts of the subsystem also document that a pb entry can cross page
boundaries:

drivers/net/hyperv/netvsc.c:netvsc_build_mpb_array() {
	...
	/* Each entry in the input "pb" array must describe a contiguous range
	 * of guest physical memory so that the pfns are sequential if the range
	 * crosses a page boundary. */
	...
	for (j = 0; j < HVPFN_UP(offset + len); j++)
		mpb_entry->pfnArray[j] = pb[i].pfn + j;
	...
}

Could this regression cause a fatal page fault on 32-bit HIGHMEM guests when
an SKB fragment crosses a page boundary, or leak unrelated fixmap memory if
the adjacent slot happens to be mapped?

> +		kunmap_local(src);
>  		dest += len;
>  	}
>

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260603163851.18058-1-leontyevantony@gmail.com?part=1

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Mike Rapoport @ 2026-06-04 12:17 UTC (permalink / raw)
  To: Jork Loeser
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <3197c9c9-9e4f-c592-bb7-ac422f89115@linux.microsoft.com>

On Wed, Jun 03, 2026 at 10:25:58AM -0700, Jork Loeser wrote:
> 
> 
> On Wed, 3 Jun 2026, Mike Rapoport wrote:
> 
> > On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
> > > On Sun, 31 May 2026, Mike Rapoport wrote:
> > > 
> > > > > Patch 19:      Export kexec_in_progress for modules
> > > > 
> > > > Isn't there another way to differentiate kexec reboot?
> > 
> > There's that "kexec reboot" string passed as the cmd to the reboot
> > notifier.
> > Maybe we can make it somehow more well defined API and use it?
> 
> A string? Dear my - the compiler won't flag it on an API change then, not
> ideal clearly. What's wrong with exporting kexec_in_progress()?

The policy in general is avoid exports unless strictly necessary.
A string can be declared as const char *KEXEC_REBOOT = "kexec reboot" and
used in both kexec and mshv. Not ideal, but still better.

No strong feelings from my side, just EXPORT_SYMBOL there felt a bit off.
 
> Best,
> Jork

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Paolo Abeni @ 2026-06-04 10:45 UTC (permalink / raw)
  To: Shradha Gupta, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, Yury Norov
  Cc: linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com>

On 6/1/26 12:27 PM, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>

Why do you consider this patch a fix? To me is a configuration
improvement and should land on net-next.

> @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +/* should be called with cpus_read_lock() held */

Minor nit: s/should/must/ or just drop the comment, as
`for_each_online_cpu()` usage implies that.

> +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (len == 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}

As this is another heuristic regarding irq spreading, why don't you
implement that inside irq_setup()?

> @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;
> +		}
> +	} else {
> +		/*
> +		 * When num_msix_usable are more than num_online_cpus, our
> +		 * queue IRQs should be equal to num of online vCPUs.
> +		 * We try to make sure queue IRQs spread across all vCPUs.
> +		 * In such a case NUMA or CPU core affinity does not matter.
> +		 * Note: in this case the total mana IRQ should always be
> +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * However, if CPUs went offline since num_msix_usable was
> +		 * computed, queue IRQs will be more than num_online_cpus().
> +		 * In such cases remaining extra IRQs will retain their default
> +		 * affinity.
> +		 */
> +		int first_unassigned = num_online_cpus();
> +		if (nvec > first_unassigned) {

An empty line is needed between the variable declaration and the code.

/P


^ permalink raw reply

* [PATCH v2 3/3] ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing
From: David Woodhouse @ 2026-06-04  9:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Michael Kelley, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Daniel Lezcano, David Woodhouse, Richard Cochran, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm,
	linux-kernel, linux-hyperv, netdev
In-Reply-To: <20260604095755.64849-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

When the system clocksource is kvmclock or Hyper-V (not the TSC
directly), vmclock_get_crosststamp() previously fell through to a
separate get_cycles() call, losing the atomic pairing between the
system time snapshot and the TSC reading.

Now that ktime_get_snapshot_id() populates hw_cycles with the
underlying TSC value for derived clocksources, use it when available.
This gives a perfect (system_time, tsc) pairing for the device time
calculation.

The SUPPORT_KVMCLOCK wrapper is still needed to convert the TSC into
kvmclock nanoseconds for system_counter->cycles, because otherwise
get_device_system_crosststamp() can't interpret the result against
the system clock.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 drivers/ptp/ptp_vmclock.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index d6a5a533164a..eebdcd5ebc08 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -140,6 +140,10 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
 			if (sts->pre_sts.cs_id == st->cs_id) {
 				cycle = sts->pre_sts.cycles;
 				sts->post_sts = sts->pre_sts;
+			} else if (sts->pre_sts.hw_csid == st->cs_id &&
+				   sts->pre_sts.hw_cycles) {
+				cycle = sts->pre_sts.hw_cycles;
+				sts->post_sts = sts->pre_sts;
 			} else {
 				cycle = get_cycles();
 				ptp_read_system_postts(sts);
-- 
2.54.0

^ permalink raw reply related

* [PATCH v2 0/3] timekeeping: Implement and use read_snapshot() functionality
From: David Woodhouse @ 2026-06-04  9:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Michael Kelley, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Daniel Lezcano, David Woodhouse, Richard Cochran, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm,
	linux-kernel, linux-hyperv, netdev

Commit ca1ec8bfac8c ("timekeeping: Add clocksource read_snapshot() 
method and hw_cycles to snapshot") provided a way for clocksources like
kvmclock and Hyper-V to expose the underlying hardware counter value
from which their clock reading is calculated.

This is useful for consumers like the KVM masterclock (when hosting
nested guests), and vmclock which needs the hardware counter value to
calculate time according to the information it's given.

Now that Thomas has beaten my initial proof of concept¹ into shape and
merged it (thanks) here are the two providers and the trivial consumer
case in the vmclock driver, reworked accordingly.

I will rework the other four KVM master clock patches into my kvmclock 
series² which I'll rebase on top of tip/timers/ptp in due course.

¹ https://lore.kernel.org/all/20260526230635.136914-1-dwmw2@infradead.org/#r
² https://lore.kernel.org/all/20260509224824.3264567-1-dwmw2@infradead.org/

David Woodhouse (3):
      clocksource/hyperv: Implement read_snapshot() for TSC page clocksource
      x86/kvmclock: Implement read_snapshot() for kvmclock clocksource
      ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing

 arch/x86/kernel/kvmclock.c         | 36 +++++++++++++++++++++++++++++-------
 drivers/clocksource/hyperv_timer.c | 37 +++++++++++++++++++++++++++----------
 drivers/ptp/ptp_vmclock.c          |  4 ++++
 3 files changed, 60 insertions(+), 17 deletions(-)

base-commit: ca1ec8bfac8c95d0fed9e3611ea21400d1f37262

^ permalink raw reply

* [PATCH v2 2/3] x86/kvmclock: Implement read_snapshot() for kvmclock clocksource
From: David Woodhouse @ 2026-06-04  9:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Michael Kelley, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Daniel Lezcano, David Woodhouse, Richard Cochran, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm,
	linux-kernel, linux-hyperv, netdev
In-Reply-To: <20260604095755.64849-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Implement the read_snapshot() callback for the kvmclock clocksource.
This returns the kvmclock nanosecond value (for timekeeping) while
also providing the raw TSC value that was used to compute it.

The TSC is read inside the pvclock seqlock-protected region, ensuring
the raw TSC and derived kvmclock value are atomically paired.

This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like the vmclock PTP driver, which currently has to do a separate call
to get_cycles() to obtain a value at *approximately* the same time, to
feed through the vmclock calculation.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
---
 arch/x86/kernel/kvmclock.c | 36 +++++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index b5991d53fc0e..cb3d0ca1fa22 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -87,6 +87,27 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
 	return kvm_clock_read();
 }
 
+static u64 kvm_clock_get_cycles_snapshot(struct clocksource *cs,
+					 struct clocksource_hw_snapshot *chs)
+{
+	struct pvclock_vcpu_time_info *src;
+	unsigned version;
+	u64 ret, tsc;
+
+	preempt_disable_notrace();
+	src = this_cpu_pvti();
+	do {
+		version = pvclock_read_begin(src);
+		tsc = rdtsc_ordered();
+		ret = __pvclock_read_cycles(src, tsc);
+	} while (pvclock_read_retry(src, version));
+	preempt_enable_notrace();
+
+	chs->hw_cycles = tsc;
+	chs->hw_csid = CSID_X86_TSC;
+	return ret;
+}
+
 static noinstr u64 kvm_sched_clock_read(void)
 {
 	return pvclock_clocksource_read_nowd(this_cpu_pvti()) - kvm_sched_clock_offset;
@@ -156,13 +177,14 @@ static int kvm_cs_enable(struct clocksource *cs)
 }
 
 static struct clocksource kvm_clock = {
-	.name	= "kvm-clock",
-	.read	= kvm_clock_get_cycles,
-	.rating	= 400,
-	.mask	= CLOCKSOURCE_MASK(64),
-	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
-	.id     = CSID_X86_KVM_CLK,
-	.enable	= kvm_cs_enable,
+	.name		= "kvm-clock",
+	.read		= kvm_clock_get_cycles,
+	.read_snapshot	= kvm_clock_get_cycles_snapshot,
+	.rating		= 400,
+	.mask		= CLOCKSOURCE_MASK(64),
+	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
+	.id		= CSID_X86_KVM_CLK,
+	.enable		= kvm_cs_enable,
 };
 
 static void kvm_register_clock(char *txt)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 1/3] clocksource/hyperv: Implement read_snapshot() for TSC page clocksource
From: David Woodhouse @ 2026-06-04  9:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Michael Kelley, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Daniel Lezcano, David Woodhouse, Richard Cochran, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, kvm,
	linux-kernel, linux-hyperv, netdev
In-Reply-To: <20260604095755.64849-1-dwmw2@infradead.org>

From: David Woodhouse <dwmw@amazon.co.uk>

Implement the read_snapshot() callback for the Hyper-V TSC page
clocksource. This returns the derived 10MHz reference time (for
timekeeping) while also providing the raw TSC value that was used
to compute it.

When the TSC page is valid, hv_read_tsc_page_tsc() atomically
captures both values from a single RDTSC inside the sequence-counter
protected read. When the TSC page is invalid (sequence == 0), the
hw_csid and hw_cycles are set to zero indicating no value is available.

This enables ktime_get_snapshot_id() to provide the raw TSC to consumers
like KVM's master clock when running nested guests under Hyper-V.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Assisted-by: Kiro:claude-opus-4.6-1m
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
---
 drivers/clocksource/hyperv_timer.c | 37 ++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index e9f5034a1bc8..df567795d175 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -444,6 +444,22 @@ static u64 notrace read_hv_clock_tsc_cs(struct clocksource *arg)
 	return read_hv_clock_tsc();
 }
 
+static u64 notrace read_hv_clock_tsc_cs_snapshot(struct clocksource *arg,
+						  struct clocksource_hw_snapshot *chs)
+{
+	u64 time;
+
+	if (hv_read_tsc_page_tsc(tsc_page, &chs->hw_cycles, &time)) {
+		chs->hw_csid = CSID_X86_TSC;
+	} else {
+		chs->hw_cycles = 0;
+		chs->hw_csid = CSID_GENERIC;
+		time = read_hv_clock_msr();
+	}
+
+	return time;
+}
+
 static u64 noinstr read_hv_sched_clock_tsc(void)
 {
 	return (read_hv_clock_tsc() - hv_sched_clock_offset) *
@@ -492,18 +508,19 @@ static int hv_cs_enable(struct clocksource *cs)
 #endif
 
 static struct clocksource hyperv_cs_tsc = {
-	.name	= "hyperv_clocksource_tsc_page",
-	.rating	= 500,
-	.read	= read_hv_clock_tsc_cs,
-	.mask	= CLOCKSOURCE_MASK(64),
-	.flags	= CLOCK_SOURCE_IS_CONTINUOUS,
-	.suspend= suspend_hv_clock_tsc,
-	.resume	= resume_hv_clock_tsc,
+	.name			= "hyperv_clocksource_tsc_page",
+	.rating			= 500,
+	.read			= read_hv_clock_tsc_cs,
+	.read_snapshot		= read_hv_clock_tsc_cs_snapshot,
+	.mask			= CLOCKSOURCE_MASK(64),
+	.flags			= CLOCK_SOURCE_IS_CONTINUOUS,
+	.suspend		= suspend_hv_clock_tsc,
+	.resume			= resume_hv_clock_tsc,
 #ifdef HAVE_VDSO_CLOCKMODE_HVCLOCK
-	.enable = hv_cs_enable,
-	.vdso_clock_mode = VDSO_CLOCKMODE_HVCLOCK,
+	.enable			= hv_cs_enable,
+	.vdso_clock_mode	= VDSO_CLOCKMODE_HVCLOCK,
 #else
-	.vdso_clock_mode = VDSO_CLOCKMODE_NONE,
+	.vdso_clock_mode	= VDSO_CLOCKMODE_NONE,
 #endif
 };
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH net 1/2] net: mana: initialize gdma queue id to INVALID_QUEUE_ID
From: Aditya Garg @ 2026-06-04  8:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
	kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260604080137.1995269-1-gargaditya@linux.microsoft.com>

mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.

Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 712a0881d720..9d145517c6dc 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1200,6 +1200,8 @@ int mana_gd_create_mana_wq_cq(struct gdma_dev *gd,
 	if (!queue)
 		return -ENOMEM;
 
+	queue->id = INVALID_QUEUE_ID;
+
 	gmi = &queue->mem_info;
 	err = mana_gd_alloc_memory(gc, spec->queue_size, gmi);
 	if (err) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 2/2] net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
From: Aditya Garg @ 2026-06-04  8:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
	kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
	linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260604080137.1995269-1-gargaditya@linux.microsoft.com>

mana_create_txq() has several error paths (after mana_alloc_queues() or
mana_create_wq_obj() failure) where tx_qp[i]->tx_object stays as the
INVALID_MANA_HANDLE sentinel set at allocation. mana_destroy_txq() then
unconditionally calls mana_destroy_wq_obj() with (u64)-1, which firmware
rejects and logs an error.

Mirror the RX-side pattern in mana_destroy_rxq() and skip the destroy
when the handle is still INVALID_MANA_HANDLE.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index db14357d3732..4a2cd800111e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2351,7 +2351,8 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 			netif_napi_del_locked(napi);
 			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
-		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
+		if (apc->tx_qp[i]->tx_object != INVALID_MANA_HANDLE)
+			mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
 
 		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 0/2] net: mana: fix error-path issues in queue setup
From: Aditya Garg @ 2026-06-04  8:01 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy, ernis,
	kees, shacharr, stephen, gargaditya, gargaditya, ssengar,
	linux-hyperv, netdev, linux-kernel

Two error-path fixes in MANA queue setup, both surfaced during Sashiko
AI review of a recently upstreamed patch series.

Patch 1 initializes queue->id to INVALID_QUEUE_ID in
mana_gd_create_mana_wq_cq() so that a CQ creation failure before the
firmware id is assigned does not NULL gc->cq_table[0] and silently
break whichever real CQ owns that slot. This mirrors the existing
pattern in mana_gd_create_eq().

Patch 2 guards mana_destroy_txq()'s call to mana_destroy_wq_obj() with
an INVALID_MANA_HANDLE check, mirroring mana_destroy_rxq(). Without
it, TX setup failures lead to a firmware-rejected destroy of (u64)-1
and a spurious error in dmesg.

Aditya Garg (2):
  net: mana: initialize gdma queue id to INVALID_QUEUE_ID
  net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check

 drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 ++
 drivers/net/ethernet/microsoft/mana/mana_en.c   | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-04  4:39 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <10988db9-8a8d-4ad0-917e-317dd4b20253@intel.com>

On Wed, Jun 03, 2026 at 02:49:24PM -0700, Jacob Keller wrote:
> On 6/1/2026 3:27 AM, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated is capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vCPUs are so
> > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > much more than their overheads if they were spread across sibling vCPUs.
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU, while some vCPUs have none.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly.
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
> > 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > Reviewed-by: Simon Horman <horms@kernel.org>
> > ---
> > Changes in v3
> >  * Optimize the comments in mana_gd_setup_dyn_irqs()
> >  * add more details in the dev_dbg for extra IRQs 
> > ---
> > Changes in v2
> >  * Removed the unused skip_first_cpu variable
> >  * fixed exit condition in irq_setup_linear() with len == 0
> >  * changed return type of irq_setup_linear() as it will always be 0
> >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> >  * added appropriate comments to indicate expected behaviour when
> >    IRQs are more than or equal to num_online_cpus()
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 60 ++++++++++++++++---
> >  1 file changed, 53 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 712a0881d720..00a28b3ca0a6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> >  	} else {
> >  		/* If dynamic allocation is enabled we have already allocated
> >  		 * hwc msi
> > +		 * Also, we make sure in this case the following is always true
> > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> >  		 */
> >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> >  	}
> > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +/* should be called with cpus_read_lock() held */
> > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (len == 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +}
> 
> I would find all of this a bit easier to follow if irq_setup_linear()
> and irq_setup() had a mana prefix so it was more obvious these are
> specific to the driver. Of course irq_setup is pre-existing, and its not
> my driver so do as you will :)
> 
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> >  	struct gdma_irq_context *gic;
> > -	bool skip_first_cpu = false;
> >  	int *irqs, irq, err, i;
> >  
> >  	irqs = kmalloc_objs(int, nvec);
> > @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  		return -ENOMEM;
> >  
> >  	/*
> > +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> > +	 * nvec is only Queue IRQ (HWC already setup).
> >  	 * While processing the next pci irq vector, we start with index 1,
> >  	 * as IRQ vector at index 0 is already processed for HWC.
> >  	 * However, the population of irqs array starts with index 0, to be
> > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > -		skip_first_cpu = true;
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > +		if (err) {
> > +			cpus_read_unlock();
> > +			goto free_irq;
> > +		}
> > +	} else {
> > +		/*
> > +		 * When num_msix_usable are more than num_online_cpus, our
> > +		 * queue IRQs should be equal to num of online vCPUs.
> > +		 * We try to make sure queue IRQs spread across all vCPUs.
> > +		 * In such a case NUMA or CPU core affinity does not matter.
> > +		 * Note: in this case the total mana IRQ should always be
> > +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * However, if CPUs went offline since num_msix_usable was
> > +		 * computed, queue IRQs will be more than num_online_cpus().
> > +		 * In such cases remaining extra IRQs will retain their default
> > +		 * affinity.
> > +		 */
> > +		int first_unassigned = num_online_cpus();
> > +		if (nvec > first_unassigned) {
> > +			char buf[32];
> > +
> > +			if (first_unassigned == nvec - 1)
> > +				snprintf(buf, sizeof(buf), "%d",
> > +					 first_unassigned);
> > +			else
> > +				snprintf(buf, sizeof(buf), "%d-%d",
> > +					 first_unassigned, nvec - 1);
> > +
> > +			dev_dbg(&pdev->dev,
> > +				"MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> > +		}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > -	if (err) {
> > -		cpus_read_unlock();
> > -		goto free_irq;
> > +		irq_setup_linear(irqs, nvec);
> 
> irq_setup() doesn't have a driver prefix, but is actually a static
> function in gdma_main.c, so its implementation is specific to this
> driver despite its name.
> 
> So if I understand this change correctly, if the number of usable MSI-X
> vectors is smaller than the number of CPUs, you contineu to use the
> current irq_setup logic.. otherwise you switch to the simpler "linear"
> logic.
> 
> I guess this means the logic and heuristic used in irq_setup() breaks
> down when the number of vectors is large and number of vCPU is small?
> 
> Makes sense.
> 

Hi Jacob,

Yes, that's the right understanding. 
Regarding the function names, let me take that up in a seperate patch to
add prefixes to all such functions.

Thanks.

> >  	}
> >  
> >  	cpus_read_unlock();
> > 
> > base-commit: 8415598365503ced2e3d019491b0a2756c85c494

^ permalink raw reply

* Re: [PATCH v2 3/4] vmbus: use generic driver_override infrastructure
From: sashiko-bot @ 2026-06-04  4:13 UTC (permalink / raw)
  To: Runyu Xiao; +Cc: linux-hyperv
In-Reply-To: <20260604035239.1711889-4-runyu.xiao@seu.edu.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] A time-of-check to time-of-use (TOCTOU) race condition in `vmbus_probe()` allows `drv->probe` to be called with a NULL `dev_id`, leading to a NULL pointer dereference in drivers like `storvsc_probe()`.
--

commit 76d8c5c515ad429741f331c31ec380bf773956cb
Author: Runyu Xiao <runyu.xiao@seu.edu.cn>

vmbus: use generic driver_override infrastructure

This commit switches the vmbus subsystem to use the generic driver core
driver_override infrastructure rather than bus-private storage. It removes
the private driver_override storage and updates the match path to use
device_match_driver_override() for locked reads, preventing the match
logic from inspecting freed memory during concurrent updates.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/

> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index d28ff45d4cfdd..a81e2b0976367 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
[ ... ]
> @@ -708,9 +679,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
>  {
>  	const guid_t *guid = &dev->dev_type;
>  	const struct hv_vmbus_device_id *id;
> +	int ret;
>  
>  	/* When driver_override is set, only bind to the matching driver */
> -	if (dev->driver_override && strcmp(dev->driver_override, drv->name))
> +	ret = device_match_driver_override(&dev->device, &drv->driver);
> +	if (ret == 0)
>  		return NULL;

[Severity: High]
This is a pre-existing issue, but does this behavior still leave a
time-of-check to time-of-use race condition open when called later
during probing?

The driver core __driver_attach() calls driver_match_device() locklessly,
which delegates to vmbus_match() and eventually hv_vmbus_get_id(). If a
matching driver name is written to the driver_override sysfs attribute (or
a matching dynamic ID is present), it returns a match.

Before __driver_attach() acquires the device lock to call
driver_probe_device(), a concurrent sysfs write can clear the
driver_override attribute or remove the dynamic ID.

Inside vmbus_probe(), which executes with the device lock held,
hv_vmbus_get_id() is called again. Because the state was changed
concurrently, it can evaluate to NULL:

drivers/hv/vmbus_drv.c:vmbus_probe() {
    ...
    dev_id = hv_vmbus_get_id(drv, dev);
    if (drv->probe) {
        ret = drv->probe(dev, dev_id);
    ...
}

Since vmbus_probe() lacks a NULL check for dev_id, it directly calls
drv->probe(dev, dev_id) with dev_id equal to NULL.

Sub-drivers like storvsc_probe() then unconditionally dereference dev_id
(for example, checking dev_id->driver_data), which results in a kernel panic.

Could a NULL check for dev_id be added in vmbus_probe() to fully prevent
this race?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604035239.1711889-1-runyu.xiao@seu.edu.cn?part=3

^ permalink raw reply

* Re: [PATCH v2 4/4] cdx: use generic driver_override infrastructure
From: sashiko-bot @ 2026-06-04  4:08 UTC (permalink / raw)
  To: Runyu Xiao; +Cc: linux-hyperv
In-Reply-To: <20260604035239.1711889-5-runyu.xiao@seu.edu.cn>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Stale kernel-doc entry for removed `@driver_override` member in `struct cdx_device`.
--

commit c9a2a632f33a0544d4ef0145d2020070218d6838
Author: Runyu Xiao <runyu.xiao@seu.edu.cn>

cdx: use generic driver_override infrastructure

This commit switches the CDX subsystem to use the generic driver-core
driver_override infrastructure. It removes the private driver_override
storage and uses device_match_driver_override() to prevent races during
bus matching.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/

[ ... ]

> diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
> index b1ba97f6c9ad6..f1a107b232daa 100644
> --- a/include/linux/cdx/cdx_bus.h
> +++ b/include/linux/cdx/cdx_bus.h
> @@ -165,7 +165,6 @@ struct cdx_device {
>  	bool enabled;
>  	u32 msi_dev_id;
>  	u32 num_msi;
> -	const char *driver_override;

[Severity: Low]
Does this leave a stale kernel-doc entry for @driver_override in the
struct cdx_device documentation block?

The struct documentation still references the removed field:

 * @driver_override: driver name to force a match; do not set directly,
 *                   because core frees it; use driver_set_override() to
 *                   set or clear it.

This might trigger an excess kerneldoc parameter warning during the kernel
build (with W=1).

>  	struct mutex irqchip_lock;
>  	bool msi_write_pending;
>  };

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260604035239.1711889-1-runyu.xiao@seu.edu.cn?part=4

^ permalink raw reply

* [PATCH v2 3/4] vmbus: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

VMBUS devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call into hv_vmbus_get_id() from
__driver_attach() without holding the device lock, and hv_vmbus_get_id()
still dereferences that private pointer directly.

That means a bind/reprobe path can race with a concurrent
driver_override update and make the match logic inspect freed memory.

Switch vmbus to the driver-core driver_override infrastructure. This
removes the private driver_override storage and uses
device_match_driver_override() for the locked read in the match path.

Keep the existing vmbus semantics intact: if driver_override matches but
no dynamic or static device ID matches, continue to return the dummy
vmbus_device_null ID so override-only binding still works as before.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: d765edbb301c ("vmbus: add driver_override support")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/hv/vmbus_drv.c | 36 +++++-------------------------------
include/linux/hyperv.h |  6 ------
 2 files changed, 5 insertions(+), 37 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d28ff45d4cfd..a81e2b097636 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -538,34 +538,6 @@ static ssize_t device_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(device);
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	int ret;
-
-	ret = driver_set_override(dev, &hv_dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct hv_device *hv_dev = device_to_hv_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", hv_dev->driver_override);
-	device_unlock(dev);
-
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 /* Set up per device attributes in /sys/bus/vmbus/devices/<bus device> */
 static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_id.attr,
@@ -596,7 +568,6 @@ static struct attribute *vmbus_dev_attrs[] = {
 	&dev_attr_channel_vp_mapping.attr,
 	&dev_attr_vendor.attr,
 	&dev_attr_device.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 
@@ -708,9 +679,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 {
 	const guid_t *guid = &dev->dev_type;
 	const struct hv_vmbus_device_id *id;
+	int ret;
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+	ret = device_match_driver_override(&dev->device, &drv->driver);
+	if (ret == 0)
 		return NULL;
 
 	/* Look at the dynamic ids first, before the static ones */
@@ -719,7 +692,7 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
 		id = hv_vmbus_dev_match(drv->id_table, guid);
 
 	/* driver_override will always match, send a dummy id */
-	if (!id && dev->driver_override)
+	if (!id && ret > 0)
 		id = &vmbus_device_null;
 
 	return id;
@@ -1021,6 +994,7 @@ static const struct dev_pm_ops vmbus_pm = {
 /* The one and only one */
 static const struct bus_type  hv_bus = {
 	.name =		"vmbus",
+	.driver_override = true,
 	.match =		vmbus_match,
 	.shutdown =		vmbus_shutdown,
 	.remove =		vmbus_remove,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 964f1be8150c..f9ede569602d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1272,12 +1272,6 @@ struct hv_device {
 	u16 device_id;
 
 	struct device device;
-	/*
-	 * Driver name to force a match.  Do not set directly, because core
-	 * frees it.  Use driver_set_override() to set or clear it.
-	 */
-	const char *driver_override;
-
 	struct vmbus_channel *channel;
 	struct kset	     *channels_kset;
 	struct device_dma_parameters dma_parms;
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 4/4] cdx: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

CDX devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call cdx_bus_match() from __driver_attach()
without holding the device lock, and cdx_bus_match() still dereferences
that private pointer directly.

That means the CDX match path can race with a concurrent
driver_override update and compare against freed memory.

Switch CDX to the driver-core driver_override infrastructure. This
removes the private driver_override storage, lets the core provide the
sysfs attribute, and uses device_match_driver_override() for the locked
read in cdx_bus_match().

Preserve the existing CDX override_only semantics: entries marked
override_only still require a matching driver_override, but ordinary ID
matches continue to work unchanged.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 48a6c7bced2a ("cdx: add device attributes")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/cdx/cdx.c           | 40 +++++--------------------------------
include/linux/cdx/cdx_bus.h |  1 -
 2 files changed, 5 insertions(+), 36 deletions(-)

diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
index 9196dc50a48d..d3d230247262 100644
--- a/drivers/cdx/cdx.c
+++ b/drivers/cdx/cdx.c
@@ -156,8 +156,6 @@ static int cdx_unregister_device(struct device *dev,
 	} else {
 		cdx_destroy_res_attr(cdx_dev, MAX_CDX_DEV_RESOURCES);
 		debugfs_remove_recursive(cdx_dev->debugfs_dir);
-		kfree(cdx_dev->driver_override);
-		cdx_dev->driver_override = NULL;
 	}
 
 	/*
@@ -268,6 +266,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 	const struct cdx_driver *cdx_drv = to_cdx_driver(drv);
 	const struct cdx_device_id *found_id = NULL;
 	const struct cdx_device_id *ids;
+	int ret;
 
 	if (cdx_dev->is_bus)
 		return false;
@@ -275,7 +274,8 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 	ids = cdx_drv->match_id_table;
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (cdx_dev->driver_override && strcmp(cdx_dev->driver_override, drv->name))
+	ret = device_match_driver_override(dev, drv);
+	if (ret == 0)
 		return false;
 
 	found_id = cdx_match_id(ids, cdx_dev);
@@ -289,7 +289,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
 		 */
 		if (!found_id->override_only)
 			return true;
-		if (cdx_dev->driver_override)
+		if (ret > 0)
 			return true;
 
 		ids = found_id + 1;
@@ -453,36 +453,6 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(modalias);
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct cdx_device *cdx_dev = to_cdx_device(dev);
-	int ret;
-
-	if (WARN_ON(dev->bus != &cdx_bus_type))
-		return -EINVAL;
-
-	ret = driver_set_override(dev, &cdx_dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct cdx_device *cdx_dev = to_cdx_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", cdx_dev->driver_override);
-	device_unlock(dev);
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 static ssize_t enable_store(struct device *dev, struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
@@ -552,7 +522,6 @@ static struct attribute *cdx_dev_attrs[] = {
 	&dev_attr_class.attr,
 	&dev_attr_revision.attr,
 	&dev_attr_modalias.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 
@@ -646,6 +615,7 @@ ATTRIBUTE_GROUPS(cdx_bus);
 
 const struct bus_type cdx_bus_type = {
 	.name		= "cdx",
+	.driver_override = true,
 	.match		= cdx_bus_match,
 	.probe		= cdx_probe,
 	.remove		= cdx_remove,
diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
index b1ba97f6c9ad..f1a107b232da 100644
--- a/include/linux/cdx/cdx_bus.h
+++ b/include/linux/cdx/cdx_bus.h
@@ -165,7 +165,6 @@ struct cdx_device {
 	bool enabled;
 	u32 msi_dev_id;
 	u32 num_msi;
-	const char *driver_override;
 	struct mutex irqchip_lock;
 	bool msi_write_pending;
 };
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 2/4] rpmsg: core: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

RPMSG still keeps driver_override in bus-private storage.

That private pointer can be updated from the sysfs driver_override
attribute, and also from rpmsg_register_device_override(). Both paths
replace the pointer and can free the old value.

However, driver_match_device() can call rpmsg_dev_match() from
__driver_attach() without holding the device lock, and rpmsg_dev_match()
still dereferences that private pointer directly.

This leaves the match path racing with concurrent driver_override
updates, with the usual risk of comparing against freed memory.

Switch rpmsg to the driver-core driver_override infrastructure. This
removes the private storage, uses device_match_driver_override() for the
locked read in rpmsg_dev_match(), and converts
rpmsg_register_device_override() to device_set_driver_override() so the
in-kernel override path uses the same core-managed storage. With that
storage now owned by struct device, drop the remaining rpmsg transport
release-path frees of rpdev->driver_override as well.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 39e47767ec9b ("rpmsg: Add driver_override device attribute for rpmsg_device")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/rpmsg/qcom_glink_native.c |  2 --
drivers/rpmsg/rpmsg_core.c        | 41 ++++++--------------------------------
drivers/rpmsg/virtio_rpmsg_bus.c  |  1 -
include/linux/rpmsg.h             |  4 ----
 4 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
index e7f7831d37f8..11d3007db5cd 100644
--- a/drivers/rpmsg/rpmsg_core.c
+++ b/drivers/rpmsg/rpmsg_core.c
@@ -358,33 +358,6 @@ rpmsg_show_attr(src, src, "0x%x\n");
 rpmsg_show_attr(dst, dst, "0x%x\n");
 rpmsg_show_attr(announce, announce ? "true" : "false", "%s\n");
 
-static ssize_t driver_override_store(struct device *dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
-	int ret;
-
-	ret = driver_set_override(dev, &rpdev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
-	ssize_t len;
-
-	device_lock(dev);
-	len = sysfs_emit(buf, "%s\n", rpdev->driver_override);
-	device_unlock(dev);
-	return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 static ssize_t modalias_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
 {
@@ -405,7 +378,6 @@ static struct attribute *rpmsg_dev_attrs[] = {
 	&dev_attr_dst.attr,
 	&dev_attr_src.attr,
 	&dev_attr_announce.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(rpmsg_dev);
@@ -424,9 +396,11 @@ static int rpmsg_dev_match(struct device *dev, const struct device_driver *drv)
 	const struct rpmsg_driver *rpdrv = to_rpmsg_driver(drv);
 	const struct rpmsg_device_id *ids = rpdrv->id_table;
 	unsigned int i;
+	int ret;
 
-	if (rpdev->driver_override)
-		return !strcmp(rpdev->driver_override, drv->name);
+	ret = device_match_driver_override(dev, drv);
+	if (ret >= 0)
+		return ret;
 
 	if (ids)
 		for (i = 0; ids[i].name[0]; i++)
@@ -533,6 +507,7 @@ static void rpmsg_dev_remove(struct device *dev)
 
 static const struct bus_type rpmsg_bus = {
 	.name		= "rpmsg",
+	.driver_override = true,
 	.match		= rpmsg_dev_match,
 	.dev_groups	= rpmsg_dev_groups,
 	.uevent		= rpmsg_uevent,
@@ -560,9 +535,7 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
 
 	device_initialize(dev);
 	if (driver_override) {
-		ret = driver_set_override(dev, &rpdev->driver_override,
-					  driver_override,
-					  strlen(driver_override));
+		ret = device_set_driver_override(dev, driver_override);
 		if (ret) {
 			dev_err(dev, "device_set_override failed: %d\n", ret);
 			put_device(dev);
@@ -573,8 +546,6 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
 	ret = device_add(dev);
 	if (ret) {
 		dev_err(dev, "device_add failed: %d\n", ret);
-		kfree(rpdev->driver_override);
-		rpdev->driver_override = NULL;
 		put_device(dev);
 	}
 
diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c
index 401a4ece0c97..d9d4468e4cbd 100644
--- a/drivers/rpmsg/qcom_glink_native.c
+++ b/drivers/rpmsg/qcom_glink_native.c
@@ -1626,7 +1626,6 @@ static void qcom_glink_rpdev_release(struct device *dev)
 {
 	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
 
-	kfree(rpdev->driver_override);
 	kfree(rpdev);
 }
 
@@ -1862,7 +1861,6 @@ static void qcom_glink_device_release(struct device *dev)
 
 	/* Release qcom_glink_alloc_channel() reference */
 	kref_put(&channel->refcount, qcom_glink_channel_release);
-	kfree(rpdev->driver_override);
 	kfree(rpdev);
 }
 
diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
index 5ae15111fb4f..1b8bb05924af 100644
--- a/drivers/rpmsg/virtio_rpmsg_bus.c
+++ b/drivers/rpmsg/virtio_rpmsg_bus.c
@@ -374,7 +374,6 @@ static void virtio_rpmsg_release_device(struct device *dev)
 	struct rpmsg_device *rpdev = to_rpmsg_device(dev);
 	struct virtio_rpmsg_channel *vch = to_virtio_rpmsg_channel(rpdev);
 
-	kfree(rpdev->driver_override);
 	kfree(vch);
 }
 
diff --git a/include/linux/rpmsg.h b/include/linux/rpmsg.h
index 83266ce14642..2e40eb54155e 100644
--- a/include/linux/rpmsg.h
+++ b/include/linux/rpmsg.h
@@ -41,9 +41,6 @@ struct rpmsg_channel_info {
  * rpmsg_device - device that belong to the rpmsg bus
  * @dev: the device struct
  * @id: device id (used to match between rpmsg drivers and devices)
- * @driver_override: driver name to force a match; do not set directly,
- *                   because core frees it; use driver_set_override() to
- *                   set or clear it.
  * @src: local address
  * @dst: destination address
  * @ept: the rpmsg endpoint of this channel
@@ -53,7 +50,6 @@ struct rpmsg_channel_info {
 struct rpmsg_device {
 	struct device dev;
 	struct rpmsg_device_id id;
-	const char *driver_override;
 	u32 src;
 	u32 dst;
 	struct rpmsg_endpoint *ept;
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 1/4] amba: use generic driver_override infrastructure
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao, stable
In-Reply-To: <20260604035239.1711889-1-runyu.xiao@seu.edu.cn>

AMBA devices still keep driver_override in bus-private storage.

The sysfs write side updates that string through driver_set_override(),
which replaces the pointer and frees the old value. However,
driver_match_device() can call amba_match() from __driver_attach()
without holding the device lock, and amba_match() still dereferences
that private pointer directly.

That means a bind/unbind or reprobe path can race with a concurrent
driver_override update and make amba_match() compare against freed
memory.

Fix this by switching AMBA to the driver-core driver_override
infrastructure. This lets the core own the sysfs attribute and storage,
and uses device_match_driver_override() for the locked read in the match
path.

Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/
Fixes: 3cf385713460 ("ARM: 8256/1: driver coamba: add device binding path 'driver_override'")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
drivers/amba/bus.c       | 35 +++++------------------------------
include/linux/amba/bus.h |  5 -----
 2 files changed, 5 insertions(+), 35 deletions(-)

diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 6d479caf89cb..df8333f90906 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -82,33 +82,6 @@ static void amba_put_disable_pclk(struct amba_device *pcdev)
 }
 
 
-static ssize_t driver_override_show(struct device *_dev,
-				    struct device_attribute *attr, char *buf)
-{
-	struct amba_device *dev = to_amba_device(_dev);
-	ssize_t len;
-
-	device_lock(_dev);
-	len = sprintf(buf, "%s\n", dev->driver_override);
-	device_unlock(_dev);
-	return len;
-}
-
-static ssize_t driver_override_store(struct device *_dev,
-				     struct device_attribute *attr,
-				     const char *buf, size_t count)
-{
-	struct amba_device *dev = to_amba_device(_dev);
-	int ret;
-
-	ret = driver_set_override(_dev, &dev->driver_override, buf, count);
-	if (ret)
-		return ret;
-
-	return count;
-}
-static DEVICE_ATTR_RW(driver_override);
-
 #define amba_attr_func(name,fmt,arg...)					\
 static ssize_t name##_show(struct device *_dev,				\
 			   struct device_attribute *attr, char *buf)	\
@@ -126,7 +99,6 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
 static struct attribute *amba_dev_attrs[] = {
 	&dev_attr_id.attr,
 	&dev_attr_resource.attr,
-	&dev_attr_driver_override.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(amba_dev);
@@ -209,6 +181,7 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
 {
 	struct amba_device *pcdev = to_amba_device(dev);
 	const struct amba_driver *pcdrv = to_amba_driver(drv);
+	int ret;
 
 	mutex_lock(&pcdev->periphid_lock);
 	if (!pcdev->periphid) {
@@ -230,8 +203,9 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
 	mutex_unlock(&pcdev->periphid_lock);
 
 	/* When driver_override is set, only bind to the matching driver */
-	if (pcdev->driver_override)
-		return !strcmp(pcdev->driver_override, drv->name);
+	ret = device_match_driver_override(dev, drv);
+	if (ret >= 0)
+		return ret;
 
 	return amba_lookup(pcdrv->id_table, pcdev) != NULL;
 }
@@ -435,6 +409,7 @@ static const struct dev_pm_ops amba_pm = {
  */
 const struct bus_type amba_bustype = {
 	.name		= "amba",
+	.driver_override = true,
 	.dev_groups	= amba_dev_groups,
 	.match		= amba_match,
 	.uevent		= amba_uevent,
diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
index 9946276aff73..6c54d5c0d21f 100644
--- a/include/linux/amba/bus.h
+++ b/include/linux/amba/bus.h
@@ -71,11 +71,6 @@ struct amba_device {
 	unsigned int		cid;
 	struct amba_cs_uci_id	uci;
 	unsigned int		irq[AMBA_NR_IRQS];
-	/*
-	 * Driver name to force a match.  Do not set directly, because core
-	 * frees it.  Use driver_set_override() to set or clear it.
-	 */
-	const char		*driver_override;
 };
 
 struct amba_driver {
-- 
2.34.1

^ permalink raw reply related

* [PATCH v2 0/4] Convert remaining buses to generic driver_override handling
From: Runyu Xiao @ 2026-06-04  3:52 UTC (permalink / raw)
  To: gregkh, rafael
  Cc: dakr, driver-core, linux, andersson, mathieu.poirier, kys,
	haiyangz, wei.liu, decui, longli, nipun.gupta, nikhil.agarwal,
	linux-remoteproc, linux-arm-msm, linux-hyperv, linux-kernel,
	jianhao.xu, runyu.xiao
In-Reply-To: <20260602160829.560904-1-runyu.xiao@seu.edu.cn>

This series converts four remaining buses from bus-private
driver_override handling to the generic driver-core infrastructure:

  - AMBA
  - RPMSG
  - VMBUS
  - CDX

These buses still keep private driver_override storage and read it
directly from their match paths. However, bus match() callbacks can be
reached from __driver_attach() without the device lock held, so those
raw reads can race with updates that replace and free the override
string.

The driver core already provides generic driver_override storage and
matching helpers with the required internal locking. Other buses have
already been converted to that model. This series switches the
remaining users above to the same infrastructure by:

  - removing bus-private driver_override storage
  - dropping bus-local driver_override sysfs handling
  - enabling struct bus_type.driver_override
  - using device_match_driver_override() in match paths

Bus-specific behavior is preserved where needed:

  - VMBUS keeps its dummy-id fallback for override-based binding
  - CDX keeps its override_only matching semantics
  - RPMSG converts its in-kernel override registration path to
    device_set_driver_override() and drops the old transport-local
    frees of bus-private override storage

Before preparing this v2 series, I rechecked the affected source paths
against v7.1-rc6. I also reran the existing report-specific no-device
KCSAN stand-ins on a local v7.1-rc6 guest for all four buses. Those
reruns again produced target-stack reports for the corresponding
driver_override update/match paths.

That runtime validation is still stand-in based rather than direct
hardware execution, but it reuses the real driver_set_override() helper
from the running v7.1-rc6 guest kernel and preserves the relevant
patch-local reader/writer contracts and caller chains.

Since v1:
  - reworked the series around the generic driver_override
    infrastructure instead of trying to serialize bus match() with
    device_lock(dev)
  - split the changes by bus
  - preserved VMBUS dummy-id fallback behavior explicitly
  - preserved CDX override_only matching semantics explicitly
  - converted the RPMSG in-kernel override registration path to the
    core helper
  - reran the four report-specific no-device KCSAN stand-ins on a
    local v7.1-rc6 guest and refreshed the validation basis
  - refreshed the commit messages accordingly

Runyu Xiao (4):
  amba: use generic driver_override infrastructure
  rpmsg: core: use generic driver_override infrastructure
  vmbus: use generic driver_override infrastructure
  cdx: use generic driver_override infrastructure

 drivers/amba/bus.c               | 35 +++++--------------------------
 drivers/cdx/cdx.c                | 40 +++++-------------------------------
 drivers/hv/vmbus_drv.c           | 36 +++++---------------------------
 drivers/rpmsg/qcom_glink_native.c |  2 --
 drivers/rpmsg/rpmsg_core.c       | 41 ++++++-------------------------------
 drivers/rpmsg/virtio_rpmsg_bus.c |  1 -
 include/linux/amba/bus.h         |  5 -----
 include/linux/cdx/cdx_bus.h      |  1 -
 include/linux/hyperv.h           |  6 ------
 include/linux/rpmsg.h            |  4 ----
 10 files changed, 21 insertions(+), 150 deletions(-)

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Jacob Keller @ 2026-06-03 21:49 UTC (permalink / raw)
  To: Shradha Gupta, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, Yury Norov
  Cc: linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260601102749.1768304-1-shradhagupta@linux.microsoft.com>

On 6/1/2026 3:27 AM, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated is capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vCPUs, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vCPUs are so
> few that the softIRQ handling overhead on two IRQs on the same vCPU is
> much more than their overheads if they were spread across sibling vCPUs.
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU, while some vCPUs have none.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly.
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Reviewed-by: Simon Horman <horms@kernel.org>
> ---
> Changes in v3
>  * Optimize the comments in mana_gd_setup_dyn_irqs()
>  * add more details in the dev_dbg for extra IRQs 
> ---
> Changes in v2
>  * Removed the unused skip_first_cpu variable
>  * fixed exit condition in irq_setup_linear() with len == 0
>  * changed return type of irq_setup_linear() as it will always be 0
>  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
>  * added appropriate comments to indicate expected behaviour when
>    IRQs are more than or equal to num_online_cpus()
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 60 ++++++++++++++++---
>  1 file changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 712a0881d720..00a28b3ca0a6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
>  	} else {
>  		/* If dynamic allocation is enabled we have already allocated
>  		 * hwc msi
> +		 * Also, we make sure in this case the following is always true
> +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
>  		 */
>  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
>  	}
> @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +/* should be called with cpus_read_lock() held */
> +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (len == 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +}

I would find all of this a bit easier to follow if irq_setup_linear()
and irq_setup() had a mana prefix so it was more obvious these are
specific to the driver. Of course irq_setup is pre-existing, and its not
my driver so do as you will :)

> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
>  	struct gdma_irq_context *gic;
> -	bool skip_first_cpu = false;
>  	int *irqs, irq, err, i;
>  
>  	irqs = kmalloc_objs(int, nvec);
> @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  		return -ENOMEM;
>  
>  	/*
> +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> +	 * nvec is only Queue IRQ (HWC already setup).
>  	 * While processing the next pci irq vector, we start with index 1,
>  	 * as IRQ vector at index 0 is already processed for HWC.
>  	 * However, the population of irqs array starts with index 0, to be
> @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> -		skip_first_cpu = true;
> +	if (gc->num_msix_usable <= num_online_cpus()) {
> +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> +		if (err) {
> +			cpus_read_unlock();
> +			goto free_irq;
> +		}
> +	} else {
> +		/*
> +		 * When num_msix_usable are more than num_online_cpus, our
> +		 * queue IRQs should be equal to num of online vCPUs.
> +		 * We try to make sure queue IRQs spread across all vCPUs.
> +		 * In such a case NUMA or CPU core affinity does not matter.
> +		 * Note: in this case the total mana IRQ should always be
> +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * However, if CPUs went offline since num_msix_usable was
> +		 * computed, queue IRQs will be more than num_online_cpus().
> +		 * In such cases remaining extra IRQs will retain their default
> +		 * affinity.
> +		 */
> +		int first_unassigned = num_online_cpus();
> +		if (nvec > first_unassigned) {
> +			char buf[32];
> +
> +			if (first_unassigned == nvec - 1)
> +				snprintf(buf, sizeof(buf), "%d",
> +					 first_unassigned);
> +			else
> +				snprintf(buf, sizeof(buf), "%d-%d",
> +					 first_unassigned, nvec - 1);
> +
> +			dev_dbg(&pdev->dev,
> +				"MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> +		}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> -	if (err) {
> -		cpus_read_unlock();
> -		goto free_irq;
> +		irq_setup_linear(irqs, nvec);

irq_setup() doesn't have a driver prefix, but is actually a static
function in gdma_main.c, so its implementation is specific to this
driver despite its name.

So if I understand this change correctly, if the number of usable MSI-X
vectors is smaller than the number of CPUs, you contineu to use the
current irq_setup logic.. otherwise you switch to the simpler "linear"
logic.

I guess this means the logic and heuristic used in irq_setup() breaks
down when the number of vectors is large and number of vCPU is small?

Makes sense.

>  	}
>  
>  	cpus_read_unlock();
> 
> base-commit: 8415598365503ced2e3d019491b0a2756c85c494


^ permalink raw reply

* Re: [PATCH net-next v10 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: sashiko-bot @ 2026-06-03 20:28 UTC (permalink / raw)
  To: Dipayaan Roy; +Cc: linux-hyperv, bpf
In-Reply-To: <20260602202801.1873742-3-dipayanroy@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Forcing `apc->port_is_up = true` on the `mana_attach` error path causes a double-destroy of the hardware vport when the interface is later brought down.
- [Medium] Latent bug: Changes to future private flags will be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF is not simultaneously toggled.
--

commit d7983f6668f209a0b7bfb427c64d5517b326c7b1
Author: Dipayaan Roy <dipayanroy@linux.microsoft.com>

net: mana: force full-page RX buffers via ethtool private flag

Adds an ethtool private flag full-page-rx to bypass page_pool fragment
allocation and force one RX buffer per page. Includes logic in
mana_set_priv_flags to dynamically reconfigure the port on flag change.

> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c

[ ... ]

> +static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
> +{
> +	struct mana_port_context *apc = netdev_priv(ndev);
> +	u32 changed = apc->priv_flags ^ priv_flags;
> +	u32 old_priv_flags = apc->priv_flags;
> +	bool schedule_port_reset = false;
> +	int err = 0;
> +
> +	if (!changed)
> +		return 0;
> +
> +	/* Reject unknown bits */
> +	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
> +		return -EINVAL;
> +
> +	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
> +		apc->priv_flags = priv_flags;
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[Severity: Medium]
When additional private flags are added in the future, will changes to those
new flags be silently ignored if MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF isn't
also toggled?

Since this assignment is nested inside the FULL_PAGE_RXBUF check, changing
only a future new flag will bypass this assignment and the function will
return success while discarding the new flag state.

Should this assignment be moved outside the flag-specific conditional block?

> +
> +		if (!apc->port_is_up) {

[ ... ]

> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			apc->priv_flags = old_priv_flags;
> +
> +			/* Restore 'port_is_up' so the reset work handler
> +			 * can properly detach/attach. Without this,
> +			 * the handler sees port_is_up=false and skips
> +			 * queue allocation, leaving the port dead.
> +			 */
> +			apc->port_is_up = true;
                        ^^^^^^^^^^^^^^^^^^^^^^

[Severity: High]
Does forcing port_is_up to true here cause a double-destroy of the hardware
vport when the interface is later brought down?

If mana_attach fails, it cleans up any partial state by destroying the vport.

Leaving the port artificially marked as up means a subsequent interface down
event or reset worker will see apc->port_is_up as true and call
mana_detach with from_close set to true.

This bypasses the netif_device_present check and triggers a second call to
mana_destroy_vport, which could drop apc->vport_use_count below zero and
send a duplicate deregister message to the firmware for the already-freed
hw_vport_handle.

> +			schedule_port_reset = true;
> +		}
> +	}
> +
> +out:
> +	mana_pre_dealloc_rxbufs(apc);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260602202801.1873742-1-dipayanroy@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [PATCH 0/2] Remove stack ib_udata's
From: Jason Gunthorpe @ 2026-06-03 18:16 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Junxian Huang, Kai Shen, Kalesh AP, Konstantin Taranov,
	Krzysztof Czurylo, Leon Romanovsky, linux-hyperv, linux-rdma,
	Long Li, Michal Kalderon, Nelson Escobar, Satish Kharat,
	Selvin Xavier, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas, Zhu Yanjun
  Cc: Leon Romanovsky, patches
In-Reply-To: <0-v1-922fa8e828ba+f7-ib_udata_stack_jgg@nvidia.com>

On Tue, May 26, 2026 at 01:15:04PM -0300, Jason Gunthorpe wrote:
> Sashiko pointed out these are dangerous, and the create_qp() one is in
> fact a bug. The query_device is just ugly old code.
> 
> Remove the stack ib_udata's from both places.
> 
> Jason Gunthorpe (2):
>   RDMA/core: Don't make a dummy ib_udata on the stack in create_qp
>   RDMA: Update the query_device() op
> 
>  drivers/infiniband/core/core_priv.h           |  2 +-
>  drivers/infiniband/core/device.c              |  3 +--
>  drivers/infiniband/core/ib_core_uverbs.c      | 12 +++++++++++
>  drivers/infiniband/core/rdma_core.h           |  7 +++++++
>  drivers/infiniband/core/uverbs_cmd.c          | 14 +------------
>  drivers/infiniband/core/uverbs_std_types_qp.c |  3 +--
>  drivers/infiniband/core/verbs.c               | 20 ++++++++++---------
>  drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  5 ++++-
>  drivers/infiniband/hw/cxgb4/provider.c        |  8 +++++---
>  drivers/infiniband/hw/erdma/erdma_verbs.c     |  9 +++++++--
>  drivers/infiniband/hw/hns/hns_roce_main.c     |  7 ++++++-
>  drivers/infiniband/hw/ionic/ionic_ibdev.c     |  7 ++++++-
>  drivers/infiniband/hw/irdma/verbs.c           |  8 +++++---
>  drivers/infiniband/hw/mana/main.c             |  7 ++++++-
>  drivers/infiniband/hw/mlx4/main.c             | 13 ++++++------
>  drivers/infiniband/hw/mthca/mthca_provider.c  | 13 +++++++-----
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   |  8 +++++---
>  drivers/infiniband/hw/qedr/verbs.c            |  7 ++++++-
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c  |  8 +++++---
>  .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   |  8 +++++---
>  drivers/infiniband/sw/rdmavt/vt.c             |  9 ++++++---
>  drivers/infiniband/sw/rxe/rxe_verbs.c         | 14 ++++---------
>  drivers/infiniband/sw/siw/siw_verbs.c         |  8 +++++---
>  23 files changed, 124 insertions(+), 76 deletions(-)

Applied to for-next

Jason

^ permalink raw reply

* Re: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions
From: Jork Loeser @ 2026-06-03 17:25 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-hyperv, linux-mm, kexec, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Pasha Tatashin, Pratyush Yadav,
	Alexander Graf, Jason Miu, Andrew Morton, David Hildenbrand,
	Muchun Song, Oscar Salvador, Baoquan He, Catalin Marinas,
	Will Deacon, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Kees Cook, Ran Xiaokai,
	Justinien Bouron, Sourabh Jain, Pingfan Liu, Rafael J. Wysocki,
	Mario Limonciello, linux-arm-kernel, x86, linux-kernel,
	Michael Kelley
In-Reply-To: <ah_z3GV55RY3ZnT-@kernel.org>



On Wed, 3 Jun 2026, Mike Rapoport wrote:

> On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote:
>> On Sun, 31 May 2026, Mike Rapoport wrote:
>>
>>>> Patch 19:      Export kexec_in_progress for modules
>>>
>>> Isn't there another way to differentiate kexec reboot?
>
> There's that "kexec reboot" string passed as the cmd to the reboot
> notifier.
> Maybe we can make it somehow more well defined API and use it?

A string? Dear my - the compiler won't flag it on an API change then, not 
ideal clearly. What's wrong with exporting kexec_in_progress()?

Best,
Jork

^ permalink raw reply

* [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 17:25 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
	edumazet, davem, stable, linux-kernel, Anton Leontev

netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.

A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).

pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity. The
physical address is reconstructed first and phys_to_page() is used to
obtain the native struct page, with offset_in_page() added so the
in-page offset stays correct where PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g.
arm64 with 64K pages). The page is then mapped on demand with
kmap_local_page()/kunmap_local(). On !CONFIG_HIGHMEM configs
kmap_local_page() reduces to page_address(), so this is a no-op there.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v2:
 - Reconstruct the physical address from the Hyper-V PFN and use
   phys_to_page() + offset_in_page() instead of pfn_to_page() on the
   raw PFN, correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).
   Reported by Haiyang Zhang.
 - Built for i386 (CONFIG_HIGHMEM) and arm64 (64K pages).
 drivers/net/hyperv/netvsc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..2038d9f5c9f9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}

 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
-		u32 offset = pb[i].offset;
+		phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
+		struct page *page = phys_to_page(paddr);
+		u32 offset = offset_in_page(paddr) + pb[i].offset;
 		u32 len = pb[i].len;
+		char *src = kmap_local_page(page);

 		memcpy(dest, (src + offset), len);
+		kunmap_local(src);
 		dest += len;
 	}

-- 
2.43.0

^ permalink raw reply related

* [PATCH net v2] hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
From: Anton Leontev @ 2026-06-03 16:38 UTC (permalink / raw)
  To: netdev
  Cc: linux-hyperv, haiyangz, kys, wei.liu, decui, longli, kuba, pabeni,
	edumazet, davem, stable, linux-kernel, Anton Leontev
In-Reply-To: <20260602155210.90987-1-leontyevanton1995@gmail.com>

netvsc_copy_to_send_buf() copies skb fragment pages into the shared
VMBus send buffer using phys_to_virt() on the fragment PFN. On 32-bit
x86 with CONFIG_HIGHMEM=y, phys_to_virt() (i.e. __va()) is only valid
for LOWMEM addresses below 896 MiB. For a HIGHMEM page it returns an
address that has no kernel page table entry and lies outside the
kernel direct map, so the subsequent memcpy() faults. As this happens
on the transmit softirq path, the fault is fatal.

A HIGHMEM fragment reaches this path whenever the page backing an skb
fragment lives above the LOWMEM boundary, which is common on a 32-bit
guest with several GiB of RAM (for example when the in-kernel NFS
server splices page cache pages directly into the reply skb).

pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity. The
physical address is reconstructed first and phys_to_page() is used to
obtain the native struct page, with offset_in_page() added so the
in-page offset stays correct where PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g.
arm64 with 64K pages). The page is then mapped on demand with
kmap_local_page()/kunmap_local(). On !CONFIG_HIGHMEM configs
kmap_local_page() reduces to page_address(), so this is a no-op there.

Fixes: c25aaf814a63 ("hyperv: Enable sendbuf mechanism on the send path")
Cc: stable@vger.kernel.org
Signed-off-by: Anton Leontev <leontyevantony@gmail.com>
---
v2:
 - Reconstruct the physical address from the Hyper-V PFN and use
   phys_to_page() + offset_in_page() instead of pfn_to_page() on the
   raw PFN, correct where PAGE_SIZE > 4K (e.g. arm64 64K pages).
   Reported by Haiyang Zhang.
 - Built for i386 (CONFIG_HIGHMEM) and arm64 (64K pages).
 drivers/net/hyperv/netvsc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 59e95341f9b1..2038d9f5c9f9 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -12,6 +12,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/slab.h>
@@ -965,11 +966,14 @@ static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
 	}

 	for (i = 0; i < page_count; i++) {
-		char *src = phys_to_virt(pb[i].pfn << HV_HYP_PAGE_SHIFT);
-		u32 offset = pb[i].offset;
+		phys_addr_t paddr = pb[i].pfn << HV_HYP_PAGE_SHIFT;
+		struct page *page = phys_to_page(paddr);
+		u32 offset = offset_in_page(paddr) + pb[i].offset;
 		u32 len = pb[i].len;
+		char *src = kmap_local_page(page);

 		memcpy(dest, (src + offset), len);
+		kunmap_local(src);
 		dest += len;
 	}

-- 
2.43.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox