Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Andrew Lunn @ 2026-04-23 16:37 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <aepF3NwyANeklkfD@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

> The root cause is in mana_gd_init_vf_regs(), which computes:
> 
>   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> 
> without validating the offset read from hardware. If the register
> returns a garbage value that is neither within bar 0 bounds nor aligned
> to the 4-byte granularity, thus causing the alignment fault.

Is GDMA_REG_SHM_OFFSET special?

What if GDMA_REG_DB_PAGE_SIZE or GDMA_REG_DB_PAGE_OFFSET have returned
garbage? Are you going to die a horrible death as well?

Isn't there a way you can poll the firmware to ask it if it is ready?

And what about the PF case. Can GDMA_PF_REG_SHM_OFF also be garbage?

      Andrew

^ permalink raw reply

* [PATCH] mshv: add a missing padding field
From: wei.liu @ 2026-04-23 17:26 UTC (permalink / raw)
  To: Linux on Hyper-V List
  Cc: Wei Liu, Doru Blânzeanu, Magnus Kulke, stable,
	K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
	Nuno Das Neves, Roman Kisel, Michael Kelley, Easwar Hariharan,
	open list

From: Wei Liu <wei.liu@kernel.org>

That was missed when importing the header.

Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
Cc: stable@kernel.org
Signed-off-by: Wei Liu <wei.liu@kernel.org>
---
 include/hyperv/hvhdk.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
index 5e83d3714966..ff7ca9ee1bd4 100644
--- a/include/hyperv/hvhdk.h
+++ b/include/hyperv/hvhdk.h
@@ -79,6 +79,7 @@ struct hv_vp_register_page {
 
 		u64 registers[18];
 	};
+	__u8 reserved[8];
 	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
 	union {
 		struct {
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 17:29 UTC (permalink / raw)
  To: wei.liu
  Cc: Linux on Hyper-V List, easwar.hariharan, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <20260423172625.1189669-2-wei.liu@kernel.org>

On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
> From: Wei Liu <wei.liu@kernel.org>
> 
> That was missed when importing the header.
> 
> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
> Cc: stable@kernel.org
> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> ---
>  include/hyperv/hvhdk.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> index 5e83d3714966..ff7ca9ee1bd4 100644
> --- a/include/hyperv/hvhdk.h
> +++ b/include/hyperv/hvhdk.h
> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>  
>  		u64 registers[18];
>  	};
> +	__u8 reserved[8];
>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>  	union {
>  		struct {


This is not a uapi, so why not just use u8 instead of __u8?
Or since it's 8 u8s, a u64?

Thanks,
Easwar (he/him)

^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 17:32 UTC (permalink / raw)
  To: wei.liu
  Cc: easwar.hariharan, Linux on Hyper-V List, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <614f1e17-2dba-4529-b067-e1434b74cad8@linux.microsoft.com>

On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
> On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
>> From: Wei Liu <wei.liu@kernel.org>
>>
>> That was missed when importing the header.
>>
>> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
>> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
>> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
>> Cc: stable@kernel.org
>> Signed-off-by: Wei Liu <wei.liu@kernel.org>
>> ---
>>  include/hyperv/hvhdk.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>> index 5e83d3714966..ff7ca9ee1bd4 100644
>> --- a/include/hyperv/hvhdk.h
>> +++ b/include/hyperv/hvhdk.h
>> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>>  
>>  		u64 registers[18];
>>  	};
>> +	__u8 reserved[8];
>>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>>  	union {
>>  		struct {
> 
> 
> This is not a uapi, so why not just use u8 instead of __u8?
> Or since it's 8 u8s, a u64?
> 
> Thanks,
> Easwar (he/him)

Hm, occurs to me that this would be used by VMMs, but then the registers
field just above used a u64 instead of a __u64....



^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-23 17:40 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Wednesday, April 15, 2026 8:31 AM
> 
> > From: Michael Kelley <mhklinux@outlook.com> Sent: Wednesday, April 8, 2026 6:54 AM

[snip]

> 
> Another example is: for a Gen2 VM with the below commands:
>    Set-VM -LowMemoryMappedIoSpace 1GB \
>           -VMName decui-u2204-gen2-fb
>    // i.e. the default setting on Azure. Let's ignore CVMs here.

FWIW, I'm seeing that in Gen2 VMs in Azure, the low_mmio_size
is 3 GiB. I'm looking at a D16ds_v5, and a D16lds_v6. The v5 VM
is newly created, while the v6 has been around for a few months.
In a CVM, the low_mmio_size should be 1 GiB. This overall example
is still correct -- it's just the comment that I have doubts about. Or
maybe you are looking at a different VM size that has a different
default?

Some years back, I had gotten into a discussion with Azure about
this size because the swiotlb memory wants to be allocated below
the 4 GiB line, and reserving 3 GiB for low mmio limited the size
of the swiotlb. CVMs were changed to have only 1 GiB for low
mmio because they need a larger swiotlb.


>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> we have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     excess_fb_size = 4MB
>     low_mmio_base = 4GB - 128MB - 4MB * 2
>                   = 4GB - 136 MB = 0xf7800000
>     but 4GB - target_low_mmio_size = 4GB - 1GB, which is
>     smaller than low_mmio_base, so low_mmio_base and
>     fb_mmio_base are both set to 4GB - 1GB = 0xc0000000,
>     and low_mmio_size = 1GB.
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> ************************************
> 
> On an ARM64 lab host, I also tested Gen2 VMs (there is no Gen1 VM
> for ARM VMs):
> 
> By default:
>   low_mmio_base = 4GB - 512MB, i.e. 0xe0000000
>   low_mmio_size = 512MB
>   fb_mmio_base = low_mmio_base
>   The default framebuffer size is 3MB
>   (i.e. screen.lfb_size = 3MB) but hyperv_drm:
>   mmio_megabytes = 8 MB, which supports up to 1920 * 1080.
> 
> With the below commands:
>    Set-VM -LowMemoryMappedIoSpace 512MB \
>           -VMName decui-u2204-gen2-fb
>    // the command only accepts a value between 512MB and 3.5GB.
>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> I thought we would have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     excess_fb_size = 4MB
>     low_mmio_base = 4GB - 512MB - 4MB * 2
>                   = 4GB - 520MB
>     fb_mmio_base = low_mmio_base
>     low_mmio_size = 4GB - low_mmio_base = 520MB
> 
>     Since 4GB - target_low_mmio_size = 4GB - 512MB, which is
>     smaller than low_mmio_base, so low_mmio_base and
>     fb_mmio_base would be both set to 4GB - 520MB, and
>     low_mmio_size would be 520MB.
> 
>     However, the actual result is:
>     max_fb_size is indeed 68MB.
>     but fb_mmio_base = low_mmio_base = 4GB - 512MB, and
>     low_mmio_size = 512MB, i.e. the 'excess_fb_size' is not
>     considered on ARM64!
> 
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> With the below command:
>    Set-VM -LowMemoryMappedIoSpace 3GB \
>           -VMName decui-u2204-gen2-fb
>    // i.e. the default setting on Azure. Unlike x86-64, an ARM64
>    // VM on Azure has 3GB of mmio below 4GB.

See my previous comment on the same topic. I think arm64
and x86/x64 are the same.

>    Set-VMVideo -VMName decui-u2204-gen2-fb \
>                -HorizontalResolution 4834 \
>                -VerticalResolution 3622 \
>                -ResolutionType Single
> we have:
>     max_fb_size = round_up_to_2MB(4834*3622*4) = 68 MB
>     low_mmio_base = 4GB - 3GB = 1GB = 0x40000000
>     low_mmio_size = 3GB
>     fb_mmio_base = low_mmio_base = 1GB
> 
>     In this case, we'd like to reserve
>     min(low_mmio_size/2, 128MB) = 128MB for the framebuffer
>     mmio, since the max possible framebuffer so far is 128MB.
> 
> ************************************
> 
> To recap, I think the bottom line is:
> 
> a) For Gen2 VMs, we can safely reserve a mmio range starting at
>    sysfb_primary_display.screen.lfb_base with a size of
>    min(low_mmio_size/2, 128MB).
> 
>    If sysfb_primary_display.screen.lfb_base is 0, i.e. in the case
>    of kdump kernel, we use low_mmio_base instead.
>    This should fix the mmio conflict in the kdump kernel.
> 
> b) For Gen1 VMs, let's still only reserve a mmio range starting at
>    4GB - 128MB with a size of 64MB, because when we are in
>    vmbus_reserve_fb(), we still don't know the exact size of the
>    max_fb_size, and we don't want to reserve too much as we would
>    want to reserve some low mmio space for PCI devices with 32-bit
>    BARs (if any).
> 
>    If the user runs Set-VMVideo and needs a framebuffer size
>    bigger than 64MB (IMO this is not a typical scenario in
>    practice), we have to use high mmio for hyperv_drm in the first
>    kernel, and the kdump kernel still suffers from the mmio
>    conflict between hyperv_drm and hv_pci. We encourage Gen1 VM
>    users to upgrade to Gen2 VMs to resolve the issue. Anyway, the
>    mmio conflict is inevitable for Gen1 VMs, if the max required
>    framebuffer size is bigger than 108MB (Note:
>    128MB - VTPM_BASE_ADDRESS = 109.25, and the required framebuffer
>    size is always rounded up to 2MB).

Question about Gen 1 VMs: If the Linux frame buffer driver moves
the frame buffer somewhere other than the default location, and
then the VM does a kexec/kdump, what does the legacy PCI graphic
device BAR report as the frame buffer location? Does it *always*
report 4G-128MB, or does it report the new location? I can run
an experiment to find out, but maybe you've already done so and
not reported that detail here.

Michael

^ permalink raw reply

* RE: [PATCH] Drivers: hv: vmbus: Improve the logc of reserving fb_mmio on Gen2 VMs
From: Michael Kelley @ 2026-04-23 17:40 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	matthew.ruffell@canonical.com, johansen@templeofstupid.com
  Cc: stable@vger.kernel.org
In-Reply-To: <20260416183529.838321-1-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Thursday, April 16, 2026 11:35 AM
> 
> If vmbus_reserve_fb() in the kdump kernel fails to properly reserve the

This problem has wider scope than just kdump. Any kexec'ed kernel would see
the same problem, though kdump is probably the most common case. But the
discussion here, and the mention of kdump in the code comments, should be
adjusted accordingly. 

> framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero [1],
> there is an MMIO conflict between the drivers hyperv_drm and pci-hyperv.

You describe an MMIO "conflict" without giving the details. Is that
intentional to keep the commit message from being too long? It might be
helpful to future readers to say a little more about how PCI devices must not
use MMIO space that the hypervisor has assigned to the frame buffer.

> This is especially an issue if pci-hyperv is built-in and hyperv_drm is
> built as a module. Consequently, the kdump kernel fails to detect PCI
> devices via pci-hyperv, and may fail to mount the root file system,
> which may reside in a NVMe disk.

It might not just be pci-hyperv that conflicts. The recently submitted
dxgkrnl driver also does vmbus_allocate_mmio(), but I haven't looked
at the details of exactly what it is doing.

> 
> On Gen2 VMs, if the screen.lfb_base is 0 in the kdump kernel, fall
> back to the low MMIO base, which should be equal to the framebuffer
> MMIO base (Tested on x64 Windows Server 2016, and on x64 and ARM64 Windows
> Server 2025 and on Azure) [2]. In the first kernel, screen.lfb_base
> is not 0; if the user specifies a high resolution, it's not enough to
> only reserve 8MB: in this case, reserve half of the space below 4GB, but
> cap the reservation to 128MB, which is the required framebuffer size of
> the highest resolution 7680*4320 supported by Hyper-V.

As you noted in the detailed discussion in the other email thread [2],
there's a Gen1 VM case that this patch doesn't fix. For completeness,
perhaps that case should be called out in this commit message.

> 
> Add the cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) check, because a CoCo
> VM (i.e. Confidential VM) on Hyper-V doesn't have any framebuffer
> device, so there is no need to reserve any MMIO for it.
> 
> While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
> the > to >=. Here the 'end' is an inclusive end (typically, it's
> 0xFFFF_FFFF).
> 
> [1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> [2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> CC: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++--
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f0d0803d1e16..a0b34f9e426a 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -37,6 +37,7 @@
>  #include <linux/dma-map-ops.h>
>  #include <linux/pci.h>
>  #include <linux/export.h>
> +#include <linux/cc_platform.h>
>  #include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
>  #include "hyperv_vmbus.h"
> @@ -2327,8 +2328,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
>  		return AE_NO_MEMORY;
> 
>  	/* If this range overlaps the virtual TPM, truncate it. */
> -	if (end > VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> -		end = VTPM_BASE_ADDRESS;
> +	if (end >= VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
> +		end = VTPM_BASE_ADDRESS - 1;
> 
>  	new_res->name = "hyperv mmio";
>  	new_res->flags = IORESOURCE_MEM;
> @@ -2395,13 +2396,36 @@ static void vmbus_mmio_remove(void)
>  static void __maybe_unused vmbus_reserve_fb(void)
>  {
>  	resource_size_t start = 0, size;
> +	resource_size_t low_mmio_base;
>  	struct pci_dev *pdev;
> 
> +	/* Hyper-V CoCo guests do not have a framebuffer device. */
> +	if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
> +		return;

This test is testing feature "A" (mem encryption) in order to determine
the presence of feature "B" (no framebuffer), because current
configurations happen to always have "A" and "B" at the same time. But
the linkage between the features is tenuous, and if configurations should
change in the future, testing this way could be bogus. It works now, but I'm
leery of depending on the linkage between "A" and "B".

You could set up a "can_have_framebuffer" flag in ms_hyperv_init_platform()
if running in a CVM, and test that flag here. But I'd suggest just dropping
this optimization. CVMs are always Gen2 (and that's not going to change),
so they have plenty of low mmio space. And at the moment, CVMs don't
support PCI devices, so can't encounter a conflict (though conceivably
some new flavor of CVM in the future could support PCI devices).

> +
>  	if (efi_enabled(EFI_BOOT)) {
>  		/* Gen2 VM: get FB base from EFI framebuffer */
>  		if (IS_ENABLED(CONFIG_SYSFB)) {
>  			start = sysfb_primary_display.screen.lfb_base;
>  			size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
> +
> +			low_mmio_base = hyperv_mmio->start;
> +			if (!low_mmio_base || low_mmio_base >= SZ_4G ||
> +			    (start && start < low_mmio_base)) {
> +				pr_warn("Unexpected low mmio base 0x%pa\n", &low_mmio_base);
> +			} else {
> +				/*
> +				 * If the kdump kernel's lfb_base is 0,

As mentioned earlier, this case isn't just kdump kernels.

> +				 * fall back to the low mmio base.
> +				 */
> +				if (!start)
> +					start = low_mmio_base;
> +				/*
> +				 * Reserve half of the space below 4GB for high
> +				 * resolutions, but cap the reservation to 128MB.
> +				 */
> +				size = min((SZ_4G - start) / 2, SZ_128M);
> +			}
>  		}
>  	} else {
>  		/* Gen1 VM: get FB base from PCI */
> @@ -2433,6 +2457,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
>  	 */
>  	for (; !fb_mmio && (size >= 0x100000); size >>= 1)
>  		fb_mmio = __request_region(hyperv_mmio, start, size, fb_mmio_name, 0);

Just above this "for" loop, "start" is tested for 0. This patch eliminates the main
reason start might be 0. But I guess it's still possible that the legacy PCI device BAR
might return 0 for a Gen1 VM? Or you might get 0 if the pr_warn() about low
mmio base is triggered. But I'm thinking maybe a pr_warn() should be done if 
start is zero.

> +
> +	pr_info("hv_mmio=%pR,%pR fb=%pR\n", hyperv_mmio, hyperv_mmio->sibling, fb_mmio);

Outputting the above info is nice!

Michael

^ permalink raw reply

* [PATCH] tools/hv: fix parse_ip_val_buffer out-of-bounds write
From: unknownbbqrx @ 2026-04-23 18:06 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: linux-hyperv, linux-kernel, unknownbbqrx


parse_ip_val_buffer() validates the parsed token length against out_len,
but several callers passed MAX_IP_ADDR_SIZE * 2 while the destination
buffers are much smaller stack arrays (e.g. INET6_ADDRSTRLEN).

This can lead to out-of-bounds writes via strcpy() when a long token is
parsed from host-provided IP/subnet strings.

Use size_t for out_len, switch to bounded copy with memcpy() + explicit
NUL termination, and pass the actual destination buffer sizes at all
call sites.

Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>
---
 tools/hv/hv_kvp_daemon.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
index c02f8a341..ecf123bce 100644
--- a/tools/hv/hv_kvp_daemon.c
+++ b/tools/hv/hv_kvp_daemon.c
@@ -1188,10 +1188,11 @@ static int is_ipv4(char *addr)
 }
 
 static int parse_ip_val_buffer(char *in_buf, int *offset,
-				char *out_buf, int out_len)
+				char *out_buf, size_t out_len)
 {
 	char *x;
 	char *start;
+	size_t copy_len;
 
 	/*
 	 * in_buf has sequence of characters that are separated by
@@ -1214,8 +1215,10 @@ static int parse_ip_val_buffer(char *in_buf, int *offset,
 		while (start[i] == ' ')
 			i++;
 
-		if ((x - start) <= out_len) {
-			strcpy(out_buf, (start + i));
+		copy_len = x - (start + i);
+		if (copy_len < out_len) {
+			memcpy(out_buf, start + i, copy_len);
+			out_buf[copy_len] = '\0';
 			*offset += (x - start) + 1;
 			return 1;
 		}
@@ -1249,7 +1252,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type)
 	memset(addr, 0, sizeof(addr));
 
 	while (parse_ip_val_buffer(ip_string, &offset, addr,
-					(MAX_IP_ADDR_SIZE * 2))) {
+					sizeof(addr))) {
 
 		sub_str[0] = 0;
 		if (is_ipv4(addr)) {
@@ -1374,7 +1377,7 @@ static int process_dns_gateway_nm(FILE *f, char *ip_string, int type,
 		memset(addr, 0, sizeof(addr));
 
 		if (!parse_ip_val_buffer(ip_string, &ip_offset, addr,
-					 (MAX_IP_ADDR_SIZE * 2)))
+					 sizeof(addr)))
 			break;
 
 		ip_ver = ip_version_check(addr);
@@ -1426,12 +1429,11 @@ static int process_ip_string_nm(FILE *f, char *ip_string, char *subnet,
 	memset(subnet_addr, 0, sizeof(subnet_addr));
 
 	while (parse_ip_val_buffer(ip_string, &ip_offset, addr,
-				   (MAX_IP_ADDR_SIZE * 2)) &&
+				   sizeof(addr)) &&
 				   parse_ip_val_buffer(subnet,
-						       &subnet_offset,
-						       subnet_addr,
-						       (MAX_IP_ADDR_SIZE *
-							2))) {
+					       &subnet_offset,
+					       subnet_addr,
+					       sizeof(subnet_addr))) {
 		ip_ver = ip_version_check(addr);
 		if (ip_ver < 0)
 			continue;

base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a
prerequisite-patch-id: df28525061dd528875c7c75880b4684d80f4aa7d
prerequisite-patch-id: 64c48c6f2222781631304d9d4d7d1c712c002610
prerequisite-patch-id: 9be258692732026bf560ed9887adbd02a8887263
-- 
2.53.0




^ permalink raw reply related

* Re: [PATCH net v2] hv_sock: Return -EIO for malformed/short packets
From: patchwork-bot+netdevbpf @ 2026-04-23 18:10 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: kys, haiyangz, wei.liu, longli, sgarzare, davem, edumazet, kuba,
	pabeni, horms, niuxuewei.nxw, linux-hyperv, virtualization,
	netdev, linux-kernel, stable
In-Reply-To: <20260423064811.1371749-1-decui@microsoft.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 22 Apr 2026 23:48:11 -0700 you wrote:
> Commit f63152958994 fixes a regression, however it fails to report an
> error for malformed/short packets -- normally we should never see such
> packets, but let's report an error for them just in case.
> 
> Fixes: f63152958994 ("hv_sock: Report EOF instead of -EIO for FIN")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> 
> [...]

Here is the summary with links:
  - [net,v2] hv_sock: Return -EIO for malformed/short packets
    https://git.kernel.org/netdev/net/c/3d1f20727a63

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Wei Liu @ 2026-04-23 18:14 UTC (permalink / raw)
  To: Easwar Hariharan
  Cc: wei.liu, Linux on Hyper-V List, Doru Blânzeanu, Magnus Kulke,
	stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui, Long Li,
	Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <19a904f4-e26f-4951-85ac-aae537da89cb@linux.microsoft.com>

On Thu, Apr 23, 2026 at 10:32:58AM -0700, Easwar Hariharan wrote:
> On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
> > On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
> >> From: Wei Liu <wei.liu@kernel.org>
> >>
> >> That was missed when importing the header.
> >>
> >> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
> >> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
> >> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
> >> Cc: stable@kernel.org
> >> Signed-off-by: Wei Liu <wei.liu@kernel.org>
> >> ---
> >>  include/hyperv/hvhdk.h | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> >> index 5e83d3714966..ff7ca9ee1bd4 100644
> >> --- a/include/hyperv/hvhdk.h
> >> +++ b/include/hyperv/hvhdk.h
> >> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
> >>  
> >>  		u64 registers[18];
> >>  	};
> >> +	__u8 reserved[8];
> >>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
> >>  	union {
> >>  		struct {
> > 
> > 
> > This is not a uapi, so why not just use u8 instead of __u8?
> > Or since it's 8 u8s, a u64?
> > 
> > Thanks,
> > Easwar (he/him)
> 
> Hm, occurs to me that this would be used by VMMs, but then the registers
> field just above used a u64 instead of a __u64....

I fat-fingered u8 to __u8.  User space code has scripts to massage the
types as needed.

To remain consistent with the existing code, it should be u8.

I can change the type when I commit this.

Wei

> 
> 

^ permalink raw reply

* Re: [PATCH] mshv: add a missing padding field
From: Easwar Hariharan @ 2026-04-23 18:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: easwar.hariharan, Linux on Hyper-V List, Doru Blânzeanu,
	Magnus Kulke, stable, K. Y. Srinivasan, Haiyang Zhang, Dexuan Cui,
	Long Li, Nuno Das Neves, Roman Kisel, Michael Kelley, open list
In-Reply-To: <20260423181440.GA1196957@liuwe-devbox-debian-v2.local>

On 4/23/2026 11:14 AM, Wei Liu wrote:
> On Thu, Apr 23, 2026 at 10:32:58AM -0700, Easwar Hariharan wrote:
>> On 4/23/2026 10:29 AM, Easwar Hariharan wrote:
>>> On 4/23/2026 10:26 AM, wei.liu@kernel.org wrote:
>>>> From: Wei Liu <wei.liu@kernel.org>
>>>>
>>>> That was missed when importing the header.
>>>>
>>>> Reported-by: Doru Blânzeanu <dblanzeanu@linux.microsoft.com>
>>>> Reported-by: Magnus Kulke <magnuskulke@linux.microsoft.com>
>>>> Fixes: e68bda71a2384 ("hyperv: Add new Hyper-V headers in include/hyperv")
>>>> Cc: stable@kernel.org
>>>> Signed-off-by: Wei Liu <wei.liu@kernel.org>
>>>> ---
>>>>  include/hyperv/hvhdk.h | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>>>> index 5e83d3714966..ff7ca9ee1bd4 100644
>>>> --- a/include/hyperv/hvhdk.h
>>>> +++ b/include/hyperv/hvhdk.h
>>>> @@ -79,6 +79,7 @@ struct hv_vp_register_page {
>>>>  
>>>>  		u64 registers[18];
>>>>  	};
>>>> +	__u8 reserved[8];
>>>>  	/* Volatile XMM registers (HV_X64_REGISTER_CLASS_XMM) */
>>>>  	union {
>>>>  		struct {
>>>
>>>
>>> This is not a uapi, so why not just use u8 instead of __u8?
>>> Or since it's 8 u8s, a u64?
>>>
>>> Thanks,
>>> Easwar (he/him)
>>
>> Hm, occurs to me that this would be used by VMMs, but then the registers
>> field just above used a u64 instead of a __u64....
> 
> I fat-fingered u8 to __u8.  User space code has scripts to massage the
> types as needed.
> 
> To remain consistent with the existing code, it should be u8.
> 
> I can change the type when I commit this.
> 
> Wei
Thanks, with that fixed:

Reviewed-by: Easwar Hariharan <easwar.hariharan@linux.microsoft.com>

^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-23 19:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <edccaafd-73f3-421d-a48e-a6cb704d39e6@lunn.ch>

On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > The root cause is in mana_gd_init_vf_regs(), which computes:
> > 
> >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > 
> > without validating the offset read from hardware. If the register
> > returns a garbage value that is neither within bar 0 bounds nor aligned
> > to the 4-byte granularity, thus causing the alignment fault.
> 
> Is GDMA_REG_SHM_OFFSET special?
Hi Andrew,
GDMA_REG_SHM_OFFSET is not special. It was simply the only register
read that had no validation at all. The other two registers
(GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
in place. Also shm_off becomes gc->shm_base (bar0_va + shm_off) and
gc->shm_base is dereferenced via readl() (ldr w1, [x20]) in
mana_smc_poll_register(), which is why it requires 4-byte alignment on arm64
device memory. Or else a misaligned shm_off propagates directly into a
misaligned shm_base, causing an alignment fault (FSC=0x21).
>
> What if GDMA_REG_DB_PAGE_SIZE or GDMA_REG_DB_PAGE_OFFSET have returned
> garbage? Are you going to die a horrible death as well?
Those two already have validation in the current code:

- GDMA_REG_DB_PAGE_SIZE is checked for < SZ_4K (returns -EPROTO)
- GDMA_REG_DB_PAGE_OFFSET is checked for >= bar0_size (returns -EPROTO)

The same checks exist for the PF equivalents (GDMA_PF_REG_DB_PAGE_SIZE
and GDMA_PF_REG_DB_PAGE_OFF) as well.
> 
> Isn't there a way you can poll the firmware to ask it if it is ready?
Unfortunately no, as there is no separate readiness register to
poll.

The existing recovery flow already waits MANA_SERVICE_PERIOD (10
seconds) after suspend before attempting resume. If the registers are
still invalid after that, the -EPROTO triggers a PCI remove/rescan,
which re-probes the device.
> 
> And what about the PF case. Can GDMA_PF_REG_SHM_OFF also be garbage?
Yes. This patch also adds bounds and alignment validation for the PF path:
both GDMA_SRIOV_REG_CFG_BASE_OFF and the SHM offset read via
(sriov_base_off + GDMA_PF_REG_SHM_OFF) are validated before use.
> 
>       Andrew

Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Andrew Lunn @ 2026-04-23 19:44 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <aepviNMszMBtiB/H@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On Thu, Apr 23, 2026 at 12:14:16PM -0700, Dipayaan Roy wrote:
> On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > > The root cause is in mana_gd_init_vf_regs(), which computes:
> > > 
> > >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > > 
> > > without validating the offset read from hardware. If the register
> > > returns a garbage value that is neither within bar 0 bounds nor aligned
> > > to the 4-byte granularity, thus causing the alignment fault.
> > 
> > Is GDMA_REG_SHM_OFFSET special?
> Hi Andrew,
> GDMA_REG_SHM_OFFSET is not special. It was simply the only register
> read that had no validation at all. The other two registers
> (GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
> in place.

I must be missing something:

grep page_size *

gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
gdma_main.c:	void __iomem *addr = gc->db_page_base + gc->db_page_size * db_index;

So if GDMA_REG_DB_PAGE_SIZE returns garbage, it is at least masked,
but it is still a random number.

mana_gd_ring_doorbell() takes this random number, multiples by
db_index, adds, gc->db_page_base and then does:

writeq(e.as_uint64, addr);

So you write to a random address. 

I don't see any sanity checks here. Cannot you check that db_page_size
is at least one of the expected page sizes?

   Andrew

^ permalink raw reply

* RE: [PATCH] tools/hv: fix parse_ip_val_buffer out-of-bounds write
From: Michael Kelley @ 2026-04-23 20:28 UTC (permalink / raw)
  To: unknownbbqrx, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, decui@microsoft.com, longli@microsoft.com
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <c9871f25-9d7e-423d-954b-4080d2484cd8@smtp-relay.sendinblue.com>

From: unknownbbqrx <dev@unknownbbqr.xyz> Sent: Thursday, April 23, 2026 11:07 AM
> 
> 
> parse_ip_val_buffer() validates the parsed token length against out_len,
> but several callers passed MAX_IP_ADDR_SIZE * 2 while the destination
> buffers are much smaller stack arrays (e.g. INET6_ADDRSTRLEN).
> 
> This can lead to out-of-bounds writes via strcpy() when a long token is
> parsed from host-provided IP/subnet strings.
> 
> Use size_t for out_len, switch to bounded copy with memcpy() + explicit
> NUL termination, and pass the actual destination buffer sizes at all
> call sites.
> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>

Linux kernel patches must be signed off by a real person's name,
not an unknown alias. In the kernel source code tree, see
Documentation/process/submitting-patches.rst and specifically
the section entitled "Sign your work - the Developer's Certificate
of Origin".  It specifies that the signoff must be done by "a
known identity (sorry, no anonymous contributions)".

Michael

> ---
>  tools/hv/hv_kvp_daemon.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c
> index c02f8a341..ecf123bce 100644
> --- a/tools/hv/hv_kvp_daemon.c
> +++ b/tools/hv/hv_kvp_daemon.c
> @@ -1188,10 +1188,11 @@ static int is_ipv4(char *addr)
>  }
> 
>  static int parse_ip_val_buffer(char *in_buf, int *offset,
> -				char *out_buf, int out_len)
> +				char *out_buf, size_t out_len)
>  {
>  	char *x;
>  	char *start;
> +	size_t copy_len;
> 
>  	/*
>  	 * in_buf has sequence of characters that are separated by
> @@ -1214,8 +1215,10 @@ static int parse_ip_val_buffer(char *in_buf, int *offset,
>  		while (start[i] == ' ')
>  			i++;
> 
> -		if ((x - start) <= out_len) {
> -			strcpy(out_buf, (start + i));
> +		copy_len = x - (start + i);
> +		if (copy_len < out_len) {
> +			memcpy(out_buf, start + i, copy_len);
> +			out_buf[copy_len] = '\0';
>  			*offset += (x - start) + 1;
>  			return 1;
>  		}
> @@ -1249,7 +1252,7 @@ static int process_ip_string(FILE *f, char *ip_string, int type)
>  	memset(addr, 0, sizeof(addr));
> 
>  	while (parse_ip_val_buffer(ip_string, &offset, addr,
> -					(MAX_IP_ADDR_SIZE * 2))) {
> +					sizeof(addr))) {
> 
>  		sub_str[0] = 0;
>  		if (is_ipv4(addr)) {
> @@ -1374,7 +1377,7 @@ static int process_dns_gateway_nm(FILE *f, char *ip_string,
> int type,
>  		memset(addr, 0, sizeof(addr));
> 
>  		if (!parse_ip_val_buffer(ip_string, &ip_offset, addr,
> -					 (MAX_IP_ADDR_SIZE * 2)))
> +					 sizeof(addr)))
>  			break;
> 
>  		ip_ver = ip_version_check(addr);
> @@ -1426,12 +1429,11 @@ static int process_ip_string_nm(FILE *f, char *ip_string,
> char *subnet,
>  	memset(subnet_addr, 0, sizeof(subnet_addr));
> 
>  	while (parse_ip_val_buffer(ip_string, &ip_offset, addr,
> -				   (MAX_IP_ADDR_SIZE * 2)) &&
> +				   sizeof(addr)) &&
>  				   parse_ip_val_buffer(subnet,
> -						       &subnet_offset,
> -						       subnet_addr,
> -						       (MAX_IP_ADDR_SIZE *
> -							2))) {
> +					       &subnet_offset,
> +					       subnet_addr,
> +					       sizeof(subnet_addr))) {
>  		ip_ver = ip_version_check(addr);
>  		if (ip_ver < 0)
>  			continue;
> 
> base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
> prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a
> prerequisite-patch-id: df28525061dd528875c7c75880b4684d80f4aa7d
> prerequisite-patch-id: 64c48c6f2222781631304d9d4d7d1c712c002610
> prerequisite-patch-id: 9be258692732026bf560ed9887adbd02a8887263
> --
> 2.53.0
> 
> 
> 


^ permalink raw reply

* Re: [PATCH net] net: mana: hardening: Validate SHM offset from BAR0 register to prevent crash due to alignment fault
From: Dipayaan Roy @ 2026-04-24  3:28 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
	john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <7c4dbe89-9b51-45d6-ae89-39d4183e66b1@lunn.ch>

On Thu, Apr 23, 2026 at 09:44:04PM +0200, Andrew Lunn wrote:
> On Thu, Apr 23, 2026 at 12:14:16PM -0700, Dipayaan Roy wrote:
> > On Thu, Apr 23, 2026 at 06:37:04PM +0200, Andrew Lunn wrote:
> > > > The root cause is in mana_gd_init_vf_regs(), which computes:
> > > > 
> > > >   gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
> > > > 
> > > > without validating the offset read from hardware. If the register
> > > > returns a garbage value that is neither within bar 0 bounds nor aligned
> > > > to the 4-byte granularity, thus causing the alignment fault.
> > > 
> > > Is GDMA_REG_SHM_OFFSET special?
> > Hi Andrew,
> > GDMA_REG_SHM_OFFSET is not special. It was simply the only register
> > read that had no validation at all. The other two registers
> > (GDMA_REG_DB_PAGE_SIZE, GDMA_REG_DB_PAGE_OFFSET) already have checks
> > in place.
> 
> I must be missing something:
> 
> grep page_size *
> 
> gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_PF_REG_DB_PAGE_SIZE) & 0xFFFF;
> gdma_main.c:	gc->db_page_size = mana_gd_r32(gc, GDMA_REG_DB_PAGE_SIZE) & 0xFFFF;
> gdma_main.c:	void __iomem *addr = gc->db_page_base + gc->db_page_size * db_index;
> 

Hi Andrew,
There are 2 upstream commits regarding these, I think you missed
them please check once:

commit fb4b4a05aeeb8b0f253c5ddce21f4635dadc9550
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Date:   Wed Mar 25 11:04:17 2026 -0700
 
    net: mana: Use at least SZ_4K in doorbell ID range check

commit 89fe91c65992a37863241e35aec151210efc53ce
Author: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Date:   Fri Mar 6 13:12:06 2026 -0800
 
    net: mana: hardening: Validate doorbell ID from GDMA_REGISTER_DEVICE response

> So if GDMA_REG_DB_PAGE_SIZE returns garbage, it is at least masked,
> but it is still a random number.
> 
> mana_gd_ring_doorbell() takes this random number, multiples by
> db_index, adds, gc->db_page_base and then does:
> 
> writeq(e.as_uint64, addr);
> 
> So you write to a random address. 
> 
> I don't see any sanity checks here. Cannot you check that db_page_size
> is at least one of the expected page sizes?
As mentioned above checks are already present in this commit: 89fe91c65992a37863241e35aec151210efc53ce
> 
>    Andrew

Regards
Dipayaan Roy

^ permalink raw reply

* [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-04-24  6:17 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable

In mana driver, the number of IRQs allocated are capped by the
min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
than the vcpu count, we want to utilize all the vcpus, irrespective of
their NUMA/core bindings.

This is important, especially in the envs where number of vcpus are so
few that the softIRQ handling overhead on two IRQs on the same vcpu is
much more than their overheads if they were spread across sibling vcpus

This behaviour is more evident with dynamic IRQ allocation. Since MANA
IRQs are assigned at a later stage compared to static allocation, other
device IRQs may already be affinitized to the vCPUs. As a result, IRQ
weights become imbalanced, causing multiple MANA IRQs to land on the
same vCPU.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly

Test envs:
=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch
20480		15.65		7.73
10240		15.63		8.93
8192		15.64		9.69
6144		15.64		13.16
4096		15.69		15.75
2048		15.69		15.83
1024		15.71		15.28

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 098fbda0d128..433c044d53c6 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+static int irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		if (len <= 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
@@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	 * first CPU sibling group since they are already affinitized to HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
+	if (gc->num_msix_usable <= num_online_cpus()) {
 		skip_first_cpu = true;
+		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
+	} else {
+		/*
+		 * In case our IRQs are more than num_online_cpus, we try to
+		 * make sure we are using all vcpus. In such a case NUMA or
+		 * CPU core affinity does not matter.
+		 * Note that in this case the total mana IRQ should always be
+		 * num_online_cpu + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * So, the nvec value in this path should always be equal to
+		 * num_online_cpu
+		 */
+		WARN_ON(nvec > num_online_cpus());
+		err = irq_setup_linear(irqs, nvec);
+	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Dipayaan Roy @ 2026-04-24 12:21 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Shiraz Saleem, Michael Kelley, Long Li, Yury Norov, linux-hyperv,
	linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260424061702.1442618-1-shradhagupta@linux.microsoft.com>

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
We do not need to call rcu_read_lock here, as the caller of this
function has already acquired cpus_read_lock.
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
len is unsigned here so <= doesnot makes sense. PLease change it to int
or better use if(!len)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
nit: typo: should be num_online_cpus
> +		 */
> +		WARN_ON(nvec > num_online_cpus());
> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1
> 
Regards
Dipayaan Roy

^ permalink raw reply

* Re: [PATCH] mshv: Fix interrupt state corruption in hv_do_map_pfns error path
From: Anirudh Rayabharam @ 2026-04-24 14:35 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177681692062.637858.4160821495321404639.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>

On Wed, Apr 22, 2026 at 12:15:28AM +0000, Stanislav Kinsburskii wrote:
> Restore interrupt state before breaking out of the loop on error.
> 
> The irq_flags are saved before entering the loop, but the early exit
> path on error fails to restore them. This leaves interrupts in an
> inconsistent state and can lead to lockdep warnings or other
> interrupt-related issues.
> 
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
>  drivers/hv/mshv_root_hv_call.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index 7ed623668c8ec..6381f949d9d91 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -237,8 +237,10 @@ static int hv_do_map_pfns(u64 partition_id, u64 gfn, u64 pfns_count,

Umm... I don't see this function in the hyperv-next tree at all.

Anirudh.

>  			} else {
>  				pfnlist[i] = mmio_spa + done + i;
>  			}
> -		if (ret)
> +		if (ret) {
> +			local_irq_restore(irq_flags);
>  			break;
> +		}
>  
>  		status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
>  					     input_page, NULL);
> 
> 

^ permalink raw reply

* Re: [PATCH V1 03/13] x86/hyperv: add insufficient memory support in irqdomain.c
From: Anirudh Rayabharam @ 2026-04-24 14:55 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-4-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:29PM -0700, Mukesh R wrote:
> Intermittent insufficient memory hypercall failure have been observed in
> the current map device interrupt hypercall. In case of such a failure,
> we must deposit more memory and redo the hypercall. Add support for
> that. Deposit memory needs partition id, make that a parameter to the
> map interrupt function.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/irqdomain.c | 38 +++++++++++++++++++++++++++++++------
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/hyperv/irqdomain.c b/arch/x86/hyperv/irqdomain.c
> index b3ad50a874dc..229f986e08ea 100644
> --- a/arch/x86/hyperv/irqdomain.c
> +++ b/arch/x86/hyperv/irqdomain.c
> @@ -13,8 +13,9 @@
>  #include <linux/irqchip/irq-msi-lib.h>
>  #include <asm/mshyperv.h>
>  
> -static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
> -		int cpu, int vector, struct hv_interrupt_entry *ret_entry)
> +static u64 hv_map_interrupt_hcall(u64 ptid, union hv_device_id hv_devid,
> +				  bool level, int cpu, int vector,
> +				  struct hv_interrupt_entry *ret_entry)
>  {
>  	struct hv_input_map_device_interrupt *input;
>  	struct hv_output_map_device_interrupt *output;
> @@ -30,8 +31,10 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
>  
>  	intr_desc = &input->interrupt_descriptor;
>  	memset(input, 0, sizeof(*input));
> -	input->partition_id = hv_current_partition_id;
> +
> +	input->partition_id = ptid;
>  	input->device_id = hv_devid.as_uint64;
> +
>  	intr_desc->interrupt_type = HV_X64_INTERRUPT_TYPE_FIXED;
>  	intr_desc->vector_count = 1;
>  	intr_desc->target.vector = vector;
> @@ -64,6 +67,28 @@ static int hv_map_interrupt(union hv_device_id hv_devid, bool level,
>  
>  	local_irq_restore(flags);
>  
> +	return status;
> +}
> +
> +static int hv_map_interrupt(u64 ptid, union hv_device_id device_id, bool level,
> +			    int cpu, int vector,
> +			    struct hv_interrupt_entry *ret_entry)
> +{
> +	u64 status;
> +	int rc, deposit_pgs = 16;		/* don't loop forever */
> +
> +	while (deposit_pgs--) {
> +		status = hv_map_interrupt_hcall(ptid, device_id, level, cpu,
> +						vector, ret_entry);
> +
> +		if (hv_result(status) != HV_STATUS_INSUFFICIENT_MEMORY)
> +			break;
> +
> +		rc = hv_call_deposit_pages(NUMA_NO_NODE, ptid, 1);

This code should use the hv_result_needs_memory() and hv_deposit_memory()
helpers instead.

Thanks,
Anirudh


^ permalink raw reply

* Re: [PATCH V1 01/13] iommu/hyperv: rename hyperv-iommu.c to hyperv-irq.c
From: Anirudh Rayabharam @ 2026-04-24 14:58 UTC (permalink / raw)
  To: Mukesh R
  Cc: hpa, robin.murphy, robh, wei.liu, mhklinux, muislam, namjain,
	magnuskulke, anbelski, linux-kernel, linux-hyperv, iommu,
	linux-pci, linux-arch, kys, haiyangz, decui, longli, tglx, mingo,
	bp, dave.hansen, x86, joro, will, lpieralisi, kwilczynski,
	bhelgaas, arnd
In-Reply-To: <20260422023239.1171963-2-mrathor@linux.microsoft.com>

On Tue, Apr 21, 2026 at 07:32:27PM -0700, Mukesh R wrote:
> This file actually implements irq remapping, so rename to more appropriate
> hyperv-irq.c. A new file to implement hyperv iommu will be introduced
> later.  Also, it should not be tied to HYPERV_IOMMU, but to CONFIG_HYPERV
> and IRQ_REMAP. The file already has #ifdef CONFIG_IRQ_REMAP.
> 
> Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
> ---
>  MAINTAINERS                                    | 2 +-
>  drivers/iommu/Makefile                         | 2 +-
>  drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} | 2 +-
>  drivers/iommu/irq_remapping.c                  | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
>  rename drivers/iommu/{hyperv-iommu.c => hyperv-irq.c} (99%)

Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>


^ permalink raw reply

* Re: [PATCH 03/23] tick/nohz: Make nohz_full parameter optional
From: Frederic Weisbecker @ 2026-04-24 15:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Thomas Gleixner, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Paul E. McKenney, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Anna-Maria Behnsen,
	Ingo Molnar, Chen Ridong, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, cgroups,
	linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <3b796360-81e4-4f90-9b19-8a9f21cbac07@redhat.com>

Le Tue, Apr 21, 2026 at 10:14:09AM -0400, Waiman Long a écrit :
11;rgb:2e2e/3434/3636> On 4/21/26 4:32 AM, Thomas Gleixner wrote:
> > On Mon, Apr 20 2026 at 23:03, Waiman Long wrote:
> > > To provide nohz_full tick support, there is a set of tick dependency
> > > masks that need to be evaluated on every IRQ and context switch.
> > s/IRQ/interrupt/
> > 
> > This is a changelog and not a SMS service.
> > > Switching on nohz_full tick support at runtime will be problematic
> > > as some of the tick dependency masks may not be properly set causing
> > > problem down the road.
> > That's useless blurb with zero content.
> > 
> > > Allow nohz_full boot option to be specified without any
> > > parameter to force enable nohz_full tick support without any
> > > CPU in the tick_nohz_full_mask yet. The context_tracking_key and
> > > tick_nohz_full_running flag will be enabled in this case to make
> > > tick_nohz_full_enabled() return true.
> > I kinda can crystal-ball what you are trying to say here, but that does
> > not make it qualified as a proper change log.
> > 
> > > There is still a small performance overhead by force enable nohz_full
> > > this way. So it should only be used if there is a chance that some
> > > CPUs may become isolated later via the cpuset isolated partition
> > > functionality and better CPU isolation closed to nohz_full is desired.
> > Why has this key to be enabled on boot if there are no CPUs in the
> > isolated mask?
> > 
> > If you want to manage this dynamically at runtime then enable the key
> > once CPUs are isolated. Yes, it's more work, but that avoids the "should
> > only be used" nonsense and makes this more robust down the road.
> 
> OK, I will try to make it fully dynamic. Of course, it will be more work.

Since the target CPUs will be offline, it should be fine to just enable/disable
the static key and masks on runtime. The only issue I see right now is posix
CPU timers because the tick dependency is per task/process group. And those
tasks could migrate to nohz_full CPUs by careless users (even though that's
nonsense) once the target become online. So the per task/process tick dependency
must be set up unconditionally. I don't think this should bring much noticeable
overhead though.

Oh and the other way to go, that is removing TICK_DEP_BIT_POSIX_TIMER and forbid to
run posix cpu timers on nohz_full CPUs, would be even more painful to implement
so I don't have a better idea.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

^ permalink raw reply

* Re: [PATCH 3/8] firmware: sysfb: Make CONFIG_SYSFB a user-selectable option
From: Javier Martinez Canillas @ 2026-04-24 16:24 UTC (permalink / raw)
  To: Thomas Zimmermann, Arnd Bergmann, Ard Biesheuvel,
	Ilias Apalodimas, Huacai Chen, WANG Xuerui, Maarten Lankhorst,
	Maxime Ripard, Dave Airlie, Simona Vetter, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, longli, Helge Deller
  Cc: linux-arm-kernel, loongarch, linux-efi, linux-riscv, dri-devel,
	linux-hyperv, linux-fbdev
In-Reply-To: <0156562f-5fcf-47ce-8fea-03345f2c3fe6@suse.de>

Thomas Zimmermann <tzimmermann@suse.de> writes:

Hello,

[...]

>>>>>> On Thu, Apr 2, 2026, at 11:09, Thomas Zimmermann wrote:
>>>>>> I don't really like this part of the series and would prefer
>>>>>> to keep CONFIG_SYSFB hidden as much as possible as an x86

I tend to agree with Arnd here, I'm also not seeing that much value on
making this symbol user selectable. For now I would just keep it hidden.

[...]

>> Yes, I saw that as well and don't have an immediate idea for how
>> to best do it. I saw that you already abstracted the access to
>> the screen_info members in drm_sysfb_screen_info.c, which I think
>> is a step in that direction.
>>
>> I also noticed that efidrm is mostly a subset of vesadrm, so
>> in theory they could be merged back into an x86 drm driver
>> along with the drm_sysfb_screen_info helpers, and have a non-x86
>> driver that constructs a drm_sysfb_device directly from the
>> EFI structures.
>
> I would not want to have a unifed driver for all-things-screen_info. The 
> code that can easily be shared is already in the sysfb helpers. But I 
> don't mind adding a separate driver for EFI's Graphics Output Protocol.

I agree. It is much more maintainable if we have dedicated DRM drivers that
use shared helpers, than attempting to have a driver for different platforms.

As Thomas explained, the maintance effort is small on the DRM side and he has
done a lot of work to split simpledrm in efidrm, vesadrm and ofdrm.

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat


^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: David Wei @ 2026-04-24 16:24 UTC (permalink / raw)
  To: Dipayaan Roy, Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <aeoVC27mIzoKytqA@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On 2026-04-23 05:48, Dipayaan Roy wrote:
> On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
>> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
>>> I still see roughly a 5% overhead from the atomic refcount operation
>>> itself, but on that platform there is no throughput drop when using
>>> page fragments versus full-page mode.
>>
>> That seems to contradict your claim that it's a problem with a specific
>> platform.. Since we're in the merge window I asked David Wei to try to
>> experiment with disabling page fragmentation on the ARM64 platforms we
>> have at Meta. If it repros we should use the generic rx-buf-len
>> ringparam because more NICs may want to implement this strategy.
> 
> Hi Jakub,
> 
> Thanks. I think I was not precise enough in my previous reply.
> 
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.
> 
> On the affected platform, fragment mode shows an additional ~15%
> throughput drop versus full-page mode. So the current data suggests that
> the atomic overhead is common, but the throughput regression is not
> explained by that overhead alone and likely depends on an additional
> platform-specific factor.
> 
> Separately, the hardware team collected PCIe traces on the affected
> platform and reported stalls in the fragment-mode case that are not seen
> in full-page mode. They are still investigating the root cause, but
> their current hypothesis is that this is related to that platform’s
> PCIe/root-port microarchitecture rather than to page_pool refcounting
> alone.
> 
> That said, I agree the right direction depends on whether this
> reproduces on other ARM64 platforms. If David is able to reproduce the
> same behavior, then using the generic rx-buf-len ringparam sounds like
> the better direction.
> 
> Please let me know what David finds, and I can rework the patch
> accordingly.

Hi Dipayaan. Can you please share more details on your testing setup?

* What are you using as the test client/server? iperf3 or something
   else?
* What do you mean specifically by "5% overhead from the atomic refcount
   operation"? Some specific function?
* What are you using to measure? perf?
* How many queues, what is the napi softirq affinity?
* How many NUMA nodes? Does the problem only appear when crossing?

Thanks,
David

> 
> 
> Regards
> Dipayaan Roy

^ permalink raw reply

* Re: [PATCH 19/23] cgroup/cpuset: Improve check for calling housekeeping_update()
From: Waiman Long @ 2026-04-24 18:32 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Catalin Marinas, Will Deacon,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Guenter Roeck, Frederic Weisbecker, Paul E. McKenney,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng,
	Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Zqiang, Anna-Maria Behnsen, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: cgroups, linux-doc, linux-kernel, linux-arm-kernel, linux-hyperv,
	linux-hwmon, rcu, netdev, linux-kselftest, Costa Shulyupin,
	Qiliang Yuan
In-Reply-To: <e8824498-f8ec-496a-a21c-d1dc594f4c8e@huaweicloud.com>

On 4/22/26 9:10 PM, Chen Ridong wrote:
>
> On 2026/4/21 11:03, Waiman Long wrote:
>> By making sure that isolated_hk_cpus matches isolated_cpus at boot time,
>> we can more accurately determine if calling housekeeping_update()
>> is needed by comparing if the two cpumasks are equal. The
>> update_housekeeping flag still have a use in cpuset_handle_hotplug()
>> to determine if a work function should be queued to invoke
>> cpuset_update_sd_hk_unlock() as it is not supposed to look at
>> isolated_hk_cpus without holding cpuset_top_mutex.
>>
> Currently, isolated_hk_cpus is updated within the cpuset_mutex critical section
> (before mutex_unlock(&cpuset_mutex)) in cpuset_update_sd_hk_unlock. Therefore, I
> think update_housekeeping can now be removed.

That is true. I will remove in the next version.

Thanks,
Longman

>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 36 ++++++++++++++++++++----------------
>>   1 file changed, 20 insertions(+), 16 deletions(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index a4eccb0ec0d1..1b0c50b46a49 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1339,26 +1339,29 @@ static void cpuset_update_sd_hk_unlock(void)
>>   	__releases(&cpuset_mutex)
>>   	__releases(&cpuset_top_mutex)
>>   {
>> +	update_housekeeping = false;
>> +
>>   	/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
>>   	if (force_sd_rebuild)
>>   		rebuild_sched_domains_locked();
>>   
>> -	if (update_housekeeping) {
>> -		update_housekeeping = false;
>> -		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>> -
>> -		/*
>> -		 * housekeeping_update() is now called without holding
>> -		 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
>> -		 * is still being held for mutual exclusion.
>> -		 */
>> -		mutex_unlock(&cpuset_mutex);
>> -		cpus_read_unlock();
>> -		WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
>> -		mutex_unlock(&cpuset_top_mutex);
>> -	} else {
>> +	if (cpumask_equal(isolated_hk_cpus, isolated_cpus)) {
>> +		/* No housekeeping cpumask update needed */
>>   		cpuset_full_unlock();
>> +		return;
>>   	}
>> +
>> +	cpumask_copy(isolated_hk_cpus, isolated_cpus);
>> +
>> +	/*
>> +	 * housekeeping_update() is now called without holding
>> +	 * cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
>> +	 * is still being held for mutual exclusion.
>> +	 */
>> +	mutex_unlock(&cpuset_mutex);
>> +	cpus_read_unlock();
>> +	WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus, BIT(HK_TYPE_DOMAIN)));
>> +	mutex_unlock(&cpuset_top_mutex);
>>   }
>>   
>>   /*
>> @@ -3692,10 +3695,11 @@ int __init cpuset_init(void)
>>   
>>   	BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
>>   
>> -	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT))
>> +	if (housekeeping_enabled(HK_TYPE_DOMAIN_BOOT)) {
>>   		cpumask_andnot(isolated_cpus, cpu_possible_mask,
>>   			       housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
>> -
>> +		cpumask_copy(isolated_hk_cpus, isolated_cpus);
>> +	}
>>   	return 0;
>>   }
>>   


^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: David Wei @ 2026-04-24 20:05 UTC (permalink / raw)
  To: Dipayaan Roy, Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <aeoVC27mIzoKytqA@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On 2026-04-23 05:48, Dipayaan Roy wrote:
> On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
>> On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
>>> I still see roughly a 5% overhead from the atomic refcount operation
>>> itself, but on that platform there is no throughput drop when using
>>> page fragments versus full-page mode.
>>
>> That seems to contradict your claim that it's a problem with a specific
>> platform.. Since we're in the merge window I asked David Wei to try to
>> experiment with disabling page fragmentation on the ARM64 platforms we
>> have at Meta. If it repros we should use the generic rx-buf-len
>> ringparam because more NICs may want to implement this strategy.
> 
> Hi Jakub,
> 
> Thanks. I think I was not precise enough in my previous reply.
> 
> What I meant is that the atomic refcount cost itself does not appear to
> be unique to the affected platform. I see a similar ~5% overhead on
> another ARM64 platformi (different vendor) as well. However, on that platform
> there is no throughput delta between fragment mode and full-page mode; both reach
> line rate.
> 
> On the affected platform, fragment mode shows an additional ~15%
> throughput drop versus full-page mode. So the current data suggests that
> the atomic overhead is common, but the throughput regression is not
> explained by that overhead alone and likely depends on an additional
> platform-specific factor.
> 
> Separately, the hardware team collected PCIe traces on the affected
> platform and reported stalls in the fragment-mode case that are not seen
> in full-page mode. They are still investigating the root cause, but
> their current hypothesis is that this is related to that platform’s
> PCIe/root-port microarchitecture rather than to page_pool refcounting
> alone.
> 
> That said, I agree the right direction depends on whether this
> reproduces on other ARM64 platforms. If David is able to reproduce the
> same behavior, then using the generic rx-buf-len ringparam sounds like
> the better direction.
> 
> Please let me know what David finds, and I can rework the patch
> accordingly.

I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.

Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.

Use 1 combined queue only for the server. Affinitized its net rx softirq
to run on core 4.

Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
running on a host w/ same hw in the same region. Using 32 queues, no
softirq affinities. The idea is to hammer page->pp_ref_count from
different cores.

* 1 frag/page  -> 32.3 Gbps
* 2 frags/page -> 36.0 Gbps

Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
pp_ref_count goes up, as expected. Is this what you see? When you say
there's a +5% overhead, what function?

Overall tput is higher with multiple frags. That's to be expected w/
page pool.

There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
driver hack. Are you going to re-implement this change with rx-buf-len
instead of a private flag? If so, I won't spend more time running this
test.

> 
> 
> Regards
> Dipayaan Roy

^ permalink raw reply

* Re: [PATCH net] net: mana: Optimize irq affinity for low vcpu configs
From: Yury Norov @ 2026-04-24 21:25 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <20260424061702.1442618-1-shradhagupta@linux.microsoft.com>

On Thu, Apr 23, 2026 at 11:17:00PM -0700, Shradha Gupta wrote:
> In mana driver, the number of IRQs allocated are capped by the
> min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> than the vcpu count, we want to utilize all the vcpus, irrespective of
> their NUMA/core bindings.
> 
> This is important, especially in the envs where number of vcpus are so
> few that the softIRQ handling overhead on two IRQs on the same vcpu is
> much more than their overheads if they were spread across sibling vcpus
> 
> This behaviour is more evident with dynamic IRQ allocation. Since MANA
> IRQs are assigned at a later stage compared to static allocation, other
> device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> weights become imbalanced, causing multiple MANA IRQs to land on the
> same vCPU.
> 
> In such cases when many parallel TCP connections are tested, the
> throughput drops significantly
> 
> Test envs:
> =======================================================
> Case 1: without this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
> 	TYPE		effective vCPU aff
> =======================================================
> IRQ0:	HWC		0
> IRQ1:	mana_q1		0
> IRQ2:	mana_q2		2
> IRQ3:	mana_q3		0
> IRQ4:	mana_q4		3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU		0	1	2	3
> =======================================================
> pass 1:		38.85	0.03	24.89	24.65
> pass 2:		39.15	0.03	24.57	25.28
> pass 3:		40.36	0.03	23.20	23.17
> 
> =======================================================
> Case 2: with this patch
> =======================================================
> 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> 
>         TYPE            effective vCPU aff
> =======================================================
> IRQ0:   HWC             0
> IRQ1:   mana_q1         0
> IRQ2:   mana_q2         1
> IRQ3:   mana_q3         2
> IRQ4:   mana_q4         3
> 
> %soft on each vCPU(mpstat -P ALL 1) on receiver
> vCPU            0       1       2       3
> =======================================================
> pass 1:         15.42	15.85	14.99	14.51
> pass 2:         15.53	15.94	15.81	15.93
> pass 3:         16.41	16.35	16.40	16.36
> 
> =======================================================
> Throughput Impact(in Gbps, same env)
> =======================================================
> TCP conn	with patch	w/o patch
> 20480		15.65		7.73
> 10240		15.63		8.93
> 8192		15.64		9.69
> 6144		15.64		13.16
> 4096		15.69		15.75
> 2048		15.69		15.83
> 1024		15.71		15.28
> 
> Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> Cc: stable@vger.kernel.org
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 35 +++++++++++++++++--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 098fbda0d128..433c044d53c6 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -1672,6 +1672,23 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
>  	return 0;
>  }
>  
> +static int irq_setup_linear(unsigned int *irqs, unsigned int len)
> +{
> +	int cpu;
> +
> +	rcu_read_lock();
> +	for_each_online_cpu(cpu) {
> +		if (len <= 0)
> +			break;
> +
> +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> +		len--;
> +	}
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
>  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  {
>  	struct gdma_context *gc = pci_get_drvdata(pdev);
> @@ -1722,10 +1739,24 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
>  	 * first CPU sibling group since they are already affinitized to HWC IRQ
>  	 */
>  	cpus_read_lock();
> -	if (gc->num_msix_usable <= num_online_cpus())
> +	if (gc->num_msix_usable <= num_online_cpus()) {
>  		skip_first_cpu = true;
> +		err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);

Then you don't need the 'skip_first_cpu' variable.

> +	} else {
> +		/*
> +		 * In case our IRQs are more than num_online_cpus, we try to
> +		 * make sure we are using all vcpus. In such a case NUMA or
> +		 * CPU core affinity does not matter.
> +		 * Note that in this case the total mana IRQ should always be
> +		 * num_online_cpu + 1. The first HWC IRQ is already handled
> +		 * in HWC setup calls
> +		 * So, the nvec value in this path should always be equal to
> +		 * num_online_cpu
> +		 */
> +		WARN_ON(nvec > num_online_cpus());

That sounds weird. If you don't support IRQs more than CPUs , and want to
warn about it, you'd do that earlier in the function, and align the other
logic accordingly. For example:

        if (WARN_ON(nvec > num_online_cpus()))
                nvec = num_online_cpus();

        irqs = kmalloc_objs(int, nvec);
        if (!irqs)
                return -ENOMEM;

        ...

So you'll decrease pressure on allocator.

What would happen with those IRQs beyond num_online_cpus()? Can you explain
it in the comment? I'm not an expert in your driver, but usually if you pass
a vector to function, and the function is able to handle only a part of it,
it returns the number of processed elements.

Thanks,
Yury

> +		err = irq_setup_linear(irqs, nvec);
> +	}
>  
> -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
>  	if (err) {
>  		cpus_read_unlock();
>  		goto free_irq;
> 
> base-commit: e728258debd553c95d2e70f9cd97c9fde27c7130
> -- 
> 2.34.1

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox