Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v4 0/2] net: phy: sfp/mdio-i2c: defer RollBall probe + fix mii_bus leak
From: Petr Wozniak @ 2026-06-24  8:48 UTC (permalink / raw)
  To: Russell King, Andrew Lunn, Heiner Kallweit
  Cc: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni,
	netdev, linux-kernel, linux-phy, Maxime Chevallier, Bjorn Mork,
	Aleksander Bajkowski, Marek Behun, Petr Wozniak

This series resends the RollBall bridge probe deferral (a fix for the
regression in commit 8fe125892f40) and adds a related mii_bus leak fix.

Patch 1 fixes a pre-existing mii_bus leak in sfp_i2c_mdiobus_destroy()
that has been present since the helper was introduced in 2022. Patch 2
introduces a new -ENODEV path that destroys the MDIO bus via
sfp_i2c_mdiobus_destroy(), so patch 1 is a prerequisite to avoid leaking
the bus on that path.

v4:
 - Retargeted net-next -> net: both patches carry Fixes: tags and
   8fe125892f40 is now in mainline.
 - Patch 1: added Reviewed-by: Maxime Chevallier.
 - Patch 2: reworked post-probe error handling to drop an over-80-column
   line and moved two block comment terminators to their own line
   (checkpatch); set err = 0 after switching to MDIO_I2C_NONE so the SFP
   state machine does not schedule a redundant 1 s retry. No functional
   change to the probe logic.
v3:
 - Resend: v2 defer patch was corrupted in transit and failed to apply
   (netdev/apply); regenerated against current net-next.
 - Fixed block comment style flagged by checkpatch. No functional change.
 - Added patch 1/2 (sfp: free mii_bus in sfp_i2c_mdiobus_destroy).
v2 (defer):
 - Generalized scope: regression affects boot-inserted and hotplugged
   modules where bridge init exceeds 200 ms; Aleksander Bajkowski
   confirmed FLYPRO SFP-10GT-CS-30M / AQR113C broken when hotplugged.
 - Corrected state machine description (probe runs in SFP_S_INIT after
   SFP_S_WAIT) - Jan Hoffmann.
 - No code changes from v1.
v1: initial submission.

Petr Wozniak (2):
  net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
  net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery

 drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
 drivers/net/phy/sfp.c         | 23 +++++++++++++++--------
 include/linux/mdio/mdio-i2c.h |  1 +
 3 files changed, 25 insertions(+), 14 deletions(-)

-- 
2.51.0


^ permalink raw reply

* [PATCH 6/7] ARM: dts: rockchip: Add RV1126 I2C5
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

The controller is present in the SoC and can be used by boards for
external peripherals, such as an RTC on the Alientek DLRV1126 carrier
board.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi | 10 ++++++++++
 arch/arm/boot/dts/rockchip/rv1126.dtsi         | 15 +++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi b/arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi
index 35ef6732281f..1d883b80aed4 100644
--- a/arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi
+++ b/arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi
@@ -123,6 +123,16 @@ i2c3m2_xfer: i2c3m2-xfer {
 				<1 RK_PD7 3 &pcfg_pull_none>;
 		};
 	};
+	i2c5 {
+		/omit-if-no-ref/
+		i2c5m0_xfer: i2c5m0-xfer {
+			rockchip,pins =
+				/* i2c5_scl_m0 */
+				<2 RK_PA5 7 &pcfg_pull_none_drv_level_0_smt>,
+				/* i2c5_sda_m0 */
+				<2 RK_PB3 7 &pcfg_pull_none_drv_level_0_smt>;
+		};
+	};
 	i2s0 {
 		i2s0m0_lrck_tx: i2s0m0-lrck-tx {
 			rockchip,pins =
diff --git a/arch/arm/boot/dts/rockchip/rv1126.dtsi b/arch/arm/boot/dts/rockchip/rv1126.dtsi
index 5b1ee06dc035..483576de841e 100644
--- a/arch/arm/boot/dts/rockchip/rv1126.dtsi
+++ b/arch/arm/boot/dts/rockchip/rv1126.dtsi
@@ -23,6 +23,7 @@ aliases {
 		i2c0 = &i2c0;
 		i2c2 = &i2c2;
 		i2c3 = &i2c3;
+		i2c5 = &i2c5;
 		serial0 = &uart0;
 		serial1 = &uart1;
 		serial2 = &uart2;
@@ -400,6 +401,20 @@ i2c3: i2c@ff520000 {
 		status = "disabled";
 	};
 
+	i2c5: i2c@ff540000 {
+		compatible = "rockchip,rv1126-i2c", "rockchip,rk3399-i2c";
+		reg = <0xff540000 0x1000>;
+		interrupts = <GIC_SPI 9 IRQ_TYPE_LEVEL_HIGH>;
+		clocks = <&cru CLK_I2C5>, <&cru PCLK_I2C5>;
+		clock-names = "i2c", "pclk";
+		pinctrl-names = "default";
+		pinctrl-0 = <&i2c5m0_xfer>;
+		rockchip,grf = <&pmugrf>;
+		#address-cells = <1>;
+		#size-cells = <0>;
+		status = "disabled";
+	};
+
 	pwm8: pwm@ff550000 {
 		compatible = "rockchip,rv1126-pwm", "rockchip,rk3328-pwm";
 		reg = <0xff550000 0x10>;

-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/7] ARM: dts: rockchip: Add RV1126 GMAC refout clock
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

This clock can be routed to an external Ethernet PHY as its reference
clock. Boards using this clock need the clock to be described so the
dwmac-rk driver can acquire and keep it enabled.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 arch/arm/boot/dts/rockchip/rv1126.dtsi | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/rockchip/rv1126.dtsi b/arch/arm/boot/dts/rockchip/rv1126.dtsi
index d6e8b63daa42..5b1ee06dc035 100644
--- a/arch/arm/boot/dts/rockchip/rv1126.dtsi
+++ b/arch/arm/boot/dts/rockchip/rv1126.dtsi
@@ -624,10 +624,11 @@ gmac: ethernet@ffc40000 {
 		rockchip,grf = <&grf>;
 		clocks = <&cru CLK_GMAC_SRC>, <&cru CLK_GMAC_TX_RX>,
 			 <&cru CLK_GMAC_TX_RX>, <&cru CLK_GMAC_REF>,
+			 <&cru CLK_GMAC_ETHERNET_OUT>,
 			 <&cru ACLK_GMAC>, <&cru PCLK_GMAC>,
 			 <&cru CLK_GMAC_TX_RX>, <&cru CLK_GMAC_PTPREF>;
 		clock-names = "stmmaceth", "mac_clk_rx",
-			      "mac_clk_tx", "clk_mac_ref",
+			      "mac_clk_tx", "clk_mac_ref", "clk_mac_refout",
 			      "aclk_mac", "pclk_mac",
 			      "clk_mac_speed", "ptp_ref";
 		resets = <&cru SRST_GMAC_A>;

-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/7] net: stmmac: dwmac-rk: Enable refout clock for RGMII
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

Some Rockchip GMAC integrations use clk_mac_refout as an external PHY
reference clock even when the MAC is configured for RGMII.

RV1126 boards can route CLK_GMAC_ETHERNET_OUT to the external PHY as a
25 MHz reference clock. If the driver does not acquire and enable this
clock in RGMII mode, the common clock framework may disable it as unused
and the PHY can lose its reference clock.

Enable the refout clock handling for RGMII in addition to RMII.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
index 8d7042e68926..f6fdc0c5b475 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
@@ -1112,7 +1112,8 @@ static int rk_gmac_clk_init(struct plat_stmmacenet_data *plat)
 	bsp_priv->clk_enabled = false;
 
 	bsp_priv->num_clks = ARRAY_SIZE(rk_clocks);
-	if (phy_iface == PHY_INTERFACE_MODE_RMII)
+	if (phy_iface == PHY_INTERFACE_MODE_RMII ||
+	    phy_iface == PHY_INTERFACE_MODE_RGMII)
 		bsp_priv->num_clks += ARRAY_SIZE(rk_rmii_clocks);
 
 	bsp_priv->clks = devm_kcalloc(dev, bsp_priv->num_clks,
@@ -1123,7 +1124,8 @@ static int rk_gmac_clk_init(struct plat_stmmacenet_data *plat)
 	for (i = 0; i < ARRAY_SIZE(rk_clocks); i++)
 		bsp_priv->clks[i].id = rk_clocks[i];
 
-	if (phy_iface == PHY_INTERFACE_MODE_RMII) {
+	if (phy_iface == PHY_INTERFACE_MODE_RMII ||
+	    phy_iface == PHY_INTERFACE_MODE_RGMII) {
 		for (j = 0; j < ARRAY_SIZE(rk_rmii_clocks); j++)
 			bsp_priv->clks[i++].id = rk_rmii_clocks[j];
 	}

-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/7] dt-bindings: net: rockchip-dwmac: Allow 9 clocks
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

RV1126 has a separate GMAC Ethernet output clock used as the external
PHY reference clock. This clock is described in addition to the existing
GMAC clocks.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 Documentation/devicetree/bindings/net/rockchip-dwmac.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml
index 80c252845349..86a7e83675ae 100644
--- a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml
+++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml
@@ -71,7 +71,7 @@ properties:
 
   clocks:
     minItems: 4
-    maxItems: 8
+    maxItems: 9
 
   clock-names:
     contains:

-- 
2.54.0


^ permalink raw reply related

* Re: [syzbot ci] Re: nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
From: Sam P @ 2026-06-24  8:46 UTC (permalink / raw)
  To: syzbot ci, davem, david, edumazet, horms, kuba, linux-kernel,
	netdev, oe-linux-nfc, pabeni, stable
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <6a3b838b.c358b6d0.89040.0007.GAE@google.com>

On 24/06/2026 09:13, syzbot ci wrote:
> syzbot ci has tested the following series
> 
> [v1] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
> https://lore.kernel.org/all/20260623222402.175798-1-sam@bynar.io
> * [PATCH net] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
> 
> and found the following issue:
> UBSAN: array-index-out-of-bounds in nci_init_complete_req
> 
> Full report is available here:
> https://ci.syzbot.org/series/2a9a8657-37a3-4dce-8cb5-2035027791dd

Oops, looks like this patch did indeed introduce a regression due to bad
check ordering. I have a v2 prepared, tested against the syzbot repro and
NCI selftest which I will submit after the ~24h patch resend period is up.

Thanks,
Sam


^ permalink raw reply

* [PATCH 2/7] dt-bindings: arm: rockchip: Add Alientek DLRV1126
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

The board consists of a DLRV1126 carrier board and a CLRV1126F core
module based on the Rockchip RV1126 SoC.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 Documentation/devicetree/bindings/arm/rockchip.yaml | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/devicetree/bindings/arm/rockchip.yaml b/Documentation/devicetree/bindings/arm/rockchip.yaml
index 1a9dde18626d..9058f2a461d5 100644
--- a/Documentation/devicetree/bindings/arm/rockchip.yaml
+++ b/Documentation/devicetree/bindings/arm/rockchip.yaml
@@ -162,6 +162,13 @@ properties:
           - const: coolpi,pi-4b
           - const: rockchip,rk3588s
 
+      - description: Alientek CLRV1126F SoM based boards
+        items:
+          - enum:
+              - alientek,dlrv1126
+          - const: alientek,clrv1126f
+          - const: rockchip,rv1126
+
       - description: Edgeble Neural Compute Module 2(Neu2) SoM based boards
         items:
           - const: edgeble,neural-compute-module-2-io   # Edgeble Neural Compute Module 2 IO Board

-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/7] dt-bindings: vendor-prefixes: add alientek
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He
In-Reply-To: <20260624-rv1126-alientek-dlrv1126-v1-0-5aef608a3f64@gmail.com>

Add a vendor prefix for Alientek, a board and module vendor used by the
ATK-DLRV1126 board.

Link: https://en.alientek.com
Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
 Documentation/devicetree/bindings/vendor-prefixes.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml
index 28784d66ae7b..a23508a61373 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.yaml
+++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml
@@ -88,6 +88,8 @@ patternProperties:
     description: ALFA Network Inc.
   "^algoltek,.*":
     description: AlgolTek, Inc.
+  "^alientek,.*":
+    description: Guangzhou Xingyi Intelligent Technology Co., Ltd.
   "^allegro,.*":
     description: Allegro DVT
   "^allegromicro,.*":

-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/7] ARM: rockchip: rv1126: Add support for Alientek ATK-DLRV1126
From: Yanan He @ 2026-06-24  8:44 UTC (permalink / raw)
  To: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiko Stuebner,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Wu, Maxime Coquelin, Alexandre Torgue
  Cc: devicetree, linux-kernel, linux-arm-kernel, linux-rockchip,
	netdev, linux-stm32, Yanan He

The ATK-DLRV1126 board consists of a CLRV1126F core module and a
DLRV1126 carrier board. The core module contains the Rockchip RV1126
SoC, eMMC and RK809 PMIC. The carrier board provides Gigabit Ethernet,
SD card, AP6212 WiFi and Bluetooth, PCF8563 RTC, ADC keys, GPIO LEDs and
audio connectors.

This series adds the Alientek vendor prefix and board compatible, updates
the Rockchip DWMAC binding and driver for the RV1126 GMAC reference
output clock, adds missing RV1126 SoC description pieces, and finally
adds the CLRV1126F core module and DLRV1126 carrier board device trees.

The board was tested with Ethernet/NFS boot, eMMC, SD card, SDIO WiFi
enumeration, Bluetooth LE scanning, RTC, ADC keys, GPIO LEDs and RK809
audio card registration.

Signed-off-by: Yanan He <grumpycat921013@gmail.com>
---
Yanan He (7):
      dt-bindings: vendor-prefixes: add alientek
      dt-bindings: arm: rockchip: Add Alientek DLRV1126
      dt-bindings: net: rockchip-dwmac: Allow 9 clocks
      net: stmmac: dwmac-rk: Enable refout clock for RGMII
      ARM: dts: rockchip: Add RV1126 GMAC refout clock
      ARM: dts: rockchip: Add RV1126 I2C5
      ARM: dts: rockchip: Add Alientek DLRV1126

 .../devicetree/bindings/arm/rockchip.yaml          |   7 +
 .../devicetree/bindings/net/rockchip-dwmac.yaml    |   2 +-
 .../devicetree/bindings/vendor-prefixes.yaml       |   2 +
 arch/arm/boot/dts/rockchip/Makefile                |   1 +
 .../dts/rockchip/rv1126-alientek-clrv1126f.dtsi    | 277 +++++++++++++++++++++
 .../boot/dts/rockchip/rv1126-alientek-dlrv1126.dts | 258 +++++++++++++++++++
 arch/arm/boot/dts/rockchip/rv1126-pinctrl.dtsi     |  10 +
 arch/arm/boot/dts/rockchip/rv1126.dtsi             |  18 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c     |   6 +-
 9 files changed, 577 insertions(+), 4 deletions(-)
---
base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
change-id: 20260618-rv1126-alientek-dlrv1126-d94abdcf8580

Best regards,
--  
Yanan He <grumpycat921013@gmail.com>

^ permalink raw reply

* [PATCH] xsk: fix memory corruptions in net/core/xdp.c
From: Clement Lecigne @ 2026-06-24  8:41 UTC (permalink / raw)
  To: aleksander.lobakin, edumazet, netdev
  Cc: clecigne, bpf, linux-kernel, kuba, sdf, horms, john.fastabend,
	ast, daniel

From: Clément Lecigne <clecigne@google.com>

Commit 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
introduced a vulnerability in the handling of XDP_PASS for AF_XDP zero-copy
frames.

Note: Currently, this specific AF_XDP zero-copy conversion path is only
reachable from the drivers/net/ethernet/intel/ice driver.

When building an skb, xdp_build_skb_from_zc() uses the chunk size
(xdp->frame_sz) for the allocation. However, napi_build_skb() automatically
reserves space at the end of the allocation for the skb_shared_info
structure. 

Most high performance UMEM applications use 4K chunks, where the
corruption cannot happen. However, if the UMEM is configured with 2KB
chunks (a very common configuration to maximize packet density in memory),
a standard 1500 MTU packet will trigger the corruption because the required
space exceeds the 2048 byte chunk size:

Headroom (256) + Packet (1514) + skb_shared_info (320) = 2090 bytes

Because 2090 bytes > 2048 bytes and __skb_put() does not perform bounds
checking, the memcpy() writes past the available linear data area and
corrupts the skb_shared_info structure. This can lead to arbitrary code
execution if pointers like destructor_arg are overwritten.

Additionally, in xdp_copy_frags_from_zc(), the allocation size is set
strictly to the fragment size (len), but the subsequent memcpy() uses
LARGEST_ALIGN(len). This mismatch results in an out-of-bounds write of
up to 7 bytes, which triggers KASAN warnings and is unsafe despite typical
page pool allocator padding.

Fix the skb allocation in xdp_build_skb_from_zc() by dynamically
calculating the exact truesize required: the sum of the headroom, the
packet length, and the skb_shared_info overhead, properly aligned via
SKB_DATA_ALIGN.

Fix the out-of-bounds write in xdp_copy_frags_from_zc() by rounding up
the allocation request using LARGEST_ALIGN(len) to match the copy
operation.

Fixes: 560d958c6c68 ("xsk: add generic XSk &xdp_buff -> skb conversion")
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Clément Lecigne <clecigne@google.com>
---
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9890a30584ba..f36d1fb875ab 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -699,7 +699,7 @@ static noinline bool xdp_copy_frags_from_zc(struct sk_buff *skb,
 	for (u32 i = 0; i < nr_frags; i++) {
 		const skb_frag_t *frag = &xinfo->frags[i];
 		u32 len = skb_frag_size(frag);
-		u32 offset, truesize = len;
+		u32 offset, truesize = LARGEST_ALIGN(len);
 		struct page *page;

 		page = page_pool_dev_alloc(pp, &offset, &truesize);
@@ -740,7 +740,9 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 {
 	const struct xdp_rxq_info *rxq = xdp->rxq;
 	u32 len = xdp->data_end - xdp->data_meta;
-	u32 truesize = xdp->frame_sz;
+	u32 headroom = xdp->data_meta - xdp->data_hard_start;
+	u32 truesize = SKB_DATA_ALIGN(headroom + len) +
+		       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	struct sk_buff *skb = NULL;
 	struct page_pool *pp;
 	int metalen;
@@ -762,7 +764,7 @@ struct sk_buff *xdp_build_skb_from_zc(struct xdp_buff *xdp)
 	}

 	skb_mark_for_recycle(skb);
-	skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
+	skb_reserve(skb, headroom);

 	memcpy(__skb_put(skb, len), xdp->data_meta, LARGEST_ALIGN(len));

^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Breno Leitao @ 2026-06-24  8:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Andrew Morton, Arnd Bergmann, Ben Segall,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

Hello Sebastian,

First of all thanks for working on it.

On Tue, Jun 23, 2026 at 04:26:49PM +0200, Sebastian Andrzej Siewior wrote:
> Provide a deferred version of the WARN_ON() macro. It will delay
> flushing the console until a later context. It is needed in a context
> where the caller holds locks which can lead to a deadlock content is
> flushed to the console driver.
> An example would from a warning from within the scheduler resulting in a
> wake-up of a task.
> 
> Deferring the output works by using printk_deferred_enter/ exit() around
> the printing output. This must be used in a context where the task can't
> migrate to another CPU. This should be the case usually, since the
> scheduler would acquire the rq lock whith disabled interrupts, but to be
> safe preemption is disabled to guarantee this.
> 
> In order not to bloat the code on architectures which provide an
> optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> __report_bug() and does not increase the code size.
> 
> Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> macros. Extend __report_bug() to handle the deferred case.

Have you considered an approach similar to printk_deferred_enter(),
where you mark the code region that needs deferral and all WARN() calls
within that region are automatically deferred?

The current proposal requires changing individual WARN() call sites,
but whether they need deferral might depend on the calling context. This
means you'd need to convert many call sites and ensure all nested
warnings are also converted to the deferred variant.


Thanks,
--breno

^ permalink raw reply

* Please apply 736b380e28d0 and eca856950f7c down to 6.1.y
From: Wongi Lee @ 2026-06-24  8:14 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, Sasha Levin, netdev, David Ahern,
	Ido Schimmel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jungwoo Lee

Hi,

Could the following upstream commits be queued for the active stable
trees?

  commit 736b380e28d0480c7bc3e022f1950f31fe53a7c5
  ("ipv6: account for fraggap on the paged allocation path")

  commit eca856950f7cb1a221e02b99d758409f2c5cec42
  ("ipv4: account for fraggap on the paged allocation path")

These fix incorrect fraggap accounting in the paged allocation path.
This can write past skb->end into skb_shared_info when MSG_MORE is 
used together with MSG_SPLICE_PAGES.

Please apply these to 6.1.y, 6.6.y, 6.12.y, 6.18.y, 7.0.y and 7.1.y.

I checked that the IPv6 upstream commit cherry-picks cleanly onto the
following stable branches:

linux-7.0.y
linux-6.18.y
linux-6.12.y
linux-6.6.y
linux-6.1.y

I checked that the IPv4 upstream commit cherry-picks cleanly onto the
following stable branches:

linux-7.0.y
linux-6.18.y

The IPv4 commit needs a small context-only backport for:

linux-6.12.y
linux-6.6.y
linux-6.1.y

Thanks,
Wongi

^ permalink raw reply

* Re: [PATCH net V3 1/3] net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
From: Tariq Toukan @ 2026-06-24  7:53 UTC (permalink / raw)
  To: Simon Horman, tariqt
  Cc: andrew+netdev, davem, edumazet, kuba, netdev, pabeni, cratiu,
	eranbe, feliu, haiyangz, lkayal, leon, linux-kernel, linux-rdma,
	mbloch, noren, saeedm, gal, alazar, cjubran, kees, eranbe, saeedm
In-Reply-To: <20260623104624.1073738-3-horms@kernel.org>



On 23/06/2026 13:46, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> 
> Simon says: I think this issue can be addressed by reversing the order
> that patches 1/3 and 2/3 appear in this patch-set.
> 
> ---
> net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
> 
> This commit fixes a zero-sized buffer allocation by using max_nch instead of
> stats_nch when allocating the buffer for HV VHCA stats.
> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> index 195863b2c013e..06cbd49d4e98e 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> @@ -122,7 +128,7 @@ static void mlx5e_hv_vhca_stats_cleanup(struct mlx5_hv_vhca_agent *agent)
>>   
>>   void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
>>   {
>> -	int buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
>> +	int buf_len = mlx5e_hv_vhca_stats_buf_max_size(priv);
>>   	struct mlx5_hv_vhca_agent *agent;
>>   
>>   	priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
> 
> [Severity: High]
> Does this change introduce a potential double-free regression on memory
> allocation failure?
> 
> When mlx5e_hv_vhca_stats_destroy() cleans up the stats agent, it frees the
> agent but does not set priv->stats_agent.agent to NULL.
> 

Good catch on the double-free, thank you.

Swapping the patches has its own problem: patch 2/3 also modifies 
mlx5e_hv_vhca_stats_create(), so applying the race fix first would leave 
an intermediate tree that still has the zero-sized buffer bug, which is 
exactly the condition the race fix builds on top of.

Instead, I think we'll make patch 1/3 self-contained by including the 
NULL assignments in mlx5e_hv_vhca_stats_destroy():

       mlx5_hv_vhca_agent_destroy(priv->stats_agent.agent);
       priv->stats_agent.agent = NULL;
       kvfree(priv->stats_agent.buf);
       priv->stats_agent.buf = NULL;

This closes the double-free window in the same patch that opens it, with 
no ordering dependency on patch 2/3.

> Before this patch, kvzalloc() was called with 0 and returned ZERO_SIZE_PTR,
> so it never returned NULL. It always proceeded to call
> mlx5_hv_vhca_agent_create() which overwrote priv->stats_agent.agent.
> 
> Now that the buffer size is non-zero, if kvzalloc() fails it will return
> NULL and take the early return path in mlx5e_hv_vhca_stats_create():
> 
> 	priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
> 	if (!priv->stats_agent.buf)
> 		return;
> 
> This early return bypasses overwriting the dangling priv->stats_agent.agent.
> If a subsequent device detach occurs, mlx5e_hv_vhca_stats_destroy() could
> dereference and free this stale pointer again.
> 
> I noticed this was later addressed upstream in commit e600849cc1e0
> ("net/mlx5e: Fix HV VHCA stats agent registration race").
> 


^ permalink raw reply

* Re: [PATCH net] net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
From: Krzysztof Kozlowski @ 2026-06-24  7:47 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Krzysztof Kozlowski, Jie Luo, Bjorn Andersson, Michael Turquette,
	Stephen Boyd, Brian Masney, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Lei Wei, Suruchi Agarwal, Pavithra R,
	linux-kernel, linux-arm-msm, linux-clk, devicetree, netdev
In-Reply-To: <0c1be662-3c6a-41be-bc19-081d0299ceae@lunn.ch>

On Tue, Jun 23, 2026 at 01:33:05PM +0200, Andrew Lunn wrote:
> > If address did not work for half a year, I really doubt that you commit
> > to above.
> 
> I tend to agree. Maybe we should set it to Orphaned, and then decide
> in 6 months time if it can be set back to Maintained?

That would be a bit stronger than I intended and actual responses from
Jie did happen, thus the driver actually might be maintained. I also
think that Qualcomm is committed to maintain it, I only doubt about
supported.

Best regards,
Krzysztof


^ permalink raw reply

* Re: [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-24  7:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260623132014-mutt-send-email-mst@kernel.org>


6/23/26 20:26, Michael S. Tsirkin wrote:
> On Tue, Jun 23, 2026 at 06:38:19PM +0300, Arseniy Krasnov wrote:
>> Logically it was based on TCP implementation, so to make further support
>> easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
>> patch only rewrites flag handling (e.g. it doesn't change logic).
>>
>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>
> It seems to change logic though:
>
>> ---
>>  Changelog v1->v2:
>>  * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>>    already added.
>>  Changelog v2->v3:
>>  * Update commit message.
>>  * Remove one empty line.
>>
>>  net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
>>  1 file changed, 22 insertions(+), 25 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 09475007165b..41c2a0b82a8e 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>  		return pkt_len;
>>  
>> -	if (info->msg) {
>> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>> -		 * there is no MSG_ZEROCOPY flag set.
>> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>> +		 * handling from 'tcp_sendmsg_locked()'.
>>  		 */
>> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;
> So previously without SOCK_ZEROCOPY, MSG_ZEROCOPY was always ignored...
>
>
>> +		if (info->msg->msg_ubuf) {
>> +			uarg = info->msg->msg_ubuf;
>> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
> now it's not in this case?

Yes, this case is currently for io_uring only, because io_uring doesn't set SOCK_ZEROCOPY to perform zerocopy transmission.

>
>
> Maybe the right call, but saying "does not change logic" seems wrong.

Agree, I need to update commit message again :)

Thanks

>
>
>> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>> +						    NULL, false);
>> +			if (!uarg) {
>> +				virtio_transport_put_credit(vvs, pkt_len);
>> +				return -ENOMEM;
>> +			}
>>  
>> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>> +			if (!can_zcopy)
>> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>>  
>> +			have_uref = true;
>> +		}
>> +
>> +		/* 'can_zcopy' means that this transmission will be
>> +		 * in zerocopy way (e.g. using 'frags' array).
>> +		 */
>>  		if (can_zcopy)
>>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
>> -
>> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
>> -		    info->op == VIRTIO_VSOCK_OP_RW) {
>> -			uarg = info->msg->msg_ubuf;
>> -
>> -			if (!uarg) {
>> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> -							    pkt_len, NULL, false);
>> -				if (!uarg) {
>> -					virtio_transport_put_credit(vvs, pkt_len);
>> -					return -ENOMEM;
>> -				}
>> -
>> -				if (!can_zcopy)
>> -					uarg_to_msgzc(uarg)->zerocopy = 0;
>> -
>> -				have_uref = true;
>> -			}
>> -		}
>>  	}
>>  
>>  	rest_len = pkt_len;
>> -- 
>> 2.25.1

^ permalink raw reply

* [PATCH net] net: clear transport header during tunnel decapsulation
From: Eric Dumazet @ 2026-06-24  7:32 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Ido Schimmel, David Ahern, netdev, eric.dumazet,
	Eric Dumazet, syzbot+d5d0d598a4cfdfafdc3b

Syzbot triggered a DEBUG_NET_WARN_ON_ONCE(len > INT_MAX) assertion in
pskb_may_pull_reason() called from qdisc_pkt_len_segs_init().

The root cause is a stale, negative transport header offset carried over
during tunnel decapsulation. When a tunnel receiver (e.g., VXLAN or Geneve)
decapsulates a packet, it pulls the outer headers but leaves the transport
header pointing to the outer UDP header. This offset becomes negative
relative to the new skb->data (inner IP header).

If the packet bypasses GRO (e.g., an untrusted GSO packet flagged as
"unexpected GSO" by udp_unexpected_gso() due to missing tunnel GSO bits),
it is flushed directly to the stack as GRO_NORMAL. On ingress, Layer 2 Qdisc
processing (sch_handle_ingress) happens before Layer 3 IP reception
(ip_rcv_core) can run and reset the transport header. Consequently,
qdisc_pkt_len_segs_init() attempts to validate the transport header using
pskb_may_pull(skb, hdr_len + sizeof(tcphdr)). The negative hdr_len overflows
the unsigned cast in pskb_may_pull(), triggering the assertion.

Fix this by clearing the transport header to the ~0U sentinel value during
decapsulation. This ensures that:
1) The ingress Qdisc safely skips validation via !skb_transport_header_was_set()
   and returns early without warning.
2) The IP layer (ip_rcv_core) later correctly resets the transport header
   to the inner L4 header offset.

Introduce skb_unset_transport_header() helper and apply it in the main
decapsulation paths:
1) __iptunnel_pull_header() (covering Geneve, GRE, IPIP, SIT, etc.)
2) vxlan_rcv() (covering VXLAN)

This restores skb invariants at the decapsulation boundary without adding
overhead to the Qdisc fast path.

Fixes: 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
Reported-by: syzbot+d5d0d598a4cfdfafdc3b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a3b853b.52ae72c2.136ac7.000c.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Assisted-by: Gemini:gemini-3.1-pro
---
 drivers/net/vxlan/vxlan_core.c | 1 +
 include/linux/skbuff.h         | 5 +++++
 net/ipv4/ip_tunnel_core.c      | 1 +
 3 files changed, 7 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 67c367cc566233e809b0f70e0d939dd1c1ac0d9f..49318ad8164a2f2572fc58c0ed449b68922ae71e 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1799,6 +1799,7 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)

 	dev_dstats_rx_add(vxlan->dev, skb->len);
 	vxlan_vnifilter_count(vxlan, vni, vninode, VXLAN_VNI_STATS_RX, skb->len);
+	skb_unset_transport_header(skb);
 	gro_cells_receive(&vxlan->gro_cells, skb);

 	rcu_read_unlock();
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 115db8c44db21383632dd150a17c9ddcc03508e4..e8305a0fd3857ab85da4c2e8322989ed93e88d87 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3084,6 +3084,11 @@ static inline bool skb_transport_header_was_set(const struct sk_buff *skb)
 	return skb->transport_header != (typeof(skb->transport_header))~0U;
 }

+static inline void skb_unset_transport_header(struct sk_buff *skb)
+{
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+}
+
 static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
 {
 	DEBUG_NET_WARN_ON_ONCE(!skb_transport_header_was_set(skb));
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index d3c677e9bff2080e4760347a3d873da4e83ac3ca..59192f58da2e3aae19d00505cc3bb04b083b77c5 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -134,6 +134,7 @@ int __iptunnel_pull_header(struct sk_buff *skb, int hdr_len,
 	__vlan_hwaccel_clear_tag(skb);
 	skb_set_queue_mapping(skb, 0);
 	skb_scrub_packet(skb, xnet);
+	skb_unset_transport_header(skb);

 	return iptunnel_pull_offloads(skb);
 }
-- 
2.55.0.rc0.799.gd6f94ed593-goog

^ permalink raw reply related

* Re: [PATCH bpf-next v8 7/7] selftests/bpf: add bpf_icmp_send recursion test
From: Emil Tsalapatis @ 2026-06-24  7:31 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song
In-Reply-To: <20260622120515.137082-8-mahe.tardy@gmail.com>

On Mon Jun 22, 2026 at 8:05 AM EDT, Mahe Tardy wrote:
> This test is similar to test_icmp_send_unreach_cgroup but checks that,
> in case of recursion, meaning that the BPF program calling the kfunc was
> re-triggered by the icmp_send done by the kfunc, the kfunc will stop
> early and return -EBUSY.
>
> The test attaches to the root cgroup to ensure the ICMP packet generated
> by the kfunc re-triggers the BPF program. Since it's attached only for
> this recursion test, it should not disrupt the whole network.
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  .../bpf/prog_tests/icmp_send_kfunc.c          | 45 +++++++++++++++
>  tools/testing/selftests/bpf/progs/icmp_send.c | 56 +++++++++++++++++++
>  2 files changed, 101 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> index 66447681f72d..fd4b8fa78a01 100644
> --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> @@ -1,8 +1,10 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <test_progs.h>
>  #include <network_helpers.h>
> +#include <cgroup_helpers.h>
>  #include <linux/errqueue.h>
>  #include <poll.h>
> +#include <unistd.h>
>  #include "icmp_send.skel.h"
>
>  #define TIMEOUT_MS 1000
> @@ -10,6 +12,7 @@
>  #define ICMP_DEST_UNREACH 3
>  #define ICMPV6_DEST_UNREACH 1
>
> +#define ICMP_HOST_UNREACH 1
>  #define ICMP_FRAG_NEEDED 4
>  #define NR_ICMP_UNREACH 15
>  #define ICMPV6_REJECT_ROUTE 6
> @@ -203,3 +206,45 @@ void test_icmp_send_unreach_tc(void)
>  	bpf_link__destroy(link);
>  	icmp_send__destroy(skel);
>  }
> +
> +void test_icmp_send_unreach_recursion(void)
> +{
> +	struct icmp_send *skel;
> +	int cgroup_fd = -1;
> +
> +	skel = icmp_send__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	if (setup_cgroup_environment()) {
> +		fprintf(stderr, "Failed to setup cgroup environment\n");
> +		goto cleanup;
> +	}
> +
> +	cgroup_fd = get_root_cgroup();
> +	if (!ASSERT_OK_FD(cgroup_fd, "get_root_cgroup"))
> +		goto cleanup;
> +
> +	skel->data->target_pid = getpid();
> +	skel->links.recursion =
> +		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	trigger_prog_read_icmp_errqueue(skel, ICMP_HOST_UNREACH, AF_INET,
> +					"127.0.0.1");
> +
> +	/*
> +	 * Because there's recursion involved, the first call will return at
> +	 * index 1 since it will return the second, and the second call will
> +	 * return at index 0 since it will return the first.
> +	 */
> +	ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
> +	ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
> +
> +cleanup:
> +	cleanup_cgroup_environment();
> +	icmp_send__destroy(skel);
> +	if (cgroup_fd >= 0)
> +		close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> index 5fa5467bdb70..fd9c7684797b 100644
> --- a/tools/testing/selftests/bpf/progs/icmp_send.c
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -13,6 +13,10 @@ __u16 server_port = 0;
>  int unreach_type = 0;
>  int unreach_code = 0;
>  int kfunc_ret = -1;
> +int target_pid = -1;
> +
> +unsigned int rec_count = 0;
> +int rec_kfunc_rets[] = { -1, -1 };
>
>  SEC("cgroup_skb/egress")
>  int egress(struct __sk_buff *skb)
> @@ -125,4 +129,56 @@ int tc_egress(struct __sk_buff *skb)
>  	return TCX_DROP;
>  }
>
> +SEC("cgroup_skb/egress")
> +int recursion(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct icmphdr *icmph;
> +	struct tcphdr *tcph;
> +	struct iphdr *iph;
> +	int ret;
> +
> +	if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
> +		return SK_PASS;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4)
> +		return SK_PASS;
> +
> +	if (iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	if (iph->protocol == IPPROTO_TCP) {
> +		tcph = (void *)iph + iph->ihl * 4;
> +		if ((void *)(tcph + 1) > data_end ||
> +		    tcph->dest != bpf_htons(server_port))
> +			return SK_PASS;
> +	} else if (iph->protocol == IPPROTO_ICMP) {
> +		icmph = (void *)iph + iph->ihl * 4;
> +		if ((void *)(icmph + 1) > data_end ||
> +		    icmph->type != unreach_type ||
> +		    icmph->code != unreach_code)
> +			return SK_PASS;
> +	} else {
> +		return SK_PASS;
> +	}
> +
> +	/*
> +	 * This call will provoke a recursion: the ICMP packet generated by the
> +	 * kfunc will re-trigger this program since we are in the root cgroup in
> +	 * which the kernel ICMP socket belongs. However when re-entering the
> +	 * kfunc, it should return EBUSY.
> +	 */
> +	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
> +	rec_kfunc_rets[rec_count & 1] = ret;
> +	__sync_fetch_and_add(&rec_count, 1);
> +
> +	/* Let the first ICMP error message pass */
> +	if (iph->protocol == IPPROTO_ICMP)
> +		return SK_PASS;
> +
> +	return SK_DROP;
> +}
> +
>  char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.34.1


^ permalink raw reply

* Re: [PATCH bpf-next v8 4/7] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests
From: Emil Tsalapatis @ 2026-06-24  7:26 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: andrii, ast, daniel, edumazet, john.fastabend, jordan, kuba,
	martin.lau, netdev, netfilter-devel, pabeni, yonghong.song
In-Reply-To: <20260622120515.137082-5-mahe.tardy@gmail.com>

On Mon Jun 22, 2026 at 8:05 AM EDT, Mahe Tardy wrote:
> This test opens a server and client, enters a new cgroup, attach a
> cgroup_skb program on egress and calls the bpf_icmp_send function from
> the client egress so that an ICMP unreach control message is sent back
> to the client. It then fetches the message from the error queue to
> confirm the correct ICMP unreach code has been sent.
>
> Note that, for the client, we have to connect in non-blocking mode to
> let the test execute faster. Otherwise, we need to wait for the TCP
> three-way handshake to timeout in the kernel before reading the errno.
>
> Also note that we don't set IP_RECVERR on the socket in
> connect_to_fd_nonblock since the error will be transferred anyway in our
> test because the connection is rejected at the beginning of the TCP
> handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.
>
> Reviewed-by: Jordan Rife <jordan@jrife.io>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  .../bpf/prog_tests/icmp_send_kfunc.c          | 151 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/icmp_send.c |  38 +++++
>  2 files changed, 189 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> new file mode 100644
> index 000000000000..f4e5b883d4c8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> @@ -0,0 +1,151 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include <poll.h>
> +#include "icmp_send.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
> +
> +static int connect_to_fd_nonblock(int server_fd)
> +{
> +	struct sockaddr_storage addr;
> +	socklen_t len = sizeof(addr);
> +	int fd, err;
> +
> +	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
> +		return -1;
> +
> +	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
> +	if (fd < 0)
> +		return -1;
> +
> +	err = connect(fd, (struct sockaddr *)&addr, len);
> +	if (err < 0 && errno != EINPROGRESS) {
> +		close(fd);
> +		return -1;
> +	}
> +
> +	return fd;
> +}
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> +	struct sock_extended_err *sock_err;
> +	char ctrl_buf[512];
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +	struct pollfd pfd = {
> +		.fd = sockfd,
> +		.events = POLLERR,
> +	};
> +	struct cmsghdr *cm;
> +	ssize_t n;
> +
> +	if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
> +		return;
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	cm = CMSG_FIRSTHDR(&msg);
> +	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
> +		return;
> +
> +	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (cm->cmsg_level != IPPROTO_IP || cm->cmsg_type != IP_RECVERR)
> +			continue;
> +
> +		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> +		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> +			       "sock_err_origin_icmp"))
> +			return;
> +		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> +			       "sock_err_type_dest_unreach"))
> +			return;
> +		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> +		return;
> +	}
> +
> +	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
> +}
> +
> +static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel, int code)
> +{
> +	int srv_fd = -1, client_fd = -1;
> +	struct sockaddr_in addr;
> +	socklen_t len = sizeof(addr);
> +
> +	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0, TIMEOUT_MS);
> +	if (!ASSERT_OK_FD(srv_fd, "start_server"))
> +		return;
> +
> +	if (getsockname(srv_fd, (struct sockaddr *)&addr, &len)) {
> +		close(srv_fd);
> +		return;
> +	}
> +	skel->bss->server_port = ntohs(addr.sin_port);
> +	skel->bss->unreach_code = code;
> +
> +	client_fd = connect_to_fd_nonblock(srv_fd);
> +	if (!ASSERT_OK_FD(client_fd, "client_connect_nonblock")) {
> +		close(srv_fd);
> +		return;
> +	}
> +
> +	/* Skip reading ICMP error queue if code is invalid */
> +	if (code >= 0 && code <= NR_ICMP_UNREACH)
> +		read_icmp_errqueue(client_fd, code);
> +
> +	close(client_fd);
> +	close(srv_fd);
> +}
> +
> +void test_icmp_send_unreach_cgroup(void)
> +{
> +	struct icmp_send *skel;
> +	int cgroup_fd = -1;
> +
> +	skel = icmp_send__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> +	if (!ASSERT_OK_FD(cgroup_fd, "join_cgroup"))
> +		goto cleanup;
> +
> +	skel->links.egress =
> +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
> +		/*
> +		 * The TCP stack reacts differently when asking for
> +		 * fragmentation, let's ignore it for now.
> +		 */
> +		if (code == ICMP_FRAG_NEEDED)
> +			continue;
> +
> +		trigger_prog_read_icmp_errqueue(skel, code);
> +		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
> +	}
> +
> +	/* Test an invalid code */
> +	trigger_prog_read_icmp_errqueue(skel, -1);
> +	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
> +
> +cleanup:
> +	icmp_send__destroy(skel);
> +	if (cgroup_fd >= 0)
> +		close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> new file mode 100644
> index 000000000000..6d0be0a9afe1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +/* 127.0.0.1 in host byte order */
> +#define SERVER_IP 0x7F000001
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +__u16 server_port = 0;
> +int unreach_code = 0;
> +int kfunc_ret = -1;
> +
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	struct tcphdr *tcph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	tcph = (void *)iph + iph->ihl * 4;
> +	if ((void *)(tcph + 1) > data_end ||
> +	    tcph->dest != bpf_htons(server_port))
> +		return SK_PASS;
> +
> +	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
> +
> +	return SK_DROP;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.34.1


^ permalink raw reply

* [PATCH net] net: enetc: fix potential divide-by-zero when num_vsi is zero
From: wei.fang @ 2026-06-24  7:27 UTC (permalink / raw)
  To: claudiu.manoil, vladimir.oltean, xiaoning.wang, andrew+netdev,
	davem, edumazet, kuba, pabeni
  Cc: Frank.Li, wei.fang, imx, netdev, linux-kernel

From: Wei Fang <wei.fang@nxp.com>

For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
pf->caps.num_vsi is zero. This leads to a divide-by-zero in
enetc4_default_rings_allocation() when distributing rings among PF and
VFs.

Division by zero is undefined behavior in C. On ARM64, the UDIV/SDIV
instructions silently return zero rather than raising an exception, so
the issue does not cause a visible crash. However, relying on this
behavior is incorrect and poses a cross-platform compatibility risk.

Add an explicit check for num_vsi == 0 and return early after the PF's
rings have been configured.

Fixes: 2d673b0e2f8d ("net: enetc: add standalone ENETC support for i.MX94")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/enetc4_pf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
index 4e771f852358..437a15bbb47b 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc4_pf.c
@@ -322,6 +322,9 @@ static void enetc4_default_rings_allocation(struct enetc_pf *pf)
 	val = enetc4_psicfgr0_val_construct(false, num_tx_bdr, num_rx_bdr);
 	enetc_port_wr(hw, ENETC4_PSICFGR0(0), val);

+	if (!pf->caps.num_vsi)
+		return;
+
 	num_rx_bdr = pf->caps.num_rx_bdr - num_rx_bdr;
 	rx_rem = num_rx_bdr % pf->caps.num_vsi;
 	num_rx_bdr = num_rx_bdr / pf->caps.num_vsi;
-- 
2.34.1

^ permalink raw reply related

* [PATCH v5 net] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-24  7:21 UTC (permalink / raw)
  To: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov
  Cc: Shradha Gupta, linux-hyperv, linux-kernel, netdev, Paul Rosswurm,
	Shradha Gupta, Saurabh Singh Sengar, stable, Yury Norov

Before the commit 755391121038 ("net: mana: Allocate MSI-X vectors
dynamically"), all the MANA IRQs were assigned statically and together
during early driver load.

After this commit, the IRQ allocation for MANA was done in two phases.
HWC IRQ allocated earlier and then, queue IRQs dynamically added at a
later point. By this time, the IRQ weights on vCPUs can become imbalanced
and if IRQ count is greater than the vCPU count the topology aware IRQ
distribution logic in MANA can cause multiple MANA IRQs to land on the
same vCPUs, while other sibling vCPUs have none (case 1).

On SMP enabled, low-vCPU systems, this becomes a bigger problem as the
softIRQ handling overhead of two IRQs on the same vCPUs becomes much more
than their overheads if they were spread across sibling vCPUs.

In such cases when many parallel TCP connections are tested, the
throughput drops significantly.

Fix the affinity assignment logic, in cases where the IRQ count is greater
than the vCPU count and when IRQs are added dynamically, by utilizing all
the vCPUs irrespective of their NUMA/core bindings (case 2).

The results of setting the affinity and hint to NULL were also studied,
and we observed that, with this logic if there are pre-existing IRQs
allocated on the VM (apart from MANA), during MANA IRQs allocation, it
leads to clustering of the MANA queue IRQs again (case 3).


=======================================================
Case 1: without this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC		0
IRQ1:	mana_q1		0
IRQ2:	mana_q2		2
IRQ3:	mana_q3		0
IRQ4:	mana_q4		3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU		0	1	2	3
=======================================================
pass 1:		38.85	0.03	24.89	24.65
pass 2:		39.15	0.03	24.57	25.28
pass 3:		40.36	0.03	23.20	23.17

=======================================================
Case 2: with this patch
=======================================================
4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

        TYPE            effective vCPU aff
=======================================================
IRQ0:   HWC             0
IRQ1:   mana_q1         0
IRQ2:   mana_q2         1
IRQ3:   mana_q3         2
IRQ4:   mana_q4         3

%soft on each vCPU(mpstat -P ALL 1) on receiver
vCPU            0       1       2       3
=======================================================
pass 1:         15.42	15.85	14.99	14.51
pass 2:         15.53	15.94	15.81	15.93
pass 3:         16.41	16.35	16.40	16.36

=======================================================
Case 3: with affinity set to NULL
=======================================================
4 vCPU(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)

	TYPE		effective vCPU aff
=======================================================
IRQ0:	HWC			0
IRQ1:	mana_q1			2
IRQ2:	mana_q2			3
IRQ3:	mana_q3			2
IRQ4:	mana_q4			3

=======================================================
Throughput Impact(in Gbps, same env)
=======================================================
TCP conn	with patch	w/o patch	aff NULL
20480		15.65		7.73		5.25
10240		15.63		8.93		5.77
8192		15.64		9.69		7.16
6144		15.64		13.16		9.33
4096		15.69		15.75		13.50
2048		15.69		15.83		13.61
1024		15.71		15.28		13.60

Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
Cc: stable@vger.kernel.org
Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Yury Norov <ynorov@nvidia.com>
---
Changes in v5
 * modify commit message to align with fix patch format
---
Changes in v4
 * Add mana prefix on irq_affinity_*() in mana driver
 * Corrected grammar, comment for mana_irq_setup_linear()
 * added new line as per guidelines
 * added case 3 in commit message for when affinity is NULL
---
Changes in v3
 * Optimize the comments in mana_gd_setup_dyn_irqs()
 * add more details in the dev_dbg for extra IRQs
---
Changes in v2
 * Removed the unused skip_first_cpu variable
 * fixed exit condition in irq_setup_linear() with len == 0
 * changed return type of irq_setup_linear() as it will always be 0
 * removed the unnecessary rcu_read_lock() in irq_setup_linear()
 * added appropriate comments to indicate expected behaviour when
   IRQs are more than or equal to num_online_cpus()
---
 .../net/ethernet/microsoft/mana/gdma_main.c   | 78 +++++++++++++++----
 1 file changed, 64 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index a0fdd052d7f1..e8b7ffb47eb9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -210,6 +210,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
 	} else {
 		/* If dynamic allocation is enabled we have already allocated
 		 * hwc msi
+		 * Also, we make sure in this case the following is always true
+		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
 		 */
 		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
 	}
@@ -1909,8 +1911,8 @@ void mana_gd_free_res_map(struct gdma_resource *r)
  * do the same thing.
  */
 
-static int irq_setup(unsigned int *irqs, unsigned int len, int node,
-		     bool skip_first_cpu)
+static int mana_irq_setup_numa_aware(unsigned int *irqs, unsigned int len,
+				     int node, bool skip_first_cpu)
 {
 	const struct cpumask *next, *prev = cpu_none_mask;
 	cpumask_var_t cpus __free(free_cpumask_var);
@@ -1946,11 +1948,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
 	return 0;
 }
 
+/* must be called with cpus_read_lock() held */
+static void mana_irq_setup_linear(unsigned int *irqs, unsigned int len)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (len == 0)
+			break;
+
+		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
+		len--;
+	}
+}
+
 static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 {
 	struct gdma_context *gc = pci_get_drvdata(pdev);
 	struct gdma_irq_context *gic;
-	bool skip_first_cpu = false;
 	int *irqs, err, i, msi;
 
 	irqs = kmalloc_objs(int, nvec);
@@ -1958,10 +1973,12 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 		return -ENOMEM;
 
 	/*
+	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
+	 * nvec is only Queue IRQ (HWC already setup).
 	 * While processing the next pci irq vector, we start with index 1,
 	 * as IRQ vector at index 0 is already processed for HWC.
 	 * However, the population of irqs array starts with index 0, to be
-	 * further used in irq_setup()
+	 * further used in mana_irq_setup_numa_aware()
 	 */
 	for (i = 1; i <= nvec; i++) {
 		msi = i;
@@ -1975,18 +1992,51 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
 	}
 
 	/*
-	 * When calling irq_setup() for dynamically added IRQs, if number of
-	 * CPUs is more than or equal to allocated MSI-X, we need to skip the
-	 * first CPU sibling group since they are already affinitized to HWC IRQ
+	 * When calling mana_irq_setup_numa_aware() for dynamically added IRQs,
+	 * if number of CPUs is more than or equal to allocated MSI-X, we need to
+	 * skip the first CPU sibling group since they are already affinitized to
+	 * HWC IRQ
 	 */
 	cpus_read_lock();
-	if (gc->num_msix_usable <= num_online_cpus())
-		skip_first_cpu = true;
+	if (gc->num_msix_usable <= num_online_cpus()) {
+		err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node,
+						true);
+		if (err) {
+			cpus_read_unlock();
+			goto free_irq;
+		}
+	} else {
+		/*
+		 * When num_msix_usable are more than num_online_cpus, our
+		 * queue IRQs should be equal to num of online vCPUs.
+		 * We try to make sure queue IRQs spread across all vCPUs.
+		 * In such a case NUMA or CPU core affinity does not matter.
+		 * Note: in this case the total mana IRQ should always be
+		 * num_online_cpus + 1. The first HWC IRQ is already handled
+		 * in HWC setup calls
+		 * However, if CPUs went offline since num_msix_usable was
+		 * computed, queue IRQs will be more than num_online_cpus().
+		 * In such cases remaining extra IRQs will retain their default
+		 * affinity.
+		 */
+		int first_unassigned = num_online_cpus();
 
-	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
-	if (err) {
-		cpus_read_unlock();
-		goto free_irq;
+		if (nvec > first_unassigned) {
+			char buf[32];
+
+			if (first_unassigned == nvec - 1)
+				snprintf(buf, sizeof(buf), "%d",
+					 first_unassigned);
+			else
+				snprintf(buf, sizeof(buf), "%d-%d",
+					 first_unassigned, nvec - 1);
+
+			dev_dbg(&pdev->dev,
+				"MANA IRQ indices #%s will retain the default CPU affinity\n",
+				buf);
+		}
+
+		mana_irq_setup_linear(irqs, nvec);
 	}
 
 	cpus_read_unlock();
@@ -2041,7 +2091,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
 		nvec -= 1;
 	}
 
-	err = irq_setup(irqs, nvec, gc->numa_node, false);
+	err = mana_irq_setup_numa_aware(irqs, nvec, gc->numa_node, false);
 	if (err) {
 		cpus_read_unlock();
 		goto free_irq;

base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
-- 
2.34.1


^ permalink raw reply related

* [syzbot] [net?] WARNING in qdisc_pkt_len_segs_init
From: syzbot @ 2026-06-24  7:20 UTC (permalink / raw)
  To: davem, edumazet, horms, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    9c87e61e3c57 Merge tag 'bpf-next-7.2' of git://git.kernel...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d2901c580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9a9f723a32776544
dashboard link: https://syzkaller.appspot.com/bug?extid=d5d0d598a4cfdfafdc3b
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/5489ff1c0660/disk-9c87e61e.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/bcdabd5a8fea/vmlinux-9c87e61e.xz
kernel image: https://storage.googleapis.com/syzbot-assets/fa77e3b769c6/bzImage-9c87e61e.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+d5d0d598a4cfdfafdc3b@syzkaller.appspotmail.com

------------[ cut here ]------------
len > ((int)(~0U >> 1))
WARNING: ./include/linux/skbuff.h:2866 at pskb_may_pull_reason include/linux/skbuff.h:2866 [inline], CPU#0: syz.0.5520/24128
WARNING: ./include/linux/skbuff.h:2866 at pskb_may_pull include/linux/skbuff.h:2884 [inline], CPU#0: syz.0.5520/24128
WARNING: ./include/linux/skbuff.h:2866 at qdisc_pkt_len_segs_init+0x4b4/0xa30 net/core/dev.c:4138, CPU#0: syz.0.5520/24128
Modules linked in:
CPU: 0 UID: 0 PID: 24128 Comm: syz.0.5520 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:pskb_may_pull_reason include/linux/skbuff.h:2866 [inline]
RIP: 0010:pskb_may_pull include/linux/skbuff.h:2884 [inline]
RIP: 0010:qdisc_pkt_len_segs_init+0x4b4/0xa30 net/core/dev.c:4138
Code: 00 00 02 00 31 ff e8 1b 3d 49 f8 81 e3 00 00 02 00 0f 85 fd 00 00 00 e8 ca 38 49 f8 45 89 e7 e9 2c ff ff ff e8 bd 38 49 f8 90 <0f> 0b 90 e9 ee fd ff ff 44 89 e7 89 de e8 6a 3a 49 f8 41 39 dc 0f
RSP: 0018:ffffc90000007660 EFLAGS: 00010246
RAX: ffffffff897ccc53 RBX: 0000000000000003 RCX: ffff8880761b8000
RDX: 0000000000000100 RSI: 00000000fffffffa RDI: 0000000000000000
RBP: ffff888034a70b40 R08: ffff88807bd48067 R09: 1ffff1100f7a900c
R10: dffffc0000000000 R11: ffffed100f7a900d R12: 00000000fffffffa
R13: dffffc0000000000 R14: 1ffff1100694e183 R15: ffff888068a29b98
FS:  00007f26d46806c0(0000) GS:ffff888125272000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000010000 CR3: 00000000894be000 CR4: 00000000003526f0
DR0: 0000000000000006 DR1: 0000000000000000 DR2: 000000007fffdff5
DR3: 0000800000000005 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 sch_handle_ingress net/core/dev.c:4479 [inline]
 __netif_receive_skb_core+0x13aa/0x30b0 net/core/dev.c:6057
 __netif_receive_skb_list_core+0x24d/0x830 net/core/dev.c:6281
 __netif_receive_skb_list net/core/dev.c:6348 [inline]
 netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6439
 gro_normal_list include/net/gro.h:523 [inline]
 gro_flush_normal include/net/gro.h:531 [inline]
 napi_complete_done+0x299/0x730 net/core/dev.c:6807
 gro_cell_poll+0x5ab/0x5d0 net/core/gro_cells.c:74
 __napi_poll+0xaa/0x330 net/core/dev.c:7729
 napi_poll net/core/dev.c:7792 [inline]
 net_rx_action+0x61d/0xf50 net/core/dev.c:7949
 handle_softirqs+0x225/0x840 kernel/softirq.c:622
 do_softirq+0x76/0xd0 kernel/softirq.c:523
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
 local_bh_enable include/linux/bottom_half.h:33 [inline]
 tun_rx_batched+0x616/0x790 drivers/net/tun.c:-1
 tun_get_user+0x2b04/0x4350 drivers/net/tun.c:1986
 tun_chr_write_iter+0x113/0x200 drivers/net/tun.c:2032
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x612/0xba0 fs/read_write.c:687
 ksys_write+0x150/0x270 fs/read_write.c:739
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f26d379ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f26d4680028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f26d3a15fa0 RCX: 00007f26d379ce59
RDX: 000000000000fdef RSI: 00002000000002c0 RDI: 000000000000000a
RBP: 00007f26d3832d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f26d3a16038 R14: 00007f26d3a15fa0 R15: 00007f26d3b3fa48
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: kernel test robot @ 2026-06-24  7:15 UTC (permalink / raw)
  To: Jamal Hadi Salim, netdev
  Cc: oe-kbuild-all, davem, edumazet, kuba, pabeni, horms, victor,
	andrew+netdev, zdi-disclosures, stable, Jamal Hadi Salim
In-Reply-To: <20260623184247.508956-1-jhs@mojatatu.com>

Hi Jamal,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamal-Hadi-Salim/net-sched-sch_teql-Introduce-slaves_lock-to-avoid-race-condition-and-UAF/20260624-024432
base:   net/main
patch link:    https://lore.kernel.org/r/20260623184247.508956-1-jhs%40mojatatu.com
patch subject: [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
config: sparc-randconfig-r133-20260624 (https://download.01.org/0day-ci/archive/20260624/202606241501.XQBMu4b8-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 15.2.0
sparse: v0.6.5-rc1
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260624/202606241501.XQBMu4b8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606241501.XQBMu4b8-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> net/sched/sch_teql.c:106:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:106:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:106:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:217:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:217:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:217:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:220:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:220:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:220:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:300:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:300:17: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:300:17: sparse:    struct Qdisc *
   net/sched/sch_teql.c:359:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:359:23: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:359:23: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:333:41: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:333:41: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *
   net/sched/sch_teql.c:349:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc [noderef] __rcu *
   net/sched/sch_teql.c:349:25: sparse:    struct Qdisc *

vim +106 net/sched/sch_teql.c

    89	
    90	static struct sk_buff *
    91	teql_dequeue(struct Qdisc *sch)
    92	{
    93		struct teql_sched_data *dat = qdisc_priv(sch);
    94		struct netdev_queue *dat_queue;
    95		struct sk_buff *skb;
    96		struct Qdisc *q;
    97	
    98		skb = __skb_dequeue(&dat->q);
    99		dat_queue = netdev_get_tx_queue(dat->m->dev, 0);
   100		q = rcu_dereference_bh(dat_queue->qdisc);
   101	
   102		if (skb == NULL) {
   103			struct net_device *m = qdisc_dev(q);
   104			if (m) {
   105				spin_lock_bh(&dat->m->slaves_lock);
 > 106				rcu_assign_pointer(dat->m->slaves, sch);
   107				spin_unlock_bh(&dat->m->slaves_lock);
   108				netif_wake_queue(m);
   109			}
   110		} else {
   111			qdisc_bstats_update(sch, skb);
   112		}
   113		WRITE_ONCE(sch->q.qlen, dat->q.qlen + READ_ONCE(q->q.qlen));
   114		return skb;
   115	}
   116	

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [syzbot ci] Re: nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
From: syzbot ci @ 2026-06-24  7:13 UTC (permalink / raw)
  To: davem, david, edumazet, horms, kuba, linux-kernel, netdev,
	oe-linux-nfc, pabeni, sam, stable
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260623222402.175798-1-sam@bynar.io>

syzbot ci has tested the following series

[v1] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()
https://lore.kernel.org/all/20260623222402.175798-1-sam@bynar.io
* [PATCH net] nfc: nci: fix uninit-value in nci_core_init_rsp_packet()

and found the following issue:
UBSAN: array-index-out-of-bounds in nci_init_complete_req

Full report is available here:
https://ci.syzbot.org/series/2a9a8657-37a3-4dce-8cb5-2035027791dd

***

UBSAN: array-index-out-of-bounds in nci_init_complete_req

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      a986fde914d88af47eb78fd29c5d1af7952c3500
arch:      amd64
compiler:  Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
config:    https://ci.syzbot.org/builds/80f835c3-e998-47ff-aaa5-24c578af3b4e/config
syz repro: https://ci.syzbot.org/findings/65008893-2498-4786-b913-f2c474a7b34a/syz_repro

------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in net/nfc/nci/core.c:192:7
index 4 is out of range for type '__u8[4]' (aka 'unsigned char[4]')
CPU: 0 UID: 0 PID: 5905 Comm: syz.1.33 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
 __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
 nci_init_complete_req+0x255/0x460 net/nfc/nci/core.c:192
 __nci_request+0x7d/0x300 net/nfc/nci/core.c:108
 nci_open_device net/nfc/nci/core.c:529 [inline]
 nci_dev_up+0x8c3/0xdc0 net/nfc/nci/core.c:643
 nfc_dev_up+0x165/0x350 net/nfc/core.c:118
 nfc_genl_dev_up+0x89/0xe0 net/nfc/netlink.c:775
 genl_family_rcv_msg_doit+0x233/0x340 net/netlink/genetlink.c:1114
 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
 genl_rcv_msg+0x614/0x7a0 net/netlink/genetlink.c:1209
 netlink_rcv_skb+0x226/0x4a0 net/netlink/af_netlink.c:2556
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x7bb/0x940 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1900
 sock_sendmsg_nosec net/socket.c:775 [inline]
 __sock_sendmsg net/socket.c:790 [inline]
 ____sys_sendmsg+0x9b9/0xa20 net/socket.c:2684
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2738
 __sys_sendmsg net/socket.c:2770 [inline]
 __do_sys_sendmsg net/socket.c:2775 [inline]
 __se_sys_sendmsg net/socket.c:2773 [inline]
 __x64_sys_sendmsg+0x1b1/0x290 net/socket.c:2773
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f55ead9ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f55ebcb9028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f55eb015fa0 RCX: 00007f55ead9ce59
RDX: 0000000004008054 RSI: 0000200000000200 RDI: 0000000000000005
RBP: 00007f55eae32e6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f55eb016038 R14: 00007f55eb015fa0 R15: 00007ffcba11c798
 </TASK>
---[ end trace ]---


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-06-24  7:12 UTC (permalink / raw)
  To: avagin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, safinaskar, torvalds, val, viro, willy
In-Reply-To: <CANaxB-zK5q=Xw6UZTmeFtXsDZjUsPkFk=p485m-wtNTBnf4hgg@mail.gmail.com>

Andrei Vagin <avagin@gmail.com>:
> The CRIU fifo test fails with this change. The problem is that vmsplice
> with SPLICE_F_NONBLOCK to a fifo file descriptor fails with -EOPNOTSUPP.
> 
> It seems we need a fix like this one:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 429b0714ec57..6fc49e933727 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1253,6 +1253,7 @@ static int fifo_open(struct inode *inode, struct
> file *filp)
> 
>         /* We can only do regular read/write on fifos */
>         stream_open(inode, filp);
> +       filp->f_mode |= FMODE_NOWAIT;
> 
>         switch (filp->f_mode & (FMODE_READ | FMODE_WRITE)) {
>         case FMODE_READ:

Does CRIU actually rely on ability to do SPLICE_F_NONBLOCK vmsplice into
named fifos? Or this is merely a test?

If this is just a test, I think we need not to preserve this behavior.

I did debian code search with regex "vmsplice.*SPLICE_F_NONBLOCK" and I
found very few packages. And it seems all them use pipes, not named fifos.

(On speed: I still think that my vmsplice patches are good thing,
despite performance regressions in CRIU.)

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v3] virtio_net: disable cb when NAPI is busy-polled
From: Michael S. Tsirkin @ 2026-06-24  7:08 UTC (permalink / raw)
  To: Longjun Tang
  Cc: xuanzhuo, jasowang, edumazet, virtualization, netdev, tanglongjun
In-Reply-To: <20260624070206.85467-1-lange_tang@163.com>

On Wed, Jun 24, 2026 at 03:02:06PM +0800, Longjun Tang wrote:
> From: Longjun Tang <tanglongjun@kylinos.cn>
> 
> When busy-poll is active, napi_schedule_prep() returns false in
> virtqueue_napi_schedule(), so virtqueue_disable_cb() is skipped.
> The device may keep firing irqs until reaches virtqueue_napi_complete().
> Under load (received == budget), it will lead to a large number
> of spurious interrupts.
> 
> Fix it by disabling the callback at the virtnet_poll() entry. This keeps
> the callback off while we poll and re-enable

and it is re-enabled

> by virtqueue_napi_complete()
> when going idle.
> 
> Fixes: ceef438d613f ("virtio_net: remove custom busy_poll")
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Longjun Tang <tanglongjun@kylinos.cn>
> 
> ---
> V1 -> V2: Remain agnostic to busy polling
> V2 -> V3: Add fixes tag
> ---
>  drivers/net/virtio_net.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..0a11f2b32500 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3008,6 +3008,11 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>  	unsigned int xdp_xmit = 0;
>  	bool napi_complete;
>  
> +	/* Keep callbacks suppressed for the duration of this poll,
> +	 * busy-poll need.

I don't know what "busy-poll need" means. Just drop this part?
In fact, the whole comment can go, we know virtqueue_disable_cb
disables callbacks.

> +	 */
> +	virtqueue_disable_cb(rq->vq);
> +
>  	virtnet_poll_cleantx(rq, budget);
>  
>  	received = virtnet_receive(rq, budget, &xdp_xmit);
> -- 
> 2.43.0


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox