Netdev List
 help / color / mirror / Atom feed
* [net-next PATCH 1/2] net: dsa: tag_rtl8_4: update format description
From: Luiz Angelo Daros de Luca @ 2026-04-08 20:31 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, Luiz Angelo Daros de Luca,
	Alvin Šipraga, Linus Walleij
In-Reply-To: <20260408-realtek_fixes-v1-0-915ff1404d56@gmail.com>

From: Alvin Šipraga <alsi@bang-olufsen.dk>

Document the updated tag layout fields (EFID, VSEL/VIDX) and clarify
which bits are set/cleared when emitting tags.

Co-developed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
---
 net/dsa/tag_rtl8_4.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/dsa/tag_rtl8_4.c b/net/dsa/tag_rtl8_4.c
index 2464545da4d2..b7ed39c5419f 100644
--- a/net/dsa/tag_rtl8_4.c
+++ b/net/dsa/tag_rtl8_4.c
@@ -17,8 +17,8 @@
  *  |              (8-bit)              |              (8-bit)              |
  *  |          Protocol [0x04]          |              REASON               | b
  *  |-----------------------------------+-----------------------------------| y
- *  |   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    | t
- *  | FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     | e
+ *  |   (1)   |   (3)  |   (1)  |  (3)  | (1)  | (1)  |    (1)     |  (5)   | t
+ *  | EFID_EN |  EFID  | PRI_EN |  PRI  | KEEP | VSEL | LEARN_DIS  |  VIDX  | e
  *  |-----------------------------------+-----------------------------------| s
  *  |   (1)  |                       (15-bit)                               | |
  *  |  ALLOW |                        TX/RX                                 | v
@@ -32,19 +32,22 @@
  *     EtherType |         note that Realtek uses the same EtherType for
  *               |         other incompatible tag formats (e.g. tag_rtl4_a.c)
  *    Protocol   | 0x04: indicates that this tag conforms to this format
- *    X          | reserved
  *   ------------+-------------
  *    REASON     | reason for forwarding packet to CPU
  *               | 0: packet was forwarded or flooded to CPU
  *               | 80: packet was trapped to CPU
- *    FID_EN     | 1: packet has an FID
- *               | 0: no FID
- *    FID        | FID of packet (if FID_EN=1)
+ *    EFID_EN    | 1: packet has an EFID
+ *               | 0: no EFID
+ *    EFID       | Extended filter ID (EFID) of packet (if EFID_EN=1)
  *    PRI_EN     | 1: force priority of packet
  *               | 0: don't force priority
  *    PRI        | priority of packet (if PRI_EN=1)
  *    KEEP       | preserve packet VLAN tag format
+ *    VSEL       | 0: switch should classify packet according to VLAN tag
+ *               | 1: switch should classify packet according to VLAN membership
+ *               |    configuration with index VIDX
  *    LEARN_DIS  | don't learn the source MAC address of the packet
+ *    VIDX       | index of a VLAN membership configuration to use with VSEL
  *    ALLOW      | 1: treat TX/RX field as an allowance port mask, meaning the
  *               |    packet may only be forwarded to ports specified in the
  *               |    mask
@@ -111,7 +114,7 @@ static void rtl8_4_write_tag(struct sk_buff *skb, struct net_device *dev,
 	/* Set Protocol; zero REASON */
 	tag16[1] = htons(FIELD_PREP(RTL8_4_PROTOCOL, RTL8_4_PROTOCOL_RTL8365MB));
 
-	/* Zero FID_EN, FID, PRI_EN, PRI, KEEP; set LEARN_DIS */
+	/* Zero EFID_EN, EFID, PRI_EN, PRI, VSEL, VIDX, KEEP; set LEARN_DIS */
 	tag16[2] = htons(FIELD_PREP(RTL8_4_LEARN_DIS, 1));
 
 	/* Zero ALLOW; set RX (CPU->switch) forwarding port mask */

-- 
2.53.0


^ permalink raw reply related

* [net-next PATCH 2/2] net: dsa: tag_rtl8_4: set KEEP flag
From: Luiz Angelo Daros de Luca @ 2026-04-08 20:31 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, Luiz Angelo Daros de Luca,
	Alvin Šipraga, Linus Walleij
In-Reply-To: <20260408-realtek_fixes-v1-0-915ff1404d56@gmail.com>

KEEP=1 is needed because we should respect the format of the packet as
the kernel sends it to us. Unless tx forward offloading is used, the
kernel is giving us the packet exactly as it should leave the specified
port on the wire. Until now this was not needed because the ports were
always functioning in a standalone mode in a VLAN-unaware way, so the
switch would not tag or untag frames anyway. But arguably it should have
been KEEP=1 all along.

Co-developed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
---
 net/dsa/tag_rtl8_4.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/dsa/tag_rtl8_4.c b/net/dsa/tag_rtl8_4.c
index b7ed39c5419f..852c6b88079a 100644
--- a/net/dsa/tag_rtl8_4.c
+++ b/net/dsa/tag_rtl8_4.c
@@ -99,6 +99,7 @@
 #define   RTL8_4_REASON_TRAP		80
 
 #define RTL8_4_LEARN_DIS		BIT(5)
+#define RTL8_4_KEEP			BIT(7)
 
 #define RTL8_4_TX			GENMASK(3, 0)
 #define RTL8_4_RX			GENMASK(10, 0)
@@ -114,8 +115,9 @@ static void rtl8_4_write_tag(struct sk_buff *skb, struct net_device *dev,
 	/* Set Protocol; zero REASON */
 	tag16[1] = htons(FIELD_PREP(RTL8_4_PROTOCOL, RTL8_4_PROTOCOL_RTL8365MB));
 
-	/* Zero EFID_EN, EFID, PRI_EN, PRI, VSEL, VIDX, KEEP; set LEARN_DIS */
-	tag16[2] = htons(FIELD_PREP(RTL8_4_LEARN_DIS, 1));
+	/* Zero EFID_EN, EFID, PRI_EN, PRI, VSEL, VIDX; set KEEP, LEARN_DIS */
+	tag16[2] = htons(FIELD_PREP(RTL8_4_LEARN_DIS, 1) |
+			 FIELD_PREP(RTL8_4_KEEP, 1));
 
 	/* Zero ALLOW; set RX (CPU->switch) forwarding port mask */
 	tag16[3] = htons(FIELD_PREP(RTL8_4_RX, dsa_xmit_port_mask(skb, dev)));

-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH bpf v3 0/2] bpf: Fix SOCK_OPS_GET_SK same-register OOB read in sock_ops and add selftest
From: Martin KaFai Lau @ 2026-04-08 20:32 UTC (permalink / raw)
  To: Jiayuan Chen, Jakub Kicinski
  Cc: bpf, werner, Daniel Borkmann, John Fastabend, Stanislav Fomichev,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Shuah Khan, Sun Jian, linux-kernel, netdev, linux-kselftest
In-Reply-To: <20260407022720.162151-1-jiayuan.chen@linux.dev>

On Tue, Apr 07, 2026 at 10:26:26AM +0800, Jiayuan Chen wrote:
> When a BPF sock_ops program accesses ctx fields with dst_reg == src_reg,
> the SOCK_OPS_GET_SK() and SOCK_OPS_GET_FIELD() macros fail to zero the
> destination register in the !fullsock / !locked_tcp_sock path, leading to
> OOB read (GET_SK) and kernel pointer leak (GET_FIELD).

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>

Jakub, can you help to push it to the net tree? Thanks!

^ permalink raw reply

* Re: [PATCH net 2/2] net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl
From: Joerg Reuter @ 2026-04-08 20:51 UTC (permalink / raw)
  To: Mashiro Chen
  Cc: netdev, andrew+netdev, davem, edumazet, kuba, pabeni, linux-hams,
	linux-kernel, stable
In-Reply-To: <20260408172358.281186-3-mashiro.chen@mailbox.org>

Hi,

Am Thu, Apr 09, 2026 at 01:23:58AM +0800 schrieb Mashiro Chen:

> If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive
> interrupt handler later calls dev_alloc_skb(0) and immediately writes
> a KISS type byte via skb_put_u8() into a zero-capacity socket buffer,
> corrupting the adjacent skb_shared_info region.

Oops, that's unfortunate.

> The scc.c comment already states the buffer must not exceed 4096 bytes,
> but this limit is never enforced.

That was a limit 30 years ago when we couldn't have skbs larger than one
page.

I'm not sure if anyone is actually using AX.25 jumbograms with a Zilog SCC
controller, that doesn't make much sense to me. But maybe someone out there
is indeed running IP over huge AX.25 UI frames, thus I'm not a fan of
enforcing an upper limit either. It's hamradio, you're supposed to tinker.

I'm okay with a mininum size of 16, of course.

73,
    Joerg

-- 
Joerg Reuter                                    http://yaina.de/jreuter
And I make my way to where the warm scent of soil fills the evening air. 
Everything is waiting quietly out there....                 (Anne Clark)

^ permalink raw reply

* Re: [PATCH net 1/2] net: hamradio: bpqether: validate frame length in bpq_rcv()
From: Joerg Reuter @ 2026-04-08 21:05 UTC (permalink / raw)
  To: Mashiro Chen
  Cc: netdev, andrew+netdev, davem, edumazet, kuba, pabeni, linux-hams,
	linux-kernel, stable
In-Reply-To: <20260408172358.281186-2-mashiro.chen@mailbox.org>

Am Thu, Apr 09, 2026 at 01:23:57AM +0800 schrieb Mashiro Chen:
> The BPQ length field is decoded as:
> 
>   len = skb->data[0] + skb->data[1] * 256 - 5;
> 
> If the sender sets bytes [0..1] to values whose combined value is
> less than 5, len becomes negative.  Passing a negative int to
> skb_trim() silently converts to a huge unsigned value, causing the
> function to be a no-op.  The frame is then passed up to AX.25 with
> its original (untrimmed) payload, delivering garbage beyond the
> declared frame boundary.

I don't even know why there is a length field in the first place, and John
G8BPQ doesn't seem to remember either. 

There is nothing supposed to come after the payload, and there should be no
need to skb_trim() at all. 

However, since an obviously wrong length field indicates that something is
indeed wrong with that frame, I'm in favor of dropping those frames.

Acked-by: Joerg Reuter <jreuter@yaina.de>

> Cc: stable@vger.kernel.org
> Cc: linux-hams@vger.kernel.org
> Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
> ---
>  drivers/net/hamradio/bpqether.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/hamradio/bpqether.c b/drivers/net/hamradio/bpqether.c
> index 045c5177262eaf..214fd1f819a1bb 100644
> --- a/drivers/net/hamradio/bpqether.c
> +++ b/drivers/net/hamradio/bpqether.c
> @@ -187,6 +187,9 @@ static int bpq_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_ty
>  
>  	len = skb->data[0] + skb->data[1] * 256 - 5;
>  
> +	if (len < 0 || len > skb->len - 2)
> +		goto drop_unlock;
> +
>  	skb_pull(skb, 2);	/* Remove the length bytes */
>  	skb_trim(skb, len);	/* Set the length of the data */
>  
> -- 
> 2.53.0
> 

-- 
Joerg Reuter                                    http://yaina.de/jreuter
And I make my way to where the warm scent of soil fills the evening air. 
Everything is waiting quietly out there....                 (Anne Clark)

^ permalink raw reply

* Re: [PATCH net-next v2 3/3] net: mdio: treat PSE EPROBE_DEFER as non-fatal during PHY registration
From: Carlo Szelinsky @ 2026-04-08 21:07 UTC (permalink / raw)
  To: andrew
  Cc: o.rempel, kory.maincent, andrew+netdev, hkallweit1, linux, kuba,
	davem, edumazet, pabeni, horms, netdev, linux-kernel
In-Reply-To: <8e12f0ac-be0d-4664-a533-df3bd1efb34a@lunn.ch>

So I went ahead and tested the phy_probe() approach on my setup (RTL930x
DSA switch with an I2C Hasivo HS104 PSE controller as module).

PoE itself works fine, but phydev->psec never gets set - ethtool just
says "No PSE is attached" on all ports.

Took me a while to figure out what's going on. The problem is how DSA
handles PHYs: when phy_probe() returns -EPROBE_DEFER because the PSE
controller hasn't probed yet, the PHY device is registered but sits
there unprobed. Then the DSA switch comes along, sets up its ports, and
phy_attach_direct() force-binds the generic PHY driver with
device_bind_driver(). Now the device already has a driver, so when the
deferred probe retry kicks in it just skips it. phy_probe() never runs
again and psec stays NULL.

What I'm seeing timing-wise:
  - MDIO scan registers PHYs, phy_probe() defers (no PSE yet)
  - DSA probes, phy_attach_direct() binds genphy
  - t=17s: HS104 finally probes
  - deferred retry: nope, driver already bound
  - t=35s: regulator_late_cleanup (caught by admin_state_synced)

Not sure what the best path forward is here. Should we look at fixing
phy_attach_direct() to handle this case, or go back to the non-fatal
EPROBE_DEFER approach from v2 for now?

Cheers,
Carlo

^ permalink raw reply

* Re: [PATCH net] ice: stop DCBNL requests during driver unload
From: Tony Nguyen @ 2026-04-08 21:08 UTC (permalink / raw)
  To: Aleksandr Loktionov, intel-wired-lan; +Cc: netdev, Dave Ertman
In-Reply-To: <20260327072332.130320-8-aleksandr.loktionov@intel.com>



On 3/27/2026 12:23 AM, Aleksandr Loktionov wrote:
> From: Dave Ertman <david.m.ertman@intel.com>
> 
> With a chatty lldpad, DCB configuration requests can arrive through
> the DCBNL API while the driver is tearing down PF resources, leading
> to use-after-free and NULL dereference crashes.
> 
> Set ICE_SHUTTING_DOWN in pf->state at the start of ice_remove() and
> check this bit at the beginning of every DCBNL callback that accesses
> resources freed during the remove path.
> 
> Fixes: b94b013eb626 ("ice: Implement DCBNL support")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> 
>   drivers/net/ethernet/intel/ice/ice.h        |  1 +
>   drivers/net/ethernet/intel/ice/ice_dcb_nl.c | 46 +++++++++++++++++++++
>   drivers/net/ethernet/intel/ice/ice_main.c   |  1 +
>   3 files changed, 48 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> index 2b2b22a..052c310 100644
> --- a/drivers/net/ethernet/intel/ice/ice.h
> +++ b/drivers/net/ethernet/intel/ice/ice.h
> @@ -283,6 +283,7 @@ enum ice_pf_state {
>   	ICE_EMPR_RECV,		/* set by OICR handler */
>   	ICE_SUSPENDED,		/* set on module remove path */
>   	ICE_RESET_FAILED,		/* set by reset/rebuild */
> +	ICE_SHUTTING_DOWN,		/* set on module remove path, before releasing resources */

...

> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -5424,6 +5424,7 @@ static void ice_remove(struct pci_dev *pdev)
>   	struct ice_pf *pf = pci_get_drvdata(pdev);
>   	int i;
>   
> +	set_bit(ICE_SHUTTING_DOWN, pf->state);

Nit but can you name this ICE_REMOVING? Since this is in ice_remove() 
rather than ice_shutdown(), it seems more appropriate. Also it aligns 
with the naming used in ixgbe[vf].

This will need updating as well:
https://lore.kernel.org/intel-wired-lan/20260403054029.3789616-4-aleksandr.loktionov@intel.com/

Also, if you have dependencies on other patches, like the latter one, 
please wait until it's applied, otherwise, it will not go through the CI 
properly.

Thanks,
Tony


>   	for (i = 0; i < ICE_MAX_RESET_WAIT; i++) {
>   		if (!ice_is_reset_in_progress(pf->state))
>   			break;


^ permalink raw reply

* Re: [PATCH net] ice: fix netdev bring-up and bring-down in self-test
From: Tony Nguyen @ 2026-04-08 21:12 UTC (permalink / raw)
  To: Aleksandr Loktionov, intel-wired-lan, Alexander Lobakin
  Cc: netdev, Konstantin Ilichev, Grzegorz Nitka
In-Reply-To: <20260327072332.130320-7-aleksandr.loktionov@intel.com>



On 3/27/2026 12:23 AM, Aleksandr Loktionov wrote:
> From: Konstantin Ilichev <konstantin.ilichev@intel.com>
> 
> When an offline self-test is initiated with ethtool -t, any ongoing
> traffic could get stuck because ice_stop() and ice_open() are called
> without letting the OS know about state transitions.  In most cases
> a write() system call would block.
> 
> Fix this by calling dev_change_flags() to bring the netdev up and
> down, which ensures ndo_open()/ndo_stop() are called and all watchers
> are notified correctly.

+ Olek

AI review reports:

The ethtool core acquires the per-netdev mutex via netdev_lock_ops(dev) 
before invoking the driver's .self_test callback. dev_change_flags() is 
an exported API that explicitly re-acquires this exact same lock:
net/core/dev_api.c:dev_change_flags() {
	...
	netdev_lock_ops(dev);
	ret = netif_change_flags(dev, flags, extack);
	netdev_unlock_ops(dev);
	...
}
Because dev->lock is a standard, non-recursive mutex, this will result 
in a hard deadlock for any driver that opts into request_ops_lock. While 
ice might not currently set this flag, introducing nested lock 
acquisitions of the same mutex guarantees a deadlock as the subsystem 
migrates toward per-netdev locking.


With ice netdev lock changes in progress [1], this would soon become an 
issue.

Thanks,
Tony

[1] 
https://lore.kernel.org/netdev/20260325200644.2528726-4-anthony.l.nguyen@intel.com/

> Fixes: 0e674aeb0b77 ("ice: Add handler for ethtool selftest")
> Cc: stable@vger.kernel.org
> Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
> Signed-off-by: Konstantin Ilichev <konstantin.ilichev@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> 
>   drivers/net/ethernet/intel/ice/ice_ethtool.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index 96d95af..2a4f06f 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -1416,7 +1416,7 @@ ice_self_test(struct net_device *netdev, struct ethtool_test *eth_test,
>   		/* If the device is online then take it offline */
>   		if (if_running)
>   			/* indicate we're in test mode */
> -			ice_stop(netdev);
> +			dev_change_flags(netdev, netdev->flags & ~IFF_UP, NULL);
>   
>   		data[ICE_ETH_TEST_LINK] = ice_link_test(netdev);
>   		data[ICE_ETH_TEST_EEPROM] = ice_eeprom_test(netdev);
> @@ -1434,10 +1434,12 @@ ice_self_test(struct net_device *netdev, struct ethtool_test *eth_test,
>   		clear_bit(ICE_TESTING, pf->state);
>   
>   		if (if_running) {
> -			int status = ice_open(netdev);
> +			int status = dev_change_flags(netdev,
> +						      netdev->flags | IFF_UP,
> +						      NULL);
>   
>   			if (status) {
> -				dev_err(dev, "Could not open device %s, err %d\n",
> +				dev_err(dev, "Could not bring up device %s, err %d\n",
>   					pf->int_name, status);
>   			}
>   		}


^ permalink raw reply

* Re: [PATCH net-next] iavf: fix kernel-doc comment style in ethtool ops
From: Tony Nguyen @ 2026-04-08 21:13 UTC (permalink / raw)
  To: Aleksandr Loktionov, intel-wired-lan; +Cc: netdev, Leszek Pepiak
In-Reply-To: <20260403054321.3791392-1-aleksandr.loktionov@intel.com>



On 4/2/2026 10:43 PM, Aleksandr Loktionov wrote:
> From: Leszek Pepiak <leszek.pepiak@intel.com>
> 
> iavf_get_channels() and iavf_set_channels() use the legacy `**/`
> comment terminator and embed the return description in the body text.
> Convert to proper kernel-doc style: single `*/` terminator and an
> explicit `Return:` section.

My inexact check of iavf shows this:

grep '\*\*/' drivers/net/ethernet/intel/iavf/* | wc -l
248

scripts/kernel-doc -none -Wreturn drivers/net/ethernet/intel/iavf/* 2>&1 
| wc -l
336

Since this is not in the context of other changes and this resolves <1% 
of the issues, this doesn't seem like much of a net gain. If we're 
looking to resolve kernel-doc warnings perhaps fixing 1 file at a time?

Bonus points if we can remove boilerplate ones :)

Thanks,
Tony

> Signed-off-by: Leszek Pepiak <leszek.pepiak@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> 
> ---
>   drivers/net/ethernet/intel/iavf/iavf_ethtool.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> index 8188dd4..425acbb 100644
> --- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> +++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
> @@ -1846,13 +1846,13 @@ static int iavf_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd,
>   	return ret;
>   }
>   /**
> - * iavf_get_channels: get the number of channels supported by the device
> + * iavf_get_channels - get the number of channels supported by the device
>    * @netdev: network interface device structure
>    * @ch: channel information structure
>    *
>    * For the purposes of our device, we only use combined channels, i.e. a tx/rx
>    * queue pair. Report one extra channel to match our "other" MSI-X vector.
> - **/
> + */
>   static void iavf_get_channels(struct net_device *netdev,
>   			      struct ethtool_channels *ch)
>   {
> @@ -1873,14 +1873,15 @@ static void iavf_get_channels(struct net_device *netdev,
>   }
>   
>   /**
> - * iavf_set_channels: set the new channel count
> + * iavf_set_channels - set the new channel count
>    * @netdev: network interface device structure
>    * @ch: channel information structure
>    *
>    * Negotiate a new number of channels with the PF then do a reset.  During
> - * reset we'll realloc queues and fix the RSS table.  Returns 0 on success,
> - * negative on failure.
> - **/
> + * reset we'll realloc queues and fix the RSS table.
> + *
> + * Return: 0 on success, negative on failure.
> + */
>   static int iavf_set_channels(struct net_device *netdev,
>   			     struct ethtool_channels *ch)
>   {


^ permalink raw reply

* Re: [net,PATCH] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Marek Vasut @ 2026-04-08 21:21 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ronald Wahl, Yicong Hui,
	linux-kernel
In-Reply-To: <215ad1cc5db3f352ac2a130c07dbd830@tipi-net.de>

On 4/8/26 9:15 PM, Nicolai Buchwitz wrote:
> On 8.4.2026 17:41, Marek Vasut wrote:
>> On 4/8/26 12:54 PM, Nicolai Buchwitz wrote:
>>
>> Hello Nicolai,
>>
>> thank you for testing on the SPI variant, that helped a lot.
>>
>>> In order to make this work I would propose something like this (which 
>>> works in my SPI setup):
>>>
>>> --- a/drivers/net/ethernet/micrel/ks8851_par.c
>>> +++ b/drivers/net/ethernet/micrel/ks8851_par.c
>>> @@ -60,12 +60,14 @@ static void ks8851_lock_par(struct ks8851_net 
>>> *ks, unsigned long *flags)
>>>   {
>>>       struct ks8851_net_par *ksp = to_ks8851_par(ks);
>>>
>>> +    local_bh_disable();
>>>       spin_lock_irqsave(&ksp->lock, *flags);
>>>   }
>>>
>>>   static void ks8851_unlock_par(struct ks8851_net *ks, unsigned long 
>>> *flags)
>>>   {
>>>       struct ks8851_net_par *ksp = to_ks8851_par(ks);
>>>
>>>       spin_unlock_irqrestore(&ksp->lock, *flags);
>>> +    local_bh_enable();
>>>   }
>>>
>>> Tested-by: Nicolai Buchwitz <nb@tipi-net.de>  # KS8851 SPI, non-RT 
>>> (regression + proposed fix)
>>
>> Are you also able to test the KS8851 driver with PREEMPT_RT enabled 
>> and heavy iperf3 traffic on the SPI variant ? Does that trigger any 
>> issues ? I ran 'iperf3 -s' on the KS8851 end and 'iperf3 -c 
>> 192.168.1.300 -t 0 --bidir' on the host PC side.
> 
> Successfully tested with both PREEMPT_RT and non-RT kernels using the 
> iperf3 command above - no issues observed. Both builds included the fix 
> from my previous message.
> If there is anything else worth testing on the KS8851 SPI variant, 
> please let me know.
Thank you for that. Could you please add the TB to v2 too ?

^ permalink raw reply

* [syzbot ci] Re: net/sched: no longer acquire RTNL in qdisc dumps
From: syzbot ci @ 2026-04-08 20:14 UTC (permalink / raw)
  To: davem, edumazet, eric.dumazet, horms, jhs, jiri, kuba, kuniyu,
	netdev, pabeni, sdf, toke
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260408125611.3592751-1-edumazet@google.com>

syzbot ci has tested the following series

[v1] net/sched: no longer acquire RTNL in qdisc dumps
https://lore.kernel.org/all/20260408125611.3592751-1-edumazet@google.com
* [PATCH net-next 01/15] net/sched: rename qstats_overlimit_inc() to qstats_cpu_overlimit_inc()
* [PATCH net-next 02/15] net/sched: add qstats_cpu_drop_inc() helper
* [PATCH net-next 03/15] net/sched: add READ_ONCE() in gnet_stats_add_queue[_cpu]
* [PATCH net-next 04/15] net/sched: add qdisc_qlen_inc() and qdisc_qlen_dec()
* [PATCH net-next 05/15] net/sched: annotate data-races around sch->qstats.backlog
* [PATCH net-next 06/15] net/sched: sch_sfb: annotate data-races in sfb_dump_stats()
* [PATCH net-next 07/15] net/sched: sch_red: annotate data-races in red_dump_stats()
* [PATCH net-next 08/15] net/sched: sch_fq_codel: remove data-races from fq_codel_dump_stats()
* [PATCH net-next 09/15] net/sched: sch_pie: annotate data-races in pie_dump_stats()
* [PATCH net-next 10/15] net/sched: sch_fq_pie: annotate data-races in fq_pie_dump_stats()
* [PATCH net-next 11/15] net_sched: sch_hhf: annotate data-races in hhf_dump_stats()
* [PATCH net-next 12/15] net/sched: sch_choke: annotate data-races in choke_dump_stats()
* [PATCH net-next 13/15] net/sched: sch_cake: annotate data-races in cake_dump_stats()
* [PATCH net-next 14/15] net/sched: mq: no longer acquire qdisc spinlocks in dump operations
* [PATCH net-next 15/15] net/sched: convert tc_dump_qdisc() to RCU

and found the following issues:
* WARNING: suspicious RCU usage in mq_dump_common
* WARNING: suspicious RCU usage in mqprio_dump
* WARNING: suspicious RCU usage in tc_fill_qdisc

Full report is available here:
https://ci.syzbot.org/series/a6ab0157-80eb-4d29-ab75-31a471a9070e

***

WARNING: suspicious RCU usage in mq_dump_common

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      b3e69fc3196fc421e26196e7792f17b0463edc6f
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/f4a44029-347c-4dad-b84f-81c322454de4/config
syz repro: https://ci.syzbot.org/findings/d5d8c727-6baf-4738-bc33-e8a42f539e21/syz_repro

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/sched/sch_mq.c:158 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.1.18/6007:
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x722/0xbe0 net/core/rtnetlink.c:6986

stack backtrace:
CPU: 1 UID: 0 PID: 6007 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 mq_dump_common+0x2fa/0x5e0 net/sched/sch_mq.c:158
 mq_dump+0x7e/0x150 net/sched/sch_mq.c:181
 tc_fill_qdisc+0x663/0x11c0 net/sched/sch_api.c:937
 qdisc_notify+0x1cf/0x440 net/sched/sch_api.c:1033
 notify_and_destroy net/sched/sch_api.c:1058 [inline]
 qdisc_graft+0x114a/0x15b0 net/sched/sch_api.c:1158
 __tc_modify_qdisc net/sched/sch_api.c:1760 [inline]
 tc_modify_qdisc+0x18a4/0x2290 net/sched/sch_api.c:1816
 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6989
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:721 [inline]
 __sock_sendmsg net/socket.c:736 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2585
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2639
 __sys_sendmsg net/socket.c:2671 [inline]
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2674
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1d3839c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1d392c4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1d38615fa0 RCX: 00007f1d3839c819
RDX: 0000000000044080 RSI: 0000200000000040 RDI: 0000000000000003
RBP: 00007f1d38432c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1d38616038 R14: 00007f1d38615fa0 R15: 00007fff29960058
 </TASK>

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/sched/sch_api.c:943 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.1.18/6007:
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x722/0xbe0 net/core/rtnetlink.c:6986

stack backtrace:
CPU: 0 UID: 0 PID: 6007 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 tc_fill_qdisc+0xd90/0x11c0 net/sched/sch_api.c:943
 qdisc_notify+0x1cf/0x440 net/sched/sch_api.c:1033
 notify_and_destroy net/sched/sch_api.c:1058 [inline]
 qdisc_graft+0x114a/0x15b0 net/sched/sch_api.c:1158
 __tc_modify_qdisc net/sched/sch_api.c:1760 [inline]
 tc_modify_qdisc+0x18a4/0x2290 net/sched/sch_api.c:1816
 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6989
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:721 [inline]
 __sock_sendmsg net/socket.c:736 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2585
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2639
 __sys_sendmsg net/socket.c:2671 [inline]
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2674
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1d3839c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f1d392c4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f1d38615fa0 RCX: 00007f1d3839c819
RDX: 0000000000044080 RSI: 0000200000000040 RDI: 0000000000000003
RBP: 00007f1d38432c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f1d38616038 R14: 00007f1d38615fa0 R15: 00007fff29960058
 </TASK>


***

WARNING: suspicious RCU usage in mqprio_dump

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      b3e69fc3196fc421e26196e7792f17b0463edc6f
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/f4a44029-347c-4dad-b84f-81c322454de4/config
syz repro: https://ci.syzbot.org/findings/159b3573-da89-4289-89cf-85c39c62db59/syz_repro

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/sched/sch_mqprio.c:570 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.1.18/5958:
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x722/0xbe0 net/core/rtnetlink.c:6986

stack backtrace:
CPU: 1 UID: 0 PID: 5958 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 mqprio_dump+0x3db/0x1370 net/sched/sch_mqprio.c:570
 tc_fill_qdisc+0x663/0x11c0 net/sched/sch_api.c:937
 qdisc_notify+0x28c/0x440 net/sched/sch_api.c:1038
 notify_and_destroy net/sched/sch_api.c:1058 [inline]
 qdisc_graft+0x114a/0x15b0 net/sched/sch_api.c:1158
 __tc_modify_qdisc net/sched/sch_api.c:1760 [inline]
 tc_modify_qdisc+0x18a4/0x2290 net/sched/sch_api.c:1816
 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6989
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:721 [inline]
 __sock_sendmsg net/socket.c:736 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2585
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2639
 __sys_sendmsg net/socket.c:2671 [inline]
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2674
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff019b9c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff01ab3e028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ff019e15fa0 RCX: 00007ff019b9c819
RDX: 0000000020000000 RSI: 0000200000000200 RDI: 0000000000000005
RBP: 00007ff019c32c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ff019e16038 R14: 00007ff019e15fa0 R15: 00007ffc53020ce8
 </TASK>

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/sched/sch_api.c:943 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.1.18/5958:
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x722/0xbe0 net/core/rtnetlink.c:6986

stack backtrace:
CPU: 1 UID: 0 PID: 5958 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 tc_fill_qdisc+0xd90/0x11c0 net/sched/sch_api.c:943
 qdisc_notify+0x28c/0x440 net/sched/sch_api.c:1038
 notify_and_destroy net/sched/sch_api.c:1058 [inline]
 qdisc_graft+0x114a/0x15b0 net/sched/sch_api.c:1158
 __tc_modify_qdisc net/sched/sch_api.c:1760 [inline]
 tc_modify_qdisc+0x18a4/0x2290 net/sched/sch_api.c:1816
 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6989
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:721 [inline]
 __sock_sendmsg net/socket.c:736 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2585
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2639
 __sys_sendmsg net/socket.c:2671 [inline]
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2674
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff019b9c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ff01ab3e028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ff019e15fa0 RCX: 00007ff019b9c819
RDX: 0000000020000000 RSI: 0000200000000200 RDI: 0000000000000005
RBP: 00007ff019c32c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ff019e16038 R14: 00007ff019e15fa0 R15: 00007ffc53020ce8
 </TASK>


***

WARNING: suspicious RCU usage in tc_fill_qdisc

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      b3e69fc3196fc421e26196e7792f17b0463edc6f
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/f4a44029-347c-4dad-b84f-81c322454de4/config
syz repro: https://ci.syzbot.org/findings/0dada8ec-a4b6-42a1-8516-d70ce8ccccc7/syz_repro

=============================
WARNING: suspicious RCU usage
syzkaller #0 Not tainted
-----------------------------
net/sched/sch_api.c:943 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by syz.0.17/5963:
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock net/core/rtnetlink.c:80 [inline]
 #0: ffffffff8fbca4c8 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x722/0xbe0 net/core/rtnetlink.c:6986

stack backtrace:
CPU: 0 UID: 0 PID: 5963 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 lockdep_rcu_suspicious+0x13f/0x1d0 kernel/locking/lockdep.c:6876
 tc_fill_qdisc+0xd90/0x11c0 net/sched/sch_api.c:943
 qdisc_notify+0x1cf/0x440 net/sched/sch_api.c:1033
 notify_and_destroy net/sched/sch_api.c:1058 [inline]
 qdisc_graft+0x114a/0x15b0 net/sched/sch_api.c:1158
 __tc_modify_qdisc net/sched/sch_api.c:1760 [inline]
 tc_modify_qdisc+0x18a4/0x2290 net/sched/sch_api.c:1816
 rtnetlink_rcv_msg+0x77e/0xbe0 net/core/rtnetlink.c:6989
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2550
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x80f/0x9b0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:721 [inline]
 __sock_sendmsg net/socket.c:736 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2585
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2639
 __sys_sendmsg net/socket.c:2671 [inline]
 __do_sys_sendmsg net/socket.c:2676 [inline]
 __se_sys_sendmsg net/socket.c:2674 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2674
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f9e5e39c819
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9e5f2d6028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f9e5e615fa0 RCX: 00007f9e5e39c819
RDX: 0000000000000000 RSI: 00002000000007c0 RDI: 0000000000000003
RBP: 00007f9e5e432c91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f9e5e616038 R14: 00007f9e5e615fa0 R15: 00007ffc6a25e4e8
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* [PATCH][next] netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
From: Gustavo A. R. Silva @ 2026-04-08 21:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netfilter-devel, coreteam, netdev, linux-kernel,
	Gustavo A. R. Silva, linux-hardening

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

struct compat_xt_standard_target and struct compat_xt_error_target are
only used in xt_compat_check_entry_offsets(). Remove these structs and
instead define the same memory layout on the stack via flexible struct
compat_xt_entry_target and DEFINE_RAW_FLEX(). Adjust the rest of the
code accordingly.

With these changes, fix the following warnings:

1 net/netfilter/x_tables.c:816:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/netfilter/x_tables.c:811:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
---
 net/netfilter/x_tables.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index b39017c80548..a58107038a24 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -817,17 +817,6 @@ int xt_compat_match_to_user(const struct xt_entry_match *m,
 }
 EXPORT_SYMBOL_GPL(xt_compat_match_to_user);
 
-/* non-compat version may have padding after verdict */
-struct compat_xt_standard_target {
-	struct compat_xt_entry_target t;
-	compat_uint_t verdict;
-};
-
-struct compat_xt_error_target {
-	struct compat_xt_entry_target t;
-	char errorname[XT_FUNCTION_MAXNAMELEN];
-};
-
 int xt_compat_check_entry_offsets(const void *base, const char *elems,
 				  unsigned int target_offset,
 				  unsigned int next_offset)
@@ -850,18 +839,25 @@ int xt_compat_check_entry_offsets(const void *base, const char *elems,
 		return -EINVAL;
 
 	if (strcmp(t->u.user.name, XT_STANDARD_TARGET) == 0) {
-		const struct compat_xt_standard_target *st = (const void *)t;
+		DEFINE_RAW_FLEX(const struct compat_xt_entry_target, st, data,
+				sizeof(compat_uint_t));
+		compat_uint_t *verdict = (compat_uint_t *)st->data;
 
-		if (COMPAT_XT_ALIGN(target_offset + sizeof(*st)) != next_offset)
+		st = (const void *)t;
+
+		if (COMPAT_XT_ALIGN(target_offset + __struct_size(st)) !=
+				next_offset)
 			return -EINVAL;
 
-		if (!verdict_ok(st->verdict))
+		if (!verdict_ok(*verdict))
 			return -EINVAL;
 	} else if (strcmp(t->u.user.name, XT_ERROR_TARGET) == 0) {
-		const struct compat_xt_error_target *et = (const void *)t;
+		DEFINE_RAW_FLEX(const struct compat_xt_entry_target, et, data,
+				XT_FUNCTION_MAXNAMELEN);
+		et = (const void *)t;
 
-		if (!error_tg_ok(t->u.target_size, sizeof(*et),
-				 et->errorname, sizeof(et->errorname)))
+		if (!error_tg_ok(t->u.target_size, __struct_size(et),
+				 et->data, __member_size(et->data)))
 			return -EINVAL;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net] net: ax25: fix integer overflow in ax25_rx_fragment()
From: Joerg Reuter @ 2026-04-08 21:31 UTC (permalink / raw)
  To: Mashiro Chen
  Cc: netdev, davem, edumazet, kuba, pabeni, horms, linux-hams,
	linux-kernel, stable
In-Reply-To: <20260408172521.281365-1-mashiro.chen@mailbox.org>

Am Thu, Apr 09, 2026 at 01:25:21AM +0800 schrieb Mashiro Chen:
> An attacker on an AX.25 link that supports multi-fragment I-frames
> (AX25_SEG_FIRST / AX25_SEG_REM mechanism) can trigger this by
> sending enough continuation fragments to wrap the 16-bit counter.
> With AX.25 segment numbers limited to 6 bits (max 63 continuation
> fragments), a fragment payload of ~1040 bytes per fragment is
> sufficient to overflow.

Even worse, it's 7 bits: https://www.ax25.net/AX25.2.2-Jul%2098-2.pdf
Figure 6.2 "Segment Header Format". Sigh.

Thanks,
     Joerg

Acked-by: Joerg Reuter <jreuter@yaina.de>
> Cc: stable@vger.kernel.org
> Cc: linux-hams@vger.kernel.org
> Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
> ---
>  net/ax25/ax25_in.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/ax25/ax25_in.c b/net/ax25/ax25_in.c
> index d75b3e9ed93de8..68202c19b19e3f 100644
> --- a/net/ax25/ax25_in.c
> +++ b/net/ax25/ax25_in.c
> @@ -41,6 +41,11 @@ static int ax25_rx_fragment(ax25_cb *ax25, struct sk_buff *skb)
>  				/* Enqueue fragment */
>  				ax25->fragno = *skb->data & AX25_SEG_REM;
>  				skb_pull(skb, 1);	/* skip fragno */
> +				if ((unsigned int)ax25->fraglen + skb->len > USHRT_MAX) {
> +					skb_queue_purge(&ax25->frag_queue);
> +					ax25->fragno = 0;
> +					return 1;
> +				}
>  				ax25->fraglen += skb->len;
>  				skb_queue_tail(&ax25->frag_queue, skb);
>  
> -- 
> 2.53.0
> 

-- 
Joerg Reuter                                    http://yaina.de/jreuter
And I make my way to where the warm scent of soil fills the evening air. 
Everything is waiting quietly out there....                 (Anne Clark)

^ permalink raw reply

* Re: [PATCH net-next v5 00/10] Decouple receive and transmit enablement in team driver
From: Marc Harvey @ 2026-04-08 21:34 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Shuah Khan, Simon Horman, netdev, linux-kernel,
	linux-kselftest
In-Reply-To: <iuhx4kkkftjx52y5qo7w7p6rbxqakq2eu74r66xaepd2p2khjh@kdhljd2c2bpm>

On Wed, Apr 8, 2026 at 2:00 AM Jiri Pirko <jiri@resnulli.us> wrote:
> Wed, Apr 08, 2026 at 02:12:35AM +0200, marcharvey@google.com wrote:
> >On Tue, Apr 7, 2026 at 4:55 AM Jiri Pirko <jiri@resnulli.us> wrote:
> >>
> >> Looks fine to me now. Do you have libteam/teamd counterpart?
> >
> >I don't see a need for this to be used in any of the teamd runners.
>
> Why do you need this then?

Initially, we plan to use a non-teamd userspace component for teaming
control due to several non-standard requirements, such as
synchronization with unrelated software. It is probably worth
converting the teamd lacp runner to independent control at some point,
because according to the spec: "It is recommended that the independent
control state diagram be implemented in preference to the coupled
control state diagram."

^ permalink raw reply

* [PATCH net-next 0/7] tcp: restrict rcv_wnd and window_clamp to representable window
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz

Hi,

this series ensures that rcv_wnd and window_clamp do not exceed the
maximum window size representable for the connection's window scale
factor.

This is most visible when TCP window scaling is not used for a
connection. In that case, the advertised window is limited to 65535
bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when
large receive buffers are used. The resulting mismatch breaks
calculations that depend on the advertised window, such as the ACK
decision in __tcp_ack_snd_check(), and can prevent immediate ACKs.

Similar effects may also occur when window scaling is in use, e.g. if
the application dynamically adjusts SO_RCVBUF in unusual ways or when
the rmem sysctl parameters change during a connection’s lifetime.

Summary:

- Patch 1 keeps rcv_wnd capped by the (window scale-limited)
  window_clamp at connection start.
- Patch 3 and 6 ensure that window_clamp is limited to the
  representable window when it is updated.
- The other patches add packetdrill tests to verify the new behavior.

A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores,
loopback) shows a noticeable improvement with window scaling disabled:

Fixed receive buffer:

- sysctl net.ipv4.tcp_window_scaling=0
- Server: iperf -l 256K -w 256K -s
- Client: iperf -l 256K -w 256K -c 127.0.0.1 -t 30
- net-next: ~47 Gbit/sec
- with this series: ~62 Gbit/sec

Receive buffer autotuning (net.ipv4.tcp_rmem = 4096 131072 7813888):

- sysctl net.ipv4.tcp_window_scaling=0
- Server: iperf -s
- Client: iperf -c 127.0.0.1 -t 30
- net-next: ~48 Gbit/sec
- with this series: ~60 Gbit/sec

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
Simon Baatz (7):
      tcp: keep rcv_wnd/rcv_ssthresh clamped by window_clamp if no scaling in use
      selftests/net: packetdrill: verify non-scaled rcv_wnd initialization
      tcp: Ensure window_clamp is limited to representable window
      selftests/net: packetdrill: add tcp_rcv_wnd_snd_ack_no_scaling.pkt
      selftests/net: packetdrill: add TCP_WINDOW_CLAMP test
      tcp: use tcp_set_window_clamp() for SO_RCVLOWAT
      selftests/net: packetdrill: add test for SO_RCVLOWAT window clamp

 net/ipv4/tcp.c                                     |  6 +++-
 net/ipv4/tcp_input.c                               | 13 ++++++--
 net/ipv4/tcp_minisocks.c                           |  8 +++--
 net/ipv4/tcp_output.c                              |  5 +--
 .../net/packetdrill/tcp_rcv_sockopt_lowat.pkt      | 24 ++++++++++++++
 .../net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt  | 28 ++++++++++++++++
 .../packetdrill/tcp_rcv_wnd_active_no_scaling.pkt  | 27 ++++++++++++++++
 .../tcp_rcv_wnd_active_peer_no_scaling.pkt         | 26 +++++++++++++++
 .../packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt | 30 ++++++++++++++++++
 .../tcp_rcv_wnd_passive_peer_no_scaling.pkt        | 29 +++++++++++++++++
 .../packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt | 37 ++++++++++++++++++++++
 11 files changed, 225 insertions(+), 8 deletions(-)
---
base-commit: b3e69fc3196fc421e26196e7792f17b0463edc6f
change-id: 20260402-tcp_rcv_exact_clamp_and_wnd-427d853e7491

Best regards,
-- 
Simon Baatz <gmbnomis@gmail.com>



^ permalink raw reply

* [PATCH net-next 1/7] tcp: keep rcv_wnd/rcv_ssthresh clamped by window_clamp if no scaling in use
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

When window scaling is not used, we already clamp window_clamp to
65535, but rcv_wnd/rcv_ssthresh can still remain larger than this.

Fix this by capping rcv_wnd to 65535 in the non-scaling paths of both
active and passive opens. Since the advertised window in SYN and
SYN/ACK segments is unscaled, this cannot shrink the window
advertised to the peer.

Also ensure that tcp_select_initial_window() always keeps rcv_wnd less
than or equal to the (scale-limited) window_clamp.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 net/ipv4/tcp_input.c     | 7 ++++++-
 net/ipv4/tcp_minisocks.c | 8 ++++++--
 net/ipv4/tcp_output.c    | 5 +++--
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7171442c3ed7a..505884dcb7a2b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6898,7 +6898,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 */
 		WRITE_ONCE(tp->rcv_nxt, TCP_SKB_CB(skb)->seq + 1);
 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
-		tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
 
 		/* RFC1323: The window in SYN & SYN/ACK segments is
 		 * never scaled.
@@ -6909,7 +6908,13 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			tp->rx_opt.snd_wscale = tp->rx_opt.rcv_wscale = 0;
 			WRITE_ONCE(tp->window_clamp,
 				   min(tp->window_clamp, 65535U));
+			tp->rcv_ssthresh = min(tp->rcv_ssthresh, 65535U);
+			/* As the window in the SYN was not scaled,
+			 * we did not advertise more than 65535.
+			 */
+			tp->rcv_wnd = min(tp->rcv_wnd, 65535U);
 		}
+		tp->rcv_mwnd_seq = tp->rcv_wup + tp->rcv_wnd;
 
 		if (tp->rx_opt.saw_tstamp) {
 			tp->rx_opt.tstamp_ok	   = 1;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 199f0b579e89c..6496fe3b9e139 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -600,9 +600,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
 	newtp->rx_opt.sack_ok = ireq->sack_ok;
 	newtp->window_clamp = req->rsk_window_clamp;
-	newtp->rcv_ssthresh = req->rsk_rcv_wnd;
 	newtp->rcv_wnd = req->rsk_rcv_wnd;
-	newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd;
 	newtp->rx_opt.wscale_ok = ireq->wscale_ok;
 	if (newtp->rx_opt.wscale_ok) {
 		newtp->rx_opt.snd_wscale = ireq->snd_wscale;
@@ -610,7 +608,13 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	} else {
 		newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0;
 		newtp->window_clamp = min(newtp->window_clamp, 65535U);
+		/* As the window in the SYN/ACK was not scaled,
+		 * we did not advertise more than 65535.
+		 */
+		newtp->rcv_wnd = min(newtp->rcv_wnd, 65535U);
 	}
+	newtp->rcv_ssthresh = newtp->rcv_wnd;
+	newtp->rcv_mwnd_seq = newtp->rcv_wup + newtp->rcv_wnd;
 	newtp->snd_wnd = ntohs(tcp_hdr(skb)->window) << newtp->rx_opt.snd_wscale;
 	newtp->max_window = newtp->snd_wnd;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8e99687526a64..bf7a12872acb3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -269,8 +269,9 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 				      0, TCP_MAX_WSCALE);
 	}
 	/* Set the clamp no higher than max representable value */
-	WRITE_ONCE(*__window_clamp,
-		   min_t(__u32, U16_MAX << (*rcv_wscale), window_clamp));
+	window_clamp = min_t(u32, U16_MAX << (*rcv_wscale), window_clamp);
+	WRITE_ONCE(*__window_clamp, window_clamp);
+	*rcv_wnd = min(*rcv_wnd, window_clamp);
 }
 
 /* Chose a new window to advertise, update state in tcp_sock for the

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 2/7] selftests/net: packetdrill: verify non-scaled rcv_wnd initialization
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

If we or our peer do not support window scaling in an active or
passive open, we must restrict rcv_wnd to 65535.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 .../packetdrill/tcp_rcv_wnd_active_no_scaling.pkt  | 27 +++++++++++++++++++
 .../tcp_rcv_wnd_active_peer_no_scaling.pkt         | 26 +++++++++++++++++++
 .../packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt | 30 ++++++++++++++++++++++
 .../tcp_rcv_wnd_passive_peer_no_scaling.pkt        | 29 +++++++++++++++++++++
 4 files changed, 112 insertions(+)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_no_scaling.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_no_scaling.pkt
new file mode 100644
index 0000000000000..d39f834299fec
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_no_scaling.pkt
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// If we do not support window scaling in an active open, we must
+// restrict rcv_wnd to 65535.
+
+`./defaults.sh
+sysctl -q net.ipv4.tcp_window_scaling=0`
+
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+
+   +0...0 connect(4, ..., ...) = 0
+   +0 > S 0:0(0) win 65535 <mss 1460,sackOK,TS val 0 ecr 0>
+   +0 < S. 0:0(0) ack 1 win 65535
+   +0 > . 1:1(0) ack 1
+
+   +0 send(4, ..., 1000, 0) = 1000
+   +0 > P. 1:1001(1000) ack 1
+// Beyond window ACK is dropped, causing an immediate ACK
+   +0 < . 65537:65537(0) ack 1001 win 65535
+   +0 > . 1001:1001(0) ack 1
+
+  +.1 %{ assert tcpi_bytes_acked == 1, tcpi_bytes_acked; }%
+
+// In window ACK is accepted
+   +0 < . 65536:65536(0) ack 1001 win 65535
+  +.1 %{ assert tcpi_bytes_acked == 1001, tcpi_bytes_acked; }%
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_peer_no_scaling.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_peer_no_scaling.pkt
new file mode 100644
index 0000000000000..a212945996e85
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_active_peer_no_scaling.pkt
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// If the peer does not support window scaling in an active open, we must
+// restrict rcv_wnd to 65535.
+
+`./defaults.sh`
+
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+
+   +0...0 connect(4, ..., ...) = 0
+   +0 > S 0:0(0) win 65535 <mss 1460,sackOK,TS val 0 ecr 0,nop,wscale 2>
+   +0 < S. 0:0(0) ack 1 win 65535
+   +0 > . 1:1(0) ack 1
+
+   +0 send(4, ..., 1000, 0) = 1000
+   +0 > P. 1:1001(1000) ack 1
+// Beyond window ACK is dropped, causing an immediate ACK
+   +0 < . 65537:65537(0) ack 1001 win 65535
+   +0 > . 1001:1001(0) ack 1
+
+  +.1 %{ assert tcpi_bytes_acked == 1, tcpi_bytes_acked; }%
+
+// In window ACK is accepted
+   +0 < . 65536:65536(0) ack 1001 win 65535
+  +.1 %{ assert tcpi_bytes_acked == 1001, tcpi_bytes_acked; }%
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt
new file mode 100644
index 0000000000000..907d452afc311
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// If we do not support window scaling in a passive open, we must
+// restrict rcv_wnd to 65535.
+
+`./defaults.sh
+sysctl -q net.ipv4.tcp_window_scaling=0`
+
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460>
+   +0 < . 1:1(0) ack 1 win 32792
+
+   +0 accept(3, ..., ...) = 4
+   +0 send(4, ..., 1000, 0) = 1000
+   +0 > P. 1:1001(1000) ack 1
+// Beyond window ACK is dropped, causing an immediate ACK
+   +0 < . 65537:65537(0) ack 1001 win 32792
+   +0 > . 1001:1001(0) ack 1
+
+  +.1 %{ assert tcpi_bytes_acked == 0, tcpi_bytes_acked; }%
+
+// In window ACK is accepted
+   +0 < . 65536:65536(0) ack 1001 win 32792
+  +.1 %{ assert tcpi_bytes_acked == 1000, tcpi_bytes_acked; }%
diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_peer_no_scaling.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_peer_no_scaling.pkt
new file mode 100644
index 0000000000000..df88dd9b6f83e
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_passive_peer_no_scaling.pkt
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// If the peer does not support window scaling in a passive open, we must
+// restrict rcv_wnd to 65535.
+
+`./defaults.sh`
+
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000>
+   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460>
+   +0 < . 1:1(0) ack 1 win 32792
+
+   +0 accept(3, ..., ...) = 4
+   +0 send(4, ..., 1000, 0) = 1000
+   +0 > P. 1:1001(1000) ack 1
+// Beyond window ACK is dropped, causing an immediate ACK
+   +0 < . 65537:65537(0) ack 1001 win 32792
+   +0 > . 1001:1001(0) ack 1
+
+  +.1 %{ assert tcpi_bytes_acked == 0, tcpi_bytes_acked; }%
+
+// In window ACK is accepted
+   +0 < . 65536:65536(0) ack 1001 win 32792
+  +.1 %{ assert tcpi_bytes_acked == 1000, tcpi_bytes_acked; }%

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 3/7] tcp: Ensure window_clamp is limited to representable window
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

On connection initiation, window_clamp is limited to the maximum value
representable for the connection's window scale factor.

However, window_clamp may be changed later when:
- it needs to be adjusted due to scaling_ratio changes
- the receive buffer grows due to autotuning
- the TCP_WINDOW_CLAMP socket option is set

In all cases, window_clamp must not end up higher than the maximum
representable advertised window.

Thus, if the TCP connection state indicates that we can rely on
rx_opt.rcv_wscale, clamp the new window_clamp to the maximum window
for that scaling factor (including the "no window scaling" case where
rcv_wscale is zero).

This has visible consequences for calculations based on rcv_wnd. For
example, the logic in __tcp_ack_snd_check() uses the advance of the
right edge of the receive window to determine when to send an
immediate ACK. If rcv_wnd does not properly reflect the "on the wire"
advertised window (i.e. it is much higher than the maximum value
representable), this calculation will be wrong and ACKs may be delayed
when they should be sent immediately.

One concrete example is when the TCP receive buffer is much larger
than 64KB, but no window scaling is used. If window_clamp (and thus
rcv_wnd) are not limited to 65535, the "internal" window based on
rcv_wnd can extend far beyond the 16‑bit window actually advertised on
the wire.

After receiving a data segment, the right edge of the "on the wire"
window can be moved (as there is plenty of space in rcv_wnd) and an
immediate ACK should be sent. But, it won't do so if the calculation
based on rcv_wnd does not happen to change "internal" window right edge.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 net/ipv4/tcp.c       | 4 ++++
 net/ipv4/tcp_input.c | 6 ++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e57eaffc007a0..bd03c99f793ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3802,6 +3802,10 @@ int tcp_set_window_clamp(struct sock *sk, int val)
 	old_window_clamp = tp->window_clamp;
 	new_window_clamp = max_t(int, SOCK_MIN_RCVBUF / 2, val);
 
+	if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT |
+			TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2))
+		new_window_clamp = min_t(u32, U16_MAX << tp->rx_opt.rcv_wscale, new_window_clamp);
+
 	if (new_window_clamp == old_window_clamp)
 		return 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 505884dcb7a2b..6e9123c98152f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -914,6 +914,7 @@ void tcp_rcvbuf_grow(struct sock *sk, u32 newval)
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 rcvwin, rcvbuf, cap, oldval;
 	u32 rtt_threshold, rtt_us;
+	u32 window_clamp;
 	u64 grow;
 
 	oldval = tp->rcvq_space.space;
@@ -949,8 +950,9 @@ void tcp_rcvbuf_grow(struct sock *sk, u32 newval)
 	if (rcvbuf > sk->sk_rcvbuf) {
 		WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
 		/* Make the window clamp follow along.  */
-		WRITE_ONCE(tp->window_clamp,
-			   tcp_win_from_space(sk, rcvbuf));
+		window_clamp = tcp_win_from_space(sk, rcvbuf);
+		window_clamp = min_t(u32, U16_MAX << tp->rx_opt.rcv_wscale, window_clamp);
+		WRITE_ONCE(tp->window_clamp, window_clamp);
 	}
 }
 /*

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 5/7] selftests/net: packetdrill: add TCP_WINDOW_CLAMP test
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

Add a packetdrill test to verify that the socket option
TCP_WINDOW_CLAMP can be set to a large value on a listening socket,
but is clamped on an established socket to the maximum representable
advertised window.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 .../net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt  | 28 ++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt
new file mode 100644
index 0000000000000..d7203f2893c3f
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Verify that TCP_WINDOW_CLAMP can be set to a high value on a LISTEN socket, but,
+// in an established connection, the value is clamped to the maximum representable
+// advertised window.
+--mss=1000
+
+`./defaults.sh`
+
+// Initialize connection
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+   +0 setsockopt(3, IPPROTO_TCP, 10, [255999], 4) = 0  // TCP_WINDOW_CLAMP == 10
+   +0 getsockopt(3, IPPROTO_TCP, 10, [255999], [4]) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792
+   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460>
+   +0 < . 1:1(0) ack 1 win 32792
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 getsockopt(4, IPPROTO_TCP, 10, [65535], [4]) = 0
+   +0 setsockopt(4, IPPROTO_TCP, 10, [255999], 4) = 0
+   +0 getsockopt(4, IPPROTO_TCP, 10, [65535], [4]) = 0
+   +0 getsockopt(3, IPPROTO_TCP, 10, [255999], [4]) = 0

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 4/7] selftests/net: packetdrill: add tcp_rcv_wnd_snd_ack_no_scaling.pkt
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

Verify that, when no TCP window scaling is used, each packet that
substantially advances the right edge of the receive window is ACKed
immediately.

Multiple packets are used so that the scaling_ratio receive window
adaptation can settle and does not by itself cause immediate ACKs,
avoiding false positives.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 .../packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt | 37 ++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt
new file mode 100644
index 0000000000000..41561b026da85
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Every packet must be ACKed immediately if the right edge of receive window
+// advances substantially. This test verifies that behavior when the connection
+// does not use window scaling.
+--mss=1000
+
+`./defaults.sh`
+
+// Initialize connection
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [256000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792
+   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460>
+   +0 < . 1:1(0) ack 1 win 32792
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 < P. 1:65001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 65001 win 65535
+   +0 < P. 65001:130001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 130001 win 65535
+   +0 < P. 130001:195001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 195001 win 65535
+   +0 < P. 195001:260001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 260001 win 65535
+   +0 < P. 260001:325001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 325001 win 65535
+
+// reading all data does not open the window further -> no ACK
+   +0 read(4, ..., 325000) = 325000
+ +0.2 < P. 325001:390001(65000) ack 1 win 32792
+   +0 > . 1:1(0) ack 390001 win 65535

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 6/7] tcp: use tcp_set_window_clamp() for SO_RCVLOWAT
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

Setting the SO_RCVLOWAT socket option may raise the receive window
clamp. Currently this is done by assigning to window_clamp directly.

Use the tcp_set_window_clamp() helper instead, so that raising the
clamp is subject to the same constraints and rcv_ssthresh adjustments
as TCP_WINDOW_CLAMP.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bd03c99f793ae..567027bc86b3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1853,7 +1853,7 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 		WRITE_ONCE(sk->sk_rcvbuf, space);
 
 		if (tp->window_clamp && tp->window_clamp < val)
-			WRITE_ONCE(tp->window_clamp, val);
+			tcp_set_window_clamp(sk, val);
 	}
 	return 0;
 }

-- 
2.53.0



^ permalink raw reply related

* [PATCH net-next 7/7] selftests/net: packetdrill: add test for SO_RCVLOWAT window clamp
From: Simon Baatz via B4 Relay @ 2026-04-08 21:50 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan
  Cc: netdev, linux-kernel, linux-kselftest, Simon Baatz
In-Reply-To: <20260408-tcp_rcv_exact_clamp_and_wnd-v1-0-76a6f212e153@gmail.com>

From: Simon Baatz <gmbnomis@gmail.com>

Add a packetdrill test to verify that setting SO_RCVLOWAT does not
raise window_clamp beyond the maximum value allowed by window
scaling.

Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
 .../net/packetdrill/tcp_rcv_sockopt_lowat.pkt      | 24 ++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_lowat.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_lowat.pkt
new file mode 100644
index 0000000000000..c024f3953f5a4
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_sockopt_lowat.pkt
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Verify that setting SO_RCVLOWAT does not set window_clamp higher than the
+// maximum value allowed by window scaling.
+--mss=1000
+
+`./defaults.sh`
+
+// Initialize connection
+    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792
+   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460>
+   +0 < . 1:1(0) ack 1 win 32792
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 getsockopt(4, IPPROTO_TCP, 10, [65535], [4]) = 0 // TCP_WINDOW_CLAMP == 10
+   +0 setsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [1024000], 4) = 0
+   +0 getsockopt(4, SOL_SOCKET, SO_RCVLOWAT, [1024000], [4]) = 0
+   +0 getsockopt(4, IPPROTO_TCP, 10, [65535], [4]) = 0

-- 
2.53.0



^ permalink raw reply related

* Re: [PATCH net-next v2 3/3] net: mdio: treat PSE EPROBE_DEFER as non-fatal during PHY registration
From: Andrew Lunn @ 2026-04-08 21:56 UTC (permalink / raw)
  To: Carlo Szelinsky
  Cc: o.rempel, kory.maincent, andrew+netdev, hkallweit1, linux, kuba,
	davem, edumazet, pabeni, horms, netdev, linux-kernel
In-Reply-To: <20260408210711.439068-1-github@szelinsky.de>

On Wed, Apr 08, 2026 at 11:07:11PM +0200, Carlo Szelinsky wrote:
> So I went ahead and tested the phy_probe() approach on my setup (RTL930x
> DSA switch with an I2C Hasivo HS104 PSE controller as module).
> 
> PoE itself works fine, but phydev->psec never gets set - ethtool just
> says "No PSE is attached" on all ports.
> 
> Took me a while to figure out what's going on. The problem is how DSA
> handles PHYs: when phy_probe() returns -EPROBE_DEFER because the PSE
> controller hasn't probed yet, the PHY device is registered but sits
> there unprobed. Then the DSA switch comes along, sets up its ports, and
> phy_attach_direct() force-binds the generic PHY driver with
> device_bind_driver().

Yes, this is a known issue with phylib.

> What I'm seeing timing-wise:
>   - MDIO scan registers PHYs, phy_probe() defers (no PSE yet)
>   - DSA probes, phy_attach_direct() binds genphy
>   - t=17s: HS104 finally probes

That is a long time. Does it actually start probing much earlier, but
it is busy download firmware, and the probe completes after 17
seconds.

>   - deferred retry: nope, driver already bound
>   - t=35s: regulator_late_cleanup (caught by admin_state_synced)
> 
> Not sure what the best path forward is here. Should we look at fixing
> phy_attach_direct() to handle this case.

It is not easy to fix, because generally drivers call
phy_attach_direct() in their open() function, not probe().  It is too
late to return EPROBE_DEFFER, you can only do that in probe. phylib
knows the device exists, but it sees there is no driver, so it does
not have much choice. It can either use genphy, or it can error out
phy_attach_direct().

DSA is however atypical, and does phylink_connect() early. So there
might be a way out. In dsa_user_phy_connect() once you have the
phydev, you could look at phydev->drv and return EPROBE_DEFFER if it
is NULL. Ugly. And a bit of a layering violation. Maybe a helper in
phylib, phy_is_driver_bound() ?

   Andrew

^ permalink raw reply

* Re: [PATCH net-next v11 03/14] net: Add lease info to queue-get response
From: Jakub Kicinski @ 2026-04-08 22:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6
In-Reply-To: <b9a046f8-cb02-4d54-9a7e-e8213339d720@iogearbox.net>

On Wed, 8 Apr 2026 11:09:34 +0200 Daniel Borkmann wrote:
> >> +void netif_put_rx_queue_lease_locked(struct net_device *orig_dev,
> >> +				     struct net_device *dev)
> >> +{
> >> +	if (orig_dev != dev)
> >> +		netdev_unlock(dev);
> >> +}  
> > 
> > Pretty sure I already complained about these ugly helpers.
> > I'll try to find the time tomorrow to come up with something better.   
> 
> Ok, sounds good. Happy to adapt if you find something better and then I'll
> work this into the series, and also integrate the things mentioned in my
> cover letter reply (netkit nl dump + additional tests).

Hi! How would you feel about something like the following on top?

--->8----------

net: remove the netif_get_rx_queue_lease_locked() helpers

The netif_get_rx_queue_lease_locked() API hides the locking
and the descend onto the leased queue. Making the code
harder to follow (at least to me). Remove the API and open
code the descend a bit. Most of the code now looks like:

 if (!leased)
     return __helper(x);

 hw_rxq = ..
 netdev_lock(hw_rxq->dev);
 ret = __helper(x);
 netdev_unlock(hw_rxq->dev);

 return ret;

Of course if we have more code paths that need the wrapping
we may need to revisit. For now, IMHO, having to know what
netif_get_rx_queue_lease_locked() does is not worth the 20LoC
it saves.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/netdev_rx_queue.h |  5 ---
 net/core/dev.h                |  1 +
 net/core/netdev-genl.c        | 59 +++++++++++++++++----------
 net/core/netdev_queues.c      | 14 ++++---
 net/core/netdev_rx_queue.c    | 48 +++++++---------------
 net/xdp/xsk.c                 | 77 ++++++++++++++++++++++-------------
 6 files changed, 111 insertions(+), 93 deletions(-)

diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 7e98c679ea84..9415a94d333d 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -76,11 +76,6 @@ struct netdev_rx_queue *
 __netif_get_rx_queue_lease(struct net_device **dev, unsigned int *rxq,
 			   enum netif_lease_dir dir);
 
-struct netdev_rx_queue *
-netif_get_rx_queue_lease_locked(struct net_device **dev, unsigned int *rxq);
-void netif_put_rx_queue_lease_locked(struct net_device *orig_dev,
-				     struct net_device *dev);
-
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
 void netdev_rx_queue_lease(struct netdev_rx_queue *rxq_dst,
 			   struct netdev_rx_queue *rxq_src);
diff --git a/net/core/dev.h b/net/core/dev.h
index 95edb2d4eff8..376bac4a82da 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -101,6 +101,7 @@ int netdev_queue_config_validate(struct net_device *dev, int rxq_idx,
 
 bool netif_rxq_has_mp(struct net_device *dev, unsigned int rxq_idx);
 bool netif_rxq_is_leased(struct net_device *dev, unsigned int rxq_idx);
+bool netif_is_queue_leasee(const struct net_device *dev);
 
 void __netif_mp_uninstall_rxq(struct netdev_rx_queue *rxq,
 			      const struct pp_memory_provider_params *p);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 056460d01940..b8f6076d8007 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -395,8 +395,7 @@ netdev_nl_queue_fill_lease(struct sk_buff *rsp, struct net_device *netdev,
 	struct netdev_rx_queue *rxq;
 	struct net *net, *peer_net;
 
-	rxq = __netif_get_rx_queue_lease(&netdev, &q_idx,
-					 NETIF_PHYS_TO_VIRT);
+	rxq = __netif_get_rx_queue_lease(&netdev, &q_idx, NETIF_PHYS_TO_VIRT);
 	if (!rxq || orig_netdev == netdev)
 		return 0;
 
@@ -436,13 +435,45 @@ netdev_nl_queue_fill_lease(struct sk_buff *rsp, struct net_device *netdev,
 	return -ENOMEM;
 }
 
+static int
+__netdev_nl_queue_fill_mp(struct sk_buff *rsp, struct netdev_rx_queue *rxq)
+{
+	struct pp_memory_provider_params *params = &rxq->mp_params;
+
+	if (params->mp_ops &&
+	    params->mp_ops->nl_fill(params->mp_priv, rsp, rxq))
+		return -EMSGSIZE;
+
+#ifdef CONFIG_XDP_SOCKETS
+	if (rxq->pool)
+		if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK))
+			return -EMSGSIZE;
+#endif
+	return 0;
+}
+
+static int
+netdev_nl_queue_fill_mp(struct sk_buff *rsp, struct net_device *netdev,
+			struct netdev_rx_queue *rxq)
+{
+	struct netdev_rx_queue *hw_rxq;
+	int ret;
+
+	hw_rxq = rxq->lease;
+	if (!hw_rxq || !netif_is_queue_leasee(netdev))
+		return __netdev_nl_queue_fill_mp(rsp, rxq);
+
+	netdev_lock(hw_rxq->dev);
+	ret = __netdev_nl_queue_fill_mp(rsp, hw_rxq);
+	netdev_unlock(hw_rxq->dev);
+	return ret;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
 {
-	struct pp_memory_provider_params *params;
-	struct net_device *orig_netdev = netdev;
-	struct netdev_rx_queue *rxq, *rxq_lease;
+	struct netdev_rx_queue *rxq;
 	struct netdev_queue *txq;
 	void *hdr;
 
@@ -462,20 +493,8 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			goto nla_put_failure;
 		if (netdev_nl_queue_fill_lease(rsp, netdev, q_idx, q_type))
 			goto nla_put_failure;
-
-		rxq_lease = netif_get_rx_queue_lease_locked(&netdev, &q_idx);
-		if (rxq_lease)
-			rxq = rxq_lease;
-		params = &rxq->mp_params;
-		if (params->mp_ops &&
-		    params->mp_ops->nl_fill(params->mp_priv, rsp, rxq))
-			goto nla_put_failure_lease;
-#ifdef CONFIG_XDP_SOCKETS
-		if (rxq->pool)
-			if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK))
-				goto nla_put_failure_lease;
-#endif
-		netif_put_rx_queue_lease_locked(orig_netdev, netdev);
+		if (netdev_nl_queue_fill_mp(rsp, netdev, rxq))
+			goto nla_put_failure;
 		break;
 	case NETDEV_QUEUE_TYPE_TX:
 		txq = netdev_get_tx_queue(netdev, q_idx);
@@ -493,8 +512,6 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 
 	return 0;
 
-nla_put_failure_lease:
-	netif_put_rx_queue_lease_locked(orig_netdev, netdev);
 nla_put_failure:
 	genlmsg_cancel(rsp, hdr);
 	return -EMSGSIZE;
diff --git a/net/core/netdev_queues.c b/net/core/netdev_queues.c
index 265161e12a9c..5597af86591b 100644
--- a/net/core/netdev_queues.c
+++ b/net/core/netdev_queues.c
@@ -37,18 +37,22 @@ struct device *netdev_queue_get_dma_dev(struct net_device *dev,
 					unsigned int idx,
 					enum netdev_queue_type type)
 {
-	struct net_device *orig_dev = dev;
+	struct netdev_rx_queue *hw_rxq;
 	struct device *dma_dev;
 
 	/* Only RX side supports queue leasing today. */
 	if (type != NETDEV_QUEUE_TYPE_RX || !netif_rxq_is_leased(dev, idx))
 		return __netdev_queue_get_dma_dev(dev, idx);
-
-	if (!netif_get_rx_queue_lease_locked(&dev, &idx))
+	if (!netif_is_queue_leasee(dev))
 		return NULL;
 
-	dma_dev = __netdev_queue_get_dma_dev(dev, idx);
-	netif_put_rx_queue_lease_locked(orig_dev, dev);
+	hw_rxq = __netif_get_rx_queue(dev, idx)->lease;
+
+	netdev_lock(hw_rxq->dev);
+	idx = get_netdev_rx_queue_index(hw_rxq);
+	dma_dev = __netdev_queue_get_dma_dev(hw_rxq->dev, idx);
+	netdev_unlock(hw_rxq->dev);
+
 	return dma_dev;
 }
 
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 1d6e7e47bf0a..53cea4460768 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -57,6 +57,11 @@ static bool netif_lease_dir_ok(const struct net_device *dev,
 	return false;
 }
 
+bool netif_is_queue_leasee(const struct net_device *dev)
+{
+	return netif_lease_dir_ok(dev, NETIF_VIRT_TO_PHYS);
+}
+
 struct netdev_rx_queue *
 __netif_get_rx_queue_lease(struct net_device **dev, unsigned int *rxq_idx,
 			   enum netif_lease_dir dir)
@@ -74,29 +79,6 @@ __netif_get_rx_queue_lease(struct net_device **dev, unsigned int *rxq_idx,
 	return rxq;
 }
 
-struct netdev_rx_queue *
-netif_get_rx_queue_lease_locked(struct net_device **dev, unsigned int *rxq_idx)
-{
-	struct net_device *orig_dev = *dev;
-	struct netdev_rx_queue *rxq;
-
-	/* Locking order is always from the virtual to the physical device
-	 * see netdev_nl_queue_create_doit().
-	 */
-	netdev_ops_assert_locked(orig_dev);
-	rxq = __netif_get_rx_queue_lease(dev, rxq_idx, NETIF_VIRT_TO_PHYS);
-	if (rxq && orig_dev != *dev)
-		netdev_lock(*dev);
-	return rxq;
-}
-
-void netif_put_rx_queue_lease_locked(struct net_device *orig_dev,
-				     struct net_device *dev)
-{
-	if (orig_dev != dev)
-		netdev_unlock(dev);
-}
-
 /* See also page_pool_is_unreadable() */
 bool netif_rxq_has_unreadable_mp(struct net_device *dev, unsigned int rxq_idx)
 {
@@ -261,7 +243,6 @@ int netif_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      const struct pp_memory_provider_params *p,
 		      struct netlink_ext_ack *extack)
 {
-	struct net_device *orig_dev = dev;
 	int ret;
 
 	if (!netdev_need_ops_lock(dev))
@@ -276,19 +257,18 @@ int netif_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 	if (!netif_rxq_is_leased(dev, rxq_idx))
 		return __netif_mp_open_rxq(dev, rxq_idx, p, extack);
 
-	if (!netif_get_rx_queue_lease_locked(&dev, &rxq_idx)) {
+	if (!__netif_get_rx_queue_lease(&dev, &rxq_idx, NETIF_VIRT_TO_PHYS)) {
 		NL_SET_ERR_MSG(extack, "rx queue leased to a virtual netdev");
 		return -EBUSY;
 	}
 	if (!dev->dev.parent) {
 		NL_SET_ERR_MSG(extack, "rx queue belongs to a virtual netdev");
-		ret = -EOPNOTSUPP;
-		goto out;
+		return -EOPNOTSUPP;
 	}
 
+	netdev_lock(dev);
 	ret = __netif_mp_open_rxq(dev, rxq_idx, p, extack);
-out:
-	netif_put_rx_queue_lease_locked(orig_dev, dev);
+	netdev_unlock(dev);
 	return ret;
 }
 
@@ -323,18 +303,18 @@ static void __netif_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 void netif_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 			const struct pp_memory_provider_params *old_p)
 {
-	struct net_device *orig_dev = dev;
-
 	if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
 		return;
 	if (!netif_rxq_is_leased(dev, ifq_idx))
 		return __netif_mp_close_rxq(dev, ifq_idx, old_p);
 
-	if (WARN_ON_ONCE(!netif_get_rx_queue_lease_locked(&dev, &ifq_idx)))
+	if (!__netif_get_rx_queue_lease(&dev, &ifq_idx, NETIF_VIRT_TO_PHYS)) {
+		WARN_ON_ONCE(1);
 		return;
-
+	}
+	netdev_lock(dev);
 	__netif_mp_close_rxq(dev, ifq_idx, old_p);
-	netif_put_rx_queue_lease_locked(orig_dev, dev);
+	netdev_unlock(dev);
 }
 
 void __netif_mp_uninstall_rxq(struct netdev_rx_queue *rxq,
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index fe1c7899455e..616cd7b42502 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -31,6 +31,8 @@
 #include <net/netdev_rx_queue.h>
 #include <net/xdp.h>
 
+#include "../core/dev.h"
+
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 #include "xsk.h"
@@ -117,20 +119,42 @@ struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 }
 EXPORT_SYMBOL(xsk_get_pool_from_qid);
 
+static void __xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
+{
+	if (queue_id < dev->num_rx_queues)
+		dev->_rx[queue_id].pool = NULL;
+	if (queue_id < dev->num_tx_queues)
+		dev->_tx[queue_id].pool = NULL;
+}
+
 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
 {
-	struct net_device *orig_dev = dev;
-	unsigned int id = queue_id;
+	struct netdev_rx_queue *hw_rxq;
 
-	if (id < dev->real_num_rx_queues)
-		WARN_ON_ONCE(!netif_get_rx_queue_lease_locked(&dev, &id));
+	if (!netif_rxq_is_leased(dev, queue_id))
+		return __xsk_clear_pool_at_qid(dev, queue_id);
+	WARN_ON_ONCE(!netif_is_queue_leasee(dev));
 
-	if (id < dev->num_rx_queues)
-		dev->_rx[id].pool = NULL;
-	if (id < dev->num_tx_queues)
-		dev->_tx[id].pool = NULL;
+	hw_rxq = __netif_get_rx_queue(dev, queue_id)->lease;
 
-	netif_put_rx_queue_lease_locked(orig_dev, dev);
+	netdev_lock(hw_rxq->dev);
+	queue_id = get_netdev_rx_queue_index(hw_rxq);
+	__xsk_clear_pool_at_qid(hw_rxq->dev, queue_id);
+	netdev_unlock(hw_rxq->dev);
+}
+
+static int __xsk_reg_pool_at_qid(struct net_device *dev,
+				 struct xsk_buff_pool *pool, u16 queue_id)
+{
+	if (xsk_get_pool_from_qid(dev, queue_id))
+		return -EBUSY;
+
+	if (queue_id < dev->real_num_rx_queues)
+		dev->_rx[queue_id].pool = pool;
+	if (queue_id < dev->real_num_tx_queues)
+		dev->_tx[queue_id].pool = pool;
+
+	return 0;
 }
 
 /* The buffer pool is stored both in the _rx struct and the _tx struct as we do
@@ -140,29 +164,26 @@ void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
 			u16 queue_id)
 {
-	struct net_device *orig_dev = dev;
-	unsigned int id = queue_id;
-	int ret = 0;
+	struct netdev_rx_queue *hw_rxq;
+	int ret;
 
-	if (id >= max(dev->real_num_rx_queues,
-		      dev->real_num_tx_queues))
+	if (queue_id >= max(dev->real_num_rx_queues,
+			    dev->real_num_tx_queues))
 		return -EINVAL;
 
-	if (id < dev->real_num_rx_queues) {
-		if (!netif_get_rx_queue_lease_locked(&dev, &id))
-			return -EBUSY;
-		if (xsk_get_pool_from_qid(dev, id)) {
-			ret = -EBUSY;
-			goto out;
-		}
-	}
+	if (queue_id >= dev->real_num_rx_queues ||
+	    !netif_rxq_is_leased(dev, queue_id))
+		return __xsk_reg_pool_at_qid(dev, pool, queue_id);
+	if (!netif_is_queue_leasee(dev))
+		return -EBUSY;
+
+	hw_rxq = __netif_get_rx_queue(dev, queue_id)->lease;
+
+	netdev_lock(hw_rxq->dev);
+	queue_id = get_netdev_rx_queue_index(hw_rxq);
+	ret = __xsk_reg_pool_at_qid(hw_rxq->dev, pool, queue_id);
+	netdev_unlock(hw_rxq->dev);
 
-	if (id < dev->real_num_rx_queues)
-		dev->_rx[id].pool = pool;
-	if (id < dev->real_num_tx_queues)
-		dev->_tx[id].pool = pool;
-out:
-	netif_put_rx_queue_lease_locked(orig_dev, dev);
 	return ret;
 }
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net v8 4/4] macsec: Support VLAN-filtering lower devices
From: Sabrina Dubroca @ 2026-04-08 22:16 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Stanislav Fomichev,
	David Wei, Shuah Khan, linux-kselftest, Dragos Tatulea
In-Reply-To: <20260408115240.1636047-5-cratiu@nvidia.com>

2026-04-08, 14:52:40 +0300, Cosmin Ratiu wrote:
> VLAN-filtering is done through two netdev features
> (NETIF_F_HW_VLAN_CTAG_FILTER and NETIF_F_HW_VLAN_STAG_FILTER) and two
> netdev ops (ndo_vlan_rx_add_vid and ndo_vlan_rx_kill_vid).
> 
> Implement these and advertise the features if the lower device supports
> them. This allows proper VLAN filtering to work on top of MACsec
> devices, when the lower device is capable of VLAN filtering.
> As a concrete example, having this chain of interfaces now works:
> vlan_filtering_capable_dev(1) -> macsec_dev(2) -> macsec_vlan_dev(3)
> 
> Before the mentioned commit this used to accidentally work because the
> MACsec device (and thus the lower device) was put in promiscuous mode
> and the VLAN filter was not used. But after commit [1] correctly made
> the macsec driver expose the IFF_UNICAST_FLT flag, promiscuous mode was
> no longer used and VLAN filters on dev 1 kicked in. Without support in
> dev 2 for propagating VLAN filters down, the register_vlan_dev ->
> vlan_vid_add -> __vlan_vid_add -> vlan_add_rx_filter_info call from dev
> 3 is silently eaten (because vlan_hw_filter_capable returns false and
> vlan_add_rx_filter_info silently succeeds).
> 
> For MACsec, VLAN filters are only relevant for offload, otherwise
> the VLANs are encrypted and the lower devices don't care about them. So
> VLAN filters are only passed on to lower devices in offload mode.
> Flipping between offload modes now needs to offload/unoffload the
> filters with vlan_{get,drop}_rx_*_filter_info().
> 
> To avoid the back-and-forth filter updating during rollback, the setting
> of macsec->offload is moved after the add/del secy ops. This is safe
> since none of the code called from those requires macsec->offload.
> 
> In case adding the filters fails, the added ones are rolled back and an
> error is returned to the operation toggling the offload state.
> 
> Fixes: 0349659fd72f ("macsec: set IFF_UNICAST_FLT priv flag")
> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
> ---
>  drivers/net/macsec.c | 71 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 63 insertions(+), 8 deletions(-)

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>

Thanks Cosmin.

-- 
Sabrina

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox