Netdev List
 help / color / mirror / Atom feed
* Re: [net-next v39] mctp pcc: Implement MCTP over PCC Transport
From: Adam Young @ 2026-05-07 16:00 UTC (permalink / raw)
  To: Jeremy Kerr, Adam Young, Matt Johnston, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel, Sudeep Holla, Jonathan Cameron, Huisong Li
In-Reply-To: <04015745fb7223f6d7dc262ae505daf0272586fb.camel@codeconstruct.com.au>

All the other will be taken in as changes

On 5/6/26 23:02, Jeremy Kerr wrote:
>> +static void mctp_pcc_tx_done(struct mbox_client *c, void *mssg, int r)
>> +{
>> +       struct mctp_pcc_ndev *mctp_pcc_ndev;
>> +       struct sk_buff *skb = mssg;
>> +
>> +       /*
>> +        * If there is a packet in flight during driver cleanup
>> +        * It may have been freed already.
>> +        */
>> +       if (!mssg)
>> +               return;
>> +       /*
>> +        * If the return code is non-zero, we should not report the packet
>> +        * as transmitted.  However, we are in IRQ context right now, and we
>> +        * cannot safely write transmission statistics.
>> +        */
> This reads as if you're not updating stats at all, but you do so in
> mctp_pcc_tx_prepare(). I don't think this comment is necessary - if
> you really want to mention this, add a comment on the
> dev_dstats_tx_add() to indicate why you're calling it early.

This comment is in prep for a fairly large change in the PCC layer to 
address it.

This statistic should be reported in tx_done, but cannot be done safely 
yet.  The fix is to get tx_done out of a hard-irq handler. I will submit 
that as a follow on changes to mailbox/pcc.c and mctp-pc.c



^ permalink raw reply

* Re: [PATCH net v2] eth: fbnic: fix double-free of PCS on phylink creation failure
From: Jakub Kicinski @ 2026-05-07 16:01 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Paolo Abeni, Alexander Duyck, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Russell King, netdev, linux-kernel,
	Bobby Eshleman
In-Reply-To: <afyxOaFhFzEKDK//@devvm29614.prn0.facebook.com>

On Thu, 7 May 2026 08:35:21 -0700 Bobby Eshleman wrote:
> > > fbd is a devlink priv, not netdev priv, touching it after free_netdev()
> > > is perfectly fine. I wish Gemini tried a *little* harder instead of
> > > guessing :| Sorry for not commenting earlier.  
> > 
> > Ugh, not enough coffee. It's complaining about MDIO reads, I think
> > that's valid.  
> 
> It is, but I think the race pre-exists.
> 
> static int
> fbnic_mdio_read_pmd(struct fbnic_dev *fbd, int addr, int regnum)
> [...]
> 	if (fbd->netdev) {
> 		fbn = netdev_priv(fbd->netdev);
> 		if (fbn->aui < FBNIC_AUI_UNKNOWN)
> 			aui = fbn->aui;
> 	}
> 
> 
> Definitely possible that ->netdev gets freed concurrently with
> fbd->netdev evaluating to true... but fbnic_netdev_free() faces the same
> race.
> 
> I'm open to fixing this all at once, if preferred. Probably need to look
> at some of the other fbnic_net ptr guards too.

I agree with Paolo, seems separate.

FWIW I think the fix may be to move the single aui field that mdio
cares about to fbd instead of fbn ? Feels like the problem is due
to a layering violation, mdio should not be touching fbn fields.

^ permalink raw reply

* Re: [PATCH net-next 10/12] net: stmmac: tc956x: add TC956x/QPS615 support
From: Daniel Thompson @ 2026-05-07 16:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alex Elder, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, rmk+kernel, andersson, konradybcio, robh,
	krzk+dt, conor+dt, linusw, brgl, arnd, gregkh, mohd.anwar,
	a0987203069, alexandre.torgue, ast, boon.khai.ng, chenchuangyu,
	chenhuacai, daniel, hawk, hkallweit1, inochiama, john.fastabend,
	julianbraha, livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <2ce5897d-5bbb-486a-b0f0-0e30e54b451a@lunn.ch>

On Fri, May 01, 2026 at 09:04:58PM +0200, Andrew Lunn wrote:
> > +static struct tc956x_mac_speed mac_speed[] = {
> > +	{ PHY_INTERFACE_MODE_2500BASEX,	SPEED_2500,  SP_SEL_SGMII_2500M, },
> > +	{ PHY_INTERFACE_MODE_SGMII,	SPEED_2500,  SP_SEL_SGMII_2500M, },
> > +	{ PHY_INTERFACE_MODE_SGMII,	SPEED_1000,  SP_SEL_SGMII_1000M, },
>
> That looks odd. Some vendors implemented 2500BaseX using SGMII
> overclocked. But that is not strictly 2500BaseX. Having the 2500BASEX
> entry suggests you have real 2500BASEX, so why have an SGMII entry
> with SPEED_2500?

This is a consequence of the code that uses this lookup table being
called both during initialization and from the fix_mac_speed() callback.

During initialization we only have the value in plat->phy_interface to
go on so we run the lookup table using plat->phy_interface (which is
typically PHY_INTERFACE_MODE_SGMII) and with the maximum permitted
speed.

I haven't got detailed enough notes to allow me to double check but I
think there were problems completing the initial MAC reset if we didn't
write something sensible to the hardware during initialization.

During fix_max_speed() we get told to adopt 2500base-x. Reviewing the
code I can see we don't propagate that and just use
plat->phy_interface for fix_mac_speed(). I will fix the code to that
the requested interface propagates properly to the lookup table but I
think we would still rely on the SGMII entry to get sane initial values
to write to the hardware.


> > +/* We have one IRQ chip instance with 25 IRQs in its domain */
>
> One per MAC, or one overall?

One per MAC.


> > +static struct irq_domain *
> > +tc956x_msigen_irq_domain_instantiate(struct tc956x_data *td)
> > +{
> > +	struct irq_domain_chip_generic_info dgc_info;
> > +	struct irq_domain_info info;
> > +
> > +	dgc_info.name = "tc956x-msigen";
>
> If it is one per MAC, maybe this name should indicate which instance
> of the MAC this is.

Will do.


> > +static int tc956x_mac_setup(void *apriv, struct mac_device_info *mac)
> > +{
> > +	struct stmmac_priv *priv = apriv;
> > +	struct stmmac_desc_ops *desc;
> > +	struct stmmac_dma_ops *dma;
> > +	struct tc956x_data *td;
> > +
> > +	td = priv->plat->bsp_priv;
> > +
> > +	/* dwxgmac301_dma_ops needs extending to provide DMA address translation */
> > +	dma = &td->dma;
> > +	*dma = dwxgmac301_dma_ops;
> > +	dma->init_rx_chan = tc956x_dma_init_rx_chan;
> > +	dma->init_tx_chan = tc956x_dma_init_tx_chan;
> > +	mac->dma = dma;
>
> I could be reading this wrong....
>
> dma points to the global dwxgmac301_dma_ops, which you added a few
> patches back.
>
> You then modify it, changing two values in it.
>
> Doesn't that break any other dwxgmac301 in the system? Shouldn't you
> be making a copy of the global structure, and then making
> modifications to your copy? mac->dma then points to your copy?

That's exactly what this code does.

`*dma = dwxgmac301_dma_ops` is a structure copy, we never take a pointer
to dwxgmac301_dma_ops (and if we did, dwxgmac301_dma_ops is const so I
think we'd get a kernel oops if we tried to write to rodata).

Would to code be easier to read if we dropped the local `dma` variable
since that would make it clearer that td->dma is not a pointer? More
like:

+       /* dwxgmac301_dma_ops needs extending to provide DMA address translation */
+       td->dma = dwxgmac301_dma_ops;
+       td->dma.init_rx_chan = tc956x_dma_init_rx_chan;
+       td->dma.init_tx_chan = tc956x_dma_init_tx_chan;
+       mac->dma = &dma;


>
> > +	/* dwxgmac210_desc_ops also needs extending for the same reason */
> > +	desc = &td->desc;
> > +	*desc = dwxgmac210_desc_ops;
> > +	desc->set_addr = tc956x_desc_set_addr;
> > +	desc->set_sec_addr = tc956x_desc_set_sec_addr;
> > +	mac->desc = desc;
>
> And the same problem here?
>
> > +/* Called by tc956x_dwmac_probe(); return errors with dev_err_probe() */
> > +static int tc956x_dwmac_parse_dt(struct tc956x_data *td)
> > +{
> > +	struct device_node *mdio_node;
> > +	struct device *dev = td->dev;
> > +	struct device_node *np;
> > +
> > +	np = dev_of_node(dev);
> > +	if (!np)
> > +		return dev_err_probe(dev, -EINVAL, "no devicetree node\n");
> > +
> > +	/* Find the MDIO bus */
> > +	for_each_child_of_node(np, mdio_node) {
> > +		if (of_device_is_compatible(mdio_node,
> > +					    "snps,dwmac-mdio"))
> > +			break;
> > +	}
>
> It looks like if you put the ethernet properties into an ethernet node
> in DT, this might go away? Or at least allow you to use
> stmmac_of_get_mdio().

Alex has started looking into adding an ethernet node.


Daniel.

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Stefano Garzarella @ 2026-05-07 16:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael S. Tsirkin, Eric Dumazet, Arseniy Krasnov, Bobby Eshleman,
	Stefan Hajnoczi, David S . Miller, Paolo Abeni, Simon Horman,
	netdev, eric.dumazet, Arseniy Krasnov, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, kvm, virtualization
In-Reply-To: <20260507073340.0604667d@kernel.org>

On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote:
>On Thu, 7 May 2026 14:59:13 +0200 Stefano Garzarella wrote:
>> >well if you want to support pathological cases such as 1 byte messages
>> >that would mean like 100x reduction no?
>>
>> Yep, but since this patch is already merged, IMHO that is better than
>> losing data in those pathological cases.
>
>We can revert if you think that the risk of regression is high..
>Please LMK soon, we can do it before patch reaches Linus.
>

Some tests in tools/testing/vsock/vsock_test.c are failing with this 
patch applied.

Test 18 are failing sometime in this way (I guess because we are 
dropping packets):

18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch

Test 22 is failing 100% in this way:

22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed: 
Resource temporarily unavailable


With my followup patch adding also advertisement to the other peer 
(still draft locally, waiting for Michael proposal) I saw 22 failing, 
because tests expects that can use the entire buf_alloc, but now we are 
reducing it.  So IMO we should do like in `__sock_set_rcvbuf()` and 
double the buffer size, or at least digest an overhead equal to the 
buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah, 
AF_VSOCK has it owns sockopt since the beginning :-().

With that approach tests are passing, but I'd like to stress a bit more 
that patch. I'll send it tomorrow as fixup of this patch, or if you 
prefer to revert, I'll send as standalone.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH net v2 2/4] net: sparx5: fix sleep in atomic context in MAC table access
From: Jakub Kicinski @ 2026-05-07 16:05 UTC (permalink / raw)
  To: Daniel Machon
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Steen Hegelund, UNGLinuxDriver, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, Bjarni Jonasson, Lars Povlsen,
	Philipp Zabel, kees, linux-kernel, netdev, linux-arm-kernel,
	linux-rt-devel
In-Reply-To: <20260506-misc-fixes-sparx5-lan969x-v2-2-fb236aa96908@microchip.com>

On Wed, 6 May 2026 09:25:37 +0200 Daniel Machon wrote:
> sparx5_set_rx_mode() runs with netif_addr_lock_bh held and iterates
> dev->mc via __dev_mc_sync(), which per address calls sparx5_mc_sync() /
> sparx5_mc_unsync() -> sparx5_mact_learn() / sparx5_mact_forget().  These
> take sparx5->lock, a mutex, and then poll the MAC access command
> register with readx_poll_timeout(). A mutex may block, which is not
> allowed from atomic context.
> 
> Convert the driver to the new .ndo_set_rx_mode_async callback introduced
> in commit 3554b4345d85 ("net: introduce ndo_set_rx_mode_async and
> netdev_rx_mode_work"). The async callback is invoked from process
> context, so the mutex and sleeping completion poll can remain.

Sashiko points out that the switchdev handlers are currently racy,
but I think that's orthogonal.

^ permalink raw reply

* Re: [PATCH net v2] eth: fbnic: fix double-free of PCS on phylink creation failure
From: Bobby Eshleman @ 2026-05-07 16:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paolo Abeni, Alexander Duyck, kernel-team, Andrew Lunn,
	David S. Miller, Eric Dumazet, Russell King, netdev, linux-kernel,
	Bobby Eshleman
In-Reply-To: <20260507090127.285a5087@kernel.org>

On Thu, May 07, 2026 at 09:01:27AM -0700, Jakub Kicinski wrote:
> On Thu, 7 May 2026 08:35:21 -0700 Bobby Eshleman wrote:
> > > > fbd is a devlink priv, not netdev priv, touching it after free_netdev()
> > > > is perfectly fine. I wish Gemini tried a *little* harder instead of
> > > > guessing :| Sorry for not commenting earlier.  
> > > 
> > > Ugh, not enough coffee. It's complaining about MDIO reads, I think
> > > that's valid.  
> > 
> > It is, but I think the race pre-exists.
> > 
> > static int
> > fbnic_mdio_read_pmd(struct fbnic_dev *fbd, int addr, int regnum)
> > [...]
> > 	if (fbd->netdev) {
> > 		fbn = netdev_priv(fbd->netdev);
> > 		if (fbn->aui < FBNIC_AUI_UNKNOWN)
> > 			aui = fbn->aui;
> > 	}
> > 
> > 
> > Definitely possible that ->netdev gets freed concurrently with
> > fbd->netdev evaluating to true... but fbnic_netdev_free() faces the same
> > race.
> > 
> > I'm open to fixing this all at once, if preferred. Probably need to look
> > at some of the other fbnic_net ptr guards too.
> 
> I agree with Paolo, seems separate.
> 
> FWIW I think the fix may be to move the single aui field that mdio
> cares about to fbd instead of fbn ? Feels like the problem is due
> to a layering violation, mdio should not be touching fbn fields.

SGTM. And I'll take a look at moving aui.

Thanks,
Bobby

^ permalink raw reply

* Re: [PATCH net v2 1/4] net: sparx5: defer VCAP debugfs creation until after netdev registration
From: Jakub Kicinski @ 2026-05-07 16:08 UTC (permalink / raw)
  To: Daniel Machon
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Steen Hegelund, UNGLinuxDriver, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, Bjarni Jonasson, Lars Povlsen,
	Philipp Zabel, kees, linux-kernel, netdev, linux-arm-kernel,
	linux-rt-devel
In-Reply-To: <20260506-misc-fixes-sparx5-lan969x-v2-1-fb236aa96908@microchip.com>

On Wed, 6 May 2026 09:25:36 +0200 Daniel Machon wrote:
> Move the debugfs setup into a new sparx5_debugfs() helper in
> sparx5_debugfs.c, invoked after sparx5_register_notifier_blocks()
> succeeds so the netdev names are finalized. sparx5_vcap_init() now
> only deals with VCAP state. The sparx5/ debugfs root is created in
> the new helper as well.

netdev names are never final :( User can change them at any time.
The best practice is to name the debugfs file by some stable hw-related
property, bus, port number etc.

^ permalink raw reply

* Re: [PATCH v2] xprtrdma: Move long delayed work on system_dfl_long_wq
From: Chuck Lever @ 2026-05-07 16:08 UTC (permalink / raw)
  To: Marco Crivellari, linux-kernel, linux-nfs, netdev
  Cc: Tejun Heo, Lai Jiangshan, Frederic Weisbecker,
	Sebastian Andrzej Siewior, Michal Hocko, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
In-Reply-To: <20260507130117.252825-1-marco.crivellari@suse.com>


On Thu, May 7, 2026, at 3:01 PM, Marco Crivellari wrote:
> Currently the code enqueue work items using {queue|mod}_delayed_work(),
> using system_long_wq. This workqueue should be used when long works are
> expected and it is a per-cpu workqueue.
>
> The function(s) end up calling __queue_delayed_work(), which set a global
> timer that could fire anywhere, enqueuing the work where the timer fired.
>
> Unbound works could benefit from scheduler task placement, to optimize
> performance and power consumption. Long work shouldn't stick to a single
> CPU.
>
> Recently, a new unbound workqueue specific for long running work has
> been added:
>
>     c116737e972e ("workqueue: Add system_dfl_long_wq for long unbound works")
>
> Since the workqueue work doesn't rely on per-cpu variables, there is no
> obvious reason that justify the use of a per-cpu workqueue. So change
> system_long_wq with system_dfl_long_wq so that the work may benefit from
> scheduler task placement.
>
> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
> ---
> Changes in v2:
> - Commit log improvements
>
> - Rebase on v7.1-rc2
>
> Link to v1: 
> https://lore.kernel.org/all/20260430085412.96961-1-marco.crivellari@suse.com/

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH net v2 0/4] net: sparx5: misc fixes for sparx5 and lan969x
From: Jakub Kicinski @ 2026-05-07 16:10 UTC (permalink / raw)
  To: Daniel Machon
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	Steen Hegelund, UNGLinuxDriver, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, Bjarni Jonasson, Lars Povlsen,
	Philipp Zabel, kees, linux-kernel, netdev, linux-arm-kernel,
	linux-rt-devel, Andrew Lunn
In-Reply-To: <20260506-misc-fixes-sparx5-lan969x-v2-0-fb236aa96908@microchip.com>

On Wed, 6 May 2026 09:25:35 +0200 Daniel Machon wrote:
>       net: sparx5: fix wrong chip ids for TSN SKUs
>       net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()

Let me grab these two already. patch 1 I think needs more work,
patch 2 is probably fine but maybe we should address the sashiko
issue in the same series

^ permalink raw reply

* Re: [PATCH net-next 08/12] dt-bindings: net: toshiba,tc965x-dwmac: add TC956x Ethernet bridge
From: Bjorn Andersson @ 2026-05-07 16:12 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alex Elder, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, rmk+kernel, konradybcio, robh, krzk+dt,
	conor+dt, linusw, brgl, arnd, gregkh, Daniel Thompson, mohd.anwar,
	a0987203069, alexandre.torgue, ast, boon.khai.ng, chenchuangyu,
	chenhuacai, daniel, hawk, hkallweit1, inochiama, john.fastabend,
	julianbraha, livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <0de89be0-d842-46f4-8a23-d7e6cc62bcc4@lunn.ch>

On Thu, May 07, 2026 at 04:19:49PM +0200, Andrew Lunn wrote:
> > Are there other consumers of these TC956x gpios which would result in a
> > board designer (and hence dts author) to ever reference this
> > gpio-controller in a different way?
> 
> This Ethernet device could driver an SFP cage. Such cages typically
> have a number of pins connect to GPIOs, so you can tell when there is
> a module in the cage, enable the transmit laser, know if light is
> entering the module from the link peer, etc.
> 

Okay, so the consumer of the gpio is actually an external component, and
not just another part of the TC956x?

Then this seems reasonable.

Thank you,
Bjorn

>     sfp2: sfp {
>       compatible = "sff,sfp";
>       i2c-bus = <&sfp_i2c>;
>       los-gpios = <&cps_gpio1 28 GPIO_ACTIVE_HIGH>;
>       mod-def0-gpios = <&cps_gpio1 27 GPIO_ACTIVE_LOW>;
>       pinctrl-names = "default";
>       pinctrl-0 = <&cps_sfpp0_pins>;
>       tx-disable-gpios = <&cps_gpio1 29 GPIO_ACTIVE_HIGH>;
>       tx-fault-gpios = <&cps_gpio1 26 GPIO_ACTIVE_HIGH>;
>     };
> 
> 	Andrew

^ permalink raw reply

* Re: [PATCH] mptcp: serialize subflow->closing with RX path
From: Matthieu Baerts @ 2026-05-07 16:12 UTC (permalink / raw)
  To: Kalpan Jani, martineau, mptcp, netdev, linux-kernel
  Cc: shardul.b, janak, kalpanjani009, shardulsb08
In-Reply-To: <20260507072802.612125-1-kalpan.jani@mpiricsoftware.com>

Hi Kalpan,

On 07/05/2026 09:28, Kalpan Jani wrote:
> There is a race between mptcp_data_ready() (RX path) and
> mptcp_close_ssk() (teardown path) when accessing subflow->closing.

Thank you for sharing this patch!

Sadly, this patch doesn't apply and looks corrupted:

  Applying: mptcp: serialize subflow->closing with RX path
  error: corrupt patch at line 44
  error: could not build fake ancestor

Did you manually edit it without changing the line references?

While at it, please follow the rules from:

  https://docs.kernel.org/process/maintainer-netdev.html

=> designate your patch to a tree: [PATCH net]

It might be easier if you send new versions only to the MPTCP ML (not
ccing netdev).

> Currently, mptcp_data_ready() checks subflow->closing before acquiring
> mptcp_data_lock(), while mptcp_close_ssk() may concurrently set
> subflow->closing and purge backlog entries. This creates a classic
> time-of-check vs time-of-use (TOCTOU) race:
> 
>   CPU A (close path)              CPU B (RX path)
>   ----------------------         -------------------------
>   set closing = 1
>                                  read closing == 0
>   purge backlog
>                                  enqueue skb to backlog
> 
> As a result, skb entries referencing the subflow socket (ssk) may be
> enqueued after the subflow is marked closing and scheduled for cleanup.
> This can lead to:
> 
>   - WARN in inet_sock_destruct() due to non-zero sk_rmem_alloc
>   - potential use-after-free via stale skb->sk references

By chance, do you have (decoded) calltraces to share in the commit
message? And even better: a reproducer? Or explaining how you found this
issue, and eventually which tool helped you find it.

> Fix this by serializing both the closing check and backlog enqueue
> under mptcp_data_lock(). This ensures that subflow->closing state and
> backlog operations are observed atomically, preventing new skb from
> being enqueued once teardown begins.
> 
> Also protect backlog cleanup in mptcp_close_ssk() with the same lock
> to guarantee mutual exclusion with the RX path.
> 
> This restores proper synchronization between RX and teardown paths
> and prevents stale skb references to closing subflows.

Also, for fixes, the "Fixes:" tag is required.

> Signed-off-by: Kalpan Jani <kalpan.jani@mpiricsoftware.com>
> ---
>  net/mptcp/protocol.c | 31 ++++++++++++++++++++++++++++---
>  1 file changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 718e910ff..295f8e1c0 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -910,14 +910,34 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk)
>  	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
>  	struct mptcp_sock *msk = mptcp_sk(sk);
>  
> +	/*
> +	 * The close path can set subflow->closing while we are racing
> +	 * from BH context here. The old check was done before taking
> +	 * mptcp_data_lock(), leaving a TOCTOU window:
> +	 *
> +	 *   CPU A: close path sets closing = 1 and purges backlog
> +	 *   CPU B: already observed closing == 0 and later enqueues skb
> +	 *
> +	 * That skb keeps skb->sk == ssk and can later trigger:
> +	 * - WARN in inet_sock_destruct() (ssk->sk_rmem_alloc != 0)
> +	 * - UAF in backlog purge via stale skb->sk
> +	 */

I don't think that's useful to add a comment referring an old behaviour.

> +
>  	/* The peer can send data while we are shutting down this
>  	 * subflow at subflow destruction time, but we must avoid enqueuing
>  	 * more data to the msk receive queue
>  	 */

Instead, I suggest moving this comment below as well, and merge it with
the new one you added.
> -	if (unlikely(subflow->closing))
> -		return;
>  
>  	mptcp_data_lock(sk);
> +
> +	/* Serialize closing check with backlog enqueue */
> +	if (unlikely(subflow->closing)) {
> +		mptcp_data_unlock(sk);

When locks are used, we usually prefer having one exit path: please add
a new label above mptcp_data_unlock() below, and a goto here.

> +		return;
> +	}
> +
>  	mptcp_rcv_rtt_update(msk, subflow);
>  	if (!sock_owned_by_user(sk)) {
>  		/* Wake-up the reader only for in-sequence data */
> @@ -2653,9 +2673,12 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
>  	if (sk->sk_state == TCP_ESTABLISHED)
>  		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
>  
> -	/* Remove any reference from the backlog to this ssk; backlog skbs consume
> +	/* Remove any reference from the backlog to this ssk.
> +	 * Serialize cleanup with RX-side enqueue using mptcp_data_lock().

Easier to add this new line at the end of the comment to reduce the diff.

> +	 * Backlog skbs consume
>  	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
>  	 */
> +	mptcp_data_lock(sk);
>  	list_for_each_entry(skb, &msk->backlog_list, list) {
>  		if (skb->sk != ssk)
>  			continue;
> @@ -2663,6 +2686,8 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
>  		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
>  		skb->sk = NULL;
>  	}
> +	mptcp_data_unlock(sk);
> +
>  

No double empty lines. I think 'checkpatch' would tell you that.

>  	/* subflow aborted before reaching the fully_established status
>  	 * attempt the creation of the next subflow

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply

* Re: [PATCH 0/6] SUNRPC: Address remaining cache_check_rcu() UAF in cache content files
From: Chuck Lever @ 2026-05-07 16:12 UTC (permalink / raw)
  To: yangerkun, Misbah Anjum N, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman
  Cc: linux-nfs, linux-kernel, netdev, Chuck Lever
In-Reply-To: <c45779f6-fe6c-4037-bb1c-01cfbbaa8aac@huawei.com>

Hello Erkun -

On Thu, May 7, 2026, at 11:09 AM, yangerkun wrote:
> Hi,
>
> 在 2026/5/1 22:51, Chuck Lever 写道:
>> Misbah Anjum reported a use-after-free in cache_check_rcu()
>> reached through e_show() while sosreport was reading
>> /proc/fs/nfsd/exports on ppc64le.  Two fixes for that report
>> landed in v7.0:
>> 
>>    48db892356d6 ("NFSD: Defer sub-object cleanup in export put callbacks")
>>    e7fcf179b82d ("NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd")
>
> Back to the problem fixed by this patches, I'm a little confused why
> this UAF can be trigged.
>
> Before this patches, svc_export_put show as follow:
>
>   368 static void svc_export_put(struct kref *ref)
>   369 {
>   370         struct svc_export *exp = container_of(ref, struct 
> svc_export, h.ref);
>   371
>   372         path_put(&exp->ex_path);
>   373         auth_domain_put(exp->ex_client);
>   374         call_rcu(&exp->ex_rcu, svc_export_release);
>   375 }
>
> The auth_domain_put function releases ->name using call_rcu, and
> path_put may release the dentry also via call_rcu. All of this seems to
> prevent e_show from causing a UAF. Could you point out which line in
> d_path triggers the issue?

The dentry, the mount, and the auth_domain ->name buffer all
end up RCU-freed (dentry_free() and delayed_free_vfsmnt in
fs/, svcauth_unix_domain_release_rcu() in svcauth_unix.c).
The eventual kfree isn't the problem.

The problem is the synchronous teardown inside path_put(),
which runs before svc_export_put() ever reaches its own
call_rcu():

  path_put(&exp->ex_path)
    -> dput(dentry)
       -> __dentry_kill()              [if last ref]
          -> __d_drop()                /* unhashes */
          -> dentry_unlink_inode()     /* d_inode = NULL */
          -> d_op->d_release() if set
          -> drops parent d_lockref    /* may cascade up */
          -> dentry_free()             /* call_rcu deferred */
    -> mntput(mnt)                     /* deferred via task_work */

The dentry pointer itself is RCU-safe, so prepend_path()'s walk
of d_parent and d_name doesn't read freed memory.  But by the
time the reader gets there, __d_clear_type_and_inode() has
already stored NULL into d_inode, __d_drop() has broken the
hash linkage, and the parent's d_lockref has been decremented
-- which can in turn fire __dentry_kill() on the parent, and
on up the tree.  An e_show() that's still inside its cache RCU
read section walks into that half-dismantled state through
seq_path(), and that's the NULL deref Misbah reported.

The earlier fix (2530766492ec, "nfsd: fix UAF when access
ex_uuid or ex_stats") moved the kfree of ex_uuid and ex_stats
into svc_export_release() so those are RCU-safe now.
path_put() and auth_domain_put() couldn't go in there because
both may sleep, and call_rcu callbacks run in softirq context.
This series uses queue_rcu_work() instead: it defers past the
grace period AND runs the callback in process context, so the
sleeping puts move into the deferred path and the window
closes.


-- 
Chuck Lever

^ permalink raw reply

* [PATCH net] net/sched: dualpi2: initialize timer earlier in dualpi2_init()
From: Davide Caratti @ 2026-05-07 16:14 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev

'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():

  # tc qdisc add dev crash0 handle 1: root dualpi2
  BUG: kernel NULL pointer dereference, address: 0000000000000010
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP PTI
  CPU: 4 UID: 0 PID: 471 Comm: tc Tainted: G            E       7.1.0-rc1-virtme #2 PREEMPT(full)
  Tainted: [E]=UNSIGNED_MODULE
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: 0010:hrtimer_active+0x39/0x60
  Code: f9 eb 23 0f b6 41 38 3c 01 0f 87 87 64 c0 ff 83 e0 01 75 33 48 39 4a 50 74 28 44 3b 42 10 75 06 48 3b 51 30 74 21 48 8b 51 30 <44> 8b 42 10 41 f6 c0 01 74 cf f3 90 44 8b 42 10 41 f6 c0 01 74 c3
  RSP: 0018:ffffd0db80b93620 EFLAGS: 00010282
  RAX: ffffffffc0400320 RBX: ffff8cf24a4c86b8 RCX: ffff8cf24a4c86b8
  RDX: 0000000000000000 RSI: ffff8cf2429c2ab0 RDI: ffff8cf24a4c86b8
  RBP: 00000000fffffff4 R08: 0000000000000003 R09: 0000000000000000
  R10: 0000000000000001 R11: ffff8cf24a39c500 R12: ffff8cf24822c000
  R13: ffffd0db80b936c0 R14: ffffffffc02cf360 R15: 00000000ffffffff
  FS:  00007fbc01706580(0000) GS:ffff8cf2dc759000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000010 CR3: 0000000008e02003 CR4: 0000000000172ef0
  Call Trace:
   <TASK>
   hrtimer_cancel+0x15/0x40
   dualpi2_destroy+0x20/0x40 [sch_dualpi2]
   qdisc_create+0x230/0x570
   tc_modify_qdisc+0x716/0xc10
   rtnetlink_rcv_msg+0x188/0x780
   netlink_rcv_skb+0xcd/0x150
   netlink_unicast+0x1ba/0x290
   netlink_sendmsg+0x242/0x4d0
   ____sys_sendmsg+0x39e/0x3e0
   ___sys_sendmsg+0xe1/0x130
   __sys_sendmsg+0xad/0x110
   do_syscall_64+0x14f/0xf80
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7fbc0188b08e
  Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
  RSP: 002b:00007fff593260e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc0188b08e
  RDX: 0000000000000000 RSI: 00007fff59326190 RDI: 0000000000000003
  RBP: 00007fff593260f0 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000202 R12: 000055f06124f260
  R13: 0000000069fca043 R14: 000055f061255640 R15: 000055f06124d3f8
   </TASK>
  Modules linked in: sch_dualpi2(E)
  CR2: 0000000000000010

[1] https://lore.kernel.org/netdev/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com/
[2] https://lore.kernel.org/netdev/20200725201707.16909-1-xiyou.wangcong@gmail.com/

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/sch_dualpi2.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
index 6d7e6389758d..dbe4b99955ab 100644
--- a/net/sched/sch_dualpi2.c
+++ b/net/sched/sch_dualpi2.c
@@ -917,6 +917,9 @@ static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
 	struct dualpi2_sched_data *q = qdisc_priv(sch);
 	int err;
 
+	hrtimer_setup(&q->pi2_timer, dualpi2_timer, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_ABS_PINNED_SOFT);
+
 	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
 				       TC_H_MAKE(sch->handle, 1), extack);
 	if (!q->l_queue)
@@ -928,8 +931,6 @@ static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
 
 	q->sch = sch;
 	dualpi2_reset_default(sch);
-	hrtimer_setup(&q->pi2_timer, dualpi2_timer, CLOCK_MONOTONIC,
-		      HRTIMER_MODE_ABS_PINNED_SOFT);
 
 	if (opt && nla_len(opt)) {
 		err = dualpi2_change(sch, opt, extack);
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH v2 iproute2-next 1/4] rdma: Update headers
From: David Ahern @ 2026-05-07 16:20 UTC (permalink / raw)
  To: Chiara Meiohas, Stephen Hemminger
  Cc: leon, michaelgur, jgg, linux-rdma, netdev, Patrisious Haddad
In-Reply-To: <98c17c60-6747-4c2b-bfe4-ce9bbe560f6d@nvidia.com>

On 4/28/26 4:05 AM, Chiara Meiohas wrote:
> We will prepare a sync patch to align the names with the kernel and send
> it shortly.

what happened to this request? I see that Stephen had to post a patch
(not yet applied) to address this problem:

https://patchwork.kernel.org/project/netdevbpf/patch/20260505181045.748088-1-stephen@networkplumber.org/

We allow rdma to have separate uapi headers for convenience. Responses
to mistakes need to be timely.

^ permalink raw reply

* Re: [PATCH net v2 0/4] net: sparx5: misc fixes for sparx5 and lan969x
From: patchwork-bot+netdevbpf @ 2026-05-07 16:20 UTC (permalink / raw)
  To: Daniel Machon
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, Steen.Hegelund,
	UNGLinuxDriver, bigeasy, clrkwllms, rostedt, bjarni.jonasson,
	lars.povlsen, p.zabel, kees, linux-kernel, netdev,
	linux-arm-kernel, steen.hegelund, linux-rt-devel, andrew
In-Reply-To: <20260506-misc-fixes-sparx5-lan969x-v2-0-fb236aa96908@microchip.com>

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 6 May 2026 09:25:35 +0200 you wrote:
> This series fixes various issues in the sparx5 driver, which also
> serves lan969x.
> 
> Details are in the individual commit descriptions.
> 
> Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
> 
> [...]

Here is the summary with links:
  - [net,v2,1/4] net: sparx5: defer VCAP debugfs creation until after netdev registration
    (no matching commit)
  - [net,v2,2/4] net: sparx5: fix sleep in atomic context in MAC table access
    (no matching commit)
  - [net,v2,3/4] net: sparx5: fix wrong chip ids for TSN SKUs
    https://git.kernel.org/netdev/net/c/b131dc93f7bf
  - [net,v2,4/4] net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()
    https://git.kernel.org/netdev/net/c/41ae14071cd7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next 10/12] net: stmmac: tc956x: add TC956x/QPS615 support
From: Andrew Lunn @ 2026-05-07 16:29 UTC (permalink / raw)
  To: Daniel Thompson
  Cc: Alex Elder, andrew+netdev, davem, edumazet, kuba, pabeni,
	maxime.chevallier, rmk+kernel, andersson, konradybcio, robh,
	krzk+dt, conor+dt, linusw, brgl, arnd, gregkh, mohd.anwar,
	a0987203069, alexandre.torgue, ast, boon.khai.ng, chenchuangyu,
	chenhuacai, daniel, hawk, hkallweit1, inochiama, john.fastabend,
	julianbraha, livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
	prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
	siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
	devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <afy34kj2hPxIlArO@aspen.lan>

On Thu, May 07, 2026 at 05:03:46PM +0100, Daniel Thompson wrote:
> On Fri, May 01, 2026 at 09:04:58PM +0200, Andrew Lunn wrote:
> > > +static struct tc956x_mac_speed mac_speed[] = {
> > > +	{ PHY_INTERFACE_MODE_2500BASEX,	SPEED_2500,  SP_SEL_SGMII_2500M, },
> > > +	{ PHY_INTERFACE_MODE_SGMII,	SPEED_2500,  SP_SEL_SGMII_2500M, },
> > > +	{ PHY_INTERFACE_MODE_SGMII,	SPEED_1000,  SP_SEL_SGMII_1000M, },
> >
> > That looks odd. Some vendors implemented 2500BaseX using SGMII
> > overclocked. But that is not strictly 2500BaseX. Having the 2500BASEX
> > entry suggests you have real 2500BASEX, so why have an SGMII entry
> > with SPEED_2500?
> 
> This is a consequence of the code that uses this lookup table being
> called both during initialization and from the fix_mac_speed() callback.
> 
> During initialization we only have the value in plat->phy_interface to
> go on so we run the lookup table using plat->phy_interface (which is
> typically PHY_INTERFACE_MODE_SGMII) and with the maximum permitted
> speed.

Something sounds wrong here. SGMII only supports 10/100/1G. You should
never be asked to do SGMII at 2500. It should ask for 2500BaseX.

> I haven't got detailed enough notes to allow me to double check but I
> think there were problems completing the initial MAC reset if we didn't
> write something sensible to the hardware during initialization.

> During fix_max_speed() we get told to adopt 2500base-x. Reviewing the
> code I can see we don't propagate that and just use
> plat->phy_interface for fix_mac_speed(). I will fix the code to that
> the requested interface propagates properly to the lookup table but I
> think we would still rely on the SGMII entry to get sane initial values
> to write to the hardware.

Getting sane values into the hardware is good, but 2500 SGMII is not
sane :-(

> > Doesn't that break any other dwxgmac301 in the system? Shouldn't you
> > be making a copy of the global structure, and then making
> > modifications to your copy? mac->dma then points to your copy?
> 
> That's exactly what this code does.
> 
> `*dma = dwxgmac301_dma_ops` is a structure copy, we never take a pointer
> to dwxgmac301_dma_ops (and if we did, dwxgmac301_dma_ops is const so I
> think we'd get a kernel oops if we tried to write to rodata).
> 
> Would to code be easier to read if we dropped the local `dma` variable
> since that would make it clearer that td->dma is not a pointer? More
> like:

> +       /* dwxgmac301_dma_ops needs extending to provide DMA address translation */
> +       td->dma = dwxgmac301_dma_ops;
> +       td->dma.init_rx_chan = tc956x_dma_init_rx_chan;
> +       td->dma.init_tx_chan = tc956x_dma_init_tx_chan;
> +       mac->dma = &dma;

Yes, that is better. I also think it is partially my problem. You
don't often see structure assignments, just pointer assignments. So
i'm somewhat blind to them.

	Andrew

^ permalink raw reply

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Eric Dumazet @ 2026-05-07 16:32 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Jakub Kicinski, Michael S. Tsirkin, Arseniy Krasnov,
	Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni,
	Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, kvm, virtualization
In-Reply-To: <afyy9CeniTBF3o2I@sgarzare-redhat>

On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
>
> On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote:
> >On Thu, 7 May 2026 14:59:13 +0200 Stefano Garzarella wrote:
> >> >well if you want to support pathological cases such as 1 byte messages
> >> >that would mean like 100x reduction no?
> >>
> >> Yep, but since this patch is already merged, IMHO that is better than
> >> losing data in those pathological cases.
> >
> >We can revert if you think that the risk of regression is high..
> >Please LMK soon, we can do it before patch reaches Linus.
> >
>
> Some tests in tools/testing/vsock/vsock_test.c are failing with this
> patch applied.
>
> Test 18 are failing sometime in this way (I guess because we are
> dropping packets):
>
> 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
>
> Test 22 is failing 100% in this way:
>
> 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
> Resource temporarily unavailable
>
>
> With my followup patch adding also advertisement to the other peer
> (still draft locally, waiting for Michael proposal) I saw 22 failing,
> because tests expects that can use the entire buf_alloc, but now we are
> reducing it.  So IMO we should do like in `__sock_set_rcvbuf()` and
> double the buffer size, or at least digest an overhead equal to the
> buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah,
> AF_VSOCK has it owns sockopt since the beginning :-().
>
> With that approach tests are passing, but I'd like to stress a bit more
> that patch. I'll send it tomorrow as fixup of this patch, or if you
> prefer to revert, I'll send as standalone.
>

A plain revert is a big issue, now users now how to crash hypervisors.

This vulnerability allows a compromised guest (controlling
virtio_vsock_hdr fields)
to continuously flood the host's vsock receive queue without
triggering any memory
 accounting limits or reader wakeups, resulting in unbounded host
kernel memory consumption (Host DoS via OOM).

A vulnerability where a KVM guest can crash or deadlock its host is
classified as a KVM DoS.

Am I missing something?

^ permalink raw reply

* Re: [PATCH net 09/13] ice: fix setting RSS VSI hash for E830
From: Marcin Szycik @ 2026-05-07 16:59 UTC (permalink / raw)
  To: Jacob Keller, Przemek Kitszel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Piotr Kwapulinski,
	Aleksandr Loktionov, Arkadiusz Kubalewski, Maciej Fijalkowski,
	Joshua Hay, Madhu Chittim, Willem de Bruijn, Dave Ertman,
	Ivan Vecera, Grzegorz Nitka
  Cc: netdev, stable
In-Reply-To: <56e52628-d029-4919-95dd-aa6da13f3b08@linux.intel.com>



On 07.05.2026 13:47, Marcin Szycik wrote:
> 
> 
> On 06.05.2026 23:06, Jacob Keller wrote:
>> On 5/4/2026 10:14 PM, Jacob Keller wrote:
>>> From: Marcin Szycik <marcin.szycik@linux.intel.com>
>>>
>>> ice_set_rss_hfunc() performs a VSI update, in which it sets hashing
>>> function, leaving other VSI options unchanged. However, ::q_opt_flags is
>>> mistakenly set to the value of another field, instead of its original
>>> value, probably due to a typo. What happens next is hardware-dependent:
>>>
>>> On E810, only the first bit is meaningful (see
>>> ICE_AQ_VSI_Q_OPT_PE_FLTR_EN) and can potentially end up in a different
>>> state than before VSI update.
>>>
>>> On E830, some of the remaining bits are not reserved. Setting them
>>> to some unrelated values can cause the firmware to reject the update
>>> because of invalid settings, or worse - succeed.
>>>
>>> Reproducer:
>>>   sudo ethtool -X $PF1 equal 8
>>>
>>> Output in dmesg:
>>>   Failed to configure RSS hash for VSI 6, error -5
>>>
>>> Fixes: 352e9bf23813 ("ice: enable symmetric-xor RSS for Toeplitz hash function")
>>> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>>> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
>>> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
>>> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/ice/ice_main.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
>>> index 1d1947a7fe11..c52c465280f7 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_main.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
>>> @@ -8046,7 +8046,7 @@ int ice_set_rss_hfunc(struct ice_vsi *vsi, u8 hfunc)
>>>  	ctx->info.q_opt_rss |=
>>>  		FIELD_PREP(ICE_AQ_VSI_Q_OPT_RSS_HASH_M, hfunc);
>>>  	ctx->info.q_opt_tc = vsi->info.q_opt_tc;
>>> -	ctx->info.q_opt_flags = vsi->info.q_opt_rss;
>>> +	ctx->info.q_opt_flags = vsi->info.q_opt_flags;
>>>  
>>>  	err = ice_update_vsi(hw, vsi->idx, ctx, NULL);
>>>  	if (err) {
>>>
>>
>> Sashiko complains about ice_set_rss_hfunc() but it is unrelated to this fix:
>>
>>> While looking at this function, I noticed a pre-existing issue regarding the
>>> hardware cache. Does calling ice_update_vsi() with a local context leave the
>>> global hw->vsi_ctx[vsi->idx] out of sync?
>>> If ice_update_vsi() succeeds, vsi->info.q_opt_rss is updated, but
>>> hw->vsi_ctx[vsi->idx]->info.q_opt_rss is not.
>>> When an unrelated feature such as RDMA filtering is subsequently toggled via
>>> ice_cfg_rdma_fltr(), could it retrieve this stale cached context via
>>> ice_get_vsi_ctx() and copy the stale q_opt_rss value back into its command
>>> buffer?
> 
> Yes.
> 
>>> Could this cause the firmware to silently revert the RSS hash function to its
>>> previous configuration during the next ice_update_vsi() call?
> 
> No, because the context object passed to ice_update_vsi() only sets
> ctx->info.valid_sections for the sections it wants to update, so unrelated
> values are not updated in HW.
> 
> Looking at other ice_update_vsi() calls, most of the times the context object is
> being allocated, not taken from cache. It's not immediately clear to me what
> purpose does hw->vsi_ctx[] serve - it only appears to be used in
> ice_cfg_rdma_fltr() (correct me if I'm wrong), where options from the cached
> context are being read to fill the unchanged fields in the updated section.
> This seems to be the equivalent of keeping track of context values in vsi->info,
> which is what almost all ice_update_vsi() callers do.
> If I had to guess, I'd say hw->vsi_ctx[] could probably be removed and vsi->info
> used instead, but maybe I'm missing something.
> 
> TLDR I think this is just old, inconsistent code that could be improved, but it
> needs some investigation. Until we don't have a clear signal that there's a bug,
> I wouldn't touch it.
> 
>>> There also appears to be a pre-existing issue with state desynchronization
>>> when handling failures. After ice_update_vsi() succeeds, vsi->rss_hfunc
>>> is updated, but ice_set_rss_hfunc() then calls ice_set_rss_cfg_symm() at the
>>> end of the function:
>>> ice_set_rss_hfunc() {
>>>         ...
>>>         err = ice_update_vsi(hw, vsi->idx, ctx, NULL);
>>>         if (err) {
>>>                 ...
>>>         } else {
>>>                 vsi->info.q_opt_rss = ctx->info.q_opt_rss;
>>>                 vsi->rss_hfunc = hfunc;
>>>                 ...
>>>         }
>>>         ...
>>>         return ice_set_rss_cfg_symm(hw, vsi, symm);
>>> }
>>> If ice_set_rss_cfg_symm() fails, the error is returned but vsi->rss_hfunc
>>> remains updated. If the user retries the command, the early check:
>>> if (hfunc == vsi->rss_hfunc)
>>>         return 0;
>>> evaluates to true and returns success immediately.
>>> Could this skip the ice_set_rss_cfg_symm() retry entirely, leaving the
>>> hardware's flow director rules in a non-symmetric state permanently while
>>> falsely reporting success?
> 
> This looks valid.

On second thought, if we decide to rollback changes to VSI on ice_set_rss_cfg_symm()
fail, we must call ice_update_vsi(), which then can also fail, still leaving us with
hfunc programmed and symmetry not set. I'm not sure if it's worth adding rollback
that can fail and still leave us with the original problem. User would just see 2 errors
instead of 1.
> Thanks,
> Marcin
> 
>> Someone from the ice team should look into this and determine whether or
>> not its valid.



^ permalink raw reply

* Re: [PATCH] mptcp: serialize subflow->closing with RX path
From: Paolo Abeni @ 2026-05-07 17:08 UTC (permalink / raw)
  To: Kalpan Jani, matttbe, martineau, mptcp, netdev, linux-kernel
  Cc: shardul.b, janak, kalpanjani009, shardulsb08
In-Reply-To: <20260507072802.612125-1-kalpan.jani@mpiricsoftware.com>

On 5/7/26 9:28 AM, Kalpan Jani wrote:
> There is a race between mptcp_data_ready() (RX path) and
> mptcp_close_ssk() (teardown path) when accessing subflow->closing.
> 
> Currently, mptcp_data_ready() checks subflow->closing before acquiring
> mptcp_data_lock(), while mptcp_close_ssk() may concurrently set
> subflow->closing and 

Are you sure this race can really happen? both the relevant part of 
__mptcp_close_ssk() and mptcp_data_ready() run under the ssk socket
lock.

> @@ -2653,9 +2673,12 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
>  	if (sk->sk_state == TCP_ESTABLISHED)
>  		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
>  
> -	/* Remove any reference from the backlog to this ssk; backlog skbs consume
> +	/* Remove any reference from the backlog to this ssk.
> +	 * Serialize cleanup with RX-side enqueue using mptcp_data_lock().
> +	 * Backlog skbs consume
>  	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
>  	 */
> +	mptcp_data_lock(sk);
>  	list_for_each_entry(skb, &msk->backlog_list, list) {
>  		if (skb->sk != ssk)
>  			continue;

The real problem is here: the backlog is currently traversed without the
data lock (wrong: the mptcp_data_lock() protects backlog updates), while
the ssk is still possibly open, unlocked and can keep receiving packets
and adding them to the BL.

A better solution would be something alike the following patch (completely
untested):
---
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 718e910ff23f..68d97926cb81 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -2550,6 +2550,21 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	lock_sock_nested(ssk, SINGLE_DEPTH_NESTING);
 	subflow->closing = 1;
 
+	/* Remove any reference from the backlog to this ssk; backlog skbs consume
+	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
+	 */
+	if (flags & MPTCP_CF_PUSH) {
+		mptcp_data_lock(sk);
+		list_for_each_entry(skb, &msk->backlog_list, list) {
+			if (skb->sk != ssk)
+				continue;
+
+			atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
+			skb->sk = NULL;
+		}
+		mptcp_data_unlock(sk);
+	}
+
 	/* Borrow the fwd allocated page left-over; fwd memory for the subflow
 	 * could be negative at this point, but will be reach zero soon - when
 	 * the data allocated using such fragment will be freed.
@@ -2653,17 +2668,6 @@ void mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 	if (sk->sk_state == TCP_ESTABLISHED)
 		mptcp_event(MPTCP_EVENT_SUB_CLOSED, mptcp_sk(sk), ssk, GFP_KERNEL);
 
-	/* Remove any reference from the backlog to this ssk; backlog skbs consume
-	 * space in the msk receive queue, no need to touch sk->sk_rmem_alloc
-	 */
-	list_for_each_entry(skb, &msk->backlog_list, list) {
-		if (skb->sk != ssk)
-			continue;
-
-		atomic_sub(skb->truesize, &skb->sk->sk_rmem_alloc);
-		skb->sk = NULL;
-	}
-
 	/* subflow aborted before reaching the fully_established status
 	 * attempt the creation of the next subflow
 	 */


^ permalink raw reply related

* [PATCH net-next v4 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
From: Neil Spring @ 2026-05-07 17:13 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, ncardwell, kuniyu, davem, kuba, dsahern, pabeni, horms,
	shuah, linux-kselftest, ntspring

Make TCP retransmission timeouts select a different ECMP path for IPv6.

Currently sk_rethink_txhash() changes the socket's txhash on RTO, but the
cached route is reused and the new hash is not propagated into the ECMP
path selection logic.  This series adds __sk_dst_reset() alongside
sk_rethink_txhash() to force a fresh route lookup, and sets fl6->mp_hash
from sk_txhash so fib6_select_path() picks a path based on the new hash.

Five selftest scenarios verify the behavior across connection setup and
established flows, forward and reverse path failures, and PLB:

  - SYN retransmission (forward path blocked during setup)
  - SYN/ACK retransmission (reverse path blocked during setup)
  - Midstream RTO (forward path blocked on established connection)
  - Midstream ACK rehash (reverse path blocked on established connection)
  - PLB rehash (ECN-driven congestion on established connection)

Changes since v3: https://lore.kernel.org/netdev/20260505193824.2791642-1-ntspring@meta.com/
- Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock
  is held in all three call sites (Eric Dumazet)
- Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP
  does not use sk_txhash for path selection
- Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value
  of sk_rethink_txhash()
- Move tcp_rsk(req)->txhash initialization before route_req() in
  tcp_conn_request() to avoid reading uninitialized memory
- Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test
- Skip PLB test gracefully if DCTCP is not available
- Save and restore original congestion control algorithm in PLB test
- Default get_netstat_counter() to 0 when counter is not found
- Skip all tests if tcp_syn_linear_timeouts is not available
- Replace bash/pipe data sources with socat OPEN:/dev/zero for
  cleaner process cleanup
- Fix shellcheck warnings

Changes since v2: https://lore.kernel.org/netdev/20260408070514.1840227-1-ntspring@meta.com/
- Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP
  (Neal Cardwell)
- Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c),
  covering the dst rebuild path used on established sockets
- Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c;
  the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling
  txhash which feeds into inet6_csk_route_req()'s mp_hash
  (Eric Dumazet)
- Add ACK rehash and PLB rehash selftests
- Improve selftest reliability

Changes since v1: https://lore.kernel.org/netdev/20260408002802.2448424-1-ntspring@meta.com/
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
  for ECMP path selection in inet6_csk_route_req(), making the request
  socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
  better rehash coverage

Neil Spring (2):
  tcp: rehash onto different local ECMP path on retransmit timeout
  selftests: net: add local ECMP rehash test

 net/ipv4/tcp_input.c                       |   6 +-
 net/ipv4/tcp_plb.c                         |   7 +-
 net/ipv4/tcp_timer.c                       |   4 +
 net/ipv6/af_inet6.c                        |   3 +
 net/ipv6/inet6_connection_sock.c           |   6 +
 tools/testing/selftests/net/Makefile       |   1 +
 tools/testing/selftests/net/config         |   1 +
 tools/testing/selftests/net/ecmp_rehash.sh | 582 +++++++++++++++++++++
 8 files changed, 607 insertions(+), 3 deletions(-)
 create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh

-- 
2.53.0-Meta


^ permalink raw reply

* [PATCH net-next v4 1/2] tcp: rehash onto different local ECMP path on retransmit timeout
From: Neil Spring @ 2026-05-07 17:13 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, ncardwell, kuniyu, davem, kuba, dsahern, pabeni, horms,
	shuah, linux-kselftest, ntspring
In-Reply-To: <20260507171319.1259115-1-ntspring@meta.com>

Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB,
and spurious-retransmission events, but the cached route is reused and
the new hash is not propagated into the ECMP path selection logic.  Two
changes are needed to make rehash select a different local ECMP path:

1. Add __sk_dst_reset() alongside sk_rethink_txhash() in
   tcp_write_timeout(), tcp_rcv_spurious_retrans(), and
   tcp_plb_check_rehash() so the cached dst is invalidated and the
   next transmit triggers a fresh route lookup.

2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for
   SYN/ACK retransmits) in inet6_sk_rebuild_header(),
   inet6_csk_route_req(), and inet6_csk_route_socket() so
   fib6_select_path() picks a path based on the new hash.

   It is necessary to update mp_hash explicitly because the
   default ECMP hash derives from fl6->flowlabel via
   np->flow_label, which is not updated from sk_txhash
   (REPFLOW is off by default).  ip6_make_flowlabel() cannot
   help either, as it runs after the route lookup.

The dst reset is guarded by sk->sk_family == AF_INET6 since IPv4
ECMP does not currently use sk_txhash for path selection.

tcp_rsk(req)->txhash initialization is moved before route_req() in
tcp_conn_request() so that inet6_csk_route_req() reads a valid hash
on the initial SYN/ACK.

Signed-off-by: Neil Spring <ntspring@meta.com>
---
 net/ipv4/tcp_input.c             | 6 ++++--
 net/ipv4/tcp_plb.c               | 7 ++++++-
 net/ipv4/tcp_timer.c             | 4 ++++
 net/ipv6/af_inet6.c              | 3 +++
 net/ipv6/inet6_connection_sock.c | 6 ++++++
 5 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7995a89bafc9..8f602a665b71 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5020,8 +5020,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
 	    skb->protocol == htons(ETH_P_IPV6) &&
 	    (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
 	     ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
-	    sk_rethink_txhash(sk))
+	    sk_rethink_txhash(sk)) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
+		__sk_dst_reset(sk);
+	}
 
 	/* Save last flowlabel after a spurious retrans. */
 	tcp_save_lrcv_flowlabel(sk, skb);
@@ -7636,6 +7638,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->af_specific = af_ops;
 	tcp_rsk(req)->ts_off = 0;
 	tcp_rsk(req)->req_usec_ts = false;
+	tcp_rsk(req)->txhash = net_tx_rndhash();
 #if IS_ENABLED(CONFIG_MPTCP)
 	tcp_rsk(req)->is_mptcp = 0;
 #endif
@@ -7717,7 +7720,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	}
 #endif
 	tcp_rsk(req)->snt_isn = isn;
-	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_rsk(req)->syn_tos = TCP_SKB_CB(skb)->ip_dsfield;
 	tcp_openreq_init_rwin(req, sk, dst);
 	sk_rx_queue_set(req_to_sk(req), skb);
diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
index c11a0cd3f8fe..accdd83dfc3d 100644
--- a/net/ipv4/tcp_plb.c
+++ b/net/ipv4/tcp_plb.c
@@ -78,7 +78,12 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
 	if (plb->pause_until)
 		return;
 
-	sk_rethink_txhash(sk);
+	if (sk_rethink_txhash(sk)) {
+#if IS_ENABLED(CONFIG_IPV6)
+		if (sk->sk_family == AF_INET6)
+			__sk_dst_reset(sk);
+#endif
+	}
 	plb->consec_cong_rounds = 0;
 	WRITE_ONCE(tcp_sk(sk)->plb_rehash, tcp_sk(sk)->plb_rehash + 1);
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 322db13333c7..24c1c19eda6e 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -300,6 +300,10 @@ static int tcp_write_timeout(struct sock *sk)
 	if (sk_rethink_txhash(sk)) {
 		WRITE_ONCE(tp->timeout_rehash, tp->timeout_rehash + 1);
 		__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
+#if IS_ENABLED(CONFIG_IPV6)
+		if (sk->sk_family == AF_INET6)
+			__sk_dst_reset(sk);
+#endif
 	}
 
 	return 0;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 0a88b376141d..90ff4448aa56 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -823,6 +823,9 @@ int inet6_sk_rebuild_header(struct sock *sk)
 	fl6->flowi6_uid = sk_uid(sk);
 	security_sk_classify_flow(sk, flowi6_to_flowi_common(fl6));
 
+	/* >> 1 for 31-bit mp_hash range matching nhc_upper_bound. */
+	fl6->mp_hash = sk->sk_txhash >> 1;
+
 	rcu_read_lock();
 	final_p = fl6_update_dst(fl6, rcu_dereference(np->opt), &np->final);
 	rcu_read_unlock();
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 37534e116899..fc4b75de6af8 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -48,6 +48,9 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
 	fl6->flowi6_uid = sk_uid(sk);
 	security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
 
+	/* >> 1 for 31-bit mp_hash range matching nhc_upper_bound. */
+	fl6->mp_hash = tcp_rsk(req)->txhash >> 1;
+
 	if (!dst) {
 		dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
 		if (IS_ERR(dst))
@@ -70,6 +73,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
 	fl6->saddr = np->saddr;
 	fl6->flowlabel = np->flow_label;
 	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+
+	/* >> 1 for 31-bit mp_hash range matching nhc_upper_bound. */
+	fl6->mp_hash = sk->sk_txhash >> 1;
 	fl6->flowi6_oif = sk->sk_bound_dev_if;
 	fl6->flowi6_mark = sk->sk_mark;
 	fl6->fl6_sport = inet->inet_sport;
-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v4 2/2] selftests: net: add local ECMP rehash test
From: Neil Spring @ 2026-05-07 17:13 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, ncardwell, kuniyu, davem, kuba, dsahern, pabeni, horms,
	shuah, linux-kselftest, ntspring
In-Reply-To: <20260507171319.1259115-1-ntspring@meta.com>

Add ecmp_rehash.sh with five scenarios verifying that TCP rehash
selects a different local ECMP path for IPv6:

  - SYN retransmission (forward path blocked during setup)
  - SYN/ACK retransmission (reverse path blocked during setup)
  - Midstream RTO (forward path blocked on established connection)
  - Midstream ACK rehash (reverse path blocked on established connection)
  - PLB rehash (ECN-driven congestion on established connection)

Signed-off-by: Neil Spring <ntspring@meta.com>
---
 tools/testing/selftests/net/Makefile       |   1 +
 tools/testing/selftests/net/config         |   1 +
 tools/testing/selftests/net/ecmp_rehash.sh | 582 +++++++++++++++++++++
 3 files changed, 584 insertions(+)
 create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index baa30287cf22..6ec1b24218ad 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -26,6 +26,7 @@ TEST_PROGS := \
 	cmsg_time.sh \
 	double_udp_encap.sh \
 	drop_monitor_tests.sh \
+	ecmp_rehash.sh \
 	fcnal-ipv4.sh \
 	fcnal-ipv6.sh \
 	fcnal-other.sh \
diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config
index 94d722770420..20fce6e4500b 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -122,6 +122,7 @@ CONFIG_PSAMPLE=m
 CONFIG_RPS=y
 CONFIG_SYSFS=y
 CONFIG_TAP=m
+CONFIG_TCP_CONG_DCTCP=m
 CONFIG_TCP_MD5SIG=y
 CONFIG_TEST_BLACKHOLE_DEV=m
 CONFIG_TEST_BPF=m
diff --git a/tools/testing/selftests/net/ecmp_rehash.sh b/tools/testing/selftests/net/ecmp_rehash.sh
new file mode 100755
index 000000000000..c0603f50abf2
--- /dev/null
+++ b/tools/testing/selftests/net/ecmp_rehash.sh
@@ -0,0 +1,582 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test local ECMP path re-selection on TCP retransmission timeout and PLB.
+#
+# Two namespaces connected by two parallel veth pairs with a 2-way ECMP
+# route.  When a TCP path is blocked (via tc drop) or congested (via
+# netem ECN marking), the kernel rehashes the connection via
+# sk_rethink_txhash() + sk_dst_reset(), causing the next route lookup
+# to select the other ECMP path.
+#
+# Each rehash re-rolls sk_txhash randomly, giving a 1/2 chance of
+# selecting the alternate path per attempt.  With tcp_syn_retries=25
+# and tcp_syn_linear_timeouts=25 there are 26 attempts, so the
+# probability of never switching is ~(1/2)^25 ~ 3e-8.
+
+source lib.sh
+
+SUBNETS=(a b)
+PORT=9900
+
+ALL_TESTS="
+	test_ecmp_syn_rehash
+	test_ecmp_synack_rehash
+	test_ecmp_midstream_rehash
+	test_ecmp_midstream_ack_rehash
+	test_ecmp_plb_rehash
+"
+
+link_tx_packets_get()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+
+	ip netns exec "$ns" cat "/sys/class/net/$dev/statistics/tx_packets"
+}
+
+# Return the number of packets matched by the tc filter action on a device.
+# When tc drops packets via "action drop", the device's tx_packets is not
+# incremented (packet never reaches veth_xmit), but the tc action maintains
+# its own counter.
+tc_filter_pkt_count()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+
+	ip netns exec "$ns" tc -s filter show dev "$dev" parent 1: 2>/dev/null |
+		awk '/Sent .* pkt/ {
+			for (i=1; i<=NF; i++)
+				if ($i == "pkt") { print $(i-1); exit }
+		}'
+}
+
+# Read a TcpExt counter from /proc/net/netstat in a namespace.
+# Returns 0 if the counter is not found.
+get_netstat_counter()
+{
+	local ns=$1; shift
+	local field=$1; shift
+	local val
+
+	# shellcheck disable=SC2016
+	val=$(ip netns exec "$ns" awk -v key="$field" '
+		/^TcpExt:/ {
+			if (!h) { split($0, n); h=1 }
+			else {
+				split($0, v)
+				for (i in n)
+					if (n[i] == key) print v[i]
+			}
+		}
+	' /proc/net/netstat)
+	echo "${val:-0}"
+}
+
+# Apply netem ECN marking: CE-mark all ECT packets instead of dropping them.
+mark_ecn()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+
+	ip netns exec "$ns" tc qdisc add dev "$dev" root netem loss 100% ecn
+}
+
+# Block TCP (IPv6 next-header = 6) egress, allowing ICMPv6 through.
+block_tcp()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+
+	ip netns exec "$ns" tc qdisc add dev "$dev" root handle 1: prio
+	ip netns exec "$ns" tc filter add dev "$dev" parent 1: \
+		protocol ipv6 prio 1 u32 match u8 0x06 0xff at 6 action drop
+}
+
+unblock_tcp()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+
+	ip netns exec "$ns" tc qdisc del dev "$dev" root 2>/dev/null
+}
+
+# Return success when a device's TX counter exceeds a baseline value.
+dev_tx_packets_above()
+{
+	local ns=$1; shift
+	local dev=$1; shift
+	local baseline=$1; shift
+
+	local cur
+	cur=$(link_tx_packets_get "$ns" "$dev")
+	[ "$cur" -gt "$baseline" ]
+}
+
+# Return success when both devices have dropped at least one TCP packet.
+both_devs_attempted()
+{
+	local ns=$1; shift
+	local dev0=$1; shift
+	local dev1=$1; shift
+
+	local c0 c1
+	c0=$(tc_filter_pkt_count "$ns" "$dev0")
+	c1=$(tc_filter_pkt_count "$ns" "$dev1")
+	[ "${c0:-0}" -ge 1 ] && [ "${c1:-0}" -ge 1 ]
+}
+
+link_tx_packets_total()
+{
+	local ns=$1; shift
+
+	echo $(( $(link_tx_packets_get "$ns" veth0a) +
+		 $(link_tx_packets_get "$ns" veth1a) ))
+}
+
+setup()
+{
+	setup_ns NS1 NS2
+
+	local ns
+	for ns in "$NS1" "$NS2"; do
+		ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.accept_dad=0
+		ip netns exec "$ns" sysctl -qw net.ipv6.conf.default.accept_dad=0
+		ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.forwarding=1
+		ip netns exec "$ns" sysctl -qw net.core.txrehash=1
+	done
+
+	local i sub
+	for i in 0 1; do
+		sub=${SUBNETS[$i]}
+		ip link add "veth${i}a" type veth peer name "veth${i}b"
+		ip link set "veth${i}a" netns "$NS1"
+		ip link set "veth${i}b" netns "$NS2"
+		ip -n "$NS1" addr add "fd00:${sub}::1/64" dev "veth${i}a"
+		ip -n "$NS2" addr add "fd00:${sub}::2/64" dev "veth${i}b"
+		ip -n "$NS1" link set "veth${i}a" up
+		ip -n "$NS2" link set "veth${i}b" up
+	done
+
+	ip -n "$NS1" addr add fd00:ff::1/128 dev lo
+	ip -n "$NS2" addr add fd00:ff::2/128 dev lo
+
+	# Allow many SYN retries at 1-second intervals (linear, no
+	# exponential backoff) so the rehash test has enough attempts
+	# to exercise both ECMP paths.
+	if ! ip netns exec "$NS1" sysctl -qw \
+	     net.ipv4.tcp_syn_linear_timeouts=25; then
+		echo "SKIP: tcp_syn_linear_timeouts not supported"
+		exit "$ksft_skip"
+	fi
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_syn_retries=25
+
+	# Keep the server's request socket alive during the blocking
+	# period so SYN/ACK retransmits continue.
+	ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_synack_retries=25
+
+	ip -n "$NS1" -6 route add fd00:ff::2/128 \
+		nexthop via fd00:a::2 dev veth0a \
+		nexthop via fd00:b::2 dev veth1a
+
+	ip -n "$NS2" -6 route add fd00:ff::1/128 \
+		nexthop via fd00:a::1 dev veth0b \
+		nexthop via fd00:b::1 dev veth1b
+
+	for i in 0 1; do
+		sub=${SUBNETS[$i]}
+		ip netns exec "$NS1" \
+			ping -6 -c1 -W5 "fd00:${sub}::2" &>/dev/null
+		ip netns exec "$NS2" \
+			ping -6 -c1 -W5 "fd00:${sub}::1" &>/dev/null
+	done
+
+	if ! ip netns exec "$NS1" ping -6 -c1 -W5 fd00:ff::2 &>/dev/null; then
+		echo "Basic connectivity check failed"
+		return "$ksft_skip"
+	fi
+}
+
+# Block ALL paths, start a connection, wait until SYNs have been dropped
+# on both interfaces (proving rehash steered the SYN to a new path), then
+# unblock so the connection completes.
+test_ecmp_syn_rehash()
+{
+	RET=0
+
+	block_tcp "$NS1" veth0a
+	defer unblock_tcp "$NS1" veth0a
+	block_tcp "$NS1" veth1a
+	defer unblock_tcp "$NS1" veth1a
+
+	ip netns exec "$NS2" socat \
+		"TCP6-LISTEN:$PORT,bind=[fd00:ff::2],reuseaddr,fork" \
+		EXEC:"echo ESTABLISH_OK" &
+	defer kill_process $!
+
+	wait_local_port_listen "$NS2" "$PORT" tcp
+
+	local rehash_before
+	rehash_before=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+
+	# Start the connection in the background; it will retry SYNs at
+	# 1-second intervals until an unblocked path is found.
+	# Use -u (unidirectional) to only receive from the server;
+	# sending data back would risk SIGPIPE if the server's EXEC
+	# child has already exited.
+	local tmpfile
+	tmpfile=$(mktemp)
+	defer rm -f "$tmpfile"
+
+	ip netns exec "$NS1" socat -u \
+		"TCP6:[fd00:ff::2]:$PORT,bind=[fd00:ff::1],connect-timeout=60" \
+		STDOUT >"$tmpfile" 2>&1 &
+	local client_pid=$!
+	defer kill_process "$client_pid"
+
+	# Wait until both paths have seen at least one dropped SYN.
+	# This proves sk_rethink_txhash() rehashed the connection from
+	# one ECMP path to the other.
+	slowwait 30 both_devs_attempted "$NS1" veth0a veth1a
+	check_err $? "SYNs did not appear on both paths (rehash not working)"
+	if [ "$RET" -ne 0 ]; then
+		log_test "Local ECMP SYN rehash: establish with blocked paths"
+		return
+	fi
+
+	# Unblock both paths and let the next SYN retransmit succeed.
+	unblock_tcp "$NS1" veth0a
+	unblock_tcp "$NS1" veth1a
+
+	local rc=0
+	wait "$client_pid" || rc=$?
+
+	local result
+	result=$(cat "$tmpfile" 2>/dev/null)
+
+	if [[ "$result" != *"ESTABLISH_OK"* ]]; then
+		check_err 1 "connection failed after unblocking (rc=$rc): $result"
+	fi
+
+	local rehash_after
+	rehash_after=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+	if [ "$rehash_after" -le "$rehash_before" ]; then
+		check_err 1 "TcpTimeoutRehash counter did not increment"
+	fi
+
+	log_test "Local ECMP SYN rehash: establish with blocked paths"
+}
+
+# Block the server's return paths so SYN/ACKs are dropped.  The client
+# retransmits SYNs at 1-second intervals; each duplicate SYN arriving at
+# the server triggers tcp_rtx_synack() which re-rolls txhash, so the
+# retransmitted SYN/ACK selects a different ECMP return path.
+test_ecmp_synack_rehash()
+{
+	RET=0
+	local port=$((PORT + 2))
+
+	block_tcp "$NS2" veth0b
+	defer unblock_tcp "$NS2" veth0b
+	block_tcp "$NS2" veth1b
+	defer unblock_tcp "$NS2" veth1b
+
+	ip netns exec "$NS2" socat \
+		"TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr,fork" \
+		EXEC:"echo SYNACK_OK" &
+	defer kill_process $!
+
+	wait_local_port_listen "$NS2" "$port" tcp
+
+	# Start the connection; SYNs reach the server (client egress is
+	# open) but SYN/ACKs are dropped on the server's return path.
+	local tmpfile
+	tmpfile=$(mktemp)
+	defer rm -f "$tmpfile"
+
+	ip netns exec "$NS1" socat -u \
+		"TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1],connect-timeout=60" \
+		STDOUT >"$tmpfile" 2>&1 &
+	local client_pid=$!
+	defer kill_process "$client_pid"
+
+	# Wait until both server-side interfaces have dropped at least
+	# one SYN/ACK, proving the server rehashed its return path.
+	slowwait 30 both_devs_attempted "$NS2" veth0b veth1b
+	check_err $? "SYN/ACKs did not appear on both return paths"
+	if [ "$RET" -ne 0 ]; then
+		log_test "Local ECMP SYN/ACK rehash: blocked return path"
+		return
+	fi
+
+	# Unblock and let the connection complete.
+	unblock_tcp "$NS2" veth0b
+	unblock_tcp "$NS2" veth1b
+
+	local rc=0
+	wait "$client_pid" || rc=$?
+
+	local result
+	result=$(cat "$tmpfile" 2>/dev/null)
+
+	if [[ "$result" != *"SYNACK_OK"* ]]; then
+		check_err 1 "connection failed after unblocking (rc=$rc): $result"
+	fi
+
+	log_test "Local ECMP SYN/ACK rehash: blocked return path"
+}
+
+# Establish a data transfer with both paths open, then block the
+# active path.  Verify that data appears on the previously inactive
+# path (proving RTO triggered a rehash) and that TcpTimeoutRehash
+# incremented.
+test_ecmp_midstream_rehash()
+{
+	RET=0
+	local port=$((PORT + 1))
+
+	ip netns exec "$NS2" socat -u \
+		"TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr" - >/dev/null &
+	defer kill_process $!
+
+	wait_local_port_listen "$NS2" "$port" tcp
+
+	local base_tx0 base_tx1
+	base_tx0=$(link_tx_packets_get "$NS1" veth0a)
+	base_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+	# Continuous data source; timeout caps overall test duration and
+	# must exceed the slowwait below so data keeps flowing.
+	ip netns exec "$NS1" timeout 90 socat -u \
+		OPEN:/dev/zero \
+		"TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1]" &>/dev/null &
+	local client_pid=$!
+	defer kill_process "$client_pid"
+
+	# Wait for enough packets to identify the active path.
+	busywait "$BUSYWAIT_TIMEOUT" until_counter_is \
+			">= $((base_tx0 + base_tx1 + 10))" \
+		link_tx_packets_total "$NS1" > /dev/null
+	check_err $? "no TX activity detected"
+	if [ "$RET" -ne 0 ]; then
+		log_test "Local ECMP midstream rehash: block active path"
+		return
+	fi
+
+	# Find the active path and block it.
+	local current_tx0 current_tx1 active_idx inactive_idx
+	current_tx0=$(link_tx_packets_get "$NS1" veth0a)
+	current_tx1=$(link_tx_packets_get "$NS1" veth1a)
+	if [ $((current_tx0 - base_tx0)) -ge $((current_tx1 - base_tx1)) ]; then
+		active_idx=0; inactive_idx=1
+	else
+		active_idx=1; inactive_idx=0
+	fi
+	local inactive_before
+	inactive_before=$(link_tx_packets_get "$NS1" "veth${inactive_idx}a")
+
+	local rehash_before
+	rehash_before=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+	# Suppress the existing __dst_negative_advice() in
+	# tcp_write_timeout() so that the patch's sk_dst_reset()
+	# is the only dst-invalidation mechanism on the RTO path.
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_retries1=255
+	defer ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_retries1=3
+
+	block_tcp "$NS1" "veth${active_idx}a"
+	defer unblock_tcp "$NS1" "veth${active_idx}a"
+
+	# Wait for meaningful data on the previously inactive path,
+	# proving RTO triggered a rehash and data actually moved.
+	# Require 100 packets beyond baseline to rule out stray
+	# control packets (ND, etc.).  Allow 60s for multiple RTO
+	# cycles with exponential backoff.
+	slowwait 60 dev_tx_packets_above \
+		"$NS1" "veth${inactive_idx}a" "$((inactive_before + 100))"
+	check_err $? "data did not appear on alternate path after blocking"
+
+	local rehash_after
+	rehash_after=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+	if [ "$rehash_after" -le "$rehash_before" ]; then
+		check_err 1 "TcpTimeoutRehash counter did not increment"
+	fi
+
+	log_test "Local ECMP midstream rehash: block active path"
+}
+
+# Block the receiver's (NS2) ACK return paths while data flows from
+# NS1 to NS2.  The sender (NS1) times out and retransmits with a new
+# flowlabel; the receiver detects the changed flowlabel via
+# tcp_rcv_spurious_retrans() and rehashes its own txhash so that its
+# ACKs try a different ECMP return path.
+test_ecmp_midstream_ack_rehash()
+{
+	RET=0
+	local port=$((PORT + 3))
+
+	ip netns exec "$NS2" socat -u \
+		"TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr" - >/dev/null &
+	defer kill_process $!
+
+	wait_local_port_listen "$NS2" "$port" tcp
+
+	local base_tx0 base_tx1
+	base_tx0=$(link_tx_packets_get "$NS1" veth0a)
+	base_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+	# Continuous data source from NS1 to NS2.
+	ip netns exec "$NS1" timeout 120 socat -u \
+		OPEN:/dev/zero \
+		"TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1]" &>/dev/null &
+	defer kill_process $!
+
+	# Wait for data to start flowing.
+	busywait "$BUSYWAIT_TIMEOUT" until_counter_is \
+			">= $((base_tx0 + base_tx1 + 10))" \
+		link_tx_packets_total "$NS1" > /dev/null
+	check_err $? "no TX activity detected"
+	if [ "$RET" -ne 0 ]; then
+		log_test "Local ECMP midstream ACK rehash: blocked return path"
+		return
+	fi
+
+	local rehash_before
+	rehash_before=$(get_netstat_counter "$NS2" TcpDuplicateDataRehash)
+
+	# Block both return paths from NS2 so ACKs are dropped.
+	# Data from NS1 still arrives (tc filter is on egress).
+	block_tcp "$NS2" veth0b
+	defer unblock_tcp "$NS2" veth0b
+	block_tcp "$NS2" veth1b
+	defer unblock_tcp "$NS2" veth1b
+
+	# NS1 will RTO (no ACKs), retransmit with new flowlabel.
+	# NS2 detects the flowlabel change via tcp_rcv_spurious_retrans(),
+	# rehashes, and NS2's ACKs try a different ECMP return path.
+	# Wait until both NS2 interfaces have dropped at least one ACK.
+	slowwait 60 both_devs_attempted "$NS2" veth0b veth1b
+	check_err $? "ACKs did not appear on both return paths"
+
+	local rehash_after
+	rehash_after=$(get_netstat_counter "$NS2" TcpDuplicateDataRehash)
+	if [ "$rehash_after" -le "$rehash_before" ]; then
+		check_err 1 "TcpDuplicateDataRehash counter did not increment"
+	fi
+
+	log_test "Local ECMP midstream ACK rehash: blocked return path"
+}
+
+# Establish a DCTCP data transfer with PLB enabled, then ECN-mark both
+# paths.  Sustained CE marking triggers PLB to call sk_rethink_txhash()
+# + sk_dst_reset(), bouncing the connection between ECMP paths.  Verify
+# data appears on both paths and that TCPPLBRehash incremented.
+test_ecmp_plb_rehash()
+{
+	RET=0
+	local port=$((PORT + 4))
+
+	# DCTCP is a restricted congestion control algorithm.  Setting it
+	# as the default in the init namespace makes it globally
+	# non-restricted (TCP_CONG_NON_RESTRICTED), allowing child
+	# namespaces to use it.
+	local saved_cc
+	saved_cc=$(sysctl -n net.ipv4.tcp_congestion_control)
+	modprobe tcp_dctcp 2>/dev/null
+	if ! sysctl -qw net.ipv4.tcp_congestion_control=dctcp; then
+		log_test_skip "Local ECMP PLB rehash: DCTCP not available"
+		return "$ksft_skip"
+	fi
+	defer sysctl -qw net.ipv4.tcp_congestion_control="$saved_cc"
+
+	# Enable ECN and DCTCP with PLB on the sender.
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_ecn=1
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_congestion_control=dctcp
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_plb_enabled=1
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_plb_rehash_rounds=3
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_plb_cong_thresh=1
+	ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_plb_suspend_rto_sec=0
+	defer ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_ecn=0
+	defer ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_congestion_control=cubic
+	defer ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_plb_enabled=0
+
+	# DCTCP sets ECT on the SYN; the receiver must also use DCTCP
+	# so that tcp_ca_needs_ecn(listen_sk) accepts the ECN
+	# negotiation.
+	ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_ecn=1
+	ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_congestion_control=dctcp
+	defer ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_ecn=0
+	defer ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_congestion_control=cubic
+
+	ip netns exec "$NS2" socat -u \
+		"TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr" - >/dev/null &
+	defer kill_process $!
+
+	wait_local_port_listen "$NS2" "$port" tcp
+
+	local base_tx0 base_tx1
+	base_tx0=$(link_tx_packets_get "$NS1" veth0a)
+	base_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+	ip netns exec "$NS1" timeout 90 socat -u \
+		OPEN:/dev/zero \
+		"TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1]" &>/dev/null &
+	local client_pid=$!
+	defer kill_process "$client_pid"
+
+	# Wait for data to start flowing before applying ECN marking.
+	busywait "$BUSYWAIT_TIMEOUT" until_counter_is \
+			">= $((base_tx0 + base_tx1 + 10))" \
+		link_tx_packets_total "$NS1" > /dev/null
+	check_err $? "no TX activity detected"
+	if [ "$RET" -ne 0 ]; then
+		log_test "Local ECMP PLB rehash: ECN-marked path"
+		return
+	fi
+
+	# Snapshot TX counters and rehash stats before ECN marking.
+	local pre_ecn_tx0 pre_ecn_tx1
+	pre_ecn_tx0=$(link_tx_packets_get "$NS1" veth0a)
+	pre_ecn_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+	local plb_before rto_before
+	plb_before=$(get_netstat_counter "$NS1" TCPPLBRehash)
+	rto_before=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+
+	# CE-mark all data on both paths.  PLB detects sustained
+	# congestion and rehashes, bouncing traffic between paths.
+	mark_ecn "$NS1" veth0a
+	defer unblock_tcp "$NS1" veth0a	# removes the marking rule
+	mark_ecn "$NS1" veth1a
+	defer unblock_tcp "$NS1" veth1a	# removes the marking rule
+
+	# Wait for meaningful data on both paths, proving PLB rehashed
+	# the connection and traffic actually moved.  Require at least
+	# 100 packets beyond the baseline to rule out stray control
+	# packets (ND, etc.) satisfying the check.
+	slowwait 60 dev_tx_packets_above \
+		"$NS1" veth0a "$((pre_ecn_tx0 + 100))"
+	check_err $? "no data on veth0a after ECN marking"
+
+	slowwait 60 dev_tx_packets_above \
+		"$NS1" veth1a "$((pre_ecn_tx1 + 100))"
+	check_err $? "no data on veth1a after ECN marking"
+
+	local plb_after rto_after
+	plb_after=$(get_netstat_counter "$NS1" TCPPLBRehash)
+	rto_after=$(get_netstat_counter "$NS1" TcpTimeoutRehash)
+	if [ "$plb_after" -le "$plb_before" ]; then
+		check_err 1 "TCPPLBRehash counter did not increment"
+	fi
+	if [ "$rto_after" -gt "$rto_before" ]; then
+		check_err 1 "TcpTimeoutRehash incremented; rehash was RTO-driven, not PLB"
+	fi
+
+	log_test "Local ECMP PLB rehash: ECN-marked path"
+}
+
+require_command socat
+
+trap cleanup_all_ns EXIT
+setup || exit $?
+tests_run
+exit "$EXIT_STATUS"
-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH net] vsock/virtio: fix potential unbounded skb queue
From: Jakub Kicinski @ 2026-05-07 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefano Garzarella, Michael S. Tsirkin, Arseniy Krasnov,
	Bobby Eshleman, Stefan Hajnoczi, David S . Miller, Paolo Abeni,
	Simon Horman, netdev, eric.dumazet, Arseniy Krasnov, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, kvm, virtualization
In-Reply-To: <CANn89iJ+qOFPSUACvda7djOVKGM8t+FfwdA5Ymjxe+g_tJtmnA@mail.gmail.com>

On Thu, 7 May 2026 09:32:24 -0700 Eric Dumazet wrote:
> On Thu, May 7, 2026 at 9:05 AM Stefano Garzarella <sgarzare@redhat.com> wrote:
> > On Thu, May 07, 2026 at 07:33:40AM -0700, Jakub Kicinski wrote:  
> > >We can revert if you think that the risk of regression is high..
> > >Please LMK soon, we can do it before patch reaches Linus.
> >
> > Some tests in tools/testing/vsock/vsock_test.c are failing with this
> > patch applied.
> >
> > Test 18 are failing sometime in this way (I guess because we are
> > dropping packets):
> >
> > 18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
> >
> > Test 22 is failing 100% in this way:
> >
> > 22 - SOCK_STREAM virtio credit update + SO_RCVLOWAT...send failed:
> > Resource temporarily unavailable
> >
> >
> > With my followup patch adding also advertisement to the other peer
> > (still draft locally, waiting for Michael proposal) I saw 22 failing,
> > because tests expects that can use the entire buf_alloc, but now we are
> > reducing it.  So IMO we should do like in `__sock_set_rcvbuf()` and
> > double the buffer size, or at least digest an overhead equal to the
> > buffer size set by the user via SO_VM_SOCKETS_BUFFER_SIZE (yeah,
> > AF_VSOCK has it owns sockopt since the beginning :-().
> >
> > With that approach tests are passing, but I'd like to stress a bit more
> > that patch. I'll send it tomorrow as fixup of this patch, or if you
> > prefer to revert, I'll send as standalone.
> 
> A plain revert is a big issue, now users now how to crash hypervisors.
> 
> This vulnerability allows a compromised guest (controlling
> virtio_vsock_hdr fields)
> to continuously flood the host's vsock receive queue without
> triggering any memory
>  accounting limits or reader wakeups, resulting in unbounded host
> kernel memory consumption (Host DoS via OOM).
> 
> A vulnerability where a KVM guest can crash or deadlock its host is
> classified as a KVM DoS.
> 
> Am I missing something?

Alright, let's leave it.

^ permalink raw reply

* RE: [PATCH ethtool-next 0/5] ethtool: Add 'pages on|off' option for module EEROM hex dump
From: Danielle Ratson @ 2026-05-07 17:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev@vger.kernel.org, mkubecek@suse.cz, Ido Schimmel,
	Petr Machata
In-Reply-To: <5114542a-7c7f-4759-b13d-6873494944c6@lunn.ch>

> -----Original Message-----
> From: Andrew Lunn <andrew@lunn.ch>
> Sent: Thursday, 7 May 2026 15:23
> To: Danielle Ratson <danieller@nvidia.com>
> Cc: netdev@vger.kernel.org; mkubecek@suse.cz; Ido Schimmel
> <idosch@nvidia.com>; Petr Machata <petrm@nvidia.com>
> Subject: Re: [PATCH ethtool-next 0/5] ethtool: Add 'pages on|off' option for
> module EEROM hex dump
> 
> > In practice, both outputs are often needed together for offline debugging:
> 
> > Output example (values zeroed to omit vendor-specific identifiers):
> >
> > $ ethtool -m swp61 hex on pages on
> > Page: 0x0
> >
> > Offset          Values
> > ------          ------
> > 0x0000:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0010:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0020:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0030:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0040:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0050:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0060:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > 0x0070:         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > Page: 0x0
> 
> Maybe also add JSON? That would make machine parsing of this easier.
> 
> 	Andrew

Hi, 

Thanks for the feedback.
I can add JSON, but ill check it doesn’t expand the scope of this set too much. Otherwise, I think it might be better to have another series for that.

Thanks,
Danielle

^ permalink raw reply

* [GIT PULL] Networking for v7.1-rc3
From: Jakub Kicinski @ 2026-05-07 17:21 UTC (permalink / raw)
  To: torvalds; +Cc: kuba, davem, netdev, linux-kernel, pabeni

Hi Linus!

The following changes since commit 08d0d3466664000ba0670e0ef0d447f23459e0d4:

  Merge tag 'net-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2026-04-30 08:45:43 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-7.1-rc3

for you to fetch changes up to 41ae14071cd7f6a7770e2fe1f8a0859d4c2c6ba4:

  net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init() (2026-05-07 09:08:47 -0700)

----------------------------------------------------------------
Including fixes from Netfilter, IPsec, Bluetooth and WiFi.

Current release - fix to a fix:

 - ipmr: add __rcu to netns_ipv4.mrt, make sure we hold the RCU lock
   in all relevant places

Current release - new code bugs:

 - fixes for the recently added resizable hash tables

 - ipv6: make sure we default IPv6 tunnel drivers to =m now that
   IPv6 itself is built in

 - drv: octeontx2-af: fixes for parser/CAM fixes

Previous releases - regressions:

 - phy: micrel: fix LAN8814 QSGMII soft reset

 - wifi: cw1200: revert "Fix locking in error paths"

 - wifi: ath12k: fix crash on WCN7850, due to adding the same queue
   buffer to a list multiple times

Previous releases - always broken:

 - number of info leak fixes

 - ipv6: implement limits on extension header parsing

 - wifi: number of fixes for missing bound checks in the drivers

 - Bluetooth: fixes for races and locking issues

 - af_unix: fix an issue between garbage collection and PEEK

 - af_unix: fix yet another issue with OOB data

 - xfrm: esp: avoid in-place decrypt on shared skb frags

 - netfilter: replace skb_try_make_writable() by skb_ensure_writable()

 - openvswitch: vport: fix race between tunnel creation and linking
   leading to invalid memory accesses (type confusion)

 - drv: amd-xgbe: fix PTP addend overflow causing frozen clock

Misc:

 - sched/isolation: make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
   (for relevant IPVS change)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

----------------------------------------------------------------
Aaradhana Sahu (1):
      wifi: ath12k: fix OF node refcount imbalance in WSI graph traversal

Aleksander Jan Bajkowski (1):
      net: usb: r8152: add TRENDnet TUC-ET2G v2.0

Alex Cheema (1):
      net: usb: cdc_ncm: add Apple Mac USB-C direct networking quirk

Alyssa Ross (1):
      ipv6: default IPV6_SIT to m

Amir Mohammad Jahangirzad (1):
      wifi: libertas: fix integer underflow in process_cmdrequest()

Andreas Haarmann-Thiemann (1):
      net: ethernet: cortina: Drop half-assembled SKB

Aurelien DESBRIERES (1):
      Bluetooth: hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized

Baochen Qiang (2):
      wifi: ath12k: prepare REO update element only for primary link
      wifi: ath12k: fix peer_id usage in normal RX path

Bart Van Assche (1):
      wifi: cw1200: Revert "Fix locking in error paths"

Benjamin Berg (1):
      wifi: mac80211: use safe list iteration in radar detect work

Bobby Eshleman (1):
      eth: fbnic: fix double-free of PCS on phylink creation failure

Breno Leitao (1):
      netpoll: pass buffer size to egress_dev() to avoid MAC truncation

Catherine (1):
      wifi: mac80211: drop stray 'static' from fast-RX rx_result

Cosmin Ratiu (6):
      tools/selftests: Use a sensible timeout value for iperf3 client
      tools/selftests: Add a VXLAN+IPsec traffic test
      xfrm: Don't clobber inner headers when already set
      net/mlx5e: psp: Fix invalid access on PSP dev registration fail
      net/mlx5e: psp: Expose only a fully initialized priv->psp
      net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable

D. Wythe (1):
      net/smc: fix missing sk_err when TCP handshake fails

Daniel Borkmann (1):
      ipv6: Implement limits on extension header parsing

Daniel Golle (1):
      net: dsa: mt7530: fix .get_stats64 sleeping in atomic context

Daniel Machon (2):
      net: sparx5: fix wrong chip ids for TSN SKUs
      net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()

Daniel Zahka (3):
      netdevsim: psp: only call nsim_psp_uninit() on PFs
      netdevsim: psp: serialize calls to nsim_psp_uninit()
      netdevsim: psp: rcu protect psp_dev reference

David Carlier (2):
      psp: strip variable-length PSP header in psp_dev_rcv()
      Bluetooth: hci_conn: fix potential UAF in create_big_sync

Dipayaan Roy (4):
      net: mana: check xdp_rxq registration before unreg in mana_destroy_rxq()
      net: mana: Skip WQ object destruction for uninitialized RXQ
      net: mana: remove double CQ cleanup in mana_create_rxq error path
      net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR

Dmitry Baryshkov (1):
      wifi: ath10k: snoc: select POWER_SEQUENCING

Dudu Lu (2):
      Bluetooth: bnep: fix incorrect length parsing in bnep_rx_frame() extension handling
      Bluetooth: l2cap: fix MPS check in l2cap_ecred_reconf_req

Eric Dumazet (12):
      ipmr: prevent info-leak in pmr_cache_report()
      ipv4: igmp: annotate data-races in igmp_heard_query()
      net/sched: sch_pie: annotate more data-races in pie_dump_stats()
      net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I)
      net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II)
      vsock/virtio: fix potential unbounded skb queue
      net: prevent possible UAF in rtnl_prop_list_size()
      net/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats()
      ipv6: fix potential UAF caused by ip6_forward_proxy_check()
      inetpeer: add a missing read_seqretry() in inet_getpeer()
      net/sched: sch_sfq: annotate data-races from sfq_dump_class_stats()
      tcp: tcp_child_process() related UAF

Fernando Fernandez Mancera (3):
      netfilter: nf_socket: skip socket lookup for non-first fragments
      netfilter: nf_tables: skip L4 header parsing for non-first fragments
      netfilter: xtables: fix L4 header parsing for non-first fragments

Florian Westphal (2):
      netfilter: xt_CT: fix usersize for v1 and v2 revision
      netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables

Gregory Fuchedgi (1):
      amd-xgbe: fix PTP addend overflow causing frozen clock

Holger Brunck (2):
      net: wan: fsl_ucc_hdlc: fix uhdlc_memclean
      net: wan: fsl_ucc_hdlc: fix ucc_hdlc_remove

Ilya Maximets (3):
      openvswitch: vport: fix race between tunnel creation and linking
      openvswitch: vport: fix self-deadlock on release of tunnel ports
      selftests: openvswitch: add tests for tunnel vport refcounting

Jakov Novak (1):
      wifi: libertas: notify firmware load wait on disconnect

Jakub Kicinski (21):
      Merge branch 'net-mctp-test-minor-kunit-test-fixes'
      Merge branch 'octeontx2-af-npc-cn20k-mcam-fixes'
      Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
      Merge branch 'ipv6-fix-ecmp-route-failover-on-carrier-loss'
      Merge branch 'replace-direct-dequeue-call-with-qdisc_dequeue_peeked'
      Merge branch 'net-sched-sch_cake-annotate-data-races-in-cake_dump_class_stats-series'
      net: tls: fix silent data drop under pipe back-pressure
      selftests: tls: add test for data loss on small pipe
      Merge branch 'mptcp-misc-fixes-for-v7-1-rc3'
      Merge branch 'bnxt_en-bug-fixes'
      Merge tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
      Merge branch 'net-mlx5e-psp-fixes'
      Merge branch 'net-mlx5-fixes-for-socket-direct'
      Merge branch 'xsk-fix-bugs-around-xsk-skb-allocation'
      Merge tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
      Merge tag 'for-net-2026-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
      Merge tag 'ovpn-net-20260504' of https://github.com/OpenVPN/ovpn-net-next
      Merge tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
      selftests: drv-net: fix sort order of makefile and config
      Merge branch 'netdevsim-psp-fix-init-and-uninit-bugs'
      Merge branch 'mptcp-pm-misc-fixes-for-v7-1-rc3'

Jamal Hadi Salim (1):
      net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked

Jann Horn (1):
      Bluetooth: hci_event: fix memset typo

Jason Xing (8):
      xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
      xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
      xsk: handle NULL dereference of the skb without frags issue
      xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
      xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
      xsk: avoid skb leak in XDP_TX_METADATA case
      xsk: fix xsk_addrs slab leak on multi-buffer error path
      xsk: fix u64 descriptor address truncation on 32-bit architectures

Jeongjun Park (1):
      wifi: rsi: fix kthread lifetime race between self-exit and external-stop

Jeremy Kerr (2):
      net: mctp: test: use a zeroed struct sockaddr_mctp
      net: mctp: test: Use dev_direct_xmit for TX to our test device

Jesper Dangaard Brouer (1):
      veth: fix OOB txq access in veth_poll() with asymmetric queue counts

Jiawen Wu (2):
      net: libwx: fix VF illegal register access
      net: libwx: use request_irq for VF misc interrupt

Jiexun Wang (1):
      af_unix: Reject SIOCATMARK on non-stream sockets

Jiri Slaby (SUSE) (1):
      wifi: ath5k: do not access array OOB

Joey Lu (1):
      net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()

Johannes Berg (5):
      Merge tag 'ath-current-20260427' of git://git.kernel.org/pub/scm/linux/kernel/git/ath/ath
      wifi: mac80211: tests: mark HT check strict
      Merge tag 'ath-current-20260505' of git://git.kernel.org/pub/scm/linux/kernel/git/ath/ath
      wifi: mac80211: remove station if connection prep fails
      wifi: nl80211: fix NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST usage

Julian Anastasov (6):
      ipvs: fixes for the new ip_vs_status info
      ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars
      ipvs: fix the spin_lock usage for RT build
      ipvs: do not leak dest after get from dest trash
      ipvs: fix races around est_mutex and est_cpulist
      ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size

Justin Chen (1):
      net: phy: broadcom: Save PHY counters during suspend

Kai Zen (1):
      net: rtnetlink: zero ifla_vf_broadcast to avoid stack infoleak in rtnl_fill_vfinfo

Kalesh AP (1):
      bnxt_en: Check return value of bnxt_hwrm_vnic_cfg

Kuan-Ting Chen (1):
      xfrm: esp: avoid in-place decrypt on shared skb frags

Kuniyuki Iwashima (6):
      selftest: net: Add test for TCP flow failover with ECMP routes.
      af_unix: Set gc_in_progress to true in unix_gc().
      ipmr: Add __rcu to netns_ipv4.mrt.
      ipv6: Fix null-ptr-deref in fib6_mtu().
      ipmr: Call ipmr_fib_lookup() under RCU.
      tcp: Fix dst leak in tcp_v6_connect().

Lorenzo Bianconi (1):
      net: airoha: Move entries to queue head in case of DMA mapping failure in airoha_dev_xmit()

Luiz Augusto von Dentz (1):
      Bluetooth: hci_event: Fix OOB read and infinite loop in hci_le_create_big_complete_evt

Maciej W. Rozycki (1):
      MAINTAINERS: Add self for the DEC LANCE network driver

Maoyi Xie (3):
      ip6_gre: Use cached t->net in ip6erspan_changelink().
      wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
      wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation

Marek Szyprowski (1):
      wifi: brcmfmac: Fix potential use-after-free issue when stopping watchdog task

Markus Baier (1):
      net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks

Matthieu Baerts (NGI0) (12):
      mptcp: sockopt: increase seq in mptcp_setsockopt_all_sf
      mptcp: pm: kernel: correctly retransmit ADD_ADDR ID 0
      mptcp: pm: ADD_ADDR rtx: allow ID 0
      mptcp: pm: ADD_ADDR rtx: fix potential data-race
      mptcp: pm: ADD_ADDR rtx: always decrease sk refcount
      mptcp: pm: ADD_ADDR rtx: free sk if last
      mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker
      mptcp: pm: ADD_ADDR rtx: skip inactive subflows
      mptcp: pm: ADD_ADDR rtx: return early if no retrans
      mptcp: pm: prio: skip closed subflows
      selftests: mptcp: check output: catch cmd errors
      selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl

Michael Bommarito (6):
      xfrm: ah: account for ESN high bits in async callbacks
      wifi: nl80211: require admin perm on SET_PMK / DEL_PMK
      wifi: mac80211: check ieee80211_rx_data_set_link return in pubsta MLO path
      Bluetooth: virtio_bt: clamp rx length before skb_put
      Bluetooth: virtio_bt: validate rx pkt_type header length
      Bluetooth: HIDP: serialise l2cap_unregister_user via hidp_session_sem

Michael Chan (2):
      bnxt_en: Delay for 5 seconds after AER DPC for all chips
      bnxt_en: Set bp->max_tpa according to what the FW supports

Michal Kosiorek (1):
      xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete

Mikhail Gavrilov (1):
      Bluetooth: l2cap: defer conn param update to avoid conn->lock/hdev->lock inversion

Nan Li (1):
      net/rds: handle zerocopy send cleanup before the message is queued

Nicolas Escande (1):
      wifi: ath12k: fix leak in some ath12k_wmi_xxx() functions

Pablo Neira Ayuso (8):
      netfilter: replace skb_try_make_writable() by skb_ensure_writable()
      netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
      netfilter: x_tables: add .check_hooks to matches and targets
      netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate
      netfilter: flowtable: ensure sufficient headroom in xmit path
      netfilter: flowtable: fix inline vlan encapsulation in xmit path
      netfilter: flowtable: fix inline pppoe encapsulation in xmit path
      netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header

Paolo Abeni (3):
      mptcp: fix rx timestamp corruption on fastopen
      Merge branch 'net-mana-fix-mana_destroy_rxq-cleanup-for-partial-rxq-init'
      Merge branch 'openvswitch-fix-self-deadlock-on-release-of-tunnel-vports'

Pauli Virtanen (2):
      Bluetooth: SCO: fix sleeping under spinlock in sco_conn_ready
      Bluetooth: SCO: hold sk properly in sco_conn_ready

Pavan Chebbi (1):
      bnxt_en: Use absolute target ns from ptp_clock_request

Pavitra Jha (1):
      net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler

Pengpeng Hou (1):
      Bluetooth: RFCOMM: pull credit byte with skb_pull_data()

Qingfang Deng (1):
      ovpn: reset MAC header before passing skb up

Ralf Lici (2):
      ovpn: ensure packet delivery happens with BH disabled
      selftests: ovpn: reduce ping count in test.sh

Rameshkumar Sundaram (1):
      wifi: ath12k: initialize RSSI dBm conversion event state

Ratheesh Kannoth (10):
      octeontx2-af: npc: cn20k: Propagate MCAM key-type errors on cn20k
      octeontx2-af: npc: cn20k: Drop debugfs_create_file() error checks in init
      octeontx2-af: npc: cn20k: Propagate errors in defrag MCAM alloc rollback
      octeontx2-af: npc: cn20k: Fix target map and rule
      octeontx2-af: npc: cn20k: Clear MCAM entries by index and key width
      octeontx2-af: npc: cn20k: Fix bank value
      octeontx2-af: npc: cn20k: Fix MCAM actions read
      octeontx2-af: npc: cn20k: Initialize default-rule index outputs up front
      octeontx2-af: npc: cn20k: Tear down default MCAM rules explicitly on free
      octeontx2-af: npc: cn20k: Reject missing default-rule MCAM indices

Rio Liu (1):
      wifi: mac80211: skip ieee80211_verify_sta_ht_mcs_support check in non-strict mode

Robert Marko (1):
      net: phy: micrel: fix LAN8814 QSGMII soft reset

Ruijie Li (1):
      xfrm: provide message size for XFRM_MSG_MAPPING

Sagarika Sharma (1):
      ipv6: update route serial number on NETDEV_CHANGE

Sai Teja Aluvala (1):
      Bluetooth: btintel_pcie: treat boot stage bit 12 as warning

SeungJu Cheon (2):
      Bluetooth: ISO: Fix data-race on dst in iso_sock_connect()
      Bluetooth: ISO: Fix data-race on iso_pi(sk) in socket and HCI event paths

Shardul Bankar (2):
      mptcp: use MPJoinSynAckHMacFailure for SynAck HMAC failure
      mptcp: use MPTCP_RST_EMPTCP for ACK HMAC validation failure

Shay Drory (4):
      net/mlx5: SD: Serialize init/cleanup
      net/mlx5: SD, Keep multi-pf debugfs entries on primary
      net/mlx5e: SD, Fix missing cleanup on probe error
      net/mlx5e: SD, Fix race condition in secondary device probe/remove

Shitalkumar Gandhi (1):
      net: rtsn: fix mdio_node leak in rtsn_mdio_alloc()

Siwei Zhang (3):
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_state_change_cb()
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_get_sndtimeo_cb()
      Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_new_connection_cb()

Tristan Madani (3):
      wifi: b43: enforce bounds check on firmware key index in b43_rx()
      wifi: b43legacy: enforce bounds check on firmware key index in RX path
      Bluetooth: btmtk: validate WMT event SKB length before struct access

Victor Nogueira (1):
      selftests/tc-testing: Add tests that force red and sfb to dequeue from child's gso_skb

Victor Nogueria (1):
      net/sched: sch_sfb: Replace direct dequeue call with peek and qdisc_dequeue_peeked

Waiman Long (2):
      ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU
      sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN

Wei Fang (1):
      net: enetc: fix VSI mailbox timeout handling and DMA lifecycle

Weiming Shi (1):
      netfilter: nft_fwd_netdev: use recursion counter in neigh egress path

Yilin Zhu (1):
      ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()

Yu-Hsiang Tseng (1):
      wifi: ath12k: use lockdep_assert_in_rcu_read_lock() for RCU assertions

 MAINTAINERS                                        |   6 +
 drivers/bluetooth/btintel_pcie.c                   |  13 +-
 drivers/bluetooth/btintel_pcie.h                   |   2 +-
 drivers/bluetooth/btmtk.c                          |  15 +-
 drivers/bluetooth/hci_ath.c                        |   3 +
 drivers/bluetooth/hci_bcsp.c                       |   3 +
 drivers/bluetooth/hci_h4.c                         |   3 +
 drivers/bluetooth/hci_h5.c                         |   3 +
 drivers/bluetooth/virtio_bt.c                      |  39 ++-
 drivers/net/dsa/mt7530.c                           |  75 +++-
 drivers/net/dsa/mt7530.h                           |   8 +
 drivers/net/ethernet/airoha/airoha_eth.c           |   6 +-
 drivers/net/ethernet/amd/xgbe/xgbe.h               |   4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |  16 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c      |  29 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c      |  10 +-
 drivers/net/ethernet/cortina/gemini.c              |   5 +
 drivers/net/ethernet/freescale/enetc/enetc.h       |   1 +
 drivers/net/ethernet/freescale/enetc/enetc_vf.c    |  42 ++-
 .../ethernet/marvell/octeontx2/af/cn20k/debugfs.c  |  33 +-
 .../net/ethernet/marvell/octeontx2/af/cn20k/npc.c  | 382 ++++++++++++++-------
 .../net/ethernet/marvell/octeontx2/af/cn20k/npc.h  |  24 +-
 .../net/ethernet/marvell/octeontx2/af/rvu_nix.c    |   3 +
 .../net/ethernet/marvell/octeontx2/af/rvu_npc.c    | 231 +++++++++++--
 .../net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c |  30 +-
 .../net/ethernet/mellanox/mlx5/core/en_accel/psp.c |  36 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  30 +-
 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c   | 114 +++++-
 drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h   |   2 +
 drivers/net/ethernet/meta/fbnic/fbnic_netdev.c     |   3 +-
 .../net/ethernet/microchip/sparx5/sparx5_main.h    |  10 +-
 .../net/ethernet/microchip/sparx5/sparx5_port.c    |   3 +-
 drivers/net/ethernet/microsoft/mana/gdma_main.c    |  40 ++-
 drivers/net/ethernet/microsoft/mana/mana_en.c      |  10 +-
 drivers/net/ethernet/microsoft/mana/shm_channel.c  |   5 -
 drivers/net/ethernet/renesas/rtsn.c                |   6 +-
 .../net/ethernet/stmicro/stmmac/dwmac-nuvoton.c    |   2 +
 drivers/net/ethernet/wangxun/libwx/wx_hw.c         |   7 +-
 drivers/net/ethernet/wangxun/libwx/wx_vf_common.c  |   4 +-
 drivers/net/netdevsim/netdev.c                     |   3 +-
 drivers/net/netdevsim/netdevsim.h                  |   4 +-
 drivers/net/netdevsim/psp.c                        |  65 +++-
 drivers/net/ovpn/io.c                              |   7 +
 drivers/net/phy/bcm-phy-lib.c                      |   9 +
 drivers/net/phy/bcm-phy-lib.h                      |   1 +
 drivers/net/phy/bcm7xxx.c                          |  14 +
 drivers/net/phy/broadcom.c                         |   5 +
 drivers/net/phy/micrel.c                           |  15 +-
 drivers/net/usb/asix_devices.c                     |   2 +
 drivers/net/usb/cdc_ncm.c                          |   8 +
 drivers/net/usb/r8152.c                            |   1 +
 drivers/net/veth.c                                 |   3 +-
 drivers/net/wan/fsl_ucc_hdlc.c                     |   9 +-
 drivers/net/wireless/ath/ath10k/Kconfig            |   1 +
 drivers/net/wireless/ath/ath12k/core.c             |  77 +++--
 drivers/net/wireless/ath/ath12k/dp_rx.c            |   5 +-
 drivers/net/wireless/ath/ath12k/mac.c              |   2 +-
 drivers/net/wireless/ath/ath12k/p2p.c              |   2 +-
 drivers/net/wireless/ath/ath12k/wmi.c              | 105 +++++-
 drivers/net/wireless/ath/ath5k/base.c              |   3 +-
 drivers/net/wireless/broadcom/b43/xmit.c           |   3 +-
 drivers/net/wireless/broadcom/b43legacy/xmit.c     |   3 +-
 .../wireless/broadcom/brcm80211/brcmfmac/sdio.c    |   6 +-
 drivers/net/wireless/marvell/libertas/if_usb.c     |   6 +-
 drivers/net/wireless/rsi/rsi_common.h              |   5 +-
 drivers/net/wireless/st/cw1200/pm.c                |   2 -
 drivers/net/wwan/t7xx/t7xx_modem_ops.c             |  20 +-
 drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c         |  18 +-
 drivers/net/wwan/t7xx/t7xx_port_proxy.h            |   2 +-
 include/linux/netfilter/x_tables.h                 |   8 +
 include/linux/sched/isolation.h                    |   6 +-
 include/net/bluetooth/hci_core.h                   |   2 +-
 include/net/dropreason-core.h                      |   6 +
 include/net/ip_vs.h                                |  31 +-
 include/net/ipv6.h                                 |   3 +
 include/net/mana/shm_channel.h                     |   6 +
 include/net/netfilter/nf_dup_netdev.h              |  13 +
 include/net/netfilter/nf_flow_table.h              |   4 +-
 include/net/netns/ipv4.h                           |   2 +-
 net/bluetooth/bnep/core.c                          |  13 +-
 net/bluetooth/hci_conn.c                           | 124 +++++--
 net/bluetooth/hci_event.c                          |  31 +-
 net/bluetooth/hidp/core.c                          |  27 +-
 net/bluetooth/iso.c                                |  56 +--
 net/bluetooth/l2cap_core.c                         |  14 +-
 net/bluetooth/l2cap_sock.c                         |   9 +
 net/bluetooth/rfcomm/core.c                        |   7 +-
 net/bluetooth/sco.c                                |  62 ++--
 net/core/dev.c                                     |   2 +-
 net/core/netpoll.c                                 |  23 +-
 net/core/rtnetlink.c                               |   1 +
 net/ipv4/ah4.c                                     |  14 +-
 net/ipv4/esp4.c                                    |   3 +-
 net/ipv4/igmp.c                                    |  58 ++--
 net/ipv4/inetpeer.c                                |   3 +-
 net/ipv4/ip_output.c                               |   2 +
 net/ipv4/ipmr.c                                    |  10 +-
 net/ipv4/netfilter/nf_socket_ipv4.c                |   3 +
 net/ipv4/tcp_ipv4.c                                |  14 +-
 net/ipv4/tcp_minisocks.c                           |   2 +-
 net/ipv6/Kconfig                                   |   4 +-
 net/ipv6/ah6.c                                     |  14 +-
 net/ipv6/esp6.c                                    |   3 +-
 net/ipv6/exthdrs_core.c                            |   7 +
 net/ipv6/ip6_gre.c                                 |   5 +-
 net/ipv6/ip6_input.c                               |   5 +
 net/ipv6/ip6_output.c                              |   5 +
 net/ipv6/ip6_tunnel.c                              |   4 +
 net/ipv6/netfilter/nf_socket_ipv6.c                |   5 +-
 net/ipv6/route.c                                   |   5 +
 net/ipv6/tcp_ipv6.c                                |  17 +-
 net/ipv6/xfrm6_protocol.c                          |   4 +-
 net/mac80211/mlme.c                                |  18 +-
 net/mac80211/rx.c                                  |   6 +-
 net/mac80211/tests/chan-mode.c                     |   1 +
 net/mac80211/util.c                                |   4 +-
 net/mctp/test/route-test.c                         |   2 +-
 net/mctp/test/utils.c                              |   2 +-
 net/mptcp/fastopen.c                               |   4 +-
 net/mptcp/pm.c                                     |  62 ++--
 net/mptcp/pm_kernel.c                              |  13 +-
 net/mptcp/sockopt.c                                |   4 +
 net/mptcp/subflow.c                                |   4 +-
 net/netfilter/ipvs/ip_vs_conn.c                    |  74 ++--
 net/netfilter/ipvs/ip_vs_core.c                    |   2 +-
 net/netfilter/ipvs/ip_vs_ctl.c                     | 164 ++++++---
 net/netfilter/ipvs/ip_vs_est.c                     |  83 +++--
 net/netfilter/nf_dup_netdev.c                      |  16 -
 net/netfilter/nf_flow_table_core.c                 |   1 +
 net/netfilter/nf_flow_table_ip.c                   | 151 ++++++--
 net/netfilter/nf_flow_table_path.c                 |   7 +-
 net/netfilter/nf_tables_api.c                      |  35 +-
 net/netfilter/nf_tables_core.c                     |   2 +-
 net/netfilter/nft_compat.c                         |  45 ++-
 net/netfilter/nft_exthdr.c                         |   2 +-
 net/netfilter/nft_fwd_netdev.c                     |  29 +-
 net/netfilter/nft_osf.c                            |   2 +-
 net/netfilter/nft_tproxy.c                         |   8 +-
 net/netfilter/x_tables.c                           |  79 ++++-
 net/netfilter/xt_CT.c                              |   8 +-
 net/netfilter/xt_TCPMSS.c                          |  33 +-
 net/netfilter/xt_TPROXY.c                          |  11 +-
 net/netfilter/xt_addrtype.c                        |  25 +-
 net/netfilter/xt_devgroup.c                        |  18 +-
 net/netfilter/xt_ecn.c                             |   4 +
 net/netfilter/xt_hashlimit.c                       |   4 +-
 net/netfilter/xt_osf.c                             |   3 +
 net/netfilter/xt_physdev.c                         |  24 +-
 net/netfilter/xt_policy.c                          |  24 +-
 net/netfilter/xt_set.c                             |  39 ++-
 net/netfilter/xt_tcpmss.c                          |   4 +
 net/openvswitch/vport-geneve.c                     |   5 +-
 net/openvswitch/vport-gre.c                        |   5 +-
 net/openvswitch/vport-netdev.c                     |  64 ++--
 net/openvswitch/vport-netdev.h                     |   2 +-
 net/openvswitch/vport-vxlan.c                      |   5 +-
 net/psp/psp_main.c                                 |  42 ++-
 net/rds/message.c                                  |  20 +-
 net/sched/sch_cake.c                               | 153 +++++----
 net/sched/sch_fq_codel.c                           |  39 ++-
 net/sched/sch_pie.c                                |  14 +-
 net/sched/sch_red.c                                |   2 +-
 net/sched/sch_sfb.c                                |   2 +-
 net/sched/sch_sfq.c                                |  48 +--
 net/smc/af_smc.c                                   |   8 +-
 net/tls/tls_sw.c                                   |   6 +-
 net/unix/af_unix.c                                 |   3 +
 net/unix/garbage.c                                 |   6 +-
 net/vmw_vsock/virtio_transport_common.c            |   4 +-
 net/wireless/nl80211.c                             |  27 ++
 net/wireless/pmsr.c                                |   2 +-
 net/xdp/xsk.c                                      | 115 ++++---
 net/xdp/xsk_buff_pool.c                            |   3 +
 net/xfrm/xfrm_output.c                             |  20 +-
 net/xfrm/xfrm_state.c                              |  12 +-
 net/xfrm/xfrm_user.c                               |   1 +
 tools/testing/selftests/drivers/net/hw/Makefile    |   1 +
 tools/testing/selftests/drivers/net/hw/config      |   5 +
 .../selftests/drivers/net/hw/ipsec_vxlan.py        | 204 +++++++++++
 tools/testing/selftests/drivers/net/lib/py/load.py |   5 +-
 tools/testing/selftests/net/Makefile               |   1 +
 tools/testing/selftests/net/mptcp/mptcp_lib.sh     |  16 +-
 tools/testing/selftests/net/mptcp/pm_netlink.sh    |  20 +-
 .../selftests/net/openvswitch/openvswitch.sh       |  37 ++
 .../testing/selftests/net/openvswitch/ovs-dpctl.py |  19 +-
 tools/testing/selftests/net/ovpn/test.sh           |   4 +-
 tools/testing/selftests/net/tcp_ecmp_failover.sh   | 216 ++++++++++++
 tools/testing/selftests/net/tls.c                  |  43 +++
 .../tc-testing/tc-tests/infra/qdiscs.json          | 148 ++++++++
 189 files changed, 3485 insertions(+), 1160 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
 create mode 100755 tools/testing/selftests/net/tcp_ecmp_failover.sh

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox