Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v7 2/2] net: ti: icssg-prueth: Add ethtool ops for Frame Preemption MAC Merge
From: Jakub Kicinski @ 2026-06-16 15:07 UTC (permalink / raw)
  To: Meghana Malladi
  Cc: elfring, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, arnd, basharath, afd, parvathi, vladimir.oltean,
	rogerq, danishanwar, pabeni, edumazet, davem, andrew+netdev,
	linux-arm-kernel, netdev, linux-kernel, srk, vigneshr
In-Reply-To: <d0123269-b1e8-4fba-94b0-b94d3d9a5405@ti.com>

On Tue, 16 Jun 2026 18:24:22 +0530 Meghana Malladi wrote:
> >> Could the firmware-register lookup table used by emac_get_stat_by_name()
> >> be separated from the ethtool -S string table, so the new preemption
> >> counters feed get_mm_stats without also showing up under ethtool -S?  
> > 
> > This -- not sure about the other complaints but this one looks legit.  
> 
> I agree that this is legit, but right now there is no other place holder 
> other than pa stats to put the mac merge firmware counters. I believe
> the effort needs to go in re-structuring the hardware and firmware stats 
> implementation to address this issue.

icssg_all_miig_stats has a true / false that looks like it's supposed
to serve the same purpose? Maybe I don't understand what you're trying
to say

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Jakub Kicinski @ 2026-06-16 15:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Petr Mladek, John Ogness, Sergey Senozhatsky, Peter Zijlstra,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616103529.Yh9Dxsjp@linutronix.de>

On Tue, 16 Jun 2026 12:35:29 +0200 Sebastian Andrzej Siewior wrote:
> On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> > On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:  
> > > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > > +	local_bh_disable();
> > > + 	poll_napi(dev);
> > > +	_local_bh_enable();  
> > 
> > tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> > softirq into not waking ksoftirqd? The problematic path is:
> > 
> >   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> > 
> > so the softirq may never get serviced.
> > 
> > In netcons we try to avoid touching the network driver if the Tx path
> > locks are already held. Ideally we'd do something similar with the
> > scheduler. Try to do bare minimum if we may be in the scheduler.
> > Failing that - don't poll the driver if we were called with irqs
> > already disabled.
> > 
> > Or maybe we only poll from console->write_thread ?  
> 
> So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> to NBCON console infrastructure"). Because from here now on writes are
> deferred to the nbcon thread. So this purely about -stable in this case.
> 
> Looking at the patch and the amount of comments vs code changes look
> somehow hackish. That ifdef for PREEMPT_RT is not needed because on
> PREEMPT_RT we have either nbcon or the legacy console (including
> netconsole before the mentioned commit) wrapped in a dedicated thread
> (via force_legacy_kthread()).
> That means in both cases the flow never ends there and the problem is
> limited to !PREEMPT_RT.
> 
> Now. The scheduler usually does printk_deferred() because of the rq lock
> so it does not deadlock for various reasons. It is kind of a pity that
> the various WARN macros don't do that.
> I don't think that patch is enough. It works around the problem in this
> scenario but should the NIC driver invoke schedule_work() then we are
> back here again.
> Should the network driver acquire a lock then lockdep might observe
> rq -> driver-lock and then driver-lock -> rq and yell dead lock (CPU1
> doing AB and CPU2 doing BA). This includes also other console driver so
> it is not limited to netconsole.
> 
> Point being made is that we should avoid the callchain:
> 
> |  console_unlock
> |  vprintk_emit
> |  __warn
> |  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
> |  put_prev_entity
> |  put_prev_task_fair
> |  __schedule
> 
> basically a printk under the rq lock.
> 
> We could add printk_deferred_enter/exit() to all the rq_lock() variants.
> I think PeterZ loves this the most. And Greg will appreciate it too
> while backporting because of all the context changes.
> 
> We could also introduce WARN_ON_DEFERRED +variants which do the
> printk_deferred_enter/exit() thingy should around the printk and replace
> all the WARNs in kernel/sched/.
> I *think* the tty/console layer has also a deadlock problem where it
> holds locks and then the WARN(), that never triggers, asks for the same
> locks again so we might have a second user…
> 
> Adding sched and printk folks for opinions while eyeballing
> WARN_ON_DEFERRED().

Thanks a lot for looking into this! To be clear - the printk_deferred /
WARN_DEFERRED would be just for stable? Or there's still some
sensitivity even with nbcon?

^ permalink raw reply

* Re: [PATCH v2] net: macb: add TX stall timeout callback to recover from lost TSTART write
From: Théo Lebrun @ 2026-06-16 15:07 UTC (permalink / raw)
  To: Andrea della Porta, netdev, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel,
	Nicolai Buchwitz
  Cc: Lukasz Raczylo, Steffen Jaeckel
In-Reply-To: <468f480454a314303bac6a54780b153f689f2267.1781598350.git.andrea.porta@suse.com>

Hello Andrea,

On Tue Jun 16, 2026 at 3:23 PM CEST, Andrea della Porta wrote:
> From: Lukasz Raczylo <lukasz@raczylo.com>
>
> The MACB found in the Raspberry Pi RP1 suffers from sporadic stalls on
> the TX queue.
> While the exact root cause is not yet fully understood, it is likely
> related to a hardware issue where a TSTART write to the NCR register
> is missed, preventing the transmission from being kicked off.
>
> Implement a timeout callback to handle TX queue stalls, triggering the
> existing restart mechanism to recover.
>
> Link: https://lore.kernel.org/all/20260514215459.36109-1-lukasz@raczylo.com/
> Fixes: dc110d1b23564 ("net: cadence: macb: Add support for Raspberry Pi RP1 ethernet controller")
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> Co-developed-by: Steffen Jaeckel <sjaeckel@suse.de>
> Signed-off-by: Steffen Jaeckel <sjaeckel@suse.de>
> Co-developed-by: Andrea della Porta <andrea.porta@suse.com>
> Signed-off-by: Andrea della Porta <andrea.porta@suse.com>

Thanks for this V2.

Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>

Any news from the Raspberry Pi community about this bug investigation?

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply

* Re: [PATCH] net: ethtool: mm: Increase FPE verification retry count
From: Vladimir Oltean @ 2026-06-16 15:16 UTC (permalink / raw)
  To: Simon Horman
  Cc: muhammad.nazim.amirul.nazle.asmade, netdev, andrew, kuba, davem,
	edumazet, pabeni, faizal.abdul.rahim, linux-kernel
In-Reply-To: <20260616071925.GA800687@horms.kernel.org>

On Tue, Jun 16, 2026 at 08:19:25AM +0100, Simon Horman wrote:
> + Vladimir
> 
> On Mon, Jun 15, 2026 at 12:24:36AM -0700, muhammad.nazim.amirul.nazle.asmade@altera.com wrote:
> > From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> > 
> > The current FPE verification retry count is set to 3. However,
> > the IEEE 802.3br standard does not specify a fixed value for this.
> > A retry count of 3 may be insufficient when the remote device is
> > slow to respond during link-up. Increase the retry count to 20 to
> > improve robustness.
> > 
> > Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
> > Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> 
> Vladimir, I'm wondering if you could take a look at this one.

IEEE 802.3br is an obsolete standard, I don't even have access to it.

IEEE 802.3-2022 is the current one for the MAC Merge layer. Clause
99.4.7.2 Constants states:

verifyLimit: the integer 3, the number of verification attempts

I don't have something in principle against making the verifyLimit
configurable past IEEE 802.3 for debugging purposes or non-standard
applications, but keep the default to 3.

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Sabrina Dubroca @ 2026-06-16 15:19 UTC (permalink / raw)
  To: syzbot
  Cc: davem, edumazet, horms, john.fastabend, kuba, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <6a315d48.b0403584.28d0ff.0002.GAE@google.com>

2026-06-16, 07:27:20 -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    f6033078a9e6 ip6_tunnel: annotate data-races around t->err..
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=122a98ae580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=8697a140486f5628
> dashboard link: https://syzkaller.appspot.com/bug?extid=cca46a9d1276f38af2ae
> compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/7af9eb2b9b5a/disk-f6033078.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/4b7e03b76e68/vmlinux-f6033078.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/38042dd09caa/bzImage-f6033078.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+cca46a9d1276f38af2ae@syzkaller.appspotmail.com
> 
> ------------[ cut here ]------------
> err >= 0
> WARNING: net/tls/tls_sw.c:73 at tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73, CPU#0: kworker/0:11/6099
> Modules linked in:
> CPU: 0 UID: 0 PID: 6099 Comm: kworker/0:11 Not tainted syzkaller #0 PREEMPT(full) 
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
> Workqueue: pencrypt_serial padata_serial_worker
> RIP: 0010:tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73
> Code: e8 03 48 b9 00 00 00 00 00 fc ff df 0f b6 04 08 84 c0 75 1b 89 ab 9c 01 00 00 48 89 df 5b 5d e9 c9 a2 32 ff e8 a4 60 8a f7 90 <0f> 0b 90 eb c3 89 f9 80 e1 07 80 c1 03 38 c1 7c d9 e8 1d 9f f5 f7
> RSP: 0018:ffffc900069379e0 EFLAGS: 00010293
> RAX: ffffffff8a3adf8c RBX: ffff88807d1e0d80 RCX: ffff888058bfdd00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
> RBP: 0000000000000000 R08: ffffe8ffffc513e3 R09: 1ffffd1ffff8a27c
> R10: dffffc0000000000 R11: ffffffff8a3c4d70 R12: ffff888028eaf400
> R13: ffff88804441030c R14: dffffc0000000000 R15: ffff888028eaf460
> FS:  0000000000000000(0000) GS:ffff8881252a0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f521f503ff8 CR3: 0000000086fc2000 CR4: 00000000003526f0
> Call Trace:
>  <TASK>
>  tls_encrypt_done+0x223/0x480 net/tls/tls_sw.c:500


	/* Check if error is previously set on socket */
	if (err || sk->sk_err) {
		rec = NULL;

		/* If err is already set on socket, return the same code */
		if (sk->sk_err) {
			ctx->async_wait.err = -sk->sk_err;
		} else {
			ctx->async_wait.err = err;
			tls_err_abort(sk, err);
		}
	}

I suspect err==0, and sock_error() consumed sk_err in between (the
alternative would be err > 0).

Something like this?

-------- 8< --------
@@ -473,6 +473,7 @@ static void tls_encrypt_done(void *data, int err)
 	struct scatterlist *sge;
 	struct sk_msg *msg_en;
 	struct sock *sk;
+	int sk_err;
 
 	if (err == -EINPROGRESS) /* see the comment in tls_decrypt_done() */
 		return;
@@ -489,12 +490,13 @@ static void tls_encrypt_done(void *data, int err)
 	sge->length += prot->prepend_size;
 
 	/* Check if error is previously set on socket */
-	if (err || sk->sk_err) {
+	sk_err = READ_ONCE(sk->sk_err);
+	if (err || sk_err) {
 		rec = NULL;
 
 		/* If err is already set on socket, return the same code */
-		if (sk->sk_err) {
-			ctx->async_wait.err = -sk->sk_err;
+		if (sk_err) {
+			ctx->async_wait.err = -sk_err;
 		} else {
 			ctx->async_wait.err = err;
 			tls_err_abort(sk, err);

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net 4/4] net: ti: icssg: Fix XSK zero copy TX during application wakeup
From: Jakub Kicinski @ 2026-06-16 15:19 UTC (permalink / raw)
  To: Meghana Malladi
  Cc: diogo.ivo, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, sdf, john.fastabend, hawk, daniel, ast, pabeni,
	edumazet, davem, andrew+netdev, bpf, linux-kernel, netdev,
	linux-arm-kernel, srk, Vignesh Raghavendra, Roger Quadros,
	danishanwar
In-Reply-To: <ed0bc332-0196-4613-8066-9b94f8ed0013@ti.com>

On Tue, 16 Jun 2026 16:41:00 +0530 Meghana Malladi wrote:
> On 6/16/26 04:51, Jakub Kicinski wrote:
> > On Fri, 12 Jun 2026 00:27:44 +0530 Meghana Malladi wrote:  
> >> @@ -169,9 +169,6 @@ static int emac_xsk_xmit_zc(struct prueth_emac *emac,
> >>   
> >>   		num_tx++;
> >>   	}
> >> -
> >> -	xsk_tx_release(tx_chn->xsk_pool);
> >> -	return num_tx;  
> > 
> > Why are you deleting this?
> >   
> 
> xsk_sendmsg() also calls this without an rcu-lock when transmitting the 
> packets if the xmit was successful, so I was assuming it is not required 
> and I removed this.

I think you still need it. Besides, seems like a separate cleanup.

> >>   void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
> >> @@ -279,9 +276,6 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
> >>   		num_tx++;
> >>   	}
> >>   
> >> -	if (!num_tx)
> >> -		return 0;  
> > 
> > Does something prevent us from running all this code if budget is 0?
> > If budget is 0 we can complete normal Tx with skbs but we must
> > not touch any AF-XDP related state.
> 
> Can you elaborate more, I couldn't interpret your comment here

netpoll may call napi from any context, including from IRQ.
It uses budget of 0 to indicate that it's trying to only reap tx
completions, without doing any Rx or XDP work. XDPs can't be called
from IRQ context.

> >>   	netif_txq = netdev_get_tx_queue(ndev, chn);
> >>   	netdev_tx_completed_queue(netif_txq, num_tx, total_bytes);
> >>   
> >> @@ -306,7 +300,9 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
> >>   
> >>   		netif_txq = netdev_get_tx_queue(ndev, chn);
> >>   		txq_trans_cond_update(netif_txq);  
> > 
> > This looks misplaced, now we will hit it even if we didn't complete
> > or submit any Tx.
> 
> This code needs to be hit for packet transmission in zero copy mode.
> emac_xsk_xmit_zc() submits the packets to the DMA in NAPI context,
> when application wakes up the driver and triggers NAPI. Once DMA 
> transfer is done, irq gets triggered NAPI gets called which will handle 
> the tx packet completion + submit next Tx batch packets to the DMA.
> 
> if (tx_chn->xsk_pool) -> check ensure this hits and runs for zero copy 
> only. Also above check (!num_tx) returns early during the application 
> wakeup (where budget is zero), hence it is removed.

I'm commenting on txq_trans_cond_update(), you're calling it
effectively on every NAPI call when XSK is bound, whether
Tx is making progress or not.

^ permalink raw reply

* Re: [PATCH v2 net-next 1/1] tcp: Replace min_tso_segs() with tso_segs() CC callback for TCP Prague
From: Jakub Kicinski @ 2026-06-16 15:23 UTC (permalink / raw)
  To: Chia-Yu Chang (Nokia)
  Cc: edumazet@google.com, ncardwell@google.com, jolsa@kernel.org,
	yonghong.song@linux.dev, song@kernel.org,
	linux-kselftest@vger.kernel.org, memxor@gmail.com,
	shuah@kernel.org, martin.lau@linux.dev, ast@kernel.org,
	daniel@iogearbox.net, andrii@kernel.org, eddyz87@gmail.com,
	horms@kernel.org, dsahern@kernel.org, bpf@vger.kernel.org,
	netdev@vger.kernel.org, pabeni@redhat.com, jhs@mojatatu.com,
	stephen@networkplumber.org, davem@davemloft.net,
	andrew+netdev@lunn.ch, donald.hunter@gmail.com, kuniyu@google.com,
	ij@kernel.org, Koen De Schepper (Nokia), g.white@cablelabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com
In-Reply-To: <PAXPR07MB7984585F9348BB3F000EB330A3E52@PAXPR07MB7984.eurprd07.prod.outlook.com>

On Tue, 16 Jun 2026 12:23:11 +0000 Chia-Yu Chang (Nokia) wrote:
> > On Mon, 15 Jun 2026 18:51:02 -0700 Jakub Kicinski wrote:  
> > > On Sun, 14 Jun 2026 09:17:56 +0200 chia-yu.chang@nokia-bell-labs.com
> > > Eric, Neal, looks good?
> > >
> > > The min rtt thing in tcp_tso_autosize() helps a bit but if the sender 
> > > gets congested for a longer stretch min_rtts on new connections are 
> > > high and we're back to sending small TSO, keeping the sender overloaded.
> > > Which is to say - I _hope_ this also solves some of Meta's problems :)  
> > 
> > Ugh, I didn't see the Sashiko report, it's only CCed to the author and bpf@, not to netdev :/
> > 
> > The zero-check sounds legit. Let's revisit this after the merge window.  
> 
> Thanks for the comment, I will take action after the merge window.
> 
> And, please correct me if I am wrong, the next eligible submission is expected from 30-June, right?

It usually opens Monday morning (PST) so Jun 29th

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Jakub Kicinski @ 2026-06-16 15:28 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: syzbot, davem, edumazet, horms, john.fastabend, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <ajFpejsl1ukTbG96@krikkit>

On Tue, 16 Jun 2026 17:19:22 +0200 Sabrina Dubroca wrote:
> I suspect err==0, and sock_error() consumed sk_err in between (the
> alternative would be err > 0).
> 
> Something like this?

Makes sense, but what's eating sk_err? Don't we depend on it being set
to avoid further state transitions once we hit a crypto error?
I thought that's why we don't consume sk_err in recvmsg and sendmsg in
the first place (we are not calling sock_error() anywhere)

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Sebastian Andrzej Siewior @ 2026-06-16 15:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Mladek, John Ogness, Sergey Senozhatsky, Peter Zijlstra,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616081128.04e2c8dd@kernel.org>

On 2026-06-16 08:11:28 [-0700], Jakub Kicinski wrote:
> > 
> > Adding sched and printk folks for opinions while eyeballing
> > WARN_ON_DEFERRED().
> 
> Thanks a lot for looking into this! To be clear - the printk_deferred /
> WARN_DEFERRED would be just for stable? Or there's still some
> sensitivity even with nbcon?

We already have printk_deferred(). WARN_DEFERRED() would be new. I
*think* this is not limited netpoll/ netconsole but all console drivers
not using CON_NBCON if the printk (via WARN) occurs with the rq held.
I don't remember all the details but printk_deferred() was introduced to
circumvent this until printk is fixed.

Once we get rid of those legacy drivers and NBCON is the default we can
get rid of printk_deferred() :)

Sebastian

^ permalink raw reply

* Re: [syzbot] [net?] KASAN: slab-use-after-free Read in fib_rules_lookup
From: Ido Schimmel @ 2026-06-16 15:31 UTC (permalink / raw)
  To: syzbot, kuniyu
  Cc: davem, dsahern, edumazet, horms, kuba, linux-kernel, netdev,
	pabeni, syzkaller-bugs
In-Reply-To: <6a315824.b0403584.28d0ff.0000.GAE@google.com>

On Tue, Jun 16, 2026 at 07:05:24AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    72dfa4700f78 net: dsa: sja1105: fix lastused timestamp in ..

This includes commit 759923cf03b0 ("ipv4: fib: Convert
fib_net_exit_batch() to ->exit_rtnl().") that moved ip_fib_net_exit()
(and therefore fib4_rules_exit()) earlier in the netns dismantle path.

Kuniyuki, can you please take a look?

You can use this to reproduce:

#!/bin/bash

while true; do
	ip netns add ns1
	ip -n ns1 link set dev lo up
	ip -n ns1 address add 192.0.2.1/24 dev lo
	ip -n ns1 link add name dummy1 up type dummy
	ip -n ns1 address add 198.51.100.1/24 dev dummy1
	ip -n ns1 rule add ipproto tcp sport 12345 table 12345
	ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
	ip netns del ns1
done

Thanks

> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15794bd2580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a0842261b62cdea8
> dashboard link: https://syzkaller.appspot.com/bug?extid=965506b59a2de0b6905c
> compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/d4e16f50a97c/disk-72dfa470.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/6cd4a736e796/vmlinux-72dfa470.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/548b0011c8e8/bzImage-72dfa470.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com
> 
> bond0 (unregistering): Released all slaves
> bond1 (unregistering): Released all slaves
> bond2 (unregistering): (slave dummy0): Releasing active interface
> bond2 (unregistering): Released all slaves
> ==================================================================
> BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641
> 
> CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full) 
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
> Workqueue: netns cleanup_net
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
>  print_address_description+0x55/0x1e0 mm/kasan/report.c:378
>  print_report+0x58/0x70 mm/kasan/report.c:482
>  kasan_report+0x117/0x150 mm/kasan/report.c:595
>  fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
>  __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96
>  ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811
>  ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702
>  __ip_route_output_key include/net/route.h:169 [inline]
>  ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929
>  ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118
>  release_sock+0x206/0x260 net/core/sock.c:3861
>  inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950
>  udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
>  fou_release net/ipv4/fou_core.c:562 [inline]
>  fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230
>  ops_exit_list net/core/net_namespace.c:199 [inline]
>  ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
>  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
>  process_one_work kernel/workqueue.c:3314 [inline]
>  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
>  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
>  kthread+0x389/0x470 kernel/kthread.c:436
>  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> 
> Allocated by task 19121:
>  kasan_save_stack mm/kasan/common.c:57 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
>  poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
>  __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
>  kasan_kmalloc include/linux/kasan.h:263 [inline]
>  __do_kmalloc_node mm/slub.c:5296 [inline]
>  __kmalloc_node_track_caller_noprof+0x4d7/0x7b0 mm/slub.c:5408
>  kmemdup_noprof+0x2b/0x70 mm/util.c:138
>  kmemdup_noprof include/linux/fortify-string.h:763 [inline]
>  fib_rules_register+0x2f/0x400 net/core/fib_rules.c:170
>  fib4_rules_init+0x21/0x160 net/ipv4/fib_rules.c:508
>  ip_fib_net_init net/ipv4/fib_frontend.c:1578 [inline]
>  fib_net_init+0x17a/0x3e0 net/ipv4/fib_frontend.c:1628
>  ops_init+0x35d/0x5d0 net/core/net_namespace.c:137
>  setup_net+0x118/0x350 net/core/net_namespace.c:446
>  copy_net_ns+0x4f9/0x720 net/core/net_namespace.c:579
>  create_new_namespaces+0x3f0/0x6b0 kernel/nsproxy.c:132
>  unshare_nsproxy_namespaces+0x149/0x190 kernel/nsproxy.c:234
>  ksys_unshare+0x57d/0xa00 kernel/fork.c:3242
>  __do_sys_unshare kernel/fork.c:3316 [inline]
>  __se_sys_unshare kernel/fork.c:3314 [inline]
>  __x64_sys_unshare+0x38/0x50 kernel/fork.c:3314
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Freed by task 12641:
>  kasan_save_stack mm/kasan/common.c:57 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
>  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:584
>  poison_slab_object mm/kasan/common.c:253 [inline]
>  __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
>  kasan_slab_free include/linux/kasan.h:235 [inline]
>  slab_free_hook mm/slub.c:2689 [inline]
>  __rcu_free_sheaf_prepare+0x12d/0x2a0 mm/slub.c:2940
>  rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
>  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
>  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
>  handle_softirqs+0x225/0x840 kernel/softirq.c:622
>  do_softirq+0x76/0xd0 kernel/softirq.c:523
>  __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
>  unregister_netdevice_many_notify+0x1874/0x2150 net/core/dev.c:12445
>  ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
>  ops_undo_list+0x391/0x8d0 net/core/net_namespace.c:248
>  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
>  process_one_work kernel/workqueue.c:3314 [inline]
>  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
>  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
>  kthread+0x389/0x470 kernel/kthread.c:436
>  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> 
> The buggy address belongs to the object at ffff88804ec4c600
>  which belongs to the cache kmalloc-192 of size 192
> The buggy address is located 128 bytes inside of
>  freed 192-byte region [ffff88804ec4c600, ffff88804ec4c6c0)
> 
> The buggy address belongs to the physical page:
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4ec4c
> flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
> page_type: f5(slab)
> raw: 00fff00000000000 ffff88813fe163c0 dead000000000100 dead000000000122
> raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 13856, tgid 13853 (syz.3.2144), ts 351172300879, free_ts 351133053454
>  set_page_owner include/linux/page_owner.h:32 [inline]
>  post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
>  prep_new_page mm/page_alloc.c:1861 [inline]
>  get_page_from_freelist+0x24ae/0x2530 mm/page_alloc.c:3941
>  __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
>  alloc_slab_page mm/slub.c:3278 [inline]
>  allocate_slab+0x77/0x660 mm/slub.c:3467
>  new_slab mm/slub.c:3525 [inline]
>  refill_objects+0x336/0x3d0 mm/slub.c:7272
>  refill_sheaf mm/slub.c:2816 [inline]
>  __pcs_replace_empty_main+0x320/0x720 mm/slub.c:4652
>  alloc_from_pcs mm/slub.c:4750 [inline]
>  slab_alloc_node mm/slub.c:4884 [inline]
>  __do_kmalloc_node mm/slub.c:5295 [inline]
>  __kmalloc_noprof+0x464/0x750 mm/slub.c:5308
>  kmalloc_noprof include/linux/slab.h:954 [inline]
>  kzalloc_noprof include/linux/slab.h:1188 [inline]
>  new_dir fs/proc/proc_sysctl.c:966 [inline]
>  get_subdir fs/proc/proc_sysctl.c:1010 [inline]
>  sysctl_mkdir_p fs/proc/proc_sysctl.c:1320 [inline]
>  __register_sysctl_table+0xc02/0x1370 fs/proc/proc_sysctl.c:1395
>  neigh_sysctl_register+0x9b1/0xa90 net/core/neighbour.c:3915
>  addrconf_sysctl_register+0xb3/0x1c0 net/ipv6/addrconf.c:7396
>  ipv6_add_dev+0xd26/0x13a0 net/ipv6/addrconf.c:460
>  addrconf_notify+0x771/0x1050 net/ipv6/addrconf.c:3679
>  notifier_call_chain+0x1a5/0x3d0 kernel/notifier.c:85
>  call_netdevice_notifiers_extack net/core/dev.c:2288 [inline]
>  call_netdevice_notifiers net/core/dev.c:2302 [inline]
>  register_netdevice+0x18db/0x1f00 net/core/dev.c:11474
>  macsec_newlink+0x706/0x1200 drivers/net/macsec.c:4218
>  rtnl_newlink_create+0x310/0xb00 net/core/rtnetlink.c:3905
> page last free pid 12657 tgid 12657 stack trace:
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  __free_pages_prepare mm/page_alloc.c:1397 [inline]
>  __free_frozen_pages+0xc0d/0xd20 mm/page_alloc.c:2938
>  __tlb_remove_table_free mm/mmu_gather.c:228 [inline]
>  tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
>  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
>  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
>  handle_softirqs+0x225/0x840 kernel/softirq.c:622
>  __do_softirq kernel/softirq.c:656 [inline]
>  invoke_softirq kernel/softirq.c:496 [inline]
>  __irq_exit_rcu+0xca/0x220 kernel/softirq.c:735
>  irq_exit_rcu+0x9/0x30 kernel/softirq.c:752
>  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1061 [inline]
>  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1061
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
> 
> Memory state around the buggy address:
>  ffff88804ec4c580: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88804ec4c600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88804ec4c680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>                    ^
>  ffff88804ec4c700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff88804ec4c780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup

^ permalink raw reply

* [PATCH 7.0 317/378] rxrpc: Fix the ACK parser to extract the SACK table for parsing
From: Greg Kroah-Hartman @ 2026-06-16 14:59 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Michael Bommarito, David Howells,
	Marc Dionne, Jeffrey Altman, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-afs, netdev,
	stable
In-Reply-To: <20260616145109.744539446@linuxfoundation.org>

7.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

commit 333b6d5bb9f87827ac2639c737bf9613dbae7253 upstream.

Fix modification of the received skbuff in rxrpc_input_soft_acks() and a
potential incorrect access of the buffer in a fragmented UDP packet (the
packet would probably have to be deliberately pre-generated as fragmented)
when AF_RXRPC tries to extract the contents of the SACK table by copying
out the contents of the SACK table into a buffer before attempting to parse

AF_RXRPC assumes that it can just call skb_condense() and then validly
access the SACK table from skb->data and that it will be a flat buffer -
but skb_condense() can silently fail to do anything under some
circumstances.

Note that whilst rxrpc_input_soft_acks() should be able to parse extended
ACKs, the rest of AF_RXRPC doesn't currently support that.

Further, there's then no need to call skb_condense() in rxrpc_input_ack(),
so don't.

Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs")
Reported-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://lore.kernel.org/r/20260513180907.2061972-1-michael.bommarito@gmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: stable@kernel.org
Link: https://patch.msgid.link/105362.1780573560@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/rxrpc/input.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -963,23 +963,34 @@ static void rxrpc_input_soft_acks(struct
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 	struct rxrpc_txqueue *tq = call->tx_queue;
 	unsigned long extracted = ~0UL;
-	unsigned int nr = 0;
+	unsigned int nr = 0, nsack;
 	rxrpc_seq_t seq = call->acks_hard_ack + 1;
 	rxrpc_seq_t lowest_nak = seq + sp->ack.nr_acks;
-	u8 *acks = skb->data + sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket);
+	u8 sack[256] __aligned(sizeof(unsigned long));
+	u8 *acks = sack;
 
 	_enter("%x,%x,%u", tq->qbase, seq, sp->ack.nr_acks);
 
 	while (after(seq, tq->qbase + RXRPC_NR_TXQUEUE - 1))
 		tq = tq->next;
 
+	/* Extract an individual SACK table.  A normal SACK table is up to 255
+	 * bytes with 1 ACK flag per byte, but an extended SACK table can be up
+	 * to 256 bytes with up to 8 ACK/NACK flags per byte.  The ACK flags go
+	 * across all bit 0's then all bit 1's, then all bit 2's, ...
+	 */
+	memset(sack, 0, sizeof(sack));
+	nsack = umin(sp->ack.nr_acks, 256);
+	if (skb_copy_bits(skb,
+			  sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket),
+			  sack, nsack) < 0)
+		return;
+
 	for (unsigned int i = 0; i < sp->ack.nr_acks; i++) {
 		/* Decant ACKs until we hit a txqueue boundary. */
+		if ((i & 255) == 0)
+			acks = sack;
 		shiftr_adv_rotr(acks, extracted);
-		if (i == 256) {
-			acks -= i;
-			i = 0;
-		}
 		seq++;
 		nr++;
 		if ((seq & RXRPC_TXQ_MASK) != 0)
@@ -1117,9 +1128,6 @@ static void rxrpc_input_ack(struct rxrpc
 	    skb_copy_bits(skb, ioffset, &trailer, sizeof(trailer)) < 0)
 		return rxrpc_proto_abort(call, 0, rxrpc_badmsg_short_ack_trailer);
 
-	if (nr_acks > 0)
-		skb_condense(skb);
-
 	call->acks_latest_ts = ktime_get_real();
 	call->acks_hard_ack = hard_ack;
 	call->acks_prev_seq = prev_pkt;



^ permalink raw reply

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Jakub Kicinski @ 2026-06-16 15:49 UTC (permalink / raw)
  To: Carsten Strotmann
  Cc: John Paul Adrian Glaubitz, davem, netdev, edumazet, pabeni,
	andrew+netdev, horms, geert, chleroy, npiggin, mpe, maddy,
	linux-mips, linux-m68k, linuxppc-dev
In-Reply-To: <A3590144-073C-46D6-8425-90EE0C4D48E8@strotmann.de>

On Tue, 16 Jun 2026 09:13:46 +0200 Carsten Strotmann wrote:
> I'm a user of AppleTalk and other "Retro"-Features in the Linux Kernel.
> 
> On 16 Jun 2026, at 2:55, Jakub Kicinski wrote:
> 
> > We can complain about the AI slop til the cows comes home.
> > I don't like it, you don't like it. What difference does it make?
> >
> > If y'all have real solutions please share. Complaining about
> > "commercial interests" and "nuk[ing] everything in a panic reaction"
> > is not helpful.  
> 
> the solution, as Adrian pointed out, is to leave these features in
> the Linux kernel but have them disabled by default.

I think y'all need to internalize that "just leave it in" means work.
_Someone_ has to handle the reports and patches. And since nobody is
doing that the code is going to GitHub, where it can continue to "just
be left" or whatever, without racking up CVEs for the Linux kernel
and leading to maintainer burn out :/

> Maybe put a warning message in the kernel config tools that people
> should only enable these if they know what they are doing.
> 
> These "retro"-features should not pose any security risk of they are
> not compiled into a kernel.

Nobody is stopping you from using this code! It's perfectly suitable 
to be an out of tree module. Maybe it'd be harder if someone wanted to
remove a CPU architecture you want to use, but protocols are perfectly
fine as loadable modules. You can continue to use the code from:
 https://github.com/linux-netdev/mod-orphan

Presumably you could get Debian to package that and you wouldn't even
know the sources no longer live in the kernel tree.

^ permalink raw reply

* [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Tushar Vyavahare @ 2026-06-16 15:49 UTC (permalink / raw)
  To: netdev, magnus.karlsson, maciej.fijalkowski, stfomichev,
	kernelxing, davem, kuba, pabeni, ast, daniel, tirthendu.sarkar,
	tushar.vyavahare
  Cc: bpf

This series improves AF_XDP selftests by making timeout handling
explicit and fixing sources of non-determinism in xsk timeout tests.

Patch 1 introduces test_spec::poll_tmout and removes implicit
dependence on RX UMEM setup state for timeout behavior.

Patch 2 fixes thread harness sequencing by attaching XDP programs
before worker startup, removing signal-based termination, and using
barrier synchronization only for dual-thread runs.

Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
configuration does not leak into subsequent cases on shared-netdev
runs.

Together these changes make timeout handling easier to follow and
improve selftest stability, especially on real NIC runs.

Tushar Vyavahare (3):
  selftests/xsk: make poll timeout mode explicit
  selftests/xsk: fix timeout thread harness sequencing
  selftests/xsk: restore shared_umem after POLL_TXQ_FULL

 .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
 .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
 2 files changed, 56 insertions(+), 42 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next 1/3] selftests/xsk: make poll timeout mode explicit
From: Tushar Vyavahare @ 2026-06-16 15:49 UTC (permalink / raw)
  To: netdev, magnus.karlsson, maciej.fijalkowski, stfomichev,
	kernelxing, davem, kuba, pabeni, ast, daniel, tirthendu.sarkar,
	tushar.vyavahare
  Cc: bpf
In-Reply-To: <20260616154955.1492560-1-tushar.vyavahare@intel.com>

Stop inferring timeout behavior from RX UMEM initialization state.
That ties timeout semantics to setup internals and obscures intent.

Use test_spec::poll_tmout as the explicit timeout-mode selector in
TX and RX paths.

In RX, treat poll timeout as expected only in timeout mode.
In TX, let send_pkts() own loop completion in non-timeout mode
and use __send_pkts() only for progress and timeout detection.

This makes timeout logic explicit and keeps control flow predictable.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 44 +++++++++----------
 .../selftests/bpf/prog_tests/test_xsk.h       |  1 +
 2 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 72875071d4f1..ca47a16ceb1a 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -65,11 +65,6 @@ static void gen_eth_hdr(struct xsk_socket_info *xsk, struct ethhdr *eth_hdr)
 	eth_hdr->h_proto = htons(ETH_P_LOOPBACK);
 }
 
-static bool is_umem_valid(struct xsk_socket_info *xsk)
-{
-	return !!xsk->umem->umem;
-}
-
 static u32 mode_to_xdp_flags(enum test_mode mode)
 {
 	return (mode == TEST_MODE_SKB) ? XDP_FLAGS_SKB_MODE : XDP_FLAGS_DRV_MODE;
@@ -1010,7 +1005,7 @@ static int __receive_pkts(struct test_spec *test, struct xsk_socket_info *xsk)
 			return TEST_FAILURE;
 
 		if (!ret) {
-			if (!is_umem_valid(test->ifobj_tx->xsk))
+			if (test->poll_tmout)
 				return TEST_PASS;
 
 			ksft_print_msg("ERROR: [%s] Poll timed out\n", __func__);
@@ -1149,7 +1144,7 @@ static int receive_pkts(struct test_spec *test)
 			break;
 
 		res = __receive_pkts(test, xsk);
-		if (!(res == TEST_PASS || res == TEST_CONTINUE))
+		if (res != TEST_CONTINUE)
 			return res;
 
 		ret = gettimeofday(&tv_now, NULL);
@@ -1166,7 +1161,8 @@ static int receive_pkts(struct test_spec *test)
 	return TEST_PASS;
 }
 
-static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, bool timeout)
+static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk,
+		       bool test_timeout)
 {
 	u32 i, idx = 0, valid_pkts = 0, valid_frags = 0, buffer_len;
 	struct pkt_stream *pkt_stream = xsk->pkt_stream;
@@ -1178,7 +1174,7 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	buffer_len = pkt_get_buffer_len(umem, pkt_stream->max_pkt_len);
 	/* pkts_in_flight might be negative if many invalid packets are sent */
 	if (pkts_in_flight >= (int)((umem_size(umem) - xsk->batch_size * buffer_len) /
-	    buffer_len)) {
+	    buffer_len) && !test_timeout) {
 		ret = kick_tx(xsk);
 		if (ret)
 			return TEST_FAILURE;
@@ -1191,7 +1187,7 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	while (xsk_ring_prod__reserve(&xsk->tx, xsk->batch_size, &idx) < xsk->batch_size) {
 		if (use_poll) {
 			ret = poll(&fds, 1, POLL_TMOUT);
-			if (timeout) {
+			if (test_timeout) {
 				if (ret < 0) {
 					ksft_print_msg("ERROR: [%s] Poll error %d\n",
 						       __func__, errno);
@@ -1271,7 +1267,7 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	if (use_poll) {
 		ret = poll(&fds, 1, POLL_TMOUT);
 		if (ret <= 0) {
-			if (ret == 0 && timeout)
+			if (ret == 0 && test_timeout)
 				return TEST_PASS;
 
 			ksft_print_msg("ERROR: [%s] Poll error %d\n", __func__, ret);
@@ -1279,14 +1275,14 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 		}
 	}
 
-	if (!timeout) {
+	if (!test_timeout) {
 		if (complete_pkts(xsk, i))
 			return TEST_FAILURE;
 
 		usleep(10);
-		return TEST_PASS;
 	}
 
+	/* Loop completion is driven by send_pkts() stream progress checks. */
 	return TEST_CONTINUE;
 }
 
@@ -1322,7 +1318,6 @@ bool all_packets_sent(struct test_spec *test, unsigned long *bitmap)
 
 static int send_pkts(struct test_spec *test, struct ifobject *ifobject)
 {
-	bool timeout = !is_umem_valid(test->ifobj_rx->xsk);
 	DECLARE_BITMAP(bitmap, test->nb_sockets);
 	u32 i, ret;
 
@@ -1337,19 +1332,18 @@ static int send_pkts(struct test_spec *test, struct ifobject *ifobject)
 				__set_bit(i, bitmap);
 				continue;
 			}
-			ret = __send_pkts(ifobject, &ifobject->xsk_arr[i], timeout);
-			if (ret == TEST_CONTINUE && !test->fail)
-				continue;
-
-			if ((ret || test->fail) && !timeout)
-				return TEST_FAILURE;
-
-			if (ret == TEST_PASS && timeout)
+			ret = __send_pkts(ifobject, &ifobject->xsk_arr[i], test->poll_tmout);
+			if (ret != TEST_CONTINUE)
 				return ret;
 
-			ret = wait_for_tx_completion(&ifobject->xsk_arr[i]);
-			if (ret)
+			if (test->fail)
 				return TEST_FAILURE;
+
+			if (!test->poll_tmout) {
+				ret = wait_for_tx_completion(&ifobject->xsk_arr[i]);
+				if (ret)
+					return TEST_FAILURE;
+			}
 		}
 	}
 
@@ -2231,6 +2225,7 @@ int testapp_xdp_shared_umem(struct test_spec *test)
 
 int testapp_poll_txq_tmout(struct test_spec *test)
 {
+	test->poll_tmout = true;
 	test->ifobj_tx->use_poll = true;
 	/* create invalid frame by set umem frame_size and pkt length equal to 2048 */
 	test->ifobj_tx->xsk->umem->frame_size = 2048;
@@ -2241,6 +2236,7 @@ int testapp_poll_txq_tmout(struct test_spec *test)
 
 int testapp_poll_rxq_tmout(struct test_spec *test)
 {
+	test->poll_tmout = true;
 	test->ifobj_rx->use_poll = true;
 	return testapp_validate_traffic_single_thread(test, test->ifobj_rx);
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.h b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
index 4313d0d87235..20eaaa254998 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.h
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
@@ -207,6 +207,7 @@ struct test_spec {
 	bool set_ring;
 	bool adjust_tail;
 	bool adjust_tail_support;
+	bool poll_tmout;
 	enum test_mode mode;
 	char name[MAX_TEST_NAME_SIZE];
 };
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 2/3] selftests/xsk: fix timeout thread harness sequencing
From: Tushar Vyavahare @ 2026-06-16 15:49 UTC (permalink / raw)
  To: netdev, magnus.karlsson, maciej.fijalkowski, stfomichev,
	kernelxing, davem, kuba, pabeni, ast, daniel, tirthendu.sarkar,
	tushar.vyavahare
  Cc: bpf
In-Reply-To: <20260616154955.1492560-1-tushar.vyavahare@intel.com>

Prevent workers from running before XDP program attachment completes.
The previous ordering allowed races between worker startup and setup.

Attach XDP programs before entering traffic validation.

Remove SIGUSR1-based worker termination and use pthread_join() for
thread shutdown so blocking syscalls are not interrupted.

Use barriers only for dual-thread runs so participants match and
teardown ordering stays deterministic.

This removes setup/startup races and stabilizes harness sequencing.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 33 ++++++++++---------
 .../selftests/bpf/prog_tests/test_xsk.h       |  1 +
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index ca47a16ceb1a..d4702d2aac5e 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -7,7 +7,6 @@
 #include <linux/netdev.h>
 #include <poll.h>
 #include <pthread.h>
-#include <signal.h>
 #include <string.h>
 #include <sys/mman.h>
 #include <sys/socket.h>
@@ -1671,7 +1670,8 @@ void *worker_testapp_validate_rx(void *arg)
 				       strerror(-err));
 	}
 
-	pthread_barrier_wait(&barr);
+	if (test->use_barrier)
+		pthread_barrier_wait(&barr);
 
 	/* We leave only now in case of error to avoid getting stuck in the barrier */
 	if (err) {
@@ -1710,11 +1710,6 @@ static void testapp_clean_xsk_umem(struct ifobject *ifobj)
 	munmap(umem->buffer, umem->mmap_size);
 }
 
-static void handler(int signum)
-{
-	pthread_exit(NULL);
-}
-
 static bool xdp_prog_changed_rx(struct test_spec *test)
 {
 	struct ifobject *ifobj = test->ifobj_rx;
@@ -1819,9 +1814,18 @@ static int __testapp_validate_traffic(struct test_spec *test, struct ifobject *i
 		return TEST_FAILURE;
 	}
 
-	if (ifobj2) {
+	err = xsk_attach_xdp_progs(test, ifobj1, ifobj2);
+	if (err) {
+		ksft_print_msg("Error: failed to attach XDP programs: %d (%s)\n",
+			       err, strerror(-err));
+		return TEST_FAILURE;
+	}
+	test->use_barrier = !!ifobj2;
+
+	if (test->use_barrier) {
 		if (pthread_barrier_init(&barr, NULL, 2))
 			return TEST_FAILURE;
+
 		pkt_stream_reset(ifobj2->xsk->pkt_stream);
 	}
 
@@ -1829,27 +1833,26 @@ static int __testapp_validate_traffic(struct test_spec *test, struct ifobject *i
 	pkt_stream_reset(ifobj1->xsk->pkt_stream);
 	pkts_in_flight = 0;
 
-	signal(SIGUSR1, handler);
 	/*Spawn RX thread */
 	pthread_create(&t0, NULL, ifobj1->func_ptr, test);
 
-	if (ifobj2) {
+	if (test->use_barrier) {
 		pthread_barrier_wait(&barr);
 		if (pthread_barrier_destroy(&barr)) {
-			pthread_kill(t0, SIGUSR1);
+			test->use_barrier = false;
+			pthread_join(t0, NULL);
 			clean_sockets(test, ifobj1);
 			clean_umem(test, ifobj1, NULL);
 			return TEST_FAILURE;
 		}
+	}
 
+	if (ifobj2) {
 		/*Spawn TX thread */
 		pthread_create(&t1, NULL, ifobj2->func_ptr, test);
-
 		pthread_join(t1, NULL);
 	}
 
-	if (!ifobj2)
-		pthread_kill(t0, SIGUSR1);
 	pthread_join(t0, NULL);
 
 	if (test->total_steps == test->current_step || test->fail) {
@@ -1887,8 +1890,6 @@ static int testapp_validate_traffic(struct test_spec *test)
 		}
 	}
 
-	if (xsk_attach_xdp_progs(test, ifobj_rx, ifobj_tx))
-		return TEST_FAILURE;
 	return __testapp_validate_traffic(test, ifobj_rx, ifobj_tx);
 }
 
diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.h b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
index 20eaaa254998..03753ddc5dcd 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.h
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.h
@@ -208,6 +208,7 @@ struct test_spec {
 	bool adjust_tail;
 	bool adjust_tail_support;
 	bool poll_tmout;
+	bool use_barrier;
 	enum test_mode mode;
 	char name[MAX_TEST_NAME_SIZE];
 };
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 3/3] selftests/xsk: restore shared_umem after POLL_TXQ_FULL
From: Tushar Vyavahare @ 2026-06-16 15:49 UTC (permalink / raw)
  To: netdev, magnus.karlsson, maciej.fijalkowski, stfomichev,
	kernelxing, davem, kuba, pabeni, ast, daniel, tirthendu.sarkar,
	tushar.vyavahare
  Cc: bpf
In-Reply-To: <20260616154955.1492560-1-tushar.vyavahare@intel.com>

POLL_TXQ_FULL temporarily disables shared_umem on TX to exercise the
TX timeout path in isolation.

With shared_umem enabled, TX setup expects RX UMEM to be initialized
first and fails with: "RX UMEM is not initialized before shared-UMEM TX
setup".

Save and restore shared_umem around POLL_TXQ_FULL execution, and restore
it on both success and pkt_stream_replace() failure paths.

Also add an in-code comment explaining why shared_umem is temporarily
disabled in this test.

This keeps timeout setup local and prevents cross-test state leakage.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index d4702d2aac5e..6eb9096d084c 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2226,13 +2226,28 @@ int testapp_xdp_shared_umem(struct test_spec *test)
 
 int testapp_poll_txq_tmout(struct test_spec *test)
 {
+	bool shared_umem = test->ifobj_tx->shared_umem;
+	int ret;
+
 	test->poll_tmout = true;
+	/*
+	 * POLL_TXQ_FULL exercises TX timeout setup in isolation.
+	 * Keep TX out of shared-UMEM mode here so TX setup does not require
+	 * RX UMEM to be initialized first.
+	 */
+	test->ifobj_tx->shared_umem = false;
 	test->ifobj_tx->use_poll = true;
 	/* create invalid frame by set umem frame_size and pkt length equal to 2048 */
 	test->ifobj_tx->xsk->umem->frame_size = 2048;
-	if (pkt_stream_replace(test, 2 * DEFAULT_PKT_CNT, 2048))
+	if (pkt_stream_replace(test, 2 * DEFAULT_PKT_CNT, 2048)) {
+		test->ifobj_tx->shared_umem = shared_umem;
 		return TEST_FAILURE;
-	return testapp_validate_traffic_single_thread(test, test->ifobj_tx);
+	}
+
+	ret = testapp_validate_traffic_single_thread(test, test->ifobj_tx);
+	test->ifobj_tx->shared_umem = shared_umem;
+
+	return ret;
 }
 
 int testapp_poll_rxq_tmout(struct test_spec *test)
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v5 0/6] pds_core: Add PLDM firmware update and host backed memory support
From: Jakub Kicinski @ 2026-06-16 15:54 UTC (permalink / raw)
  To: Nikhil P. Rao
  Cc: netdev, brett.creeley, eric.joyner, andrew+netdev, davem,
	edumazet, pabeni, jacob.e.keller
In-Reply-To: <20260616023554.258764-1-nikhil.rao@amd.com>

On Tue, 16 Jun 2026 02:35:48 +0000 Nikhil P. Rao wrote:
> This series adds PLDM-based firmware update support to the pds_core
> driver. PLDM (Platform Level Data Model) is a DMTF standard for firmware
> management that provides a vendor-neutral interface for firmware updates.

net-next is closed for the duration of the merge window.
Please see: https://netdev.bots.linux.dev/net-next.html
And of course:
https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html#development-cycle
-- 
pw-bot: defer

^ permalink raw reply

* Re: [syzbot] [net?] KASAN: slab-use-after-free Read in fib_rules_lookup
From: Eric Dumazet @ 2026-06-16 15:55 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: syzbot, kuniyu, davem, dsahern, horms, kuba, linux-kernel, netdev,
	pabeni, syzkaller-bugs
In-Reply-To: <20260616153110.GA876739@shredder>

On Tue, Jun 16, 2026 at 8:31 AM Ido Schimmel <idosch@nvidia.com> wrote:
>
> On Tue, Jun 16, 2026 at 07:05:24AM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    72dfa4700f78 net: dsa: sja1105: fix lastused timestamp in ..
>
> This includes commit 759923cf03b0 ("ipv4: fib: Convert
> fib_net_exit_batch() to ->exit_rtnl().") that moved ip_fib_net_exit()
> (and therefore fib4_rules_exit()) earlier in the netns dismantle path.
>
> Kuniyuki, can you please take a look?
>
> You can use this to reproduce:
>
> #!/bin/bash
>
> while true; do
>         ip netns add ns1
>         ip -n ns1 link set dev lo up
>         ip -n ns1 address add 192.0.2.1/24 dev lo
>         ip -n ns1 link add name dummy1 up type dummy
>         ip -n ns1 address add 198.51.100.1/24 dev dummy1
>         ip -n ns1 rule add ipproto tcp sport 12345 table 12345
>         ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
>         ip netns del ns1
> done
>

Oh right.

While looking at this syzbot report I also found an old issue.

https://lore.kernel.org/netdev/20260616141317.407791-1-edumazet@google.com/T/#u

I guess adding some delays in enqueue_to_backlog() could trigger a
similar bug even if we revert Kuniyuki's patch.




> Thanks
>
> > git tree:       net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15794bd2580000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=a0842261b62cdea8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=965506b59a2de0b6905c
> > compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/d4e16f50a97c/disk-72dfa470.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/6cd4a736e796/vmlinux-72dfa470.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/548b0011c8e8/bzImage-72dfa470.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com
> >
> > bond0 (unregistering): Released all slaves
> > bond1 (unregistering): Released all slaves
> > bond2 (unregistering): (slave dummy0): Releasing active interface
> > bond2 (unregistering): Released all slaves
> > ==================================================================
> > BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> > Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641
> >
> > CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full)
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
> > Workqueue: netns cleanup_net
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
> >  print_address_description+0x55/0x1e0 mm/kasan/report.c:378
> >  print_report+0x58/0x70 mm/kasan/report.c:482
> >  kasan_report+0x117/0x150 mm/kasan/report.c:595
> >  fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> >  __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96
> >  ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811
> >  ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702
> >  __ip_route_output_key include/net/route.h:169 [inline]
> >  ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929
> >  ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118
> >  release_sock+0x206/0x260 net/core/sock.c:3861
> >  inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950
> >  udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
> >  fou_release net/ipv4/fou_core.c:562 [inline]
> >  fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230
> >  ops_exit_list net/core/net_namespace.c:199 [inline]
> >  ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
> >  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
> >  process_one_work kernel/workqueue.c:3314 [inline]
> >  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
> >  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
> >  kthread+0x389/0x470 kernel/kthread.c:436
> >  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >  </TASK>
> >
> > Allocated by task 19121:
> >  kasan_save_stack mm/kasan/common.c:57 [inline]
> >  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
> >  poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
> >  __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
> >  kasan_kmalloc include/linux/kasan.h:263 [inline]
> >  __do_kmalloc_node mm/slub.c:5296 [inline]
> >  __kmalloc_node_track_caller_noprof+0x4d7/0x7b0 mm/slub.c:5408
> >  kmemdup_noprof+0x2b/0x70 mm/util.c:138
> >  kmemdup_noprof include/linux/fortify-string.h:763 [inline]
> >  fib_rules_register+0x2f/0x400 net/core/fib_rules.c:170
> >  fib4_rules_init+0x21/0x160 net/ipv4/fib_rules.c:508
> >  ip_fib_net_init net/ipv4/fib_frontend.c:1578 [inline]
> >  fib_net_init+0x17a/0x3e0 net/ipv4/fib_frontend.c:1628
> >  ops_init+0x35d/0x5d0 net/core/net_namespace.c:137
> >  setup_net+0x118/0x350 net/core/net_namespace.c:446
> >  copy_net_ns+0x4f9/0x720 net/core/net_namespace.c:579
> >  create_new_namespaces+0x3f0/0x6b0 kernel/nsproxy.c:132
> >  unshare_nsproxy_namespaces+0x149/0x190 kernel/nsproxy.c:234
> >  ksys_unshare+0x57d/0xa00 kernel/fork.c:3242
> >  __do_sys_unshare kernel/fork.c:3316 [inline]
> >  __se_sys_unshare kernel/fork.c:3314 [inline]
> >  __x64_sys_unshare+0x38/0x50 kernel/fork.c:3314
> >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >  do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Freed by task 12641:
> >  kasan_save_stack mm/kasan/common.c:57 [inline]
> >  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
> >  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:584
> >  poison_slab_object mm/kasan/common.c:253 [inline]
> >  __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
> >  kasan_slab_free include/linux/kasan.h:235 [inline]
> >  slab_free_hook mm/slub.c:2689 [inline]
> >  __rcu_free_sheaf_prepare+0x12d/0x2a0 mm/slub.c:2940
> >  rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
> >  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
> >  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
> >  handle_softirqs+0x225/0x840 kernel/softirq.c:622
> >  do_softirq+0x76/0xd0 kernel/softirq.c:523
> >  __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
> >  unregister_netdevice_many_notify+0x1874/0x2150 net/core/dev.c:12445
> >  ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
> >  ops_undo_list+0x391/0x8d0 net/core/net_namespace.c:248
> >  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
> >  process_one_work kernel/workqueue.c:3314 [inline]
> >  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
> >  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
> >  kthread+0x389/0x470 kernel/kthread.c:436
> >  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >
> > The buggy address belongs to the object at ffff88804ec4c600
> >  which belongs to the cache kmalloc-192 of size 192
> > The buggy address is located 128 bytes inside of
> >  freed 192-byte region [ffff88804ec4c600, ffff88804ec4c6c0)
> >
> > The buggy address belongs to the physical page:
> > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4ec4c
> > flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
> > page_type: f5(slab)
> > raw: 00fff00000000000 ffff88813fe163c0 dead000000000100 dead000000000122
> > raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
> > page dumped because: kasan: bad access detected
> > page_owner tracks the page as allocated
> > page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 13856, tgid 13853 (syz.3.2144), ts 351172300879, free_ts 351133053454
> >  set_page_owner include/linux/page_owner.h:32 [inline]
> >  post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
> >  prep_new_page mm/page_alloc.c:1861 [inline]
> >  get_page_from_freelist+0x24ae/0x2530 mm/page_alloc.c:3941
> >  __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
> >  alloc_slab_page mm/slub.c:3278 [inline]
> >  allocate_slab+0x77/0x660 mm/slub.c:3467
> >  new_slab mm/slub.c:3525 [inline]
> >  refill_objects+0x336/0x3d0 mm/slub.c:7272
> >  refill_sheaf mm/slub.c:2816 [inline]
> >  __pcs_replace_empty_main+0x320/0x720 mm/slub.c:4652
> >  alloc_from_pcs mm/slub.c:4750 [inline]
> >  slab_alloc_node mm/slub.c:4884 [inline]
> >  __do_kmalloc_node mm/slub.c:5295 [inline]
> >  __kmalloc_noprof+0x464/0x750 mm/slub.c:5308
> >  kmalloc_noprof include/linux/slab.h:954 [inline]
> >  kzalloc_noprof include/linux/slab.h:1188 [inline]
> >  new_dir fs/proc/proc_sysctl.c:966 [inline]
> >  get_subdir fs/proc/proc_sysctl.c:1010 [inline]
> >  sysctl_mkdir_p fs/proc/proc_sysctl.c:1320 [inline]
> >  __register_sysctl_table+0xc02/0x1370 fs/proc/proc_sysctl.c:1395
> >  neigh_sysctl_register+0x9b1/0xa90 net/core/neighbour.c:3915
> >  addrconf_sysctl_register+0xb3/0x1c0 net/ipv6/addrconf.c:7396
> >  ipv6_add_dev+0xd26/0x13a0 net/ipv6/addrconf.c:460
> >  addrconf_notify+0x771/0x1050 net/ipv6/addrconf.c:3679
> >  notifier_call_chain+0x1a5/0x3d0 kernel/notifier.c:85
> >  call_netdevice_notifiers_extack net/core/dev.c:2288 [inline]
> >  call_netdevice_notifiers net/core/dev.c:2302 [inline]
> >  register_netdevice+0x18db/0x1f00 net/core/dev.c:11474
> >  macsec_newlink+0x706/0x1200 drivers/net/macsec.c:4218
> >  rtnl_newlink_create+0x310/0xb00 net/core/rtnetlink.c:3905
> > page last free pid 12657 tgid 12657 stack trace:
> >  reset_page_owner include/linux/page_owner.h:25 [inline]
> >  __free_pages_prepare mm/page_alloc.c:1397 [inline]
> >  __free_frozen_pages+0xc0d/0xd20 mm/page_alloc.c:2938
> >  __tlb_remove_table_free mm/mmu_gather.c:228 [inline]
> >  tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
> >  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
> >  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
> >  handle_softirqs+0x225/0x840 kernel/softirq.c:622
> >  __do_softirq kernel/softirq.c:656 [inline]
> >  invoke_softirq kernel/softirq.c:496 [inline]
> >  __irq_exit_rcu+0xca/0x220 kernel/softirq.c:735
> >  irq_exit_rcu+0x9/0x30 kernel/softirq.c:752
> >  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1061 [inline]
> >  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1061
> >  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
> >
> > Memory state around the buggy address:
> >  ffff88804ec4c580: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
> >  ffff88804ec4c600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > >ffff88804ec4c680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> >                    ^
> >  ffff88804ec4c700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >  ffff88804ec4c780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> > ==================================================================
> >
> >
> > ---
> > This report is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkaller@googlegroups.com.
> >
> > syzbot will keep track of this issue. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> >
> > If the report is already addressed, let syzbot know by replying with:
> > #syz fix: exact-commit-title
> >
> > If you want to overwrite report's subsystems, reply with:
> > #syz set subsystems: new-subsystem
> > (See the list of subsystem names on the web dashboard)
> >
> > If the report is a duplicate of another one, reply with:
> > #syz dup: exact-subject-of-another-report
> >
> > If you want to undo deduplication, reply with:
> > #syz undup

^ permalink raw reply

* [PATCH net 0/5] rxrpc: Miscellaneous fixes
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel

Here are some miscellaneous AF_RXRPC fixes for more stuff found by Sashiko[1]:

 (1) Reject ACKALL packets for calls not in Tx or immediate post-Tx state.

 (2) Fix connection leak from AF_RXRPC recvmsg userspace OOB handling.

 (3) Fix double unlock in AF_RXRPC recvmsg userspace OOB handling.

 (4) Fix AFS preallocate charge to flush the waitqueue after unlistening
     the socket so that any charging thread that does manage to get started
     will be waited for before socket destruction.

 (5) Fix AFS OOB notify handling to cancel in-progress OOB notification
     handling and then to flush the workqueue it's on.

David

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-fixes

[1] https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com

David Howells (4):
  rxrpc: Fix leak of connection from OOB challenge
  rxrpc: Fix double unlock in rxrpc_recvmsg()
  afs: Fix further netns teardown to cancel the preallocation charger
  afs: Fix uncancelled rxrpc OOB message handler

Wyatt Feng (1):
  rxrpc: input: reject ACKALL outside transmit phase

 fs/afs/cm_security.c |  3 ++-
 fs/afs/rxrpc.c       |  5 ++++-
 net/rxrpc/input.c    | 16 +++++++++++++++-
 net/rxrpc/oob.c      |  5 +++++
 net/rxrpc/recvmsg.c  |  2 +-
 5 files changed, 27 insertions(+), 4 deletions(-)


^ permalink raw reply

* [PATCH net 1/5] rxrpc: input: reject ACKALL outside transmit phase
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Wyatt Feng, stable, Yuan Tan, Yifan Wu, Juefei Pu,
	Zhengchuan Liang, Xin Liu, Ren Wei
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

From: Wyatt Feng <bronzed_45_vested@icloud.com>

rxrpc_input_ackall() accepts ACKALL packets without checking whether
the call is in a state that can legitimately have outstanding transmit
buffers.  A forged ACKALL can therefore reach a new service call in
RXRPC_CALL_SERVER_RECV_REQUEST before any reply packets have been
queued.

In that state call->tx_top is zero and call->tx_queue is NULL, so
rxrpc_rotate_tx_window() dereferences a NULL txqueue and triggers a
null-pointer dereference.

Fix rxrpc_input_ackall() to mirror the transmit-state gating already
used for normal ACK processing, and ignore ACKALL when there is no
outstanding transmit window to rotate.

Fixes: b341a0263b1b ("rxrpc: Implement progressive transmission queue struct")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
---
 net/rxrpc/input.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index ce761466b02d..37881dffa898 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -1214,8 +1214,22 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb)
 static void rxrpc_input_ackall(struct rxrpc_call *call, struct sk_buff *skb)
 {
 	struct rxrpc_ack_summary summary = { 0 };
+	rxrpc_seq_t top = READ_ONCE(call->tx_top);
+
+	switch (__rxrpc_call_state(call)) {
+	case RXRPC_CALL_CLIENT_SEND_REQUEST:
+	case RXRPC_CALL_CLIENT_AWAIT_REPLY:
+	case RXRPC_CALL_SERVER_SEND_REPLY:
+	case RXRPC_CALL_SERVER_AWAIT_ACK:
+		break;
+	default:
+		return;
+	}
+
+	if (call->tx_bottom == top)
+		return;
 
-	if (rxrpc_rotate_tx_window(call, call->tx_top, &summary))
+	if (rxrpc_rotate_tx_window(call, top, &summary))
 		rxrpc_end_tx_phase(call, false, rxrpc_eproto_unexpected_ackall);
 }
 


^ permalink raw reply related

* [PATCH net 2/5] rxrpc: Fix leak of connection from OOB challenge
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	stable
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

Fix leak of connection object from OOB challenge queue when response is
provided by userspace.

Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/oob.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/rxrpc/oob.c b/net/rxrpc/oob.c
index 05ca9c1faa57..3318c8bd82ad 100644
--- a/net/rxrpc/oob.c
+++ b/net/rxrpc/oob.c
@@ -210,6 +210,11 @@ static int rxrpc_respond_to_oob(struct rxrpc_sock *rx,
 		break;
 	}
 
+	switch (skb->mark) {
+	case RXRPC_OOB_CHALLENGE:
+		rxrpc_put_connection(sp->chall.conn, rxrpc_conn_put_oob);
+		break;
+	}
 	rxrpc_free_skb(skb, rxrpc_skb_put_oob);
 	return ret;
 }


^ permalink raw reply related

* [PATCH net 3/5] rxrpc: Fix double unlock in rxrpc_recvmsg()
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	stable
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

Fix a double unlock in rxrpc_recvmsg() when dealing with OOB messages.

Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 net/rxrpc/recvmsg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 82614cbdb60f..39a03684432d 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -471,7 +471,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 		release_sock(&rx->sk);
 		if (ret == -EAGAIN)
 			goto try_again;
-		goto error_no_call;
+		goto error_trace;
 	}
 
 	/* Find the next call and dequeue it if we're not just peeking.  If we


^ permalink raw reply related

* [PATCH net 4/5] afs: Fix further netns teardown to cancel the preallocation charger
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Li Daming, Ren Wei, Jeffrey Altman, stable
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

When an afs network namespace is torn down, it cancels and waits for the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before disabling incoming (i.e. listen 0), but there's a small window in
which it can be requeued by an incoming call wending through the I/O
thread.

Fix this by flushing the workqueue on which the charger runs after reducing
the listen backlog to zero.

Fixes: 47694fbc9d24 ("afs: Fix netns teardown to cancel the preallocation charger")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 fs/afs/rxrpc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index d5cfd24e815b..fd2d260fb25f 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -130,6 +130,7 @@ void afs_close_socket(struct afs_net *net)
 	cancel_work_sync(&net->charge_preallocation_work);
 	kernel_listen(net->socket, 0);
 	flush_workqueue(afs_async_calls);
+	flush_workqueue(afs_wq);
 
 	if (net->spare_incoming_call) {
 		afs_put_call(net->spare_incoming_call);


^ permalink raw reply related

* [PATCH net 5/5] afs: Fix uncancelled rxrpc OOB message handler
From: David Howells @ 2026-06-16 15:57 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, Marc Dionne, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, linux-afs, linux-kernel,
	Li Daming, Ren Wei, Jeffrey Altman, stable
In-Reply-To: <20260616155749.2125907-1-dhowells@redhat.com>

Fix AFS to cancel its OOB message processing (typically to respond to
security challenges).  Also move OOB message processing to afs_wq so that
it's also waited for and make the OOB handler just return if the net
namespace is no longer live.

Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE")
Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org
---
 fs/afs/cm_security.c | 3 ++-
 fs/afs/rxrpc.c       | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/afs/cm_security.c b/fs/afs/cm_security.c
index edcbd249d202..103168c70dd4 100644
--- a/fs/afs/cm_security.c
+++ b/fs/afs/cm_security.c
@@ -101,7 +101,8 @@ void afs_process_oob_queue(struct work_struct *work)
 	struct sk_buff *oob;
 	enum rxrpc_oob_type type;
 
-	while ((oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
+	while (READ_ONCE(net->live) &&
+	       (oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
 		switch (type) {
 		case RXRPC_OOB_CHALLENGE:
 			afs_respond_to_challenge(oob);
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index fd2d260fb25f..6241f9349f6b 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -128,6 +128,7 @@ void afs_close_socket(struct afs_net *net)
 	_enter("");
 
 	cancel_work_sync(&net->charge_preallocation_work);
+	cancel_work_sync(&net->rx_oob_work);
 	kernel_listen(net->socket, 0);
 	flush_workqueue(afs_async_calls);
 	flush_workqueue(afs_wq);
@@ -985,5 +986,6 @@ static void afs_rx_notify_oob(struct sock *sk, struct sk_buff *oob)
 {
 	struct afs_net *net = sk->sk_user_data;
 
-	schedule_work(&net->rx_oob_work);
+	if (net->live)
+		queue_work(afs_wq, &net->rx_oob_work);
 }


^ permalink raw reply related

* Re: [PATCH 3/4] vhost/vsock: suppress EHOSTUNREACH fast-fail during CPR pause
From: Andrey Drobyshev @ 2026-06-16 15:58 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <ajFUk7quPhbI7Te-@sgarzare-redhat>

On 6/16/26 5:18 PM, Stefano Garzarella wrote:
> On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when
> 
> Please follow 
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes 
> on how to refer to a commit.
>

I omitted the hash on purpose as the commit is not yet in the mainline
tree, although our series is based and depends on it, as I mentioned:

https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/commit/?id=bb26ed5f3a8b

So it's a different (Michael's) repo and the commit is about to get
merged (but not yet there).  But maybe usual reference style + repo link
would be better.
>> guest isn't ready") added a fast-fail in vhost_transport_send_pkt().  It
>> rejects every host send with -EHOSTUNREACH until the destination calls
>> SET_RUNNING(1).  The fast-fail condition checks whether device's backends
>> are dropped, and if they're, the guest is considered to be not ready.
> 
> Okay, so it's not a regression, I mean without this series that patch is 
> not adding any regression, no?
> 
> If it's the case, I'll change the wording in the cover letter.
>

Agreed.

>>
>> However, there might be other reasons for backends to be nulled.  In
>> particular, when QEMU is performing CPR (checkpoint-restore) migration,
>> device ownership is being RESET and SET again, which leads to backends
>> drop and reattach.  If we end up connecting during this window, an
>> AF_VSOCK client gets -EHOSTUNREACH, which is wrong.
> 
> Please add this change before starting to support VHOST_RESET_OWNER 
> ioctl in vhost-vsock, otherwise we are breaking the bisectability.
>

Agreed.

>>
>> Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
>> backend was previously live, cleared by vhost_vsock_start(). When set,
>> vhost_transport_send_pkt() queues the skb instead of fast-failing; the
>> existing kick of send_pkt_work in vhost_vsock_start() drains it on
>> resume. A device that has never run keeps cpr_paused == false and the
>> boot-time fast-fail behaviour is preserved.
>>
>> Pair the cpr_paused store with the backend store using an
>> smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
>> architecture never observes (NULL backend, !paused):
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>> drivers/vhost/vsock.c | 22 +++++++++++++++++++---
>> 1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index e629886e5cf8..bcaba36becd7 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -61,6 +61,7 @@ struct vhost_vsock {
>>
>> 	u32 guest_cid;
>> 	bool seqpacket_allow;
>> +	bool cpr_paused;	/* between stop and next start */
>> };
>>
>> static u32 vhost_transport_get_local_cid(void)
>> @@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
>> 	 * the mutex would be too expensive in this hot path, and we already have
>> 	 * all the outcomes covered: if the backend becomes NULL right after the check,
>> 	 * vhost_transport_do_send_pkt() will check it under the mutex anyway.
>> +	 *
>> +	 * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
>> +	 * The kick in vhost_vsock_start() will drain them on resume.
>> 	 */
>> 	if (unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
>> -		rcu_read_unlock();
>> -		kfree_skb(skb);
>> -		return -EHOSTUNREACH;
>> +		smp_rmb();	/* pairs with smp_wmb() in start/drop_backends */
>> +		if (!READ_ONCE(vsock->cpr_paused)) {
> 
> Can we avoid this which is not really readable and maybe add a single 
> variable to control the fast-fail at all?
> 
> I mean replacing both cpr_paused + backend-pointer with a single 
> `started` flag: set it to false at open, true on start via 
> smp_store_release(), back to false on normal stop, and leave it true 
> during CPR pause.
> 
> The reader in send_pkt can do just:
> 
>      if (!smp_load_acquire(&vsock->started))
>          return -EHOSTUNREACH;
> 
> WDYT?
>

I don't think it's gonna work as suggested.  As I understand, the order
during CPR migration is:

1) SET_RUNNING(0)
       -> vhost_vsock_stop()
           -> vhost_vsock_drop_backends()
2) RESET_OWNER
       -> vhost_vsock_drop_backends()
3) SET_OWNER
4) SET_RUNNING(1)
       -> vhost_vsock_start
           -> for (...) vhost_vq_set_backend()

(Btw I just noticed backends are already NULL at step 2), but that's
just our CPR case, for any potential RESET_OWNER users it might not be
the case).

So the race windows starts from 1) (not from 2)).  We have no way of
differentiating whether device is actually being stopped for good, or
we're in the middle of CPR.  If we set the flag to false on stop as you
suggested, we'll still hit the -EHOSTUNREACH case eventually, and
avoiding it is the whole purpose of this patch.

The fast-fail with -EHOSTUNREACH relies on the presence of backends.
IIUC the backend will only become set after initial SET_RUNNING(1),
which will only happen once the guest driver writes smth to virtio
config register, QEMU catches it and calls SET_RUNNING(1).  So we have
ordering with the guest's actions here, which is logical.  But for our
issue that means that the only true marker of paused/not paused is the
presence of backends - and that's why the flag is set in
vhost_vsock_drop_backends().

>> +			rcu_read_unlock();
>> +			kfree_skb(skb);
>> +			return -EHOSTUNREACH;
>> +		}
> 
> 
> That said claude here is reporting a potential issue that I think we 
> should consider:
>      After VHOST_RESET_OWNER, the guest CID stays in the hash, so 
>      vhost_transport_send_pkt() can still find the vsock, skip the 
>      fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while 
>      vhost_workers_free() is freeing workers without a synchronize_rcu() 
>      — risking a use-after-free. Also, any send_pkt_work queued between 
>      the last flush and worker teardown gets its VHOST_WORK_QUEUED bit 
>      stuck (the vhost task exits without draining), deadlocking 
>      host→guest traffic after restart.
> 
>      A synchronize_rcu() in vhost_workers_free() between the 
>      rcu_assign_pointer(NULL) loop and the destroy loop would close the 
>      use-after-free, and reinitializing send_pkt_work via 
>      vhost_work_init() after vhost_dev_reset_owner() returns would clear 
>      the stuck QUEUED bit.
> 
> 

Yes, this looks real indeed.  Though I couldn't hit the UAF issue while
testing host->guest transfer under KASAN.

>> 	}
>>
>> 	if (virtio_vsock_skb_reply(skb))
>> @@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> 		mutex_unlock(&vq->mutex);
>> 	}
>>
>> +	smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	WRITE_ONCE(vsock->cpr_paused, false);
>> +
>> 	/* Some packets may have been queued before the device was started,
>> 	 * let's kick the send worker to send them.
>> 	 */
>> @@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock *vsock)
>>
>> 	lockdep_assert_held(&vsock->dev.mutex);
>>
>> +	if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
>> +		WRITE_ONCE(vsock->cpr_paused, true);
>> +		smp_wmb();	/* pairs with smp_rmb() in send_pkt */
>> +	}
> 
> Why here and not in vhost_vsock_reset_owner()?
> 
> Also having this here will set it to true also with 
> VHOST_VSOCK_SET_RUNNING(0), is that right?
>

That was added here precisely to cover the vhost_vsock_stop() case (see
above).

> Thanks,
> Stefano
> 
>> +
>> 	for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
>> 		vq = &vsock->vqs[i];
>>
>> @@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
>>
>> 	vsock->guest_cid = 0; /* no CID assigned yet */
>> 	vsock->seqpacket_allow = false;
>> +	vsock->cpr_paused = false;
>>
>> 	atomic_set(&vsock->queued_replies, 0);
>>
>> -- 
>> 2.47.1
>>
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox