Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 0/2] appletalk: move the protocol out of tree
From: Jakub Kicinski @ 2026-06-16 15:49 UTC (permalink / raw)
  To: Carsten Strotmann
  Cc: John Paul Adrian Glaubitz, davem, netdev, edumazet, pabeni,
	andrew+netdev, horms, geert, chleroy, npiggin, mpe, maddy,
	linux-mips, linux-m68k, linuxppc-dev
In-Reply-To: <A3590144-073C-46D6-8425-90EE0C4D48E8@strotmann.de>

On Tue, 16 Jun 2026 09:13:46 +0200 Carsten Strotmann wrote:
> I'm a user of AppleTalk and other "Retro"-Features in the Linux Kernel.
> 
> On 16 Jun 2026, at 2:55, Jakub Kicinski wrote:
> 
> > We can complain about the AI slop til the cows comes home.
> > I don't like it, you don't like it. What difference does it make?
> >
> > If y'all have real solutions please share. Complaining about
> > "commercial interests" and "nuk[ing] everything in a panic reaction"
> > is not helpful.  
> 
> the solution, as Adrian pointed out, is to leave these features in
> the Linux kernel but have them disabled by default.

I think y'all need to internalize that "just leave it in" means work.
_Someone_ has to handle the reports and patches. And since nobody is
doing that the code is going to GitHub, where it can continue to "just
be left" or whatever, without racking up CVEs for the Linux kernel
and leading to maintainer burn out :/

> Maybe put a warning message in the kernel config tools that people
> should only enable these if they know what they are doing.
> 
> These "retro"-features should not pose any security risk of they are
> not compiled into a kernel.

Nobody is stopping you from using this code! It's perfectly suitable 
to be an out of tree module. Maybe it'd be harder if someone wanted to
remove a CPU architecture you want to use, but protocols are perfectly
fine as loadable modules. You can continue to use the code from:
 https://github.com/linux-netdev/mod-orphan

Presumably you could get Debian to package that and you wouldn't even
know the sources no longer live in the kernel tree.

^ permalink raw reply

* [PATCH 7.0 317/378] rxrpc: Fix the ACK parser to extract the SACK table for parsing
From: Greg Kroah-Hartman @ 2026-06-16 14:59 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Michael Bommarito, David Howells,
	Marc Dionne, Jeffrey Altman, Eric Dumazet, David S. Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, linux-afs, netdev,
	stable
In-Reply-To: <20260616145109.744539446@linuxfoundation.org>

7.0-stable review patch.  If anyone has any objections, please let me know.

------------------

From: David Howells <dhowells@redhat.com>

commit 333b6d5bb9f87827ac2639c737bf9613dbae7253 upstream.

Fix modification of the received skbuff in rxrpc_input_soft_acks() and a
potential incorrect access of the buffer in a fragmented UDP packet (the
packet would probably have to be deliberately pre-generated as fragmented)
when AF_RXRPC tries to extract the contents of the SACK table by copying
out the contents of the SACK table into a buffer before attempting to parse

AF_RXRPC assumes that it can just call skb_condense() and then validly
access the SACK table from skb->data and that it will be a flat buffer -
but skb_condense() can silently fail to do anything under some
circumstances.

Note that whilst rxrpc_input_soft_acks() should be able to parse extended
ACKs, the rest of AF_RXRPC doesn't currently support that.

Further, there's then no need to call skb_condense() in rxrpc_input_ack(),
so don't.

Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs")
Reported-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://lore.kernel.org/r/20260513180907.2061972-1-michael.bommarito@gmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: stable@kernel.org
Link: https://patch.msgid.link/105362.1780573560@warthog.procyon.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/rxrpc/input.c |   26 +++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -963,23 +963,34 @@ static void rxrpc_input_soft_acks(struct
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 	struct rxrpc_txqueue *tq = call->tx_queue;
 	unsigned long extracted = ~0UL;
-	unsigned int nr = 0;
+	unsigned int nr = 0, nsack;
 	rxrpc_seq_t seq = call->acks_hard_ack + 1;
 	rxrpc_seq_t lowest_nak = seq + sp->ack.nr_acks;
-	u8 *acks = skb->data + sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket);
+	u8 sack[256] __aligned(sizeof(unsigned long));
+	u8 *acks = sack;
 
 	_enter("%x,%x,%u", tq->qbase, seq, sp->ack.nr_acks);
 
 	while (after(seq, tq->qbase + RXRPC_NR_TXQUEUE - 1))
 		tq = tq->next;
 
+	/* Extract an individual SACK table.  A normal SACK table is up to 255
+	 * bytes with 1 ACK flag per byte, but an extended SACK table can be up
+	 * to 256 bytes with up to 8 ACK/NACK flags per byte.  The ACK flags go
+	 * across all bit 0's then all bit 1's, then all bit 2's, ...
+	 */
+	memset(sack, 0, sizeof(sack));
+	nsack = umin(sp->ack.nr_acks, 256);
+	if (skb_copy_bits(skb,
+			  sizeof(struct rxrpc_wire_header) + sizeof(struct rxrpc_ackpacket),
+			  sack, nsack) < 0)
+		return;
+
 	for (unsigned int i = 0; i < sp->ack.nr_acks; i++) {
 		/* Decant ACKs until we hit a txqueue boundary. */
+		if ((i & 255) == 0)
+			acks = sack;
 		shiftr_adv_rotr(acks, extracted);
-		if (i == 256) {
-			acks -= i;
-			i = 0;
-		}
 		seq++;
 		nr++;
 		if ((seq & RXRPC_TXQ_MASK) != 0)
@@ -1117,9 +1128,6 @@ static void rxrpc_input_ack(struct rxrpc
 	    skb_copy_bits(skb, ioffset, &trailer, sizeof(trailer)) < 0)
 		return rxrpc_proto_abort(call, 0, rxrpc_badmsg_short_ack_trailer);
 
-	if (nr_acks > 0)
-		skb_condense(skb);
-
 	call->acks_latest_ts = ktime_get_real();
 	call->acks_hard_ack = hard_ack;
 	call->acks_prev_seq = prev_pkt;



^ permalink raw reply

* Re: [syzbot] [net?] KASAN: slab-use-after-free Read in fib_rules_lookup
From: Ido Schimmel @ 2026-06-16 15:31 UTC (permalink / raw)
  To: syzbot, kuniyu
  Cc: davem, dsahern, edumazet, horms, kuba, linux-kernel, netdev,
	pabeni, syzkaller-bugs
In-Reply-To: <6a315824.b0403584.28d0ff.0000.GAE@google.com>

On Tue, Jun 16, 2026 at 07:05:24AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    72dfa4700f78 net: dsa: sja1105: fix lastused timestamp in ..

This includes commit 759923cf03b0 ("ipv4: fib: Convert
fib_net_exit_batch() to ->exit_rtnl().") that moved ip_fib_net_exit()
(and therefore fib4_rules_exit()) earlier in the netns dismantle path.

Kuniyuki, can you please take a look?

You can use this to reproduce:

#!/bin/bash

while true; do
	ip netns add ns1
	ip -n ns1 link set dev lo up
	ip -n ns1 address add 192.0.2.1/24 dev lo
	ip -n ns1 link add name dummy1 up type dummy
	ip -n ns1 address add 198.51.100.1/24 dev dummy1
	ip -n ns1 rule add ipproto tcp sport 12345 table 12345
	ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
	ip netns del ns1
done

Thanks

> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=15794bd2580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a0842261b62cdea8
> dashboard link: https://syzkaller.appspot.com/bug?extid=965506b59a2de0b6905c
> compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/d4e16f50a97c/disk-72dfa470.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/6cd4a736e796/vmlinux-72dfa470.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/548b0011c8e8/bzImage-72dfa470.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com
> 
> bond0 (unregistering): Released all slaves
> bond1 (unregistering): Released all slaves
> bond2 (unregistering): (slave dummy0): Releasing active interface
> bond2 (unregistering): Released all slaves
> ==================================================================
> BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
> Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641
> 
> CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full) 
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
> Workqueue: netns cleanup_net
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
>  print_address_description+0x55/0x1e0 mm/kasan/report.c:378
>  print_report+0x58/0x70 mm/kasan/report.c:482
>  kasan_report+0x117/0x150 mm/kasan/report.c:595
>  fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
>  __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96
>  ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811
>  ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702
>  __ip_route_output_key include/net/route.h:169 [inline]
>  ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929
>  ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118
>  release_sock+0x206/0x260 net/core/sock.c:3861
>  inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950
>  udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
>  fou_release net/ipv4/fou_core.c:562 [inline]
>  fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230
>  ops_exit_list net/core/net_namespace.c:199 [inline]
>  ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
>  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
>  process_one_work kernel/workqueue.c:3314 [inline]
>  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
>  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
>  kthread+0x389/0x470 kernel/kthread.c:436
>  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> 
> Allocated by task 19121:
>  kasan_save_stack mm/kasan/common.c:57 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
>  poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
>  __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
>  kasan_kmalloc include/linux/kasan.h:263 [inline]
>  __do_kmalloc_node mm/slub.c:5296 [inline]
>  __kmalloc_node_track_caller_noprof+0x4d7/0x7b0 mm/slub.c:5408
>  kmemdup_noprof+0x2b/0x70 mm/util.c:138
>  kmemdup_noprof include/linux/fortify-string.h:763 [inline]
>  fib_rules_register+0x2f/0x400 net/core/fib_rules.c:170
>  fib4_rules_init+0x21/0x160 net/ipv4/fib_rules.c:508
>  ip_fib_net_init net/ipv4/fib_frontend.c:1578 [inline]
>  fib_net_init+0x17a/0x3e0 net/ipv4/fib_frontend.c:1628
>  ops_init+0x35d/0x5d0 net/core/net_namespace.c:137
>  setup_net+0x118/0x350 net/core/net_namespace.c:446
>  copy_net_ns+0x4f9/0x720 net/core/net_namespace.c:579
>  create_new_namespaces+0x3f0/0x6b0 kernel/nsproxy.c:132
>  unshare_nsproxy_namespaces+0x149/0x190 kernel/nsproxy.c:234
>  ksys_unshare+0x57d/0xa00 kernel/fork.c:3242
>  __do_sys_unshare kernel/fork.c:3316 [inline]
>  __se_sys_unshare kernel/fork.c:3314 [inline]
>  __x64_sys_unshare+0x38/0x50 kernel/fork.c:3314
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Freed by task 12641:
>  kasan_save_stack mm/kasan/common.c:57 [inline]
>  kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
>  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:584
>  poison_slab_object mm/kasan/common.c:253 [inline]
>  __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
>  kasan_slab_free include/linux/kasan.h:235 [inline]
>  slab_free_hook mm/slub.c:2689 [inline]
>  __rcu_free_sheaf_prepare+0x12d/0x2a0 mm/slub.c:2940
>  rcu_free_sheaf+0x31/0x200 mm/slub.c:5850
>  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
>  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
>  handle_softirqs+0x225/0x840 kernel/softirq.c:622
>  do_softirq+0x76/0xd0 kernel/softirq.c:523
>  __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450
>  unregister_netdevice_many_notify+0x1874/0x2150 net/core/dev.c:12445
>  ops_exit_rtnl_list net/core/net_namespace.c:187 [inline]
>  ops_undo_list+0x391/0x8d0 net/core/net_namespace.c:248
>  cleanup_net+0x572/0x810 net/core/net_namespace.c:702
>  process_one_work kernel/workqueue.c:3314 [inline]
>  process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
>  worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
>  kthread+0x389/0x470 kernel/kthread.c:436
>  ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> 
> The buggy address belongs to the object at ffff88804ec4c600
>  which belongs to the cache kmalloc-192 of size 192
> The buggy address is located 128 bytes inside of
>  freed 192-byte region [ffff88804ec4c600, ffff88804ec4c6c0)
> 
> The buggy address belongs to the physical page:
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4ec4c
> flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff)
> page_type: f5(slab)
> raw: 00fff00000000000 ffff88813fe163c0 dead000000000100 dead000000000122
> raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 13856, tgid 13853 (syz.3.2144), ts 351172300879, free_ts 351133053454
>  set_page_owner include/linux/page_owner.h:32 [inline]
>  post_alloc_hook+0x22d/0x280 mm/page_alloc.c:1853
>  prep_new_page mm/page_alloc.c:1861 [inline]
>  get_page_from_freelist+0x24ae/0x2530 mm/page_alloc.c:3941
>  __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5221
>  alloc_slab_page mm/slub.c:3278 [inline]
>  allocate_slab+0x77/0x660 mm/slub.c:3467
>  new_slab mm/slub.c:3525 [inline]
>  refill_objects+0x336/0x3d0 mm/slub.c:7272
>  refill_sheaf mm/slub.c:2816 [inline]
>  __pcs_replace_empty_main+0x320/0x720 mm/slub.c:4652
>  alloc_from_pcs mm/slub.c:4750 [inline]
>  slab_alloc_node mm/slub.c:4884 [inline]
>  __do_kmalloc_node mm/slub.c:5295 [inline]
>  __kmalloc_noprof+0x464/0x750 mm/slub.c:5308
>  kmalloc_noprof include/linux/slab.h:954 [inline]
>  kzalloc_noprof include/linux/slab.h:1188 [inline]
>  new_dir fs/proc/proc_sysctl.c:966 [inline]
>  get_subdir fs/proc/proc_sysctl.c:1010 [inline]
>  sysctl_mkdir_p fs/proc/proc_sysctl.c:1320 [inline]
>  __register_sysctl_table+0xc02/0x1370 fs/proc/proc_sysctl.c:1395
>  neigh_sysctl_register+0x9b1/0xa90 net/core/neighbour.c:3915
>  addrconf_sysctl_register+0xb3/0x1c0 net/ipv6/addrconf.c:7396
>  ipv6_add_dev+0xd26/0x13a0 net/ipv6/addrconf.c:460
>  addrconf_notify+0x771/0x1050 net/ipv6/addrconf.c:3679
>  notifier_call_chain+0x1a5/0x3d0 kernel/notifier.c:85
>  call_netdevice_notifiers_extack net/core/dev.c:2288 [inline]
>  call_netdevice_notifiers net/core/dev.c:2302 [inline]
>  register_netdevice+0x18db/0x1f00 net/core/dev.c:11474
>  macsec_newlink+0x706/0x1200 drivers/net/macsec.c:4218
>  rtnl_newlink_create+0x310/0xb00 net/core/rtnetlink.c:3905
> page last free pid 12657 tgid 12657 stack trace:
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  __free_pages_prepare mm/page_alloc.c:1397 [inline]
>  __free_frozen_pages+0xc0d/0xd20 mm/page_alloc.c:2938
>  __tlb_remove_table_free mm/mmu_gather.c:228 [inline]
>  tlb_remove_table_rcu+0x85/0x100 mm/mmu_gather.c:291
>  rcu_do_batch kernel/rcu/tree.c:2617 [inline]
>  rcu_core+0x78b/0x10a0 kernel/rcu/tree.c:2869
>  handle_softirqs+0x225/0x840 kernel/softirq.c:622
>  __do_softirq kernel/softirq.c:656 [inline]
>  invoke_softirq kernel/softirq.c:496 [inline]
>  __irq_exit_rcu+0xca/0x220 kernel/softirq.c:735
>  irq_exit_rcu+0x9/0x30 kernel/softirq.c:752
>  instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1061 [inline]
>  sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1061
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
> 
> Memory state around the buggy address:
>  ffff88804ec4c580: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88804ec4c600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88804ec4c680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>                    ^
>  ffff88804ec4c700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff88804ec4c780: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Sebastian Andrzej Siewior @ 2026-06-16 15:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Petr Mladek, John Ogness, Sergey Senozhatsky, Peter Zijlstra,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616081128.04e2c8dd@kernel.org>

On 2026-06-16 08:11:28 [-0700], Jakub Kicinski wrote:
> > 
> > Adding sched and printk folks for opinions while eyeballing
> > WARN_ON_DEFERRED().
> 
> Thanks a lot for looking into this! To be clear - the printk_deferred /
> WARN_DEFERRED would be just for stable? Or there's still some
> sensitivity even with nbcon?

We already have printk_deferred(). WARN_DEFERRED() would be new. I
*think* this is not limited netpoll/ netconsole but all console drivers
not using CON_NBCON if the printk (via WARN) occurs with the rq held.
I don't remember all the details but printk_deferred() was introduced to
circumvent this until printk is fixed.

Once we get rid of those legacy drivers and NBCON is the default we can
get rid of printk_deferred() :)

Sebastian

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Jakub Kicinski @ 2026-06-16 15:28 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: syzbot, davem, edumazet, horms, john.fastabend, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <ajFpejsl1ukTbG96@krikkit>

On Tue, 16 Jun 2026 17:19:22 +0200 Sabrina Dubroca wrote:
> I suspect err==0, and sock_error() consumed sk_err in between (the
> alternative would be err > 0).
> 
> Something like this?

Makes sense, but what's eating sk_err? Don't we depend on it being set
to avoid further state transitions once we hit a crypto error?
I thought that's why we don't consume sk_err in recvmsg and sendmsg in
the first place (we are not calling sock_error() anywhere)

^ permalink raw reply

* Re: [PATCH v2 net-next 1/1] tcp: Replace min_tso_segs() with tso_segs() CC callback for TCP Prague
From: Jakub Kicinski @ 2026-06-16 15:23 UTC (permalink / raw)
  To: Chia-Yu Chang (Nokia)
  Cc: edumazet@google.com, ncardwell@google.com, jolsa@kernel.org,
	yonghong.song@linux.dev, song@kernel.org,
	linux-kselftest@vger.kernel.org, memxor@gmail.com,
	shuah@kernel.org, martin.lau@linux.dev, ast@kernel.org,
	daniel@iogearbox.net, andrii@kernel.org, eddyz87@gmail.com,
	horms@kernel.org, dsahern@kernel.org, bpf@vger.kernel.org,
	netdev@vger.kernel.org, pabeni@redhat.com, jhs@mojatatu.com,
	stephen@networkplumber.org, davem@davemloft.net,
	andrew+netdev@lunn.ch, donald.hunter@gmail.com, kuniyu@google.com,
	ij@kernel.org, Koen De Schepper (Nokia), g.white@cablelabs.com,
	ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com,
	cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com,
	vidhi_goel@apple.com
In-Reply-To: <PAXPR07MB7984585F9348BB3F000EB330A3E52@PAXPR07MB7984.eurprd07.prod.outlook.com>

On Tue, 16 Jun 2026 12:23:11 +0000 Chia-Yu Chang (Nokia) wrote:
> > On Mon, 15 Jun 2026 18:51:02 -0700 Jakub Kicinski wrote:  
> > > On Sun, 14 Jun 2026 09:17:56 +0200 chia-yu.chang@nokia-bell-labs.com
> > > Eric, Neal, looks good?
> > >
> > > The min rtt thing in tcp_tso_autosize() helps a bit but if the sender 
> > > gets congested for a longer stretch min_rtts on new connections are 
> > > high and we're back to sending small TSO, keeping the sender overloaded.
> > > Which is to say - I _hope_ this also solves some of Meta's problems :)  
> > 
> > Ugh, I didn't see the Sashiko report, it's only CCed to the author and bpf@, not to netdev :/
> > 
> > The zero-check sounds legit. Let's revisit this after the merge window.  
> 
> Thanks for the comment, I will take action after the merge window.
> 
> And, please correct me if I am wrong, the next eligible submission is expected from 30-June, right?

It usually opens Monday morning (PST) so Jun 29th

^ permalink raw reply

* Re: [PATCH net 4/4] net: ti: icssg: Fix XSK zero copy TX during application wakeup
From: Jakub Kicinski @ 2026-06-16 15:19 UTC (permalink / raw)
  To: Meghana Malladi
  Cc: diogo.ivo, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, sdf, john.fastabend, hawk, daniel, ast, pabeni,
	edumazet, davem, andrew+netdev, bpf, linux-kernel, netdev,
	linux-arm-kernel, srk, Vignesh Raghavendra, Roger Quadros,
	danishanwar
In-Reply-To: <ed0bc332-0196-4613-8066-9b94f8ed0013@ti.com>

On Tue, 16 Jun 2026 16:41:00 +0530 Meghana Malladi wrote:
> On 6/16/26 04:51, Jakub Kicinski wrote:
> > On Fri, 12 Jun 2026 00:27:44 +0530 Meghana Malladi wrote:  
> >> @@ -169,9 +169,6 @@ static int emac_xsk_xmit_zc(struct prueth_emac *emac,
> >>   
> >>   		num_tx++;
> >>   	}
> >> -
> >> -	xsk_tx_release(tx_chn->xsk_pool);
> >> -	return num_tx;  
> > 
> > Why are you deleting this?
> >   
> 
> xsk_sendmsg() also calls this without an rcu-lock when transmitting the 
> packets if the xmit was successful, so I was assuming it is not required 
> and I removed this.

I think you still need it. Besides, seems like a separate cleanup.

> >>   void prueth_xmit_free(struct prueth_tx_chn *tx_chn,
> >> @@ -279,9 +276,6 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
> >>   		num_tx++;
> >>   	}
> >>   
> >> -	if (!num_tx)
> >> -		return 0;  
> > 
> > Does something prevent us from running all this code if budget is 0?
> > If budget is 0 we can complete normal Tx with skbs but we must
> > not touch any AF-XDP related state.
> 
> Can you elaborate more, I couldn't interpret your comment here

netpoll may call napi from any context, including from IRQ.
It uses budget of 0 to indicate that it's trying to only reap tx
completions, without doing any Rx or XDP work. XDPs can't be called
from IRQ context.

> >>   	netif_txq = netdev_get_tx_queue(ndev, chn);
> >>   	netdev_tx_completed_queue(netif_txq, num_tx, total_bytes);
> >>   
> >> @@ -306,7 +300,9 @@ int emac_tx_complete_packets(struct prueth_emac *emac, int chn,
> >>   
> >>   		netif_txq = netdev_get_tx_queue(ndev, chn);
> >>   		txq_trans_cond_update(netif_txq);  
> > 
> > This looks misplaced, now we will hit it even if we didn't complete
> > or submit any Tx.
> 
> This code needs to be hit for packet transmission in zero copy mode.
> emac_xsk_xmit_zc() submits the packets to the DMA in NAPI context,
> when application wakes up the driver and triggers NAPI. Once DMA 
> transfer is done, irq gets triggered NAPI gets called which will handle 
> the tx packet completion + submit next Tx batch packets to the DMA.
> 
> if (tx_chn->xsk_pool) -> check ensure this hits and runs for zero copy 
> only. Also above check (!num_tx) returns early during the application 
> wakeup (where budget is zero), hence it is removed.

I'm commenting on txq_trans_cond_update(), you're calling it
effectively on every NAPI call when XSK is bound, whether
Tx is making progress or not.

^ permalink raw reply

* Re: [syzbot] [net?] WARNING in tls_err_abort
From: Sabrina Dubroca @ 2026-06-16 15:19 UTC (permalink / raw)
  To: syzbot
  Cc: davem, edumazet, horms, john.fastabend, kuba, linux-kernel,
	netdev, pabeni, syzkaller-bugs
In-Reply-To: <6a315d48.b0403584.28d0ff.0002.GAE@google.com>

2026-06-16, 07:27:20 -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    f6033078a9e6 ip6_tunnel: annotate data-races around t->err..
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=122a98ae580000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=8697a140486f5628
> dashboard link: https://syzkaller.appspot.com/bug?extid=cca46a9d1276f38af2ae
> compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/7af9eb2b9b5a/disk-f6033078.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/4b7e03b76e68/vmlinux-f6033078.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/38042dd09caa/bzImage-f6033078.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+cca46a9d1276f38af2ae@syzkaller.appspotmail.com
> 
> ------------[ cut here ]------------
> err >= 0
> WARNING: net/tls/tls_sw.c:73 at tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73, CPU#0: kworker/0:11/6099
> Modules linked in:
> CPU: 0 UID: 0 PID: 6099 Comm: kworker/0:11 Not tainted syzkaller #0 PREEMPT(full) 
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
> Workqueue: pencrypt_serial padata_serial_worker
> RIP: 0010:tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73
> Code: e8 03 48 b9 00 00 00 00 00 fc ff df 0f b6 04 08 84 c0 75 1b 89 ab 9c 01 00 00 48 89 df 5b 5d e9 c9 a2 32 ff e8 a4 60 8a f7 90 <0f> 0b 90 eb c3 89 f9 80 e1 07 80 c1 03 38 c1 7c d9 e8 1d 9f f5 f7
> RSP: 0018:ffffc900069379e0 EFLAGS: 00010293
> RAX: ffffffff8a3adf8c RBX: ffff88807d1e0d80 RCX: ffff888058bfdd00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
> RBP: 0000000000000000 R08: ffffe8ffffc513e3 R09: 1ffffd1ffff8a27c
> R10: dffffc0000000000 R11: ffffffff8a3c4d70 R12: ffff888028eaf400
> R13: ffff88804441030c R14: dffffc0000000000 R15: ffff888028eaf460
> FS:  0000000000000000(0000) GS:ffff8881252a0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f521f503ff8 CR3: 0000000086fc2000 CR4: 00000000003526f0
> Call Trace:
>  <TASK>
>  tls_encrypt_done+0x223/0x480 net/tls/tls_sw.c:500


	/* Check if error is previously set on socket */
	if (err || sk->sk_err) {
		rec = NULL;

		/* If err is already set on socket, return the same code */
		if (sk->sk_err) {
			ctx->async_wait.err = -sk->sk_err;
		} else {
			ctx->async_wait.err = err;
			tls_err_abort(sk, err);
		}
	}

I suspect err==0, and sock_error() consumed sk_err in between (the
alternative would be err > 0).

Something like this?

-------- 8< --------
@@ -473,6 +473,7 @@ static void tls_encrypt_done(void *data, int err)
 	struct scatterlist *sge;
 	struct sk_msg *msg_en;
 	struct sock *sk;
+	int sk_err;
 
 	if (err == -EINPROGRESS) /* see the comment in tls_decrypt_done() */
 		return;
@@ -489,12 +490,13 @@ static void tls_encrypt_done(void *data, int err)
 	sge->length += prot->prepend_size;
 
 	/* Check if error is previously set on socket */
-	if (err || sk->sk_err) {
+	sk_err = READ_ONCE(sk->sk_err);
+	if (err || sk_err) {
 		rec = NULL;
 
 		/* If err is already set on socket, return the same code */
-		if (sk->sk_err) {
-			ctx->async_wait.err = -sk->sk_err;
+		if (sk_err) {
+			ctx->async_wait.err = -sk_err;
 		} else {
 			ctx->async_wait.err = err;
 			tls_err_abort(sk, err);

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH] net: ethtool: mm: Increase FPE verification retry count
From: Vladimir Oltean @ 2026-06-16 15:16 UTC (permalink / raw)
  To: Simon Horman
  Cc: muhammad.nazim.amirul.nazle.asmade, netdev, andrew, kuba, davem,
	edumazet, pabeni, faizal.abdul.rahim, linux-kernel
In-Reply-To: <20260616071925.GA800687@horms.kernel.org>

On Tue, Jun 16, 2026 at 08:19:25AM +0100, Simon Horman wrote:
> + Vladimir
> 
> On Mon, Jun 15, 2026 at 12:24:36AM -0700, muhammad.nazim.amirul.nazle.asmade@altera.com wrote:
> > From: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> > 
> > The current FPE verification retry count is set to 3. However,
> > the IEEE 802.3br standard does not specify a fixed value for this.
> > A retry count of 3 may be insufficient when the remote device is
> > slow to respond during link-up. Increase the retry count to 20 to
> > improve robustness.
> > 
> > Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com>
> > Signed-off-by: Nazim Amirul <muhammad.nazim.amirul.nazle.asmade@altera.com>
> 
> Vladimir, I'm wondering if you could take a look at this one.

IEEE 802.3br is an obsolete standard, I don't even have access to it.

IEEE 802.3-2022 is the current one for the MAC Merge layer. Clause
99.4.7.2 Constants states:

verifyLimit: the integer 3, the number of verification attempts

I don't have something in principle against making the verifyLimit
configurable past IEEE 802.3 for debugging purposes or non-standard
applications, but keep the default to 3.

^ permalink raw reply

* Re: [PATCH v2] net: macb: add TX stall timeout callback to recover from lost TSTART write
From: Théo Lebrun @ 2026-06-16 15:07 UTC (permalink / raw)
  To: Andrea della Porta, netdev, Nicolas Ferre, Claudiu Beznea,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-kernel, linux-arm-kernel, linux-rpi-kernel,
	Nicolai Buchwitz
  Cc: Lukasz Raczylo, Steffen Jaeckel
In-Reply-To: <468f480454a314303bac6a54780b153f689f2267.1781598350.git.andrea.porta@suse.com>

Hello Andrea,

On Tue Jun 16, 2026 at 3:23 PM CEST, Andrea della Porta wrote:
> From: Lukasz Raczylo <lukasz@raczylo.com>
>
> The MACB found in the Raspberry Pi RP1 suffers from sporadic stalls on
> the TX queue.
> While the exact root cause is not yet fully understood, it is likely
> related to a hardware issue where a TSTART write to the NCR register
> is missed, preventing the transmission from being kicked off.
>
> Implement a timeout callback to handle TX queue stalls, triggering the
> existing restart mechanism to recover.
>
> Link: https://lore.kernel.org/all/20260514215459.36109-1-lukasz@raczylo.com/
> Fixes: dc110d1b23564 ("net: cadence: macb: Add support for Raspberry Pi RP1 ethernet controller")
> Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
> Co-developed-by: Steffen Jaeckel <sjaeckel@suse.de>
> Signed-off-by: Steffen Jaeckel <sjaeckel@suse.de>
> Co-developed-by: Andrea della Porta <andrea.porta@suse.com>
> Signed-off-by: Andrea della Porta <andrea.porta@suse.com>

Thanks for this V2.

Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>

Any news from the Raspberry Pi community about this bug investigation?

Thanks,

--
Théo Lebrun, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Jakub Kicinski @ 2026-06-16 15:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Petr Mladek, John Ogness, Sergey Senozhatsky, Peter Zijlstra,
	Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260616103529.Yh9Dxsjp@linutronix.de>

On Tue, 16 Jun 2026 12:35:29 +0200 Sebastian Andrzej Siewior wrote:
> On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> > On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:  
> > > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > > +	local_bh_disable();
> > > + 	poll_napi(dev);
> > > +	_local_bh_enable();  
> > 
> > tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> > softirq into not waking ksoftirqd? The problematic path is:
> > 
> >   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> > 
> > so the softirq may never get serviced.
> > 
> > In netcons we try to avoid touching the network driver if the Tx path
> > locks are already held. Ideally we'd do something similar with the
> > scheduler. Try to do bare minimum if we may be in the scheduler.
> > Failing that - don't poll the driver if we were called with irqs
> > already disabled.
> > 
> > Or maybe we only poll from console->write_thread ?  
> 
> So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> to NBCON console infrastructure"). Because from here now on writes are
> deferred to the nbcon thread. So this purely about -stable in this case.
> 
> Looking at the patch and the amount of comments vs code changes look
> somehow hackish. That ifdef for PREEMPT_RT is not needed because on
> PREEMPT_RT we have either nbcon or the legacy console (including
> netconsole before the mentioned commit) wrapped in a dedicated thread
> (via force_legacy_kthread()).
> That means in both cases the flow never ends there and the problem is
> limited to !PREEMPT_RT.
> 
> Now. The scheduler usually does printk_deferred() because of the rq lock
> so it does not deadlock for various reasons. It is kind of a pity that
> the various WARN macros don't do that.
> I don't think that patch is enough. It works around the problem in this
> scenario but should the NIC driver invoke schedule_work() then we are
> back here again.
> Should the network driver acquire a lock then lockdep might observe
> rq -> driver-lock and then driver-lock -> rq and yell dead lock (CPU1
> doing AB and CPU2 doing BA). This includes also other console driver so
> it is not limited to netconsole.
> 
> Point being made is that we should avoid the callchain:
> 
> |  console_unlock
> |  vprintk_emit
> |  __warn
> |  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
> |  put_prev_entity
> |  put_prev_task_fair
> |  __schedule
> 
> basically a printk under the rq lock.
> 
> We could add printk_deferred_enter/exit() to all the rq_lock() variants.
> I think PeterZ loves this the most. And Greg will appreciate it too
> while backporting because of all the context changes.
> 
> We could also introduce WARN_ON_DEFERRED +variants which do the
> printk_deferred_enter/exit() thingy should around the printk and replace
> all the WARNs in kernel/sched/.
> I *think* the tty/console layer has also a deadlock problem where it
> holds locks and then the WARN(), that never triggers, asks for the same
> locks again so we might have a second user…
> 
> Adding sched and printk folks for opinions while eyeballing
> WARN_ON_DEFERRED().

Thanks a lot for looking into this! To be clear - the printk_deferred /
WARN_DEFERRED would be just for stable? Or there's still some
sensitivity even with nbcon?

^ permalink raw reply

* Re: [PATCH net-next v7 2/2] net: ti: icssg-prueth: Add ethtool ops for Frame Preemption MAC Merge
From: Jakub Kicinski @ 2026-06-16 15:07 UTC (permalink / raw)
  To: Meghana Malladi
  Cc: elfring, haokexin, vadim.fedorenko, devnexen, horms,
	jacob.e.keller, arnd, basharath, afd, parvathi, vladimir.oltean,
	rogerq, danishanwar, pabeni, edumazet, davem, andrew+netdev,
	linux-arm-kernel, netdev, linux-kernel, srk, vigneshr
In-Reply-To: <d0123269-b1e8-4fba-94b0-b94d3d9a5405@ti.com>

On Tue, 16 Jun 2026 18:24:22 +0530 Meghana Malladi wrote:
> >> Could the firmware-register lookup table used by emac_get_stat_by_name()
> >> be separated from the ethtool -S string table, so the new preemption
> >> counters feed get_mm_stats without also showing up under ethtool -S?  
> > 
> > This -- not sure about the other complaints but this one looks legit.  
> 
> I agree that this is legit, but right now there is no other place holder 
> other than pa stats to put the mac merge firmware counters. I believe
> the effort needs to go in re-structuring the hardware and firmware stats 
> implementation to address this issue.

icssg_all_miig_stats has a true / false that looks like it's supposed
to serve the same purpose? Maybe I don't understand what you're trying
to say

^ permalink raw reply

* Re: [PATCH v2] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-16 15:02 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: Stefan Hajnoczi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michael S. Tsirkin, Jason Wang, Bobby Eshleman,
	Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm, virtualization,
	netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <ajFFOJSdqpNWgohB@sgarzare-redhat>


On 16/06/2026 16:09, Stefano Garzarella wrote:
> On Sun, Jun 14, 2026 at 08:47:56PM +0300, Arseniy Krasnov wrote:
>> Logically it was based on TCP implementation, so make further support
>> easier, rewrite it in the TCP way.
>
> Hi Arseniy, and thank you so much for the patch!
>
> I’d like to ask you to expand on the message a bit, especially to explain why we’re making this change.
>
> In particular, I’d like to better understand whether this is just a cosmetic change or if we’re fixing any issues (and if so, which ones), so we can determine whether this patch should be backported to the stable branches.

This is cosmetic change. I'll update commit message in v3.

>
>>
>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>> ---
>> Changelog v1->v2:
>> * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>>   already added.
>
> Ah, okay is net-next material, please use the net-next tag (ie. [PATCH net-next v2]). 

Sure!

>
>>
>> net/vmw_vsock/virtio_transport_common.c | 48 ++++++++++++-------------
>> 1 file changed, 23 insertions(+), 25 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 09475007165b..787524b8cb44 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -328,38 +328,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>     if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>         return pkt_len;
>>
>> -    if (info->msg) {
>> -        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>> -         * there is no MSG_ZEROCOPY flag set.
>> +    if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>> +        /* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>> +         * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>> +         * handling from 'tcp_sendmsg_locked()'.
>>          */
>> -        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> -            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> +        if (info->msg->msg_ubuf) {
>> +            uarg = info->msg->msg_ubuf;
>> +            can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>> +        } else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>> +            uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>> +                            NULL, false);
>> +            if (!uarg) {
>> +                virtio_transport_put_credit(vvs, pkt_len);
>> +                return -ENOMEM;
>> +            }
>>
>> -        if (info->msg->msg_flags & MSG_ZEROCOPY)
>>             can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>
>
> nit: we can remove this extra blank line.
>
> For the rest I can't see anything wrong, but a bit more context in the commit would help me in the review.

Ack, I'll update commit message in v3. Also need to check some reports about pre-existing issues from sashiko, triggered by this patch.

Thanks!

>
> Thanks,
> Stefano
>

^ permalink raw reply

* Re: [PATCH net v4] virtio-net: fix len check in receive_big()
From: Bui Quang Minh @ 2026-06-16 14:43 UTC (permalink / raw)
  To: Xiang Mei
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev,
	virtualization, linux-kernel, bestswngs, mst, jasowang, xuanzhuo,
	eperezma
In-Reply-To: <20260616042837.2249468-1-xmei5@asu.edu>

On 6/16/26 11:28, Xiang Mei wrote:
> receive_big() bounds the device-announced length by
> (big_packets_num_skbfrags + 1) * PAGE_SIZE.  That is still too loose:
> add_recvbuf_big() sets sg[1] to start at offset
> sizeof(struct padded_vnet_hdr) into the first page, so the chain
> actually carries hdr_len + (PAGE_SIZE - sizeof(padded_vnet_hdr)) +
> big_packets_num_skbfrags * PAGE_SIZE bytes -- 20 bytes less than the
> check allows for the common hdr_len == 12 case.
>
> A malicious virtio backend can announce a len in that gap.  page_to_skb()
> then walks one frag past the page chain, storing a NULL page->private
> into skb_shinfo()->frags[MAX_SKB_FRAGS], which is both an out-of-bounds
> write past the static frag array and a NULL frag handed up the rx path.
>
> Bound len by the size add_recvbuf_big() actually advertised.
>
> Fixes: 0c716703965f ("virtio-net: fix received length check in big packets")

I think we should Cc: stable@vger.kernel.org

> Reported-by: Weiming Shi <bestswngs@gmail.com>
> Signed-off-by: Xiang Mei <xmei5@asu.edu>
> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
> v4: use easy to understand math to compute the max_len
> v3: revoke 2/2 and add Xuan Zhuo's Reviewed-by tag
> v2: add additiona check as 2/2
>
>   drivers/net/virtio_net.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f4adcfee7a80..8f4562316aaa 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1999,15 +1999,16 @@ static struct sk_buff *receive_big(struct net_device *dev,
>   				   struct virtnet_rq_stats *stats)
>   {
>   	struct page *page = buf;
> +	unsigned long max_len = (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE -
> +				sizeof(struct padded_vnet_hdr) + vi->hdr_len;
>   	struct sk_buff *skb;
>   
>   	/* Make sure that len does not exceed the size allocated in
>   	 * add_recvbuf_big.
>   	 */
> -	if (unlikely(len > (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE)) {
> +	if (unlikely(len > max_len)) {
>   		pr_debug("%s: rx error: len %u exceeds allocated size %lu\n",
> -			 dev->name, len,
> -			 (vi->big_packets_num_skbfrags + 1) * PAGE_SIZE);
> +			 dev->name, len, max_len);
>   		goto err;
>   	}
>   

Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>

Thanks,
Quang Minh.


^ permalink raw reply

* [PATCH net] net: ethernet: ti: icssg: guard PA stat lookups
From: Philippe Schenker @ 2026-06-16 14:35 UTC (permalink / raw)
  To: netdev
  Cc: Philippe Schenker, danishanwar, rogerq, linux-arm-kernel, stable,
	Andrew Lunn, David Carlier, David S. Miller, Eric Dumazet,
	Jacob Keller, Jakub Kicinski, Kevin Hao, Meghana Malladi,
	Paolo Abeni, Simon Horman, Vadim Fedorenko, linux-kernel

From: Philippe Schenker <philippe.schenker@impulsing.ch>

icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
with FW PA stat names regardless of whether the PA stats block is
present on the hardware.  emac_get_stat_by_name() already guards the
PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
is NULL the lookup falls through to netdev_err() and returns -EINVAL.
Because ndo_get_stats64 is polled regularly by the networking stack
this produces thousands of log entries of the form:

  icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR

A secondary consequence is that the int(-EINVAL) return value is
implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
into the __u64 fields of rtnl_link_stats64, silently corrupting the
rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.

Every other PA-aware code path in the driver is already guarded with
the same `if (emac->prueth->pa_stats)` check.  Apply the same guard
here.

Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats")

Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch>

Cc: danishanwar@ti.com
Cc: rogerq@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
---

 drivers/net/ethernet/ti/icssg/icssg_common.c | 48 +++++++++++---------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/ti/icssg/icssg_common.c b/drivers/net/ethernet/ti/icssg/icssg_common.c
index a28a608f9bf4..2dbb8f717de0 100644
--- a/drivers/net/ethernet/ti/icssg/icssg_common.c
+++ b/drivers/net/ethernet/ti/icssg/icssg_common.c
@@ -1628,28 +1628,34 @@ void icssg_ndo_get_stats64(struct net_device *ndev,
 	stats->rx_over_errors = emac_get_stat_by_name(emac, "rx_over_errors");
 	stats->multicast      = emac_get_stat_by_name(emac, "rx_multicast_frames");
 
-	stats->rx_errors  = ndev->stats.rx_errors +
-			    emac_get_stat_by_name(emac, "FW_RX_ERROR") +
-			    emac_get_stat_by_name(emac, "FW_RX_EOF_SHORT_FRMERR") +
-			    emac_get_stat_by_name(emac, "FW_RX_B0_DROP_EARLY_EOF") +
-			    emac_get_stat_by_name(emac, "FW_RX_EXP_FRAG_Q_DROP") +
-			    emac_get_stat_by_name(emac, "FW_RX_FIFO_OVERRUN");
-	stats->rx_dropped = ndev->stats.rx_dropped +
-			    emac_get_stat_by_name(emac, "FW_DROPPED_PKT") +
-			    emac_get_stat_by_name(emac, "FW_INF_PORT_DISABLED") +
-			    emac_get_stat_by_name(emac, "FW_INF_SAV") +
-			    emac_get_stat_by_name(emac, "FW_INF_SA_DL") +
-			    emac_get_stat_by_name(emac, "FW_INF_PORT_BLOCKED") +
-			    emac_get_stat_by_name(emac, "FW_INF_DROP_TAGGED") +
-			    emac_get_stat_by_name(emac, "FW_INF_DROP_PRIOTAGGED") +
-			    emac_get_stat_by_name(emac, "FW_INF_DROP_NOTAG") +
-			    emac_get_stat_by_name(emac, "FW_INF_DROP_NOTMEMBER");
+	stats->rx_errors  = ndev->stats.rx_errors;
+	stats->rx_dropped = ndev->stats.rx_dropped;
 	stats->tx_errors  = ndev->stats.tx_errors;
-	stats->tx_dropped = ndev->stats.tx_dropped +
-			    emac_get_stat_by_name(emac, "FW_RTU_PKT_DROP") +
-			    emac_get_stat_by_name(emac, "FW_TX_DROPPED_PACKET") +
-			    emac_get_stat_by_name(emac, "FW_TX_TS_DROPPED_PACKET") +
-			    emac_get_stat_by_name(emac, "FW_TX_JUMBO_FRM_CUTOFF");
+	stats->tx_dropped = ndev->stats.tx_dropped;
+
+	if (emac->prueth->pa_stats) {
+		stats->rx_errors  +=
+				emac_get_stat_by_name(emac, "FW_RX_ERROR") +
+				emac_get_stat_by_name(emac, "FW_RX_EOF_SHORT_FRMERR") +
+				emac_get_stat_by_name(emac, "FW_RX_B0_DROP_EARLY_EOF") +
+				emac_get_stat_by_name(emac, "FW_RX_EXP_FRAG_Q_DROP") +
+				emac_get_stat_by_name(emac, "FW_RX_FIFO_OVERRUN");
+		stats->rx_dropped +=
+				emac_get_stat_by_name(emac, "FW_DROPPED_PKT") +
+				emac_get_stat_by_name(emac, "FW_INF_PORT_DISABLED") +
+				emac_get_stat_by_name(emac, "FW_INF_SAV") +
+				emac_get_stat_by_name(emac, "FW_INF_SA_DL") +
+				emac_get_stat_by_name(emac, "FW_INF_PORT_BLOCKED") +
+				emac_get_stat_by_name(emac, "FW_INF_DROP_TAGGED") +
+				emac_get_stat_by_name(emac, "FW_INF_DROP_PRIOTAGGED") +
+				emac_get_stat_by_name(emac, "FW_INF_DROP_NOTAG") +
+				emac_get_stat_by_name(emac, "FW_INF_DROP_NOTMEMBER");
+		stats->tx_dropped +=
+				emac_get_stat_by_name(emac, "FW_RTU_PKT_DROP") +
+				emac_get_stat_by_name(emac, "FW_TX_DROPPED_PACKET") +
+				emac_get_stat_by_name(emac, "FW_TX_TS_DROPPED_PACKET") +
+				emac_get_stat_by_name(emac, "FW_TX_JUMBO_FRM_CUTOFF");
+	}
 }
 EXPORT_SYMBOL_GPL(icssg_ndo_get_stats64);
 
-- 
2.54.0

base-commit: 8cd9520d35a6c38db6567e97dd93b1f11f185dc6
branch: fix-icssg_common-pa-stats-errors__master-7-1

^ permalink raw reply related

* [PATCH v2 4/4] selftest: Add tests for useful handling of LSM denials on SCM_RIGHTS
From: Jori Koolstra @ 2026-06-16 14:30 UTC (permalink / raw)
  To: brauner, cyphar, Shuah Khan, Kuniyuki Iwashima, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, Jori Koolstra, open list,
	open list:KERNEL SELFTEST FRAMEWORK,
	open list:NETWORKING [GENERAL]
In-Reply-To: <20260616143020.3458085-1-jkoolstra@xs4all.nl>

Tests SCM_RIGHTS fd passing on a socket with the new socket option
SO_RIGHTS_NOTRUNC turned on.

The test uses the following Smack labels:

   "Sender"   - label for the sending process
   "Receiver" - label for the receiving process
   "SecretX"   - labels for the files being passed

Socket communication (Sender <-> Receiver) is always allowed.
The tests control whether Receiver can access "SecretX"-labeled fds.
When the LSM blocks an fd, we should see a sentinel that corresponds to
the error returned by the LSM, such as -EACCES.

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 tools/testing/selftests/Makefile              |   1 +
 .../net/af_unix/scm_rights_denial/.gitignore  |   2 +
 .../net/af_unix/scm_rights_denial/Makefile    |  13 ++
 .../net/af_unix/scm_rights_denial/helper.h    |  38 ++++
 .../net/af_unix/scm_rights_denial/receiver.c  | 195 ++++++++++++++++++
 .../scm_rights_denial/scm_rights_denial.sh    | 171 +++++++++++++++
 .../net/af_unix/scm_rights_denial/sender.c    | 126 +++++++++++
 7 files changed, 546 insertions(+)
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights_denial/.gitignore
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights_denial/Makefile
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights_denial/helper.h
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights_denial/receiver.c
 create mode 100755 tools/testing/selftests/net/af_unix/scm_rights_denial/scm_rights_denial.sh
 create mode 100644 tools/testing/selftests/net/af_unix/scm_rights_denial/sender.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 8d4db2241cc2..7ff876692267 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -74,6 +74,7 @@ TARGETS += mseal_system_mappings
 TARGETS += nci
 TARGETS += net
 TARGETS += net/af_unix
+TARGETS += net/af_unix/scm_rights_denial
 TARGETS += net/can
 TARGETS += net/forwarding
 TARGETS += net/hsr
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/.gitignore b/tools/testing/selftests/net/af_unix/scm_rights_denial/.gitignore
new file mode 100644
index 000000000000..5a1c58ff005f
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/.gitignore
@@ -0,0 +1,2 @@
+sender
+receiver
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/Makefile b/tools/testing/selftests/net/af_unix/scm_rights_denial/Makefile
new file mode 100644
index 000000000000..03eb6d1427d7
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0
+top_srcdir := ../../../../../..
+include $(top_srcdir)/scripts/Makefile.compiler
+
+cc-option = $(call __cc-option, $(CC),,$(1),$(2))
+
+CFLAGS += $(KHDR_INCLUDES) -Wall $(call cc-option,-Wflex-array-member-not-at-end)
+
+TEST_PROGS := scm_rights_denial.sh
+
+TEST_GEN_FILES := sender receiver
+
+include ../../../lib.mk
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/helper.h b/tools/testing/selftests/net/af_unix/scm_rights_denial/helper.h
new file mode 100644
index 000000000000..2ecdf2b8b973
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/helper.h
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+#include <fcntl.h>
+
+#ifndef SO_RIGHTS_NOTRUNC
+#define SO_RIGHTS_NOTRUNC 85
+#endif
+
+#define CMSG_IS_SCM_RIGHTS(cmsg) ({		\
+	typeof(cmsg) _cmsg = (cmsg);		\
+	_cmsg &&				\
+	_cmsg->cmsg_level == SOL_SOCKET &&	\
+	_cmsg->cmsg_type == SCM_RIGHTS;		\
+})
+
+#define MIN(a, b) ({ \
+	typeof(a) _a = (a); \
+	typeof(b) _b = (b); \
+	_a < _b ? _a : _b; \
+})
+
+#define MAX_FDS 10
+
+static inline int read_current_label(char *label, size_t size)
+{
+	int fd = open("/proc/self/attr/current", O_RDONLY);
+	if (fd < 0)
+		return fd;
+
+	ssize_t r = read(fd, label, size - 1);
+	close(fd);
+	if (r < 0)
+		return r;
+
+	label[r] = '\0';
+
+	return 0;
+}
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/receiver.c b/tools/testing/selftests/net/af_unix/scm_rights_denial/receiver.c
new file mode 100644
index 000000000000..a9bd49a6e214
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/receiver.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * receiver.c - Receive a file descriptor over a Unix domain socket via SCM_RIGHTS
+ *
+ * Usage: ./receiver <socket_path>
+ *
+ * Listens on the given Unix socket path, accepts a connection, and
+ * attempts to receive file descriptors via SCM_RIGHTS. Reports
+ * whether the fds were delivered or blocked.
+ *
+ * Used for testing LSM (Smack) blocking of fd passing.
+ */
+
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <sys/xattr.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <fcntl.h>
+
+#include "helper.h"
+
+#define RECV_LOG(fmt, ...) printf("receiver: " fmt, ##__VA_ARGS__)
+#define RECV_ERR(fmt, ...) fprintf(stderr, "receiver: " fmt, ##__VA_ARGS__)
+
+static int recv_fds(int sock, int *fds)
+{
+	char buf[1];
+	char ctrl[CMSG_SPACE(MAX_FDS * sizeof(int))];
+
+	struct iovec iov = {
+		.iov_base = buf,
+		.iov_len  = sizeof(buf),
+	};
+	struct msghdr msg = {
+		.msg_iov        = &iov,
+		.msg_iovlen     = 1,
+		.msg_control    = ctrl,
+		.msg_controllen = sizeof(ctrl),
+	};
+
+	ssize_t bytes_read = recvmsg(sock, &msg, 0);
+	if (bytes_read < 0) {
+		perror("receiver: recvmsg");
+		return -1;
+	}
+	if (bytes_read == 0) {
+		RECV_ERR("connection closed, no data received\n");
+		return -1;
+	}
+
+	struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+	if (!CMSG_IS_SCM_RIGHTS(cmsg)) {
+		RECV_ERR("no SCM_RIGHTS in control message\n");
+		return -1;
+	}
+
+	int num_fd_slots = (cmsg->cmsg_len - CMSG_LEN(0)) / sizeof(int);
+	memcpy(fds, CMSG_DATA(cmsg), num_fd_slots  * sizeof(int));
+
+	RECV_LOG("got %d fd slot(s):", num_fd_slots);
+	for (int i = 0; i < num_fd_slots ; i++) {
+		if (fds[i] < 0)
+			printf(" %s", strerrorname_np(-fds[i]));
+		else
+			printf(" %d", fds[i]);
+	}
+	putchar('\n');
+
+	return num_fd_slots;
+}
+
+static inline int print_current_label(void)
+{
+	char label[256];
+	if (!read_current_label(label, sizeof(label))) {
+		RECV_LOG("running with Smack label '%s'\n", label);
+		return 0;
+	}
+	return -1;
+}
+
+int main(int argc, char *argv[])
+{
+	if (argc != 2) {
+		fprintf(stderr, "Usage: %s <socket_path>\n", argv[0]);
+		return -1;
+	}
+
+	if (print_current_label()) {
+		RECV_ERR("cannot read process Smack label");
+		return -1;
+	}
+
+	int listen_sock = socket(AF_UNIX, SOCK_STREAM, 0);
+	if (listen_sock < 0) {
+		perror("receiver: socket");
+		return -1;
+	}
+
+	struct sockaddr_un addr = {};
+	addr.sun_family = AF_UNIX;
+	strncpy(addr.sun_path, argv[1], sizeof(addr.sun_path) - 1);
+
+	/* Remove any stale socket file */
+	unlink(argv[1]);
+
+	if (bind(listen_sock, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		perror("receiver: bind");
+		return -1;
+	}
+
+	if (listen(listen_sock, 1) < 0) {
+		perror("receiver: listen");
+		return -1;
+	}
+
+	RECV_LOG("listening on '%s'\n", argv[1]);
+
+	int conn_sock = accept(listen_sock, NULL, NULL);
+	if (conn_sock < 0) {
+		perror("receiver: accept");
+		return -1;
+	}
+
+	RECV_LOG("connection accepted\n");
+
+	int one = 1;
+	if (setsockopt(conn_sock, SOL_SOCKET, SO_RIGHTS_NOTRUNC,
+		       &one, sizeof(one)) < 0) {
+		perror("receiver: setsockopt(SO_RIGHTS_NOTRUNC)");
+		goto out_sock;
+	}
+
+	/* Try to receive the fds */
+	int fds[MAX_FDS];
+	int num_fds = recv_fds(conn_sock, fds);
+	if (num_fds < 0)
+		goto out_sock;
+
+	/* Try to use the received fds -- read and print their contents */
+	RECV_LOG("attempting to read from received fds...\n");
+	int i;
+	for (i = 0; i < num_fds; ++i) {
+		char readbuf[256];
+
+		if (fds[i] < 0) {
+			RECV_LOG("fd in position %i blocked\n", i);
+			continue;
+		} else if (fds[i] == 0) {
+			RECV_LOG("bad fd in position %i\n", i);
+			goto out_recv;
+		}
+
+		ssize_t n = read(fds[i], readbuf, sizeof(readbuf) - 1);
+		if (n < 0) {
+			perror("receiver: read from received fd");
+			goto out_recv;
+		}
+
+		readbuf[n] = '\0';
+		RECV_LOG("read %zd bytes from fd at position %i: '%s'\n", n, i, readbuf);
+	}
+
+	RECV_LOG("final result:\n");
+	for (int j = 0; j < num_fds; ++j) {
+		if (fds[j] < 0) {
+			printf("BLOCKED");
+		} else {
+			printf("PASSED");
+			close(fds[j]);
+		}
+		putchar(' ');
+	}
+
+	close(conn_sock);
+	close(listen_sock);
+	unlink(argv[1]);
+	return 0;
+
+out_recv:
+	for (int j = 0; j < num_fds; ++j) {
+		if (fds[j] > 0)
+			close(fds[j]);
+	}
+
+out_sock:
+	close(conn_sock);
+	close(listen_sock);
+	unlink(argv[1]);
+	return -1;
+}
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/scm_rights_denial.sh b/tools/testing/selftests/net/af_unix/scm_rights_denial/scm_rights_denial.sh
new file mode 100755
index 000000000000..9d7d4530cadd
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/scm_rights_denial.sh
@@ -0,0 +1,171 @@
+# SPDX-License-Identifier: GPL-2.0
+
+#
+# test_scm_rights_smack.sh - Test SCM_RIGHTS fd passing using Smack LSM blocking
+#
+# Must be run as root on a kernel with Smack enabled (security=smack).
+# Requires: capsh (libcap), setfattr/getfattr (attr)
+#
+# We use the following Smack labels:
+#   "Sender"   - label for the sending process
+#   "Receiver" - label for the receiving process
+#   "SecretX"   - labels for the files being passed
+#
+# Socket communication (Sender <-> Receiver) is always allowed.
+# The tests control whether Receiver can access "SecretX"-labeled fds.
+#
+
+set -e
+
+readonly SENDER="./sender"
+readonly RECEIVER="./receiver"
+
+readonly TESTDIR="$(mktemp -d)"
+readonly SOCK="$TESTDIR/scm_test.sock"
+readonly TESTFILE1="$TESTDIR/secret_1"
+readonly TESTFILE2="$TESTDIR/secret_2"
+
+trap 'rm -rf "$TESTDIR"' EXIT
+
+run_tests() {
+
+	preflight
+	setup
+
+	run_test "TEST 1" \
+		"Receiver should NOT have access to Secret1." \
+		"Receiver Secret1 ---
+Receiver Secret2 ---" \
+		"$TESTFILE1" \
+		"BLOCKED"
+
+	run_test "TEST 2" \
+		"Receiver should have access to Secret1." \
+		"Receiver Secret1 r--
+Receiver Secret2 ---" \
+		"$TESTFILE1" \
+		"PASSED"
+
+	run_test "TEST 3" \
+		"Receiver should have access to Secret2, but NOT Secret1." \
+		"Receiver Secret1 ---
+Receiver Secret2 r--" \
+		"$TESTFILE1 $TESTFILE2" \
+		"BLOCKED PASSED"
+}
+
+run_test() {
+	local name="$1"
+	local description="$2"
+	local rules="$3"
+	local files="$4"
+	local expected="$5"
+
+	echo ""
+	echo "$name: $description"
+	echo "Rules:"
+	echo "$rules"
+	echo "Expected: $expected"
+	echo ""
+
+	while IFS= read -r rule; do
+		[ -n "$rule" ] && echo "$rule" > /sys/fs/smackfs/load2
+	done <<< "$rules"
+
+	local output status last_line
+	output=$(send_fds "$SOCK" $files)
+	status=$?
+	echo "$output"
+	last_line=$(echo "$output" | tail -n 1 | xargs)
+
+	if [ "$status" -ne 0 ]; then
+		echo "TEST FAILED: receiver returned $status"
+		return 1
+	fi
+
+	if [[ "$last_line" == "$expected" ]]; then
+		echo "TEST PASSED: outcome was $expected as expected"
+		return 0
+	else
+		echo "TEST FAILED: expected $expected, got '$last_line'"
+		return 1
+	fi
+}
+
+setup() {
+
+	printf "Secret 1" > "$TESTFILE1"
+	printf "Secret 2" > "$TESTFILE2"
+
+	setfattr -n security.SMACK64 -v "Secret1" "$TESTFILE1"
+	setfattr -n security.SMACK64 -v "Secret2" "$TESTFILE2"
+	setfattr -n security.SMACK64 -v "Tmp" /tmp
+	setfattr -n security.SMACK64 -v "Tmp" "$TESTDIR"
+
+	echo "Sender	Receiver	-w-" > /sys/fs/smackfs/load2
+	echo "Receiver	Sender		-w-" > /sys/fs/smackfs/load2
+	echo "Sender	Tmp 		rwx" > /sys/fs/smackfs/load2
+	echo "Receiver	Tmp		rwx" > /sys/fs/smackfs/load2
+	echo "Sender	Secret1		r--" > /sys/fs/smackfs/load2
+	echo "Sender	Secret2		r--" > /sys/fs/smackfs/load2
+}
+
+send_fds() {
+
+	local sk="$1"
+	shift
+	local files="$*"
+
+	(
+	    echo "Receiver" > /proc/self/attr/current
+	    exec capsh --drop=cap_mac_override,cap_mac_admin -- -c "$RECEIVER $sk"
+	) &
+	local recv_pid=$!
+	sleep 1
+
+	(
+	    echo "Sender" > /proc/self/attr/current
+	    exec capsh --drop=cap_mac_override,cap_mac_admin -- -c "$SENDER $sk $files"
+	) || true
+
+	local recv_status=0
+	wait "$recv_pid" || recv_status=$?
+
+	if [ "$recv_status" -ne 0 ]; then
+	    echo "receiver exited with $recv_status"
+	fi
+	return "$recv_status"
+}
+
+preflight() {
+
+	if [ "$(id -u)" -ne 0 ]; then
+	    echo "ERROR: must be run as root"
+	    exit 1
+	fi
+
+	if ! grep -q smack /sys/kernel/security/lsm 2>/dev/null; then
+	    echo "ERROR: Smack is not active"
+	    echo "  Check: cat /sys/kernel/security/lsm"
+	    echo "  Boot with: security=smack"
+	    exit 1
+	fi
+
+	if ! mountpoint -q /sys/fs/smackfs 2>/dev/null; then
+	    echo "Mounting smackfs..."
+	    mount -t smackfs smackfs /sys/fs/smackfs
+	fi
+
+	if ! command -v capsh &>/dev/null; then
+	    echo "ERROR: capsh not found (install libcap)"
+	    exit 1
+	fi
+
+	if [ ! -x "$SENDER" ] || [ ! -x "$RECEIVER" ]; then
+	    echo "ERROR: $SENDER / $RECEIVER not built (run 'make' first)"
+	    exit 1
+	fi
+
+}
+
+run_tests
diff --git a/tools/testing/selftests/net/af_unix/scm_rights_denial/sender.c b/tools/testing/selftests/net/af_unix/scm_rights_denial/sender.c
new file mode 100644
index 000000000000..b1c76d23b8bd
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/scm_rights_denial/sender.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * sender.c - Send file descriptors over a Unix domain socket via SCM_RIGHTS
+ *
+ * Usage: ./sender <socket_path> <file_to_send> [<file_to_send>...]
+ *
+ * Opens the specified files and sends their fds to a receiver connected
+ * on the given Unix socket path. Used for testing LSM blocking of fd
+ * passing.
+ */
+
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdio.h>
+#include <fcntl.h>
+
+#include "helper.h"
+
+#define SEND_LOG(fmt, ...) fprintf(stdout, "sender: " fmt, ##__VA_ARGS__)
+#define SEND_ERR(fmt, ...) fprintf(stderr, "sender: " fmt, ##__VA_ARGS__)
+
+static int send_fds(int sock, int *fds, int num_fds)
+{
+	if (num_fds > MAX_FDS)
+		return -1;
+
+	char buf[1] = { 'X' };
+	char ctrl[CMSG_SPACE(MAX_FDS * sizeof(int))] = { 0 };
+
+	struct iovec iov = {
+		.iov_base = buf,
+		.iov_len  = sizeof(buf),
+	};
+	struct msghdr msg = {
+		.msg_iov        = &iov,
+		.msg_iovlen     = 1,
+		.msg_control    = ctrl,
+		.msg_controllen = CMSG_SPACE(num_fds * sizeof(int)),
+	};
+
+	struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type  = SCM_RIGHTS;
+	cmsg->cmsg_len   = CMSG_LEN(num_fds * sizeof(int));
+	memcpy(CMSG_DATA(cmsg), fds, num_fds * sizeof(int));
+
+	ssize_t bytes_send = sendmsg(sock, &msg, 0);
+	if (bytes_send < 0) {
+		perror("sender: sendmsg");
+		return -1;
+	}
+
+	return 0;
+}
+
+static inline int print_current_label(void)
+{
+	char label[256];
+	if (!read_current_label(label, sizeof(label))) {
+		SEND_LOG("running with Smack label '%s'\n", label);
+		return 0;
+	}
+	return -1;
+}
+
+int main(int argc, char *argv[])
+{
+	if (argc < 3 || argc > 2 + MAX_FDS) {
+		fprintf(stderr, "Usage: %s <socket_path> <file_to_send> [<file_to_send>...]\\n",
+			argv[0]);
+		fprintf(stderr, "Up to a maximum of %d files", MAX_FDS);
+		return -1;
+	}
+
+	if (print_current_label()) {
+		SEND_ERR("cannot read process Smack label");
+		return -1;
+	}
+
+	int sock = socket(AF_UNIX, SOCK_STREAM, 0);
+	if (sock < 0) {
+		perror("sender: socket");
+		return -1;
+	}
+
+	struct sockaddr_un addr = {};
+	addr.sun_family = AF_UNIX;
+	strncpy(addr.sun_path, argv[1], sizeof(addr.sun_path) - 1);
+
+	if (connect(sock, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
+		perror("sender: connect");
+		goto out_sock;
+	}
+
+	SEND_LOG("connected to '%s'\n", argv[1]);
+
+	int num_files = argc - 2;
+	int fds[MAX_FDS];
+	int i;
+	for (i = 0; i < num_files; i++) {
+		fds[i] = open(argv[2 + i], O_RDONLY);
+		if (fds[i] < 0) {
+			perror("sender: open file");
+			goto out_opened;
+		}
+		SEND_LOG("opened '%s' as fd %d\n", argv[2 + i], fds[i]);
+	}
+
+	if (send_fds(sock, fds, num_files) < 0)
+		goto out_opened;
+
+	SEND_LOG("fds successfully sent:");
+	for (int j = 0; j < num_files; j++)
+		printf(" %d", fds[j]);
+	putchar('\n');
+
+out_opened:
+	for (int j = 0; j < i; j++)
+		close(fds[j]);
+out_sock:
+	close(sock);
+	return -1;
+}
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 3/4] net: af_unix: replace copy_from_sockptr() with copy_safe_from_sockptr()
From: Jori Koolstra @ 2026-06-16 14:30 UTC (permalink / raw)
  To: brauner, cyphar, Kuniyuki Iwashima, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: linux-fsdevel, Jori Koolstra, open list:NETWORKING [GENERAL],
	open list
In-Reply-To: <20260616143020.3458085-1-jkoolstra@xs4all.nl>

Replace deprecated call to copy_from_sockptr() with
copy_safe_from_sockptr().

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 net/unix/af_unix.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4e1463ee2815..eb4051f3aae7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -933,6 +933,7 @@ static int unix_setsockopt(struct socket *sock, int level, int optname,
 {
 	struct unix_sock *u = unix_sk(sock->sk);
 	struct sock *sk = sock->sk;
+	int error;
 	int val;
 
 	if (level != SOL_SOCKET)
@@ -941,11 +942,9 @@ static int unix_setsockopt(struct socket *sock, int level, int optname,
 	if (!unix_custom_sockopt(optname))
 		return sock_setsockopt(sock, level, optname, optval, optlen);
 
-	if (optlen != sizeof(int))
-		return -EINVAL;
-
-	if (copy_from_sockptr(&val, optval, sizeof(val)))
-		return -EFAULT;
+	error = copy_safe_from_sockptr(&val, sizeof(val), optval, optlen);
+	if (error)
+		return error;
 
 	switch (optname) {
 	case SO_INQ:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 2/4] net: af_unix: Useful handling of LSM denials on SCM_RIGHTS
From: Jori Koolstra @ 2026-06-16 14:30 UTC (permalink / raw)
  To: brauner, cyphar, Alexander Viro, Jan Kara, Kuniyuki Iwashima,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Arnd Bergmann, Willem de Bruijn
  Cc: linux-fsdevel, Jori Koolstra, Jeff Layton, open list,
	open list:NETWORKING [GENERAL],
	open list:GENERIC INCLUDE/ASM HEADER FILES
In-Reply-To: <20260616143020.3458085-1-jkoolstra@xs4all.nl>

Right now if some LSM such as Smack denies an AF_UNIX socket peer to
receive an SCM_RIGHTS fd, the SCM_RIGHTS fd array will be cut short at
that point, and MSG_CTRUNC is set on return of recvmsg(). This is
highly problematic behaviour, because it leaves the receiver
wondering what happened. As per man page MSG_CTRUNC is supposed to
indicate that the control buffer was sized too short, but suddenly
a permission error might result in the exact same flag being set.
Moreover, the receiver has no chance to determine how many fds got
originally sent and how many were suppressed.[1]

Add a SO_RIGHTS_NOTRUNC option to UNIX sockets to enable more useful
handling of LSM denials when receiving SCM_RIGHTS messages: instead of
truncating the message at the first blocked fd, keep every fd slot
and store the LSM errno in the blocked slot.

[1]: https://github.com/uapi-group/kernel-features#useful-handling-of-lsm-denials-on-scm_rights

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 fs/file.c                         | 48 ++++++++++++++++++++-----------
 include/linux/file.h              |  2 ++
 include/net/af_unix.h             |  1 +
 include/net/scm.h                 | 15 +++++++---
 include/uapi/asm-generic/socket.h |  3 ++
 net/compat.c                      |  4 +--
 net/core/scm.c                    | 13 +++++----
 net/unix/af_unix.c                |  9 ++++++
 8 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 628ca07dc4b1..2bc22cc69e84 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1367,6 +1367,25 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
 	return err;
 }
 
+static int __receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
+{
+	int error;
+
+	FD_PREPARE(fdf, o_flags, file);
+	if (fdf.err)
+		return fdf.err;
+	get_file(file);
+
+	if (ufd) {
+		error = put_user(fd_prepare_fd(fdf), ufd);
+		if (error)
+			return error;
+	}
+
+	__receive_sock(fd_prepare_file(fdf));
+	return fd_publish(fdf);
+}
+
 /**
  * receive_fd() - Install received file into file descriptor table
  * @file: struct file that was received from another process
@@ -1384,27 +1403,24 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
  */
 int receive_fd(struct file *file, int __user *ufd, unsigned int o_flags)
 {
-	int error;
-
-	error = security_file_receive(file);
+	int error = security_file_receive(file);
 	if (error)
 		return error;
+	return __receive_fd(file, ufd, o_flags);
+}
+EXPORT_SYMBOL_GPL(receive_fd);
 
-	FD_PREPARE(fdf, o_flags, file);
-	if (fdf.err)
-		return fdf.err;
-	get_file(file);
-
-	if (ufd) {
-		error = put_user(fd_prepare_fd(fdf), ufd);
-		if (error)
-			return error;
+int receive_fd_filtered(struct file *file, int __user *ufd, unsigned int o_flags,
+		bool *filtered)
+{
+	int error = security_file_receive(file);
+	if (error) {
+		*filtered = true;
+		return error;
 	}
-
-	__receive_sock(fd_prepare_file(fdf));
-	return fd_publish(fdf);
+	*filtered = false;
+	return __receive_fd(file, ufd, o_flags);
 }
-EXPORT_SYMBOL_GPL(receive_fd);
 
 int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags)
 {
diff --git a/include/linux/file.h b/include/linux/file.h
index 27484b444d31..748f08470bb4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -119,6 +119,8 @@ DEFINE_FREE(fput, struct file *, if (!IS_ERR_OR_NULL(_T)) fput(_T))
 extern void fd_install(unsigned int fd, struct file *file);
 
 int receive_fd(struct file *file, int __user *ufd, unsigned int o_flags);
+int receive_fd_filtered(struct file *file, int __user *ufd, unsigned int o_flags,
+		bool *filtered);
 
 int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
 
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 34f53dde65ce..bb1b3dee02e8 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -49,6 +49,7 @@ struct unix_sock {
 	struct scm_stat		scm_stat;
 	int			inq_len;
 	bool			recvmsg_inq;
+	bool			scm_rights_notrunc;
 #if IS_ENABLED(CONFIG_AF_UNIX_OOB)
 	struct sk_buff		*oob_skb;
 #endif
diff --git a/include/net/scm.h b/include/net/scm.h
index c52519669349..761cda0803fb 100644
--- a/include/net/scm.h
+++ b/include/net/scm.h
@@ -50,8 +50,8 @@ struct scm_cookie {
 #endif
 };
 
-void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm);
-void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm);
+void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm, bool notrunc);
+void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm, bool notrunc);
 int __scm_send(struct socket *sock, struct msghdr *msg, struct scm_cookie *scm);
 void __scm_destroy(struct scm_cookie *scm);
 struct scm_fp_list *scm_fp_dup(struct scm_fp_list *fpl);
@@ -108,11 +108,18 @@ void scm_recv_unix(struct socket *sock, struct msghdr *msg,
 		   struct scm_cookie *scm, int flags);
 
 static inline int scm_recv_one_fd(struct file *f, int __user *ufd,
-				  unsigned int flags)
+				  unsigned int flags, bool notrunc)
 {
+	bool filtered;
+	int error;
+
 	if (!ufd)
 		return -EFAULT;
-	return receive_fd(f, ufd, flags);
+
+	error = receive_fd_filtered(f, ufd, flags, &filtered);
+	if (filtered && notrunc)
+		return put_user(error, ufd);
+	return error;
 }
 
 #endif /* __LINUX_NET_SCM_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 53b5a8c002b1..c5fb2ee96830 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -150,6 +150,9 @@
 #define SO_INQ			84
 #define SCM_INQ			SO_INQ
 
+#define SO_RIGHTS_NOTRUNC	85
+#define SCM_RIGHTS_NOTRUNC	SO_RIGHTS_NOTRUNC
+
 #if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/net/compat.c b/net/compat.c
index d68cf9c3aad5..6bdf4a2c9077 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -286,7 +286,7 @@ static int scm_max_fds_compat(struct msghdr *msg)
 	return (msg->msg_controllen - sizeof(struct compat_cmsghdr)) / sizeof(int);
 }
 
-void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm)
+void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm, bool notrunc)
 {
 	struct compat_cmsghdr __user *cm =
 		(struct compat_cmsghdr __user *)msg->msg_control_user;
@@ -296,7 +296,7 @@ void scm_detach_fds_compat(struct msghdr *msg, struct scm_cookie *scm)
 	int err = 0, i;
 
 	for (i = 0; i < fdmax; i++) {
-		err = scm_recv_one_fd(scm->fp->fp[i], cmsg_data + i, o_flags);
+		err = scm_recv_one_fd(scm->fp->fp[i], cmsg_data + i, o_flags, notrunc);
 		if (err < 0)
 			break;
 	}
diff --git a/net/core/scm.c b/net/core/scm.c
index a73b1eb30fd2..1ef4e9431661 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -351,7 +351,7 @@ static int scm_max_fds(struct msghdr *msg)
 	return (msg->msg_controllen - sizeof(struct cmsghdr)) / sizeof(int);
 }
 
-void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
+void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm, bool notrunc)
 {
 	struct cmsghdr __user *cm =
 		(__force struct cmsghdr __user *)msg->msg_control_user;
@@ -365,12 +365,12 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm)
 		return;
 
 	if (msg->msg_flags & MSG_CMSG_COMPAT) {
-		scm_detach_fds_compat(msg, scm);
+		scm_detach_fds_compat(msg, scm, notrunc);
 		return;
 	}
 
 	for (i = 0; i < fdmax; i++) {
-		err = scm_recv_one_fd(scm->fp->fp[i], cmsg_data + i, o_flags);
+		err = scm_recv_one_fd(scm->fp->fp[i], cmsg_data + i, o_flags, notrunc);
 		if (err < 0)
 			break;
 	}
@@ -542,8 +542,11 @@ void scm_recv_unix(struct socket *sock, struct msghdr *msg,
 	if (!__scm_recv_common(sock->sk, msg, scm, flags))
 		return;
 
-	if (scm->fp)
-		scm_detach_fds(msg, scm);
+	if (scm->fp) {
+		struct unix_sock *u = unix_sk(sock->sk);
+		bool notrunc = READ_ONCE(u->scm_rights_notrunc);
+		scm_detach_fds(msg, scm, notrunc);
+	}
 
 	if (sock->sk->sk_scm_pidfd)
 		scm_pidfd_recv(msg, scm);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0d9cd977c7b7..4e1463ee2815 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -921,6 +921,7 @@ static bool unix_custom_sockopt(int optname)
 {
 	switch (optname) {
 	case SO_INQ:
+	case SO_RIGHTS_NOTRUNC:
 		return true;
 	default:
 		return false;
@@ -956,6 +957,14 @@ static int unix_setsockopt(struct socket *sock, int level, int optname,
 
 		WRITE_ONCE(u->recvmsg_inq, val);
 		break;
+
+	case SO_RIGHTS_NOTRUNC:
+		if (val > 1 || val < 0)
+			return -EINVAL;
+
+		WRITE_ONCE(u->scm_rights_notrunc, val);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 1/4] net: scm: move scm_detach_fds() from common path to scm_recv_unix()
From: Jori Koolstra @ 2026-06-16 14:30 UTC (permalink / raw)
  To: brauner, cyphar, Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni,
	Willem de Bruijn, David S. Miller, Jakub Kicinski, Simon Horman
  Cc: linux-fsdevel, Jori Koolstra, open list:NETWORKING [GENERAL],
	open list

scm->fp can only be set when using UNIX sockets, therefore we should
move it out of the common path __scm_recv_common() into
scm_recv_unix().

Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
---
 net/core/scm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/scm.c b/net/core/scm.c
index eec13f50ecaf..a73b1eb30fd2 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -523,9 +523,6 @@ static bool __scm_recv_common(struct sock *sk, struct msghdr *msg,
 
 	scm_passec(sk, msg, scm);
 
-	if (scm->fp)
-		scm_detach_fds(msg, scm);
-
 	return true;
 }
 
@@ -545,6 +542,9 @@ void scm_recv_unix(struct socket *sock, struct msghdr *msg,
 	if (!__scm_recv_common(sock->sk, msg, scm, flags))
 		return;
 
+	if (scm->fp)
+		scm_detach_fds(msg, scm);
+
 	if (sock->sk->sk_scm_pidfd)
 		scm_pidfd_recv(msg, scm);
 

base-commit: 6b5a2b7d9bc156e505f09e698d85d6a1547c1206
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 0/4] vhost/vsock: add support for VHOST_RESET_OWNER and CPR migration
From: Stefano Garzarella @ 2026-06-16 14:28 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <4fa88fa6-a188-4c63-876c-ed748809bf0b@virtuozzo.com>

On Tue, Jun 16, 2026 at 05:01:34PM +0300, Andrey Drobyshev wrote:
>Hello Stefano,
>
>On 6/16/26 4:35 PM, Stefano Garzarella wrote:
>> Hi Andrey,
>> thanks for the series!
>>
>> On Fri, Jun 12, 2026 at 07:57:14PM +0300, Andrey Drobyshev wrote:
>>> Host<-->guest connections via AF_VSOCK sockets aren't supposed to
>>> outlive VM migration, since VM is moving to another host.  However
>>> there's a special case, which is QEMU live-update, or CPR
>>> (checkpoint-restore) migration.  In this case, VM remains on the same
>>> host, and we'd like such connections to persist.
>>
>> In the spec we have VIRTIO_VSOCK_EVENT_TRANSPORT_RESET which is usually
>> sent by the device after a migration.
>>
>> IIUC the specs don't say this has to be done all the time, so we don't
>> need to change anything in the specs, right?
>>
>> We just need to avoid sending it (which I think is what we're doing
>> here... I still need to look at the patches).
>>
>
>Sending this exact ioctl is guarded by one of my patches in the QEMU
>counterpart series:
>
>https://lore.kernel.org/qemu-devel/20260612165110.431376-6-andrey.drobyshev@virtuozzo.com/
>
>So we indeed avoid sending it on migration target in case of CPR migration.

Great, so we are aligned :-)

>
>>>
>>> For this to work, we need to be able to transfer device ownership from
>>> source QEMU to dest QEMU.  Namely, source needs to reset ownership by
>>> issuing VHOST_RESET_OWNER ioctl, and then target has to claim it by
>>> calling VHOST_SET_OWNER.
>>>
>>> Since VHOST_RESET_OWNER isn't yet implemented for vhost-vsock, let's add
>>> such implementation (patches 1-2).  Also fix regression introduced by
>>> the earlier commit [1] (patch 3), and fix the deadlock bug (commit 4).
>>
>> If it's a regression, should we fix it separately?
>>
>> Or is it related to this series?
>>
>
>Probably my wording wasn't quite correct.  I posted this patch here
>because we found the problem during testing this particular
>functionality, i.e. vsock data transfer + CPR migration.  And the
>problem was introduced by a recent commit, which is fine on its own, but
>breaks the CPR case.

Yeah, I figured out while reviewing the patch.
I'd avoid "regression" here and use just "issue", because at the end is 
just affecting this work that is not yet merged, so it can be a 
regression.

Thanks,
Stefano


^ permalink raw reply

* Re: [PATCH net 1/1] net: smc: fix splice entry lifetime imbalance in smc_rx_splice
From: Sidraya Jayagond @ 2026-06-16 14:27 UTC (permalink / raw)
  To: Ren Wei, linux-rdma, linux-s390, netdev
  Cc: alibuda, dust.li, wenjia, mjambigi, tonylu, guwen, ubraun,
	stefan.raspl, davem, yuantan098, zcliangcn, bird, lx24,
	d4n.for.sec
In-Reply-To: <192d1b44ed358ca143f44ef167d14153bccc51e9.1781097957.git.d4n.for.sec@gmail.com>



On 10/06/26 11:24 pm, Ren Wei wrote:
> From: Daming Li <d4n.for.sec@gmail.com>
> 
> smc_rx_splice() hands candidate pages to splice_to_pipe() without taking
> references for the lifetime of each splice entry first. That breaks the
> splice ownership contract in the VM-backed RMB path.
> 
> splice_to_pipe() drops unqueued entries through spd_release(), while
> queued entries are later dropped through the pipe buffer release
> callback. The current code only tries to take page references after the
> splice succeeds, and it derives the number of queued VM pages from a
> mutated offset value. This can underflow page refcounts and trigger a
> use-after-free. It also leaves the socket lifetime imbalanced in the
> multi-page VM case, where one sock_hold() can be followed by multiple
> sock_put() calls.
> 
> Fix this by taking the page and socket references for every candidate
> splice entry before calling splice_to_pipe(), and by releasing the
> matching private state, page reference, and socket reference from
> smc_rx_spd_release() for entries that never get queued. This makes the
> SMC splice path follow the normal splice lifetime rules and removes the
> broken post-splice VM page counting entirely.
> 
> Fixes: 9014db202cb7 ("smc: add support for splice()")
> Cc: stable@vger.kernel.org
> Reported-by: Yuan Tan <yuantan098@gmail.com>
> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
> Reported-by: Xin Liu <bird@lzu.edu.cn>
> Assisted-by: Codex:GPT-5.4
> Co-developed-by: Liu Xiao <lx24@stu.ynu.edu.cn>
> Signed-off-by: Liu Xiao <lx24@stu.ynu.edu.cn>
> Signed-off-by: Daming Li <d4n.for.sec@gmail.com>
> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
> ---
>  net/smc/smc_rx.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
> index c1d9b923938d..88aee0d93597 100644
> --- a/net/smc/smc_rx.c
> +++ b/net/smc/smc_rx.c
> @@ -150,18 +150,23 @@ static const struct pipe_buf_operations smc_pipe_ops = {
>  static void smc_rx_spd_release(struct splice_pipe_desc *spd,
>  			       unsigned int i)
>  {
> +	struct smc_spd_priv *priv = (struct smc_spd_priv *)spd->partial[i].private;
> +	struct sock *sk = &priv->smc->sk;
> +
> +	kfree(priv);
>  	put_page(spd->pages[i]);
> +	sock_put(sk);
>  }
>  
>  static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
>  			 struct smc_sock *smc)
>  {
>  	struct smc_link_group *lgr = smc->conn.lgr;
> -	int offset = offset_in_page(src);
>  	struct partial_page *partial;
>  	struct splice_pipe_desc spd;
>  	struct smc_spd_priv **priv;
>  	struct page **pages;
> +	int offset = offset_in_page(src);
>  	int bytes, nr_pages;
>  	int i;
>  
> @@ -209,6 +214,10 @@ static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
>  			offset = 0;
>  		}
>  	}
> +	for (i = 0; i < nr_pages; i++) {
> +		get_page(pages[i]);
> +		sock_hold(&smc->sk);
> +	}
>  	spd.nr_pages_max = nr_pages;
>  	spd.nr_pages = nr_pages;
>  	spd.pages = pages;
> @@ -217,16 +226,8 @@ static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
>  	spd.spd_release = smc_rx_spd_release;
>  
>  	bytes = splice_to_pipe(pipe, &spd);
> -	if (bytes > 0) {
> -		sock_hold(&smc->sk);
> -		if (!lgr->is_smcd && smc->conn.rmb_desc->is_vm) {
> -			for (i = 0; i < PAGE_ALIGN(bytes + offset) / PAGE_SIZE; i++)
> -				get_page(pages[i]);
> -		} else {
> -			get_page(smc->conn.rmb_desc->pages);
> -		}
> +	if (bytes > 0)
>  		atomic_add(bytes, &smc->conn.splice_pending);
> -	}
>  	kfree(priv);
>  	kfree(partial);
>  	kfree(pages);
Code changes looks good to me.
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>

^ permalink raw reply

* [syzbot] [net?] WARNING in tls_err_abort
From: syzbot @ 2026-06-16 14:27 UTC (permalink / raw)
  To: davem, edumazet, horms, john.fastabend, kuba, linux-kernel,
	netdev, pabeni, sd, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    f6033078a9e6 ip6_tunnel: annotate data-races around t->err..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=122a98ae580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8697a140486f5628
dashboard link: https://syzkaller.appspot.com/bug?extid=cca46a9d1276f38af2ae
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/7af9eb2b9b5a/disk-f6033078.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/4b7e03b76e68/vmlinux-f6033078.xz
kernel image: https://storage.googleapis.com/syzbot-assets/38042dd09caa/bzImage-f6033078.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+cca46a9d1276f38af2ae@syzkaller.appspotmail.com

------------[ cut here ]------------
err >= 0
WARNING: net/tls/tls_sw.c:73 at tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73, CPU#0: kworker/0:11/6099
Modules linked in:
CPU: 0 UID: 0 PID: 6099 Comm: kworker/0:11 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026
Workqueue: pencrypt_serial padata_serial_worker
RIP: 0010:tls_err_abort+0x5d/0x80 net/tls/tls_sw.c:73
Code: e8 03 48 b9 00 00 00 00 00 fc ff df 0f b6 04 08 84 c0 75 1b 89 ab 9c 01 00 00 48 89 df 5b 5d e9 c9 a2 32 ff e8 a4 60 8a f7 90 <0f> 0b 90 eb c3 89 f9 80 e1 07 80 c1 03 38 c1 7c d9 e8 1d 9f f5 f7
RSP: 0018:ffffc900069379e0 EFLAGS: 00010293
RAX: ffffffff8a3adf8c RBX: ffff88807d1e0d80 RCX: ffff888058bfdd00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: 0000000000000000 R08: ffffe8ffffc513e3 R09: 1ffffd1ffff8a27c
R10: dffffc0000000000 R11: ffffffff8a3c4d70 R12: ffff888028eaf400
R13: ffff88804441030c R14: dffffc0000000000 R15: ffff888028eaf460
FS:  0000000000000000(0000) GS:ffff8881252a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f521f503ff8 CR3: 0000000086fc2000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 tls_encrypt_done+0x223/0x480 net/tls/tls_sw.c:500
 padata_serial_worker+0x2b9/0x430 kernel/padata.c:343
 process_one_work kernel/workqueue.c:3314 [inline]
 process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397
 worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478
 kthread+0x389/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH 2/4] vhost/vsock: add VHOST_RESET_OWNER ioctl
From: Stefano Garzarella @ 2026-06-16 14:26 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <129f5833-3a7f-4b2d-a965-20903e4e2fb5@virtuozzo.com>

On Tue, Jun 16, 2026 at 05:10:38PM +0300, Andrey Drobyshev wrote:
>On 6/16/26 4:48 PM, Stefano Garzarella wrote:
>> On Fri, Jun 12, 2026 at 07:57:16PM +0300, Andrey Drobyshev wrote:
>>> From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>>
>>> This ioctl is needed for QEMU's CPR (checkpoint-restore) migration of
>>> the guest with vhost-vsock device.  For this to work, we need to reset
>>> the device ownership on the source side by calling RESET_OWNER, and then
>>> claim it on the dest side by calling SET_OWNER.  We expect not to lose any
>>> AF_VSOCK connection while this happens.
>>>
>>> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
>>> ---
>>> drivers/vhost/vsock.c | 28 ++++++++++++++++++++++++++++
>>> 1 file changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index b12221ce6faf..e629886e5cf8 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -894,6 +894,32 @@ static int vhost_vsock_set_features(struct vhost_vsock *vsock, u64 features)
>>> 	return -EFAULT;
>>> }
>>>
>>> +static int vhost_vsock_reset_owner(struct vhost_vsock *vsock)
>>> +{
>>> +	struct vhost_iotlb *umem;
>>> +	long err;
>>> +
>>> +	mutex_lock(&vsock->dev.mutex);
>>> +	err = vhost_dev_check_owner(&vsock->dev);
>>> +	if (err)
>>> +		goto done;
>>> +	umem = vhost_dev_reset_owner_prepare();
>>> +	if (!umem) {
>>> +		err = -ENOMEM;
>>> +		goto done;
>>> +	}
>>> +	/* Follows vhost_vsock_dev_release closely except for guest_cid drop */
>>> +	vsock_for_each_connected_socket(&vhost_transport.transport,
>>> +					vhost_vsock_reset_orphans);
>>
>> In vhost_vsock_reset_orphans() we have:
>>
>> 	rcu_read_lock();
>>
>> 	/* If the peer is still valid, no need to reset connection */
>> 	if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
>> 		rcu_read_unlock();
>> 		return;
>> 	}
>>
>> IIUC we are not removing the guest cid from the hash table, so this
>> check will be always true, and nothing is done.
>>
>> So, is this call really useful?
>>
>
>You're right, and it's most probably an artifact from mimicking the
>vhost_vsock_dev_release() implementation, as mentioned in the comment.
>In our case this whole iteration is a no-op, we better remove it.
>
>BTW earlier I received some feedback from Sashiko AI reviewer, which
>also spotted that same issue (and some more interesting races):
>
>https://sashiko.dev/#/patchset/20260612165718.433546-1-andrey.drobyshev@virtuozzo.com

Oh they seems similar to claude comments I included in my comment on 
patch 3.

Yeah, we should takes a look, they seems real issues.

>
>Apparently it only CC's its reviews to kvm@vger.kernel.org so you can't
>see them right away.  Just wanted to let you know to save your time
>here.  I'll send a v2 with respect to Sashiko remarks.  But of course
>would be great if you spot some more issues here.
>

Thanks for pointing that out, but in general I try to do my reviews 
before looking at AI reviews (both sashiko or claude locally) to avoid 
to be too much biased.

Thanks,
Stefano


^ permalink raw reply

* [PATCH net] net: ena: clean up XDP TX queues when regular TX setup fails
From: Dawei Feng @ 2026-06-16 14:24 UTC (permalink / raw)
  To: akiyano
  Cc: darinzon, andrew+netdev, davem, edumazet, kuba, pabeni, ast,
	daniel, hawk, john.fastabend, sdf, sameehj, netdev, linux-kernel,
	bpf, jianhao.xu, Dawei Feng, stable

create_queues_with_size_backoff() creates XDP TX queues before setting
up the regular TX path. If the subsequent allocation or creation of
regular TX queues fails, the error handling paths omit the teardown of the
XDP TX queues, leading to a resource leak.

Fix this by explicitly destroying the XDP TX queue subset at the two
missing failure points.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc7.

An x86_64 allyesconfig build showed no new warnings. As we do not have
an ENA device to test with, no runtime testing was able to be performed.

Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action")
Cc: stable@vger.kernel.org
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 23 ++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 92d149d4f091..5d05020a6d05 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -752,6 +752,18 @@ static void ena_destroy_all_tx_queues(struct ena_adapter *adapter)
 	}
 }
 
+static void ena_destroy_xdp_tx_queues(struct ena_adapter *adapter)
+{
+	u16 ena_qid;
+	int i;
+
+	for (i = adapter->xdp_first_ring;
+	     i < adapter->xdp_first_ring + adapter->xdp_num_queues; i++) {
+		ena_qid = ENA_IO_TXQ_IDX(i);
+		ena_com_destroy_io_queue(adapter->ena_dev, ena_qid);
+	}
+}
+
 static void ena_destroy_all_rx_queues(struct ena_adapter *adapter)
 {
 	u16 ena_qid;
@@ -2078,14 +2090,21 @@ static int create_queues_with_size_backoff(struct ena_adapter *adapter)
 		rc = ena_setup_tx_resources_in_range(adapter,
 						     0,
 						     adapter->num_io_queues);
-		if (rc)
+		if (rc) {
+			ena_destroy_xdp_tx_queues(adapter);
+			ena_free_all_io_tx_resources_in_range(adapter,
+							      adapter->xdp_first_ring,
+							      adapter->xdp_num_queues);
 			goto err_setup_tx;
+		}
 
 		rc = ena_create_io_tx_queues_in_range(adapter,
 						      0,
 						      adapter->num_io_queues);
-		if (rc)
+		if (rc) {
+			ena_destroy_xdp_tx_queues(adapter);
 			goto err_create_tx_queues;
+		}
 
 		rc = ena_setup_all_rx_resources(adapter);
 		if (rc)
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH 4/4] vhost/vsock: re-scan TX virtqueue on device start
From: Stefano Garzarella @ 2026-06-16 14:23 UTC (permalink / raw)
  To: Andrey Drobyshev
  Cc: linux-kernel, kvm, virtualization, netdev, mst, stefanha,
	maciej.szmigiero, bchaney, mark.kanda, ptikhomirov, den
In-Reply-To: <20260612165718.433546-5-andrey.drobyshev@virtuozzo.com>

On Fri, Jun 12, 2026 at 07:57:18PM +0300, Andrey Drobyshev wrote:
>During QEMU CPR live-update (and VHOST_RESET_OWNER in general) the guest
>keeps running while the host drops and later re-attaches vhost backends.
>If the guest adds a buffer to the TX virtqueue (guest->host) and kicks
>while the backend is temporarily NULL (between vhost_vsock_drop_backends()
>and the next vhost_vsock_start()), then the kick is delivered to the
>vhost worker, handle_tx_kick() sees a NULL backend and returns, and the
>kick signal is consumed.  The buffer is then left in the ring.
>
>Then upon device start vhost_vsock_start() only re-kicks the RX send
>worker, never the TX VQ, so the buffer is processed only if the guest
>happens to kick again.  But if the guest itself is now waiting for data
>from the host, it will never kick TX VQ again, and we end up in a
>deadlock.
>
>The deadlock is reproduced during active host->guest socat data transfer
>under multiple consecutive CPR live-update's.
>
>To fix this, in vhost_vsock_start(), after kicking the RX send worker, also
>queue the TX vq poll so any buffers the guest enqueued while we were paused
>get scanned.

Again, it seems like we're fixing an issue that existed before this 
series, but IIUC without support for VHOST_RESET_OWNER, this could never 
have happened, so the wording should be changed to make it clear that 
this is can happen only with the new VHOST_RESET_OWNER support.

In addition, this patch must also be applied before the 
VHOST_RESET_OWNER support or merged into it.

>
>Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
>---
> drivers/vhost/vsock.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index bcaba36becd7..1fcfe71d18be 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -655,6 +655,12 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
> 	 */
> 	vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
>
>+	/*
>+	 * Some packets might've also been queued in TX VQ.  Re-scan it here,
>+	 * mirroring the RX send-worker kick above.
>+	 */

Can we also mention that this is related to VHOST_RESET_OWNER?

Thanks,
Stefano

>+	vhost_poll_queue(&vsock->vqs[VSOCK_VQ_TX].poll);
>+
> 	mutex_unlock(&vsock->dev.mutex);
> 	return 0;
>
>-- 
>2.47.1
>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox