Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf v2] bpf, sockmap: disallow update and delete from tc, xdp and flow_dissector
From: Sechang Lim @ 2026-06-20  3:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Stanislav Fomichev, Lorenz Bauer, Jiayuan Chen, bpf, linux-kernel,
	netdev

sock_map_update_common() and __sock_map_delete() hold stab->lock and call
sock_map_unref() -> sock_map_del_link(), which takes sk_callback_lock for
write. That gives the order stab->lock -> sk_callback_lock.

The reverse order comes from the SK_SKB stream parser.
sk_psock_strp_data_ready() holds sk_callback_lock for read, and after the
verdict tcp_bpf_strp_read_sock() acks the consumed data inline via
__tcp_cleanup_rbuf(). The ACK goes out egress, where a sched_cls program
deletes from the sockmap and takes stab->lock:

  WARNING: possible circular locking dependency detected
  7.1.0-rc6 Not tainted
  ------------------------------------------------------
  syz.9.8824 is trying to acquire lock:
  (&stab->lock){+.-.}-{3:3}, at: __sock_map_delete net/core/sock_map.c:421
  but task is already holding lock:
  (clock-AF_INET){++.-}-{3:3}, at: sk_psock_strp_data_ready net/core/skmsg.c:1173

  -> #1 (clock-AF_INET){++.-}-{3:3}:
         _raw_write_lock_bh
         sock_map_del_link net/core/sock_map.c:167
         sock_map_unref net/core/sock_map.c:184
         sock_map_update_common net/core/sock_map.c:509
         sock_map_update_elem_sys net/core/sock_map.c:588
         map_update_elem kernel/bpf/syscall.c:1805

  -> #0 (&stab->lock){+.-.}-{3:3}:
         _raw_spin_lock_bh
         __sock_map_delete net/core/sock_map.c:421
         sock_map_delete_elem net/core/sock_map.c:452
         bpf_prog_06044d24140080b6
         tcx_run net/core/dev.c:4451
         sch_handle_egress net/core/dev.c:4541
         __dev_queue_xmit net/core/dev.c:4808
         ...
         tcp_bpf_strp_read_sock net/ipv4/tcp_bpf.c:701
         strp_data_ready net/strparser/strparser.c:402
         sk_psock_strp_data_ready net/core/skmsg.c:1174
         tcp_data_queue net/ipv4/tcp_input.c:5661

  Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    rlock(clock-AF_INET);
                                 lock(&stab->lock);
                                 lock(clock-AF_INET);
    lock(&stab->lock);

   *** DEADLOCK ***

A tc, xdp or flow_dissector program has no reason to update or delete a
sockmap, and redirect does not go through here. Drop them from
may_update_sockmap() so the verifier rejects it. It also closes the
matching sockhash inversion.

Fixes: 0126240f448d ("bpf: sockmap: Allow update from BPF")
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
---
v2:
 - reject sockmap update/delete from tc, xdp and flow_dissector (John
   Fastabend)
 - fix the changelog (Jiayuan Chen)

v1:
 - https://lore.kernel.org/all/20260616091153.2966617-1-rhkrqnwk98@gmail.com/

 kernel/bpf/verifier.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7fb88e1cd7c4..94d225521b5a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8766,11 +8766,7 @@ static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id)
 			return true;
 		break;
 	case BPF_PROG_TYPE_SOCKET_FILTER:
-	case BPF_PROG_TYPE_SCHED_CLS:
-	case BPF_PROG_TYPE_SCHED_ACT:
-	case BPF_PROG_TYPE_XDP:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
-	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return true;
 	default:
-- 
2.43.0


^ permalink raw reply related

* [syzbot] [fs?] [mm?] INFO: rcu detected stall in dentry_kill
From: syzbot @ 2026-06-20  3:58 UTC (permalink / raw)
  To: brauner, jack, linux-fsdevel, linux-kernel, linux-mm, netdev,
	syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    b85966adbf5d Merge tag 'net-next-7.2' of git://git.kernel...
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15ffe3a1580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9a9f723a32776544
dashboard link: https://syzkaller.appspot.com/bug?extid=0635dc2e2c3c21a6aa04
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1192ccfe580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10dec2ae580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/d65306d96573/disk-b85966ad.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ef43139aab0e/vmlinux-b85966ad.xz
kernel image: https://storage.googleapis.com/syzbot-assets/26d4d1ab67c3/bzImage-b85966ad.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0635dc2e2c3c21a6aa04@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: 	0-...!: (1 GPs behind) idle=8aec/1/0x4000000000000000 softirq=15232/15238 fqs=0
rcu: 	(detected by 1, t=10502 jiffies, g=12001, q=779 ncpus=2)
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 UID: 0 PID: 5691 Comm: udevd Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:lock_release+0x2d3/0x3c0 kernel/locking/lockdep.c:5893
Code: 65 c7 05 2c 91 98 11 00 00 00 00 eb b5 e8 45 d1 05 0a f7 c3 00 02 00 00 74 b9 65 48 8b 05 45 4c 98 11 48 3b 44 24 28 75 44 fb <48> 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 48 8d
RSP: 0018:ffffc90000007c98 EFLAGS: 00000046
RAX: 2f357cb7f4202a00 RBX: ffff88803147f2a8 RCX: 0000000000010002
RDX: 0000000000010000 RSI: ffffffff8c291100 RDI: ffffffff8c2910c0
RBP: dffffc0000000000 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff52000000f90 R12: ffff8880611c6000
R13: ffffffff89b61a3a R14: ffff88803147f2c0 R15: ffff88803147f300
FS:  0000000000000000(0000) GS:ffff88812527c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564961a89a38 CR3: 000000000e746000 CR4: 00000000003526f0
Call Trace:
 <IRQ>
 __raw_spin_unlock include/linux/spinlock_api_smp.h:167 [inline]
 _raw_spin_unlock+0x16/0x50 kernel/locking/spinlock.c:190
 spin_unlock include/linux/spinlock.h:390 [inline]
 advance_sched+0x99a/0xc80 net/sched/sch_taprio.c:988
 __run_hrtimer kernel/time/hrtimer.c:2032 [inline]
 __hrtimer_run_queues+0x3bc/0xa10 kernel/time/hrtimer.c:2096
 hrtimer_interrupt+0x448/0x910 kernel/time/hrtimer.c:2215
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1051 [inline]
 __sysvec_apic_timer_interrupt+0x102/0x430 arch/x86/kernel/apic/apic.c:1068
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1062 [inline]
 sysvec_apic_timer_interrupt+0xa1/0xc0 arch/x86/kernel/apic/apic.c:1062
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:674
RIP: 0010:__unwind_start+0x514/0x660 arch/x86/kernel/unwind_orc.c:-1
Code: 10 42 80 3c 28 00 4c 8d 7b 38 74 08 4c 89 ff e8 12 7a ba 00 48 8b 44 24 08 49 39 07 0f 87 b6 fb ff ff 48 89 df e8 cc d0 ff ff <48> 8b 04 24 42 0f b6 04 28 84 c0 75 11 83 3b 00 4c 89 f1 0f 85 5b
RSP: 0018:ffffc9000432f590 EFLAGS: 00000282
RAX: 00000000f218b401 RBX: ffffc9000432f5e8 RCX: 0000000080000001
RDX: ffffc9000432f601 RSI: ffffffff8c291100 RDI: ffff888034f03e00
RBP: 1ffff92000865ebf R08: ffffc9000432f5d8 R09: 0000000000000000
R10: ffffc9000432f638 R11: fffff52000865ec9 R12: 1ffff92000865ebe
R13: dffffc0000000000 R14: ffffc9000432f5f8 R15: ffffc9000432f620
 unwind_start arch/x86/include/asm/unwind.h:64 [inline]
 arch_stack_walk+0xe3/0x150 arch/x86/kernel/stacktrace.c:24
 stack_trace_save+0xa9/0x100 kernel/stacktrace.c:122
 kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57
 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556
 __call_rcu_common kernel/rcu/tree.c:3159 [inline]
 call_rcu+0xee/0x8b0 kernel/rcu/tree.c:3279
 __destroy_inode+0x2a1/0x630 fs/inode.c:365
 destroy_inode fs/inode.c:388 [inline]
 evict+0x8d4/0xb50 fs/inode.c:852
 dentry_kill+0x1b9/0x880 fs/dcache.c:826
 finish_dput+0x1a/0x260 fs/dcache.c:1001
 __fput+0x675/0xa50 fs/file_table.c:520
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0x73a/0x2360 kernel/exit.c:1004
 do_group_exit+0x22d/0x2f0 kernel/exit.c:1147
 __do_sys_exit_group kernel/exit.c:1158 [inline]
 __se_sys_exit_group kernel/exit.c:1156 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1156
 x64_sys_call+0x221a/0x2240 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fbd5bcf16c5
Code: Unable to access opcode bytes at 0x7fbd5bcf169b.
RSP: 002b:00007ffe420f4688 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000564961aa4f80 RCX: 00007fbd5bcf16c5
RDX: 00000000000000e7 RSI: fffffffffffffe68 RDI: 0000000000000000
RBP: 0000564961a80910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe420f46d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>
rcu: rcu_preempt kthread starved for 10502 jiffies! g12001 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt     state:R  running task     stack:28040 pid:16    tgid:16    ppid:2      task_flags:0x208040 flags:0x00080000
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5504 [inline]
 __schedule+0x17d9/0x56c0 kernel/sched/core.c:7228
 __schedule_loop kernel/sched/core.c:7307 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:7322
 schedule_timeout+0x152/0x2c0 kernel/time/sleep_timeout.c:99
 rcu_gp_fqs_loop+0x30c/0x11f0 kernel/rcu/tree.c:2123
 rcu_gp_kthread+0x9e/0x2b0 kernel/rcu/tree.c:2325
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
rcu: Stack dump where RCU GP kthread last ran:
CPU: 1 UID: 0 PID: 5689 Comm: udevd Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:csd_lock_wait kernel/smp.c:342 [inline]
RIP: 0010:smp_call_function_many_cond+0x10b0/0x14b0 kernel/smp.c:892
Code: c0 75 73 41 8b 1e 89 de 83 e6 01 31 ff e8 98 02 0c 00 83 e3 01 48 bb 00 00 00 00 00 fc ff df 75 07 e8 44 fe 0b 00 eb 37 f3 90 <41> 0f b6 04 1c 84 c0 75 10 41 f7 06 01 00 00 00 74 1e e8 29 fe 0b
RSP: 0000:ffffc9000430f840 EFLAGS: 00000293
RAX: ffffffff81b9f7f7 RBX: dffffc0000000000 RCX: ffff88807f020000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffffc9000430f970 R08: ffffffff903116f7 R09: 1ffffffff20622de
R10: dffffc0000000000 R11: fffffbfff20622df R12: 1ffff110170c85c5
R13: ffff8880b873c2c8 R14: ffff8880b8642e28 R15: 0000000000000000
FS:  00007fbd5c388880(0000) GS:ffff88812537c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564961a89a38 CR3: 0000000044280000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 on_each_cpu_cond_mask+0x3f/0x80 kernel/smp.c:1057
 __flush_tlb_multi arch/x86/include/asm/paravirt.h:46 [inline]
 flush_tlb_multi arch/x86/mm/tlb.c:1361 [inline]
 flush_tlb_mm_range+0x5c4/0x1090 arch/x86/mm/tlb.c:1451
 flush_tlb_page arch/x86/include/asm/tlbflush.h:345 [inline]
 ptep_clear_flush+0x120/0x170 mm/pgtable-generic.c:104
 wp_page_copy mm/memory.c:3941 [inline]
 do_wp_page+0x3d52/0x4c70 mm/memory.c:4336
 handle_pte_fault mm/memory.c:6443 [inline]
 __handle_mm_fault mm/memory.c:6565 [inline]
 handle_mm_fault+0x1490/0x3080 mm/memory.c:6734
 do_user_addr_fault+0xa4d/0x1340 arch/x86/mm/fault.c:1339
 handle_page_fault arch/x86/mm/fault.c:1479 [inline]
 exc_page_fault+0x6a/0xc0 arch/x86/mm/fault.c:1532
 asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:595
RIP: 0033:0x7fbd5c3ada9a
Code: 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 53 48 85 ff 74 2f 48 8b 47 08 48 39 c7 74 21 48 8b 1f 48 39 df 74 19 48 89 18 <48> 89 43 08 e8 8d d9 ff ff 48 89 d8 5b c3 0f 1f 84 00 00 00 00 00
RSP: 002b:00007ffe420f4620 EFLAGS: 00010202
RAX: 0000564961a8a0b0 RBX: 0000564961a89a30 RCX: 0000000000000000
RDX: 0000564961a95430 RSI: 0000564961a91f60 RDI: 0000564961a8f4e0
RBP: 0000564961a8f4e0 R08: 0000564961a91f70 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000297 R12: 0000564958c24588
R13: 00007ffe420f46d0 R14: 0000000000000000 R15: 0000000000000000
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH net] bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()
From: Abdun Nihaal @ 2026-06-20  6:23 UTC (permalink / raw)
  To: skalluru
  Cc: Abdun Nihaal, manishc, andrew+netdev, davem, edumazet, kuba,
	pabeni, netdev, linux-kernel, barak, stable

If the allocation of fp[i].tpa_info fails, the error path will not free
the struct bnx2x_fastpath allocated earlier, as it is not linked to the
bp structure yet. Fix that by linking it immediately after allocation.

Cc: stable@vger.kernel.org
Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure")
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
---
Compile tested only. Issue found using static analysis.

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 19e078479b0d..5b2640bd31c3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4748,6 +4748,7 @@ int bnx2x_alloc_mem_bp(struct bnx2x *bp)
 	fp = kzalloc_objs(*fp, bp->fp_array_size);
 	if (!fp)
 		goto alloc_err;
+	bp->fp = fp;
 	for (i = 0; i < bp->fp_array_size; i++) {
 		fp[i].tpa_info =
 			kzalloc_objs(struct bnx2x_agg_info,
@@ -4756,8 +4757,6 @@ int bnx2x_alloc_mem_bp(struct bnx2x *bp)
 			goto alloc_err;
 	}
 
-	bp->fp = fp;
-
 	/* allocate sp objs */
 	bp->sp_objs = kzalloc_objs(struct bnx2x_sp_objs, bp->fp_array_size);
 	if (!bp->sp_objs)
-- 
2.43.0


^ permalink raw reply related

* AW: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
From: Markus Stockhausen @ 2026-06-20  6:45 UTC (permalink / raw)
  To: 'Jan Klos', 'Heiner Kallweit',
	'Andrew Lunn', 'Russell King', netdev
  Cc: 'Maxime Chevallier', 'David S. Miller',
	'Eric Dumazet', 'Jakub Kicinski',
	'Paolo Abeni', 'Daniel Golle',
	'Vladimir Oltean', 'Aleksander Jan Bajkowski',
	'Jan Hoffmann', 'Issam Hamdi',
	'Chukun Pan', 'Russell King (Oracle)',
	'ChunHao Lin', linux-kernel
In-Reply-To: <20260620011956.37181-1-honza.klos@gmail.com>

> Von: Jan Klos <honza.klos@gmail.com> 
> Gesendet: Samstag, 20. Juni 2026 03:20
> Betreff: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G
bit
>
> On RTL8127A connected to a link partner that advertises 10000baseT
> speed cannot be changed to anything other than 10000baseT as 10GbE
> is always advertised regardless of any setting. Fix this by
> clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
> call to phy_modify_mmd_changed().

As you are enhancing the mask, shouldn't this be "... by respecting ..."?

Markus


^ permalink raw reply

* [PATCH] netdevsim: fix use-after-free in __nsim_dev_port_del
From: Hrushiraj Gandhi @ 2026-06-20  6:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Paolo Abeni,
	Jiri Pirko, netdev, linux-kernel, Hrushiraj Gandhi,
	syzbot+6c25f4750230faf70be9

debugfs files created under a port's ddir (ethtool/get_err,
ethtool/set_err, ring params, bpf_offloaded_id, udp_ports/inject_error,
etc.) store raw pointers directly into the netdevsim struct, which lives
in the net_device private data kmalloc slab.

If these files outlive the netdevsim struct, a concurrent reader can
trigger a slab-use-after-free by passing debugfs_file_get() (which only
checks dentry lifetime) and then dereferencing the freed data pointer
in debugfs_u32_get().

In __nsim_dev_port_del(), nsim_destroy() is called before
nsim_dev_port_debugfs_exit(). However, nsim_destroy() calls free_netdev()
at its end, while nsim_dev_port_debugfs_exit() removes the port's
debugfs directory. This means the slab is freed before the debugfs
files are removed.

Fix by calling debugfs_remove_recursive(ns->nsim_dev_port->ddir) in
nsim_destroy() right before free_netdev(). This ensures all per-port
debugfs files are destroyed synchronously before the backing memory is
freed. The subsequent call to nsim_dev_port_debugfs_exit() in
__nsim_dev_port_del() becomes a harmless no-op.

Reported-by: syzbot+6c25f4750230faf70be9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6c25f4750230faf70be9
Fixes: e05b2d141fef ("netdevsim: move netdev creation/destruction to dev probe")
Signed-off-by: Hrushiraj Gandhi <hrushirajg23@gmail.com>
---
 drivers/net/netdevsim/netdev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 27e5f109f933..08136e7990cb 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -1214,6 +1214,13 @@ void nsim_destroy(struct netdevsim *ns)
 		ns->page = NULL;
 	}
 
+	/*
+	 * Remove per-port debugfs files before free_netdev() releases the
+	 * netdevsim struct to prevent use-after-free in concurrent readers.
+	 */
+	debugfs_remove_recursive(ns->nsim_dev_port->ddir);
+	ns->nsim_dev_port->ddir = NULL;
+
 	free_netdev(dev);
 }
 
-- 
2.47.3


^ permalink raw reply related

* Re: [RFC net-next 3/4] net: dsa: motorcomm: Dynamically allocate port structures
From: Andrew Lunn @ 2026-06-20  8:03 UTC (permalink / raw)
  To: David Yang
  Cc: netdev, Vladimir Oltean, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <CAAXyoMN9a6nncr-C1UDuUQMf4i5FR-9s-sfhSpYLYs5nmh9Uhg@mail.gmail.com>

On Fri, Jun 19, 2026 at 02:46:24PM +0800, David Yang wrote:
> On Fri, Jun 19, 2026 at 2:06 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Fri, Jun 19, 2026 at 04:26:31AM +0800, David Yang wrote:
> > > With support for LED introduced later, struct yt921x_priv will be 17k
> > > which is not very good for a single kmalloc(). Convert the ports array
> > > to a array of pointers to stop bloating the priv struct.
> > >
> > > Signed-off-by: David Yang <mmyangfl@gmail.com>
> > > ---
> > >  drivers/net/dsa/motorcomm/chip.c | 95 ++++++++++++++++++++++++--------
> > >  drivers/net/dsa/motorcomm/chip.h |  3 +-
> > >  2 files changed, 75 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/net/dsa/motorcomm/chip.c b/drivers/net/dsa/motorcomm/chip.c
> > > index 6dee25b6754a..d44f7749de02 100644
> > > --- a/drivers/net/dsa/motorcomm/chip.c
> > > +++ b/drivers/net/dsa/motorcomm/chip.c
> > > @@ -548,11 +548,14 @@ yt921x_mbus_ext_init(struct yt921x_priv *priv, struct device_node *mnp)
> > >  /* Read and handle overflow of 32bit MIBs. MIB buffer must be zeroed before. */
> > >  static int yt921x_read_mib(struct yt921x_priv *priv, int port)
> > >  {
> > > -     struct yt921x_port *pp = &priv->ports[port];
> > > +     struct yt921x_port *pp = priv->ports[port];
> > >       struct device *dev = to_device(priv);
> > >       struct yt921x_mib *mib = &pp->mib;
> > >       int res = 0;
> > >
> > > +     if (!pp)
> > > +             return -ENODEV;
> > > +
> >
> > Are all these tests actually needed? If you cannot allocate the
> > memory, i would expect the probe to fail, so you can never get here.
> >
> >         Andrew
> 
> Dummy ports are no longer assigned control blocks (in yt921x_dsa_setup).

This seems pretty error prone. A missing check will result in an
opps. At least it will be obvious. How big is each port structure? Is
the memory saving worth it?

    Andrew

^ permalink raw reply

* [PATCH net v5] net: airoha: Fix skb->priority underflow in airoha_dev_select_queue()
From: Wayen Yan @ 2026-06-20  8:17 UTC (permalink / raw)
  To: netdev
  Cc: lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
	angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
	linux-mediatek, Joe Damato

In airoha_dev_select_queue(), the expression:

  queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES;

implicitly converts to unsigned arithmetic: when skb->priority is 0
(the default for unclassified traffic), (0u - 1u) wraps to UINT_MAX,
and UINT_MAX % 8 = 7, routing default best-effort packets to the
highest-priority QoS queue. This causes QoS inversion where the
majority of traffic on a PON gateway starves actual high-priority
flows (VoIP, gaming, etc.).

The "- 1" offset was a leftover from the ETS offload implementation
that has since been removed. The correct mapping is a direct modulo:

  queue = skb->priority % AIROHA_NUM_QOS_QUEUES;

This maps priority 0 → queue 0 (lowest), priority 7 → queue 7
(highest), with higher priorities wrapping around. This is the
standard Linux sk_prio → HW queue mapping used by other drivers.

Fixes: 2b288b81560b ("net: airoha: Introduce ndo_select_queue callback")
Link: https://lore.kernel.org/netdev/178185573207.2378135.3729126358670287878@gmail.com/
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Joe Damato <joe@dama.to>
---
Changes in v5:
- Rebase on net/main (previous version was incorrectly based on
  net-next/origin/master, causing Patchwork CI apply failure).

Signed-off-by: Wayen Yan <win847@gmail.com>
---
 drivers/net/ethernet/airoha/airoha_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
index 64dde6464f3f..3370c3df7c10 100644
--- a/drivers/net/ethernet/airoha/airoha_eth.c
+++ b/drivers/net/ethernet/airoha/airoha_eth.c
@@ -2110,7 +2110,7 @@ static u16 airoha_dev_select_queue(struct net_device *netdev,
 	 */
 	channel = netdev_uses_dsa(netdev) ? skb_get_queue_mapping(skb) : port->id;
 	channel = channel % AIROHA_NUM_QOS_CHANNELS;
-	queue = (skb->priority - 1) % AIROHA_NUM_QOS_QUEUES; /* QoS queue */
+	queue = skb->priority % AIROHA_NUM_QOS_QUEUES;
 	queue = channel * AIROHA_NUM_QOS_QUEUES + queue;
 
 	return queue < netdev->num_tx_queues ? queue : 0;
-- 
2.51.0



^ permalink raw reply related

* [PATCH] net: liquidio: fix BAR resource leak on PF number failure
From: Haoxiang Li @ 2026-06-20  8:37 UTC (permalink / raw)
  To: andrew+netdev, davem, kuba, pabeni, felix.manlunas,
	ricardo.farrington
  Cc: netdev, linux-kernel, Haoxiang Li, stable

If cn23xx_get_pf_num() fails, the function returns without
unmapping either BAR. Unmap both BARs before returning from
the error path.

Fixes: 0c45d7fe12c7 ("liquidio: fix use of pf in pass-through mode in a virtual machine")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
 drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 75f22f74774c..a1548ca81ecd 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1167,8 +1167,11 @@ int setup_cn23xx_octeon_pf_device(struct octeon_device *oct)
 		return 1;
 	}
 
-	if (cn23xx_get_pf_num(oct) != 0)
+	if (cn23xx_get_pf_num(oct) != 0) {
+		octeon_unmap_pci_barx(oct, 0);
+		octeon_unmap_pci_barx(oct, 1);
 		return 1;
+	}
 
 	if (cn23xx_sriov_config(oct)) {
 		octeon_unmap_pci_barx(oct, 0);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH iwl-next v2] ixgbe: Implement PCI reset handler
From: Simon Horman @ 2026-06-20  8:53 UTC (permalink / raw)
  To: sergey.temerkhanov; +Cc: Simon Horman, intel-wired-lan, netdev, pmenzel
In-Reply-To: <20260618142212.310475-1-sergey.temerkhanov@intel.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ixgbe: Implement PCI reset handler

Implement PCI device reset handler to allow the network device to
get re-initialized and function after a PCI-level reset.

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
[ ... ]
> @@ -12338,6 +12338,86 @@ static pci_ers_result_t ixgbe_io_slot_reset(struct pci_dev *pdev)
> +static void ixgbe_reset_prep(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
> +
> +	if (!adapter)
> +		return;
> +
> +	if (poll_timeout_us(test_and_set_bit(__IXGBE_RESETTING, &adapter->state),
> +			    test_bit(__IXGBE_RESETTING, &adapter->state),

[Severity: Medium]
Does this introduce a new instance of an ad-hoc bit lock?

According to the networking drivers guidelines, using bit operations like
test_and_set_bit() to guard critical sections instead of proper
synchronization primitives like a mutex or spinlock can hide issues from
lockdep and lacks deadlock detection.

[Severity: Critical]
Does this misuse the poll_timeout_us() macro?

The macro executes the operation before evaluating the condition. Since
test_and_set_bit() unconditionally sets the bit to 1, test_bit() will always
evaluate to true on the very first loop iteration, regardless of whether the
lock was already held.

Doesn't this completely break mutual exclusion?

> +			    IXGBE_RESET_PREP_POLL_INTERVAL_US,
> +			    IXGBE_RESET_PREP_TIMEOUT_US, false)) {
> +		/* ixgbe_reset_done() will exit early if this happens.
> +		 * A retry will be needed
> +		 */
> +		e_err(drv, "Timed out waiting for __IXGBE_RESETTING to be released. Reset is needed\n");
> +		return;

[Severity: High]
If this times out and returns early, won't it abandon the software
reinitialization process while the PCI subsystem proceeds with the physical
reset?

The .reset_prepare callback returns void, so the reset continues. By
returning early without setting __IXGBE_PCIE_RESET_IN_PROGRESS, critical
teardowns are skipped, and ixgbe_reset_done() will also bail out early:

ixgbe_reset_done() {
    if (!test_and_clear_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state)) {
        e_err(drv, "Reset done called without PCIe reset in progress\n");
        return;
    }
}

Could this permanently break the network interface until the module is
reloaded?

> +	}
> +
> +	/* Sync __IXGBE_RESETTING */
> +	smp_mb__after_atomic();
> +
> +	if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
> +		/* Prevent the service task from being requeued in the timer callback */
> +		timer_delete_sync(&adapter->service_timer);
> +		/* Cancel any possibly queued service task */
> +		cancel_work_sync(&adapter->service_task);

[Severity: Critical]
Can this sequence cause a deadlock with the service task?

Here in ixgbe_reset_prep(), __IXGBE_RESETTING has just been acquired,
and then cancel_work_sync(&adapter->service_task) blocks waiting for
the service task to complete.

However, if the active service task reaches ixgbe_reinit_locked():

ixgbe_reinit_locked() {
    while (test_and_set_bit(__IXGBE_RESETTING, &adapter->state))
        usleep_range(1000, 2000);
    ...
}

The service task will spin infinitely waiting for __IXGBE_RESETTING to be
cleared. Since ixgbe_reset_prep() holds the bit lock and is blocked waiting
for the service task, neither thread can make progress.

> +	}
> +
> +	pci_clear_master(pdev);
> +
> +	set_bit(__IXGBE_PCIE_RESET_IN_PROGRESS, &adapter->state);
> +}
[ ... ]
> +static void ixgbe_reset_done(struct pci_dev *pdev)
> +{
> +	struct ixgbe_adapter *adapter = pci_get_drvdata(pdev);
[ ... ]
> +	/* Allow the service task to run */
> +	if (!test_bit(__IXGBE_REMOVING, &adapter->state)) {
> +		clear_bit(__IXGBE_RESETTING, &adapter->state);
> +		/* Sync __IXGBE_RESETTING */
> +		smp_mb__after_atomic();

[Severity: Medium]
Does this provide the correct release semantics for the ad-hoc lock?

The clear_bit() operation does not imply a memory barrier, and
smp_mb__after_atomic() only orders memory operations that occur after the
atomic operation.

Could memory accesses from within the critical section (like the
pci_set_master() call earlier) bleed past the lock release, allowing other
threads to acquire the lock too early?

Is it possible to use clear_bit_unlock() or place an smp_mb__before_atomic()
prior to the clear_bit() instead?

> +	}
[ ... ]

^ permalink raw reply

* [PATCH net] nfc: st-nci: use unaligned accessors for frame length
From: Runyu Xiao @ 2026-06-20  9:05 UTC (permalink / raw)
  To: Krzysztof Kozlowski, netdev
  Cc: Samuel Ortiz, Christophe Ricard, linux-kernel, Runyu Xiao,
	Jianhao Xu, stable

The ST NCI I2C and SPI transports parse a frame length from bytes
received from the controller. Both paths first read the frame header into
a local u8 buffer and then cast buf + 2 to __be16 * before converting it
from big endian.

These are transport byte buffers, not __be16 objects. Use
get_unaligned_be16() for the NCI frame length field in both the I2C and
SPI transports.

This issue was detected by our static analysis tool and confirmed by
manual audit. A focused UBSAN alignment validation kept the original
access shape, be16_to_cpu(*(__be16 *)(buf + 2)), and ran it on an NCI
frame byte buffer with buf + 2 at an odd address. UBSAN reported a
misaligned-access load of type '__be16', and the trace contained
st_nci_i2c_read().

The driver has the same source-level issue: the transport helpers fill
u8 buffers, and the length checks only prove that the bytes are present.
They do not establish a __be16 object at buf + 2 or a 2-byte alignment
guarantee before the typed load.

Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci")
Fixes: 2bc4d4f8c8f3 ("nfc: st-nci: Add spi phy support for st21nfcb")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 drivers/nfc/st-nci/i2c.c | 3 ++-
 drivers/nfc/st-nci/spi.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nfc/st-nci/i2c.c b/drivers/nfc/st-nci/i2c.c
index 9ae839a6f5cc..29fdb4ae56e0 100644
--- a/drivers/nfc/st-nci/i2c.c
+++ b/drivers/nfc/st-nci/i2c.c
@@ -14,6 +14,7 @@
 #include <linux/delay.h>
 #include <linux/nfc.h>
 #include <linux/of.h>
+#include <linux/unaligned.h>
 
 #include "st-nci.h"
 
@@ -120,7 +121,7 @@ static int st_nci_i2c_read(struct st_nci_i2c_phy *phy,
 	if (r != ST_NCI_I2C_MIN_SIZE)
 		return -EREMOTEIO;
 
-	len = be16_to_cpu(*(__be16 *) (buf + 2));
+	len = get_unaligned_be16(buf + 2);
 	if (len > ST_NCI_I2C_MAX_SIZE) {
 		nfc_err(&client->dev, "invalid frame len\n");
 		return -EBADMSG;
diff --git a/drivers/nfc/st-nci/spi.c b/drivers/nfc/st-nci/spi.c
index 169eacc0a32a..1326c20e43fc 100644
--- a/drivers/nfc/st-nci/spi.c
+++ b/drivers/nfc/st-nci/spi.c
@@ -14,6 +14,7 @@
 #include <linux/delay.h>
 #include <linux/nfc.h>
 #include <linux/of.h>
+#include <linux/unaligned.h>
 #include <net/nfc/nci.h>
 
 #include "st-nci.h"
@@ -130,7 +131,7 @@ static int st_nci_spi_read(struct st_nci_spi_phy *phy,
 	if (r < 0)
 		return -EREMOTEIO;
 
-	len = be16_to_cpu(*(__be16 *) (buf + 2));
+	len = get_unaligned_be16(buf + 2);
 	if (len > ST_NCI_SPI_MAX_SIZE) {
 		nfc_err(&dev->dev, "invalid frame len\n");
 		phy->ndlc->hard_fault = 1;
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net v2 05/10] afs: Fix uncancelled rxrpc OOB message handler
From: Simon Horman @ 2026-06-20  9:13 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-6-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
afs: Fix uncancelled rxrpc OOB message handler

This patch modifies AFS to cancel its OOB message processing during
shutdown. It moves OOB message processing to afs_wq and adds checks
for the net namespace's live state to prevent processing when no
longer live.

Link: https://sashiko.dev/#/patchset/20260609140911.838677-1-dhowells%40redhat.com

> diff --git a/fs/afs/cm_security.c b/fs/afs/cm_security.c
> index edcbd249d2024..103168c70dd4d 100644
> --- a/fs/afs/cm_security.c
> +++ b/fs/afs/cm_security.c

[ ... ]

> @@ -101,7 +101,8 @@ void afs_process_oob_queue(struct work_struct *work)
>  	struct sk_buff *oob;
>  	enum rxrpc_oob_type type;
>  
> -	while ((oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
> +	while (READ_ONCE(net->live) &&
> +	       (oob = rxrpc_kernel_dequeue_oob(net->socket, &type))) {
>  		switch (type) {
>  		case RXRPC_OOB_CHALLENGE:
>  			afs_respond_to_challenge(oob);

[Severity: Critical]
This isn't a bug introduced by this patch, but can afs_respond_to_challenge()
cause a null pointer dereference or use-after-free here?

When afs_respond_to_challenge() queries peer_data and casts it to an
afs_server pointer, it unconditionally dereferences it and locks its mutex:

fs/afs/cm_security.c:afs_respond_to_challenge() {
    ...
    server = (struct afs_server *)peer_data;
    if (!server->cm_rxgk_appdata.data) {
        mutex_lock(&server->cm_token_lock);
    ...
}

If a challenge is received for an unknown peer, or if afs_server_destroyer()
concurrently clears the peer app_data to null during teardown, could 
peer_data be null?

If the pointer was read just before being cleared, could the server object
be freed concurrently since no RCU read lock or refcount is held?

> diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
> index 6714a189d58fb..e8af2a6614405 100644
> --- a/fs/afs/rxrpc.c
> +++ b/fs/afs/rxrpc.c

[ ... ]

> @@ -989,5 +991,6 @@ static void afs_rx_notify_oob(struct sock *sk, struct sk_buff *oob)
>  {
>  	struct afs_net *net = sk->sk_user_data;
>  
> -	schedule_work(&net->rx_oob_work);
> +	if (net->live)
> +		queue_work(afs_wq, &net->rx_oob_work);
>  }

[Severity: Medium]
Should net->live be read with READ_ONCE() here?

Since net->live can be modified concurrently by namespace teardown code
such as afs_close_socket() or afs_net_exit(), reading it without READ_ONCE()
might allow the compiler to optimize the read, leading to a stale evaluation.

The updated afs_process_oob_queue() function correctly uses READ_ONCE().
Should afs_rx_notify_oob() do the same to prevent data races?

[Severity: High]
This isn't a bug introduced by this patch, but there appears to be a similar
issue in fs/afs/server.c when retrieving the server pointer.

Can afs_find_server() increment the refcount of a freed object?

afs_find_server() fetches the pointer and increments its refcount:

fs/afs/server.c:afs_find_server() {
    ...
    struct afs_server *server = (struct afs_server *)rxrpc_kernel_get_peer_data(peer);
    
    if (!server)
        return NULL;
    return afs_use_server(server, false, afs_server_trace_use_cm_call);
    ...
}

If afs_server_destroyer() runs concurrently, unbinds the peer, and frees
the server, could afs_find_server() fetch the pointer just before it is
cleared?

If preempted, it seems this operation lacks RCU read lock protection and
could increment the refcount of a freed object without using
refcount_inc_not_zero().

^ permalink raw reply

* Re: [PATCH net v2 06/10] rxrpc: Fix the reception of a reply packet before data transmission
From: Simon Horman @ 2026-06-20  9:17 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-7-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
rxrpc: Fix the reception of a reply packet before data transmission

This commit fixes rxrpc_receiving_reply() to handle the reception of an
apparent reply DATA packet before rxrpc has had a chance to send any request
DATA packets. It also adjusts rxrpc_rotate_tx_window() to handle an empty
Tx queue by changing the do...while loop into a while loop.

Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com

> diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
> index 37881dffa898b..01ccd2d2fe92c 100644
> --- a/net/rxrpc/input.c
> +++ b/net/rxrpc/input.c
> @@ -247,7 +247,7 @@ static bool rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
>  		tq = call->tx_queue;
>  	}
>  
> -	do {
> +	while (before_eq(seq, to)) {

[Severity: Critical]
This is a pre-existing issue, but can this trigger a use-after-free write
during a concurrent sendmsg() if the remote peer sends an early reply?

Right before this loop in rxrpc_rotate_tx_window(), the code unconditionally
frees the active transmission queue buffer:

	if (seq == call->tx_qbase + RXRPC_NR_TXQUEUE) {
		call->tx_qbase += RXRPC_NR_TXQUEUE;
		call->tx_queue = tq->next;
		...
		kfree(tq);
		tq = call->tx_queue;
	}

If a full block of 63 packets is transmitted and ACKed, call->tx_bottom
reaches 63. If the remote peer then sends an early reply,
rxrpc_receiving_reply() calls rxrpc_rotate_tx_window() with to = 63 and seq
starting at 64.

This triggers the condition above and frees the active block (tq) without
verifying if it is still being written to. If it is still the tail
(tq == call->tx_qtail), call->send_queue and call->tx_qtail are left as
dangling pointers.

Concurrently, if the client is still sending data via sendmsg(), it might call
rxrpc_alloc_txqueue() for the 64th packet:

net/rxrpc/sendmsg.c:rxrpc_alloc_txqueue() {
	...
	if (call->send_queue) {
		tq->qbase = call->send_top + 1;
		call->send_queue->next = tq;
		call->send_queue = tq;
	}
	...
}

Will call->send_queue->next = tq write a newly allocated pointer into the
recently freed memory block?

[ ... ]

^ permalink raw reply

* Re: [PATCH net v2 07/10] rxrpc: Fix oob challenge leak in cleanup after notification failure
From: Simon Horman @ 2026-06-20  9:17 UTC (permalink / raw)
  To: dhowells
  Cc: Simon Horman, netdev, marc.dionne, kuba, davem, edumazet, pabeni,
	linux-afs, linux-kernel, jaltman, stable
In-Reply-To: <20260618134802.2477777-8-dhowells@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
rxrpc: Fix oob challenge leak in cleanup after notification failure

Fix rxrpc_notify_socket_oob() to return an indication of failure in the
event that it failed to queue a packet and fix rxrpc_post_challenge() to
clean up the connection ref in such an event.

Link: https://sashiko.dev/#/patchset/20260616155749.2125907-1-dhowells%40redhat.com

> diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
> index c96ca615b787c..611c790bc6d0c 100644
> --- a/net/rxrpc/conn_event.c
> +++ b/net/rxrpc/conn_event.c
> @@ -436,7 +436,7 @@ static bool rxrpc_post_challenge(struct rxrpc_connection *conn,
>  	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
>  	struct rxrpc_call *call = NULL;
>  	struct rxrpc_sock *rx;
> -	bool respond = false;
> +	bool respond = false, queued = false;

[Severity: Critical]
This is a pre-existing issue, but can the lockless read of
conn->channels[i].call in net/rxrpc/conn_event.c lead to a NULL pointer
dereference?

Looking at rxrpc_post_challenge(), the code does:

    if (conn->channels[i].call) {
        call = conn->channels[i].call;
        rx = rcu_dereference(call->socket);

Without READ_ONCE(), the compiler could issue two separate loads. If a
concurrent thread clears the channel's call pointer after the if-check but
before the assignment, call becomes NULL.

Does this cause the subsequent rcu_dereference(call->socket) to panic by
dereferencing NULL->socket?

[Severity: High]
This isn't a bug introduced by this patch, but does taking a connection
reference here in rxrpc_post_challenge() lead to a leak if the packet is
later discarded due to connection abort or teardown?

If rxrpc_process_event() in net/rxrpc/conn_event.c encounters 
conn->state == RXRPC_CONN_ABORTED, it returns -ECONNABORTED early:

    if (conn->state == RXRPC_CONN_ABORTED)
        return -ECONNABORTED;

This bypasses the cleanup, and the skb is consumed by
rxrpc_do_process_connection() without dropping the sp->chall.conn reference.

Similarly, during teardown, rxrpc_purge_queue() in net/rxrpc/skbuff.c
consumes skbs via:

    while ((skb = skb_dequeue((list))) != NULL) {
        ...
        consume_skb(skb);
    }

Since it doesn't inspect the cb scratch space, doesn't this permanently leak
the connection references stored in sp->chall.conn?

[ ... ]

>  	if (call)
> -		rxrpc_notify_socket_oob(call, skb);
> +		queued = rxrpc_notify_socket_oob(call, skb);

[Severity: High]
This is a pre-existing issue, but does failing to set skb->mark to
RXRPC_OOB_CHALLENGE here in rxrpc_post_challenge() break OOB challenge
processing?

The skb->mark remains RXRPC_SKB_MARK_PACKET as set during ingress.
When the application attempts to respond via rxrpc_respond_to_oob() or read
via rxrpc_recvmsg_oob(), won't the operation fail with -EPROTO because it
expects skb->mark == RXRPC_OOB_CHALLENGE?

Additionally, when the OOB message is freed via rxrpc_kernel_free_oob(),
doesn't the switch statement on oob->mark fail to match, bypassing
rxrpc_put_connection() and leaking the connection reference stored in
sp->chall.conn?

>  	rcu_read_unlock();
> +	if (call && !queued) {
> +		rxrpc_put_connection(conn, rxrpc_conn_put_challenge_input);
> +		sp->chall.conn = NULL;
> +		return false;
> +	}
>  
>  	if (!call)
>  		rxrpc_post_packet_to_conn(conn, skb);
> diff --git a/net/rxrpc/oob.c b/net/rxrpc/oob.c
> index 3318c8bd82ad5..c80ee2487d094 100644
> --- a/net/rxrpc/oob.c
> +++ b/net/rxrpc/oob.c

[ ... ]

> @@ -49,6 +50,7 @@ void rxrpc_notify_socket_oob(struct rxrpc_call *call, struct sk_buff *skb)
>  			skb->skb_mstamp_ns = rx->oob_id_counter++;
>  			rxrpc_get_skb(skb, rxrpc_skb_get_post_oob);
>  			skb_queue_tail(&rx->recvmsg_oobq, skb);
> +			queued = true;

[Severity: High]
This isn't a bug introduced by this patch, but does unbounded queuing of
challenge packets onto recvmsg_oobq risk a kernel OOM?

In net/rxrpc/oob.c, rxrpc_notify_socket_oob() uses skb_queue_tail() without
checking memory limits (like sk_rmem_alloc vs sk_rcvbuf) or restricting the
total number of challenges.

Since rxrpc_encap_rcv() intercepts these packets before standard UDP memory
accounting occurs, could a remote attacker flood the host with valid
CHALLENGE packets?

If the application is slow to process them or ignores them, will these
accumulate infinitely in recvmsg_oobq and cause a Denial of Service?

^ permalink raw reply

* Re: [PATCH net] nfc: st-nci: use unaligned accessors for frame length
From: David Laight @ 2026-06-20  9:29 UTC (permalink / raw)
  To: Runyu Xiao
  Cc: Krzysztof Kozlowski, netdev, Samuel Ortiz, Christophe Ricard,
	linux-kernel, Jianhao Xu, stable
In-Reply-To: <20260620090536.1701282-1-runyu.xiao@seu.edu.cn>

On Sat, 20 Jun 2026 17:05:36 +0800
Runyu Xiao <runyu.xiao@seu.edu.cn> wrote:

> The ST NCI I2C and SPI transports parse a frame length from bytes
> received from the controller. Both paths first read the frame header into
> a local u8 buffer and then cast buf + 2 to __be16 * before converting it
> from big endian.

Then align the local buffer.

	David

> 
> These are transport byte buffers, not __be16 objects. Use
> get_unaligned_be16() for the NCI frame length field in both the I2C and
> SPI transports.
> 
> This issue was detected by our static analysis tool and confirmed by
> manual audit. A focused UBSAN alignment validation kept the original
> access shape, be16_to_cpu(*(__be16 *)(buf + 2)), and ran it on an NCI
> frame byte buffer with buf + 2 at an odd address. UBSAN reported a
> misaligned-access load of type '__be16', and the trace contained
> st_nci_i2c_read().
> 
> The driver has the same source-level issue: the transport helpers fill
> u8 buffers, and the length checks only prove that the bytes are present.
> They do not establish a __be16 object at buf + 2 or a 2-byte alignment
> guarantee before the typed load.
> 
> Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci")
> Fixes: 2bc4d4f8c8f3 ("nfc: st-nci: Add spi phy support for st21nfcb")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  drivers/nfc/st-nci/i2c.c | 3 ++-
>  drivers/nfc/st-nci/spi.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nfc/st-nci/i2c.c b/drivers/nfc/st-nci/i2c.c
> index 9ae839a6f5cc..29fdb4ae56e0 100644
> --- a/drivers/nfc/st-nci/i2c.c
> +++ b/drivers/nfc/st-nci/i2c.c
> @@ -14,6 +14,7 @@
>  #include <linux/delay.h>
>  #include <linux/nfc.h>
>  #include <linux/of.h>
> +#include <linux/unaligned.h>
>  
>  #include "st-nci.h"
>  
> @@ -120,7 +121,7 @@ static int st_nci_i2c_read(struct st_nci_i2c_phy *phy,
>  	if (r != ST_NCI_I2C_MIN_SIZE)
>  		return -EREMOTEIO;
>  
> -	len = be16_to_cpu(*(__be16 *) (buf + 2));
> +	len = get_unaligned_be16(buf + 2);
>  	if (len > ST_NCI_I2C_MAX_SIZE) {
>  		nfc_err(&client->dev, "invalid frame len\n");
>  		return -EBADMSG;
> diff --git a/drivers/nfc/st-nci/spi.c b/drivers/nfc/st-nci/spi.c
> index 169eacc0a32a..1326c20e43fc 100644
> --- a/drivers/nfc/st-nci/spi.c
> +++ b/drivers/nfc/st-nci/spi.c
> @@ -14,6 +14,7 @@
>  #include <linux/delay.h>
>  #include <linux/nfc.h>
>  #include <linux/of.h>
> +#include <linux/unaligned.h>
>  #include <net/nfc/nci.h>
>  
>  #include "st-nci.h"
> @@ -130,7 +131,7 @@ static int st_nci_spi_read(struct st_nci_spi_phy *phy,
>  	if (r < 0)
>  		return -EREMOTEIO;
>  
> -	len = be16_to_cpu(*(__be16 *) (buf + 2));
> +	len = get_unaligned_be16(buf + 2);
>  	if (len > ST_NCI_SPI_MAX_SIZE) {
>  		nfc_err(&dev->dev, "invalid frame len\n");
>  		phy->ndlc->hard_fault = 1;


^ permalink raw reply

* "ip help" output is an error
From: Dmitri Seletski @ 2026-06-20  9:36 UTC (permalink / raw)
  To: netdev


Hello iproute2 maintainers,

I am reporting an inconsistency regarding the exit status of the ip help 
command.

Current Behavior:
When running ip help, the command prints the help documentation to 
stdout, but exits with a non-zero status (error). This causes issues in 
shell scripts that rely on exit codes for control flow.

Steps to reproduce:
bash

# This returns "FAIL" because the exit code is non-zero
if ip help > /dev/null; then
     echo "SUCCESS"
else
     echo "FAIL"
fi

Expected Behavior:
Since the command successfully performs the requested task (displaying 
help information) and does not encounter a system error, it should 
return an exit code of 0.

Context:
This behavior breaks standard Bash logic for automation. For example:
ip help && echo "This will not execute"

"ip help |grep br" - this will bring no result.

Current version tested: iproute2-6.19.0

Thank you for your time and for maintaining this tool.

Regards,
Dmitri Seletski


^ permalink raw reply

* [PATCH net] net: marvell: prestera: use unaligned accessors for DSA tag
From: Runyu Xiao @ 2026-06-20  9:37 UTC (permalink / raw)
  To: Taras Chornyi, netdev
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Oleksandr Mazur, Andrii Savka, Vadym Kochan,
	Volodymyr Mytnyk, linux-kernel, Runyu Xiao, Jianhao Xu, stable

Prestera parses and builds its 16-byte DSA tag from an skb byte buffer.
The current code casts the tag pointer to __be32 * and then reads or
writes the four tag words through that typed pointer.

The tag pointer is derived from skb data, but that only identifies the
protocol tag location inside the packet buffer. It does not make the tag
a naturally aligned __be32 array. Use the unaligned big-endian helpers
for both parsing and building the tag.

This issue was detected by our static analysis tool and confirmed by
manual audit. The same access pattern was validated with UBSAN alignment
instrumentation by keeping the original cast from a u8 DSA tag buffer to
__be32 * and reading dsa_words[i] from a deliberately misaligned tag
buffer. UBSAN reported misaligned-access loads of type '__be32' in
prestera_dsa_parse().

The driver has the same source-level issue: the RX path parses bytes at
skb->data - ETH_TLEN, and the TX path writes the tag at skb->data +
2 * ETH_ALEN. Those offsets identify the DSA tag bytes, but they do not
establish a __be32 object or a 4-byte alignment guarantee for typed loads
and stores.

Fixes: 501ef3066c89 ("net: marvell: prestera: Add driver for Prestera family ASIC devices")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 .../ethernet/marvell/prestera/prestera_dsa.c  | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/prestera/prestera_dsa.c b/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
index b7e89c0ca5c0..276f98cbd50e 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
@@ -4,6 +4,7 @@
 #include <linux/bitfield.h>
 #include <linux/bitops.h>
 #include <linux/errno.h>
+#include <linux/unaligned.h>
 #include <linux/string.h>
 
 #include "prestera_dsa.h"
@@ -33,15 +34,14 @@
 
 int prestera_dsa_parse(struct prestera_dsa *dsa, const u8 *dsa_buf)
 {
-	__be32 *dsa_words = (__be32 *)dsa_buf;
 	enum prestera_dsa_cmd cmd;
 	u32 words[4];
 	u32 field;
 
-	words[0] = ntohl(dsa_words[0]);
-	words[1] = ntohl(dsa_words[1]);
-	words[2] = ntohl(dsa_words[2]);
-	words[3] = ntohl(dsa_words[3]);
+	words[0] = get_unaligned_be32(dsa_buf);
+	words[1] = get_unaligned_be32(dsa_buf + 4);
+	words[2] = get_unaligned_be32(dsa_buf + 8);
+	words[3] = get_unaligned_be32(dsa_buf + 12);
 
 	/* set the common parameters */
 	cmd = (enum prestera_dsa_cmd)FIELD_GET(PRESTERA_DSA_W0_CMD, words[0]);
@@ -82,7 +82,6 @@ int prestera_dsa_parse(struct prestera_dsa *dsa, const u8 *dsa_buf)
 
 int prestera_dsa_build(const struct prestera_dsa *dsa, u8 *dsa_buf)
 {
-	__be32 *dsa_words = (__be32 *)dsa_buf;
 	u32 dev_num = dsa->hw_dev_num;
 	u32 words[4] = { 0 };
 
@@ -98,10 +97,10 @@ int prestera_dsa_build(const struct prestera_dsa *dsa, u8 *dsa_buf)
 	words[1] |= FIELD_PREP(PRESTERA_DSA_W1_EXT_BIT, 1);
 	words[2] |= FIELD_PREP(PRESTERA_DSA_W2_EXT_BIT, 1);
 
-	dsa_words[0] = htonl(words[0]);
-	dsa_words[1] = htonl(words[1]);
-	dsa_words[2] = htonl(words[2]);
-	dsa_words[3] = htonl(words[3]);
+	put_unaligned_be32(words[0], dsa_buf);
+	put_unaligned_be32(words[1], dsa_buf + 4);
+	put_unaligned_be32(words[2], dsa_buf + 8);
+	put_unaligned_be32(words[3], dsa_buf + 12);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
From: Jan Klos @ 2026-06-20  9:43 UTC (permalink / raw)
  To: Markus Stockhausen
  Cc: Heiner Kallweit, Andrew Lunn, Russell King, netdev,
	Maxime Chevallier, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Golle, Vladimir Oltean,
	Aleksander Jan Bajkowski, Jan Hoffmann, Issam Hamdi, Chukun Pan,
	Russell King (Oracle), ChunHao Lin, linux-kernel
In-Reply-To: <008101dd0080$5feb6860$1fc23920$@gmx.de>

On Sat, 20 Jun 2026 at 08:45, Markus Stockhausen
<markus.stockhausen@gmx.de> wrote:
>
> > Von: Jan Klos <honza.klos@gmail.com>
> > Gesendet: Samstag, 20. Juni 2026 03:20
> > Betreff: [PATCH net v2] net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G
> bit
> >
> > On RTL8127A connected to a link partner that advertises 10000baseT
> > speed cannot be changed to anything other than 10000baseT as 10GbE
> > is always advertised regardless of any setting. Fix this by
> > clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
> > call to phy_modify_mmd_changed().
>
> As you are enhancing the mask, shouldn't this be "... by respecting ..."?
>
> Markus
>

I don't think so, in (__)phy_modify_mmd_changed() the mask is really used to
clear MMD register bits from old register value before setting new bits in set:
* @mask: bit mask of bits to clear
* @set: new value of bits set in mask to write to @regnum
*
* Unlocked helper function which allows a MMD register to be modified as
* new register value = (old register value & ~mask) | set

^ permalink raw reply

* Re: [PATCH net] net: marvell: prestera: use unaligned accessors for DSA tag
From: David Laight @ 2026-06-20  9:47 UTC (permalink / raw)
  To: Runyu Xiao
  Cc: Taras Chornyi, netdev, Andrew Lunn, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Oleksandr Mazur,
	Andrii Savka, Vadym Kochan, Volodymyr Mytnyk, linux-kernel,
	Jianhao Xu, stable
In-Reply-To: <20260620093739.2164921-1-runyu.xiao@seu.edu.cn>

On Sat, 20 Jun 2026 17:37:39 +0800
Runyu Xiao <runyu.xiao@seu.edu.cn> wrote:

> Prestera parses and builds its 16-byte DSA tag from an skb byte buffer.
> The current code casts the tag pointer to __be32 * and then reads or
> writes the four tag words through that typed pointer.
> 
> The tag pointer is derived from skb data, but that only identifies the
> protocol tag location inside the packet buffer. It does not make the tag
> a naturally aligned __be32 array. Use the unaligned big-endian helpers
> for both parsing and building the tag.
> 
> This issue was detected by our static analysis tool and confirmed by
> manual audit. The same access pattern was validated with UBSAN alignment
> instrumentation by keeping the original cast from a u8 DSA tag buffer to
> __be32 * and reading dsa_words[i] from a deliberately misaligned tag
> buffer. UBSAN reported misaligned-access loads of type '__be32' in
> prestera_dsa_parse().
> 
> The driver has the same source-level issue: the RX path parses bytes at
> skb->data - ETH_TLEN, and the TX path writes the tag at skb->data +
> 2 * ETH_ALEN. Those offsets identify the DSA tag bytes, but they do not
> establish a __be32 object or a 4-byte alignment guarantee for typed loads
> and stores.

Stop sending these 'fixes' unless you can do proper analysis.
skb data is guaranteed to be aligned so that these reads (and ones of
the IP/TCP/UDP headers) are aligned.

	David


> 
> Fixes: 501ef3066c89 ("net: marvell: prestera: Add driver for Prestera family ASIC devices")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  .../ethernet/marvell/prestera/prestera_dsa.c  | 19 +++++++++----------
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/prestera/prestera_dsa.c b/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
> index b7e89c0ca5c0..276f98cbd50e 100644
> --- a/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
> +++ b/drivers/net/ethernet/marvell/prestera/prestera_dsa.c
> @@ -4,6 +4,7 @@
>  #include <linux/bitfield.h>
>  #include <linux/bitops.h>
>  #include <linux/errno.h>
> +#include <linux/unaligned.h>
>  #include <linux/string.h>
>  
>  #include "prestera_dsa.h"
> @@ -33,15 +34,14 @@
>  
>  int prestera_dsa_parse(struct prestera_dsa *dsa, const u8 *dsa_buf)
>  {
> -	__be32 *dsa_words = (__be32 *)dsa_buf;
>  	enum prestera_dsa_cmd cmd;
>  	u32 words[4];
>  	u32 field;
>  
> -	words[0] = ntohl(dsa_words[0]);
> -	words[1] = ntohl(dsa_words[1]);
> -	words[2] = ntohl(dsa_words[2]);
> -	words[3] = ntohl(dsa_words[3]);
> +	words[0] = get_unaligned_be32(dsa_buf);
> +	words[1] = get_unaligned_be32(dsa_buf + 4);
> +	words[2] = get_unaligned_be32(dsa_buf + 8);
> +	words[3] = get_unaligned_be32(dsa_buf + 12);
>  
>  	/* set the common parameters */
>  	cmd = (enum prestera_dsa_cmd)FIELD_GET(PRESTERA_DSA_W0_CMD, words[0]);
> @@ -82,7 +82,6 @@ int prestera_dsa_parse(struct prestera_dsa *dsa, const u8 *dsa_buf)
>  
>  int prestera_dsa_build(const struct prestera_dsa *dsa, u8 *dsa_buf)
>  {
> -	__be32 *dsa_words = (__be32 *)dsa_buf;
>  	u32 dev_num = dsa->hw_dev_num;
>  	u32 words[4] = { 0 };
>  
> @@ -98,10 +97,10 @@ int prestera_dsa_build(const struct prestera_dsa *dsa, u8 *dsa_buf)
>  	words[1] |= FIELD_PREP(PRESTERA_DSA_W1_EXT_BIT, 1);
>  	words[2] |= FIELD_PREP(PRESTERA_DSA_W2_EXT_BIT, 1);
>  
> -	dsa_words[0] = htonl(words[0]);
> -	dsa_words[1] = htonl(words[1]);
> -	dsa_words[2] = htonl(words[2]);
> -	dsa_words[3] = htonl(words[3]);
> +	put_unaligned_be32(words[0], dsa_buf);
> +	put_unaligned_be32(words[1], dsa_buf + 4);
> +	put_unaligned_be32(words[2], dsa_buf + 8);
> +	put_unaligned_be32(words[3], dsa_buf + 12);
>  
>  	return 0;
>  }


^ permalink raw reply

* Re: [PATCH net] net: marvell: prestera: use unaligned accessors for DSA tag
From: Runyu Xiao @ 2026-06-20 10:01 UTC (permalink / raw)
  To: David Laight
  Cc: Taras Chornyi, netdev, andrew+netdev, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Oleksandr Mazur,
	Andrii Savka, Vadym Kochan, Volodymyr Mytnyk, linux-kernel,
	Jianhao Xu, stable
In-Reply-To: <20260620104750.5270a11c@pumpkin>

On Sat, 20 Jun 2026 17:47:50 +0800, David Laight wrote:
&gt; Stop sending these 'fixes' unless you can do proper analysis.
&gt; skb data is guaranteed to be aligned so that these reads (and ones of
&gt; the IP/TCP/UDP headers) are aligned.

You are right. I treated the DSA tag buffer as a generic byte buffer and
did not account for the skb data alignment guarantees in this path.

Please drop this patch. I will re-check the remaining reports against
the relevant subsystem alignment contracts before sending anything else.

pw-bot: changes-requested

Thanks,
Runyu

^ permalink raw reply

* [PATCH net-next v2 0/4] net: pse-pd: decouple controller lookup from MDIO probe
From: Carlo Szelinsky @ 2026-06-20 11:24 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260423-pse-notifier-decouple-v1-0-86ed750a9d62@leavitt.info>

This is v2 of Corey's RFC [1]. Corey is busy at the moment, so I'm picking
it up to unblock everyone. The design is unchanged. The main thing v2
fixes is the SFP deadlock Jonas reported, plus a couple of smaller points
from the review.

The problem:

When a PSE controller driver is built as a module and a DT PHY node has a
"pses = <&...>" phandle, fwnode_mdiobus_register_phy() tries to resolve the
PSE handle before the controller has probed. It gets -EPROBE_DEFER, the
MDIO/DSA probe fails, and driver-core keeps retrying until the PSE module
loads. Since fa2f0454174c each retry does a full phy_device_register() /
phy_device_remove() cycle, so on a board with a tight watchdog the retry
loop can reset the box before userspace is up.

Rather than make the retry cheaper, this takes the PSE lookup out of the
MDIO probe path completely. pse_core gets a notifier chain (REGISTERED /
UNREGISTERED), the phy layer subscribes, owns phydev->psec, and attaches the
PSE handle when the controller actually shows up instead of during probe.
fwnode_mdio no longer knows about PSE, so no -EPROBE_DEFER crosses that
boundary and the retry loop is gone.

What changed since v1:

 - v1 made phy_device_register() hold rtnl across the whole registration,
   including device_add(). That deadlocks a PHY that drives its own SFP cage:
   device_add() -> phy_probe() -> phy_sfp_probe() -> sfp_bus_add_upstream(),
   and sfp_bus_add_upstream() takes rtnl again. Jonas hit this with
   RTL8214FC. v2 keeps device_add() out of rtnl and only takes rtnl around
   the psec attach, which now runs after device_add(). Doing the attach
   after the phy is on the bus keeps the PSE_REGISTERED race closed: either
   the notifier walk finds the phy and attaches it, or our own attach does,
   and the phydev->psec check makes that idempotent.

 - A broken "pses" binding now gets a phydev_warn() instead of being
   swallowed. -ENOENT (no phandle) and -EPROBE_DEFER stay quiet.

Tested on a Realtek rtl93xx PoE switch with two HS104 PSE controllers on
i2c:

 - clean boot, no probe-retry loop, no watchdog reset
 - 10G SFP+ port: module hotplug works, no deadlock (this is the path that
   hung with v1)
 - ethtool --set-pse enable/disable cuts and restores power to a connected PD
 - full i2c unbind -> rmmod -> modprobe cycle: PSE detaches on unbind (module
   refcount drops to 0 so rmmod works), and re-attaches on reload with power
   restored, no reboot. No lockdep splats.

Tested-by: Carlo Szelinsky <github@szelinsky.de>

One thing I'd like input on: the Fixes: tags. Patch 1 is a standalone
regulator lifetime fix and carries its own Fixes:. The boot-hang itself is
fixed by patches 2-4 together. Should those three carry
Fixes: fa2f0454174c so the fix can be backported, or should the series stay
net-next only? I'm fine either way.

[1] https://lore.kernel.org/netdev/20260423-pse-notifier-decouple-v1-0-86ed750a9d62@leavitt.info/

Corey Leavitt (4):
  net: pse-pd: scope pse_control regulator handle to kref lifetime
  net: pse-pd: add notifier chain for controller lifecycle events
  net: pse-pd: fire lifecycle events on controller register/unregister
  net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio
    hook

 drivers/net/mdio/fwnode_mdio.c |  34 -------
 drivers/net/phy/phy_device.c   | 168 +++++++++++++++++++++++++++++++--
 drivers/net/phy/sfp.c          |   2 +-
 drivers/net/pse-pd/pse_core.c  |  60 +++++++++++-
 include/linux/phy.h            |   2 +
 include/linux/pse-pd/pse.h     |  41 ++++++++
 6 files changed, 261 insertions(+), 46 deletions(-)


base-commit: b85966adbf5de0668a815c6e3527f87e0c387fb4
-- 
2.43.0


^ permalink raw reply

* [PATCH net-next v2 1/4] net: pse-pd: scope pse_control regulator handle to kref lifetime
From: Carlo Szelinsky @ 2026-06-20 11:24 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260620112440.1734404-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

__pse_control_release() drops psec->ps via devm_regulator_put(), which
only succeeds if the devres entry added by the matching
devm_regulator_get_exclusive() is still present on pcdev->dev at the
time the pse_control's kref hits zero.

In practice that assumption does not hold when the controller is
unbound while any pse_control still has consumers: pcdev->dev's
devres list is released LIFO, so every per-attach regulator-GET
devres runs (and regulator_put()s the underlying regulator) before
pse_controller_unregister() itself is invoked. Any later
pse_control_put() from that unbind path then reads psec->ps as a
dangling pointer inside devm_regulator_put() and WARNs at
drivers/regulator/devres.c:232 (devres_release() fails to find the
already-released match).

The pse_control's consumer handle is logically scoped to the
pse_control's refcount, not to pcdev->dev's devres lifetime. Switch
to the plain regulator_get_exclusive() / regulator_put() pair so
__pse_control_release() does the right put regardless of whether
the controller's devres has already been unwound.

No change to the regulator-framework-visible refcount or lifetime of
the underlying regulator: a single get paired with a single put. The
existing devm_regulator_register() for the per-PI rails is unchanged
(those ARE correctly scoped to the controller's lifetime).

Fixes: d83e13761d5b ("net: pse-pd: Use regulator framework within PSE framework")
Signed-off-by: Corey Leavitt <corey@leavitt.info>
Acked-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/pse-pd/pse_core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 69dbdbde9d71..a5e6d7b26b9f 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1367,7 +1367,7 @@ static void __pse_control_release(struct kref *kref)
 
 	if (psec->pcdev->pi[psec->id].admin_state_enabled)
 		regulator_disable(psec->ps);
-	devm_regulator_put(psec->ps);
+	regulator_put(psec->ps);
 
 	module_put(psec->pcdev->owner);
 
@@ -1436,8 +1436,8 @@ pse_control_get_internal(struct pse_controller_dev *pcdev, unsigned int index,
 		goto free_psec;
 
 	pcdev->pi[index].admin_state_enabled = ret;
-	psec->ps = devm_regulator_get_exclusive(pcdev->dev,
-						rdev_get_name(pcdev->pi[index].rdev));
+	psec->ps = regulator_get_exclusive(pcdev->dev,
+					   rdev_get_name(pcdev->pi[index].rdev));
 	if (IS_ERR(psec->ps)) {
 		ret = PTR_ERR(psec->ps);
 		goto put_module;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 2/4] net: pse-pd: add notifier chain for controller lifecycle events
From: Carlo Szelinsky @ 2026-06-20 11:24 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260620112440.1734404-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Introduce a blocking notifier chain that allows other subsystems to be
informed when a PSE controller is registered or unregistered, and
provide pse_register_notifier() / pse_unregister_notifier() as the
subscriber interface.

Subsequent patches will use this to let the phy subsystem own the
phydev->psec lifecycle directly, decoupling PSE lookup from
fwnode_mdiobus_register_phy() and removing the probe-time
-EPROBE_DEFER coupling that currently exists between mdio, phy and
pse-pd when the PSE controller driver is modular.

A blocking chain (rather than atomic) is used because callbacks will
take rtnl_lock and call back into pse_core via of_pse_control_get().

The enum pse_controller_event is placed outside the
IS_ENABLED(CONFIG_PSE_CONTROLLER) guard so that subscribers compiled
into a kernel without PSE support can still reference the event
values in dead-code paths without breaking the build.

This patch is pure infrastructure: nothing fires events yet, and
nothing subscribes. No observable behavior change.

Signed-off-by: Corey Leavitt <corey@leavitt.info>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/pse-pd/pse_core.c | 34 ++++++++++++++++++++++++++++++++++
 include/linux/pse-pd/pse.h    | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index a5e6d7b26b9f..84c734ed4553 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/ethtool.h>
 #include <linux/ethtool_netlink.h>
+#include <linux/notifier.h>
 #include <linux/of.h>
 #include <linux/phy.h>
 #include <linux/pse-pd/pse.h>
@@ -23,6 +24,39 @@ static LIST_HEAD(pse_controller_list);
 static DEFINE_XARRAY_ALLOC(pse_pw_d_map);
 static DEFINE_MUTEX(pse_pw_d_mutex);
 
+static BLOCKING_NOTIFIER_HEAD(pse_controller_notifier);
+
+/**
+ * pse_register_notifier - register a callback for PSE controller events
+ * @nb: notifier block to register
+ *
+ * See enum pse_controller_event for events fired and their subscriber
+ * contract. Callbacks run in process context; they may sleep, take
+ * rtnl, and call of_pse_control_get(). The chain fires synchronously,
+ * so a PSE controller driver's probe/unbind path must not hold any
+ * such lock when calling pse_controller_register() or
+ * pse_controller_unregister().
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int pse_register_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&pse_controller_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(pse_register_notifier);
+
+/**
+ * pse_unregister_notifier - unregister a previously registered callback
+ * @nb: notifier block previously passed to pse_register_notifier()
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int pse_unregister_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&pse_controller_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(pse_unregister_notifier);
+
 /**
  * struct pse_control - a PSE control
  * @pcdev: a pointer to the PSE controller device
diff --git a/include/linux/pse-pd/pse.h b/include/linux/pse-pd/pse.h
index 4e5696cfade7..78fe3a2b1ea8 100644
--- a/include/linux/pse-pd/pse.h
+++ b/include/linux/pse-pd/pse.h
@@ -21,6 +21,7 @@ struct net_device;
 struct phy_device;
 struct pse_controller_dev;
 struct netlink_ext_ack;
+struct notifier_block;
 
 /* C33 PSE extended state and substate. */
 struct ethtool_c33_pse_ext_state_info {
@@ -337,6 +338,24 @@ enum pse_budget_eval_strategies {
 	PSE_BUDGET_EVAL_STRAT_DYNAMIC	= 1 << 2,
 };
 
+/**
+ * enum pse_controller_event - PSE controller lifecycle events
+ *
+ * Event data in callbacks is always a pointer to the struct
+ * pse_controller_dev firing the event.
+ *
+ * @PSE_REGISTERED: controller added to pse_controller_list and
+ *	resolvable by of_pse_control_get().
+ * @PSE_UNREGISTERED: controller about to be removed from
+ *	pse_controller_list. Subscribers holding pse_control references
+ *	targeting it must drop them before returning and must not
+ *	acquire new references for it.
+ */
+enum pse_controller_event {
+	PSE_REGISTERED,
+	PSE_UNREGISTERED,
+};
+
 #if IS_ENABLED(CONFIG_PSE_CONTROLLER)
 int pse_controller_register(struct pse_controller_dev *pcdev);
 void pse_controller_unregister(struct pse_controller_dev *pcdev);
@@ -366,6 +385,9 @@ int pse_ethtool_set_prio(struct pse_control *psec,
 bool pse_has_podl(struct pse_control *psec);
 bool pse_has_c33(struct pse_control *psec);
 
+int pse_register_notifier(struct notifier_block *nb);
+int pse_unregister_notifier(struct notifier_block *nb);
+
 #else
 
 static inline struct pse_control *of_pse_control_get(struct device_node *node,
@@ -416,6 +438,16 @@ static inline bool pse_has_c33(struct pse_control *psec)
 	return false;
 }
 
+static inline int pse_register_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int pse_unregister_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+
 #endif
 
 #endif
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 3/4] net: pse-pd: fire lifecycle events on controller register/unregister
From: Carlo Szelinsky @ 2026-06-20 11:24 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260620112440.1734404-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Hook the newly-introduced pse_controller_notifier chain so that
pse_controller_register() fires PSE_REGISTERED after the controller
has been added to pse_controller_list (i.e. is now resolvable by
of_pse_control_get()), and pse_controller_unregister() fires
PSE_UNREGISTERED before the controller is removed from the list
(while it is still valid to dereference from a subscriber's
pse_control pointer targeting it).

With no subscribers yet, this is observably a no-op. A later change
wires the phy subsystem in as the first subscriber.

Signed-off-by: Corey Leavitt <corey@leavitt.info>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/pse-pd/pse_core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 84c734ed4553..37ba4ab778af 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1138,6 +1138,9 @@ int pse_controller_register(struct pse_controller_dev *pcdev)
 	list_add(&pcdev->list, &pse_controller_list);
 	mutex_unlock(&pse_list_mutex);
 
+	blocking_notifier_call_chain(&pse_controller_notifier,
+				     PSE_REGISTERED, pcdev);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(pse_controller_register);
@@ -1148,6 +1151,9 @@ EXPORT_SYMBOL_GPL(pse_controller_register);
  */
 void pse_controller_unregister(struct pse_controller_dev *pcdev)
 {
+	blocking_notifier_call_chain(&pse_controller_notifier,
+				     PSE_UNREGISTERED, pcdev);
+
 	pse_flush_pw_ds(pcdev);
 	pse_release_pis(pcdev);
 	if (pcdev->irq)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 4/4] net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook
From: Carlo Szelinsky @ 2026-06-20 11:24 UTC (permalink / raw)
  To: Oleksij Rempel, Kory Maincent, Andrew Lunn, Heiner Kallweit,
	Russell King, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Corey Leavitt, Jonas Jelonek, netdev, linux-kernel,
	Carlo Szelinsky
In-Reply-To: <20260620112440.1734404-1-github@szelinsky.de>

From: Corey Leavitt <corey@leavitt.info>

Transfer ownership of phydev->psec from fwnode_mdio to the phy
subsystem itself. The phy subsystem now subscribes to the pse-pd
notifier chain and manages psec attach/detach in response to PSE
controller lifecycle events, while fwnode_mdio loses its PSE awareness
entirely.

phydev->psec is attached after device_add() has made the phy visible
on mdio_bus_type, under a narrow rtnl_lock() that covers only
phy_try_attach_pse(). Ordering the attach after registration closes
the race that would otherwise leave a phy unattached: a PSE_REGISTERED
event firing during registration walks mdio_bus_type and either finds
the phy already added (and attaches it) or runs before device_add(),
in which case the post-add attach resolves it. The phydev->psec check
in phy_try_attach_pse() makes the two paths idempotent. Holding rtnl
across of_pse_control_get() is safe because pse_list_mutex is never
taken in the opposite order.

device_add() is deliberately left outside rtnl. Binding a phy that
itself provides an SFP cage reaches sfp_bus_add_upstream() through
phy_probe() -> phy_setup_ports() -> phy_sfp_probe(), and
sfp_bus_add_upstream() takes rtnl_lock(); holding rtnl across
device_add() would deadlock such phys (reported on RTL8214FC).

phy_device_register() is split into the public form, which takes the
narrow rtnl_lock() around the attach, and a phy_device_register_locked()
form for callers that already hold rtnl (the SFP module state machine
via __sfp_sm_event). This pair mirrors the register_netdevice() /
register_netdev() split convention already established in the core
networking stack. The _locked form runs device_add() under the
caller's rtnl, which is safe because a phy resident on an SFP module
does not itself provide a downstream cage, so phy_sfp_probe() is a
no-op there.

  - On PSE_REGISTERED: an rtnl-guarded bus walk retries the attach for
    every registered phy whose psec is still NULL. This is the "phy
    was enumerated before the PSE controller loaded" case, the root
    cause of the boot-time probe-retry storm on systems with a modular
    PSE controller driver.

  - On PSE_UNREGISTERED: an rtnl-guarded bus walk releases every
    phydev->psec that targets the departing controller before
    pse_release_pis() frees pcdev->pi. Without this, a phy still
    holding a pse_control reference would cause a use-after-free in
    __pse_control_release()'s pcdev->pi[psec->id] access, and the PSE
    driver module could not finish unloading while any phy still held a
    reference.

A bad `pses` binding -- an error from of_pse_control_get() other than
-ENOENT (no phandle) or -EPROBE_DEFER (controller not yet registered)
-- is reported with phydev_warn() rather than silently dropped,
preserving the diagnostic that the removed fwnode_mdio lookup used to
provide.

The final pse_control_put() of phydev->psec moves from
phy_device_remove() to phy_device_release(), so it runs only after
every reference on the device -- including the bus-iterator references
taken by bus_for_each_dev() in the notifier walk -- has been dropped.

Finally, delete fwnode_find_pse_control() and its call site in
fwnode_mdiobus_register_phy(), and drop the PSE header from
fwnode_mdio.c. The MDIO/DSA probe no longer sees any PSE-originated
-EPROBE_DEFER, so the probe-retry storm is gone and fwnode_mdio is
now PSE-agnostic.

Reported-by: Jonas Jelonek <jelonek.jonas@gmail.com>
Closes: https://lore.kernel.org/netdev/e00048dd-1ed3-40c3-9912-59bccf015ad5@gmail.com/
Signed-off-by: Corey Leavitt <corey@leavitt.info>
Co-developed-by: Carlo Szelinsky <github@szelinsky.de>
Signed-off-by: Carlo Szelinsky <github@szelinsky.de>
---
 drivers/net/mdio/fwnode_mdio.c |  34 -------
 drivers/net/phy/phy_device.c   | 168 +++++++++++++++++++++++++++++++--
 drivers/net/phy/sfp.c          |   2 +-
 drivers/net/pse-pd/pse_core.c  |  14 +++
 include/linux/phy.h            |   2 +
 include/linux/pse-pd/pse.h     |   9 ++
 6 files changed, 186 insertions(+), 43 deletions(-)

diff --git a/drivers/net/mdio/fwnode_mdio.c b/drivers/net/mdio/fwnode_mdio.c
index ba7091518265..7bd979b59f49 100644
--- a/drivers/net/mdio/fwnode_mdio.c
+++ b/drivers/net/mdio/fwnode_mdio.c
@@ -11,33 +11,11 @@
 #include <linux/fwnode_mdio.h>
 #include <linux/of.h>
 #include <linux/phy.h>
-#include <linux/pse-pd/pse.h>
 
 MODULE_AUTHOR("Calvin Johnson <calvin.johnson@oss.nxp.com>");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("FWNODE MDIO bus (Ethernet PHY) accessors");
 
-static struct pse_control *
-fwnode_find_pse_control(struct fwnode_handle *fwnode,
-			struct phy_device *phydev)
-{
-	struct pse_control *psec;
-	struct device_node *np;
-
-	if (!IS_ENABLED(CONFIG_PSE_CONTROLLER))
-		return NULL;
-
-	np = to_of_node(fwnode);
-	if (!np)
-		return NULL;
-
-	psec = of_pse_control_get(np, phydev);
-	if (PTR_ERR(psec) == -ENOENT)
-		return NULL;
-
-	return psec;
-}
-
 static struct mii_timestamper *
 fwnode_find_mii_timestamper(struct fwnode_handle *fwnode)
 {
@@ -118,7 +96,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 				struct fwnode_handle *child, u32 addr)
 {
 	struct mii_timestamper *mii_ts = NULL;
-	struct pse_control *psec = NULL;
 	struct phy_device *phy;
 	bool is_c45;
 	u32 phy_id;
@@ -159,14 +136,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 			goto clean_phy;
 	}
 
-	psec = fwnode_find_pse_control(child, phy);
-	if (IS_ERR(psec)) {
-		rc = PTR_ERR(psec);
-		goto unregister_phy;
-	}
-
-	phy->psec = psec;
-
 	/* phy->mii_ts may already be defined by the PHY driver. A
 	 * mii_timestamper probed via the device tree will still have
 	 * precedence.
@@ -176,9 +145,6 @@ int fwnode_mdiobus_register_phy(struct mii_bus *bus,
 
 	return 0;
 
-unregister_phy:
-	if (is_acpi_node(child) || is_of_node(child))
-		phy_device_remove(phy);
 clean_phy:
 	phy_device_free(phy);
 clean_mii_ts:
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 0615228459ef..f5febff4b00b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -223,8 +223,19 @@ static void phy_mdio_device_free(struct mdio_device *mdiodev)
 
 static void phy_device_release(struct device *dev)
 {
+	struct phy_device *phydev = to_phy_device(dev);
+
+	/* bus_for_each_dev() holds get_device() across each iteration
+	 * step, deferring this release callback until any in-flight PSE
+	 * notifier walk has advanced past this phy. pse_control_put()
+	 * takes pse_list_mutex, so this path must run in sleepable
+	 * context.
+	 */
+	might_sleep();
+	pse_control_put(phydev->psec);
+
 	fwnode_handle_put(dev->fwnode);
-	kfree(to_phy_device(dev));
+	kfree(phydev);
 }
 
 static void phy_mdio_device_remove(struct mdio_device *mdiodev)
@@ -1102,11 +1113,103 @@ struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45)
 }
 EXPORT_SYMBOL(get_phy_device);
 
-/**
- * phy_device_register - Register the phy device on the MDIO bus
- * @phydev: phy_device structure to be added to the MDIO bus
+/* Best-effort attach of phydev->psec from a DT `pses = <&...>` phandle.
+ * Caller must hold rtnl. A missing phandle (-ENOENT) or a not-yet-registered
+ * controller (-EPROBE_DEFER) is silent; the notifier retries the latter at
+ * PSE_REGISTERED time. Any other error means a broken binding and is warned
+ * about, but left non-fatal so the phy still registers.
  */
-int phy_device_register(struct phy_device *phydev)
+static void phy_try_attach_pse(struct phy_device *phydev)
+{
+	struct pse_control *psec;
+	struct device_node *np;
+
+	ASSERT_RTNL();
+
+	np = phydev->mdio.dev.of_node;
+	if (!np)
+		return;
+
+	if (phydev->psec)
+		return;
+
+	psec = of_pse_control_get(np, phydev);
+	if (IS_ERR(psec)) {
+		if (PTR_ERR(psec) != -EPROBE_DEFER && PTR_ERR(psec) != -ENOENT)
+			phydev_warn(phydev, "failed to get PSE control: %pe\n",
+				    psec);
+		return;
+	}
+
+	phydev->psec = psec;
+}
+
+static int phy_pse_attach_one(struct device *dev, void *data __maybe_unused)
+{
+	ASSERT_RTNL();
+
+	if (dev->type != &mdio_bus_phy_type)
+		return 0;
+
+	phy_try_attach_pse(to_phy_device(dev));
+	return 0;
+}
+
+static int phy_pse_detach_one(struct device *dev, void *data)
+{
+	struct pse_controller_dev *pcdev = data;
+	struct phy_device *phydev;
+	struct pse_control *psec;
+
+	ASSERT_RTNL();
+
+	if (dev->type != &mdio_bus_phy_type)
+		return 0;
+
+	phydev = to_phy_device(dev);
+	psec = phydev->psec;
+	if (!psec || !pse_control_matches_pcdev(psec, pcdev))
+		return 0;
+
+	phydev->psec = NULL;
+	pse_control_put(psec);
+	return 0;
+}
+
+static int phy_pse_notifier_event(struct notifier_block *nb,
+				  unsigned long event, void *data)
+{
+	switch (event) {
+	case PSE_REGISTERED:
+		rtnl_lock();
+		bus_for_each_dev(&mdio_bus_type, NULL, NULL,
+				 phy_pse_attach_one);
+		rtnl_unlock();
+		return NOTIFY_OK;
+	case PSE_UNREGISTERED:
+		rtnl_lock();
+		bus_for_each_dev(&mdio_bus_type, NULL, data,
+				 phy_pse_detach_one);
+		rtnl_unlock();
+		return NOTIFY_OK;
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block phy_pse_notifier __read_mostly = {
+	.notifier_call = phy_pse_notifier_event,
+};
+
+/* Core registration: add the phy to the MDIO bus. Does not touch rtnl or
+ * PSE. phydev->psec is attached by the callers below, after device_add()
+ * has made the phy visible on mdio_bus_type, so that a concurrent PSE
+ * notifier walk and the attach can never leave the phy unattached. Keeping
+ * device_add() out of rtnl also avoids deadlocking when binding a phy that
+ * itself provides an SFP cage (phy_probe() -> phy_sfp_probe() ->
+ * sfp_bus_add_upstream() takes rtnl).
+ */
+static int __phy_device_register(struct phy_device *phydev)
 {
 	int err;
 
@@ -1135,10 +1238,54 @@ int phy_device_register(struct phy_device *phydev)
  out:
 	/* Assert the reset signal */
 	phy_device_reset(phydev, 1);
-
 	mdiobus_unregister_device(&phydev->mdio);
 	return err;
 }
+
+/**
+ * phy_device_register_locked - Register the phy device on the MDIO bus
+ * @phydev: phy_device structure to be added to the MDIO bus
+ *
+ * Same as phy_device_register() but caller must already hold rtnl_lock().
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int phy_device_register_locked(struct phy_device *phydev)
+{
+	int err;
+
+	ASSERT_RTNL();
+
+	err = __phy_device_register(phydev);
+	if (err)
+		return err;
+
+	phy_try_attach_pse(phydev);
+
+	return 0;
+}
+EXPORT_SYMBOL(phy_device_register_locked);
+
+/**
+ * phy_device_register - Register the phy device on the MDIO bus
+ * @phydev: phy_device structure to be added to the MDIO bus
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int phy_device_register(struct phy_device *phydev)
+{
+	int err;
+
+	err = __phy_device_register(phydev);
+	if (err)
+		return err;
+
+	rtnl_lock();
+	phy_try_attach_pse(phydev);
+	rtnl_unlock();
+
+	return 0;
+}
 EXPORT_SYMBOL(phy_device_register);
 
 /**
@@ -1152,8 +1299,6 @@ EXPORT_SYMBOL(phy_device_register);
 void phy_device_remove(struct phy_device *phydev)
 {
 	unregister_mii_timestamper(phydev->mii_ts);
-	pse_control_put(phydev->psec);
-
 	device_del(&phydev->mdio.dev);
 
 	/* Assert the reset signal */
@@ -3981,8 +4126,14 @@ static int __init phy_init(void)
 	if (rc)
 		goto err_c45;
 
+	rc = pse_register_notifier(&phy_pse_notifier);
+	if (rc)
+		goto err_genphy;
+
 	return 0;
 
+err_genphy:
+	phy_driver_unregister(&genphy_driver);
 err_c45:
 	phy_driver_unregister(&genphy_c45_driver);
 err_ethtool_phy_ops:
@@ -3999,6 +4150,7 @@ static int __init phy_init(void)
 
 static void __exit phy_exit(void)
 {
+	pse_unregister_notifier(&phy_pse_notifier);
 	phy_driver_unregister(&genphy_c45_driver);
 	phy_driver_unregister(&genphy_driver);
 	rtnl_lock();
diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
index 03bfd8640db9..18868bdd6485 100644
--- a/drivers/net/phy/sfp.c
+++ b/drivers/net/phy/sfp.c
@@ -2083,7 +2083,7 @@ static int sfp_sm_probe_phy(struct sfp *sfp, int addr, bool is_c45)
 	/* Mark this PHY as being on a SFP module */
 	phy->is_on_sfp_module = true;
 
-	err = phy_device_register(phy);
+	err = phy_device_register_locked(phy);
 	if (err) {
 		phy_device_free(phy);
 		dev_err(sfp->dev, "phy_device_register failed: %pe\n",
diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 37ba4ab778af..432ca2ee5402 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -2021,3 +2021,17 @@ bool pse_has_c33(struct pse_control *psec)
 	return psec->pcdev->types & ETHTOOL_PSE_C33;
 }
 EXPORT_SYMBOL_GPL(pse_has_c33);
+
+/**
+ * pse_control_matches_pcdev - Test whether a pse_control targets a controller
+ * @psec: pse_control obtained from of_pse_control_get()
+ * @pcdev: PSE controller to compare against
+ *
+ * Return: %true if @psec was obtained from @pcdev, %false otherwise.
+ */
+bool pse_control_matches_pcdev(struct pse_control *psec,
+			       struct pse_controller_dev *pcdev)
+{
+	return psec->pcdev == pcdev;
+}
+EXPORT_SYMBOL_GPL(pse_control_matches_pcdev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 199a7aaa341b..865b9baddb85 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -2158,6 +2158,8 @@ struct phy_device *fwnode_phy_find_device(struct fwnode_handle *phy_fwnode);
 struct fwnode_handle *fwnode_get_phy_node(const struct fwnode_handle *fwnode);
 struct phy_device *get_phy_device(struct mii_bus *bus, int addr, bool is_c45);
 int phy_device_register(struct phy_device *phy);
+/* Caller must hold rtnl_lock(); see phy_device_register() for the public form. */
+int phy_device_register_locked(struct phy_device *phy);
 void phy_device_free(struct phy_device *phydev);
 void phy_device_remove(struct phy_device *phydev);
 int phy_get_c45_ids(struct phy_device *phydev);
diff --git a/include/linux/pse-pd/pse.h b/include/linux/pse-pd/pse.h
index 78fe3a2b1ea8..d4310ca71a3e 100644
--- a/include/linux/pse-pd/pse.h
+++ b/include/linux/pse-pd/pse.h
@@ -385,6 +385,9 @@ int pse_ethtool_set_prio(struct pse_control *psec,
 bool pse_has_podl(struct pse_control *psec);
 bool pse_has_c33(struct pse_control *psec);
 
+bool pse_control_matches_pcdev(struct pse_control *psec,
+			       struct pse_controller_dev *pcdev);
+
 int pse_register_notifier(struct notifier_block *nb);
 int pse_unregister_notifier(struct notifier_block *nb);
 
@@ -438,6 +441,12 @@ static inline bool pse_has_c33(struct pse_control *psec)
 	return false;
 }
 
+static inline bool pse_control_matches_pcdev(struct pse_control *psec,
+					     struct pse_controller_dev *pcdev)
+{
+	return false;
+}
+
 static inline int pse_register_notifier(struct notifier_block *nb)
 {
 	return 0;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v5.10 0/2] Fix CVE-2026-23204
From: Sasha Levin @ 2026-06-20 11:54 UTC (permalink / raw)
  To: stable, gregkh
  Cc: Sasha Levin, davem, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel, xiaosuo, iri, jhs, ajay.kaher, alexey.makhalov,
	vamsi-krishna.brahmajosyula, yin.ding, tapas.kundu,
	Shivani Agarwal
In-Reply-To: <20260618080807.1269070-1-shivani.agarwal@broadcom.com>

> [PATCH v5.10 0/2] Fix CVE-2026-23204

Queued the series for 5.10, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox