Netdev List
 help / color / mirror / Atom feed
* [syzbot] [net?] UBSAN: shift-out-of-bounds in xfrm_selector_match (3)
From: syzbot @ 2026-06-15  8:56 UTC (permalink / raw)
  To: davem, edumazet, herbert, horms, kuba, linux-kernel, netdev,
	pabeni, steffen.klassert, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    2b07ea76fd28 Merge tag 'core-urgent-2026-06-13' of git://g..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11cb4986580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4e828c596d7aa593
dashboard link: https://syzkaller.appspot.com/bug?extid=9383b1ff0df4b29ca5e6
compiler:       Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/a90e4670c989/disk-2b07ea76.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/e1ad8b2531b9/vmlinux-2b07ea76.xz
kernel image: https://storage.googleapis.com/syzbot-assets/9a927176c534/bzImage-2b07ea76.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+9383b1ff0df4b29ca5e6@syzkaller.appspotmail.com

------------[ cut here ]------------
UBSAN: shift-out-of-bounds in ./include/net/xfrm.h:970:23
shift exponent -96 is negative
CPU: 1 UID: 0 PID: 12115 Comm: syz.5.2221 Tainted: G             L      syzkaller #0 PREEMPT(full) 
Tainted: [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
 __ubsan_handle_shift_out_of_bounds+0x36d/0x400 lib/ubsan.c:494
 addr4_match include/net/xfrm.h:970 [inline]
 __xfrm4_selector_match net/xfrm/xfrm_policy.c:222 [inline]
 xfrm_selector_match+0xd5c/0x1140 net/xfrm/xfrm_policy.c:247
 __xfrm_policy_check+0x5d1/0x37b0 net/xfrm/xfrm_policy.c:3715
 __xfrm_policy_check2 include/net/xfrm.h:1302 [inline]
 xfrm_policy_check+0x475/0x880 include/net/xfrm.h:1307
 vti_rcv_cb+0x3b6/0x770 net/ipv4/ip_vti.c:135
 tunnel4_rcv_cb+0xd6/0x230 net/ipv4/tunnel4.c:124
 xfrm_rcv_cb+0x1c7/0x310 net/xfrm/xfrm_input.c:117
 xfrm_input+0x4738/0x7760 net/xfrm/xfrm_input.c:729
 vti_input+0x21f/0x330 net/ipv4/ip_vti.c:69
 tunnel4_rcv+0xdd/0x2d0 net/ipv4/tunnel4.c:103
 ip_protocol_deliver_rcu+0x2dc/0x440 net/ipv4/ip_input.c:207
 ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241
 NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318
 NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318
 __netif_receive_skb_one_core net/core/dev.c:6202 [inline]
 __netif_receive_skb net/core/dev.c:6315 [inline]
 netif_receive_skb_internal net/core/dev.c:6401 [inline]
 netif_receive_skb+0x45b/0xbf0 net/core/dev.c:6460
 tun_rx_batched+0x1de/0x790 drivers/net/tun.c:1487
 tun_get_user+0x2b04/0x4350 drivers/net/tun.c:1955
 tun_chr_write_iter+0x113/0x200 drivers/net/tun.c:2001
 new_sync_write fs/read_write.c:595 [inline]
 vfs_write+0x612/0xba0 fs/read_write.c:688
 ksys_write+0x150/0x270 fs/read_write.c:740
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fca8819ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fca890f6028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fca88416090 RCX: 00007fca8819ce59
RDX: 0000000000000fce RSI: 0000200000000840 RDI: 0000000000000003
RBP: 00007fca88232d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fca88416128 R14: 00007fca88416090 R15: 00007fca8853fa48
 </TASK>
---[ end trace ]---


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH net] xfrm: validate selector family and prefixlen during match
From: Eric Dumazet @ 2026-06-15  9:02 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, netdev, eric.dumazet, Eric Dumazet,
	syzbot+9383b1ff0df4b29ca5e6, Sabrina Dubroca, Steffen Klassert

syzbot reported a shift-out-of-bounds in xfrm_selector_match()
due to AF_UNSPEC selector with large prefixlen (e.g. 128) matched
against IPv4 flow (when XFRM_STATE_AF_UNSPEC is set).

Fix this by:

- Rejecting mismatched families in xfrm_selector_match.
- Returning false in addr4_match if prefixlen > 32.
- Returning false in addr_match if prefixlen > 128 (prevents overflow).

Fixes: 3f0ab59e6537 ("xfrm: validate new SA's prefixlen using SA family when sel.family is unset")
Reported-by: syzbot+9383b1ff0df4b29ca5e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6a2fbe35.be3f099c.2836ae.0018.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/xfrm.h     | 7 +++++++
 net/xfrm/xfrm_policy.c | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 874409127e292197c17dbac4686efdd5ff56c185..baa7454a0b7b8d1faffa7e8375510082b811e903 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -943,6 +943,9 @@ static inline bool addr_match(const void *token1, const void *token2,
 	unsigned int pdw;
 	unsigned int pbi;
 
+	if (prefixlen > 128)
+		return false;
+
 	pdw = prefixlen >> 5;	  /* num of whole u32 in prefix */
 	pbi = prefixlen &  0x1f;  /* num of bits in incomplete u32 in prefix */
 
@@ -967,6 +970,10 @@ static inline bool addr4_match(__be32 a1, __be32 a2, u8 prefixlen)
 	/* C99 6.5.7 (3): u32 << 32 is undefined behaviour */
 	if (sizeof(long) == 4 && prefixlen == 0)
 		return true;
+
+	if (prefixlen > 32)
+		return false;
+
 	return !((a1 ^ a2) & htonl(~0UL << (32 - prefixlen)));
 }
 
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 95954442569290719b9fdb7b0f9462d70b5d755e..bcc6ab6b0c183bfa90a94800c68dd0d029c2497c 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -242,6 +242,9 @@ __xfrm6_selector_match(const struct xfrm_selector *sel, const struct flowi *fl)
 bool xfrm_selector_match(const struct xfrm_selector *sel, const struct flowi *fl,
 			 unsigned short family)
 {
+	if (family != sel->family && sel->family != AF_UNSPEC)
+		return false;
+
 	switch (family) {
 	case AF_INET:
 		return __xfrm4_selector_match(sel, fl);
-- 
2.54.0.1136.gdb2ca164c4-goog


^ permalink raw reply related

* [PATCH net-next v2] selftests/net/openvswitch: add ICMPv6 echo type match test
From: Minxi Hou @ 2026-06-15  9:05 UTC (permalink / raw)
  To: netdev
  Cc: aconole, echaudro, i.maximets, davem, edumazet, kuba, pabeni,
	horms, shuah, dev, linux-kselftest, Minxi Hou
In-Reply-To: <20260613141429.3084962-1-houminxi@gmail.com>

Register OVS_KEY_ATTR_ICMPV6 in the flow key parser so that
icmpv6(type=...) can be used in flow specifications. Without this
registration the parser silently drops the token and the kernel
rejects the flow with EINVAL because the expected ICMPv6 key
attribute is missing.

While here, add convert_int() to the ovs_key_ipv6 and ovs_key_icmp
fields_map entries so that specifying a field value produces the
correct wildcard mask. The IPv6 flow label uses convert_int(20) to
produce a 20-bit mask (0x000FFFFF), matching the kernel constraint in
flow_netlink.c that rejects masks with bits 20-31 set; byte-wide
fields use convert_int(8). The ipv4 counterpart already does this via
convert_int(); the ipv6 and icmp classes were simply missing the fifth
tuple element. Existing callers that pass empty parentheses are
unaffected because convert_int("") returns (0, 0).

Add test_icmpv6 exercising the ICMPv6 echo flow key. The test uses
static neighbour entries to bypass NDP, then verifies in three steps:
install icmpv6(type=128) and icmpv6(type=129) flows and confirm ping
works, remove the flows and confirm ping fails, reinstall and confirm
recovery.

Signed-off-by: Minxi Hou <houminxi@gmail.com>
---
v2: fix ovs_key_ipv6 label mask to use convert_int(20) instead of
    convert_int(32); the IPv6 flow label is 20 bits and the kernel
    rejects masks with bits 20-31 set in validate_set().

 .../selftests/net/openvswitch/openvswitch.sh  | 63 +++++++++++++++++++
 .../selftests/net/openvswitch/ovs-dpctl.py    | 26 +++++---
 2 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/openvswitch/openvswitch.sh b/tools/testing/selftests/net/openvswitch/openvswitch.sh
index d533decca5c1..8923224fa88e 100755
--- a/tools/testing/selftests/net/openvswitch/openvswitch.sh
+++ b/tools/testing/selftests/net/openvswitch/openvswitch.sh
@@ -31,6 +31,7 @@ tests="
 	pop_vlan				vlan: POP_VLAN action strips tag
 	dec_ttl					ttl: dec_ttl decrements IP TTL
 	flow_set				flow-set: Flow modify
+	icmpv6					icmpv6: ICMPv6 echo type match
 	psample					psample: Sampling packets with psample"
 
 info() {
@@ -377,6 +378,68 @@ test_flow_set() {
 	return 0
 }
 
+test_icmpv6() {
+	sbx_add "test_icmpv6" || return $?
+	ovs_add_dp "test_icmpv6" icmpv6 || return 1
+
+	info "create namespaces"
+	for ns in client server; do
+		ovs_add_netns_and_veths "test_icmpv6" "icmpv6" \
+			"$ns" "${ns:0:1}0" "${ns:0:1}1" || return 1
+	done
+
+	ip netns exec client ip addr add fd00::1/64 dev c1 nodad
+	ip netns exec client ip link set c1 up
+	ip netns exec server ip addr add fd00::2/64 dev s1 nodad
+	ip netns exec server ip link set s1 up
+
+	local cl_mac sl_mac
+	cl_mac=$(ip netns exec client \
+		ip link show c1 | awk '/link\/ether/ {print $2}')
+	[ -z "$cl_mac" ] && \
+		{ info "failed to get c1 hwaddr"; return 1; }
+	sl_mac=$(ip netns exec server \
+		ip link show s1 | awk '/link\/ether/ {print $2}')
+	[ -z "$sl_mac" ] && \
+		{ info "failed to get s1 hwaddr"; return 1; }
+	ip netns exec client \
+		ip -6 neigh add fd00::2 lladdr "$sl_mac" dev c1
+	ip netns exec server \
+		ip -6 neigh add fd00::1 lladdr "$cl_mac" dev s1
+
+	ovs_add_flow "test_icmpv6" icmpv6 \
+	  'in_port(1),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=128)' \
+	  '2' || return 1
+	ovs_add_flow "test_icmpv6" icmpv6 \
+	  'in_port(2),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=129)' \
+	  '1' || return 1
+
+	info "verify ICMPv6 echo with type-specific flows"
+	ovs_sbx "test_icmpv6" ip netns exec client \
+		ping -6 -c 1 -W 2 fd00::2 || return 1
+
+	ovs_del_flows "test_icmpv6" icmpv6
+
+	info "verify ping fails without echo flows"
+	ovs_sbx "test_icmpv6" ip netns exec client \
+		ping -6 -c 1 -W 2 fd00::2 >/dev/null 2>&1 \
+		&& { info "FAIL: ping should fail without flows"
+		     return 1; }
+
+	ovs_add_flow "test_icmpv6" icmpv6 \
+	  'in_port(1),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=128)' \
+	  '2' || return 1
+	ovs_add_flow "test_icmpv6" icmpv6 \
+	  'in_port(2),eth(),eth_type(0x86dd),ipv6(proto=58),icmpv6(type=129)' \
+	  '1' || return 1
+
+	info "verify connectivity restored"
+	ovs_sbx "test_icmpv6" ip netns exec client \
+		ping -6 -c 1 -W 2 fd00::2 || return 1
+
+	return 0
+}
+
 # psample test
 # - use psample to observe packets
 test_psample() {
diff --git a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
index e1ecfad2c03e..f3edd198223f 100644
--- a/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
+++ b/tools/testing/selftests/net/openvswitch/ovs-dpctl.py
@@ -1255,11 +1255,16 @@ class ovskey(nla):
                 lambda x: ipaddress.IPv6Address(x).packed if x else 0,
                 convert_ipv6,
             ),
-            ("label", "label", "%d", lambda x: int(x) if x else 0),
-            ("proto", "proto", "%d", lambda x: int(x) if x else 0),
-            ("tclass", "tclass", "%d", lambda x: int(x) if x else 0),
-            ("hlimit", "hlimit", "%d", lambda x: int(x) if x else 0),
-            ("frag", "frag", "%d", lambda x: int(x) if x else 0),
+            ("label", "label", "%d", lambda x: int(x) if x else 0,
+                convert_int(20)),
+            ("proto", "proto", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
+            ("tclass", "tclass", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
+            ("hlimit", "hlimit", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
+            ("frag", "frag", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
         )
 
         def __init__(
@@ -1344,8 +1349,10 @@ class ovskey(nla):
         )
 
         fields_map = (
-            ("type", "type", "%d", lambda x: int(x) if x else 0),
-            ("code", "code", "%d", lambda x: int(x) if x else 0),
+            ("type", "type", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
+            ("code", "code", "%d", lambda x: int(x) if x else 0,
+                convert_int(8)),
         )
 
         def __init__(
@@ -1982,6 +1989,11 @@ class ovskey(nla):
                 "icmp",
                 ovskey.ovs_key_icmp,
             ),
+            (
+                "OVS_KEY_ATTR_ICMPV6",
+                "icmpv6",
+                ovskey.ovs_key_icmpv6,
+            ),
             (
                 "OVS_KEY_ATTR_TCP_FLAGS",
                 "tcp_flags",
-- 
2.54.0


^ permalink raw reply related

* [PATCH net v2] appletalk: fix TOCTOU race in atalk_sendmsg
From: Yizhou Zhao @ 2026-06-15  9:06 UTC (permalink / raw)
  To: netdev
  Cc: Yizhou Zhao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Kito Xu, linux-kernel,
	Yuxiang Yang, Ao Wang, Xuewei Feng, Qi Li, Ke Xu, stable

atalk_sendmsg() looks up an AppleTalk route, stores the returned
atalk_route and net_device pointers, and then drops the socket lock
around sock_alloc_send_skb().  The route pointer returned by
atrtr_find() is only protected while atalk_routes_lock is held; after
that lock is dropped, a concurrent SIOCDELRT or device-down path can
unlink the route, drop the device reference, and free the route.

When sendmsg resumes, it can still dereference the stale route and
device pointers while building or transmitting the packet.  A KASAN
reproducer using AF_APPLETALK sockets and SIOCADDRT/SIOCDELRT reports
slab-use-after-free reads in atalk_sendmsg(), with the object allocated
by atrtr_create() and freed by atrtr_delete().

Fix this by splitting the route lookup into a helper that is called with
atalk_routes_lock already held.  atalk_sendmsg() now performs route
lookup, copies the route fields it needs, and takes references to the
selected devices with netdev_hold() while still holding
atalk_routes_lock.  After the lock is dropped and skb allocation sleeps,
the send path uses only the copied route data and the held net_device
references, which are released with netdev_put() before returning.

This preserves the existing route selection behaviour, including the
separate loopback route used for broadcast loopback, while removing the
dangling route/device window.

Fixes: 60d9f461a20b ("appletalk: remove the BKL")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: GLM:GLM-5.1
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
Changes in v2:
- Use netdev_hold()/netdev_put() instead of dev_hold()/dev_put().
- Drop explicit NULL checks before releasing temporary device refs.
- Link to v1: https://lore.kernel.org/netdev/20260610052315.64504-1-zhaoyz24@mails.tsinghua.edu.cn/
---
 net/appletalk/ddp.c | 67 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 46 insertions(+), 21 deletions(-)

diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 30a6dc06291c..9b95dd06f600 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -434,7 +434,7 @@ static struct atalk_iface *atalk_find_interface(__be16 net, int node)
  * the socket (later on...). We know about host routes and the fact
  * that a route must be direct to broadcast.
  */
-static struct atalk_route *atrtr_find(struct atalk_addr *target)
+static struct atalk_route *atrtr_find_locked(struct atalk_addr *target)
 {
 	/*
 	 * we must search through all routes unless we find a
@@ -444,7 +444,6 @@ static struct atalk_route *atrtr_find(struct atalk_addr *target)
 	struct atalk_route *net_route = NULL;
 	struct atalk_route *r;
 
-	read_lock_bh(&atalk_routes_lock);
 	for (r = atalk_routes; r; r = r->next) {
 		if (!(r->flags & RTF_UP))
 			continue;
@@ -477,6 +476,15 @@ static struct atalk_route *atrtr_find(struct atalk_addr *target)
 	else /* No route can be found */
 		r = NULL;
 out:
+	return r;
+}
+
+static struct atalk_route *atrtr_find(struct atalk_addr *target)
+{
+	struct atalk_route *r;
+
+	read_lock_bh(&atalk_routes_lock);
+	r = atrtr_find_locked(target);
 	read_unlock_bh(&atalk_routes_lock);
 	return r;
 }
@@ -1553,10 +1561,13 @@ static int atalk_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	int loopback = 0;
 	struct sockaddr_at local_satalk, gsat;
 	struct sk_buff *skb;
-	struct net_device *dev;
+	struct net_device *dev = NULL, *dev_lo = NULL;
+	netdevice_tracker dev_tracker, dev_lo_tracker;
 	struct ddpehdr *ddp;
 	int size, hard_header_len;
 	struct atalk_route *rt, *rt_lo = NULL;
+	int rt_flags;
+	struct atalk_addr rt_gateway;
 	int err;
 
 	if (flags & ~(MSG_DONTWAIT|MSG_CMSG_COMPAT))
@@ -1600,39 +1611,50 @@ static int atalk_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	/* For headers */
 	size = sizeof(struct ddpehdr) + len + ddp_dl->header_length;
 
+	read_lock_bh(&atalk_routes_lock);
 	if (usat->sat_addr.s_net || usat->sat_addr.s_node == ATADDR_ANYNODE) {
-		rt = atrtr_find(&usat->sat_addr);
+		rt = atrtr_find_locked(&usat->sat_addr);
 	} else {
 		struct atalk_addr at_hint;
 
 		at_hint.s_node = 0;
 		at_hint.s_net  = at->src_net;
 
-		rt = atrtr_find(&at_hint);
+		rt = atrtr_find_locked(&at_hint);
 	}
 	err = -ENETUNREACH;
-	if (!rt)
+	if (!rt) {
+		read_unlock_bh(&atalk_routes_lock);
 		goto out;
+	}
 
 	dev = rt->dev;
-
-	net_dbg_ratelimited("SK %p: Size needed %d, device %s\n",
-			sk, size, dev->name);
+	netdev_hold(dev, &dev_tracker, GFP_ATOMIC);
+	rt_flags = rt->flags;
+	rt_gateway = rt->gateway;
 
 	hard_header_len = dev->hard_header_len;
 	/* Leave room for loopback hardware header if necessary */
 	if (usat->sat_addr.s_node == ATADDR_BCAST &&
-	    (dev->flags & IFF_LOOPBACK || !(rt->flags & RTF_GATEWAY))) {
+	    (dev->flags & IFF_LOOPBACK || !(rt_flags & RTF_GATEWAY))) {
 		struct atalk_addr at_lo;
 
 		at_lo.s_node = 0;
 		at_lo.s_net  = 0;
 
-		rt_lo = atrtr_find(&at_lo);
+		rt_lo = atrtr_find_locked(&at_lo);
 
-		if (rt_lo && rt_lo->dev->hard_header_len > hard_header_len)
-			hard_header_len = rt_lo->dev->hard_header_len;
+		if (rt_lo) {
+			dev_lo = rt_lo->dev;
+			netdev_hold(dev_lo, &dev_lo_tracker, GFP_ATOMIC);
+			if (dev_lo->hard_header_len > hard_header_len)
+				hard_header_len = dev_lo->hard_header_len;
+		}
 	}
+	read_unlock_bh(&atalk_routes_lock);
+
+	net_dbg_ratelimited("SK %p: Size needed %d, device %s\n",
+			    sk, size, dev->name);
 
 	size += hard_header_len;
 	release_sock(sk);
@@ -1675,7 +1697,7 @@ static int atalk_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	 * to group we are in)
 	 */
 	if (ddp->deh_dnode == ATADDR_BCAST &&
-	    !(rt->flags & RTF_GATEWAY) && !(dev->flags & IFF_LOOPBACK)) {
+	    !(rt_flags & RTF_GATEWAY) && !(dev->flags & IFF_LOOPBACK)) {
 		struct sk_buff *skb2 = skb_copy(skb, GFP_KERNEL);
 
 		if (skb2) {
@@ -1693,20 +1715,21 @@ static int atalk_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		/* loop back */
 		skb_orphan(skb);
 		if (ddp->deh_dnode == ATADDR_BCAST) {
-			if (!rt_lo) {
+			if (!dev_lo) {
 				kfree_skb(skb);
 				err = -ENETUNREACH;
 				goto out;
 			}
-			dev = rt_lo->dev;
-			skb->dev = dev;
+			skb->dev = dev_lo;
+			ddp_dl->request(ddp_dl, skb, dev_lo->dev_addr);
+		} else {
+			ddp_dl->request(ddp_dl, skb, dev->dev_addr);
 		}
-		ddp_dl->request(ddp_dl, skb, dev->dev_addr);
 	} else {
 		net_dbg_ratelimited("SK %p: send out.\n", sk);
-		if (rt->flags & RTF_GATEWAY) {
-		    gsat.sat_addr = rt->gateway;
-		    usat = &gsat;
+		if (rt_flags & RTF_GATEWAY) {
+			gsat.sat_addr = rt_gateway;
+			usat = &gsat;
 		}
 
 		/*
@@ -1717,6 +1740,8 @@ static int atalk_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	net_dbg_ratelimited("SK %p: Done write (%zd).\n", sk, len);
 
 out:
+	netdev_put(dev, &dev_tracker);
+	netdev_put(dev_lo, &dev_lo_tracker);
 	release_sock(sk);
 	return err ? : len;
 }


^ permalink raw reply related

* Re: [PATCH net-next v2 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Simon Horman @ 2026-06-15  9:07 UTC (permalink / raw)
  To: jelonek.jonas
  Cc: Simon Horman, o.rempel, kory.maincent, andrew+netdev, davem,
	edumazet, kuba, pabeni, robh, krzk+dt, conor+dt, netdev,
	devicetree, linux-kernel, daniel, bjorn
In-Reply-To: <20260612132944.460646-3-jelonek.jonas@gmail.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net: pse-pd: add Realtek/Broadcom PSE MCU driver

This patch adds a driver for a microcontroller that fronts Realtek and
Broadcom PSE silicon on PoE switches, exposing the MCU over I2C/SMBus or
UART using a 12-byte fixed-frame protocol. The driver is split into a
shared core plus two transport modules, with a per-dialect opcode table
and parser hooks for the Realtek vs Broadcom variants.

> diff --git a/drivers/net/pse-pd/Kconfig b/drivers/net/pse-pd/Kconfig
> index 7ef29657ee5d..b065b19db126 100644
> --- a/drivers/net/pse-pd/Kconfig
> +++ b/drivers/net/pse-pd/Kconfig
> @@ -13,6 +13,34 @@ menuconfig PSE_CONTROLLER
>  
>  if PSE_CONTROLLER
>  
> +config PSE_REALTEK
> +	tristate
> +	help
> +	  Shared core for the Realtek/Broadcom PSE MCU driver. This is
> +	  selected automatically by the transport options below.
> +
> +config PSE_REALTEK_I2C
> +	tristate "Realtek/Broadcom PSE MCU driver (I2C transport)"
> +	depends on I2C
> +	select PSE_REALTEK
> +	help
> +	  Driver for the microcontroller (MCU) that fronts the PSE
> +	  hardware on switches with Realtek or Broadcom PSE chips, attached
> +	  via I2C/SMBus. The MCU exposes a message-based protocol; the actual
> +	  PSE silicon is not accessed directly. To compile this driver as a
> +	  module, choose M here: the module will be called realtek-pse-i2c.
> +
> +config PSE_REALTEK_UART
> +	tristate "Realtek/Broadcom PSE MCU driver (UART transport)"
> +	depends on SERIAL_DEV_BUS
> +	select PSE_REALTEK
> +	help
> +	  Driver for the microcontroller (MCU) that fronts the PSE
> +	  hardware on switches with Realtek or Broadcom PSE chips, attached
> +	  via UART. The MCU exposes a message-based protocol; the actual PSE
> +	  silicon is not accessed directly. To compile this driver as a
> +	  module, choose M here: the module will be called realtek-pse-uart.
> +

[Medium]
The realtek, vendor prefix is used for both the Realtek dialect and the
Broadcom dialect compatibles (realtek,pse-mcu-rtk and
realtek,pse-mcu-bcm), and the binding documentation explicitly notes
that the prefix "reflects the platform" rather than the silicon vendor.
DT compatibles are stable ABI once accepted.

Would a brcm, prefix (or per-MCU-part compatibles) for the Broadcom
dialect be more appropriate, given that the BCM variant uses Broadcom
PSE silicon driven by an MCU built by Nuvoton or STMicro and has no
Realtek involvement?

Similarly, would describing the I2C wire framing (native vs SMBus) via
separate compatibles be preferable to encoding it in the
realtek,i2c-protocol DT property, and would describing the actual PSE
chip in DT (rather than relying on runtime device-id detection) match
the usual binding patterns?

>  config PSE_REGULATOR
>  	tristate "Regulator based PSE controller"
>  	help

[ ... ]

> diff --git a/drivers/net/pse-pd/realtek-pse-core.c b/drivers/net/pse-pd/realtek-pse-core.c
> new file mode 100644
> index 000000000000..3a601cfb6280
> --- /dev/null
> +++ b/drivers/net/pse-pd/realtek-pse-core.c

[ ... ]

> +static int rtpse_port_cmd(struct rtpse_ctrl *pse, unsigned int port, u8 opcode, u8 arg)
> +{
> +	struct rtpse_mcu_msg req, resp;
> +	int ret;
> +
> +	rtpse_mcu_msg_init(&req, opcode);
> +	req.payload[0] = port;
> +	req.payload[1] = arg;
> +
> +	ret = rtpse_do_xfer(pse, &req, &resp);
> +	if (ret)
> +		return ret;
> +
> +	if (resp.payload[0] != port || resp.payload[1] != 0)
> +		return -EIO;
> +
> +	return 0;
> +}

[Medium]
Can an all-zero 12-byte buffer be silently accepted as a successful
PORT_ENABLE/PORT_DISABLE on the BCM dialect for port 0?

For the BCM dialect, RTPSE_CMD_PORT_ENABLE maps to opcode 0x00, and the
checksum of eleven zero bytes is 0, so an all-zero response passes the
opcode and checksum checks in rtpse_do_xfer():

	if (resp->opcode != req->opcode ||
	    resp->checksum != rtpse_checksum((u8 *)resp, RTPSE_MCU_MSG_SIZE - 1))
		return -EBADMSG;

rtpse_port_cmd() then verifies payload[0] == port and payload[1] == 0,
both of which pass when port == 0 and arg is 0 (disable) or when arg is
1 but the MCU never observed the command and the buffer is all zero
anyway since payload[1] is checked against 0.

The seq_num field exists in struct rtpse_mcu_msg and is filled with 0xff
by rtpse_mcu_msg_init() but is never checked on the response side. Could
a seq_num round-trip check (or a non-zero opcode sentinel for the BCM
dialect) reject stale or zeroed-out frames here?

[ ... ]

> +static int rtpse_discover(struct rtpse_ctrl *pse, struct rtpse_mcu_info *info)
> +{
> +	struct rtpse_mcu_ext_config ext_config;
> +	unsigned long deadline;
> +	int ret;
> +
> +	/*
> +	 * The MCU may not answer on the bus yet right after power-up or
> +	 * enable-gpios assertion: depending on the transport it either stays
> +	 * silent (-ETIMEDOUT) or does not ACK its address at all (-ENXIO /
> +	 * -EREMOTEIO). Retry within a bounded wall-time window so a slow boot
> +	 * still probes, while a genuinely unresponsive MCU fails with its real
> +	 * error instead of deferring forever and masking it.
> +	 */
> +	deadline = jiffies + msecs_to_jiffies(RTPSE_MCU_BOOT_TIMEOUT_MS);
> +	do {
> +		ret = rtpse_mcu_get_info(pse, info);
> +		if (ret != -ETIMEDOUT && ret != -ENXIO &&
> +		    ret != -EREMOTEIO && ret != -EAGAIN)
> +			break;
> +		msleep(RTPSE_MCU_BOOT_RETRY_MS);
> +	} while (time_before(jiffies, deadline));

[Low]
The comment says the loop is meant to tolerate a slow-boot MCU, but the
retry whitelist excludes -EBADMSG and -EIO.

rtpse_do_xfer() returns -EBADMSG when the opcode or checksum mismatch
fires, and rtpse_port_query()/rtpse_port_cmd() return -EIO when the
echoed port byte does not match the request. On a UART-attached MCU
that emits a single garbled frame while its serial peripheral
stabilises, or on an I2C-attached MCU that returns stale boot-ROM bytes
right after enable-gpios assertion, the first attempt would surface as
one of those errors rather than -ETIMEDOUT or -ENXIO, and the loop
would break out immediately.

Should -EBADMSG (and arguably -EIO) be added to the retry set so the
documented intent matches the actual behaviour?

[ ... ]

> +int rtpse_register(struct rtpse_ctrl *pse)
> +{

[ ... ]

> +	if (!info.system_enable) {
> +		ret = rtpse_set_global_state(pse, true);
> +		/* Dialects without a global-state concept (e.g. BCM) return
> +		 * -EOPNOTSUPP; treat that as "no separate enable required".
> +		 */
> +		if (ret && ret != -EOPNOTSUPP)
> +			return dev_err_probe(pse->dev, ret,
> +					     "failed to enable PSE system\n");
> +	}
> +
> +	ret = regulator_enable(pse->poe_supply);
> +	if (ret)
> +		return dev_err_probe(pse->dev, ret, "failed to enable PoE supply\n");
> +
> +	ret = devm_add_action_or_reset(pse->dev, rtpse_regulator_disable, pse->poe_supply);
> +	if (ret)
> +		return ret;

[Low]
If rtpse_set_global_state(pse, true) succeeds but a later step in
rtpse_register() fails (regulator_enable, devm_add_action_or_reset, or
devm_pse_controller_register), is the MCU left with the PSE subsystem
enabled while the kernel detaches?

The regulator is rolled back via rtpse_regulator_disable through
devm_add_action_or_reset, but there does not seem to be a matching
devm_add_action_or_reset to call rtpse_set_global_state(false) on
probe-failure or driver-removal. Is that asymmetry intentional?

[ ... ]

> diff --git a/drivers/net/pse-pd/realtek-pse-uart.c b/drivers/net/pse-pd/realtek-pse-uart.c
> new file mode 100644
> index 000000000000..785407b379a4
> --- /dev/null
> +++ b/drivers/net/pse-pd/realtek-pse-uart.c

[ ... ]

> +static size_t rtpse_uart_receive(struct serdev_device *serdev,
> +				 const u8 *buf, size_t count)
> +{
> +	struct rtpse_uart *ctx = serdev_device_get_drvdata(serdev);
> +	bool done = false;
> +	size_t take;
> +
> +	scoped_guard(spinlock_irqsave, &ctx->rx_lock) {
> +		take = min(count, sizeof(ctx->rx_buf) - ctx->rx_len);
> +		if (take) {
> +			memcpy(ctx->rx_buf + ctx->rx_len, buf, take);
> +			ctx->rx_len += take;
> +			done = (ctx->rx_len == sizeof(ctx->rx_buf));
> +		}
> +	}
> +	if (done)
> +		complete(&ctx->rx_done);
> +
> +	/* consume all to avoid desync/misalignment */
> +	return count;
> +}

[ ... ]

> +static int rtpse_uart_send(struct rtpse_ctrl *pse, const struct rtpse_mcu_msg *req)
> +{
> +	struct rtpse_uart *ctx = to_rtpse_uart(pse);
> +	int written;
> +
> +	/* clear any leftover rx state before transmitting */
> +	reinit_completion(&ctx->rx_done);
> +	scoped_guard(spinlock_irqsave, &ctx->rx_lock)
> +		ctx->rx_len = 0;

[High]
Is there a race between reinit_completion() here and the asynchronous
serdev receive_buf callback when a previous transaction has timed out
with bytes still in flight?

reinit_completion() is an unlocked x->done = 0 write, and the matching
complete() in rtpse_uart_receive() is called outside rx_lock. The
following interleave seems possible:

  rtpse_uart_send() runs reinit_completion()       /* done = 0 */
  rtpse_uart_receive() takes rx_lock, fills rx_buf
    to 12 bytes, computes done=true, releases lock
  rtpse_uart_send() takes rx_lock, sets rx_len = 0,
    releases lock
  rtpse_uart_receive() calls complete(&ctx->rx_done) /* done = 1 */
  rtpse_uart_send() writes the new request
  rtpse_uart_recv() wait_for_completion_timeout()
    returns immediately (done was 1), then sees
    rx_len == 0 and returns -EIO

rtpse_discover()'s retry whitelist covers -ETIMEDOUT, -ENXIO,
-EREMOTEIO, and -EAGAIN but not -EIO, so a single occurrence of this
race during the boot retry window would break out of the loop and turn
into a permanent dev_err_probe() failure.

Would moving complete() inside rx_lock in rtpse_uart_receive(), or
performing reinit_completion() and rx_len = 0 together inside rx_lock
while excluding the completer, close this window?

> +
> +	written = serdev_device_write(ctx->serdev, (const u8 *)req, sizeof(*req),
> +				      RTPSE_UART_TX_TIMEOUT);
> +	if (written < 0)
> +		return written;
> +	if (written != sizeof(*req))
> +		return -EIO;
> +
> +	return 0;
> +}

[ ... ]

^ permalink raw reply

* [PATCH net-next v5 0/4] Add support for RTL8261C_CG
From: javen @ 2026-06-15  9:08 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	freddy_gu, nb
  Cc: netdev, linux-kernel, daniel, vladimir.oltean, Javen Xu

From: Javen Xu <javen_xu@realsil.com.cn>

Add support for RTL8261C_CG and add support for loading firmware.

Javen Xu (4):
  net: phy: c45: add genphy_c45_soft_reset()
  net: phy: c45: add genphy_c45_config_master_slave()
  net: phy: realtek: add support for RTL8261C_CG
  net: phy: realtek: load firmware for RTL8261C_CG

 drivers/net/phy/phy-c45.c              | 118 +++++++
 drivers/net/phy/realtek/realtek_main.c | 407 +++++++++++++++++++++++++
 include/linux/phy.h                    |   1 +
 include/uapi/linux/mdio.h              |   5 +
 4 files changed, 531 insertions(+)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next v5 3/4] net: phy: realtek: add support for RTL8261C_CG
From: javen @ 2026-06-15  9:08 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	freddy_gu, nb
  Cc: netdev, linux-kernel, daniel, vladimir.oltean, Javen Xu
In-Reply-To: <20260615090817.429-1-javen_xu@realsil.com.cn>

From: Javen Xu <javen_xu@realsil.com.cn>

This patch adds support for Realtek phy chip RTL8261C_CG. Its PHY ID is
0x001cc898.

Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
---
Changes in v2:
 - no changes, new file

Changes in v3:
 - re-order function according to the order in phy-c45.c
 - add kernel-doc about return value
 - add MASTER_SLAVE_CFG_MASTER_PREFERRED,
   MASTER_SLAVE_CFG_SLAVE_PREFERRED, MASTER_SLAVE_CFG_UNKNOWN,
   MASTER_SLAVE_CFG_UNSUPPORTED, MASTER_SLAVE_CFG_SLAVE_PREFERRED cfg

Changes in v4:
 - no changes

Changes in v5:
 - remove genphy_c45_pma_setup_forced() for this is already done when
   calling genphy_c45_config_aneg()

Changes in v6:
 - when PHY_INTERRUPT_DISABLE, clear IMR and ISR
 - if AUTONEG_DISABLE, nothing need to do in rtl8261x_config_aneg
 - add rtl8261x_read_status, support 1G speed
---
 drivers/net/phy/realtek/realtek_main.c | 187 +++++++++++++++++++++++++
 1 file changed, 187 insertions(+)

diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index 27268811f564..bef476ddbe3d 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -141,6 +141,10 @@
 #define RTL8211F_PHYSICAL_ADDR_WORD1		17
 #define RTL8211F_PHYSICAL_ADDR_WORD2		18
 
+#define RTL8261X_EXT_ADDR_REG			0xa436
+#define RTL8261X_EXT_DATA_REG			0xa438
+#define RTL_8261X_SUB_PHY_ID_ADDR		0x801d
+
 #define RTL822X_VND1_SERDES_OPTION			0x697a
 #define RTL822X_VND1_SERDES_OPTION_MODE_MASK		GENMASK(5, 0)
 #define RTL822X_VND1_SERDES_OPTION_MODE_2500BASEX_SGMII		0
@@ -251,6 +255,32 @@
 #define RTL_8221B_VM_CG				0x001cc84a
 #define RTL_8251B				0x001cc862
 #define RTL_8261C				0x001cc890
+#define RTL_8261C_CG				0x001cc898
+
+#define RTL8261C_CE_MODEL		0x00
+#define RTL8261X_IMR			0xa4d2
+#define RTL8261X_ISR			0xa4d4
+#define RTL8261X_INT_AUTONEG_ERROR	BIT(0)
+#define RTL8261X_INT_PAGE_RECV		BIT(2)
+#define RTL8261X_INT_AUTONEG_DONE	BIT(3)
+#define RTL8261X_INT_LINK_CHG		BIT(4)
+#define RTL8261X_INT_PHY_REG_ACCESS	BIT(5)
+#define RTL8261X_INT_PME		BIT(7)
+#define RTL8261X_INT_ALDPS_CHG		BIT(9)
+#define RTL8261X_INT_JABBER		BIT(10)
+
+#define RTL8261X_INT_MASK_DEFAULT	(RTL8261X_INT_AUTONEG_DONE | \
+					 RTL8261X_INT_LINK_CHG)
+
+#define RTL8261X_INT_MASK_ALL		(RTL8261X_INT_AUTONEG_ERROR | \
+					 RTL8261X_INT_PAGE_RECV | \
+					 RTL8261X_INT_AUTONEG_DONE | \
+					 RTL8261X_INT_LINK_CHG | \
+					 RTL8261X_INT_PHY_REG_ACCESS | \
+					 RTL8261X_INT_PME | \
+					 RTL8261X_INT_ALDPS_CHG | \
+					 RTL8261X_INT_JABBER)
+
 
 /* RTL8211E and RTL8211F support up to three LEDs */
 #define RTL8211x_LED_COUNT			3
@@ -310,6 +340,151 @@ static int rtl821x_modify_ext_page(struct phy_device *phydev, u16 ext_page,
 	return phy_restore_page(phydev, oldpage, ret);
 }
 
+static int rtl8261x_probe(struct phy_device *phydev)
+{
+	int sub_phy_id, ret;
+
+	ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_EXT_ADDR_REG,
+			    RTL_8261X_SUB_PHY_ID_ADDR);
+	if (ret < 0)
+		return ret;
+
+	ret = phy_read_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_EXT_DATA_REG);
+	if (ret < 0)
+		return ret;
+
+	sub_phy_id = (ret >> 8) & 0xff;
+
+	switch (sub_phy_id) {
+	case RTL8261C_CE_MODEL:
+		phydev_info(phydev, "RTL8261C detected (sub_id 0x%02x)\n", sub_phy_id);
+		break;
+
+	default:
+		phydev_err(phydev, "Unknown sub_id 0x%02x\n", sub_phy_id);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static int rtl8261x_get_features(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_c45_pma_read_abilities(phydev);
+	if (ret)
+		return ret;
+	/*
+	 * Supplement Multi-Gig speeds that may not be automatically detected
+	 * RTL8261X supports 2.5G/5G in addition to standard 10G
+	 */
+	linkmode_set_bit(ETHTOOL_LINK_MODE_2500baseT_Full_BIT,
+			 phydev->supported);
+	linkmode_set_bit(ETHTOOL_LINK_MODE_5000baseT_Full_BIT,
+			 phydev->supported);
+
+	return 0;
+}
+
+static int rtl8261x_read_status(struct phy_device *phydev)
+{
+	int ret;
+
+	ret = genphy_c45_read_status(phydev);
+	if (ret < 0)
+		return ret;
+
+	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
+		int lp_status;
+
+		lp_status = phy_read_mmd(phydev, MDIO_MMD_VEND2,
+					 RTL822X_VND2_C22_REG(MII_STAT1000));
+		if (lp_status < 0)
+			return lp_status;
+
+		if (lp_status & (LPA_1000FULL | LPA_1000HALF)) {
+			mii_stat1000_mod_linkmode_lpa_t(phydev->lp_advertising, lp_status);
+			phy_resolve_aneg_linkmode(phydev);
+		}
+	}
+
+	return 0;
+}
+
+static int rtl8261x_config_intr(struct phy_device *phydev)
+{
+	int ret;
+
+	if (phydev->interrupts == PHY_INTERRUPT_ENABLED) {
+		ret = phy_read_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_ISR);
+		if (ret < 0)
+			return ret;
+
+		ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_IMR,
+				    RTL8261X_INT_MASK_DEFAULT);
+		if (ret < 0)
+			return ret;
+	} else {
+		ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_IMR, 0);
+		if (ret < 0)
+			return ret;
+
+		ret = phy_read_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_ISR);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+static irqreturn_t rtl8261x_handle_interrupt(struct phy_device *phydev)
+{
+	int irq_status;
+
+	irq_status = phy_read_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_ISR);
+	if (irq_status < 0) {
+		phy_error(phydev);
+		return IRQ_NONE;
+	}
+
+	if (!(irq_status & RTL8261X_INT_MASK_ALL))
+		return IRQ_NONE;
+
+	if (irq_status & (RTL8261X_INT_LINK_CHG | RTL8261X_INT_AUTONEG_DONE |
+	    RTL8261X_INT_AUTONEG_ERROR | RTL8261X_INT_JABBER))
+		phy_trigger_machine(phydev);
+
+	return IRQ_HANDLED;
+}
+
+static int rtl8261x_config_aneg(struct phy_device *phydev)
+{
+	u16 adv_1g = 0;
+	int ret;
+
+	ret = genphy_c45_config_aneg(phydev);
+	if (ret < 0)
+		return ret;
+
+	if (phydev->autoneg == AUTONEG_DISABLE)
+		return 0;
+
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+			      phydev->advertising))
+		adv_1g = ADVERTISE_1000FULL;
+
+	ret = phy_modify_mmd_changed(phydev, MDIO_MMD_VEND2,
+				     RTL822X_VND2_C22_REG(MII_CTRL1000),
+				     ADVERTISE_1000FULL, adv_1g);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		return genphy_c45_restart_aneg(phydev);
+
+	return 0;
+}
+
 static int rtl821x_probe(struct phy_device *phydev)
 {
 	struct device *dev = &phydev->mdio.dev;
@@ -3001,6 +3176,18 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= genphy_resume,
 		.read_mmd	= genphy_read_mmd_unsupported,
 		.write_mmd	= genphy_write_mmd_unsupported,
+	}, {
+		PHY_ID_MATCH_EXACT(RTL_8261C_CG),
+		.name			= "Realtek RTL8261C 10Gbps PHY",
+		.probe			= rtl8261x_probe,
+		.get_features		= rtl8261x_get_features,
+		.config_aneg		= rtl8261x_config_aneg,
+		.read_status		= rtl8261x_read_status,
+		.config_intr		= rtl8261x_config_intr,
+		.handle_interrupt	= rtl8261x_handle_interrupt,
+		.soft_reset		= genphy_c45_soft_reset,
+		.suspend		= genphy_c45_pma_suspend,
+		.resume			= genphy_c45_pma_resume,
 	},
 };
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v5 4/4] net: phy: realtek: load firmware for RTL8261C_CG
From: javen @ 2026-06-15  9:08 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	freddy_gu, nb
  Cc: netdev, linux-kernel, daniel, vladimir.oltean, Javen Xu
In-Reply-To: <20260615090817.429-1-javen_xu@realsil.com.cn>

From: Javen Xu <javen_xu@realsil.com.cn>

This patch adds support for loading firmware. Download some parameters
for RTL8261C_CG.

Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
---
Changes in v2:
 - remove __pack, struct rtl8261x_fw_header and rtl8261x_fw_entry will not pad
 - reverse xmas tree for some definition
 - add explanation on rtl_phy_write_mmd_bits()

Changes in v3:
 - add struct rtl8261x_priv

Changes in v4:
 - add struct device *dev

Changes in v5:
 - no changes

Changes in v6:
 - replace rtl_phy_write_mmd_bits with phy_modify_mmd, keep mdio lock
 - check msb and lsb at the beginning of rtl8261x_fw_execute_entry()
 - add comments on rtl8261x_config_init()
---
 drivers/net/phy/realtek/realtek_main.c | 220 +++++++++++++++++++++++++
 1 file changed, 220 insertions(+)

diff --git a/drivers/net/phy/realtek/realtek_main.c b/drivers/net/phy/realtek/realtek_main.c
index bef476ddbe3d..d1a07d1101b6 100644
--- a/drivers/net/phy/realtek/realtek_main.c
+++ b/drivers/net/phy/realtek/realtek_main.c
@@ -8,7 +8,9 @@
  * Copyright (c) 2004 Freescale Semiconductor, Inc.
  */
 #include <linux/bitops.h>
+#include <linux/crc32.h>
 #include <linux/ethtool_netlink.h>
+#include <linux/firmware.h>
 #include <linux/of.h>
 #include <linux/phy.h>
 #include <linux/pm_wakeirq.h>
@@ -281,6 +283,42 @@
 					 RTL8261X_INT_ALDPS_CHG | \
 					 RTL8261X_INT_JABBER)
 
+#define FW_MAIN_MAGIC			0x52544C38
+#define FW_SUB_MAGIC_8261C		0x32363143
+#define RTL8261X_POLL_TIMEOUT_MS	100
+
+#define RTL8261C_CE_FW_NAME	"rtl_nic/rtl8261c.bin"
+MODULE_FIRMWARE(RTL8261C_CE_FW_NAME);
+
+enum rtl8261x_fw_op {
+	OP_WRITE = 0x00,	/* Write */
+	OP_POLL  = 0x02,	/* Polling */
+};
+
+struct rtl8261x_fw_header {
+	__le32 main_magic;	/* Main magic number 0x52544C38 ("RTL8") */
+	__le32 sub_magic;	/* Sub magic number */
+	__le16 version_major;	/* Major version */
+	__le16 version_minor;	/* Minor version */
+	__le16 num_entries;	/* Number of entries */
+	__le16 reserved;	/* Reserved */
+	__le32 crc32;		/* CRC32 checksum */
+};
+
+struct rtl8261x_fw_entry {
+	__u8  type;		/* Operation type (OP_*) */
+	__u8  dev;		/* MMD device */
+	__le16 addr;		/* Register address */
+	__u8  msb;		/* MSB bit position */
+	__u8  lsb;		/* LSB bit position */
+	__le16 value;		/* Value to write/compare */
+	__le16 timeout_ms;	/* Poll timeout in milliseconds */
+	__u8  poll_set;		/* Poll for set (1) or clear (0) */
+	__u8  reserved;		/* Reserved */
+};
+
+#define FW_HEADER_SIZE		sizeof(struct rtl8261x_fw_header)
+#define FW_ENTRY_SIZE		sizeof(struct rtl8261x_fw_entry)
 
 /* RTL8211E and RTL8211F support up to three LEDs */
 #define RTL8211x_LED_COUNT			3
@@ -300,6 +338,11 @@ struct rtl821x_priv {
 	u16 iner;
 };
 
+struct rtl8261x_priv {
+	const char *fw_name;
+	bool fw_loaded;
+};
+
 static int rtl821x_read_page(struct phy_device *phydev)
 {
 	return __phy_read(phydev, RTL821x_PAGE_SELECT);
@@ -342,8 +385,16 @@ static int rtl821x_modify_ext_page(struct phy_device *phydev, u16 ext_page,
 
 static int rtl8261x_probe(struct phy_device *phydev)
 {
+	struct device *dev = &phydev->mdio.dev;
+	struct rtl8261x_priv *priv;
 	int sub_phy_id, ret;
 
+	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	phydev->priv = priv;
+
 	ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, RTL8261X_EXT_ADDR_REG,
 			    RTL_8261X_SUB_PHY_ID_ADDR);
 	if (ret < 0)
@@ -357,6 +408,7 @@ static int rtl8261x_probe(struct phy_device *phydev)
 
 	switch (sub_phy_id) {
 	case RTL8261C_CE_MODEL:
+		priv->fw_name = RTL8261C_CE_FW_NAME;
 		phydev_info(phydev, "RTL8261C detected (sub_id 0x%02x)\n", sub_phy_id);
 		break;
 
@@ -412,6 +464,153 @@ static int rtl8261x_read_status(struct phy_device *phydev)
 	return 0;
 }
 
+static int rtl8261x_verify_firmware(struct phy_device *phydev, const struct firmware *fw)
+{
+	const struct rtl8261x_fw_header *hdr;
+	u32 main_magic, sub_magic;
+	u32 calc_crc, file_crc;
+	size_t data_len;
+	u16 num_entries;
+
+	if (fw->size < FW_HEADER_SIZE) {
+		phydev_err(phydev, "Firmware too small: %zu bytes\n", fw->size);
+		return -EINVAL;
+	}
+
+	hdr = (const struct rtl8261x_fw_header *)fw->data;
+
+	main_magic = le32_to_cpu(hdr->main_magic);
+	if (main_magic != FW_MAIN_MAGIC) {
+		phydev_err(phydev, "Invalid firmware magic: 0x%08x\n", main_magic);
+		return -EINVAL;
+	}
+
+	sub_magic = le32_to_cpu(hdr->sub_magic);
+	if (sub_magic != FW_SUB_MAGIC_8261C) {
+		phydev_err(phydev, "Invalid sub magic: 0x%08x\n", sub_magic);
+		return -EINVAL;
+	}
+
+	num_entries = le16_to_cpu(hdr->num_entries);
+	data_len = num_entries * FW_ENTRY_SIZE;
+
+	if (fw->size != sizeof(*hdr) + data_len) {
+		phydev_err(phydev, "Firmware size mismatch\n");
+		return -EINVAL;
+	}
+
+	calc_crc = crc32(~0, fw->data + FW_HEADER_SIZE, data_len) ^ ~0;
+	file_crc = le32_to_cpu(hdr->crc32);
+
+	if (calc_crc != file_crc) {
+		phydev_err(phydev, "CRC32 mismatch: calculated=0x%08x file=0x%08x\n",
+			   calc_crc, file_crc);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int rtl8261x_fw_execute_entry(struct phy_device *phydev,
+				     const struct rtl8261x_fw_entry *entry)
+{
+	u16 addr, value, timeout_ms;
+	u8 dev, msb, lsb, poll_set;
+	u32 bits, expect_val;
+	int ret = 0;
+	int val;
+
+	dev = entry->dev;
+	addr = le16_to_cpu(entry->addr);
+	msb = entry->msb;
+	lsb = entry->lsb;
+	value = le16_to_cpu(entry->value);
+	timeout_ms = le16_to_cpu(entry->timeout_ms);
+	poll_set = entry->poll_set;
+
+	if (timeout_ms == 0)
+		timeout_ms = RTL8261X_POLL_TIMEOUT_MS;
+
+	if (msb > 15 || lsb > msb) {
+		phydev_err(phydev, "Invalid firmware bits: msb=%d, lsb=%d\n", msb, lsb);
+		return -EINVAL;
+	}
+
+	switch (entry->type) {
+	case OP_WRITE:
+		ret = phy_modify_mmd(phydev, dev, addr,
+				     GENMASK(msb, lsb), (value << lsb) & GENMASK(msb, lsb));
+		if (ret) {
+			phydev_err(phydev, "WRITE failed: dev=%d addr=0x%04x\n", dev, addr);
+			return ret;
+		}
+		break;
+
+	case OP_POLL: {
+		bits = GENMASK(msb, lsb);
+		expect_val = (value << lsb) & bits;
+
+		if (poll_set)
+			ret = phy_read_mmd_poll_timeout(phydev, dev, addr, val,
+							(val & bits) == expect_val,
+							1000, timeout_ms * 1000, false);
+		else
+			ret = phy_read_mmd_poll_timeout(phydev, dev, addr, val,
+							(val & bits) != expect_val,
+							1000, timeout_ms * 1000, false);
+		if (ret)
+			phydev_err(phydev, "POLL timeout: dev=%d addr=0x%04x\n",
+				   dev, addr);
+		break;
+	}
+	default:
+		phydev_err(phydev, "Unknown firmware operation: %d\n", entry->type);
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int rtl8261x_fw_load(struct phy_device *phydev)
+{
+	struct rtl8261x_priv *priv = phydev->priv;
+	const struct rtl8261x_fw_entry *entry;
+	const struct rtl8261x_fw_header *hdr;
+	const struct firmware *fw;
+	int ret, i;
+
+	if (!priv->fw_name)
+		return 0;
+
+	ret = request_firmware(&fw, priv->fw_name, &phydev->mdio.dev);
+	if (ret) {
+		phydev_err(phydev, "Failed to load firmware %s: %d\n", priv->fw_name, ret);
+		return ret;
+	}
+
+	ret = rtl8261x_verify_firmware(phydev, fw);
+	if (ret)
+		goto release_fw;
+
+	hdr = (const struct rtl8261x_fw_header *)fw->data;
+
+	entry = (const struct rtl8261x_fw_entry *)(fw->data + FW_HEADER_SIZE);
+	for (i = 0; i < le16_to_cpu(hdr->num_entries); i++, entry++) {
+		ret = rtl8261x_fw_execute_entry(phydev, entry);
+		if (ret) {
+			phydev_err(phydev, "Entry %d failed: %d\n", i, ret);
+			goto release_fw;
+		}
+	}
+
+	priv->fw_loaded = true;
+
+release_fw:
+	release_firmware(fw);
+	return ret;
+}
+
 static int rtl8261x_config_intr(struct phy_device *phydev)
 {
 	int ret;
@@ -485,6 +684,26 @@ static int rtl8261x_config_aneg(struct phy_device *phydev)
 	return 0;
 }
 
+static int rtl8261x_config_init(struct phy_device *phydev)
+{
+	struct rtl8261x_priv *priv = phydev->priv;
+	int ret = 0;
+
+	/* The firmware parameters are preserved across IEEE soft resets and
+	 * suspend/resume cycles. Reloading is only necessary after a power
+	 * cycle or hard reset.
+	 */
+	if (priv->fw_name && !priv->fw_loaded) {
+		ret = rtl8261x_fw_load(phydev);
+		if (ret) {
+			phydev_err(phydev, "Firmware loading failed: %d\n", ret);
+			return ret;
+		}
+	}
+
+	return ret;
+}
+
 static int rtl821x_probe(struct phy_device *phydev)
 {
 	struct device *dev = &phydev->mdio.dev;
@@ -3180,6 +3399,7 @@ static struct phy_driver realtek_drvs[] = {
 		PHY_ID_MATCH_EXACT(RTL_8261C_CG),
 		.name			= "Realtek RTL8261C 10Gbps PHY",
 		.probe			= rtl8261x_probe,
+		.config_init		= rtl8261x_config_init,
 		.get_features		= rtl8261x_get_features,
 		.config_aneg		= rtl8261x_config_aneg,
 		.read_status		= rtl8261x_read_status,
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v5 1/4] net: phy: c45: add genphy_c45_soft_reset()
From: javen @ 2026-06-15  9:08 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	freddy_gu, nb
  Cc: netdev, linux-kernel, daniel, vladimir.oltean, Javen Xu
In-Reply-To: <20260615090817.429-1-javen_xu@realsil.com.cn>

From: Javen Xu <javen_xu@realsil.com.cn>

Add a generic Clause 45 software reset helper. The helper sets the reset
bit in the PMA/PMD control register and waits until the bit is cleared by
hardware.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
---
Changes in v2:
 - no changes, new file

Changes in v3:
 - re-order function according to the order in phy-c45.c

Changes in v4:
 - no changes

Changes in v5:
 - no changes

Changes in v6:
 - increase timeout to 600ms
---
 drivers/net/phy/phy-c45.c | 22 ++++++++++++++++++++++
 include/linux/phy.h       |  1 +
 2 files changed, 23 insertions(+)

diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index 126951741428..60d044156a83 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -384,6 +384,28 @@ int genphy_c45_check_and_restart_aneg(struct phy_device *phydev, bool restart)
 }
 EXPORT_SYMBOL_GPL(genphy_c45_check_and_restart_aneg);
 
+/**
+ * genphy_c45_soft_reset - software reset the PHY via Clause 45 PMA/PMD control register
+ * @phydev: target phy_device struct
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int genphy_c45_soft_reset(struct phy_device *phydev)
+{
+	int ret, val;
+
+	ret = phy_set_bits_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_CTRL1,
+			       MDIO_CTRL1_RESET);
+	if (ret < 0)
+		return ret;
+
+	return phy_read_mmd_poll_timeout(phydev, MDIO_MMD_PMAPMD,
+					 MDIO_CTRL1, val,
+					 !(val & MDIO_CTRL1_RESET),
+					 5000, 600000, true);
+}
+EXPORT_SYMBOL_GPL(genphy_c45_soft_reset);
+
 /**
  * genphy_c45_aneg_done - return auto-negotiation complete status
  * @phydev: target phy_device struct
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 199a7aaa341b..25a66320df56 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -2309,6 +2309,7 @@ int genphy_c37_read_status(struct phy_device *phydev, bool *changed);
 /* Clause 45 PHY */
 int genphy_c45_restart_aneg(struct phy_device *phydev);
 int genphy_c45_check_and_restart_aneg(struct phy_device *phydev, bool restart);
+int genphy_c45_soft_reset(struct phy_device *phydev);
 int genphy_c45_aneg_done(struct phy_device *phydev);
 int genphy_c45_read_link(struct phy_device *phydev);
 int genphy_c45_read_lpa(struct phy_device *phydev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v5 2/4] net: phy: c45: add genphy_c45_config_master_slave()
From: javen @ 2026-06-15  9:08 UTC (permalink / raw)
  To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni,
	freddy_gu, nb
  Cc: netdev, linux-kernel, daniel, vladimir.oltean, Javen Xu
In-Reply-To: <20260615090817.429-1-javen_xu@realsil.com.cn>

From: Javen Xu <javen_xu@realsil.com.cn>

Add a generic helper to configure forced master/slave mode for Clause 45
PHYs using the 10GBASE-T AN control register.

Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
---
Changes in v2:
 - no changes, new file

Changes in v3:
 - re-order function according to the order in phy-c45.c
 - add kernel-doc about return value
 - add MASTER_SLAVE_CFG_MASTER_PREFERRED,
   MASTER_SLAVE_CFG_SLAVE_PREFERRED, MASTER_SLAVE_CFG_UNKNOWN,
   MASTER_SLAVE_CFG_UNSUPPORTED, MASTER_SLAVE_CFG_SLAVE_PREFERRED cfg

Changes in v4:
 - no changes

Changes in v5:
 - move genphy_c45_an_setup_master_slave() to genphy_c45_config_aneg(),
   as that C22 does.

Changes in v6:
 - add colon in the function description
 - add genphy_c45_read_master_slave in read function
---
 drivers/net/phy/phy-c45.c | 96 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/mdio.h |  5 ++
 2 files changed, 101 insertions(+)

diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index 60d044156a83..4af532d5080a 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -406,6 +406,90 @@ int genphy_c45_soft_reset(struct phy_device *phydev)
 }
 EXPORT_SYMBOL_GPL(genphy_c45_soft_reset);
 
+/**
+ * genphy_c45_an_setup_master_slave - Configure Master/Slave setting for C45 PHYs
+ * @phydev: target phy_device struct
+ *
+ * Description: Configure the forced or preferred Master/Slave role
+ * 10GBASE-T control register (MMD 7, Register 0x0020) according to
+ * IEEE 802.3 standards.
+ *
+ * Return: negative errno code on failure, 0 if Master/Slave didn't change,
+ * or 1 if Master/Slave modes changed.
+ */
+static int genphy_c45_an_setup_master_slave(struct phy_device *phydev)
+{
+	u16 ctl = 0;
+
+	switch (phydev->master_slave_set) {
+	case MASTER_SLAVE_CFG_MASTER_PREFERRED:
+		ctl = MDIO_AN_10GBT_CTRL_MS_PORT_TYPE;
+		break;
+	case MASTER_SLAVE_CFG_SLAVE_PREFERRED:
+		break;
+	case MASTER_SLAVE_CFG_MASTER_FORCE:
+		ctl = MDIO_AN_10GBT_CTRL_MS_ENABLE | MDIO_AN_10GBT_CTRL_MS_VALUE;
+		break;
+	case MASTER_SLAVE_CFG_SLAVE_FORCE:
+		ctl = MDIO_AN_10GBT_CTRL_MS_ENABLE;
+		break;
+	case MASTER_SLAVE_CFG_UNKNOWN:
+	case MASTER_SLAVE_CFG_UNSUPPORTED:
+		return 0;
+	default:
+		phydev_warn(phydev, "Unsupported Master/Slave mode\n");
+		return -EOPNOTSUPP;
+	}
+
+	return phy_modify_mmd_changed(phydev, MDIO_MMD_AN, MDIO_AN_10GBT_CTRL,
+				      MDIO_AN_10GBT_CTRL_MS_ENABLE |
+				      MDIO_AN_10GBT_CTRL_MS_VALUE |
+				      MDIO_AN_10GBT_CTRL_MS_PORT_TYPE, ctl);
+}
+
+/**
+ * genphy_c45_read_master_slave - read master/slave status
+ * @phydev: target phy_device struct
+ *
+ * Description: Read the Master/Slave configuration and status
+ * from 10GBASE-T control/status registers (MMD 7, Reg 0x0020 and 0x0021).
+ *
+ * Return: 0 on success, or a negative error code on failure.
+ */
+static int genphy_c45_read_master_slave(struct phy_device *phydev)
+{
+	int val;
+
+	val = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_10GBT_CTRL);
+	if (val < 0)
+		return val;
+
+	if (val & MDIO_AN_10GBT_CTRL_MS_ENABLE) {
+		if (val & MDIO_AN_10GBT_CTRL_MS_VALUE)
+			phydev->master_slave_get = MASTER_SLAVE_CFG_MASTER_FORCE;
+		else
+			phydev->master_slave_get = MASTER_SLAVE_CFG_SLAVE_FORCE;
+	} else {
+		if (val & MDIO_AN_10GBT_CTRL_MS_PORT_TYPE)
+			phydev->master_slave_get = MASTER_SLAVE_CFG_MASTER_PREFERRED;
+		else
+			phydev->master_slave_get = MASTER_SLAVE_CFG_SLAVE_PREFERRED;
+	}
+
+	val = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_10GBT_STAT);
+	if (val < 0)
+		return val;
+
+	if (val & MDIO_AN_10GBT_STAT_MS_FAULT)
+		phydev->master_slave_state = MASTER_SLAVE_STATE_ERR;
+	else if (val & MDIO_AN_10GBT_STAT_MS_RES)
+		phydev->master_slave_state = MASTER_SLAVE_STATE_MASTER;
+	else
+		phydev->master_slave_state = MASTER_SLAVE_STATE_SLAVE;
+
+	return 0;
+}
+
 /**
  * genphy_c45_aneg_done - return auto-negotiation complete status
  * @phydev: target phy_device struct
@@ -1214,6 +1298,10 @@ int genphy_c45_read_status(struct phy_device *phydev)
 			ret = genphy_c45_baset1_read_status(phydev);
 			if (ret < 0)
 				return ret;
+		} else {
+			ret = genphy_c45_read_master_slave(phydev);
+			if (ret < 0)
+				return ret;
 		}
 
 		phy_resolve_aneg_linkmode(phydev);
@@ -1247,6 +1335,14 @@ int genphy_c45_config_aneg(struct phy_device *phydev)
 	if (ret > 0)
 		changed = true;
 
+	if (!genphy_c45_baset1_able(phydev)) {
+		ret = genphy_c45_an_setup_master_slave(phydev);
+		if (ret < 0)
+			return ret;
+		if (ret > 0)
+			changed = true;
+	}
+
 	return genphy_c45_check_and_restart_aneg(phydev, changed);
 }
 EXPORT_SYMBOL_GPL(genphy_c45_config_aneg);
diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h
index b2541c948fc1..06f4bc3c20c7 100644
--- a/include/uapi/linux/mdio.h
+++ b/include/uapi/linux/mdio.h
@@ -332,8 +332,13 @@
 #define MDIO_AN_10GBT_CTRL_ADV2_5G	0x0080	/* Advertise 2.5GBASE-T */
 #define MDIO_AN_10GBT_CTRL_ADV5G	0x0100	/* Advertise 5GBASE-T */
 #define MDIO_AN_10GBT_CTRL_ADV10G	0x1000	/* Advertise 10GBASE-T */
+#define MDIO_AN_10GBT_CTRL_MS_ENABLE	0x8000	/* Master/slave manual config enable */
+#define MDIO_AN_10GBT_CTRL_MS_VALUE	0x4000	/* Master/slave config value (1=Master) */
+#define MDIO_AN_10GBT_CTRL_MS_PORT_TYPE	0x2000	/* Master Preferred Type */
 
 /* AN 10GBASE-T status register. */
+#define MDIO_AN_10GBT_STAT_MS_FAULT	0x8000	/* Master/slave fault */
+#define MDIO_AN_10GBT_STAT_MS_RES	0x4000	/* Master/slave resolution (1=Master) */
 #define MDIO_AN_10GBT_STAT_LP2_5G	0x0020  /* LP is 2.5GBT capable */
 #define MDIO_AN_10GBT_STAT_LP5G		0x0040  /* LP is 5GBT capable */
 #define MDIO_AN_10GBT_STAT_LPTRR	0x0200	/* LP training reset req. */
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next] docs: exclude driver and netdevsim bugs
From: Leon Romanovsky @ 2026-06-15  9:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, johannes,
	corbet, skhan, workflows, linux-doc
In-Reply-To: <20260603162943.2406080-1-kuba@kernel.org>

On Wed, Jun 03, 2026 at 09:29:43AM -0700, Jakub Kicinski wrote:
> Initial wave of AI-generated fixes was mostly for core and protocols
> we care about. But the number of irrelevant driver fixes is slowly
> increasing. Add a section of explicit exclusions to our maintainer
> profile.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: corbet@lwn.net
> CC: skhan@linuxfoundation.org
> CC: workflows@vger.kernel.org
> CC: linux-doc@vger.kernel.org
> ---
>  Documentation/process/maintainer-netdev.rst | 28 +++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst
> index ec7b9aa2877f..cc4b5fa3b5c1 100644
> --- a/Documentation/process/maintainer-netdev.rst
> +++ b/Documentation/process/maintainer-netdev.rst
> @@ -272,6 +272,34 @@ the case today. Please follow the standard stable rules in
>  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`,
>  and make sure you include appropriate Fixes tags!
>  
> +Bug fixes
> +~~~~~~~~~
> +
> +Unless explicitly excluded all bug fixes should be targeting the ``net``
> +tree and contain an appropriate Fixes tag.
> +
> +Obvious exclusions:
> +
> + - fixes for bugs which only exist in ``net-next`` should target ``net-next``
> +   (please still include the Fixes tag in the commit message)
> + - bugs which cannot be reached, e.g. in code paths not executed given
> +   current in-tree callers
> + - fixes for compiler warnings and typos

If you decide to resubmit this patch, could you please remove "fixes for
compiler warnings" from the exclusion list?

It is quite frustrating to receive a compiler warning originating from a
different subsystem after the merge window, knowing it will not be
addressed until the next merge window (around eight weeks later).

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] selftests: net: do not detect PPPoX loopback
From: Matthieu Baerts @ 2026-06-15  9:15 UTC (permalink / raw)
  To: Qingfang Deng
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Shuah Khan, linux-ppp, netdev, linux-kselftest,
	linux-kernel
In-Reply-To: <20260603061746.23452-1-qingfang.deng@linux.dev>

Hi Qingfang,

On 03/06/2026 08:17, Qingfang Deng wrote:
> By default, pppd attempts to detect loopbacks on the underlying
> interface using a pseudo-randomly generated magic number and checks if
> the same value is received. The seed for the PRNG is a hash of hostname
> XOR current time XOR pid, which is likely to collide on NIPA, causing
> false positives. Disable magic number generation.

Thank you for the fix!

It looks like the test is no longer flaky [1], so I just unignored it on
NIPA.

[1] https://netdev.bots.linux.dev/contest.html?skip=0&test=pppol2tp-sh

Cheers,
Matt

^ permalink raw reply

* [PATCH nf v2] netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
From: Lorenzo Bianconi @ 2026-06-15  9:18 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	David Ahern, Ido Schimmel, Shuah Khan, Lorenzo Bianconi
  Cc: netfilter-devel, coreteam, netdev, linux-kselftest

Fix nf_flow_ip6_tunnel_proto() to use pskb_may_pull() instead of
skb_header_pointer() to ensure the outer IPv6 header is in the skb
headroom, which is required for subsequent packet processing. Move
ctx->offset update inside the IPPROTO_IPV6 conditional block since it
should only be adjusted when an IP6IP6 tunnel is actually detected.
Simplify the rx path by removing ipv6_skip_exthdr() and checking
ip6h->nexthdr directly, as the flowtable fast path only handles simple
IP6IP6 encapsulation without extension headers.
Drop the tunnel encapsulation limit destination option support from the
tx path to match, since the rx path no longer handles extension headers.
Remove the encap_limit parameter from nf_flow_offload_ipv6_forward(),
nf_flow_tunnel_ip6ip6_push() and nf_flow_tunnel_v6_push(), along with
the ipv6_tel_txoption struct and related headroom/MTU adjustments.

Fixes: d98103575dcdd ("netfilter: flowtable: Add IP6IP6 rx sw acceleration")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
Changes in v2:
- Drop tunnel encapsulation limit destination option support.
- Do not allow IPv6 extension headers in nf_flow_ip6_tunnel_proto().
- Link to v1: https://lore.kernel.org/r/20260608-b4-nf_flow_ip6_tunnel_proto-update-v1-1-782c7052c8fd@kernel.org
---
 net/ipv6/ip6_tunnel.c                              |  7 ++
 net/netfilter/nf_flow_table_ip.c                   | 80 +++++-----------------
 .../selftests/net/netfilter/nft_flowtable.sh       |  8 +--
 3 files changed, 30 insertions(+), 65 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 9d1037ac082f..bf1e77f95f18 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1850,6 +1850,13 @@ static int ip6_tnl_fill_forward_path(struct net_device_path_ctx *ctx,
 	struct dst_entry *dst;
 	int err;
 
+	if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) {
+		/* encaplimit option is currently not supported is
+		 * sw-acceleration path.
+		 */
+		return -EOPNOTSUPP;
+	}
+
 	dst = ip6_route_output(dev_net(ctx->dev), NULL, &fl6);
 	if (!dst->error) {
 		path->type = DEV_PATH_TUN;
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 9c05a50d6013..e7a3fb2b2d94 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -347,29 +347,23 @@ static bool nf_flow_ip6_tunnel_proto(struct nf_flowtable_ctx *ctx,
 				     struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_IPV6)
-	struct ipv6hdr *ip6h, _ip6h;
-	__be16 frag_off;
-	u8 nexthdr;
-	int hdrlen;
+	struct ipv6hdr *ip6h;
 
-	ip6h = skb_header_pointer(skb, ctx->offset, sizeof(*ip6h), &_ip6h);
-	if (!ip6h)
+	if (!pskb_may_pull(skb, sizeof(*ip6h) + ctx->offset))
 		return false;
 
+	ip6h = (struct ipv6hdr *)(skb_network_header(skb) + ctx->offset);
 	if (ip6h->hop_limit <= 1)
 		return false;
 
-	nexthdr = ip6h->nexthdr;
-	hdrlen = ipv6_skip_exthdr(skb, sizeof(*ip6h) + ctx->offset, &nexthdr,
-				  &frag_off);
-	if (hdrlen < 0)
+	if (ipv6_ext_hdr(ip6h->nexthdr))
 		return false;
 
-	if (nexthdr == IPPROTO_IPV6) {
-		ctx->tun.hdr_size = hdrlen;
-		ctx->tun.proto = IPPROTO_IPV6;
+	if (ip6h->nexthdr == IPPROTO_IPV6) {
+		ctx->tun.proto = ip6h->nexthdr;
+		ctx->tun.hdr_size = sizeof(*ip6h);
+		ctx->offset += ctx->tun.hdr_size;
 	}
-	ctx->offset += ctx->tun.hdr_size;
 
 	return true;
 #else
@@ -648,25 +642,19 @@ static int nf_flow_tunnel_v4_push(struct net *net, struct sk_buff *skb,
 	return 0;
 }
 
-struct ipv6_tel_txoption {
-	struct ipv6_txoptions ops;
-	__u8 dst_opt[8];
-};
-
 static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 				      struct flow_offload_tuple *tuple,
-				      struct in6_addr **ip6_daddr,
-				      int encap_limit)
+				      struct in6_addr **ip6_daddr)
 {
 	struct ipv6hdr *ip6h = (struct ipv6hdr *)skb_network_header(skb);
-	u8 hop_limit = ip6h->hop_limit, proto = IPPROTO_IPV6;
 	struct rtable *rt = dst_rtable(tuple->dst_cache);
 	__u8 dsfield = ipv6_get_dsfield(ip6h);
 	struct flowi6 fl6 = {
 		.daddr = tuple->tun.src_v6,
 		.saddr = tuple->tun.dst_v6,
-		.flowi6_proto = proto,
+		.flowi6_proto = IPPROTO_IPV6,
 	};
+	u8 hop_limit = ip6h->hop_limit;
 	int err, mtu;
 	u32 headroom;
 
@@ -674,41 +662,18 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 	if (err)
 		return err;
 
-	skb_set_inner_ipproto(skb, proto);
+	skb_set_inner_ipproto(skb, IPPROTO_IPV6);
 	headroom = sizeof(*ip6h) + LL_RESERVED_SPACE(rt->dst.dev) +
 		   rt->dst.header_len;
-	if (encap_limit)
-		headroom += 8;
 	err = skb_cow_head(skb, headroom);
 	if (err)
 		return err;
 
 	skb_scrub_packet(skb, true);
 	mtu = dst_mtu(&rt->dst) - sizeof(*ip6h);
-	if (encap_limit)
-		mtu -= 8;
 	mtu = max(mtu, IPV6_MIN_MTU);
 	skb_dst_update_pmtu_no_confirm(skb, mtu);
 
-	if (encap_limit > 0) {
-		struct ipv6_tel_txoption opt = {
-			.dst_opt[2] = IPV6_TLV_TNL_ENCAP_LIMIT,
-			.dst_opt[3] = 1,
-			.dst_opt[4] = encap_limit,
-			.dst_opt[5] = IPV6_TLV_PADN,
-			.dst_opt[6] = 1,
-		};
-		struct ipv6_opt_hdr *hopt;
-
-		opt.ops.dst1opt = (struct ipv6_opt_hdr *)opt.dst_opt;
-		opt.ops.opt_nflen = 8;
-
-		hopt = skb_push(skb, ipv6_optlen(opt.ops.dst1opt));
-		memcpy(hopt, opt.ops.dst1opt, ipv6_optlen(opt.ops.dst1opt));
-		hopt->nexthdr = IPPROTO_IPV6;
-		proto = NEXTHDR_DEST;
-	}
-
 	skb_push(skb, sizeof(*ip6h));
 	skb_reset_network_header(skb);
 
@@ -716,7 +681,7 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 	ip6_flow_hdr(ip6h, dsfield,
 		     ip6_make_flowlabel(net, skb, fl6.flowlabel, true, &fl6));
 	ip6h->hop_limit = hop_limit;
-	ip6h->nexthdr = proto;
+	ip6h->nexthdr = IPPROTO_IPV6;
 	ip6h->daddr = tuple->tun.src_v6;
 	ip6h->saddr = tuple->tun.dst_v6;
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(*ip6h));
@@ -729,12 +694,10 @@ static int nf_flow_tunnel_ip6ip6_push(struct net *net, struct sk_buff *skb,
 
 static int nf_flow_tunnel_v6_push(struct net *net, struct sk_buff *skb,
 				  struct flow_offload_tuple *tuple,
-				  struct in6_addr **ip6_daddr,
-				  int encap_limit)
+				  struct in6_addr **ip6_daddr)
 {
 	if (tuple->tun_num)
-		return nf_flow_tunnel_ip6ip6_push(net, skb, tuple, ip6_daddr,
-						  encap_limit);
+		return nf_flow_tunnel_ip6ip6_push(net, skb, tuple, ip6_daddr);
 
 	return 0;
 }
@@ -1089,7 +1052,7 @@ static int nf_flow_tuple_ipv6(struct nf_flowtable_ctx *ctx, struct sk_buff *skb,
 static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 					struct nf_flowtable *flow_table,
 					struct flow_offload_tuple_rhash *tuplehash,
-					struct sk_buff *skb, int encap_limit)
+					struct sk_buff *skb)
 {
 	enum flow_offload_tuple_dir dir;
 	struct flow_offload *flow;
@@ -1100,11 +1063,8 @@ static int nf_flow_offload_ipv6_forward(struct nf_flowtable_ctx *ctx,
 	flow = container_of(tuplehash, struct flow_offload, tuplehash[dir]);
 
 	mtu = flow->tuplehash[dir].tuple.mtu + ctx->offset;
-	if (flow->tuplehash[!dir].tuple.tun_num) {
+	if (flow->tuplehash[!dir].tuple.tun_num)
 		mtu -= sizeof(*ip6h);
-		if (encap_limit > 0)
-			mtu -= 8; /* encap limit option */
-	}
 
 	if (unlikely(nf_flow_exceeds_mtu(skb, mtu)))
 		return 0;
@@ -1158,7 +1118,6 @@ unsigned int
 nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 			  const struct nf_hook_state *state)
 {
-	int encap_limit = IPV6_DEFAULT_TNL_ENCAP_LIMIT;
 	struct flow_offload_tuple_rhash *tuplehash;
 	struct nf_flowtable *flow_table = priv;
 	struct flow_offload_tuple *other_tuple;
@@ -1177,8 +1136,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	if (tuplehash == NULL)
 		return NF_ACCEPT;
 
-	ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb,
-					   encap_limit);
+	ret = nf_flow_offload_ipv6_forward(&ctx, flow_table, tuplehash, skb);
 	if (ret < 0)
 		return NF_DROP;
 	else if (ret == 0)
@@ -1198,7 +1156,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
 	ip6_daddr = &other_tuple->src_v6;
 
 	if (nf_flow_tunnel_v6_push(state->net, skb, other_tuple,
-				   &ip6_daddr, encap_limit) < 0)
+				   &ip6_daddr) < 0)
 		return NF_DROP;
 
 	switch (tuplehash->tuple.xmit_type) {
diff --git a/tools/testing/selftests/net/netfilter/nft_flowtable.sh b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
index 7a34ef468975..08ad07500e8a 100755
--- a/tools/testing/selftests/net/netfilter/nft_flowtable.sh
+++ b/tools/testing/selftests/net/netfilter/nft_flowtable.sh
@@ -592,7 +592,7 @@ ip -net "$nsr1" link set tun0 up
 ip -net "$nsr1" addr add 192.168.100.1/24 dev tun0
 ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
-ip -net "$nsr1" link add name tun6 type ip6tnl local fee1:2::1 remote fee1:2::2
+ip -net "$nsr1" link add name tun6 type ip6tnl local fee1:2::1 remote fee1:2::2 encaplimit none
 ip -net "$nsr1" link set tun6 up
 ip -net "$nsr1" addr add fee1:3::1/64 dev tun6 nodad
 
@@ -601,7 +601,7 @@ ip -net "$nsr2" link set tun0 up
 ip -net "$nsr2" addr add 192.168.100.2/24 dev tun0
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0.forwarding=1 > /dev/null
 
-ip -net "$nsr2" link add name tun6 type ip6tnl local fee1:2::2 remote fee1:2::1 || ret=1
+ip -net "$nsr2" link add name tun6 type ip6tnl local fee1:2::2 remote fee1:2::1 encaplimit none || ret=1
 ip -net "$nsr2" link set tun6 up
 ip -net "$nsr2" addr add fee1:3::2/64 dev tun6 nodad
 
@@ -651,7 +651,7 @@ ip -net "$nsr1" route change default via 192.168.200.2
 ip netns exec "$nsr1" sysctl net.ipv4.conf.tun0/10.forwarding=1 > /dev/null
 ip netns exec "$nsr1" nft -a insert rule inet filter forward 'meta oif tun0.10 accept'
 
-ip -net "$nsr1" link add name tun6.10 type ip6tnl local fee1:4::1 remote fee1:4::2
+ip -net "$nsr1" link add name tun6.10 type ip6tnl local fee1:4::1 remote fee1:4::2 encaplimit none
 ip -net "$nsr1" link set tun6.10 up
 ip -net "$nsr1" addr add fee1:5::1/64 dev tun6.10 nodad
 ip -6 -net "$nsr1" route delete default
@@ -670,7 +670,7 @@ ip -net "$nsr2" addr add 192.168.200.2/24 dev tun0.10
 ip -net "$nsr2" route change default via 192.168.200.1
 ip netns exec "$nsr2" sysctl net.ipv4.conf.tun0/10.forwarding=1 > /dev/null
 
-ip -net "$nsr2" link add name tun6.10 type ip6tnl local fee1:4::2 remote fee1:4::1 || ret=1
+ip -net "$nsr2" link add name tun6.10 type ip6tnl local fee1:4::2 remote fee1:4::1 encaplimit none || ret=1
 ip -net "$nsr2" link set tun6.10 up
 ip -net "$nsr2" addr add fee1:5::2/64 dev tun6.10 nodad
 ip -6 -net "$nsr2" route delete default

---
base-commit: 1fad1796b9411217fa77b6a497ed76b999205487
change-id: 20260608-b4-nf_flow_ip6_tunnel_proto-update-8b64903825b4

Best regards,
-- 
Lorenzo Bianconi <lorenzo@kernel.org>


^ permalink raw reply related

* Re: [PATCH net-next 09/11] netfilter: flowtable: bail out if forward path cannot be discovered
From: Lorenzo Bianconi @ 2026-06-15  9:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260614114605.474783-10-pablo@netfilter.org>

[-- Attachment #1: Type: text/plain, Size: 6709 bytes --]

> If forward path discovery fails for any reason or netdevice is not
> registered for this flowtable, then bail out to classic forwarding path
> rather than providing incomplete forwarding path.
> 
> Update the existing forward path parser functions to report an error
> so the flow_offload expressions gives up on setting up the flowtable
> entry.
> 
> Link: https://sashiko.dev/#/patchset/20260607094954.48892-15-pablo%40netfilter.org?part=14
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Tested-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---
>  net/netfilter/nf_flow_table_path.c | 81 +++++++++++++++++-------------
>  1 file changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/net/netfilter/nf_flow_table_path.c b/net/netfilter/nf_flow_table_path.c
> index a3e6b82f2f8e..1e7e216b9f89 100644
> --- a/net/netfilter/nf_flow_table_path.c
> +++ b/net/netfilter/nf_flow_table_path.c
> @@ -90,9 +90,9 @@ struct nft_forward_info {
>  	enum flow_offload_xmit_type xmit_type;
>  };
>  
> -static void nft_dev_path_info(const struct net_device_path_stack *stack,
> -			      struct nft_forward_info *info,
> -			      unsigned char *ha, struct nf_flowtable *flowtable)
> +static int nft_dev_path_info(const struct net_device_path_stack *stack,
> +			     struct nft_forward_info *info,
> +			     unsigned char *ha, struct nf_flowtable *flowtable)
>  {
>  	const struct net_device_path *path;
>  	int i;
> @@ -120,19 +120,17 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
>  
>  			/* DEV_PATH_VLAN, DEV_PATH_PPPOE and DEV_PATH_TUN */
>  			if (path->type == DEV_PATH_TUN) {
> -				if (info->num_tuns) {
> -					info->indev = NULL;
> -					break;
> -				}
> +				if (info->num_tuns)
> +					return -1;
> +
>  				info->tun.src_v6 = path->tun.src_v6;
>  				info->tun.dst_v6 = path->tun.dst_v6;
>  				info->tun.l3_proto = path->tun.l3_proto;
>  				info->num_tuns++;
>  			} else {
> -				if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX) {
> -					info->indev = NULL;
> -					break;
> -				}
> +				if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX)
> +					return -1;
> +
>  				info->encap[info->num_encaps].id =
>  					path->encap.id;
>  				info->encap[info->num_encaps].proto =
> @@ -151,22 +149,23 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
>  
>  			switch (path->bridge.vlan_mode) {
>  			case DEV_PATH_BR_VLAN_UNTAG_HW:
> +				if (info->num_encaps == 0)
> +					return -1;
> +
>  				info->ingress_vlans |= BIT(info->num_encaps - 1);
>  				break;
>  			case DEV_PATH_BR_VLAN_TAG:
> -				if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX) {
> -					info->indev = NULL;
> -					break;
> -				}
> +				if (info->num_encaps >= NF_FLOW_TABLE_ENCAP_MAX)
> +					return -1;
> +
>  				info->encap[info->num_encaps].id = path->bridge.vlan_id;
>  				info->encap[info->num_encaps].proto = path->bridge.vlan_proto;
>  				info->num_encaps++;
>  				break;
>  			case DEV_PATH_BR_VLAN_UNTAG:
> -				if (info->num_encaps == 0) {
> -					info->indev = NULL;
> -					break;
> -				}
> +				if (info->num_encaps == 0)
> +					return -1;
> +
>  				info->num_encaps--;
>  				break;
>  			case DEV_PATH_BR_VLAN_KEEP:
> @@ -175,8 +174,7 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
>  			info->xmit_type = FLOW_OFFLOAD_XMIT_DIRECT;
>  			break;
>  		default:
> -			info->indev = NULL;
> -			break;
> +			return -1;
>  		}
>  	}
>  	info->outdev = info->indev;
> @@ -184,6 +182,8 @@ static void nft_dev_path_info(const struct net_device_path_stack *stack,
>  	if (nf_flowtable_hw_offload(flowtable) &&
>  	    nft_is_valid_ether_device(info->indev))
>  		info->xmit_type = FLOW_OFFLOAD_XMIT_DIRECT;
> +
> +	return 0;
>  }
>  
>  static bool nft_flowtable_find_dev(const struct net_device *dev,
> @@ -241,11 +241,11 @@ static int nft_flow_tunnel_update_route(const struct nft_pktinfo *pkt,
>  	return 0;
>  }
>  
> -static void nft_dev_forward_path(const struct nft_pktinfo *pkt,
> -				 struct nf_flow_route *route,
> -				 const struct nf_conn *ct,
> -				 enum ip_conntrack_dir dir,
> -				 struct nft_flowtable *ft)
> +static int nft_dev_forward_path(const struct nft_pktinfo *pkt,
> +				struct nf_flow_route *route,
> +				const struct nf_conn *ct,
> +				enum ip_conntrack_dir dir,
> +				struct nft_flowtable *ft)
>  {
>  	const struct dst_entry *dst = route->tuple[dir].dst;
>  	struct net_device_path_stack stack;
> @@ -253,15 +253,16 @@ static void nft_dev_forward_path(const struct nft_pktinfo *pkt,
>  	unsigned char ha[ETH_ALEN];
>  	int i;
>  
> -	if (nft_dev_fill_forward_path(route, dst, ct, dir, ha, &stack) >= 0)
> -		nft_dev_path_info(&stack, &info, ha, &ft->data);
> +	if (nft_dev_fill_forward_path(route, dst, ct, dir, ha, &stack) < 0 ||
> +	    nft_dev_path_info(&stack, &info, ha, &ft->data) < 0)
> +		return -ENOENT;
> +
> +	if (!nft_flowtable_find_dev(info.indev, ft))
> +		return -ENOENT;
>  
>  	if (info.outdev)
>  		route->tuple[dir].out.ifindex = info.outdev->ifindex;
>  
> -	if (!info.indev || !nft_flowtable_find_dev(info.indev, ft))
> -		return;
> -
>  	route->tuple[!dir].in.ifindex = info.indev->ifindex;
>  	for (i = 0; i < info.num_encaps; i++) {
>  		route->tuple[!dir].in.encap[i].id = info.encap[i].id;
> @@ -285,6 +286,8 @@ static void nft_dev_forward_path(const struct nft_pktinfo *pkt,
>  		route->tuple[dir].xmit_type = info.xmit_type;
>  	}
>  	route->tuple[dir].out.needs_gso_segment = info.needs_gso_segment;
> +
> +	return 0;
>  }
>  
>  int nft_flow_route(const struct nft_pktinfo *pkt, const struct nf_conn *ct,
> @@ -329,11 +332,19 @@ int nft_flow_route(const struct nft_pktinfo *pkt, const struct nf_conn *ct,
>  	nft_default_forward_path(route, this_dst, dir);
>  	nft_default_forward_path(route, other_dst, !dir);
>  
> -	if (route->tuple[dir].xmit_type	== FLOW_OFFLOAD_XMIT_NEIGH)
> -		nft_dev_forward_path(pkt, route, ct, dir, ft);
> -	if (route->tuple[!dir].xmit_type == FLOW_OFFLOAD_XMIT_NEIGH)
> -		nft_dev_forward_path(pkt, route, ct, !dir, ft);
> +	if (route->tuple[dir].xmit_type	== FLOW_OFFLOAD_XMIT_NEIGH &&
> +	    nft_dev_forward_path(pkt, route, ct, dir, ft) < 0)
> +		goto err_dst_release;
> +
> +	if (route->tuple[!dir].xmit_type == FLOW_OFFLOAD_XMIT_NEIGH &&
> +	    nft_dev_forward_path(pkt, route, ct, !dir, ft) < 0)
> +		goto err_dst_release;
>  
>  	return 0;
> +
> +err_dst_release:
> +	dst_release(route->tuple[dir].dst);
> +	dst_release(route->tuple[!dir].dst);
> +	return -ENOENT;
>  }
>  EXPORT_SYMBOL_GPL(nft_flow_route);
> -- 
> 2.47.3
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v5 5/9] block: implement NVMEM provider
From: Loic Poulain @ 2026-06-15  9:28 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Marcel Holtmann, Luiz Augusto von Dentz,
	Balakrishna Godavarthi, Rocky Liao, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Srinivas Kandagatla,
	Andrew Lunn, Heiner Kallweit, Russell King, Saravana Kannan
In-Reply-To: <CAMRc=McQkLnz2OS2RREAbcrsp47cL-W3bCduq8LwPBBUcVNyJw@mail.gmail.com>

On Mon, Jun 15, 2026 at 10:53 AM Bartosz Golaszewski <brgl@kernel.org> wrote:
>
> On Fri, 12 Jun 2026 15:20:57 +0200, Loic Poulain
> <loic.poulain@oss.qualcomm.com> said:
> > From: Daniel Golle <daniel@makrotopia.org>
> >
> > On embedded devices using an eMMC it is common that one or more partitions
> > on the eMMC are used to store MAC addresses and Wi-Fi calibration EEPROM
> > data. Allow referencing the partition in device tree for the kernel and
> > Wi-Fi drivers accessing it via the NVMEM layer.
> >
> > For now, NVMEM is only registered for the whole disk block device, as the
> > OF node is currently only associated to it.
> >
> > Signed-off-by: Daniel Golle <daniel@makrotopia.org>
> > Co-developed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  block/Kconfig             |   9 ++++
> >  block/Makefile            |   1 +
> >  block/blk-nvmem.c         | 109 ++++++++++++++++++++++++++++++++++++++++++++++
> >  block/blk.h               |   8 ++++
> >  block/genhd.c             |   4 ++
> >  include/linux/blk_types.h |   3 ++
> >  include/linux/blkdev.h    |   1 +
> >  7 files changed, 135 insertions(+)
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 15027963472d7b40e27b9097a5993c457b5b3054..0b33747e16dc33473683706f75c92bdf8b648f7c 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -209,6 +209,15 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
> >         by falling back to the kernel crypto API when inline
> >         encryption hardware is not present.
> >
> > +config BLK_NVMEM
> > +     bool "Block device NVMEM provider"
> > +     depends on OF
> > +     depends on NVMEM
> > +     help
> > +       Allow block devices (or partitions) to act as NVMEM providers,
> > +       typically used with eMMC to store MAC addresses or Wi-Fi
> > +       calibration data on embedded devices.
> > +
> >  source "block/partitions/Kconfig"
> >
> >  config BLK_PM
> > diff --git a/block/Makefile b/block/Makefile
> > index 7dce2e44276c4274c11a0a61121c83d9c43d6e0c..d7ac389e71902bc091a8800ea266190a43b3e63d 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -36,3 +36,4 @@ obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += blk-crypto.o blk-crypto-profile.o \
> >                                          blk-crypto-sysfs.o
> >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) += blk-crypto-fallback.o
> >  obj-$(CONFIG_BLOCK_HOLDER_DEPRECATED)        += holder.o
> > +obj-$(CONFIG_BLK_NVMEM)                += blk-nvmem.o
> > diff --git a/block/blk-nvmem.c b/block/blk-nvmem.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..c005f059d9fe56242ebaef9905673dff902b5686
> > --- /dev/null
> > +++ b/block/blk-nvmem.c
> > @@ -0,0 +1,109 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * block device NVMEM provider
> > + *
> > + * Copyright (c) 2024 Daniel Golle <daniel@makrotopia.org>
> > + * Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
> > + *
> > + * Useful on devices using a partition on an eMMC for MAC addresses or
> > + * Wi-Fi calibration EEPROM data.
> > + */
> > +
> > +#include <linux/file.h>
> > +#include <linux/nvmem-provider.h>
> > +#include <linux/nvmem-consumer.h>
> > +#include <linux/of.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/property.h>
> > +
> > +#include "blk.h"
> > +
> > +static int blk_nvmem_reg_read(void *priv, unsigned int from, void *val, size_t bytes)
> > +{
> > +     blk_mode_t mode = BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES;
> > +     dev_t devt = (dev_t)(uintptr_t)priv;
> > +     size_t bytes_left = bytes;
> > +     loff_t pos = from;
> > +     int ret = 0;
> > +
> > +     struct file *bdev_file __free(fput) = bdev_file_open_by_dev(devt, mode, priv, NULL);
> > +     if (IS_ERR(bdev_file))
> > +             return PTR_ERR(bdev_file);
> > +
> > +     while (bytes_left) {
> > +             pgoff_t f_index = pos >> PAGE_SHIFT;
> > +             struct folio *folio;
> > +             size_t folio_off;
> > +             size_t to_read;
> > +
> > +             folio = read_mapping_folio(bdev_file->f_mapping, f_index, NULL);
> > +             if (IS_ERR(folio)) {
> > +                     ret = PTR_ERR(folio);
> > +                     break;
> > +             }
> > +
> > +             folio_off = offset_in_folio(folio, pos);
> > +             to_read = min(bytes_left, folio_size(folio) - folio_off);
> > +             memcpy_from_folio(val, folio, folio_off, to_read);
> > +             pos += to_read;
> > +             bytes_left -= to_read;
> > +             val += to_read;
> > +             folio_put(folio);
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +void blk_nvmem_add(struct block_device *bdev)
> > +{
> > +     struct device *dev = &bdev->bd_device;
> > +     struct nvmem_config config = {};
> > +
> > +     /* skip devices which do not have a device tree node */
> > +     if (!dev_of_node(dev))
> > +             return;
> > +
> > +     /* skip devices without an nvmem layout defined */
> > +     struct device_node *child __free(device_node) =
> > +             of_get_child_by_name(dev_of_node(dev), "nvmem-layout");
> > +     if (!child)
> > +             return;
> > +
> > +     /*
> > +      * skip block device too large to be represented as NVMEM devices,
> > +      * the NVMEM reg_read callback uses an unsigned int offset
> > +      */
> > +     if (bdev_nr_bytes(bdev) > UINT_MAX) {
> > +             dev_warn(dev, "block device too large to be an NVMEM provider\n");
> > +             return;
> > +     }
> > +
> > +     config.id = NVMEM_DEVID_NONE;
> > +     config.dev = dev;
> > +     config.name = dev_name(dev);
> > +     config.owner = THIS_MODULE;
> > +     config.priv = (void *)(uintptr_t)dev->devt;
> > +     config.reg_read = blk_nvmem_reg_read;
> > +     config.size = bdev_nr_bytes(bdev);
> > +     config.word_size = 1;
> > +     config.stride = 1;
> > +     config.read_only = true;
> > +     config.root_only = true;
> > +     config.ignore_wp = true;
> > +     config.of_node = to_of_node(dev->fwnode);
> > +
> > +     bdev->bd_nvmem = nvmem_register(&config);
> > +     if (IS_ERR(bdev->bd_nvmem)) {
> > +             dev_err_probe(dev, PTR_ERR(bdev->bd_nvmem),
> > +                           "Failed to register NVMEM device\n");
>
> Using dev_err_probe() only makes sense with a return value. Which makes me
> think: we won't retry this after a probe deferral. I think we should return

Yes, so here with the nvmem fixed-layout, there is no way to get a
deferred probe error, but better to be ready to handle this anyway.

> int from this function just for this use-case. Also: if we *do* have
> a layout, shouldn't we treat a failure to register the nvmem provider as
> a an error and propagate it up the stack?

From an API perspective we should indeed return the error. From block
core, Do we want to fail the entire disk addition just because the
'companion' NVMEM provider couldn't be registered, or should we only
abort/return in case of EPROBE_DEFER?

>
> > +             bdev->bd_nvmem = NULL;
> > +     }
> > +}
> > +
> > +void blk_nvmem_del(struct block_device *bdev)
> > +{
> > +     if (bdev->bd_nvmem)
>
> Nvmem core already performs a NULL check.

Ok, thanks!


>
> > +             nvmem_unregister(bdev->bd_nvmem);
> > +
> > +     bdev->bd_nvmem = NULL;
> > +}
> > diff --git a/block/blk.h b/block/blk.h
> > index ec4674cdf2ead4fd259ff5fc42401f591e684ee9..cd3c7ca723391c40be56f1dd4810e641b7c8a2b3 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -757,4 +757,12 @@ static inline void blk_debugfs_unlock(struct request_queue *q,
> >       memalloc_noio_restore(memflags);
> >  }
> >
> > +#ifdef CONFIG_BLK_NVMEM
> > +void blk_nvmem_add(struct block_device *bdev);
> > +void blk_nvmem_del(struct block_device *bdev);
> > +#else
> > +static inline void blk_nvmem_add(struct block_device *bdev) {}
> > +static inline void blk_nvmem_del(struct block_device *bdev) {}
> > +#endif
> > +
> >  #endif /* BLK_INTERNAL_H */
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 7d6854fd28e95ae9134309679a7c6a937f5b7db8..1b2382de6fb30c1e5f60f45c04dc03ed3bf5d5f2 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -421,6 +421,8 @@ static void add_disk_final(struct gendisk *disk)
> >                */
> >               dev_set_uevent_suppress(ddev, 0);
> >               disk_uevent(disk, KOBJ_ADD);
> > +
> > +             blk_nvmem_add(disk->part0);
> >       }
> >
> >       blk_apply_bdi_limits(disk->bdi, &disk->queue->limits);
> > @@ -704,6 +706,8 @@ static void __del_gendisk(struct gendisk *disk)
> >
> >       disk_del_events(disk);
> >
> > +     blk_nvmem_del(disk->part0);
> > +
> >       /*
> >        * Prevent new openers by unlinked the bdev inode.
> >        */
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 8808ee76e73c09e0ceaac41ba59e86fb0c4efc64..ace6f59b860d0813665b2f62a1c03a1f4be94059 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -73,6 +73,9 @@ struct block_device {
> >       int                     bd_writers;
> >  #ifdef CONFIG_SECURITY
> >       void                    *bd_security;
> > +#endif
> > +#ifdef CONFIG_BLK_NVMEM
> > +     struct nvmem_device     *bd_nvmem;
> >  #endif
> >       /*
> >        * keep this out-of-line as it's both big and not needed in the fast
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 890128cdea1ce66863c5baa36f3b336ec4550807..f15d2b5bf9e4fd2368b8a70416a978e22c0d4333 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -30,6 +30,7 @@
> >
> >  struct module;
> >  struct request_queue;
> > +struct nvmem_device;
> >  struct elevator_queue;
> >  struct blk_trace;
> >  struct request;
> >
> > --
> > 2.34.1
> >
> >
>
> I like this approach better than the previous one.
>
> Thanks,
> Bartosz

^ permalink raw reply

* Re: [PATCH v5 5/9] block: implement NVMEM provider
From: Loic Poulain @ 2026-06-15  9:33 UTC (permalink / raw)
  To: Bartosz Golaszewski
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Marcel Holtmann, Luiz Augusto von Dentz,
	Balakrishna Godavarthi, Rocky Liao, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Srinivas Kandagatla,
	Andrew Lunn, Heiner Kallweit, Russell King, Saravana Kannan
In-Reply-To: <CAFEp6-0qsqhcwnSjm3=bG21jsCktzn5-L5sk2pNTZcGuVXaiNA@mail.gmail.com>

On Mon, Jun 15, 2026 at 11:28 AM Loic Poulain
<loic.poulain@oss.qualcomm.com> wrote:
>
> On Mon, Jun 15, 2026 at 10:53 AM Bartosz Golaszewski <brgl@kernel.org> wrote:
> >
> > On Fri, 12 Jun 2026 15:20:57 +0200, Loic Poulain
> > <loic.poulain@oss.qualcomm.com> said:
> > > From: Daniel Golle <daniel@makrotopia.org>
> > >
> > > On embedded devices using an eMMC it is common that one or more partitions
> > > on the eMMC are used to store MAC addresses and Wi-Fi calibration EEPROM
> > > data. Allow referencing the partition in device tree for the kernel and
> > > Wi-Fi drivers accessing it via the NVMEM layer.
> > >
> > > For now, NVMEM is only registered for the whole disk block device, as the
> > > OF node is currently only associated to it.
> > >
> > > Signed-off-by: Daniel Golle <daniel@makrotopia.org>
> > > Co-developed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > > ---
> > >  block/Kconfig             |   9 ++++
> > >  block/Makefile            |   1 +
> > >  block/blk-nvmem.c         | 109 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  block/blk.h               |   8 ++++
> > >  block/genhd.c             |   4 ++
> > >  include/linux/blk_types.h |   3 ++
> > >  include/linux/blkdev.h    |   1 +
> > >  7 files changed, 135 insertions(+)
> > >
> > > diff --git a/block/Kconfig b/block/Kconfig
> > > index 15027963472d7b40e27b9097a5993c457b5b3054..0b33747e16dc33473683706f75c92bdf8b648f7c 100644
> > > --- a/block/Kconfig
> > > +++ b/block/Kconfig
> > > @@ -209,6 +209,15 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
> > >         by falling back to the kernel crypto API when inline
> > >         encryption hardware is not present.
> > >
> > > +config BLK_NVMEM
> > > +     bool "Block device NVMEM provider"
> > > +     depends on OF
> > > +     depends on NVMEM
> > > +     help
> > > +       Allow block devices (or partitions) to act as NVMEM providers,
> > > +       typically used with eMMC to store MAC addresses or Wi-Fi
> > > +       calibration data on embedded devices.
> > > +
> > >  source "block/partitions/Kconfig"
> > >
> > >  config BLK_PM
> > > diff --git a/block/Makefile b/block/Makefile
> > > index 7dce2e44276c4274c11a0a61121c83d9c43d6e0c..d7ac389e71902bc091a8800ea266190a43b3e63d 100644
> > > --- a/block/Makefile
> > > +++ b/block/Makefile
> > > @@ -36,3 +36,4 @@ obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += blk-crypto.o blk-crypto-profile.o \
> > >                                          blk-crypto-sysfs.o
> > >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) += blk-crypto-fallback.o
> > >  obj-$(CONFIG_BLOCK_HOLDER_DEPRECATED)        += holder.o
> > > +obj-$(CONFIG_BLK_NVMEM)                += blk-nvmem.o
> > > diff --git a/block/blk-nvmem.c b/block/blk-nvmem.c
> > > new file mode 100644
> > > index 0000000000000000000000000000000000000000..c005f059d9fe56242ebaef9905673dff902b5686
> > > --- /dev/null
> > > +++ b/block/blk-nvmem.c
> > > @@ -0,0 +1,109 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * block device NVMEM provider
> > > + *
> > > + * Copyright (c) 2024 Daniel Golle <daniel@makrotopia.org>
> > > + * Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
> > > + *
> > > + * Useful on devices using a partition on an eMMC for MAC addresses or
> > > + * Wi-Fi calibration EEPROM data.
> > > + */
> > > +
> > > +#include <linux/file.h>
> > > +#include <linux/nvmem-provider.h>
> > > +#include <linux/nvmem-consumer.h>
> > > +#include <linux/of.h>
> > > +#include <linux/pagemap.h>
> > > +#include <linux/property.h>
> > > +
> > > +#include "blk.h"
> > > +
> > > +static int blk_nvmem_reg_read(void *priv, unsigned int from, void *val, size_t bytes)
> > > +{
> > > +     blk_mode_t mode = BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES;
> > > +     dev_t devt = (dev_t)(uintptr_t)priv;
> > > +     size_t bytes_left = bytes;
> > > +     loff_t pos = from;
> > > +     int ret = 0;
> > > +
> > > +     struct file *bdev_file __free(fput) = bdev_file_open_by_dev(devt, mode, priv, NULL);
> > > +     if (IS_ERR(bdev_file))
> > > +             return PTR_ERR(bdev_file);
> > > +
> > > +     while (bytes_left) {
> > > +             pgoff_t f_index = pos >> PAGE_SHIFT;
> > > +             struct folio *folio;
> > > +             size_t folio_off;
> > > +             size_t to_read;
> > > +
> > > +             folio = read_mapping_folio(bdev_file->f_mapping, f_index, NULL);
> > > +             if (IS_ERR(folio)) {
> > > +                     ret = PTR_ERR(folio);
> > > +                     break;
> > > +             }
> > > +
> > > +             folio_off = offset_in_folio(folio, pos);
> > > +             to_read = min(bytes_left, folio_size(folio) - folio_off);
> > > +             memcpy_from_folio(val, folio, folio_off, to_read);
> > > +             pos += to_read;
> > > +             bytes_left -= to_read;
> > > +             val += to_read;
> > > +             folio_put(folio);
> > > +     }
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +void blk_nvmem_add(struct block_device *bdev)
> > > +{
> > > +     struct device *dev = &bdev->bd_device;
> > > +     struct nvmem_config config = {};
> > > +
> > > +     /* skip devices which do not have a device tree node */
> > > +     if (!dev_of_node(dev))
> > > +             return;
> > > +
> > > +     /* skip devices without an nvmem layout defined */
> > > +     struct device_node *child __free(device_node) =
> > > +             of_get_child_by_name(dev_of_node(dev), "nvmem-layout");
> > > +     if (!child)
> > > +             return;
> > > +
> > > +     /*
> > > +      * skip block device too large to be represented as NVMEM devices,
> > > +      * the NVMEM reg_read callback uses an unsigned int offset
> > > +      */
> > > +     if (bdev_nr_bytes(bdev) > UINT_MAX) {
> > > +             dev_warn(dev, "block device too large to be an NVMEM provider\n");
> > > +             return;
> > > +     }
> > > +
> > > +     config.id = NVMEM_DEVID_NONE;
> > > +     config.dev = dev;
> > > +     config.name = dev_name(dev);
> > > +     config.owner = THIS_MODULE;
> > > +     config.priv = (void *)(uintptr_t)dev->devt;
> > > +     config.reg_read = blk_nvmem_reg_read;
> > > +     config.size = bdev_nr_bytes(bdev);
> > > +     config.word_size = 1;
> > > +     config.stride = 1;
> > > +     config.read_only = true;
> > > +     config.root_only = true;
> > > +     config.ignore_wp = true;
> > > +     config.of_node = to_of_node(dev->fwnode);
> > > +
> > > +     bdev->bd_nvmem = nvmem_register(&config);
> > > +     if (IS_ERR(bdev->bd_nvmem)) {
> > > +             dev_err_probe(dev, PTR_ERR(bdev->bd_nvmem),
> > > +                           "Failed to register NVMEM device\n");
> >
> > Using dev_err_probe() only makes sense with a return value. Which makes me
> > think: we won't retry this after a probe deferral. I think we should return
>
> Yes, so here with the nvmem fixed-layout, there is no way to get a
> deferred probe error, but better to be ready to handle this anyway.
>
> > int from this function just for this use-case. Also: if we *do* have
> > a layout, shouldn't we treat a failure to register the nvmem provider as
> > a an error and propagate it up the stack?
>
> From an API perspective we should indeed return the error. From block
> core, Do we want to fail the entire disk addition just because the
> 'companion' NVMEM provider couldn't be registered, or should we only
> abort/return in case of EPROBE_DEFER?

Also we cannot safely return -EPROBE_DEFER from add_disk_final()
either. The NVMEM registration point is late in the sequence, too much
has already happened to easily unwind. The easiest is that the NVMEM
simply won't be available if registration fails, which looks
acceptable?

>
> >
> > > +             bdev->bd_nvmem = NULL;
> > > +     }
> > > +}
> > > +
> > > +void blk_nvmem_del(struct block_device *bdev)
> > > +{
> > > +     if (bdev->bd_nvmem)
> >
> > Nvmem core already performs a NULL check.
>
> Ok, thanks!
>
>
> >
> > > +             nvmem_unregister(bdev->bd_nvmem);
> > > +
> > > +     bdev->bd_nvmem = NULL;
> > > +}
> > > diff --git a/block/blk.h b/block/blk.h
> > > index ec4674cdf2ead4fd259ff5fc42401f591e684ee9..cd3c7ca723391c40be56f1dd4810e641b7c8a2b3 100644
> > > --- a/block/blk.h
> > > +++ b/block/blk.h
> > > @@ -757,4 +757,12 @@ static inline void blk_debugfs_unlock(struct request_queue *q,
> > >       memalloc_noio_restore(memflags);
> > >  }
> > >
> > > +#ifdef CONFIG_BLK_NVMEM
> > > +void blk_nvmem_add(struct block_device *bdev);
> > > +void blk_nvmem_del(struct block_device *bdev);
> > > +#else
> > > +static inline void blk_nvmem_add(struct block_device *bdev) {}
> > > +static inline void blk_nvmem_del(struct block_device *bdev) {}
> > > +#endif
> > > +
> > >  #endif /* BLK_INTERNAL_H */
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index 7d6854fd28e95ae9134309679a7c6a937f5b7db8..1b2382de6fb30c1e5f60f45c04dc03ed3bf5d5f2 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -421,6 +421,8 @@ static void add_disk_final(struct gendisk *disk)
> > >                */
> > >               dev_set_uevent_suppress(ddev, 0);
> > >               disk_uevent(disk, KOBJ_ADD);
> > > +
> > > +             blk_nvmem_add(disk->part0);
> > >       }
> > >
> > >       blk_apply_bdi_limits(disk->bdi, &disk->queue->limits);
> > > @@ -704,6 +706,8 @@ static void __del_gendisk(struct gendisk *disk)
> > >
> > >       disk_del_events(disk);
> > >
> > > +     blk_nvmem_del(disk->part0);
> > > +
> > >       /*
> > >        * Prevent new openers by unlinked the bdev inode.
> > >        */
> > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > > index 8808ee76e73c09e0ceaac41ba59e86fb0c4efc64..ace6f59b860d0813665b2f62a1c03a1f4be94059 100644
> > > --- a/include/linux/blk_types.h
> > > +++ b/include/linux/blk_types.h
> > > @@ -73,6 +73,9 @@ struct block_device {
> > >       int                     bd_writers;
> > >  #ifdef CONFIG_SECURITY
> > >       void                    *bd_security;
> > > +#endif
> > > +#ifdef CONFIG_BLK_NVMEM
> > > +     struct nvmem_device     *bd_nvmem;
> > >  #endif
> > >       /*
> > >        * keep this out-of-line as it's both big and not needed in the fast
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 890128cdea1ce66863c5baa36f3b336ec4550807..f15d2b5bf9e4fd2368b8a70416a978e22c0d4333 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -30,6 +30,7 @@
> > >
> > >  struct module;
> > >  struct request_queue;
> > > +struct nvmem_device;
> > >  struct elevator_queue;
> > >  struct blk_trace;
> > >  struct request;
> > >
> > > --
> > > 2.34.1
> > >
> > >
> >
> > I like this approach better than the previous one.
> >
> > Thanks,
> > Bartosz

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Oleksij Rempel @ 2026-06-15  9:39 UTC (permalink / raw)
  To: Simon Horman
  Cc: jelonek.jonas, kory.maincent, andrew+netdev, davem, edumazet,
	kuba, pabeni, robh, krzk+dt, conor+dt, netdev, devicetree,
	linux-kernel, daniel, bjorn
In-Reply-To: <20260615090733.739423-2-horms@kernel.org>

Hi Jonas,

On Mon, Jun 15, 2026 at 10:07:33AM +0100, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev

It probably sounds scary, but sashiko finds not all issues in one time.
Bigger patches and more different included frameworks - increase findings
probability a lot. With other words, it may more rounds than expected.

You may optimize it if you have access to some free or payed tokes,
by using sashik-cli with LLM of you choice as backend.

Best Regards,
Oleksij
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Jonas Jelonek @ 2026-06-15  9:47 UTC (permalink / raw)
  To: Simon Horman
  Cc: o.rempel, kory.maincent, andrew+netdev, davem, edumazet, kuba,
	pabeni, robh, krzk+dt, conor+dt, netdev, devicetree, linux-kernel,
	daniel, bjorn
In-Reply-To: <20260615090733.739423-2-horms@kernel.org>

Hi,

On 15.06.26 11:07, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> net: pse-pd: add Realtek/Broadcom PSE MCU driver
>
> This patch adds a driver for a microcontroller that fronts Realtek and
> Broadcom PSE silicon on PoE switches, exposing the MCU over I2C/SMBus or
> UART using a 12-byte fixed-frame protocol. The driver is split into a
> shared core plus two transport modules, with a per-dialect opcode table
> and parser hooks for the Realtek vs Broadcom variants.
>
>> diff --git a/drivers/net/pse-pd/Kconfig b/drivers/net/pse-pd/Kconfig
>> index 7ef29657ee5d..b065b19db126 100644
>> --- a/drivers/net/pse-pd/Kconfig
>> +++ b/drivers/net/pse-pd/Kconfig
>> @@ -13,6 +13,34 @@ menuconfig PSE_CONTROLLER
>>  
>>  if PSE_CONTROLLER
>>  
>> +config PSE_REALTEK
>> +	tristate
>> +	help
>> +	  Shared core for the Realtek/Broadcom PSE MCU driver. This is
>> +	  selected automatically by the transport options below.
>> +
>> +config PSE_REALTEK_I2C
>> +	tristate "Realtek/Broadcom PSE MCU driver (I2C transport)"
>> +	depends on I2C
>> +	select PSE_REALTEK
>> +	help
>> +	  Driver for the microcontroller (MCU) that fronts the PSE
>> +	  hardware on switches with Realtek or Broadcom PSE chips, attached
>> +	  via I2C/SMBus. The MCU exposes a message-based protocol; the actual
>> +	  PSE silicon is not accessed directly. To compile this driver as a
>> +	  module, choose M here: the module will be called realtek-pse-i2c.
>> +
>> +config PSE_REALTEK_UART
>> +	tristate "Realtek/Broadcom PSE MCU driver (UART transport)"
>> +	depends on SERIAL_DEV_BUS
>> +	select PSE_REALTEK
>> +	help
>> +	  Driver for the microcontroller (MCU) that fronts the PSE
>> +	  hardware on switches with Realtek or Broadcom PSE chips, attached
>> +	  via UART. The MCU exposes a message-based protocol; the actual PSE
>> +	  silicon is not accessed directly. To compile this driver as a
>> +	  module, choose M here: the module will be called realtek-pse-uart.
>> +
> [Medium]
> The realtek, vendor prefix is used for both the Realtek dialect and the
> Broadcom dialect compatibles (realtek,pse-mcu-rtk and
> realtek,pse-mcu-bcm), and the binding documentation explicitly notes
> that the prefix "reflects the platform" rather than the silicon vendor.
> DT compatibles are stable ABI once accepted.
>
> Would a brcm, prefix (or per-MCU-part compatibles) for the Broadcom
> dialect be more appropriate, given that the BCM variant uses Broadcom
> PSE silicon driven by an MCU built by Nuvoton or STMicro and has no
> Realtek involvement?
>
> Similarly, would describing the I2C wire framing (native vs SMBus) via
> separate compatibles be preferable to encoding it in the
> realtek,i2c-protocol DT property, and would describing the actual PSE
> chip in DT (rather than relying on runtime device-id detection) match
> the usual binding patterns?

This has been the hardest part so far but the solution I have right now
seems the "best" one so far. But this part needs some guidance from
DT maintainers.

I would hold against the AIs suggestions due to particular reasons. Using
specific compatibles for the PSE chips would be very wrong IMO because:

(1) It would claim those PSE chips are what is interfaced on the bus. But
     this is not true, there's only the MCU which mostly hides the PSE chips.
(2) The PSE chips could be used without such an MCU. While this is rather
     theoretical for Realtek PSE chips, it is definitely the case for the
     Broadcom PSE chips. And in that case, they even use a different
     for communication, called Broadcom Serial Control.

On the other side, using specific MCU compatibles would be wrong too:

(1) The MCU silicon is irrelevant to the PSE/PoE system. Those are just
     general purpose MCUs where the supplied firmware on that MCU,
     not the silicon, defines the interface to the host.
(2) The MCUs are from various vendors (here STMicro, GigaDevice,
     Nuvoton), since they are general purpose they're likely used in various
     other applications. Claiming compatibles here for a specific
     application is IMO wrong.

Finally, why using the Realtek prefix/scope: The system that is described
by the bindings and supported by the driver has so far been seen solely
on Realtek-based switches. Some SDK sources already include a driver
implementation for that. Though there is no 100% evidence of that, it
seems quite reasonable that Realtek provides this MCU+firmware+PSE
silicon as kind of a package. Thus, Realtek defines and provides the
interface and behavior. The suffix then just denotes the protocol dialect
used, correlating with the PSE silicon behind the MCU.

In older switch generations, Realtek just decided to use BCM59xxx PSE
chips until they developed and shipped their owns. Likely because
the capabilities of BCM and Realtek PSE silicon slightly differs, they
decided to alter the protocol with different opcodes and slightly
different command set.

I've seen a similar review from Sashiko on the bindings patch, my
explanation above applies to it in the same way.

I'll try to improve the rationale a bit in the descriptions for the next
version.

>>  config PSE_REGULATOR
>>  	tristate "Regulator based PSE controller"
>>  	help
> [ ... ]
>
>> diff --git a/drivers/net/pse-pd/realtek-pse-core.c b/drivers/net/pse-pd/realtek-pse-core.c
>> new file mode 100644
>> index 000000000000..3a601cfb6280
>> --- /dev/null
>> +++ b/drivers/net/pse-pd/realtek-pse-core.c
> [ ... ]
>
>> +static int rtpse_port_cmd(struct rtpse_ctrl *pse, unsigned int port, u8 opcode, u8 arg)
>> +{
>> +	struct rtpse_mcu_msg req, resp;
>> +	int ret;
>> +
>> +	rtpse_mcu_msg_init(&req, opcode);
>> +	req.payload[0] = port;
>> +	req.payload[1] = arg;
>> +
>> +	ret = rtpse_do_xfer(pse, &req, &resp);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (resp.payload[0] != port || resp.payload[1] != 0)
>> +		return -EIO;
>> +
>> +	return 0;
>> +}
> [Medium]
> Can an all-zero 12-byte buffer be silently accepted as a successful
> PORT_ENABLE/PORT_DISABLE on the BCM dialect for port 0?
>
> For the BCM dialect, RTPSE_CMD_PORT_ENABLE maps to opcode 0x00, and the
> checksum of eleven zero bytes is 0, so an all-zero response passes the
> opcode and checksum checks in rtpse_do_xfer():
>
> 	if (resp->opcode != req->opcode ||
> 	    resp->checksum != rtpse_checksum((u8 *)resp, RTPSE_MCU_MSG_SIZE - 1))
> 		return -EBADMSG;
>
> rtpse_port_cmd() then verifies payload[0] == port and payload[1] == 0,
> both of which pass when port == 0 and arg is 0 (disable) or when arg is
> 1 but the MCU never observed the command and the buffer is all zero
> anyway since payload[1] is checked against 0.
>
> The seq_num field exists in struct rtpse_mcu_msg and is filled with 0xff
> by rtpse_mcu_msg_init() but is never checked on the response side. Could
> a seq_num round-trip check (or a non-zero opcode sentinel for the BCM
> dialect) reject stale or zeroed-out frames here?
>
> [ ... ]
>

Valid, I will add a check for the seq_num field here.

>> +static int rtpse_discover(struct rtpse_ctrl *pse, struct rtpse_mcu_info *info)
>> +{
>> +	struct rtpse_mcu_ext_config ext_config;
>> +	unsigned long deadline;
>> +	int ret;
>> +
>> +	/*
>> +	 * The MCU may not answer on the bus yet right after power-up or
>> +	 * enable-gpios assertion: depending on the transport it either stays
>> +	 * silent (-ETIMEDOUT) or does not ACK its address at all (-ENXIO /
>> +	 * -EREMOTEIO). Retry within a bounded wall-time window so a slow boot
>> +	 * still probes, while a genuinely unresponsive MCU fails with its real
>> +	 * error instead of deferring forever and masking it.
>> +	 */
>> +	deadline = jiffies + msecs_to_jiffies(RTPSE_MCU_BOOT_TIMEOUT_MS);
>> +	do {
>> +		ret = rtpse_mcu_get_info(pse, info);
>> +		if (ret != -ETIMEDOUT && ret != -ENXIO &&
>> +		    ret != -EREMOTEIO && ret != -EAGAIN)
>> +			break;
>> +		msleep(RTPSE_MCU_BOOT_RETRY_MS);
>> +	} while (time_before(jiffies, deadline));
> [Low]
> The comment says the loop is meant to tolerate a slow-boot MCU, but the
> retry whitelist excludes -EBADMSG and -EIO.
>
> rtpse_do_xfer() returns -EBADMSG when the opcode or checksum mismatch
> fires, and rtpse_port_query()/rtpse_port_cmd() return -EIO when the
> echoed port byte does not match the request. On a UART-attached MCU
> that emits a single garbled frame while its serial peripheral
> stabilises, or on an I2C-attached MCU that returns stale boot-ROM bytes
> right after enable-gpios assertion, the first attempt would surface as
> one of those errors rather than -ETIMEDOUT or -ENXIO, and the loop
> would break out immediately.
>
> Should -EBADMSG (and arguably -EIO) be added to the retry set so the
> documented intent matches the actual behaviour?

This would be speculative IMO, because 'no answer' was the only behavior
I could observe so far. It always came up as a timeout when the MCU
wasn't ready yet.

> [ ... ]
>
>> +int rtpse_register(struct rtpse_ctrl *pse)
>> +{
> [ ... ]
>
>> +	if (!info.system_enable) {
>> +		ret = rtpse_set_global_state(pse, true);
>> +		/* Dialects without a global-state concept (e.g. BCM) return
>> +		 * -EOPNOTSUPP; treat that as "no separate enable required".
>> +		 */
>> +		if (ret && ret != -EOPNOTSUPP)
>> +			return dev_err_probe(pse->dev, ret,
>> +					     "failed to enable PSE system\n");
>> +	}
>> +
>> +	ret = regulator_enable(pse->poe_supply);
>> +	if (ret)
>> +		return dev_err_probe(pse->dev, ret, "failed to enable PoE supply\n");
>> +
>> +	ret = devm_add_action_or_reset(pse->dev, rtpse_regulator_disable, pse->poe_supply);
>> +	if (ret)
>> +		return ret;
> [Low]
> If rtpse_set_global_state(pse, true) succeeds but a later step in
> rtpse_register() fails (regulator_enable, devm_add_action_or_reset, or
> devm_pse_controller_register), is the MCU left with the PSE subsystem
> enabled while the kernel detaches?
>
> The regulator is rolled back via rtpse_regulator_disable through
> devm_add_action_or_reset, but there does not seem to be a matching
> devm_add_action_or_reset to call rtpse_set_global_state(false) on
> probe-failure or driver-removal. Is that asymmetry intentional?

No, this is an inconsistency. I'll fix this in the next version.

> [ ... ]
>
>> diff --git a/drivers/net/pse-pd/realtek-pse-uart.c b/drivers/net/pse-pd/realtek-pse-uart.c
>> new file mode 100644
>> index 000000000000..785407b379a4
>> --- /dev/null
>> +++ b/drivers/net/pse-pd/realtek-pse-uart.c
> [ ... ]
>
>> +static size_t rtpse_uart_receive(struct serdev_device *serdev,
>> +				 const u8 *buf, size_t count)
>> +{
>> +	struct rtpse_uart *ctx = serdev_device_get_drvdata(serdev);
>> +	bool done = false;
>> +	size_t take;
>> +
>> +	scoped_guard(spinlock_irqsave, &ctx->rx_lock) {
>> +		take = min(count, sizeof(ctx->rx_buf) - ctx->rx_len);
>> +		if (take) {
>> +			memcpy(ctx->rx_buf + ctx->rx_len, buf, take);
>> +			ctx->rx_len += take;
>> +			done = (ctx->rx_len == sizeof(ctx->rx_buf));
>> +		}
>> +	}
>> +	if (done)
>> +		complete(&ctx->rx_done);
>> +
>> +	/* consume all to avoid desync/misalignment */
>> +	return count;
>> +}
> [ ... ]
>
>> +static int rtpse_uart_send(struct rtpse_ctrl *pse, const struct rtpse_mcu_msg *req)
>> +{
>> +	struct rtpse_uart *ctx = to_rtpse_uart(pse);
>> +	int written;
>> +
>> +	/* clear any leftover rx state before transmitting */
>> +	reinit_completion(&ctx->rx_done);
>> +	scoped_guard(spinlock_irqsave, &ctx->rx_lock)
>> +		ctx->rx_len = 0;
> [High]
> Is there a race between reinit_completion() here and the asynchronous
> serdev receive_buf callback when a previous transaction has timed out
> with bytes still in flight?
>
> reinit_completion() is an unlocked x->done = 0 write, and the matching
> complete() in rtpse_uart_receive() is called outside rx_lock. The
> following interleave seems possible:
>
>   rtpse_uart_send() runs reinit_completion()       /* done = 0 */
>   rtpse_uart_receive() takes rx_lock, fills rx_buf
>     to 12 bytes, computes done=true, releases lock
>   rtpse_uart_send() takes rx_lock, sets rx_len = 0,
>     releases lock
>   rtpse_uart_receive() calls complete(&ctx->rx_done) /* done = 1 */
>   rtpse_uart_send() writes the new request
>   rtpse_uart_recv() wait_for_completion_timeout()
>     returns immediately (done was 1), then sees
>     rx_len == 0 and returns -EIO
>
> rtpse_discover()'s retry whitelist covers -ETIMEDOUT, -ENXIO,
> -EREMOTEIO, and -EAGAIN but not -EIO, so a single occurrence of this
> race during the boot retry window would break out of the loop and turn
> into a permanent dev_err_probe() failure.
>
> Would moving complete() inside rx_lock in rtpse_uart_receive(), or
> performing reinit_completion() and rx_len = 0 together inside rx_lock
> while excluding the completer, close this window?
>

Valid, will fix it as suggested in the next version.

>> +
>> +	written = serdev_device_write(ctx->serdev, (const u8 *)req, sizeof(*req),
>> +				      RTPSE_UART_TX_TIMEOUT);
>> +	if (written < 0)
>> +		return written;
>> +	if (written != sizeof(*req))
>> +		return -EIO;
>> +
>> +	return 0;
>> +}
> [ ... ]

Best regards,
Jonas

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Jonas Jelonek @ 2026-06-15  9:50 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: kory.maincent, andrew+netdev, davem, edumazet, kuba, pabeni, robh,
	krzk+dt, conor+dt, netdev, devicetree, linux-kernel, daniel,
	bjorn, Simon Horman
In-Reply-To: <ai_ITHd_xxt7an4q@pengutronix.de>

Hi Oleksij,

thanks for your message.

On 15.06.26 11:39, Oleksij Rempel wrote:
> Hi Jonas,
>
> On Mon, Jun 15, 2026 at 10:07:33AM +0100, Simon Horman wrote:
>> This is an AI-generated review of your patch. The human sending this
>> email has considered the AI review valid, or at least plausible.
>> Full review at: https://sashiko.dev
> It probably sounds scary, but sashiko finds not all issues in one time.
> Bigger patches and more different included frameworks - increase findings
> probability a lot. With other words, it may more rounds than expected.

Sure, I've noticed that in another series recently. I'll just be patient and
address/discuss the review/issues accordingly :)

> You may optimize it if you have access to some free or payed tokes,
> by using sashik-cli with LLM of you choice as backend.
>
> Best Regards,
> Oleksij

Best regards,
Jonas

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] net: isolate SKB data area allocations
From: Pedro Falcato @ 2026-06-15 10:07 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), Jakub Kicinski
  Cc: Harry Yoo, Andrew Morton, David S. Miller, Eric Dumazet,
	Paolo Abeni, linux-hardening, linux-mm, netdev, linux-kernel,
	Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Simon Horman, Jason Xing, Kuniyuki Iwashima, Kees Cook
In-Reply-To: <ad862f5c-34e4-48d6-8d8b-f02bdc02d7d6@kernel.org>

On Mon, Jun 15, 2026 at 09:28:39AM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/13/26 20:33, Jakub Kicinski wrote:
> > On Thu, 11 Jun 2026 13:46:40 +0100 Pedro Falcato wrote:
> >> Subject: [PATCH net-next v2 0/2] net: isolate SKB data area allocations
> > 
> > This doesn't apply to net-next, does patch 2 not apply to mm?

Ugh, annoying - really should have rebased on net-next precisely
(vs just linux-next).

> > If neither tree can take both - maybe MM can take the first patch by
> 
> OK I'll take the first patch through the slab tree in the planned second
> next week's PR.

Thanks!

> 
> > itself and we will queue patch 2 after the changes propagate during 
> > the merge window?
> 

It's all the same to me, I'll have to resubmit the patches when the merge
window closes. So whatever is easier on the maintainers side sounds good to
me :)

-- 
Pedro

^ permalink raw reply

* [PATCH net v2 1/1] net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes
From: Ren Wei @ 2026-06-15 10:31 UTC (permalink / raw)
  To: netdev, edumazet, kuniyu, david.laight.linux
  Cc: ncardwell, pabeni, chia-yu.chang, ij, yuuchihsu, idosch, fmancera,
	herbert, yuantan098, zcliangcn, bird, bronzed_45_vested, n05ec

From: Wyatt Feng <bronzed_45_vested@icloud.com>

Reject invalid `net.ipv4.tcp_reordering` values before they reach TCP
socket state. The sysctl is stored as an `int` but copied into the
`u32` `tp->reordering` field for new sockets, so negative writes wrap
to large values.

With `tcp_mtu_probing=2`, the wrapped value can overflow the
`tcp_mtu_probe()` size calculation and drive the MTU probing path into
an out-of-bounds read. Route `tcp_reordering` writes through
`proc_dointvec_minmax()` and require it to be at least 1. Also require
`tcp_max_reordering` to be at least 1 so the configured maximum cannot
become negative either.

When registering the table for a non-init network namespace, relocate
`extra2` pointers that refer into `init_net.ipv4` so the
`tcp_reordering` upper bound follows that namespace's
`tcp_max_reordering`.

Harden `tcp_mtu_probe()` itself by computing `size_needed` as `u64`.
This keeps the send queue and window checks from being bypassed through
signed integer overflow.

Fixes: 91cc17c0e5e5 ("[TCP]: MTUprobe: receiver window & data available checks fixed")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Suggested-by: Eric Dumazet <edumazet@google.com>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
Changes in v2:
- Use proc_dointvec_minmax() directly for tcp_reordering and
  tcp_max_reordering, as suggested in review.
- Relocate ipv4 .extra2 sysctl pointers for non-init network namespaces.
- Harden tcp_mtu_probe() by making size_needed a u64.
- v1 link: https://lore.kernel.org/all/42cd30856907350e1b3834a3338364f9828a307b.1780979031.git.bronzed_45_vested@icloud.com/

 net/ipv4/sysctl_net_ipv4.c | 10 ++++++++--
 net/ipv4/tcp_output.c      |  4 ++--
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index c0e85cc171ae..ca1180dba1de 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1058,7 +1058,9 @@ static struct ctl_table ipv4_net_table[] = {
 		.data		= &init_net.ipv4.sysctl_tcp_reordering,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
+		.extra2		= &init_net.ipv4.sysctl_tcp_max_reordering,
 	},
 	{
 		.procname	= "tcp_retries1",
@@ -1293,7 +1295,8 @@ static struct ctl_table ipv4_net_table[] = {
 		.data		= &init_net.ipv4.sysctl_tcp_max_reordering,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ONE,
 	},
 	{
 		.procname	= "tcp_dsack",
@@ -1676,6 +1679,9 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 				 */
 				table[i].mode &= ~0222;
 			}
+			if (table[i].extra2 >= (void *)&init_net.ipv4 &&
+			    table[i].extra2 < (void *)(&init_net.ipv4 + 1))
+				table[i].extra2 += (void *)net - (void *)&init_net;
 		}
 	}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6e4bb411dc04..193637a58dcc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2687,7 +2687,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	struct sk_buff *skb, *nskb, *next;
 	struct net *net = sock_net(sk);
 	int probe_size;
-	int size_needed;
+	u64 size_needed;
 	int copy, len;
 	int mss_now;
 	int interval;
@@ -2711,7 +2711,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	mss_now = tcp_current_mss(sk);
 	probe_size = tcp_mtu_to_mss(sk, (icsk->icsk_mtup.search_high +
 				    icsk->icsk_mtup.search_low) >> 1);
-	size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
+	size_needed = probe_size + (tp->reordering + 1) * (u64)tp->mss_cache;
 	interval = icsk->icsk_mtup.search_high - icsk->icsk_mtup.search_low;
 	/* When misfortune happens, we are reprobing actively,
 	 * and then reprobe timer has expired. We stick with current
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH net-next v2 2/2] net: pse-pd: add Realtek/Broadcom PSE MCU driver
From: Simon Horman @ 2026-06-15 10:34 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: jelonek.jonas, kory.maincent, andrew+netdev, davem, edumazet,
	kuba, pabeni, robh, krzk+dt, conor+dt, netdev, devicetree,
	linux-kernel, daniel, bjorn
In-Reply-To: <ai_ITHd_xxt7an4q@pengutronix.de>

On Mon, Jun 15, 2026 at 11:39:24AM +0200, Oleksij Rempel wrote:
> Hi Jonas,
> 
> On Mon, Jun 15, 2026 at 10:07:33AM +0100, Simon Horman wrote:
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > Full review at: https://sashiko.dev
> 
> It probably sounds scary, but sashiko finds not all issues in one time.
> Bigger patches and more different included frameworks - increase findings
> probability a lot. With other words, it may more rounds than expected.

FWIIW, that matches my observations too.

...

^ permalink raw reply

* [PATCH net] nfc: pn533: prevent division by zero in the listen mode timer
From: Yinhao Hu @ 2026-06-15 10:35 UTC (permalink / raw)
  To: netdev
  Cc: David Heidelberg, Krzysztof Kozlowski, Jakub Kicinski,
	Dan Carpenter, dzm91, hust-os-kernel-patches, Yinhao Hu

The listen-mode timer handler advances the polling state machine through
pn533_poll_next_mod(), which computes:

dev->poll_mod_curr = (dev->poll_mod_curr + 1) % dev->poll_mod_count;

pn533_poll_reset_mod_list() clears dev->poll_mod_count without first
stopping that timer: pn533_dep_link_down() deletes no timer at all, and
pn533_stop_poll() uses timer_delete(), which does not wait for a handler
already running on another CPU. When the handler runs after the count
has been zeroed, it divides by zero:

Oops: divide error: 0000 [#1] SMP
RIP: 0010:pn533_listen_mode_timer+0x9b/0x110

Delete the timer synchronously in pn533_poll_reset_mod_list(), the single
place that clears the list, so the handler can no longer run past a reset.
Also return early when poll_mod_count is already zero, covering the window
where pn533_wq_poll() re-arms the timer just before a reset.

Fixes: 6fbbdc16be38 ("NFC: Implement pn533 polling loop")
Signed-off-by: Yinhao Hu <dddddd@hust.edu.cn>
---
 drivers/nfc/pn533/pn533.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/nfc/pn533/pn533.c b/drivers/nfc/pn533/pn533.c
index d7bdbc82e2ba..88df99001b4a 100644
--- a/drivers/nfc/pn533/pn533.c
+++ b/drivers/nfc/pn533/pn533.c
@@ -951,6 +951,7 @@ static inline void pn533_poll_next_mod(struct pn533 *dev)
 
 static void pn533_poll_reset_mod_list(struct pn533 *dev)
 {
+	timer_delete_sync(&dev->listen_timer);
 	dev->poll_mod_count = 0;
 }
 
@@ -1235,6 +1236,10 @@ static void pn533_listen_mode_timer(struct timer_list *t)
 {
 	struct pn533 *dev = timer_container_of(dev, t, listen_timer);
 
+	/* Polling may have been stopped while the timer was pending. */
+	if (!dev->poll_mod_count)
+		return;
+
 	dev->cancel_listen = 1;
 
 	pn533_poll_next_mod(dev);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 1/1] net: sched: ets: avoid deficit wrap and empty-round livelock
From: Ren Wei @ 2026-06-15 10:37 UTC (permalink / raw)
  To: netdev
  Cc: jhs, jiri, petrm, davem, yuantan098, yifanwucs, tomapufckgml,
	zcliangcn, bird, bronzed_45_vested, n05ec
In-Reply-To: <20260615103759.2404228-1-n05ec@lzu.edu.cn>

From: Wyatt Feng <bronzed_45_vested@icloud.com>

ETS keeps each DRR-style deficit in a u32 and replenishes it with
the configured quantum whenever the head packet is too large. Both
the quantum and qdisc_pkt_len() are user-controlled inputs: a large
quantum can wrap the deficit counter, while a tiny quantum combined
with an inflated qdisc_pkt_len() can force billions of iterations in
softirq context before any packet becomes eligible.

Store deficits in u64 so replenishment remains monotonic, and after
one complete pass over the active list compute how many additional
full rounds cannot dequeue from any class. Add that budget in bulk
instead of advancing one quantum at a time. This preserves ETS ordering
while removing the non-productive loop.

Fixes: dcc68b4d8084 ("net: sch_ets: Add a new Qdisc")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 net/sched/sch_ets.c | 63 +++++++++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 19 deletions(-)

diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index a4b07b661b77..ed3b191781ee 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -40,7 +40,7 @@ struct ets_class {
 	struct list_head alist; /* In struct ets_sched.active. */
 	struct Qdisc *qdisc;
 	u32 quantum;
-	u32 deficit;
+	u64 deficit;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue qstats;
 };
@@ -465,8 +465,10 @@ ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
 static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
 {
 	struct ets_sched *q = qdisc_priv(sch);
-	struct ets_class *cl;
+	struct ets_class *cl, *first;
 	struct sk_buff *skb;
+	u64 extra_rounds;
+	u64 rounds;
 	unsigned int band;
 	unsigned int len;
 
@@ -481,26 +483,49 @@ static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
 		if (list_empty(&q->active))
 			goto out;
 
-		cl = list_first_entry(&q->active, struct ets_class, alist);
-		skb = cl->qdisc->ops->peek(cl->qdisc);
-		if (!skb) {
-			qdisc_warn_nonwc(__func__, cl->qdisc);
-			goto out;
-		}
+		first = list_first_entry(&q->active, struct ets_class, alist);
+		extra_rounds = U64_MAX;
 
-		len = qdisc_pkt_len(skb);
-		if (len <= cl->deficit) {
-			cl->deficit -= len;
-			skb = qdisc_dequeue_peeked(cl->qdisc);
-			if (unlikely(!skb))
+		do {
+			cl = list_first_entry(&q->active, struct ets_class, alist);
+			skb = cl->qdisc->ops->peek(cl->qdisc);
+			if (!skb) {
+				qdisc_warn_nonwc(__func__, cl->qdisc);
 				goto out;
-			if (cl->qdisc->q.qlen == 0)
-				list_del_init(&cl->alist);
-			return ets_qdisc_dequeue_skb(sch, skb);
-		}
+			}
+
+			len = qdisc_pkt_len(skb);
+			if (len <= cl->deficit) {
+				cl->deficit -= len;
+				skb = qdisc_dequeue_peeked(cl->qdisc);
+				if (unlikely(!skb))
+					goto out;
+				if (cl->qdisc->q.qlen == 0)
+					list_del_init(&cl->alist);
+				return ets_qdisc_dequeue_skb(sch, skb);
+			}
+
+			cl->deficit += cl->quantum;
+			list_move_tail(&cl->alist, &q->active);
+
+			if (len <= cl->deficit) {
+				extra_rounds = 0;
+				continue;
+			}
+
+			rounds = div64_u64((u64)len - cl->deficit + cl->quantum - 1,
+					   cl->quantum);
+			if (rounds < extra_rounds)
+				extra_rounds = rounds;
+		} while (list_first_entry(&q->active, struct ets_class,
+					  alist) != first);
+
+		if (!extra_rounds)
+			continue;
 
-		cl->deficit += cl->quantum;
-		list_move_tail(&cl->alist, &q->active);
+		/* Skip full rounds where no active class can dequeue. */
+		list_for_each_entry(cl, &q->active, alist)
+			cl->deficit += extra_rounds * cl->quantum;
 	}
 out:
 	return NULL;
-- 
2.43.7


^ permalink raw reply related

* [PATCH net] appletalk: fix use-after-free in atalk_find_primary()
From: Yizhou Zhao @ 2026-06-15 10:39 UTC (permalink / raw)
  To: netdev
  Cc: Yizhou Zhao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, Kito Xu, linux-kernel,
	Yuxiang Yang, Ao Wang, Xuewei Feng, Qi Li, Ke Xu, stable

atalk_find_primary() walks the global AppleTalk interface list under
atalk_interfaces_lock, but returns a pointer to iface->address after
dropping that lock.  Both atalk_autobind() and atalk_bind() then
dereference the returned pointer without any lifetime protection.

The interface can be removed concurrently through the normal AppleTalk
interface ioctl path.  SIOCATALKDIFADDR calls atalk_dev_down(), which
eventually reaches atif_drop_device() and frees the same struct
atalk_iface that owns the returned address field.  A racing bind can
therefore read from freed memory.

This is reachable with a configured AppleTalk interface; reproducing the
race does not require a malicious device or driver.  The configuration
ioctls require CAP_NET_ADMIN in the initial user namespace, and
AF_APPLETALK sockets are limited to init_net.

Fix the lifetime issue without changing the returned address pointer
type.  Rename the helper to atalk_find_primary_locked() and keep
atalk_interfaces_lock held across the return.  The callers now copy
s_net and s_node while the lock is still held, then immediately release
the lock before doing any further work.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: GLM:GLM-5.1
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 30a6dc06291c..4d6576cd0ae8 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -351,7 +351,7 @@ struct atalk_addr *atalk_find_dev_addr(struct net_device *dev)
 	return iface ? &iface->address : NULL;
 }
 
-static struct atalk_addr *atalk_find_primary(void)
+static struct atalk_addr *atalk_find_primary_locked(void)
 {
 	struct atalk_iface *fiface = NULL;
 	struct atalk_addr *retval;
@@ -378,7 +378,6 @@ static struct atalk_addr *atalk_find_primary(void)
 	else
 		retval = NULL;
 out:
-	read_unlock_bh(&atalk_interfaces_lock);
 	return retval;
 }
 
@@ -1132,20 +1131,24 @@ static int atalk_autobind(struct sock *sk)
 {
 	struct atalk_sock *at = at_sk(sk);
 	struct sockaddr_at sat;
-	struct atalk_addr *ap = atalk_find_primary();
+	struct atalk_addr *ap = atalk_find_primary_locked();
 	int n = -EADDRNOTAVAIL;
 
 	if (!ap || ap->s_net == htons(ATADDR_ANYNET))
-		goto out;
+		goto unlock_and_out;
 
 	at->src_net  = sat.sat_addr.s_net  = ap->s_net;
 	at->src_node = sat.sat_addr.s_node = ap->s_node;
+	read_unlock_bh(&atalk_interfaces_lock);
 
 	n = atalk_pick_and_bind_port(sk, &sat);
 	if (!n)
 		sock_reset_flag(sk, SOCK_ZAPPED);
 out:
 	return n;
+unlock_and_out:
+	read_unlock_bh(&atalk_interfaces_lock);
+	goto out;
 }
 
 /* Set the address 'our end' of the connection */
@@ -1165,14 +1168,15 @@ static int atalk_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int a
 
 	lock_sock(sk);
 	if (addr->sat_addr.s_net == htons(ATADDR_ANYNET)) {
-		struct atalk_addr *ap = atalk_find_primary();
+		struct atalk_addr *ap = atalk_find_primary_locked();
 
 		err = -EADDRNOTAVAIL;
 		if (!ap)
-			goto out;
+			goto unlock_and_out;
 
 		at->src_net  = addr->sat_addr.s_net = ap->s_net;
 		at->src_node = addr->sat_addr.s_node = ap->s_node;
+		read_unlock_bh(&atalk_interfaces_lock);
 	} else {
 		err = -EADDRNOTAVAIL;
 		if (!atalk_find_interface(addr->sat_addr.s_net,
@@ -1201,6 +1205,9 @@ static int atalk_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int a
 out:
 	release_sock(sk);
 	return err;
+unlock_and_out:
+	read_unlock_bh(&atalk_interfaces_lock);
+	goto out;
 }
 
 /* Set the address we talk to */

--
2.43.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox