Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf v3 3/4] selftests/bpf: Adapt sockmap update error handling
From: Michal Luczaj @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Cong Wang
  Cc: Michal Luczaj, bpf, linux-kselftest, linux-kernel, netdev
In-Reply-To: <20260702-sockmap-lookup-udp-leak-v3-0-ff8de8782468@rbox.co>

Update sockmap_listen to accommodate the recent change in sockmap that
rejects unbound UDP sockets.

TCP: Reject unbound and bound (unless established or listening).
UDP: Accept only bound sockets.

While at it, migrate to ASSERT_* and enforce reverse xmas tree.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c       | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index cc0c68bab907..b87118aab7c4 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -53,8 +53,8 @@ static void test_insert_opened(struct test_sockmap_listen *skel __always_unused,
 			       int family, int sotype, int mapfd)
 {
 	u32 key = 0;
-	u64 value;
 	int err, s;
+	u64 value;
 
 	s = xsocket(family, sotype, 0);
 	if (s == -1)
@@ -63,11 +63,8 @@ static void test_insert_opened(struct test_sockmap_listen *skel __always_unused,
 	errno = 0;
 	value = s;
 	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
-	if (sotype == SOCK_STREAM) {
-		if (!err || errno != EOPNOTSUPP)
-			FAIL_ERRNO("map_update: expected EOPNOTSUPP");
-	} else if (err)
-		FAIL_ERRNO("map_update: expected success");
+	ASSERT_ERR(err, "map_update");
+	ASSERT_EQ(errno, EOPNOTSUPP, "errno");
 	xclose(s);
 }
 
@@ -77,8 +74,8 @@ static void test_insert_bound(struct test_sockmap_listen *skel __always_unused,
 	struct sockaddr_storage addr;
 	socklen_t len = 0;
 	u32 key = 0;
-	u64 value;
 	int err, s;
+	u64 value;
 
 	init_addr_loopback(family, &addr, &len);
 
@@ -93,8 +90,12 @@ static void test_insert_bound(struct test_sockmap_listen *skel __always_unused,
 	errno = 0;
 	value = s;
 	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
-	if (!err || errno != EOPNOTSUPP)
-		FAIL_ERRNO("map_update: expected EOPNOTSUPP");
+	if (sotype == SOCK_STREAM) {
+		ASSERT_ERR(err, "map_update");
+		ASSERT_EQ(errno, EOPNOTSUPP, "errno");
+	} else if (err) {
+		ASSERT_OK(err, "map_update");
+	}
 close:
 	xclose(s);
 }
@@ -1289,7 +1290,7 @@ static void test_ops(struct test_sockmap_listen *skel, struct bpf_map *map,
 		/* insert */
 		TEST(test_insert_invalid),
 		TEST(test_insert_opened),
-		TEST(test_insert_bound, SOCK_STREAM),
+		TEST(test_insert_bound),
 		TEST(test_insert),
 		/* delete */
 		TEST(test_delete_after_insert),

-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf v3 0/4] bpf, sockmap: Fix sockmap leaking UDP socks
From: Michal Luczaj @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Cong Wang
  Cc: Michal Luczaj, bpf, linux-kselftest, linux-kernel, netdev

Fix for UDP sockets getting leaked during sockmap lookup/release.
Accompanied by selftests updates.

Two Sashiko's concerns to be addressed separately:
https://lore.kernel.org/bpf/20260626205814.BAC3C1F000E9@smtp.kernel.org/

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v3:
- selftest: better error handling, ASSERT_*() macros [Sashiko]
- selftest: fix grammar, reorder patches [Kuniyuki]
- Link to v2: https://patch.msgid.link/20260626-sockmap-lookup-udp-leak-v2-0-7e7e201c951a@rbox.co

Changes in v2:
- selftest: drop the original, adapt old tests
- fix: change approach to rejecting unbound UDP [Kuniyuki]
- Link to v1: https://patch.msgid.link/20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co

To: Alexei Starovoitov <ast@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
To: Andrii Nakryiko <andrii@kernel.org>
To: Eduard Zingerman <eddyz87@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Martin KaFai Lau <martin.lau@linux.dev>
To: Song Liu <song@kernel.org>
To: Yonghong Song <yonghong.song@linux.dev>
To: Jiri Olsa <jolsa@kernel.org>
To: Emil Tsalapatis <emil@etsalapatis.com>
To: Shuah Khan <shuah@kernel.org>
To: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
To: Jiayuan Chen <jiayuan.chen@linux.dev>
To: Eric Dumazet <edumazet@google.com>
To: Kuniyuki Iwashima <kuniyu@google.com>
To: Paolo Abeni <pabeni@redhat.com>
To: Willem de Bruijn <willemb@google.com>
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
To: Simon Horman <horms@kernel.org>
To: Cong Wang <cong.wang@bytedance.com>
Cc: bpf@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org

---
Michal Luczaj (4):
      selftests/bpf: Ensure UDP sockets are bound
      bpf, sockmap: Reject unhashed UDP sockets on sockmap update
      selftests/bpf: Adapt sockmap update error handling
      selftests/bpf: Fail unbound UDP on sockmap update

 net/core/sock_map.c                                 |  2 ++
 .../selftests/bpf/prog_tests/sockmap_basic.c        |  6 +++---
 .../selftests/bpf/prog_tests/sockmap_listen.c       | 21 +++++++++++----------
 tools/testing/selftests/bpf/test_maps.c             | 13 ++++++-------
 4 files changed, 22 insertions(+), 20 deletions(-)
---
base-commit: c341792c9c7272cf91c8b17eae929caed7c2a732
change-id: 20260617-sockmap-lookup-udp-leak-bc4e5c5481d7

Best regards,
--  
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply

* [PATCH bpf v3 4/4] selftests/bpf: Fail unbound UDP on sockmap update
From: Michal Luczaj @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Cong Wang
  Cc: Michal Luczaj, bpf, linux-kselftest, linux-kernel, netdev
In-Reply-To: <20260702-sockmap-lookup-udp-leak-v3-0-ff8de8782468@rbox.co>

sockmap now rejects unbound UDP sockets. Adjust test_maps. While at it,
check socket()'s return value.

This effectively reverts commit c39aa2159974 ("bpf, selftests: Fix
test_maps now that sockmap supports UDP").

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/test_maps.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index c32da7bd8be2..6a2641ee7897 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -759,16 +759,15 @@ static void test_sockmap(unsigned int tasks, void *data)
 		goto out_sockmap;
 	}
 
-	/* Test update with unsupported UDP socket */
+	/* Test update with unsupported unbound UDP socket */
 	udp = socket(AF_INET, SOCK_DGRAM, 0);
-	i = 0;
-	err = bpf_map_update_elem(fd, &i, &udp, BPF_ANY);
-	if (err) {
-		printf("Failed socket update SOCK_DGRAM '%i:%i'\n",
-		       i, udp);
+	CHECK(udp < 0, "socket(AF_INET, SOCK_DGRAM)", "errno:%d\n", errno);
+	err = bpf_map_update_elem(fd, &(int){0}, &udp, BPF_ANY);
+	close(udp);
+	if (!err) {
+		printf("Unexpectedly succeeded unbound UDP update '0:%i'\n", udp);
 		goto out_sockmap;
 	}
-	close(udp);
 
 	/* Test update without programs */
 	for (i = 0; i < 6; i++) {

-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf v3 1/4] selftests/bpf: Ensure UDP sockets are bound
From: Michal Luczaj @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Cong Wang
  Cc: Michal Luczaj, bpf, linux-kselftest, linux-kernel, netdev
In-Reply-To: <20260702-sockmap-lookup-udp-leak-v3-0-ff8de8782468@rbox.co>

Update sockmap_basic tests to bind sockets before they are used. This
accommodates the recent change in sockmap that rejects unbound UDP sockets.

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..2d22a9058a8e 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -853,7 +853,7 @@ static void test_sockmap_many_socket(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);
@@ -922,7 +922,7 @@ static void test_sockmap_many_maps(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);
@@ -993,7 +993,7 @@ static void test_sockmap_same_sock(void)
 		return;
 	}
 
-	udp = xsocket(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, 0);
+	udp = socket_loopback(AF_INET, SOCK_DGRAM | SOCK_NONBLOCK);
 	if (udp < 0) {
 		close(dgram);
 		close(tcp);

-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf v3 2/4] bpf, sockmap: Reject unhashed UDP sockets on sockmap update
From: Michal Luczaj @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	John Fastabend, Jakub Sitnicki, Jiayuan Chen, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Cong Wang
  Cc: Michal Luczaj, bpf, linux-kselftest, linux-kernel, netdev
In-Reply-To: <20260702-sockmap-lookup-udp-leak-v3-0-ff8de8782468@rbox.co>

UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.

Because sockmap accepts unbound UDP sockets, a BPF program can increment a
socket's refcount via lookup. If the socket is subsequently bound, the
transition from unbound to bound causes bpf_sk_release() to skip the
decrement of the refcount, causing a memory leak.

unreferenced object 0xffff88810bc2eb40 (size 1984):
  comm "test_progs", pid 2451, jiffies 4295320596
  hex dump (first 32 bytes):
    7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
    02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace (crc bdee079d):
    kmem_cache_alloc_noprof+0x557/0x660
    sk_prot_alloc+0x69/0x240
    sk_alloc+0x30/0x460
    inet_create+0x2ce/0xf80
    __sock_create+0x25b/0x5c0
    __sys_socket+0x119/0x1d0
    __x64_sys_socket+0x72/0xd0
    do_syscall_64+0xa1/0x5f0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Instead of special-casing for refcounted sockets, reject unhashed UDP
sockets during sockmap updates, as there is no benefit to supporting those.
This effectively reverts the commit under Fixes, with two exceptions:

1. sock_map_sk_state_allowed() maintains a fall-through `return true`.
2. In the spirit of commit b8b8315e39ff ("bpf, sockmap: Remove unhash
   handler for BPF sockmap usage"), the proto::unhash BPF handler is not
   reintroduced.

Historical note: this issue is related to commit 67312adc96b5 ("bpf: reject
unhashed sockets in bpf_sk_assign").

Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index c60ba6d292f9..9efbd8ca7db8 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -542,6 +542,8 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 {
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
+	if (sk_is_udp(sk))
+		return sk_hashed(sk);
 	if (sk_is_stream_unix(sk))
 		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	if (sk_is_vsock(sk) &&

-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH net] net/packet: avoid fanout hook re-registration after unregister
From: Willem de Bruijn @ 2026-07-01 23:34 UTC (permalink / raw)
  To: David Lee, david.lee, Willem de Bruijn
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Dominik 'Disconnect3d' Czarnota, Simon Horman, netdev,
	linux-kernel
In-Reply-To: <20260701113947.23180-1-david.lee@trailofbits.com>

David Lee wrote:
> packet_set_ring() temporarily detaches a socket from packet delivery while
> reconfiguring its ring. It records the previous running state, clears
> po->num, unregisters the protocol hook when needed, drops po->bind_lock,
> and later restores po->num and re-registers the hook from the saved
> was_running value.
> 
> That unlocked window can race with NETDEV_UNREGISTER. The notifier can
> observe the socket as not running, skip __unregister_prot_hook(), and
> invalidate the per-socket binding by setting po->ifindex to -1 and clearing
> po->prot_hook.dev. A one-member fanout group can still retain its shared
> fanout hook device pointer. When packet_set_ring() resumes, re-registering
> solely from the stale was_running state can re-add the fanout hook after
> the device has been unregistered.

Thanks for the report with fix.

> Treat po->ifindex == -1 as an invalidated binding after reacquiring
> po->bind_lock. Restore po->num as before, but do not re-register the hook
> if device unregister already detached the socket.

I guess key here is that po->ifindex == -1 is not the normal device
unbound state. That would be po->ifindex 0, and those sockets do need
to be restored.

So this LGTM.

The bug here is having a stale fanout->prot_hook->dev.
If this is the only socket in a fanout group then
__unregister_prot_hook calls __fanout_unlink calls
__dev_remove_pack(&f->prot_hook).

Then later __register_prot_hook calls __fanout_link and
dev_add_pack(&f->prot_hook), rather than checking po->prot_hook.dev,
which packet_notifier modified.

An alternative would be for __fanout_link to check

@@ -1518,7 +1518,8 @@ static void __fanout_link(struct sock *sk, struct packet_sock *po)
        rcu_assign_pointer(f->arr[f->num_members], sk);
        smp_wmb();
        f->num_members++;
-       if (f->num_members == 1)
+       if (f->num_members == 1 &&
+           f->prot_hook.dev == po->prot_hook.dev)
                dev_add_pack(&f->prot_hook);
        spin_unlock(&f->lock);
 }

This condition is true when the group is created and checked on every
socket joining the group.

Or even simpler

+       if (f->num_members == 1 && READ_ONCE(po->ifindex) != -1)

But as said the patch as shared should work too.

Patches to net need a Fixes tag:

Fixes: dc99f600698d ("packet: Add fanout support.")
 
> Signed-off-by: Dominik 'Disconnect3d' Czarnota <dominik.czarnota@trailofbits.com>

I'm not entirely sure what the policy on nicknames is. But at best it
is uncommon. Consider dropping.

> Assisted-by: Codex:gpt-5
> ---
> Bug found and triaged by David Lee from Trail of Bits.

Instead, a Reported-by tag?

> Trail of Bits has a PoC that achieves local privilege escalation using this
> bug on a custom kernel config with CONFIG_LIST_HARDENED disabled, which can
> be shared further if needed.
> 
>  net/packet/af_packet.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 8e6f3a734ba0..000000000000 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -4561,7 +4561,11 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
>  
>  	spin_lock(&po->bind_lock);
>  	WRITE_ONCE(po->num, num);
> -	if (was_running)
> +	/*
> +	 * NETDEV_UNREGISTER may have invalidated the binding while bind_lock
> +	 * was dropped above.  Do not re-add a fanout hook to a dead device.
> +	 */
> +	if (was_running && READ_ONCE(po->ifindex) != -1)
>  		register_prot_hook(sk);
>  
>  	spin_unlock(&po->bind_lock);
> -- 
> 2.43.0
x


^ permalink raw reply

* [PATCH net v3] ipv4: igmp: remove multicast group from hash table on device destruction
From: Yuyang Huang @ 2026-07-01 23:50 UTC (permalink / raw)
  To: Yuyang Huang
  Cc: David S. Miller, Cong Wang, David Ahern, Eric Dumazet,
	Ido Schimmel, Jakub Kicinski, Paolo Abeni, Simon Horman,
	linux-kernel, netdev, stable

When a device is destroyed under RTNL, ip_mc_destroy_dev() iterates through
the multicast list and calls ip_ma_put() on each membership, scheduling
them for RCU reclamation. However, they are not unlinked from the device's
multicast hash table (mc_hash).

Since the device remains published in dev->ip_ptr until after
ip_mc_destroy_dev() completes, concurrent RCU readers traversing mc_hash
can still locate and access the multicast group after its refcount is
decremented. If the RCU callback runs and frees the group while a reader is
accessing it, a use-after-free occurs.

Fix this by unlinking the multicast group from mc_hash using
ip_mc_hash_remove() before scheduling it for reclamation.

BUG: KASAN: slab-use-after-free in ip_check_mc_rcu+0x149/0x3f0
Read of size 4 at addr ffff888009bf1408 by task mausezahn/2276

Call Trace:
 <IRQ>
 dump_stack_lvl+0x67/0x90
 print_report+0x175/0x7c0
 kasan_report+0x147/0x180
 ip_check_mc_rcu+0x149/0x3f0
 udp_v4_early_demux+0x36d/0x12d0
 ip_rcv_finish_core+0xb8b/0x1390
 ip_rcv_finish+0x54/0x120
 NF_HOOK+0x213/0x2b0
 __netif_receive_skb+0x126/0x340
 process_backlog+0x4f2/0xf00
 __napi_poll+0x92/0x2c0
 net_rx_action+0x583/0xc60
 handle_softirqs+0x236/0x7f0
 do_softirq+0x57/0x80
 </IRQ>

Allocated by task 2239:
 kasan_save_track+0x3e/0x80
 __kasan_kmalloc+0x72/0x90
 ____ip_mc_inc_group+0x31a/0xa40
 __ip_mc_join_group+0x334/0x3f0
 do_ip_setsockopt+0x16fa/0x2010
 ip_setsockopt+0x3f/0x90
 do_sock_setsockopt+0x1ad/0x300

Freed by task 0:
 kasan_save_track+0x3e/0x80
 kasan_save_free_info+0x40/0x50
 __kasan_slab_free+0x3a/0x60
 __rcu_free_sheaf_prepare+0xd4/0x220
 rcu_free_sheaf+0x36/0x190
 rcu_core+0x8d9/0x12f0
 handle_softirqs+0x236/0x7f0

Fixes: e9897071350b ("igmp: hash a hash table to speedup ip_check_mc_rcu()")
Cc: stable@vger.kernel.org
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
---
v3:
  - Target 'net' instead of 'net-next'.
  - Add the KASAN Use-After-Free traceback to the commit message.
v2:
  - Add Fixes tag in the commit message.

 net/ipv4/igmp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index b6337a47c141..d520ea4f6d14 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1922,6 +1922,7 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
 #endif
 
 	while ((i = rtnl_dereference(in_dev->mc_list)) != NULL) {
+		ip_mc_hash_remove(in_dev, i);
 		in_dev->mc_list = i->next_rcu;
 		WRITE_ONCE(in_dev->mc_count, in_dev->mc_count - 1);
 		ip_mc_clear_src(i);
-- 
2.55.0.rc0.799.gd6f94ed593-goog


^ permalink raw reply related

* Re: [PATCH iproute2-next v3 2/2] devlink: support u64-array values in devlink param show/set
From: David Ahern @ 2026-07-01 23:53 UTC (permalink / raw)
  To: Ratheesh Kannoth, stephen, kuba, linux-kernel, netdev
  Cc: andrew+netdev, edumazet, pabeni, jiri
In-Reply-To: <20260701031359.839221-3-rkannoth@marvell.com>

On 6/30/26 9:13 PM, Ratheesh Kannoth wrote:
> Add support for DEVLINK_VAR_ATTR_TYPE_U64_ARRAY parameters that carry
> multiple DEVLINK_ATTR_PARAM_VALUE_DATA attributes. Parse and display
> u64 array values in param show, and accept space- or comma-separated
> u64 values in devlink and port param set commands.
> 
>   - Show search order
> 
>   devlink dev param show pci/0002:01:00.0 name npc_srch_order
>   pci/0002:01:00.0:
>     name npc_srch_order type driver-specific
>       values:
>         cmode runtime value  value  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

"value value"? Is that correct or a typo in the commit message?

You are dumping it as a string which for json is wrong -- it should be
an array.


^ permalink raw reply

* [PATCH] net/sched: cake: reject overhead values that underflow length
From: Samuel Moelius @ 2026-07-01 23:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Samuel Moelius, Jamal Hadi Salim, Jiri Pirko, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	moderated list:CAKE QDISC, open list:TC subsystem, open list

CAKE accepts signed overhead values and stores them in an s16, but the
adjusted packet length calculation uses unsigned arithmetic.  A negative
effective length can therefore wrap to a large value.

Such configurations make rate accounting depend on integer wraparound
rather than on the packet size userspace intended to model.  A static
netlink lower bound is not enough because packets reaching CAKE can be
smaller than any reasonable manual-overhead allowance.

Fold the signed overhead adjustment into the existing datapath MPU clamp
so negative adjusted lengths are clamped before link-layer framing
adjustments.

Fixes: a729b7f0bd5b ("sch_cake: Add overhead compensation support to the rate shaper")
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
Changes in v3:
  - Adjust how check is performed
Changes in v2:
  - Add fixes tag

 net/sched/sch_cake.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index a3c185505afc..f78f8e950776 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1389,10 +1389,7 @@ static u32 cake_calc_overhead(struct cake_sched_data *qd, u32 len, u32 off)
 	if (qd->min_netlen > len)
 		WRITE_ONCE(qd->min_netlen, len);
 
-	len += q->rate_overhead;
-
-	if (len < q->rate_mpu)
-		len = q->rate_mpu;
+	len = max((s32)len + q->rate_overhead, (s32)q->rate_mpu);
 
 	if (q->atm_mode == CAKE_ATM_ATM) {
 		len += 47;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH bpf v3 3/4] selftests/bpf: Adapt sockmap update error handling
From: bot+bpf-ci @ 2026-07-02  0:00 UTC (permalink / raw)
  To: mhal, ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
	yonghong.song, jolsa, emil, shuah, john.fastabend, jakub,
	jiayuan.chen, edumazet, kuniyu, pabeni, willemb, davem, kuba,
	horms, cong.wang
  Cc: mhal, bpf, linux-kselftest, linux-kernel, netdev, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260702-sockmap-lookup-udp-leak-v3-3-ff8de8782468@rbox.co>

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> index cc0c68bab907..b87118aab7c4 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c

[ ... ]

> @@ -93,8 +90,12 @@ static void test_insert_bound(struct test_sockmap_listen *skel __always_unused,
>  	errno = 0;
>  	value = s;
>  	err = bpf_map_update_elem(mapfd, &key, &value, BPF_NOEXIST);
> -	if (!err || errno != EOPNOTSUPP)
> -		FAIL_ERRNO("map_update: expected EOPNOTSUPP");
> +	if (sotype == SOCK_STREAM) {
> +		ASSERT_ERR(err, "map_update");
> +		ASSERT_EQ(errno, EOPNOTSUPP, "errno");
> +	} else if (err) {
> +		ASSERT_OK(err, "map_update");
> +	}

This isn't a bug, but could the UDP branch just be `else ASSERT_OK(err,
"map_update");` — the `if (err)` guard looks redundant since ASSERT_OK is
a no-op on success?

>  close:
>  	xclose(s);
>  }

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28554911192

^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Lee Trager @ 2026-07-01 23:15 UTC (permalink / raw)
  To: Das, Shubham, Alexander Duyck, Andrew Lunn
  Cc: Maxime Chevallier, netdev@vger.kernel.org, mkubecek@suse.cz,
	D H, Siddaraju, Chintalapalle, Balaji, Lindberg, Magnus,
	niklas.damberg@ericsson.com
In-Reply-To: <SN7PR11MB810921E7DA70DB3F6FD1C41DFFE82@SN7PR11MB8109.namprd11.prod.outlook.com>

On 6/29/26 9:15 AM, Das, Shubham wrote:

> Hi All,
>
> Below are the proposed modifications to the UAPI, data structures, and Netlink messages to support PRBS/BERT and test pattern configuration.
>
> diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
> index 5e9135e3774f..cb11e139dd81 100644
> --- a/Documentation/netlink/specs/ethtool.yaml
> +++ b/Documentation/netlink/specs/ethtool.yaml
> @@ -30,6 +30,36 @@ definitions:
> +    name: phy-test-pattern
> +    enum-name: phy-test-pattern
> +    type: enum
> +    name-prefix: phy-test-pattern-
> +    doc: PRBS and other PHY test patterns
> +    entries:
> +      - off
> +      - prbs7
> +      - prbs9
> +      - prbs11

fbnic supports a number of other tests as well. Getting the full list of 
common PRBS tests codified would be ideal

- prbs11.0

- prbs11.1

- prbs11.2

- prbs11.3

> +      - prbs13

  - prbs13.0

- prbs13.1

- prbs13.2

- prbs13.3

> +      - prbs15
- prbs16
> +      - prbs23
> +      - prbs31
- prbs32
> +      - ssprq
> +      - prbs13q
> +      - prbs31q
> +      - square
> +  -
> +    name: phy-test-action
> +    enum-name: phy-test-action
Each lane and each direction is a completely separate test with its own 
test of statistics. The test is actually verified on the Rx side, Tx is 
your generator so you won't have data to collect. So when you run PRBS 
testing on a 2 lane NIC you are actually running 4 independent tests. 
While its fine to have a shortcut to run the same test on all lanes we 
absolutely need a way to run tests per lane and the ability to choose 
Rx, Tx, or both.
> +    type: enum
> +    name-prefix: phy-test-action-
> +    doc: Actions for PHY BERT test control
> +    entries:
> +      - none
> +      - start
> +      - stop
> +      - stats
I wouldn't consider stats a phy-test action. It shouldn't change the 
state of the NIC at all. I would just add phy-test-stats as set of 
standard ethtool statistics.
>     -
>       name: header-flags
>       type: flags
> @@ -1818,6 +1848,58 @@ attribute-sets:
>           type: u32
>           enum: loopback-type
>   
> +  -
> +    name: phy-test
> +    attr-cnt-name: __ethtool-a-phy-test-cnt
> +    doc: |
> +      PHY test configuration for pattern generation/checking,
> +      BERT (Bit Error Rate Test), and statistics.
> +    attributes:
> +      -
> +        name: unspec
> +        type: unused
> +        value: 0
> +      -
> +        name: header
> +        type: nest
> +        nested-attributes: header
> +      -
> +        name: tx-pattern
> +        type: u32
> +        doc: TX test pattern type (PRBS or square wave)
> +        enum: phy-test-pattern
> +      -
> +        name: rx-pattern
> +        type: u32
> +        doc: RX checker pattern type (PRBS or square wave)
> +        enum: phy-test-pattern
> +      -
> +        name: bert-action
> +        type: u32
> +        doc: BERT test start/stop/stats
> +        enum: phy-test-action
> +      -
> +        name: inject-error-count
> +        type: u32
> +        doc: |
> +          Number of errors to inject. Each invocation injects the specified
> +          number of bit errors into the data stream.
> +      -
> +        name: ber-lock-status
> +        type: u8
> +        doc: PRBS lock status (1=locked, 0=not locked)
> +      -
> +        name: ber-error-count
> +        type: u64
> +        doc: BERT bit error count
> +      -
> +        name: ber-total-bits-sent
> +        type: u64
> +        doc: BERT total bits tested
> +      -
> +        name: supported-test-patterns
> +        type: u32
> +        doc: Bitmask of supported test patterns
Again all of this needs to be per lane.
>   
>     -
>       name: phy-tunable
> @@ -2924,6 +3006,53 @@ operations:
>              - header
>              - enabled
>              - type
> +    -
> +      name: phy-test-act
> +      doc: |
> +        Configure PHY test parameters. Each attribute is optional and only
> +        specified attributes are applied. TX/RX patterns are set on the
> +        local port. BERT and error injection operate on the receiver port.
> +        When bert-action is stats, a reply with BERT counters is returned.
> +        Typical workflow:
> +          ethtool --phy-test eth1 tx-pattern prbs7  (TX side)
> +          ethtool --phy-test eth2 rx-pattern prbs7  (RX side)
> +          ethtool --phy-test eth2 bert start        (start BERT on RX)
> +          ethtool --phy-test eth2 bert stats        (read counters and lock status)
> +          ethtool --phy-test eth2 bert stop         (stop BERT)
> +
> +      attribute-set: phy-test
> +
> +      do:
> +        request:
> +          attributes:
> +            - header
> +            - tx-pattern
> +            - rx-pattern
> +            - bert-action
> +            - inject-error-count
> +        reply:
> +          attributes:
> +            - header
> +            - ber-lock-status
> +            - ber-error-count
> +            - ber-total-bits-sent
> +    -
> +      name: phy-test-get
> +      doc: |
> +        Get PHY test configuration status and supported patterns.
> +
> +      attribute-set: phy-test
> +
> +      do:
> +        request:
> +          attributes:
> +            - header
> +        reply:
> +          attributes:
> +            - header
> +            - tx-pattern
> +            - rx-pattern
> +            - supported-test-patterns
>   
>   mcast-groups:
>     list:
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 1ac85b8aebd7..3bcca506cf7b 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
>
> +/* Bitmask of which ethtool_phy_test fields were explicitly specified */
> +#define PHY_TEST_CMD_TX_PATTERN		    BIT(0)
> +#define PHY_TEST_CMD_RX_PATTERN		    BIT(1)
> +#define PHY_TEST_CMD_BERT_ACTION	    BIT(2)
> +#define PHY_TEST_CMD_INJECT_COUNT	    BIT(3)
> +
> +/**
> + * struct ethtool_phy_test - PHY test configuration and status
> + * @cmd: Bitmask of PHY_TEST_CMD_* indicating which fields to apply (SET)
> + * @tx_pattern: TX test pattern
> + * @rx_pattern: RX checker pattern
> + * @bert_action: BERT start/stop/stats action
> + * @inject_error_count: Number of bit errors to inject (SET only)
> + * @supported_test_patterns: Bitmask of supported patterns (GET only)
> + * @ber_lock_status: BER lock status 1=locked, 0=not locked (GET only)
> + * @ber_error_count: BERT bit error count (GET only)
> + * @ber_total_bits_sent: BERT total bits tested (GET only)
> + */
> +struct ethtool_phy_test {
> +	u32 cmd;
> +	enum phy_test_pattern tx_pattern;
> +	enum phy_test_pattern rx_pattern;
> +	enum phy_test_action bert_action;
> +	u32 inject_error_count;
> +	u32 supported_test_patterns;
> +	u8 ber_lock_status;
> +	u64 ber_error_count;
> +	u64 ber_total_bits_sent;
> +};
> +
>   /**
>    * struct ethtool_ops - optional netdev operations
>    * @supported_input_xfrm: supported types of input xfrm from %RXH_XFRM_*.
> @@ -1091,7 +1121,8 @@ struct ethtool_loopback {
>    * @get_mm: Query the 802.3 MAC Merge layer state.
>    * @set_mm: Set the 802.3 MAC Merge layer parameters.
>    * @get_mm_stats: Query the 802.3 MAC Merge layer statistics.
> - *
> + * @get_phy_test: Get PHY test status, patterns, and BERT counters.
> + * @set_phy_test: Configure PHY test (pattern, BERT, error injection). *
>    * All operations are optional (i.e. the function pointer may be set
>    * to %NULL) and callers must take this into account.  Callers must
>    * hold the RTNL lock.
> @@ -1260,6 +1291,10 @@ struct ethtool_ops {
>   	void	(*get_mm_stats)(struct net_device *dev, struct ethtool_mm_stats *stats);
> +	int	(*get_phy_test)(struct net_device *dev,
> +				struct ethtool_phy_test *test);
> +	int	(*set_phy_test)(struct net_device *dev,
> +				struct ethtool_phy_test *test);
>   };
>
>
> The 'tx_prbs' and 'rx_prbs' command parameters have been renamed to 'tx_pattern' and 'rx_pattern' to allow support
> for additional test patterns defined in the RFC, such as square patterns, in addition to PRBS.
>
> The statistics have been moved to the 'ber' test command.
>
> I also think it would be better to expose 'tx_pattern' and 'rx_pattern' as separate commands,
> since the TX and RX ports can be different. They are only the same when operating in loopback mode.
>
>
>> You need to think about the units for inject errors. There is no floating point support. Also, is this corrupt packets?
>> Or single bit flips in the stream? It needs to be well defined what it actually means. The driver can then convert it to whatever the hardware supports. How does 802.3 specify this?
> I believe it is not mentioned in IEEE specs, But it will be helpful in debug in both data and PRBS mode.
> Maybe we can have number of errors injected in steam when we issue command rather than error rate ?
>
>
>> Traditionally, Unix does not offer a way to clear statistic counters back to zero. So i'm not sure about clear-stats.
>> We also need to think about hardware which does not support that. And there is locking issues, can the stats be cleared while a test is active?
> I think we can auto clear in PHY FW or in implementation when we start the test.
>
> Also, as previously suggested we need new status to indicate device is under test for net device.
>   
> - Shubham D
>
>> -----Original Message-----
>> From: Alexander Duyck <alexander.duyck@gmail.com>
>> Sent: 24 June 2026 21:06
>> To: Andrew Lunn <andrew@lunn.ch>
>> Cc: Lee Trager <lee@trager.us>; Das, Shubham <shubham.das@intel.com>;
>> Maxime Chevallier <maxime.chevallier@bootlin.com>; netdev@vger.kernel.org;
>> mkubecek@suse.cz; D H, Siddaraju <siddaraju.dh@intel.com>; Chintalapalle,
>> Balaji <balaji.chintalapalle@intel.com>; Lindberg, Magnus
>> <magnus.k.lindberg@ericsson.com>; niklas.damberg@ericsson.com
>> Subject: Re: Ethtool : PRBS feature
>>
>> On Tue, Jun 23, 2026 at 7:30 PM Andrew Lunn <andrew@lunn.ch> wrote:
>>>>      To avoid race conditions, maybe some of these commands need combining.
>>>>      ethtool --phy-test eth1 tx-prbs prbs7 rx-prbs prbs7 bert start
>>>>
>>>>      The configuration is then atomic, with respect to the uAPI, so we
>>>>      don't get two users configuring it at the same time, ending up with a
>>>>      messed up configuration.
>>>>
>>>> Testing consumes the link so you really don't want anything done to
>>>> the netdev while testing is running. fbnic does the following.
>>>>
>>>> 1. Testing cannot start when the link is up
>>> That is not going to work in the generic case. Many MAC drivers don't
>>> bind to there PCS or PHY until open() is called. So there is no way to
>>> pass the uAPI calls onto the PCS or PHY if the interface is down.
>>> There are also some MACs which connect to multiple PCSs, and there can
>>> be multiple PHYs. So you need to somehow indicate which PCS/PHY should
>>> perform the PRBS. There was a discussion about loopback recently,
>>> which has the same issue, you can perform loopback testing in multiple
>>> places. So i expect the same concept will be used for this.
>> I would think something like this would still be usable. You would just need to
>> specify the phy address and possibly device address in the case that you support
>> doing such testing at multiple layers.
>> Basically it would be up to the driver to provide a way to connect the request with
>> the desired interface. I would imagine something similar is the case for the
>> loopback handling since there are so many layers where you can hairpin things
>> back to the port it came in on.
>>
>>>> 2. Once testing starts the driver removes the netdev to prevent use.
>>>> The netdev is only added back when testing stops. The upstream
>>>> solution will need something that can keep the netdev but lock
>>>> everything down while testing is running.
>>> Probably IF_OPER_TESTING would be part of this. If the interface is in
>>> this state, you want many other things blocked. However, probably
>>> ksettings get/set need to work, so you can force the link into a
>>> specific mode.
>> I would imagine it depends on if you want to enforce ordering on this or not. I
>> would say the set would probably need to be blocked as you wouldn't normally
>> want to be changing the setting in the middle of a test as it would cause the error
>> stats to climb quickly.
>>
>>>> 3. Once testing starts you cannot change the test, even on an
>>>> individual lane basis. You must stop testing first.
>>>>
>>>>
>>>>      Traditionally, Unix does not offer a way to clear statistic counters
>>>>      back to zero. So i'm not sure about clear-stats. We also need to think
>>>>      about hardware which does not support that. And there is locking
>>>>      issues, can the stats be cleared while a test is active?
>>>>
>>>> fbnic actually has separate registers for PRBS test results. Results
>>>> do need to be clean between runs but I never created an explicit
>>>> clear interface. Firmware automatically reset the registers when a
>>>> new test was started. This also allows results to be viewed after testing has
>> stopped.
>>> We should really take 802.3 as the model, but i've not had time yet to
>>> read what it says about the statistics.
>> I think most of this is all called out in the IEEE 802.3-2022 spec under section
>> 45.2.1.169 - 45.2.1.174. Basically the ability and controls live in the 1500 range,
>> Tx error statistics in the 1600, and Rx statistics in the 1700 range.
>>
>>>> Reading results was a little tricky due to roll over between two
>>>> 32bit registers.
>>> 802.3 is make this even more interesting, since those registers are 16
>>> bits.
>> Yeah, normally to deal with something like that we would likely be looking at
>> having to maintain a fairly high read frequency. Although in theory the error
>> counts shouldn't be climbing that fast anyway. The spec calls out that the registers
>> are clear on read and held at ~0 in the event of overflow which would be a failing
>> case for any reasonable test anyway.
>>
>>>> When I spoke to hardware engineers at Meta they did not want a
>>>> timeout. Testing often occurred over days, so they wanted to be able
>>>> to start it and explicitly stop it. I'm not against a time out but I do think it
>> should be optional.
>>>> Since PRBS testing is handled by firmware one safety measure I added
>>>> is if firmware lost contact with the host testing was automatically
>>>> stopped and TX FIR values were reset to factory. This ensured that
>>>> the NIC won't get stuck in testing and on initialization the driver
>>>> doesn't have to worry about testing state.
>>> That will work for firmware, but not when Linux is driving the
>>> hardware. I don't know if netlink will allow it, or if RTNL will get
>>> in the way etc, but it could be we actually don't want a start and
>>> stop commands at all, it is a blocking netlink call, and the test runs
>>> until the user space process closes the socket?
>> What we would probably need to do is look at testing as a state rather than an
>> operation. Basically the NIC would be put into the testing state and as a result it
>> would just be sitting there emitting whatever test pattern it is supposed to emit,
>> and validating it is receiving the pattern it expects to receive.
>>
>> The statistics could probably just be a subset of the PHY statistics that could be
>> collected separately. Actually now that I think about it I wonder if we couldn't
>> look at putting together the interface similar to how we currently handle FEC
>> where you have the --set-fec interface to configure things and the --show-fec
>> interface with the -I option to show the current state and also dump the
>> statistics.

^ permalink raw reply

* [PATCH v3] net/sched: cake: reject overhead values that underflow length
From: Samuel Moelius @ 2026-07-02  0:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Samuel Moelius, Jamal Hadi Salim, Jiri Pirko, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	moderated list:CAKE QDISC, open list:TC subsystem, open list

CAKE accepts signed overhead values and stores them in an s16, but the
adjusted packet length calculation uses unsigned arithmetic.  A negative
effective length can therefore wrap to a large value.

Such configurations make rate accounting depend on integer wraparound
rather than on the packet size userspace intended to model.  A static
netlink lower bound is not enough because packets reaching CAKE can be
smaller than any reasonable manual-overhead allowance.

Fold the signed overhead adjustment into the existing datapath MPU clamp
so negative adjusted lengths are clamped before link-layer framing
adjustments.

Fixes: a729b7f0bd5b ("sch_cake: Add overhead compensation support to the rate shaper")
Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
Changes in v3:
  - Adjust how check is performed
  - Resend with v3 suffix
Changes in v2:
  - Add fixes tag

 net/sched/sch_cake.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index a3c185505afc..f78f8e950776 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1389,10 +1389,7 @@ static u32 cake_calc_overhead(struct cake_sched_data *qd, u32 len, u32 off)
 	if (qd->min_netlen > len)
 		WRITE_ONCE(qd->min_netlen, len);
 
-	len += q->rate_overhead;
-
-	if (len < q->rate_mpu)
-		len = q->rate_mpu;
+	len = max((s32)len + q->rate_overhead, (s32)q->rate_mpu);
 
 	if (q->atm_mode == CAKE_ATM_ATM) {
 		len += 47;
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC] connectat()/bindat() or an alternative design
From: Cong Wang @ 2026-07-02  0:32 UTC (permalink / raw)
  To: John Ericson
  Cc: Li Chen, Andy Lutomirski, Christian Brauner, Jens Axboe,
	network dev, linux-fsdevel
In-Reply-To: <e396ce86-ec84-4189-9da2-98af7cfa6c41@app.fastmail.com>

On Tue, Jun 30, 2026 at 04:22:25PM -0400, John Ericson wrote:
> I'm bumping this and adding new recipients again in light of the
> discussion happening elsewhere in
> <https://lore.kernel.org/all/a49ce818-f38d-41b0-bbf7-80b8aad998b1@app.fastmail.com/>.
> I don't want to count my chickens before they are hatched, but it is
> looking to me like a consensus in that thread is building around the
> ability to opt into intentionally empty/unusable root and working
> directories (at least with nullfs, maybe but less likely with other
> mechanisms instead).
> 
> That new functionality concretizes the motivation for what I am
> proposing in this thread: in such a world, there is little to no point
> binding listening sockets in the file system, because the containing
> directory would have to be conveyed by file descriptor anyways --- might
> as well just directly convey the socket to connect to by file
> descriptor. Likewise, abstract sockets are not appealing, because the
> abstract socket namespace is either too coarse-grained (leaking info in
> the same way root/cwd would), or too cumbersome to keep it from leaking.
> 
> To recap (with some slight changes, like renames), my latest proposal (a
> new version, not either of the two variations in the original email) is
> new syscalls `bind_unix_anon` and `connectat`, supporting a workflow
> like this:
> 
>     /* server */
>     int lfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
>     int addrfd = bind_unix_anon(
>             lfd,
>             /*flags, for the future*/0);
>     listen(lfd, 64);
> 
>     /* client, handed `addrfd` */
>     int cfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
>     connectat(addrfd, cfd, AT_EMPTY_PATH);
> 
> Or, more radically, `bind_unix_anon` and `connectat` could let one skip
> the initial `socket` calls by returning those new sockets directly:

Hm? Why not just setsockopt()? Something like:

struct unix_lookup_ctx {
    int dirfd;
    __u64 resolve_flags;   /* RESOLVE_BENEATH, RESOLVE_NO_SYMLINKS, etc. */
    __u64 op_flags;        /* maybe future */
};

setsockopt(fd, SOL_UNIX, UNIX_NEXT_LOOKUP,
           &ctx, sizeof(ctx));

bind(fd, (struct sockaddr *)&addr, addrlen);
/* or connect(fd, ...) */


Zero new syscall is needed.

Regards,
Cong

^ permalink raw reply

* Re: [PATCH v2 6.6.y/6.12.y/6.18.y] af_unix: Set gc_in_progress to true in unix_gc().
From: Sasha Levin @ 2026-07-02  0:38 UTC (permalink / raw)
  To: stable
  Cc: Sasha Levin, kuniyu, kuba, pabeni, davem, edumazet, netdev,
	sysroot314
In-Reply-To: <20260701065306.281809-1-sysroot314@gmail.com>

> [ Upstream commit d82ba05263c69fa2437fe93e4e561cc40f4c03af ]
>
> Igor Ushakov reported that unix_gc() could run with gc_in_progress
> being false if the work is scheduled while running:

Queued for 6.18.y, 6.12.y, and 6.6.y, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 5.10] net: cpsw_new: Fix potential unregister of netdev that has not been registered yet
From: Sasha Levin @ 2026-07-02  0:38 UTC (permalink / raw)
  To: stable, Greg Kroah-Hartman
  Cc: Sasha Levin, Elizaveta Tereshkina, Grygorii Strashko,
	David S. Miller, Jakub Kicinski, Kevin Hao, Alexander Sverdlin,
	Wenshan Lan, Ilias Apalodimas, Murali Karicheri, linux-omap,
	netdev, linux-kernel, lvc-project
In-Reply-To: <20260630200717.1994713-1-etereshkina@astralinux.ru>

> If an error occurs during register_netdev() for the first MAC in
> cpsw_register_ports(), even though cpsw->slaves[0].ndev is set to NULL,
> cpsw->slaves[1].ndev would remain unchanged. This could later cause
> cpsw_unregister_ports() to attempt unregistering the second MAC.

Queued for 5.10.y, thanks.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 5.10.y] net/sched: fix pedit partial COW leading to page cache corruption
From: Sasha Levin @ 2026-07-02  0:38 UTC (permalink / raw)
  To: dominique.martinet, stable
  Cc: Sasha Levin, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Jakub Kicinski, Mat Martineau, Paolo Abeni,
	netdev, linux-kernel, Rajat Gupta, Yiming Qian, Keenan Dong,
	Han Guidong, Zhang Cen, Davide Caratti,
	Toke Høiland-Jørgensen, Victor Nogueira
In-Reply-To: <20260630-cve-2026-46331-v1-2-c1986f356f26@atmark-techno.com>

> [Dominique: plenty of context conflict but the code itself could still
> mostly be used]

Thanks! However, the same upstream fix (899ee91156e5) is already queued for
5.10.y via Wentao Guan's backport, which brings in the act_pedit
RCU/percpu-stats prerequisites first and then applies the fix nearly
verbatim.

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 5.10.y] net/sched: fix pedit partial COW leading to page cache corruption
From: Dominique Martinet @ 2026-07-02  1:00 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable, Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	Jakub Kicinski, Mat Martineau, Paolo Abeni, netdev, linux-kernel,
	Rajat Gupta, Yiming Qian, Keenan Dong, Han Guidong, Zhang Cen,
	Davide Caratti, Toke Høiland-Jørgensen, Victor Nogueira
In-Reply-To: <stable-reply-pedit-cow-510-20260701193800@kernel.org>

Sasha Levin wrote on Wed, Jul 01, 2026 at 08:38:34PM -0400:
> > [Dominique: plenty of context conflict but the code itself could still
> > mostly be used]
> 
> Thanks! However, the same upstream fix (899ee91156e5) is already queued for
> 5.10.y via Wentao Guan's backport, which brings in the act_pedit
> RCU/percpu-stats prerequisites first and then applies the fix nearly
> verbatim.

Oh, thanks!
Sorry for the noise, I'll remember to check next time.

-- 
Dominique

^ permalink raw reply

* [PATCH ipsec] xfrm6: clear dst.dev on error to avoid double netdev_put in xfrm6_fill_dst()
From: Xiang Mei (Microsoft) @ 2026-07-02  1:05 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu, David S . Miller, netdev
  Cc: Simon Horman, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	AutonomousCodeSecurity, tgopinath, kys, Xiang Mei (Microsoft)

On the error path where in6_dev_get(dev) returns NULL, xfrm6_fill_dst()
releases the device reference with netdev_put() but leaves
xdst->u.dst.dev set. dst_destroy() later calls netdev_put(dst->dev)
again, so the same net_device reference is released twice, underflowing
its refcount (ref_tracker WARNING + "unregister_netdevice: waiting for
<dev> to become free").

Clear xdst->u.dst.dev after the netdev_put(), the same way the XFRM
device-offload paths xfrm_dev_state_add() and xfrm_dev_policy_add() in
net/xfrm/xfrm_device.c NULL ->dev when releasing the reference on error.

  ref_tracker: reference already released.
  ref_tracker: allocated in:
   xfrm6_fill_dst (net/ipv6/xfrm6_policy.c:86)
   ...
   udpv6_sendmsg (net/ipv6/udp.c:1696)
   ...
  ref_tracker: freed in:
   xfrm6_fill_dst (net/ipv6/xfrm6_policy.c:90)
   ...
  WARNING: lib/ref_tracker.c:322 at ref_tracker_free+0x58b/0x780
   dst_destroy (net/core/dst.c:115)
   rcu_core
   handle_softirqs
   ...

Fixes: 84c4a9dfbf43 ("xfrm6: release dev before returning error")
Reported-by: AutonomousCodeSecurity@microsoft.com
Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu>
---
 net/ipv6/xfrm6_policy.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 125ea9a5b8a0..3b749475f6ed 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -88,6 +88,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 	xdst->u.rt6.rt6i_idev = in6_dev_get(dev);
 	if (!xdst->u.rt6.rt6i_idev) {
 		netdev_put(dev, &xdst->u.dst.dev_tracker);
+		xdst->u.dst.dev = NULL;
 		return -ENODEV;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] af_unix: mark MSG_SPLICE_PAGES frags shared
From: 钱一铭 @ 2026-07-02  2:05 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: security, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel, lingchen5202005
In-Reply-To: <CAAVpQUC_EcpFf9GmNnD07KhFAOcPhrCa5xNF-idz0CsB=TJV9A@mail.gmail.com>

Thanks, I agree plain AF_UNIX does not by itself provide the writer side.

  The concern was that MSG_SPLICE_PAGES imports externally owned pages into
  skb frags, while unlike TCP/UDP/KCM this path does not mark them with
  SKBFL_SHARED_FRAG. I checked the AF_UNIX sockmap path as well, but I do
  not currently have a concrete in-tree chain where those frags reach a
  writer that skips COW based on skb_has_shared_frag().

  So this should be treated as a defensive consistency cleanup rather than
  a security fix. I will drop the Fixes tag and the duplicate Reported-by
  tags in v2.

Kuniyuki Iwashima <kuniyu@google.com> 于2026年6月30日周二 23:51写道:
>
> On Tue, Jun 30, 2026 at 12:06 AM Yiming Qian <yimingqian591@gmail.com> wrote:
> >
> > unix_stream_sendmsg() splices external pages directly into skb frags when
> > MSG_SPLICE_PAGES is set, but it does not propagate SKBFL_SHARED_FRAG
> > afterward.
>
> I think it doesn't matter with the plain AF_UNIX.
>
> Please elaborate on the scenario where this could be a problem.
> e.g. sockmap ?
>
>
> >
> > That leaves later writers without the shared-frag marker even though the
> > skb still references externally owned pages.
> >
> > Set the marker after a successful skb_splice_from_iter() call.
> >
> > Fixes: a0dbf5f818f90 ("af_unix: Support MSG_SPLICE_PAGES")
> > Reported-by: Yiming Qian <yimingqian591@gmail.com>
> > Reported-by: Can Liu <lingchen5202005@gmail.com>
>
> Reported-by is not needed when it's identical to SOB tag.
>
>
> > Signed-off-by: Yiming Qian <yimingqian591@gmail.com>
> > Signed-off-by: Can Liu <lingchen5202005@gmail.com>
> > ---
> >  net/unix/af_unix.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index f7a9d55eee8a1..f2cd0f8ec0914 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -2458,6 +2458,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
> >                                 goto out_free;
> >
> >                         size = err;
> > +                       skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> >                         refcount_add(size, &sk->sk_wmem_alloc);
> >                 } else {
> >                         skb_put(skb, size - data_len);
> > --
> > 2.34.1

^ permalink raw reply

* Re: [PATCH] af_unix: mark MSG_SPLICE_PAGES frags shared
From: Kuniyuki Iwashima @ 2026-07-02  2:18 UTC (permalink / raw)
  To: 钱一铭
  Cc: security, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel, lingchen5202005
In-Reply-To: <CAL_bE8JU8fpwracSRC5ziiT=YMW-XRkYNsUEGGuoSOv0gYZcwQ@mail.gmail.com>

On Wed, Jul 1, 2026 at 7:05 PM 钱一铭 <yimingqian591@gmail.com> wrote:
>
> Thanks, I agree plain AF_UNIX does not by itself provide the writer side.
>
>   The concern was that MSG_SPLICE_PAGES imports externally owned pages into
>   skb frags, while unlike TCP/UDP/KCM this path does not mark them with
>   SKBFL_SHARED_FRAG. I checked the AF_UNIX sockmap path as well, but I do
>   not currently have a concrete in-tree chain where those frags reach a
>   writer that skips COW based on skb_has_shared_frag().
>
>   So this should be treated as a defensive consistency cleanup rather than
>   a security fix.

Hmm, if it's not exploitable, let's not add that.  It's rather confusing
to future readers.


> I will drop the Fixes tag and the duplicate Reported-by
>   tags in v2.
>
> Kuniyuki Iwashima <kuniyu@google.com> 于2026年6月30日周二 23:51写道:
> >
> > On Tue, Jun 30, 2026 at 12:06 AM Yiming Qian <yimingqian591@gmail.com> wrote:
> > >
> > > unix_stream_sendmsg() splices external pages directly into skb frags when
> > > MSG_SPLICE_PAGES is set, but it does not propagate SKBFL_SHARED_FRAG
> > > afterward.
> >
> > I think it doesn't matter with the plain AF_UNIX.
> >
> > Please elaborate on the scenario where this could be a problem.
> > e.g. sockmap ?
> >
> >
> > >
> > > That leaves later writers without the shared-frag marker even though the
> > > skb still references externally owned pages.
> > >
> > > Set the marker after a successful skb_splice_from_iter() call.
> > >
> > > Fixes: a0dbf5f818f90 ("af_unix: Support MSG_SPLICE_PAGES")
> > > Reported-by: Yiming Qian <yimingqian591@gmail.com>
> > > Reported-by: Can Liu <lingchen5202005@gmail.com>
> >
> > Reported-by is not needed when it's identical to SOB tag.
> >
> >
> > > Signed-off-by: Yiming Qian <yimingqian591@gmail.com>
> > > Signed-off-by: Can Liu <lingchen5202005@gmail.com>
> > > ---
> > >  net/unix/af_unix.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > > index f7a9d55eee8a1..f2cd0f8ec0914 100644
> > > --- a/net/unix/af_unix.c
> > > +++ b/net/unix/af_unix.c
> > > @@ -2458,6 +2458,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
> > >                                 goto out_free;
> > >
> > >                         size = err;
> > > +                       skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
> > >                         refcount_add(size, &sk->sk_wmem_alloc);
> > >                 } else {
> > >                         skb_put(skb, size - data_len);
> > > --
> > > 2.34.1

^ permalink raw reply

* RE: [PATCH net-next v9 5/5] net: wangxun: add pcie error handler
From: Jiawen Wu @ 2026-07-02  2:35 UTC (permalink / raw)
  To: 'Breno Leitao'
  Cc: netdev, 'Mengyuan Lou', 'Andrew Lunn',
	'David S. Miller', 'Eric Dumazet',
	'Jakub Kicinski', 'Paolo Abeni',
	'Richard Cochran', 'Russell King',
	'Aleksandr Loktionov', 'Jacob Keller',
	'Michal Swiatkowski', 'Simon Horman',
	'Kees Cook', 'Larysa Zaremba',
	'Greg Kroah-Hartman', 'Thomas Gleixner',
	'Rongguang Wei',
	'Uwe Kleine-König (The Capable Hub)',
	'Fabio Baltieri'
In-Reply-To: <akTvTF7n9P1lrqCN@gmail.com>

On Wed, Jul 1, 2026 6:45 PM, Breno Leitao wrote:
> On Wed, Jul 01, 2026 at 03:23:57PM +0800, Jiawen Wu wrote:
> > +static pci_ers_result_t wx_io_slot_reset(struct pci_dev *pdev)
> > +{
> > +	struct wx *wx = pci_get_drvdata(pdev);
> > +	pci_ers_result_t result;
> > +
> > +	if (pci_enable_device_mem(pdev)) {
> > +		wx_err(wx, "Cannot re-enable PCI device after reset.\n");
> > +		result = PCI_ERS_RESULT_DISCONNECT;
> > +	} else {
> > +		/* make all memory operations done before clearing the flag */
> > +		smp_mb__before_atomic();
> > +		clear_bit(WX_STATE_DISABLED, wx->state);
> > +		clear_bit(WX_FLAG_NEED_PCIE_RECOVERY, wx->flags);
> > +		pci_set_master(pdev);
> > +		pci_restore_state(pdev);
> > +		pci_wake_from_d3(pdev, false);
> > +
> > +		rtnl_lock();
> > +		if (netif_running(wx->netdev) && wx->down_suspend)
> > +			wx->down_suspend(wx);
> > +		if (wx->do_reset)
> > +			wx->do_reset(wx->netdev, false);
> > +		rtnl_unlock();
> > +		result = PCI_ERS_RESULT_RECOVERED;
> > +	}
> > +
> > +	pci_aer_clear_nonfatal_status(pdev);
> 
> After bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all
> affected devices"), AER errors are always cleared by the PCI core and
> drivers don't need to do it themselves.

Thanks. I'll remove it.


^ permalink raw reply

* Re: [PATCH iproute2-next v2 2/2] devlink: support u64-array values in devlink param show/set
From: Ratheesh Kannoth @ 2026-07-02  2:47 UTC (permalink / raw)
  To: David Ahern
  Cc: stephen, kuba, linux-kernel, netdev, andrew+netdev, edumazet,
	pabeni, jiri
In-Reply-To: <87f24e1e-4167-432b-b73c-0fc0c4b7d532@kernel.org>

On 2026-07-01 at 20:04:56, David Ahern (dsahern@kernel.org) wrote:
> On 6/30/26 8:29 PM, Ratheesh Kannoth wrote:
> > On 2026-06-30 at 20:06:17, David Ahern (dsahern@kernel.org) wrote:
> >> On 6/29/26 7:50 PM, Ratheesh Kannoth wrote:
> >>> diff --git a/devlink/devlink.c b/devlink/devlink.c
> >>> index 9372e92f..3c29601d 100644
> >>> --- a/devlink/devlink.c
> >>> +++ b/devlink/devlink.c
> >>> @@ -3496,13 +3496,115 @@ static const struct param_val_conv param_val_conv[] = {
> >>>  };
> >>>
> >>>  #define PARAM_VAL_CONV_LEN ARRAY_SIZE(param_val_conv)
> >>> +#define DEVLINK_PARAM_MAX_ARRAY_SIZE 32
> >>
> >> Why 32? Is that based on current code?
> > Yes, this aligns with the current kernel-side limits. See:
> > https://lore.kernel.org/all/20260609040453.711932-5-rkannoth@marvell.com/
> >
> >> How does the kernel side handle
> >> the number of parameters? What happens if the kernel sends more than 32
> >> parameters - from a user's perspective, not this code and processing the
> >> output?
> > The kernel strictly validates and restricts the number of parameters. To be safe, this patch
> > adds an explicit bounds check to prevent userspace issues if that threshold is ever crossed.
> >
> > Ideally, since "union devlink_param_value" is omitted from the UAPI, we have to define
> > DEVLINK_PARAM_MAX_ARRAY_SIZE here. Moving the underlying structures to the UAPI in the
> > future would allow us to share a single definition and avoid this hardcoded value in userspace.
>
> iproute2 needs to be backward and forward compatible. As it stands, a
> new kernel can allow more than 32 entries and an older iproute2 will not
> display all of them. That is wrong.
>
> Let's make the limit part of the uapi. If you do not want to do that
> now, then iproute2 code needs to handle a larger size.
ACK.

^ permalink raw reply

* Re: [PATCH net-next 1/2] geneve: convert config to RCU-protected pointer
From: Kuniyuki Iwashima @ 2026-07-02  2:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, netdev, eric.dumazet
In-Reply-To: <20260701120454.3533252-2-edumazet@google.com>

On Wed, Jul 1, 2026 at 5:04 AM Eric Dumazet <edumazet@google.com> wrote:
>
> geneve_changelink() currently updates configuration by copying it over
> the old one using memcpy() under RTNL, forcing data path pause via
> geneve_quiesce() and synchronize_net() to avoid reading torn values.
>
> Convert geneve->cfg to an RCU-protected pointer, allowing lockless
> and safe reads under RCU read lock without synchronization overhead.
>
> Key changes:
> - Introduced geneve_config_alloc/free() helpers for lifecycle.
> - geneve_configure() allocates config and publishes it via RCU.
> - geneve_changelink() performs RCU swap; old config is freed via call_rcu_hurry().
> - Allocates new dst_cache during changelink to prevent pcpu sharing.
> - Removed geneve_quiesce/unquiesce() and synchronize_net() from changelink.
> - Added rcu_barrier() to module exit to wait for pending callbacks.
> - Updated data path to use rcu_dereference().
> - Updated geneve_fill_info() to use rtnl_dereference() for now.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks for this change !
Sashiko always complained about this for every geneve patch :)


[...]
> @@ -1539,28 +1551,36 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
>  static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>         struct geneve_dev *geneve = netdev_priv(dev);
> -       struct ip_tunnel_info *info = NULL;
> +       const struct ip_tunnel_info *info = NULL;
> +       const struct geneve_config *cfg;
>         int err;
>
> -       if (geneve->cfg.collect_md) {
> +       rcu_read_lock();
> +       cfg = rcu_dereference(geneve->cfg);
> +       if (unlikely(!cfg)) {
> +               err = -ENODEV;
> +               goto tx_err;
> +       }

Do we need NULL check for geneve->cfg in the fast paths ?

I think genve->cfg is cleared only in the error path of geneve_configure()
due to NETDEV_REGISTER notifier.   Although dev is already published
by list_netdevice(), it's not UP and socket is not yet created, so I guess no
one can reach the fast path when geneve->cfg is NULL.

In that sense, the NULL check in the next patch makes sense.

[...]
> @@ -1962,10 +2040,13 @@ static int geneve_configure(struct net *net, struct net_device *dev,
>                 }
>         }
>
> -       dst_cache_reset(&geneve->cfg.info.dst_cache);
> -       memcpy(&geneve->cfg, cfg, sizeof(*cfg));
> +       new_cfg = geneve_config_alloc(cfg);
> +       if (IS_ERR(new_cfg))
> +               return PTR_ERR(new_cfg);
>
> -       if (geneve->cfg.inner_proto_inherit) {
> +       rcu_assign_pointer(geneve->cfg, new_cfg);
> +
> +       if (cfg->inner_proto_inherit) {
>                 dev->header_ops = NULL;
>                 dev->type = ARPHRD_NONE;
>                 dev->hard_header_len = 0;
> @@ -1975,10 +2056,15 @@ static int geneve_configure(struct net *net, struct net_device *dev,
>
>         err = register_netdevice(dev);
>         if (err)
> -               return err;
> +               goto err_free_cfg;
>
>         list_add(&geneve->next, &gn->geneve_list);
>         return 0;
> +
> +err_free_cfg:
> +       geneve_config_free(new_cfg);
> +       RCU_INIT_POINTER(geneve->cfg, NULL);

I think we need to clear geneve->cfg before geneve_config_free()
in case call_rcu() run before RCU_INIT_POINTER().

^ permalink raw reply

* Re: [PATCH net-next 2/2] geneve: make geneve_fill_info() RTNL independent
From: Kuniyuki Iwashima @ 2026-07-02  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, netdev, eric.dumazet
In-Reply-To: <20260701120454.3533252-3-edumazet@google.com>

On Wed, Jul 1, 2026 at 5:05 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Now that geneve->cfg is an RCU-protected pointer, we can update
> geneve_fill_info() to read the configuration under RCU read lock
> instead of relying on RTNL.
>
> Also add const qualifiers to the dereferenced pointers where appropriate.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply

* Re: [PATCH net] amt: fix size calculation in amt_get_size()
From: Kuniyuki Iwashima @ 2026-07-02  2:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Andrew Lunn, netdev, eric.dumazet
In-Reply-To: <20260701122329.3562825-1-edumazet@google.com>

On Wed, Jul 1, 2026 at 5:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> amt_get_size() incorrectly used sizeof(struct iphdr) for the sizes of
> IFLA_AMT_DISCOVERY_IP, IFLA_AMT_REMOTE_IP, and IFLA_AMT_LOCAL_IP.
> These attributes contain IPv4 addresses (__be32), not full IP headers.
>
> Replace sizeof(struct iphdr) with sizeof(__be32) to avoid over-allocating
> netlink message space.
>
> Fixes: b9022b53adad ("amt: add control plane of amt interface")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox