Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net v3 3/3] tcp: Decrement tcp_md5_needed static branch
From: Dmitry Safonov via B4 Relay @ 2026-06-25 18:21 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Salam Noureddine
  Cc: Michael Bommarito, Qihang, netdev, linux-kernel, Dmitry Safonov,
	stable
In-Reply-To: <20260625-tcp-md5-connect-v3-0-1fd313d6c1e0@gmail.com>

From: Dmitry Safonov <0x7f454c46@gmail.com>

In case of early freeing an unwanted TCP-MD5 key on TCP-AO connect(),
md5sig_info is freed right away (and set to NULL). Later, at
the moment of socket destruction, the static branch counter
is not getting decremented.

Add a missing decrement for TCP-MD5 static branch.

Reported-by: Qihang <q.h.hack.winter@gmail.com>
Fixes: 0aadc73995d0 ("net/tcp: Prevent TCP-MD5 with TCP-AO being set")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bc03809ca3af..d7c1444b5e30 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4334,8 +4334,8 @@ int tcp_connect(struct sock *sk)
 			tcp_clear_md5_list(sk);
 			md5sig = rcu_replace_pointer(tp->md5sig_info, NULL,
 						     lockdep_sock_is_held(sk));
-			if (md5sig)
-				kfree_rcu(md5sig, rcu);
+			kfree_rcu(md5sig, rcu);
+			static_branch_slow_dec_deferred(&tcp_md5_needed);
 		}
 	}
 #endif

-- 
2.51.2



^ permalink raw reply related

* [PATCH net v3 1/3] tcp: restore RCU grace period in tcp_ao_destroy_sock
From: Dmitry Safonov via B4 Relay @ 2026-06-25 18:21 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Salam Noureddine
  Cc: Michael Bommarito, Qihang, netdev, linux-kernel, Dmitry Safonov,
	stable, Dmitry Safonov
In-Reply-To: <20260625-tcp-md5-connect-v3-0-1fd313d6c1e0@gmail.com>

From: Michael Bommarito <michael.bommarito@gmail.com>

Commit 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
removed the call_rcu() callback from tcp_ao_destroy_sock(), arguing that
"the destruction of info/keys is delayed until the socket destructor"
and therefore "no one can discover it anymore".

That argument does not hold for the call site in tcp_connect()
(net/ipv4/tcp_output.c:4327-4332). At that point the socket is in
TCP_SYN_SENT, has already been inserted into the inet ehash by
inet_hash_connect() in tcp_v4_connect(), and is therefore very much
discoverable: any softirq running tcp_v4_rcv() on another CPU can take
the socket out of the ehash, walk into tcp_inbound_hash(), and load
tp->ao_info via implicit RCU before bh_lock_sock_nested() is taken on
the destroying CPU.

The reader path then enters __tcp_ao_do_lookup() (net/ipv4/tcp_ao.c:208)
which re-loads tp->ao_info via rcu_dereference_check(); the re-load can
still observe the (about-to-be-freed) pointer because there is no
synchronize_rcu() between rcu_assign_pointer(tp->ao_info, NULL) and
tcp_ao_info_free() in tcp_ao_destroy_sock(). The captured pointer is
then walked at line 223:

	hlist_for_each_entry_rcu(key, &ao->head, node, ...)

The writer's synchronous kfree() is free to complete between the line
218 re-fetch and the line 223 hlist iteration. The slab is reused
(or simply LIST_POISON1-stamped if not yet reused) and the iteration
walks attacker-controlled or poison memory in softirq context.

Reproducer (no debug shim, stock x86_64 v7.1-rc2 SMP+KASAN, QEMU+KVM):
an unprivileged uid=1000 process inside CLONE_NEWUSER|CLONE_NEWNET
installs TCP_MD5SIG + TCP_AO_ADD_KEY on a TCP socket, sprays forged
TCP-AO segments toward its eventual 4-tuple via raw sockets, then
calls connect(). The md5-wins reconciliation in tcp_connect() fires
tcp_ao_destroy_sock(); the softirq backlog reader on the loopback
NAPI path crashes on the freed ao->head.first walk:

  Oops: general protection fault, probably for non-canonical
    address 0xfbd59c000000002f
  KASAN: maybe wild-memory-access in range
    [0xdead000000000178-0xdead00000000017f]
  CPU: 0 UID: 1000 PID: 100 Comm: repro_userns
  RIP: 0010:__tcp_ao_do_lookup+0x107/0x1c0
  Call Trace: <IRQ>
    __tcp_ao_do_lookup+0x107/0x1c0
    tcp_ao_inbound_lookup.constprop.0+0x12a/0x200
    tcp_inbound_ao_hash+0x5ea/0x1520
    tcp_inbound_hash+0x7ce/0x1240
    tcp_v4_rcv+0x1e7a/0x3e10
    ...

Restore the RCU grace period: re-add struct rcu_head to tcp_ao_info
and replace the synchronous tcp_ao_info_free() with a call_rcu()
callback. Readers that captured tp->ao_info before rcu_assign_pointer
NULLed it now see the object remain valid until rcu_read_unlock().
With the patch applied the reproducer runs cleanly for 2000 iterations
on the same kernel build.

Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
Cc: stable@vger.kernel.org # v6.18+
Reviewed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
 include/net/tcp_ao.h | 1 +
 net/ipv4/tcp_ao.c    | 5 +++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp_ao.h b/include/net/tcp_ao.h
index 29fd7b735afa..9a2333e62e99 100644
--- a/include/net/tcp_ao.h
+++ b/include/net/tcp_ao.h
@@ -145,6 +145,7 @@ struct tcp_ao_info {
 	u32			snd_sne;
 	u32			rcv_sne;
 	refcount_t		refcnt;		/* Protects twsk destruction */
+	struct rcu_head		rcu;
 };

 #ifdef CONFIG_TCP_MD5SIG
diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index a56bb79e15e0..e4ec60a33496 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -371,8 +371,9 @@ static void tcp_ao_key_free_rcu(struct rcu_head *head)
 	kfree_sensitive(key);
 }

-static void tcp_ao_info_free(struct tcp_ao_info *ao)
+static void tcp_ao_info_free_rcu(struct rcu_head *head)
 {
+	struct tcp_ao_info *ao = container_of(head, struct tcp_ao_info, rcu);
 	struct tcp_ao_key *key;
 	struct hlist_node *n;

@@ -411,7 +412,7 @@ void tcp_ao_destroy_sock(struct sock *sk, bool twsk)

 	if (!twsk)
 		tcp_ao_sk_omem_free(sk, ao);
-	tcp_ao_info_free(ao);
+	call_rcu(&ao->rcu, tcp_ao_info_free_rcu);
 }

 void tcp_ao_time_wait(struct tcp_timewait_sock *tcptw, struct tcp_sock *tp)

-- 
2.51.2

^ permalink raw reply related

* [PATCH net v3 2/3] tcp: defer md5sig_info kfree past RCU grace period in tcp_connect
From: Dmitry Safonov via B4 Relay @ 2026-06-25 18:21 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Salam Noureddine
  Cc: Michael Bommarito, Qihang, netdev, linux-kernel, Dmitry Safonov,
	stable, Dmitry Safonov
In-Reply-To: <20260625-tcp-md5-connect-v3-0-1fd313d6c1e0@gmail.com>

From: Michael Bommarito <michael.bommarito@gmail.com>

The md5+ao reconciliation in tcp_connect() (net/ipv4/tcp_output.c)
has two symmetric branches:

	if (needs_md5) {
		tcp_ao_destroy_sock(sk, false);
	} else if (needs_ao) {
		tcp_clear_md5_list(sk);
		kfree(rcu_replace_pointer(tp->md5sig_info, NULL, ...));
	}

Both branches free a per-socket auth-info object while the socket is
in TCP_SYN_SENT and is already on the inet ehash (inserted by
inet_hash_connect() in tcp_v4_connect()). Both branches are reachable
by softirq RX-path readers that load the corresponding info pointer
via implicit RCU before bh_lock_sock_nested() is taken.

The needs_md5 branch is fixed in the prior patch by re-introducing
the call_rcu() free in tcp_ao_destroy_sock(): the equivalent per-key
loop runs inside tcp_ao_info_free_rcu(), the RCU callback, so by the
time it frees each tcp_ao_key all softirq readers that captured the
container have already completed rcu_read_unlock().

The needs_ao branch is not symmetric in the same way. The container
free can be deferred via kfree_rcu(md5sig, rcu) -- struct
tcp_md5sig_info already has the required rcu member
(include/net/tcp.h:1999-2002), and the rest of the tree already does
this in the tcp_md5sig_info_add() rollback paths
(net/ipv4/tcp_ipv4.c:1410, 1436). But the per-key teardown is done
by tcp_clear_md5_list() in process context BEFORE the container's
RCU grace period: it walks &md5sig->head and frees each
tcp_md5sig_key with bare hlist_del + kfree. A concurrent softirq
reader in __tcp_md5_do_lookup() / __tcp_md5_do_lookup_exact()
(tcp_ipv4.c:1253, 1298) walks the same list via
hlist_for_each_entry_rcu() and races with that bare kfree on the
keys themselves -- a per-key slab use-after-free of the same class
as the TCP-AO bug, on the same race window.

Fix this in two halves:

  1. Convert the bare kfree() in tcp_connect() to kfree_rcu() so the
     md5sig_info container joins the rest of the md5sig lifecycle.
     The local-variable lift is mechanical and required because
     kfree_rcu() is a macro that expects an lvalue.

  2. Make tcp_clear_md5_list() RCU-safe by replacing hlist_del +
     kfree(key) with hlist_del_rcu + kfree_rcu(key, rcu). struct
     tcp_md5sig_key already carries the rcu member
     (include/net/tcp.h:1995) and tcp_md5_do_del()
     (net/ipv4/tcp_ipv4.c:1456) already uses kfree_rcu, so this
     restores the lifecycle invariant the rest of the file follows
     rather than introducing a one-off.

The other caller of tcp_clear_md5_list() is tcp_md5_destruct_sock()
(net/ipv4/tcp.c:412), which runs from the sock destructor when the
socket is already unhashed and unreachable; the extra grace period
there is unnecessary but harmless. Making the helper unconditionally
RCU-safe is the cleaner contract.

The needs_ao branch is not reachable by the userns reproducer used
to demonstrate the AO-side splat (the repro installs both keys but
ends up in the needs_md5 branch because the connect peer matches
the MD5 key, not the AO key); however the symmetric race exists
and a maintainer touching this code should not have to think about
which branch escapes RCU and which one does not.

Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
Cc: stable@vger.kernel.org # v6.18+
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Reviewed-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
[also credits to Qihang, who found that this races with tcp-diag]
Reported-by: Qihang <q.h.hack.winter@gmail.com>
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
 net/ipv4/tcp_ipv4.c   | 4 ++--
 net/ipv4/tcp_output.c | 8 ++++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ec09f97cc9e6..209ef7522508 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1467,9 +1467,9 @@ void tcp_clear_md5_list(struct sock *sk)
 	md5sig = rcu_dereference_protected(tp->md5sig_info, 1);

 	hlist_for_each_entry_safe(key, n, &md5sig->head, node) {
-		hlist_del(&key->node);
+		hlist_del_rcu(&key->node);
 		atomic_sub(sizeof(*key), &sk->sk_omem_alloc);
-		kfree(key);
+		kfree_rcu(key, rcu);
 	}
 }

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 00ec4b5900f2..bc03809ca3af 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4329,9 +4329,13 @@ int tcp_connect(struct sock *sk)
 		if (needs_md5) {
 			tcp_ao_destroy_sock(sk, false);
 		} else if (needs_ao) {
+			struct tcp_md5sig_info *md5sig;
+
 			tcp_clear_md5_list(sk);
-			kfree(rcu_replace_pointer(tp->md5sig_info, NULL,
-						  lockdep_sock_is_held(sk)));
+			md5sig = rcu_replace_pointer(tp->md5sig_info, NULL,
+						     lockdep_sock_is_held(sk));
+			if (md5sig)
+				kfree_rcu(md5sig, rcu);
 		}
 	}
 #endif

-- 
2.51.2

^ permalink raw reply related

* [PATCH net v3 0/3] tcp: TCP-AO connect() fixes
From: Dmitry Safonov via B4 Relay @ 2026-06-25 18:21 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, Neal Cardwell, Kuniyuki Iwashima,
	David S. Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Salam Noureddine
  Cc: Michael Bommarito, Qihang, netdev, linux-kernel, Dmitry Safonov,
	stable, Dmitry Safonov

Resending v3.

I've addeded credits to Qihang on patch 2; and a third patch/fix
for static key decrement.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
Dmitry Safonov (1):
      tcp: Decrement tcp_md5_needed static branch

Michael Bommarito (2):
      tcp: restore RCU grace period in tcp_ao_destroy_sock
      tcp: defer md5sig_info kfree past RCU grace period in tcp_connect

 include/net/tcp_ao.h  | 1 +
 net/ipv4/tcp_ao.c     | 5 +++--
 net/ipv4/tcp_ipv4.c   | 4 ++--
 net/ipv4/tcp_output.c | 8 ++++++--
 4 files changed, 12 insertions(+), 6 deletions(-)
---
base-commit: 02f144fbb4c86c360495d33debe307cb46a57f95
change-id: 20260625-tcp-md5-connect-dc2369d7f414

Best regards,
--  
Dmitry Safonov <0x7f454c46@gmail.com>



^ permalink raw reply

* [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
From: Nabil S. Alramli @ 2026-06-25 17:40 UTC (permalink / raw)
  To: saeedm, tariqt, mbloch, dtatulea
  Cc: dev, nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel

Hello mlx5 experts,

We have been experiencing frequent WARNINGs in the mlx5 driver on frag page
release and we think it could possibly be caused by a bug in mlx5. Could
you please review the attached patch and provide us your guidance on
whether or not our investigation and assumptions are valid, and if so,
would it be possible to incorporate this fix into your next release?

Best Regards,

Nabil S. Alramli (1):
  net/mlx5: RX, Fix refcount warning on frag page release

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 ++++++++++---------
 3 files changed, 22 insertions(+), 21 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
From: Nabil S. Alramli @ 2026-06-25 17:40 UTC (permalink / raw)
  To: saeedm, tariqt, mbloch, dtatulea
  Cc: dev, nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel
In-Reply-To: <20260625174059.2879717-1-dev@nalramli.com>

Under memory pressure, mlx5 driver has WARNING during fragmented page
release. This happens because there is a discrepency between what mlx5
thinks the page fragment counter is vs what the page_pool actually says it
is.

The cause of the issue is page allocations on concurrent cpus, which
increment the non-atomic u16 page counter mlx5e_frag_page.frags, while at
the same time the page reference counter net_iov.pp_ref_count is atomically
incremented. That sometimes leads to a difference in the counts and
therefore triggers the warning in page_pool_unref_netmem:

```
	ret = atomic_long_sub_return(nr, pp_ref_count);
	WARN_ON(ret < 0);
```

The actual stack trace looks like this:

```
WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
Hardware name: *
RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
 mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
 mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
 mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
 __napi_poll+0x2b/0x1a0
 net_rx_action+0x30e/0x370
 ? sched_clock+0x9/0x10
 ? sched_clock_cpu+0xf/0x170
 handle_softirqs+0xe2/0x2a0
 common_interrupt+0x85/0xa0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x26/0x40
RIP: 0010:page_counter_uncharge+0x34/0x90
RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
 uncharge_batch+0x40/0xd0
```

The fix is to use an atomic page fragment counter, so it will always match
the number of references held in the page_pool.

Signed-off-by: Nabil S. Alramli <dev@nalramli.com>
Fixes: 6f5742846053 ("net/mlx5e: RX, Enable skb page recycling through the page_pool")
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 ++++++++++---------
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2270e2e550dd..c164106eb85d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -568,7 +568,7 @@ struct mlx5e_icosq {
 
 struct mlx5e_frag_page {
 	netmem_ref netmem;
-	u16 frags;
+	atomic_long_t frags;
 };
 
 enum mlx5e_wqe_frag_flag {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a46870c4b74..571a0df9f604 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -400,7 +400,7 @@ static int mlx5e_rq_alloc_mpwqe_linear_info(struct mlx5e_rq *rq, int node,
 	rq->mpwqe.linear_info = li;
 
 	/* Set to max to force allocation on first run. */
-	li->frag_page.frags = li->max_frags;
+	atomic_long_set(&li->frag_page.frags, li->max_frags);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b60aa47c75b..ee360fa0c316 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -284,7 +284,7 @@ static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
 
 	*frag_page = (struct mlx5e_frag_page) {
 		.netmem	= netmem,
-		.frags	= 0,
+		.frags	= ATOMIC_LONG_INIT(0),
 	};
 
 	return 0;
@@ -293,7 +293,7 @@ static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
 static void mlx5e_page_release_fragmented(struct page_pool *pp,
 					  struct mlx5e_frag_page *frag_page)
 {
-	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
+	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - atomic_long_read(&frag_page->frags);
 	netmem_ref netmem = frag_page->netmem;
 
 	if (page_pool_unref_netmem(netmem, drain_count) == 0)
@@ -304,7 +304,7 @@ static int mlx5e_mpwqe_linear_page_refill(struct mlx5e_rq *rq)
 {
 	struct mlx5e_mpw_linear_info *li = rq->mpwqe.linear_info;
 
-	if (likely(li->frag_page.frags < li->max_frags))
+	if (likely(atomic_long_read(&li->frag_page.frags) < li->max_frags))
 		return 0;
 
 	if (likely(li->frag_page.netmem)) {
@@ -323,7 +323,8 @@ static void *mlx5e_mpwqe_get_linear_page_frag(struct mlx5e_rq *rq)
 	if (unlikely(mlx5e_mpwqe_linear_page_refill(rq)))
 		return NULL;
 
-	frag_offset = li->frag_page.frags << MLX5E_XDP_LOG_MAX_LINEAR_SZ;
+	frag_offset = atomic_long_read(&li->frag_page.frags) <<
+		      MLX5E_XDP_LOG_MAX_LINEAR_SZ;
 	WARN_ON(frag_offset >= BIT(rq->mpwqe.page_shift));
 
 	return netmem_address(li->frag_page.netmem) + frag_offset;
@@ -568,7 +569,7 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
 		return;
 	}
 
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 	skb_add_rx_frag_netmem(skb, next_frag, netmem,
 			       frag_offset, len, truesize);
 }
@@ -744,7 +745,7 @@ void mlx5e_mpwqe_dealloc_linear_page(struct mlx5e_rq *rq)
 	 * things in a good state for re-allocation.
 	 */
 	li->frag_page.netmem = 0;
-	li->frag_page.frags = li->max_frags;
+	atomic_long_set(&li->frag_page.frags, li->max_frags);
 }
 
 INDIRECT_CALLABLE_SCOPE bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
@@ -1615,7 +1616,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	/* queue up for recycling/reuse */
 	skb_mark_for_recycle(skb);
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 
 	return skb;
 }
@@ -1683,7 +1684,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 				struct mlx5e_wqe_frag_info *pwi;
 
 				for (pwi = head_wi; pwi < wi; pwi++)
-					pwi->frag_page->frags++;
+					atomic_long_inc(&pwi->frag_page->frags);
 			}
 			return NULL; /* page/packet was consumed by XDP */
 		}
@@ -1702,7 +1703,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL;
 
 	skb_mark_for_recycle(skb);
-	head_wi->frag_page->frags++;
+	atomic_long_inc(&head_wi->frag_page->frags);
 
 	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
@@ -1711,7 +1712,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 					  xdp_buff_get_skb_flags(&mxbuf->xdp));
 
 		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
-			pwi->frag_page->frags++;
+			atomic_long_inc(&pwi->frag_page->frags);
 	}
 
 	return skb;
@@ -1760,7 +1761,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-			wi->frag_page->frags++;
+			atomic_long_inc(&wi->frag_page->frags);
 		goto wq_cyc_pop;
 	}
 
@@ -1808,7 +1809,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-			wi->frag_page->frags++;
+			atomic_long_inc(&wi->frag_page->frags);
 		goto wq_cyc_pop;
 	}
 
@@ -2011,9 +2012,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				struct mlx5e_frag_page *pfp;
 
 				for (pfp = head_page; pfp < frag_page; pfp++)
-					pfp->frags++;
+					atomic_long_inc(&pfp->frags);
 
-				linear_page->frags++;
+				atomic_long_inc(&linear_page->frags);
 			}
 			return NULL; /* page/packet was consumed by XDP */
 		}
@@ -2035,7 +2036,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL;
 
 		skb_mark_for_recycle(skb);
-		linear_page->frags++;
+		atomic_long_inc(&linear_page->frags);
 
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
@@ -2048,7 +2049,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 			pagep = head_page;
 			do
-				pagep->frags++;
+				atomic_long_inc(&pagep->frags);
 			while (++pagep < frag_page);
 
 			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
@@ -2068,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 			pagep = frag_page - sinfo->nr_frags;
 			do
-				pagep->frags++;
+				atomic_long_inc(&pagep->frags);
 			while (++pagep < frag_page);
 		}
 		/* copy header */
@@ -2121,7 +2122,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				 cqe_bcnt, mxbuf);
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-				frag_page->frags++;
+				atomic_long_inc(&frag_page->frags);
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
@@ -2136,7 +2137,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	/* queue up for recycling/reuse */
 	skb_mark_for_recycle(skb);
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 
 	return skb;
 }
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net v2] seg6: validate SRH length before reading fixed fields
From: Andrea Mayer @ 2026-06-25 19:49 UTC (permalink / raw)
  To: Nuoqi Gui
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, bpf, linux-kernel, Mathieu Xhonneux,
	Daniel Borkmann, David Lebrun, stefano.salsano, Paolo Lungaroni,
	Andrea Mayer
In-Reply-To: <20260623-f01-17-seg6-srh-len-v2-1-2edc40e9e3e1@mails.tsinghua.edu.cn>

On Tue, 23 Jun 2026 18:32:31 +0800
Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> wrote:

> seg6_validate_srh() reads fixed SRH fields such as srh->type and
> srh->hdrlen before checking that the supplied length covers the fixed
> struct ipv6_sr_hdr fields.
> 
> The BPF SEG6 encap path reaches this with a BPF program-supplied pointer
> and length: bpf_lwt_push_encap() and the SEG6 local BPF END_B6 and
> END_B6_ENCAP actions call bpf_push_seg6_encap(), which forwards the
> length to seg6_validate_srh() with no minimum-size guard.  A 2-byte SEG6
> encap header can therefore make the validator read srh->type at offset 2
> beyond the caller-supplied buffer.
> 
> Reject lengths shorter than the fixed SRH at the top of
> seg6_validate_srh(), before any field is read.  This fixes the BPF helper
> path and keeps the common validator robust.
> 
> Fixes: fe94cc290f53 ("bpf: Add IPv6 Segment Routing helpers")
> Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
> ---
> Changes in v2:
> - Narrowed the commit message to the BPF encap callers that can supply a
>   too-short SRH length.
> - Dropped the unnecessary cast in the minimum SRH length check.
> - Link to v1: https://patch.msgid.link/20260620-f01-17-seg6-srh-len-v1-1-36cbb29c12f1@mails.tsinghua.edu.cn  
> 
> To: Andrea Mayer <andrea.mayer@uniroma2.it>
> To: "David S. Miller" <davem@davemloft.net>
> To: Eric Dumazet <edumazet@google.com>
> To: Jakub Kicinski <kuba@kernel.org>
> To: Paolo Abeni <pabeni@redhat.com>
> To: Simon Horman <horms@kernel.org>
> To: Mathieu Xhonneux <m.xhonneux@gmail.com>
> To: Daniel Borkmann <daniel@iogearbox.net>
> To: David Lebrun <dlebrun@google.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: bpf@vger.kernel.org
> ---
>  net/ipv6/seg6.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
> index 1c3ad25700c4c..62a7eb7792026 100644
> --- a/net/ipv6/seg6.c
> +++ b/net/ipv6/seg6.c
> @@ -29,6 +29,9 @@ bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len, bool reduced)
>  	int max_last_entry;
>  	int trailing;
>  
> +	if (len < sizeof(*srh))
> +		return false;
> +

Thanks for the patch.

Looks good to me.

Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>

On a separate note: the AI review message seems correct. The reported
issue is a separate, pre-existing bug in the BPF SEG6 encap path, not
introduced by this patch.

Regards,
Andrea

>  	if (srh->type != IPV6_SRCRT_TYPE_4)
>  		return false;
>  
> 
> ---
> base-commit: 96e7f9122aae0ed000ee321f324b812a447906d9
> change-id: 20260619-f01-17-seg6-srh-len-a85f35427e0b
> 
> Best regards,
> --  
> Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 1/4] bpf: Initialize the l3mdev field for the fib lookup flow
From: David Ahern @ 2026-06-25 19:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Avinash Duduskar,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
	John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan,
	Jesper Dangaard Brouer, Mykyta Yatsenko, Leon Hwang, KP Singh,
	Anton Protopopov, Amery Hung, Eyal Birger, Rong Tao, bpf, netdev,
	linux-kselftest, linux-kernel
In-Reply-To: <87bjd9h6yh.fsf@toke.dk>

On 6/17/26 3:06 AM, Toke Høiland-Jørgensen wrote:
>> The helper already initializes the other flow fields the rules path
>> consumes (flowi4_mark, flowi4_tun_key.tun_id, flowi4_uid and the v6
>> counterparts); flowi*_l3mdev was added to that set afterwards and this
>> helper was never updated to match. ip_route_input_slow() likewise zeroes
>> the field before its input lookup. Do the same here.
> 
> So how about we explicitly zero-init the whole struct instead of adding
> more fields ad-hoc like this? Otherwise this seems like something that
> is likely to happen again if we ever add another field to the struct?
> 
> -Toke
> 

+1. Piecemeal init of the flow struct has been a known source of bugs.

^ permalink raw reply

* Re: [GIT PULL] Networking for v7.2-rc1
From: pr-tracker-bot @ 2026-06-25 19:57 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: torvalds, kuba, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20260625174511.745883-1-kuba@kernel.org>

The pull request you sent on Thu, 25 Jun 2026 10:45:11 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-7.2-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/805185b7c7a1069e407b6f7b3bc98e44d415f484

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v29 4/5] sfc: obtain and map cxl range using devm_cxl_probe_mem
From: Dan Williams (nvidia) @ 2026-06-25 20:34 UTC (permalink / raw)
  To: Alejandro Lucero Palau, Dan Williams (nvidia),
	alejandro.lucero-palau, linux-cxl, netdev, dan.j.williams,
	edward.cree, davem, kuba, pabeni, edumazet, dave.jiang
  Cc: Edward Cree
In-Reply-To: <b0a45e85-f42c-4a52-8223-f8318da10649@amd.com>

Alejandro Lucero Palau wrote:
[..]
> >> +{
> > If you are going to have an explicit efx_cxl_exit() then I would also
> > add an explicit unregistration of the memdev.
> 
> 
> This is necessary for undoing the mmap. Nothing else happens there 
> because it is all relying on devm ...
> 
> 
> I could change the ioremap_wc call to devm_ioremap_wc, but
> 
> 
> > This would also fix the
> > Sashiko report about pci_disable_device() running while the cxl_memdev
> > is still registered. Unfortunately, mixing devm and explicit unwind is
> > always fraught.
> 
> 
> I do not think there is a problem here. The cxl core does not need what 
> a type2 driver can do regarding PCI BAR mappings, or at least it is not 
> the case for sfc.
> 
> Any action through sysfs cxl will go through cxl core and the only thing 
> linked to the type device is the CXL registers which are mapped inside 
> cxl_map_component_regs() and those are managed resources.
> 
> 
> So, I can not see why this change is needed. If it is really necessary, 
> please describe the problem with more detail.
> 
> 
> It looks like you need reasons for delaying this further ...

What? Help with Sashiko reports is an act of malice? I assumed you
wanted help with those so that other maintainers would proceed with
these patches. 

I did do another run through to see if there are any paths that the CXL
core can reach if someone tried to fuzz the CXL ABIs or kernel paths
while SFC is unloading. I think Sashiko is hallucinating a sysfs path to
the BAR mapping given there is no mailbox and the EDAC capabilities are
usually not present on a type-2 device. The RAS path looks valid, but
that may also get lucky that most (all?) of the RAS use cases lock the
device before accessing the registers, so devres_release_all() would
become consistent with pci_disable_device() before any access attempt.
That does not seem like a clean design, but it is also does not appear
to be immediately exploitable.

If you believe the patches are ready and the Sashiko reports are
invalid, please do say so, no more comments from me on this set from
this point forward.

> > Let me know if this passes your testing, and I can send it out as a
> > standalone patch. You could also use it to unwind if the ioremap()
> > fails.
> 
> 
> You did not read my comments on v28 ...
> 
> 
> I changed efx_cxl_init to make the driver probe to fail if cxl is 
> supported and enabled but the cxl initialization fails, including 
> ioremap_wc(). What you proposed to do, explicitly undo cxl 
> initialization bits, has the same outcome: device detached from the driver.

Right, I did read that and that motivated the devm_cxl_remove_mem()
helper to undo the memdev creation without unloading the driver. You are
free to ignore that helper.

^ permalink raw reply

* Re: [PATCH net] octeontx2-pf: check DMAC extraction support before filtering
From: Harshitha Ramamurthy @ 2026-06-25 21:28 UTC (permalink / raw)
  To: nshettyj
  Cc: netdev, linux-kernel, sgoutham, gakula, sbhatta, hkelam,
	bbhushan2, andrew+netdev, davem, edumazet, kuba, pabeni, naveenm,
	tduszynski, sumang
In-Reply-To: <20260625172552.258631-1-nshettyj@marvell.com>

On Thu, Jun 25, 2026 at 10:30 AM <nshettyj@marvell.com> wrote:
>
> From: Suman Ghosh <sumang@marvell.com>
>
> Currently, configuring a VF MAC address via the PF (e.g., 'ip link
> set <pf> vf 0 mac <mac>') blindly attempts to install a DMAC-based
> hardware filter. However, the hardware parser profile might not
> support DMAC extraction.
>
> Check if the hardware parsing profile supports DMAC extraction
> before adding the filter. Additionally, emit a warning message
> to inform the operator if the MAC filter installation fails due
> to missing DMAC extraction support.
>
> Fixes: f0c2982aaf98 ("octeontx2-pf: Add support for SR-IOV management functions")
> Signed-off-by: Suman Ghosh <sumang@marvell.com>
> Signed-off-by: Nitin Shetty J <nshettyj@marvell.com>
> ---
>  .../ethernet/marvell/octeontx2/nic/otx2_pf.c  | 34 +++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> index b63df5737ff2..8e4435d9e520 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
> @@ -2546,6 +2546,8 @@ static int otx2_do_set_vf_mac(struct otx2_nic *pf, int vf, const u8 *mac)
>  static int otx2_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
>  {
>         struct otx2_nic *pf = netdev_priv(netdev);
> +       struct npc_get_field_status_req *req;
> +       struct npc_get_field_status_rsp *rsp;
>         struct pci_dev *pdev = pf->pdev;
>         struct otx2_vf_config *config;
>         int ret;
> @@ -2559,6 +2561,38 @@ static int otx2_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
>         if (!is_valid_ether_addr(mac))
>                 return -EINVAL;
>
> +       /* Skip installing the DMAC filter if the hardware parser profile
> +        * does not support DMAC extraction.
> +        */
> +       mutex_lock(&pf->mbox.lock);
> +       req = otx2_mbox_alloc_msg_npc_get_field_status(&pf->mbox);
> +       if (!req) {
> +               mutex_unlock(&pf->mbox.lock);
> +               return -ENOMEM;
> +       }
> +
> +       req->field = NPC_DMAC;
> +       if (otx2_sync_mbox_msg(&pf->mbox)) {
> +               mutex_unlock(&pf->mbox.lock);
> +               return -EINVAL;
> +       }
> +
> +       rsp = (struct npc_get_field_status_rsp *)otx2_mbox_get_rsp
> +              (&pf->mbox.mbox, 0, &req->hdr);
> +       if (IS_ERR(rsp)) {
> +               mutex_unlock(&pf->mbox.lock);
> +               return PTR_ERR(rsp);
> +       }
> +
> +       if (!rsp->enable) {
> +               mutex_unlock(&pf->mbox.lock);
> +               netdev_warn(netdev, "VF %d MAC filter not installed: DMAC extraction not supported by parser profile\n",
> +                           vf);
> +               return 0;

Is the intent to return success here even though the MAC address was
not programmed?

> +       }
> +
> +       mutex_unlock(&pf->mbox.lock);
> +

Why not move all these checks into the otx2_do_set_vf_mac() since that
anyway acquires the pf->mbox.lock? That way you could also fold all
the mutex_unlock() calls introduced in the error paths in this patch
into the existing goto-out in that function.

>         config = &pf->vf_configs[vf];
>         ether_addr_copy(config->mac, mac);
>
> --
> 2.48.1
>
>

^ permalink raw reply

* Re: [PATCH iproute2-next] "ip help" wrong output, exit code.
From: Stephen Hemminger @ 2026-06-25 21:34 UTC (permalink / raw)
  To: Dmitri Seletski; +Cc: netdev
In-Reply-To: <65f53987-c992-41b9-9603-9e9a448e469d@gmail.com>

On Thu, 25 Jun 2026 16:54:29 +0100
Dmitri Seletski <drjoms@gmail.com> wrote:

> I am confused.
> 
> Whats the next step here?
> 
> Regards
> 
> Dmitri
> 
> On 6/22/26 18:47, Dmitri Seletski wrote:
> > Hello David,
> >
> >
> > Based on change introduced:
> >
> > Two samples of "ip help" with demonstration of exit code and standard 
> > output are below.
> >
> > This is in line with what expect.
> >
> >
> > dimkosPC~/compiled/iproute2-next #if ./ip/ip help a >>/dev/null  ; 
> > then echo help triggered  ; else echo error code triggered  ;fi  #this 
> > redirects standard output  to /dev/null, so text missing is not error,
> > but standard text
> > help triggered
> >
> > dimkosPC~/compiled/iproute2-next #if ./ip/ip help   ; then echo help 
> > triggered  ; else echo error code triggered  ;fi
> > Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }
> >       ip [ -force ] -batch filename
> > where  OBJECT := { address | addrlabel | fou | help | ila | ioam | 
> > l2tp | link |
> >                   macsec | maddress | monitor | mptcp | mroute | mrule |
> >                   neighbor | neighbour | netconf | netns | nexthop | 
> > ntable |
> >                   ntbl | route | rule | sr | stats | tap | tcpmetrics |
> >                   token | tunnel | tuntap | vrf | xfrm }
> >       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |
> >                    -h[uman-readable] | -iec | -j[son] | -p[retty] |
> >                    -f[amily] { inet | inet6 | mpls | bridge | link } |
> >                    -4 | -6 | -M | -B | -0 |
> >                    -l[oops] { maximum-addr-flush-attempts } | -echo | 
> > -br[ief] |
> >                    -o[neline] | -t[imestamp] | -ts[hort] | -b[atch] 
> > [filename] |
> >                    -rc[vbuf] [size] | -n[etns] name | -N[umeric] | 
> > -a[ll] |
> >                    -c[olor]}
> > help triggered
> >
> > Two samples of command that is broken on purpose.
> >
> > dimkosPC~/compiled/iproute2-next #if ./ip/ip idontexist   ; then echo 
> > help triggered  ; else echo error code triggered  ;fi
> > Object "idontexist" is unknown, try "ip help".
> > error code triggered
> >
> > dimkosPC~/compiled/iproute2-next #if ./ip/ip idontexist  >>/dev/null 
> >  ; then echo help triggered  ; else echo error code triggered  ;fi 
> >  #this redirects standard output  to /dev/null, so text missing is not 
> > error, but standard text
> > Object "idontexist" is unknown, try "ip help".
> > error code triggered
> >
> > This works as expected as per my understanding.
> >
> >
> > Not everything is fixed, but chunk of things fixed is better than non 
> > of it.
> >
> > for example:
> >
> > if ip  add help    ; then echo help triggered  ; else echo error code 
> > triggered  ;fi  #this redirects standard output  to /dev/null, so text 
> > missing is not error, but standard text
> > Usage: ip address {add|change|replace} IFADDR dev IFNAME [ LIFETIME ]
> >                                                      [ CONFFLAG-LIST ]
> >       ip address del IFADDR dev IFNAME [mngtmpaddr]
> >       ip address {save|flush} [ dev IFNAME ] [ scope SCOPE-ID ] [ to 
> > PREFIX ]
> >                            [ FLAG-LIST ] [ label LABEL ] [ { up | down 
> > } ]
> >       ip address [ show [ dev IFNAME ] [ scope SCOPE-ID ] [ master 
> > DEVICE ]
> >                         [ nomaster ]
> >                         [ type TYPE ] [ to PREFIX ] [ FLAG-LIST ]
> >                         [ label LABEL ] [ { up | down } ] [ vrf NAME ]
> >                         [ proto ADDRPROTO ] ]
> >       ip address {showdump|restore}
> > IFADDR := PREFIX | ADDR peer PREFIX
> >          [ broadcast ADDR ] [ anycast ADDR ]
> >          [ label IFNAME ] [ scope SCOPE-ID ] [ metric METRIC ]
> >          [ proto ADDRPROTO ]
> > SCOPE-ID := [ host | link | global | NUMBER ]
> > FLAG-LIST := [ FLAG-LIST ] FLAG
> > FLAG  := [ permanent | dynamic | secondary | primary |
> >           [-]tentative | [-]deprecated | [-]dadfailed | temporary |
> >           CONFFLAG-LIST ]
> > CONFFLAG-LIST := [ CONFFLAG-LIST ] CONFFLAG
> > CONFFLAG  := [ home | nodad | mngtmpaddr | noprefixroute | autojoin ]
> > LIFETIME := [ valid_lft LFT ] [ preferred_lft LFT ]
> > LFT := forever | SECONDS
> > ADDRPROTO := [ NAME | NUMBER ]
> > TYPE := { amt | bareudp | bond | bond_slave | bridge | bridge_slave |
> >          dsa | dummy | erspan | geneve | gre | gretap | gtp | hsr |
> >          ifb | ip6erspan | ip6gre | ip6gretap | ip6tnl |
> >          ipip | ipoib | ipvlan | ipvtap |
> >          macsec | macvlan | macvtap | netdevsim |
> >          netkit | nlmon | pfcp | rmnet | sit | team | team_slave |
> >          vcan | veth | vlan | vrf | vti | vxcan | vxlan | wwan |
> >          xfrm | virt_wifi }
> > error code triggered
> >
> > This is still problematic.
> >
> >
> > But so far code leaves "ip help" command/argument in better shape than 
> > it found it in.
> >
> >
> > I may try improve things more, but lets submit what we already have 
> > "better", please.
> >
> > Kind Regards
> >
> > Dmitri Seletski
> >
> >
> > On 6/22/26 17:44, David Laight wrote:  
> >> On Mon, 22 Jun 2026 07:57:00 -0700
> >> Stephen Hemminger <stephen@networkplumber.org> wrote:
> >>  
> >>> On Sun, 21 Jun 2026 22:48:59 +0100
> >>> Dmitri Seletski <drjoms@gmail.com> wrote:
> >>>  
> >>>>  From 0805e07105cd15c5b94271a4706e50e3c65dbde5 Mon Sep 17 00:00:00 
> >>>> 2001
> >>>> From: Dmitri Seletski <drjoms@gmail.com>
> >>>> Date: Sun, 21 Jun 2026 22:12:43 +0100
> >>>> Subject: [PATCH iproute2-next]  "ip help" wrong output, exit code.
> >>>>
> >>>> Changed output of "ip help" from standard error to standard output. 
> >>>> And
> >>>> Exit is now 0 instead of -1. "ip help|grep bridge" - now gives bridge
> >>>> syntax instead of flooding user with everything from "ip help".
> >>>> ---
> >>>> ip/ip.c | 4 ++--
> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/ip/ip.c b/ip/ip.c
> >>>> index e4b71bde..4627b61c 100644
> >>>> --- a/ip/ip.c
> >>>> +++ b/ip/ip.c
> >>>> @@ -56,7 +56,7 @@ static void usage(void) __attribute__((noreturn));
> >>>>
> >>>> static void usage(void)
> >>>> {
> >>>> -fprintf(stderr,
> >>>> +fprintf(stdout,
> >>>> "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
> >>>> "       ip [ -force ] -batch filename\n"
> >>>> "where  OBJECT := { address | addrlabel | fou | help | ila | ioam | 
> >>>> l2tp
> >>>> | link |\n"
> >>>> @@ -72,7 +72,7 @@ static void usage(void)
> >>>> "                    -o[neline] | -t[imestamp] | -ts[hort] | -b[atch]
> >>>> [filename] |\n"
> >>>> "                    -rc[vbuf] [size] | -n[etns] name | -N[umeric] |
> >>>> -a[ll] |\n"
> >>>> "                    -c[olor]}\n");
> >>>> -exit(-1);
> >>>> +exit(0);
> >>>> }  
> >>> Your mailer damages white space.
> >>>  
> >> The output also needs to depend on whether these is a 'usage' error or
> >> if 'help' is requested.
> >> Code code is correct for the former - except it should do exit(1).
> >>
> >>     David
> >>
> >>  
> 

We need to have a broad solution that doesn't look ugly.
There are a couple problems with current code:
  1. Help should exit with 0 (ok); invalid argument should exit with non-zero
     by Gnu convention that is 2 but other commands like git use 129
  2. help should go to stdout; usage on error should go to stderr

The solution should work across iproute2 commands: ip, tc, dpll, tipc, bridge, ...
and the sub commands.

So far the mailing list patches were kind of messy and limited.

^ permalink raw reply

* Re: [PATCH net-next] net: neigh: avoid calling neigh_forced_gc on every alloc when table is full
From: Kuniyuki Iwashima @ 2026-06-25 21:45 UTC (permalink / raw)
  To: avimalin; +Cc: edumazet, kuniyu, netdev, vimal.agrawal, kuba
In-Reply-To: <20260625102020.92814-1-vimal.agrawal@sophos.com>

From: Vimal Agrawal <avimalin@gmail.com>
Date: Thu, 25 Jun 2026 10:20:20 +0000
> Once the neighbour table exceeds gc_thresh3, neigh_forced_gc() is called
> on every allocation attempt with no rate limiting. In workloads with mostly
> active/reachable entries, the GC walk traverses a large portion of the
> neighbour table without reclaiming entries, holding tbl->lock for an
> extended period. This causes severe lock contention and allocation
> latencies exceeding 16ms under sustained neighbour creation.
> 
> Add a pre-lock check in neigh_forced_gc() to skip the GC run if one was
> performed within the last second, avoiding repeated full table scans and
> lock acquisitions on the hot allocation path.
> 
> Profiling of neigh_create() shows ~3 orders of magnitude latency
> improvement with this change.
> 
> Link:https://lore.kernel.org/netdev/CALkUMdSCpx_ywYCx_ePLdm6yioO1nQWx7sSM=AEgsq0kywHxTw@mail.gmail.com/

From the thread, these look misconfigured.

---8<---
net.ipv6.neigh.default.gc_thresh2 = 32768
net.ipv6.neigh.default.gc_thresh3 = 32768
---8<---

If gc_thresh3 is larger enough, gc_thresh2 will give you 5s
rate limiting.

If the number of active neigh entries constantly exceeds
gc_thresh3, it will be the correct gc_thresh2 for you.

Also, I guess you want a new kernel param for the first
neigh_hash_alloc(), which is currently fixed for 3, which
is too small for some hosts.

50000 entries require neigh_hash_grow() 13 times.

Can you test this on your real workload, starting from
neigh_hash_shift=16 and appropriate gc_thresh2/3 ?

---8<---
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1349c0eedb64..a75b3750eec9 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1817,6 +1817,22 @@ EXPORT_SYMBOL(neigh_parms_release);
 static struct lock_class_key neigh_table_proxy_queue_class;
 
 static struct neigh_table __rcu *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
+static __initdata unsigned long neigh_hash_shift = 3;
+
+static int __init neigh_set_hash_shift(char *str)
+{
+	ssize_t ret;
+
+	if (!str)
+		return 0;
+
+	ret = kstrtoul(str, 0, &neigh_hash_shift);
+	if (ret)
+		return 0;
+
+	return 1;
+}
+__setup("neigh_hash_shift=", neigh_set_hash_shift);
 
 void neigh_table_init(int index, struct neigh_table *tbl)
 {
@@ -1843,7 +1859,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 		panic("cannot create neighbour proc dir entry");
 #endif
 
-	RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(3));
+	RCU_INIT_POINTER(tbl->nht, neigh_hash_alloc(neigh_hash_shift));
 
 	phsize = (PNEIGH_HASHMASK + 1) * sizeof(struct pneigh_entry *);
 	tbl->phash_buckets = kzalloc(phsize, GFP_KERNEL);
---8<---



> Signed-off-by: Vimal Agrawal <vimal.agrawal@sophos.com>
> ---
>  net/core/neighbour.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 1349c0eedb64..078842db3c5f 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -260,6 +260,9 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>  	int shrunk = 0;
>  	int loop = 0;
>  
> +	if (!time_after(jiffies, READ_ONCE(tbl->last_flush) + HZ))
> +		return 0;
> +
>  	NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
>  
>  	spin_lock_bh(&tbl->lock);
> -- 
> 2.17.1
> v

^ permalink raw reply related

* Re: [PATCH v14 0/9] tls: Add TLS 1.3 hardware offload support
From: Rishikesh Jethwani @ 2026-06-25 22:57 UTC (permalink / raw)
  To: Nils Juenemann
  Cc: netdev, borisp, davem, edumazet, john.fastabend, kuba, leon,
	mbloch, saeedm, sd, tariqt
In-Reply-To: <CAMPsyauZ+jzG9AysO0FWv6ZY0kvCUpjX_U7o=oOjCuOQ87BCgg@mail.gmail.com>

On Tue, Jun 23, 2026 at 10:53 AM Nils Juenemann
<nils.juenemann@gmail.com> wrote:
>
> Hi Rishikesh, all,
>
> we have been testing the v14 TLS 1.3 HW offload series on a ConnectX-6
> DX and hit a sendfile() final-record loss on the device TX path. We
> reduced it to a self-contained C reproducer and characterized it;
> reporting it here with the analysis and a question on where a fix belongs.
>
> Setup:
>
> NIC: ConnectX-6 DX (crypto enabled), FW 22.47.1026, SR-IOV VF,
> TX offload only
>
> Kernel: net-next + this v14 series
>
> TLS 1.3, AES-128-GCM, kTLS installed via setsockopt(TLS_TX) on the
> sending side with fixed test crypto material and no handshake, like
> tools/testing/selftests/net/tls
>
> a server sends a file with the raw sendfile(2) syscall; a client on
> another host reads the decrypted stream and counts the bytes
>
> Trigger: sendfile(2) with a count larger than the bytes remaining in
> the file (count > EOF). This is what a generic copy loop / Go's
> net.TCPConn.ReadFrom passes for a file of unknown length (~2 GiB). The
> kernel sends up to EOF, but the connection's final TLS record then
> appears not to be put on the wire unless a subsequent write flushes it.
> An abrupt close() appears to drop it, and the peer receives the whole
> body except the last record's bytes.
>
> Reproducer results (two hosts over the ConnectX - a loopback/same-host
> connection stays on TLS_SW and does not show it). Same file, 226965
> bytes (= 13*16384 + 13973):
>
> TLS_HW count>EOF close() -> 212992 short
> TLS_HW count>EOF close(), no zerocopy -> 212992 same
> TLS_HW count==exact close() -> 226965 full
> TLS_HW count>EOF close_notify, then close() -> 226965 full
> TLS_SW count>EOF close(), hw-tx-offload off -> 226965 full
>
> So it is specific to the device-offload TX path: the final record of a
> count > EOF sendfile() appears not to be finalized/flushed at EOF, only
> by a following write. A bounded count, a trailing write (close_notify),
> or software kTLS all avoid it. TLS_TX_ZEROCOPY_RO makes no difference.
> We are currently using the exact-count workaround in a preview environment.
>
> We may be misreading the code, so this is only a pointer: with
> count > EOF tls_push_data() fills the last record without reaching the
> size==0 case; on the device path tls_device_record_close() for that
> pending record appears to run only on the next push, and an abrupt
> teardown appears to discard it. The software path seems to flush
> pending TX records on close (tls_sw_release_resources_tx), which would
> explain why it is unaffected.
>
> Reproducer:
> https://gist.github.com/totallyunknown/a8f0ad3c54e40befde2f5a8d360fa6be
>
> It installs kTLS with fixed test crypto material via
> setsockopt(TLS_TX/TLS_RX), sends a file using the raw sendfile(2)
> syscall, and compares count > EOF against exact-count and close_notify.
> The v14 selftest (patch 9/9) sends via send() only and ends cleanly, so
> it misses this; a sendfile() + count > EOF case reproduces it
> deterministically for us.
>
> Question: should the device offload finalize and flush the connection's
> final record at EOF / on close, the way software kTLS does, or is a
> trailing write required by contract? And should a fix live in net/tls
> (device record close on the final partial record / the close path) or
> on the mlx5 side?
>

close() should be sufficient here.
I will fix this in net/tls/tls_device.c:tls_device_splice_eof().
tls_device_splice_eof() only checked tls_is_partially_sent_record(),
but it also needs to handle tls_is_pending_open_record()
(pending_open_record_frags). On the device TX path, that state occurs
when tls_push_data() exits with MSG_MORE set, which is what
splice/sendfile does while count > EOF still leaves requested bytes
outstanding.
So EOF can be reached with a final open record still pending. The fix
is to close and push that record from tls_device_splice_eof(), so
close() remains sufficient and no trailing write is required.

^ permalink raw reply

* Re: [PATCH net] net: airoha: dma map xmit frags with skb_frag_dma_map()
From: Harshitha Ramamurthy @ 2026-06-25 22:59 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-arm-kernel, linux-mediatek, netdev
In-Reply-To: <20260625-airoha-eth-skb_frag_dma_map-v1-1-31d9e460aae6@kernel.org>

On Thu, Jun 25, 2026 at 2:43 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Map xmit skb fragments using skb_frag_dma_map() instead of
> dma_map_single(skb_frag_address()). skb_frag_address() relies on
> page_address() to obtain a kernel virtual address, which is not
> guaranteed to work for all page types (e.g. highmem pages or
> user-pinned pages from MSG_ZEROCOPY).
> skb_frag_dma_map() maps the fragment directly via its struct page and
> offset through dma_map_page(), avoiding the need for a kernel virtual
> address entirely.
> Introduce an enum airoha_dma_map_type to track how each queue entry was
> mapped (single vs page), so that the matching unmap function is called
> on completion and in error paths.
>
> Fixes: 23020f049327 ("net: airoha: Introduce ethernet support for EN7581 SoC")
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>

> ---
>  drivers/net/ethernet/airoha/airoha_eth.c | 61 ++++++++++++++++++++------------
>  drivers/net/ethernet/airoha/airoha_eth.h |  7 ++++
>  2 files changed, 45 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> index 932b3a3df2e5..1caf6766f2c0 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.c
> +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> @@ -944,6 +944,25 @@ static void airoha_qdma_wake_netdev_txqs(struct airoha_queue *q)
>         q->txq_stopped = false;
>  }
>
> +static void airoha_unmap_xmit_buf(struct airoha_eth *eth,
> +                                 struct airoha_queue_entry *e)
> +{
> +       switch (e->dma_type) {
> +       case AIROHA_DMA_MAP_PAGE:
> +               dma_unmap_page(eth->dev, e->dma_addr, e->dma_len,
> +                              DMA_TO_DEVICE);
> +               break;
> +       case AIROHA_DMA_MAP_SINGLE:
> +               dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> +                                DMA_TO_DEVICE);
> +               break;
> +       case AIROHA_DMA_UNMAPPED:
> +       default:
> +               break;
> +       }
> +       e->dma_type = AIROHA_DMA_UNMAPPED;
> +}
> +
>  static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>  {
>         struct airoha_tx_irq_queue *irq_q;
> @@ -1006,9 +1025,7 @@ static int airoha_qdma_tx_napi_poll(struct napi_struct *napi, int budget)
>                 skb = e->skb;
>                 e->skb = NULL;
>
> -               dma_unmap_single(eth->dev, e->dma_addr, e->dma_len,
> -                                DMA_TO_DEVICE);
> -               e->dma_addr = 0;
> +               airoha_unmap_xmit_buf(eth, e);
>                 list_add_tail(&e->list, &q->tx_list);
>
>                 WRITE_ONCE(desc->msg0, 0);
> @@ -1177,12 +1194,10 @@ static void airoha_qdma_tx_cleanup(struct airoha_qdma *qdma)
>                         struct airoha_qdma_desc *desc = &q->desc[j];
>                         struct sk_buff *skb = e->skb;
>
> -                       if (!e->dma_addr)
> +                       if (e->dma_type == AIROHA_DMA_UNMAPPED)
>                                 continue;
>
> -                       dma_unmap_single(qdma->eth->dev, e->dma_addr,
> -                                        e->dma_len, DMA_TO_DEVICE);
> -                       e->dma_addr = 0;
> +                       airoha_unmap_xmit_buf(qdma->eth, e);
>                         list_add_tail(&e->list, &q->tx_list);
>
>                         WRITE_ONCE(desc->ctrl, 0);
> @@ -2193,8 +2208,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>         struct netdev_queue *txq;
>         struct airoha_queue *q;
>         LIST_HEAD(tx_list);
> +       dma_addr_t addr;
>         int i = 0, qid;
> -       void *data;
>         u16 index;
>         u8 fport;
>
> @@ -2250,24 +2265,22 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>                 return NETDEV_TX_BUSY;
>         }
>
> -       len = skb_headlen(skb);
> -       data = skb->data;
> -
>         e = list_first_entry(&q->tx_list, struct airoha_queue_entry,
>                              list);
> +       len = skb_headlen(skb);
> +       addr = dma_map_single(netdev->dev.parent, skb->data, len,
> +                             DMA_TO_DEVICE);
> +       if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> +               goto error_unlock;
> +
> +       e->dma_type = AIROHA_DMA_MAP_SINGLE;
>         index = e - q->entry;
>
>         while (true) {
>                 struct airoha_qdma_desc *desc = &q->desc[index];
>                 skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> -               dma_addr_t addr;
>                 u32 val;
>
> -               addr = dma_map_single(netdev->dev.parent, data, len,
> -                                     DMA_TO_DEVICE);
> -               if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> -                       goto error_unmap;
> -
>                 list_move_tail(&e->list, &tx_list);
>                 e->skb = i == nr_frags - 1 ? skb : NULL;
>                 e->dma_addr = addr;
> @@ -2291,8 +2304,13 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>                 if (++i == nr_frags)
>                         break;
>
> -               data = skb_frag_address(frag);
>                 len = skb_frag_size(frag);
> +               addr = skb_frag_dma_map(netdev->dev.parent, frag, 0, len,
> +                                       DMA_TO_DEVICE);
> +               if (unlikely(dma_mapping_error(netdev->dev.parent, addr)))
> +                       goto error_unmap;
> +
> +               e->dma_type = AIROHA_DMA_MAP_PAGE;
>         }
>         q->queued += i;
>
> @@ -2313,11 +2331,8 @@ static netdev_tx_t airoha_dev_xmit(struct sk_buff *skb,
>         return NETDEV_TX_OK;
>
>  error_unmap:
> -       list_for_each_entry(e, &tx_list, list) {
> -               dma_unmap_single(netdev->dev.parent, e->dma_addr, e->dma_len,
> -                                DMA_TO_DEVICE);
> -               e->dma_addr = 0;
> -       }
> +       list_for_each_entry(e, &tx_list, list)
> +               airoha_unmap_xmit_buf(dev->eth, e);
>         list_splice(&tx_list, &q->tx_list);
>  error_unlock:
>         spin_unlock_bh(&q->lock);
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
> index d7ff8c5200e2..2765244d937c 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.h
> +++ b/drivers/net/ethernet/airoha/airoha_eth.h
> @@ -170,12 +170,19 @@ enum trtcm_param {
>  #define TRTCM_TOKEN_RATE_MASK                  GENMASK(23, 6)
>  #define TRTCM_TOKEN_RATE_FRACTION_MASK         GENMASK(5, 0)
>
> +enum airoha_dma_map_type {
> +       AIROHA_DMA_UNMAPPED,
> +       AIROHA_DMA_MAP_SINGLE,
> +       AIROHA_DMA_MAP_PAGE,
> +};
> +
>  struct airoha_queue_entry {
>         union {
>                 void *buf;
>                 struct {
>                         struct list_head list;
>                         struct sk_buff *skb;
> +                       enum airoha_dma_map_type dma_type;
>                 };
>         };
>         dma_addr_t dma_addr;
>
> ---
> base-commit: 232c4ca2343d1181cbfc061f9856d9591e397579
> change-id: 20260625-airoha-eth-skb_frag_dma_map-bcccd5d6e4b1
>
> Best regards,
> --
> Lorenzo Bianconi <lorenzo@kernel.org>
>
>

^ permalink raw reply

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox