Netdev List
 help / color / mirror / Atom feed
* [PATCH net v3] tipc: fix use-after-free of the discoverer in tipc_disc_rcv()
From: Weiming Shi @ 2026-06-17 13:57 UTC (permalink / raw)
  To: Jon Maloy, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Ying Xue, netdev, tipc-discussion, linux-kernel,
	Xiang Mei, Weiming Shi

bearer_disable() frees b->disc with tipc_disc_delete()'s plain kfree(),
but tipc_disc_rcv() still dereferences b->disc in RX softirq under
rcu_read_lock() (tipc_udp_recv -> tipc_rcv -> tipc_disc_rcv).

L2 bearers are safe thanks to the synchronize_net() in
tipc_disable_l2_media(), but the UDP bearer defers that call to the
cleanup_bearer() workqueue, so the discoverer is freed with no grace
period:

 BUG: KASAN: slab-use-after-free in tipc_disc_rcv (net/tipc/discover.c:149)
 Read of size 8 at addr ffff88802348b728 by task poc_tipc/184
 <IRQ>
  tipc_disc_rcv (net/tipc/discover.c:149)
  tipc_rcv (net/tipc/node.c:2126)
  tipc_udp_recv (net/tipc/udp_media.c:391)
  udp_rcv (net/ipv4/udp.c:2643)
  ip_local_deliver_finish (net/ipv4/ip_input.c:241)
 </IRQ>
 Freed by task 181:
  kfree (mm/slub.c:6565)
  bearer_disable (net/tipc/bearer.c:418)
  tipc_nl_bearer_disable (net/tipc/bearer.c:1001)

The bearer is freed with kfree_rcu(); free the discoverer the same way.
Add an rcu_head to struct tipc_discoverer and free it and its skb from an
RCU callback.

Because the RCU callback (tipc_disc_free_rcu) lives in module text, a
call_rcu() that is still pending when the tipc module is unloaded would
invoke a freed function. Add an rcu_barrier() to tipc_exit() after the
bearer subsystem has been torn down, so all pending discoverer callbacks
have run before the module text goes away.

Reachable from an unprivileged user namespace: the TIPCv2 genl family is
netnsok and its bearer commands have no GENL_ADMIN_PERM. Needs CONFIG_TIPC
and CONFIG_TIPC_MEDIA_UDP.

Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values")
Reported-by: Xiang Mei <xmei5@asu.edu>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
v3:
 - Reword the rcu_barrier() comment as a TODO (Tung Quang Nguyen).
v2:
 - split the over-80-column container_of() line (Tung Quang Nguyen)
 - add rcu_barrier() to tipc_exit() so a pending call_rcu() cannot fire
   into freed module text after rmmod (Eric Dumazet)

 net/tipc/core.c     |  5 +++++
 net/tipc/discover.c | 14 ++++++++++++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 434e70eabe08..1ddecea1df6e 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -218,6 +218,11 @@ static void __exit tipc_exit(void)
 	unregister_pernet_device(&tipc_net_ops);
 	tipc_unregister_sysctl();
 
+	/* TODO: Wait for all timers that called call_rcu() to finish before
+	 * calling rcu_barrier().
+	 */
+	rcu_barrier();
+
 	pr_info("Deactivated\n");
 }
 
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index 3e54d2df5683..b9d06595b067 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -58,6 +58,7 @@
  * @skb: request message to be (repeatedly) sent
  * @timer: timer governing period between requests
  * @timer_intv: current interval between requests (in ms)
+ * @rcu: RCU head for deferred freeing
  */
 struct tipc_discoverer {
 	u32 bearer_id;
@@ -69,6 +70,7 @@ struct tipc_discoverer {
 	struct sk_buff *skb;
 	struct timer_list timer;
 	unsigned long timer_intv;
+	struct rcu_head rcu;
 };
 
 /**
@@ -382,6 +384,15 @@ int tipc_disc_create(struct net *net, struct tipc_bearer *b,
 	return 0;
 }
 
+static void tipc_disc_free_rcu(struct rcu_head *rp)
+{
+	struct tipc_discoverer *d = container_of(rp, struct tipc_discoverer,
+						 rcu);
+
+	kfree_skb(d->skb);
+	kfree(d);
+}
+
 /**
  * tipc_disc_delete - destroy object sending periodic link setup requests
  * @d: ptr to link dest structure
@@ -389,8 +400,7 @@ int tipc_disc_create(struct net *net, struct tipc_bearer *b,
 void tipc_disc_delete(struct tipc_discoverer *d)
 {
 	timer_shutdown_sync(&d->timer);
-	kfree_skb(d->skb);
-	kfree(d);
+	call_rcu(&d->rcu, tipc_disc_free_rcu);
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related

* [PATCH net V2 0/3] net/mlx5e: Fix crashes in dynamic per-channel stats and HV VHCA agent
From: Tariq Toukan @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan

Hi,

Since per-channel stats were converted to be allocated and published
lazily at first channel open in commit fa691d0c9c08 ("net/mlx5e:
Allocate per-channel stats dynamically at first usage"),
priv->channel_stats[] and priv->stats_nch are filled in
incrementally during interface bring-up. This opened a window in
which the various stats readers - most of them reachable from
userspace via netlink/netdev stats queries - can race with
mlx5e_open_channel() on another CPU and observe partially
initialized state. The HV VHCA stats agent, which is created
before the channels are opened, hits related problems of its own.

This series by Feng fixes the resulting crashes.

Regards,
Tariq

V2:
- Drop "Bounds-check stats_nch in mlx5e_get_queue_stats_rx()" (Jakub).

V1:
https://lore.kernel.org/all/20260604135041.455754-1-tariqt@nvidia.com

Feng Liu (3):
  net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
  net/mlx5e: Fix HV VHCA stats agent registration race
  net/mlx5e: Fix publication race for priv->channel_stats[]

 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 12 ++++++
 .../mellanox/mlx5/core/en/hv_vhca_stats.c     | 38 +++++++++++++------
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 14 ++++---
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  9 +++--
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  3 +-
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.c |  8 +++-
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.h |  6 ++-
 7 files changed, 63 insertions(+), 27 deletions(-)


base-commit: 406e8a651a7b854c41fecd5117bb282b3a6c2c6b
-- 
2.44.0


^ permalink raw reply

* [PATCH net V2 1/3] net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
From: Tariq Toukan @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20260617140127.573117-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_hv_vhca_stats_create() is called from mlx5e_nic_enable(),
before mlx5e_open(). At that point priv->stats_nch is still zero,
because it is only ever incremented in mlx5e_channel_stats_alloc(),
which is reached only from mlx5e_open_channel().

mlx5e_hv_vhca_stats_buf_size() therefore returns 0, and
kvzalloc(0, GFP_KERNEL) returns ZERO_SIZE_PTR ((void *)16) rather
than NULL. The "if (!buf)" guard does not catch this, and
mlx5e_hv_vhca_stats_create() completes "successfully" with
priv->stats_agent.buf set to ZERO_SIZE_PTR.

Once channels are opened (priv->stats_nch > 0) and the hypervisor
enables stats reporting, mlx5e_hv_vhca_stats_work() recomputes
buf_len using the new non-zero stats_nch and calls
memset(buf, 0, buf_len) on ZERO_SIZE_PTR, faulting at address 0x10.

Allocate the buffer based on priv->max_nch, which is set in
mlx5e_priv_init() and is the upper bound on stats_nch:

  - Add a separate helper mlx5e_hv_vhca_stats_buf_max_size() that
    returns sizeof(per_ring_stats) * max(max_nch, stats_nch), and
    use it for the kvzalloc() in mlx5e_hv_vhca_stats_create().
  - Keep mlx5e_hv_vhca_stats_buf_size() (which returns based on
    stats_nch) for the worker's active payload size, so the wire
    format (block->rings = stats_nch) and the amount of data filled
    by mlx5e_hv_vhca_fill_stats() are unchanged.

The max(max_nch, stats_nch) guard handles the rare case where
mlx5e_attach_netdev() recomputes max_nch downward across a
detach/resume cycle while priv->stats_nch persists (mlx5e_detach_netdev
does not call mlx5e_priv_cleanup, so stats_nch is only reset when
the netdev is destroyed). Without the guard, the worker could compute
buf_len from stats_nch and overrun the smaller buffer allocated based
on the reduced max_nch.

This mirrors the existing mlx5e pattern of preallocating arrays of
size max_nch (e.g. priv->channel_stats) and lazily populating
entries up to stats_nch on demand.

Fixes: fa691d0c9c08 ("net/mlx5e: Allocate per-channel stats dynamically at first usage")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c    | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 195863b2c013..06cbd49d4e98 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -54,6 +54,12 @@ static int mlx5e_hv_vhca_stats_buf_size(struct mlx5e_priv *priv)
 		priv->stats_nch);
 }
 
+static int mlx5e_hv_vhca_stats_buf_max_size(struct mlx5e_priv *priv)
+{
+	return (sizeof(struct mlx5e_hv_vhca_per_ring_stats) *
+		max(priv->max_nch, priv->stats_nch));
+}
+
 static void mlx5e_hv_vhca_stats_work(struct work_struct *work)
 {
 	struct mlx5e_hv_vhca_stats_agent *sagent;
@@ -122,7 +128,7 @@ static void mlx5e_hv_vhca_stats_cleanup(struct mlx5_hv_vhca_agent *agent)
 
 void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 {
-	int buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
+	int buf_len = mlx5e_hv_vhca_stats_buf_max_size(priv);
 	struct mlx5_hv_vhca_agent *agent;
 
 	priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
-- 
2.44.0


^ permalink raw reply related

* [PATCH net V2 2/3] net/mlx5e: Fix HV VHCA stats agent registration race
From: Tariq Toukan @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20260617140127.573117-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_hv_vhca_stats_create() registers the stats agent through
mlx5_hv_vhca_agent_create(). The helper publishes the agent in
hv_vhca->agents[type] under agents_lock and immediately schedules an
asynchronous control invalidation on the HV VHCA workqueue before
returning to mlx5e.

The asynchronous invalidation invokes the control agent's invalidate
callback, which reads the hypervisor control block and forwards the
command to mlx5e_hv_vhca_stats_control(). That callback may either:

  - call cancel_delayed_work_sync(&priv->stats_agent.work), or
  - call queue_delayed_work(priv->wq, &sagent->work, sagent->delay).

However, the delayed_work and priv->stats_agent.agent are only
initialized after mlx5_hv_vhca_agent_create() returns to mlx5e:

    agent = mlx5_hv_vhca_agent_create(...);   /* publish + invalidate */
    ...
    priv->stats_agent.agent = agent;          /* too late */
    INIT_DELAYED_WORK(&priv->stats_agent.work, ...); /* too late */

If the asynchronous control path runs before the two assignments
above, it can:

  - Operate on an uninitialized delayed_work whose timer.function is
    NULL. queue_delayed_work() calls add_timer() unconditionally, so
    when the timer expires the timer softirq invokes a NULL function
    pointer.
  - Re-initialize the timer later through INIT_DELAYED_WORK() while
    the timer is already enqueued in the timer wheel, corrupting the
    hlist (entry.pprev cleared while the previous bucket node still
    points at this entry).
  - When the worker eventually runs, mlx5e_hv_vhca_stats_work() reads
    sagent->agent (NULL) and dereferences it inside
    mlx5_hv_vhca_agent_write().

Fix this by:

  - Initializing priv->stats_agent.work before invoking
    mlx5_hv_vhca_agent_create(), so the work is always in a valid
    state when the control callback observes it.
  - Adding a struct mlx5_hv_vhca_agent **ctx_update out-parameter
    to mlx5_hv_vhca_agent_create(). The helper writes the agent
    pointer to *ctx_update before publishing into hv_vhca->agents[]
    and triggering the agents_update flow, so any callback
    subsequently invoked from that flow already sees a valid
    priv->stats_agent.agent. This avoids having the control
    callback participate in agent initialization.

While at it, clear priv->stats_agent.{agent,buf} after teardown and
on the agent_create() failure path. Without this, an enable/disable
cycle hitting an early-return in create can lead to a UAF or
double-destroy of stale pointers from the previous cycle.

Fixes: cef35af34d6d ("net/mlx5e: Add mlx5e HV VHCA stats agent")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en/hv_vhca_stats.c     | 22 ++++++++++++-------
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.c |  8 +++++--
 .../ethernet/mellanox/mlx5/core/lib/hv_vhca.h |  6 +++--
 3 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 06cbd49d4e98..2e495442a547 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -73,7 +73,7 @@ static void mlx5e_hv_vhca_stats_work(struct work_struct *work)
 	sagent = container_of(dwork, struct mlx5e_hv_vhca_stats_agent, work);
 	priv = container_of(sagent, struct mlx5e_priv, stats_agent);
 	buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
-	agent = sagent->agent;
+	agent = READ_ONCE(sagent->agent);
 	buf = sagent->buf;
 
 	memset(buf, 0, buf_len);
@@ -135,11 +135,14 @@ void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 	if (!priv->stats_agent.buf)
 		return;
 
+	INIT_DELAYED_WORK(&priv->stats_agent.work, mlx5e_hv_vhca_stats_work);
+
 	agent = mlx5_hv_vhca_agent_create(priv->mdev->hv_vhca,
 					  MLX5_HV_VHCA_AGENT_STATS,
 					  mlx5e_hv_vhca_stats_control, NULL,
 					  mlx5e_hv_vhca_stats_cleanup,
-					  priv);
+					  priv,
+					  &priv->stats_agent.agent);
 
 	if (IS_ERR_OR_NULL(agent)) {
 		if (IS_ERR(agent))
@@ -148,18 +151,21 @@ void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
 				    agent);
 
 		kvfree(priv->stats_agent.buf);
-		return;
+		priv->stats_agent.buf = NULL;
 	}
-
-	priv->stats_agent.agent = agent;
-	INIT_DELAYED_WORK(&priv->stats_agent.work, mlx5e_hv_vhca_stats_work);
 }
 
 void mlx5e_hv_vhca_stats_destroy(struct mlx5e_priv *priv)
 {
-	if (IS_ERR_OR_NULL(priv->stats_agent.agent))
+	struct mlx5_hv_vhca_agent *agent;
+
+	agent = READ_ONCE(priv->stats_agent.agent);
+	if (IS_ERR_OR_NULL(agent))
 		return;
 
-	mlx5_hv_vhca_agent_destroy(priv->stats_agent.agent);
+	mlx5_hv_vhca_agent_destroy(agent);
 	kvfree(priv->stats_agent.buf);
+
+	WRITE_ONCE(priv->stats_agent.agent, NULL);
+	priv->stats_agent.buf = NULL;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
index d6dc7bce855e..305752dab7bd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.c
@@ -190,7 +190,7 @@ mlx5_hv_vhca_control_agent_create(struct mlx5_hv_vhca *hv_vhca)
 	return mlx5_hv_vhca_agent_create(hv_vhca, MLX5_HV_VHCA_AGENT_CONTROL,
 					 NULL,
 					 mlx5_hv_vhca_control_agent_invalidate,
-					 NULL, NULL);
+					 NULL, NULL, NULL);
 }
 
 static void mlx5_hv_vhca_control_agent_destroy(struct mlx5_hv_vhca_agent *agent)
@@ -256,7 +256,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleaup)(struct mlx5_hv_vhca_agent *agent),
-			  void *priv)
+			  void *priv,
+			  struct mlx5_hv_vhca_agent **ctx_update)
 {
 	struct mlx5_hv_vhca_agent *agent;
 
@@ -284,6 +285,9 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 	agent->invalidate = invalidate;
 	agent->cleanup   = cleaup;
 
+	if (ctx_update)
+		WRITE_ONCE(*ctx_update, agent);
+
 	mutex_lock(&hv_vhca->agents_lock);
 	hv_vhca->agents[type] = agent;
 	mutex_unlock(&hv_vhca->agents_lock);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
index f240ffe5116c..8b3974cf0ee4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/hv_vhca.h
@@ -43,7 +43,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
-			  void *context);
+			  void *context,
+			  struct mlx5_hv_vhca_agent **ctx_update);
 
 void mlx5_hv_vhca_agent_destroy(struct mlx5_hv_vhca_agent *agent);
 int mlx5_hv_vhca_agent_write(struct mlx5_hv_vhca_agent *agent,
@@ -84,7 +85,8 @@ mlx5_hv_vhca_agent_create(struct mlx5_hv_vhca *hv_vhca,
 			  void (*invalidate)(struct mlx5_hv_vhca_agent*,
 					     u64 block_mask),
 			  void (*cleanup)(struct mlx5_hv_vhca_agent *agent),
-			  void *context)
+			  void *context,
+			  struct mlx5_hv_vhca_agent **ctx_update)
 {
 	return NULL;
 }
-- 
2.44.0


^ permalink raw reply related

* Re: [PATCH net 0/2] devlink: Fix a couple parent ref leaks
From: Simon Horman @ 2026-06-17 14:02 UTC (permalink / raw)
  To: Cosmin Ratiu
  Cc: netdev, Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Michal Wilczynski, Carolina Jubran, Mark Bloch,
	Tariq Toukan
In-Reply-To: <20260616110633.1449432-1-cratiu@nvidia.com>

On Tue, Jun 16, 2026 at 02:06:31PM +0300, Cosmin Ratiu wrote:
> These two patches fix parent ref leaks on errors.
> 
> Cosmin Ratiu (2):
>   devlink: Fix parent ref leak in devl_rate_node_create()
>   devlink: Fix parent ref leak on tc-bw failure

Thanks Cosmin,

For the series:

Reviewed-by: Simon Horman <horms@kernel.org>


^ permalink raw reply

* [PATCH net V2 3/3] net/mlx5e: Fix publication race for priv->channel_stats[]
From: Tariq Toukan @ 2026-06-17 14:01 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	netdev, Paolo Abeni
  Cc: Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20260617140127.573117-1-tariqt@nvidia.com>

From: Feng Liu <feliu@nvidia.com>

mlx5e_channel_stats_alloc() publishes a new entry to
priv->channel_stats[] and then increments priv->stats_nch as a
publication token, but neither store carries any memory barrier:

	priv->channel_stats[ix] = kvzalloc_node(...);
	if (!priv->channel_stats[ix])
		return -ENOMEM;
	priv->stats_nch++;

Concurrent readers compute the loop bound from priv->stats_nch and
then dereference priv->channel_stats[i] using plain accesses, e.g.

	for (i = 0; i < priv->stats_nch; i++) {
		struct mlx5e_channel_stats *cs = priv->channel_stats[i];
		... cs->rq.packets ...
	}

On weakly-ordered architectures (ARM, PowerPC, RISC-V) the writes to
channel_stats[ix] and stats_nch may become visible to other CPUs out
of program order. A reader can observe stats_nch == N while still
seeing channel_stats[N-1] == NULL, leading to a NULL pointer
dereference in the channel_stats loop.

This has been observed in production on BlueField-3 DPUs (arm64),
where ovs-vswitchd queries netdev statistics over netlink during NIC
bringup, racing mlx5e_open_channel() -> mlx5e_channel_stats_alloc()
on another CPU:

  Unable to handle kernel NULL pointer dereference at virtual address 0x840
  Hardware name: BlueField-3 DPU
  pc : mlx5e_fold_sw_stats64+0x30/0x180 [mlx5_core]
  Call trace:
   mlx5e_fold_sw_stats64+0x30/0x180 [mlx5_core]
   dev_get_stats+0x50/0xc0
   ovs_vport_get_stats+0x38/0xac [openvswitch]
   ovs_vport_cmd_fill_info+0x194/0x290 [openvswitch]
   ovs_vport_cmd_get+0xbc/0x10c [openvswitch]
   genl_family_rcv_msg_doit+0xd0/0x160
   genl_rcv_msg+0xec/0x1f0
   netlink_rcv_skb+0x64/0x130
   genl_rcv+0x40/0x60
   netlink_unicast+0x2fc/0x370
   netlink_sendmsg+0x1dc/0x454
   ...
   __arm64_sys_sendmsg+0x2c/0x40

Add mlx5e_stats_nch_write() and mlx5e_stats_nch_read() helpers in en.h
that wrap the smp_store_release()/smp_load_acquire() pair on stats_nch.
The release/acquire pair establishes the contract:

  stats_nch == N  =>  channel_stats[0..N-1] are visible and non-NULL.

Publish the stats_nch increment via mlx5e_stats_nch_write() in the
writer (mlx5e_channel_stats_alloc()), and read stats_nch via
mlx5e_stats_nch_read() in all readers: mlx5e RX/TX queue stats,
mlx5e_get_base_stats(), ethtool channels stats, IPoIB stats, the
sw_stats fold and the HV VHCA stats agent.

Fixes: fa691d0c9c08 ("net/mlx5e: Allocate per-channel stats dynamically at first usage")
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h       | 12 ++++++++++++
 .../ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c | 10 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 14 ++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c |  9 +++++----
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  3 ++-
 5 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2270e2e550dd..d507289096c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -987,6 +987,18 @@ struct mlx5e_priv {
 	struct ethtool_fec_hist_range *fec_ranges;
 };
 
+static inline u16 mlx5e_stats_nch_read(const struct mlx5e_priv *priv)
+{
+	/* Pairs with smp_store_release in mlx5e_stats_nch_write(). */
+	return smp_load_acquire(&priv->stats_nch);
+}
+
+static inline void mlx5e_stats_nch_write(struct mlx5e_priv *priv, u16 n)
+{
+	/* Pairs with smp_load_acquire in mlx5e_stats_nch_read(). */
+	smp_store_release(&priv->stats_nch, n);
+}
+
 struct mlx5e_dev {
 	struct net_device *netdev;
 	struct devlink_port dl_port;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
index 2e495442a547..9747d7736d37 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
@@ -33,9 +33,10 @@ mlx5e_hv_vhca_fill_ring_stats(struct mlx5e_priv *priv, int ch,
 static void mlx5e_hv_vhca_fill_stats(struct mlx5e_priv *priv, void *data,
 				     int buf_len)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int ch, i = 0;
 
-	for (ch = 0; ch < priv->stats_nch; ch++) {
+	for (ch = 0; ch < nch; ch++) {
 		void *buf = data + i;
 
 		if (WARN_ON_ONCE(buf +
@@ -50,8 +51,9 @@ static void mlx5e_hv_vhca_fill_stats(struct mlx5e_priv *priv, void *data,
 
 static int mlx5e_hv_vhca_stats_buf_size(struct mlx5e_priv *priv)
 {
-	return (sizeof(struct mlx5e_hv_vhca_per_ring_stats) *
-		priv->stats_nch);
+	u16 nch = mlx5e_stats_nch_read(priv);
+
+	return sizeof(struct mlx5e_hv_vhca_per_ring_stats) * nch;
 }
 
 static int mlx5e_hv_vhca_stats_buf_max_size(struct mlx5e_priv *priv)
@@ -106,7 +108,7 @@ static void mlx5e_hv_vhca_stats_control(struct mlx5_hv_vhca_agent *agent,
 	sagent = &priv->stats_agent;
 
 	block->version = MLX5_HV_VHCA_STATS_VERSION;
-	block->rings   = priv->stats_nch;
+	block->rings   = mlx5e_stats_nch_read(priv);
 
 	if (!block->command) {
 		cancel_delayed_work_sync(&priv->stats_agent.work);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8f2b3abe0092..94e5352a246c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -2773,7 +2773,7 @@ static int mlx5e_channel_stats_alloc(struct mlx5e_priv *priv, int ix, int cpu)
 						GFP_KERNEL, cpu_to_node(cpu));
 	if (!priv->channel_stats[ix])
 		return -ENOMEM;
-	priv->stats_nch++;
+	mlx5e_stats_nch_write(priv, priv->stats_nch + 1);
 
 	return 0;
 }
@@ -4043,9 +4043,10 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
 
 void mlx5e_fold_sw_stats64(struct mlx5e_priv *priv, struct rtnl_link_stats64 *s)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int i;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats = priv->channel_stats[i];
 		struct mlx5e_rq_stats *xskrq_stats = &channel_stats->xskrq;
 		struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
@@ -5489,7 +5490,7 @@ static void mlx5e_get_queue_stats_rx(struct net_device *dev, int i,
 	struct mlx5e_rq_stats *xskrq_stats;
 	struct mlx5e_rq_stats *rq_stats;
 
-	if (mlx5e_is_uplink_rep(priv) || !priv->stats_nch)
+	if (mlx5e_is_uplink_rep(priv) || !mlx5e_stats_nch_read(priv))
 		return;
 
 	channel_stats = priv->channel_stats[i];
@@ -5508,7 +5509,7 @@ static void mlx5e_get_queue_stats_tx(struct net_device *dev, int i,
 	struct mlx5e_priv *priv = netdev_priv(dev);
 	struct mlx5e_sq_stats *sq_stats;
 
-	if (!priv->stats_nch)
+	if (!mlx5e_stats_nch_read(priv))
 		return;
 
 	/* no special case needed for ptp htb etc since txq2sq_stats is kept up
@@ -5525,6 +5526,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 				 struct netdev_queue_stats_tx *tx)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
+	u16 nch = mlx5e_stats_nch_read(priv);
 	struct mlx5e_ptp *ptp_channel;
 	int i, tc;
 
@@ -5533,7 +5535,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 		rx->bytes = 0;
 		rx->alloc_fail = 0;
 
-		for (i = priv->channels.params.num_channels; i < priv->stats_nch; i++) {
+		for (i = priv->channels.params.num_channels; i < nch; i++) {
 			struct netdev_queue_stats_rx rx_i = {0};
 
 			mlx5e_get_queue_stats_rx(dev, i, &rx_i);
@@ -5558,7 +5560,7 @@ static void mlx5e_get_base_stats(struct net_device *dev,
 	tx->packets = 0;
 	tx->bytes = 0;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats = priv->channel_stats[i];
 
 		/* handle two cases:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1a3ecf073913..8632b73179cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -516,6 +516,7 @@ static void mlx5e_stats_update_stats_rq_page_pool(struct mlx5e_channel *c)
 static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 {
 	struct mlx5e_sw_stats *s = &priv->stats.sw;
+	u16 nch = mlx5e_stats_nch_read(priv);
 	int i;
 
 	memset(s, 0, sizeof(*s));
@@ -523,7 +524,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(sw)
 	for (i = 0; i < priv->channels.num; i++) /* for active channels only */
 		mlx5e_stats_update_stats_rq_page_pool(priv->channels.c[i]);
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats =
 			priv->channel_stats[i];
 
@@ -2615,7 +2616,7 @@ static MLX5E_DECLARE_STATS_GRP_OP_UPDATE_STATS(ptp) { return; }
 
 static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(channels)
 {
-	int max_nch = priv->stats_nch;
+	int max_nch = mlx5e_stats_nch_read(priv);
 
 	return (NUM_RQ_STATS * max_nch) +
 	       (NUM_CH_STATS * max_nch) +
@@ -2628,8 +2629,8 @@ static MLX5E_DECLARE_STATS_GRP_OP_NUM_STATS(channels)
 
 static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(channels)
 {
+	int max_nch = mlx5e_stats_nch_read(priv);
 	bool is_xsk = priv->xsk.ever_used;
-	int max_nch = priv->stats_nch;
 	int i, j, tc;
 
 	for (i = 0; i < max_nch; i++)
@@ -2661,8 +2662,8 @@ static MLX5E_DECLARE_STATS_GRP_OP_FILL_STRS(channels)
 
 static MLX5E_DECLARE_STATS_GRP_OP_FILL_STATS(channels)
 {
+	int max_nch = mlx5e_stats_nch_read(priv);
 	bool is_xsk = priv->xsk.ever_used;
-	int max_nch = priv->stats_nch;
 	int i, j, tc;
 
 	for (i = 0; i < max_nch; i++)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index 0a6003fe60e9..674bed721e63 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -135,10 +135,11 @@ void mlx5i_cleanup(struct mlx5e_priv *priv)
 
 static void mlx5i_grp_sw_update_stats(struct mlx5e_priv *priv)
 {
+	u16 nch = mlx5e_stats_nch_read(priv);
 	struct rtnl_link_stats64 s = {};
 	int i, j;
 
-	for (i = 0; i < priv->stats_nch; i++) {
+	for (i = 0; i < nch; i++) {
 		struct mlx5e_channel_stats *channel_stats;
 		struct mlx5e_rq_stats *rq_stats;
 
-- 
2.44.0


^ permalink raw reply related

* Re: [PATCH net v2] tipc: fix use-after-free of the discoverer in tipc_disc_rcv()
From: Weiming Shi @ 2026-06-17 14:05 UTC (permalink / raw)
  To: Tung Quang Nguyen
  Cc: jmaloy@redhat.com, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, horms@kernel.org, davem@davemloft.net,
	xmei5@asu.edu, netdev@vger.kernel.org,
	tipc-discussion@lists.sourceforge.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <GV1P189MB1988614C09148258A9844D0CC6E42@GV1P189MB1988.EURP189.PROD.OUTLOOK.COM>

Tung Quang Nguyen <tung.quang.nguyen@est.tech> 于2026年6月17日周三 16:47写道:
>
> >Subject: [PATCH net v2] tipc: fix use-after-free of the discoverer in
> >tipc_disc_rcv()
> >
> >bearer_disable() frees b->disc with tipc_disc_delete()'s plain kfree(), but
> >tipc_disc_rcv() still dereferences b->disc in RX softirq under
> >rcu_read_lock() (tipc_udp_recv -> tipc_rcv -> tipc_disc_rcv).
> >
> >L2 bearers are safe thanks to the synchronize_net() in tipc_disable_l2_media(),
> >but the UDP bearer defers that call to the
> >cleanup_bearer() workqueue, so the discoverer is freed with no grace
> >period:
> >
> > BUG: KASAN: slab-use-after-free in tipc_disc_rcv (net/tipc/discover.c:149)
> >Read of size 8 at addr ffff88802348b728 by task poc_tipc/184  <IRQ>
> >  tipc_disc_rcv (net/tipc/discover.c:149)
> >  tipc_rcv (net/tipc/node.c:2126)
> >  tipc_udp_recv (net/tipc/udp_media.c:391)
> >  udp_rcv (net/ipv4/udp.c:2643)
> >  ip_local_deliver_finish (net/ipv4/ip_input.c:241)  </IRQ>  Freed by task 181:
> >  kfree (mm/slub.c:6565)
> >  bearer_disable (net/tipc/bearer.c:418)
> >  tipc_nl_bearer_disable (net/tipc/bearer.c:1001)
> >
> >The bearer is freed with kfree_rcu(); free the discoverer the same way.
> >Add an rcu_head to struct tipc_discoverer and free it and its skb from an RCU
> >callback.
> >
> >Because the RCU callback (tipc_disc_free_rcu) lives in module text, a
> >call_rcu() that is still pending when the tipc module is unloaded would invoke a
> >freed function. Add an rcu_barrier() to tipc_exit() after the bearer subsystem
> >has been torn down, so all pending discoverer callbacks have run before the
> >module text goes away.
> >
> >Reachable from an unprivileged user namespace: the TIPCv2 genl family is
> >netnsok and its bearer commands have no GENL_ADMIN_PERM. Needs
> >CONFIG_TIPC and CONFIG_TIPC_MEDIA_UDP.
> >
> >Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash
> >values")
> >Reported-by: Xiang Mei <xmei5@asu.edu>
> >Assisted-by: Claude:claude-opus-4-8
> >Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> >---
> >v2:
> > - split the over-80-column container_of() line (Tung Quang Nguyen)
> > - add rcu_barrier() to tipc_exit() so a pending call_rcu() cannot fire
> >   into freed module text after rmmod (Eric Dumazet)
> >
> > net/tipc/core.c     |  3 +++
> > net/tipc/discover.c | 14 ++++++++++++--
> > 2 files changed, 15 insertions(+), 2 deletions(-)
> >
> >diff --git a/net/tipc/core.c b/net/tipc/core.c index
> >434e70eabe08..747328e58d30 100644
> >--- a/net/tipc/core.c
> >+++ b/net/tipc/core.c
> >@@ -218,6 +218,9 @@ static void __exit tipc_exit(void)
> >       unregister_pernet_device(&tipc_net_ops);
> >       tipc_unregister_sysctl();
> >
> >+      /* Wait for tipc_disc_free_rcu() callbacks queued from module text. */
>
> Please change above comment to: /* TODO: Wait for all timers that  called call_rcu() to finish before calling rcu_barrier() */
>
> Note that call_rcu() are used in discover.c and node.c. So, the TODO comment helps we add more checking code later in another patch.
>
> >+      rcu_barrier();
> >+
> >       pr_info("Deactivated\n");
> > }
> >
> >diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> >3e54d2df5683..696b7a8ed54d 100644
> >--- a/net/tipc/discover.c
> >+++ b/net/tipc/discover.c
> >@@ -58,6 +58,7 @@
> >  * @skb: request message to be (repeatedly) sent
> >  * @timer: timer governing period between requests
> >  * @timer_intv: current interval between requests (in ms)
> >+ * @rcu: RCU head for deferred freeing
> >  */
> > struct tipc_discoverer {
> >       u32 bearer_id;
> >@@ -69,6 +70,7 @@ struct tipc_discoverer {
> >       struct sk_buff *skb;
> >       struct timer_list timer;
> >       unsigned long timer_intv;
> >+      struct rcu_head rcu;
> > };
> >
> > /**
> >@@ -382,6 +384,15 @@ int tipc_disc_create(struct net *net, struct tipc_bearer
> >*b,
> >       return 0;
> > }
> >
> >+static void tipc_disc_free_rcu(struct rcu_head *rp) {
> >+      struct tipc_discoverer *d =
> >+              container_of(rp, struct tipc_discoverer, rcu);
> >+
> >+      kfree_skb(d->skb);
> >+      kfree(d);
> >+}
> >+
> > /**
> >  * tipc_disc_delete - destroy object sending periodic link setup requests
> >  * @d: ptr to link dest structure
> >@@ -389,8 +400,7 @@ int tipc_disc_create(struct net *net, struct tipc_bearer
> >*b,  void tipc_disc_delete(struct tipc_discoverer *d)  {
> >       timer_shutdown_sync(&d->timer);
> >-      kfree_skb(d->skb);
> >-      kfree(d);
> >+      call_rcu(&d->rcu, tipc_disc_free_rcu);
> > }
> >
> > /**
> >--
> >2.43.0
>

Hi
Sent v3. Thanks for the review.

Best,
Weiming Shi

^ permalink raw reply

* Re: [PATCH net] net: rnpgbe: fix mailbox endianness handling
From: Yibo Dong @ 2026-06-17 14:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, vadim.fedorenko,
	netdev, linux-kernel, yaojun
In-Reply-To: <26517b8f-33b7-4de7-8fe8-c7dca5fa7a4b@lunn.ch>

On Wed, Jun 17, 2026 at 02:09:00PM +0200, Andrew Lunn wrote:
> > My understanding is as follows:
> > The firmware structures are defined with__le16 / __le32 for wire format,
> > but the original code cast these struct pointers to u32 * before passing
> > them to the mailbox read/write routines:
> > - Send path: (u32 *)&req -> msg buffer -> writel()
> > - Receive path: readl() -> msg buffer -> (u32 *)&reply
> > Sparse only sees pure u32 = u32 assignments here, so no type mismatch is
> > reported.
> 
> Can the code be changed so that it does not need the cast? Casts are
> bad, as you have just shown. This is something i try to push back on,
> it makes you think about types and avoid issues like this.
> 
> 	Andrew
> 
Thinking... Yes. A few possibilities:

1. Make all fields __le32, then extract via shifts:
   struct mbx_fw_cmd_req {
       __le32 word0;  // [15:0]=flags  [31:16]=opcode
       __le32 word1;  // [15:0]=datalen [31:16]=ret_value
       ...
   };
   But that's painful — le32_to_cpu(req.word0) >> 16 vs req.opcode.

2. Use a union to keep named fields while also exposing __le32[] access:
   union mbx_fw_cmd_req_u {
       struct mbx_fw_cmd_req req;
       __le32 dwords[sizeof(struct mbx_fw_cmd_req) / sizeof(__le32)];
   };
   union mbx_fw_cmd_reply_u {
       struct mbx_fw_cmd_reply reply;
       __le32 dwords[sizeof(struct mbx_fw_cmd_reply) / sizeof(__le32)];
   };

   The transport interface becomes:
   int mucse_write_mbx_pf(struct mucse_hw *hw, const __le32 *msg, u16 size);
   int mucse_read_mbx_pf(struct mucse_hw *hw, __le32 *msg, u16 size);

   Callers would use:
   union mbx_fw_cmd_req_u cmd = {};
   cmd.req.opcode = cpu_to_le16(...);
   cmd.req.flags  = cpu_to_le16(...);
   mucse_write_mbx_pf(hw, cmd.dwords, sizeof(cmd.req));

   If the transport layer forgets le32_to_cpu(), sparse would catch it
   because msg is __le32 * and mbx_data_rd32() returns u32.

   The downside is an extra union wrapper and an extra level in field
   access (cmd.req.opcode vs req.opcode) — a minor inconvenience.

Do you have a preference between these, or another approach?

Thanks for the feedback.

^ permalink raw reply

* Re: [PATCH bpf v3 2/2] selftests/bpf: Cover partial copy of non-linear test_run output
From: sun jian @ 2026-06-17 14:19 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, netdev, linux-kselftest, linux-kernel, ast, daniel, andrii,
	martin.lau, eddyz87, memxor, song, yonghong.song, jolsa, davem,
	edumazet, kuba, pabeni, horms, shuah, hawk, john.fastabend, sdf,
	toke, lorenzo, paul.chaignon, martin.lau, clm, ihor.solodrai
In-Reply-To: <2dad9b5c184fa101d7cffa1d8f5eea5b5df60f53533d98c68175c9e3ec5ee6ac@mail.kernel.org>

On Wed, Jun 17, 2026 at 6:31 PM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > index 01f1d1b6715a..9cc898e6a9f7 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/prog_run_opts.c
> > @@ -4,6 +4,10 @@
> >
> >  #include "test_pkt_access.skel.h"
> >
> > +#define NONLINEAR_PKT_LEN 9000
> > +#define NONLINEAR_LINEAR_DATA_LEN 64
> > +#define SHORT_OUT_LEN 100
> > +
>
> [ ... ]
>
> > @@ -20,6 +24,69 @@ static void check_run_cnt(int prog_fd, __u64 run_cnt)
> >             "incorrect number of repetitions, want %llu have %llu\n", run_cnt, info.run_cnt);
> >  }
> >
> > +static void init_pkt(__u8 *pkt, size_t len)
> > +{
> > +     size_t i;
> > +
> > +     for (i = 0; i < len; i++)
> > +             pkt[i] = i & 0xff;
> > +}
>
> A question was raised on v2 about whether pkt_v4 could be reused by
> reducing the linear area to ETH_HLEN, rather than introducing a custom
> init_pkt() with a 9000-byte stack packet.
>
> Can't we reuse pkt_v4 here by reducing the linear area to ETH_HLEN?
> The v3 still adds init_pkt() and the NONLINEAR_PKT_LEN packet, so this
> doesn't seem to have been picked up.
>
> > +
> > +static void test_skb_nonlinear_data_out_partial(struct test_pkt_access *skel)
> > +{
> > +     LIBBPF_OPTS(bpf_test_run_opts, topts);
> > +     __u8 pkt[NONLINEAR_PKT_LEN];
> > +     __u8 out[SHORT_OUT_LEN];
> > +     struct __sk_buff skb = {};
> > +     int prog_fd, err;
> > +
> > +     init_pkt(pkt, sizeof(pkt));
> > +
> > +     skb.data_end = NONLINEAR_LINEAR_DATA_LEN;
> > +
> > +     topts.data_in = pkt;
> > +     topts.data_size_in = sizeof(pkt);
> > +     topts.data_out = out;
> > +     topts.data_size_out = sizeof(out);
> > +     topts.ctx_in = &skb;
> > +     topts.ctx_size_in = sizeof(skb);
> > +
> > +     prog_fd = bpf_program__fd(skel->progs.tc_pass_prog);
>
> [ ... ]
>
> > diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > index bce7173152c6..cd284401eebd 100644
> > --- a/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > +++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c
> > @@ -150,3 +150,15 @@ int test_pkt_access(struct __sk_buff *skb)
> >
> >       return TC_ACT_UNSPEC;
> >  }
> > +
> > +SEC("tc")
> > +int tc_pass_prog(struct __sk_buff *skb)
> > +{
> > +     return TC_ACT_OK;
> > +}
> > +
> > +SEC("xdp.frags")
> > +int xdp_frags_pass_prog(struct xdp_md *ctx)
> > +{
> > +     return XDP_PASS;
> > +}
>
> A related suggestion on v2 was that, once pkt_v4 is reused, the existing
> BPF program could be reused instead of adding new pass-through programs.
>
> Could tc_pass_prog and xdp_frags_pass_prog be dropped in favour of the
> existing program? The v3 still adds both of these, so this point also
> seems to be open.
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27680511802

Hi,

Thanks for checking this.

I tried reusing pkt_v4 and the existing TC program, but they do not fit
the skb case this test is trying to cover.

For skb test_run, IPv4/IPv6 inputs with a too-short L3 header in the
linear area are rejected before bpf_test_finish(). With pkt_v4 and a
linear area of ETH_HLEN, the test fails with -EINVAL before reaching the
partial copy-out path. If the linear area is increased enough to pass the
IPv4 check, pkt_v4 is too small to both trigger the old
copy_size - frag_size path and verify that the copied prefix spans the
linear data and the first fragment. pkt_v6 has the same issue: after
making the IPv6 header linear, only 20 bytes remain in frags.

The existing test_pkt_access program has its own packet-access coverage
goals and is not just a pass-through carrier. With such a short linear
area or small packet fixture, it can fail before the test hits the
bpf_test_finish()'s partial copy-out path. A pass-through TC program is
therefore a better fit, because it keeps the test focused on the
bpf_test_finish() copy-out semantics.

For XDP, this object does not have an existing xdp.frags pass-through
program, so the small XDP frags program is needed to cover the other
caller of the shared bpf_test_finish() path.

Thanks,
Sun Jian

^ permalink raw reply

* Re: Landlock: LANDLOCK_ACCESS_NET_CONNECT_TCP bypass via TCP Fast Open
From: Mickaël Salaün @ 2026-06-17 14:22 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Günther Noack, Matthieu Buffet, Paul Moore, Eric Dumazet,
	Neal Cardwell, linux-security-module, netdev, linux-kernel,
	Mikhail Ivanov
In-Reply-To: <20260616201615.275032-1-hexlabsecurity@proton.me>

Hi,

Thanks for the report.  This was previously identified by Mikhail and
Matthieu, see the related issue:
https://github.com/landlock-lsm/linux/issues/41


On Tue, Jun 16, 2026 at 08:16:22PM +0000, Bryam Vargas wrote:
> Hello Mickaël, and Landlock folks,
> 
> A task confined by a Landlock ruleset that handles
> LANDLOCK_ACCESS_NET_CONNECT_TCP and is denied connecting to a given port can
> still establish a TCP connection to that port by using TCP Fast Open, i.e.
> sendto(fd, ..., MSG_FASTOPEN, &dst, dstlen) on a fresh stream socket. The
> network-egress confinement for TCP connect is silently bypassed.
> 
> Affected
> --------
> Any kernel with CONFIG_SECURITY_LANDLOCK=y and Landlock enabled that supports
> the TCP network access rights (Landlock ABI >= 4, since Linux 6.7). Confirmed by
> source inspection on mainline (v7.1-rc7) and reproduced on Linux 7.0.11
> (Landlock ABI 8). No CONFIG beyond Landlock + IPv4/IPv6 TCP; TCP Fast Open client
> is enabled by the per-netns default (net.ipv4.tcp_fastopen has TFO_CLIENT_ENABLE
> set), so no sysctl change and no setsockopt are required.
> 
> Root cause
> ----------
> LANDLOCK_ACCESS_NET_CONNECT_TCP is enforced only by the socket_connect LSM hook
> (hook_socket_connect -> current_check_access_socket). security_socket_connect()
> has exactly one call site in the tree, net/socket.c (the connect(2) syscall).
> 
> TCP Fast Open performs an implicit connect inside sendmsg:
> 
>   tcp_sendmsg_locked()            net/ipv4/tcp.c  (MSG_FASTOPEN branch)
>    -> tcp_sendmsg_fastopen()      net/ipv4/tcp.c
>    -> __inet_stream_connect(..., is_sendmsg=1)  net/ipv4/af_inet.c
>    -> sk->sk_prot->connect()      net/ipv4/af_inet.c  -> tcp_v4_connect()
> 
> This path establishes the connection to the address taken from msg_name but
> never calls security_socket_connect(). The only LSM hook fired on the sendmsg
> path is security_socket_sendmsg(), and Landlock registers no socket_sendmsg
> hook, so LANDLOCK_ACCESS_NET_CONNECT_TCP is never re-checked. __inet_stream_connect()
> itself carries no LSM hook (only the cgroup-BPF pre_connect, a different
> mechanism).
> 
> Notably the kernel already mediates the analogous AF_UNIX implicit-connect on the
> send path via the unix_may_send hook, which Landlock does register
> (hook_unix_may_send) -- so the sendmsg-implies-connect pattern is recognized, but
> the TCP Fast Open case has no equivalent coverage. The MPTCP fast-open path
> (mptcp_sendmsg_fastopen -> __inet_stream_connect) is a second producer of the
> same unmediated connect (by source inspection; not separately reproduced).
> 
> Reproducer
> ----------
> A self-contained, fully unprivileged PoC is available on request. It forks an
> unconfined TFO-capable loopback listener, then in a child applies a Landlock
> ruleset handling LANDLOCK_ACCESS_NET_CONNECT_TCP with no allow rule
> (landlock_create_ruleset() with handled_access_net =
> LANDLOCK_ACCESS_NET_CONNECT_TCP, no landlock_add_rule(), then
> landlock_restrict_self(); every TCP connect is denied) and tries the forbidden
> port two ways:
> 
>   (1) connect(fd, &dst)                 -> -EACCES   (Landlock enforces CONNECT_TCP)
>   (2) sendto(fd2, buf, len, MSG_FASTOPEN, &dst, dstlen)
>                                         -> succeeds; the listener accepts the
>                                            connection and reads the payload.
> 
> Observed on Linux 7.0.11 (Landlock ABI 8):
> 
>   [1] connect(2)            -> ret=-1 errno=13 (Permission denied)
>   [2] sendto(MSG_FASTOPEN)  -> ret=14 errno=0 (OK/queued)
>   [+] listener ACCEPTED the confined child's connection; payload="..."
> 
> connect(2) to the port is denied while sendto(MSG_FASTOPEN) reaches the identical
> port and delivers data.
> 
> Impact
> ------
> A sandbox that uses LANDLOCK_ACCESS_NET_CONNECT_TCP to restrict outbound TCP
> (e.g. to keep a confined component from reaching an internal service or a
> metadata endpoint) can be escaped by an unprivileged, self-confined task with no
> CAP and no namespace transition -- for any destination port, since the
> implicit-connect path never consults the connect hook regardless of address (the
> run above shows one port). It is an integrity
> bypass of the network-confinement property; no memory safety is involved.
> I score it CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:N/I:H/A:N (6.5 Medium) -- the
> confined task escapes the policy authority that defined its sandbox, a scope
> change; 5.5 if you treat the Landlock boundary as the same authority (S:U).
> 
> Note on the in-flight UDP series
> --------------------------------
> The "landlock: Add UDP access control support" series (v5, Matthieu Buffet,
> https://lore.kernel.org/r/20260611162107.49278-3-matthieu@buffet.re) adds a
> socket_sendmsg hook, hook_socket_sendmsg(), but it returns 0 for non-UDP
> sockets:
> 
>     if (sk_is_udp(sock->sk))
>             access_request = LANDLOCK_ACCESS_NET_CONNECT_SEND_UDP;
>     else
>             return 0;
> 
> so a TCP socket using MSG_FASTOPEN still bypasses LANDLOCK_ACCESS_NET_CONNECT_TCP
> even after that series lands. It may be most convenient to fix this there.
> 
> Suggested direction
> -------------------
> Re-check LANDLOCK_ACCESS_NET_CONNECT_TCP on the implicit-connect path: either have
> the socket_sendmsg hook evaluate CONNECT_TCP for stream sockets when the call
> performs an implicit connect (mirroring the AF_UNIX unix_may_send handling), or
> place the check inside __inet_stream_connect() so a single chokepoint covers
> connect(2), TCP Fast Open, and the MPTCP fast-open sibling.
> 
> I am happy to send a patch for this if you would like me to.

Yes please.

> 
> Best regards,
> 
> Bryam Vargas
> Independent security researcher, HEXLAB S.A.S., Cali, Colombia
> hexlabsecurity@proton.me
> 
> 

^ permalink raw reply

* Re: [PATCH] net/sched: dualpi2: fix GSO backlog accounting
From: Jamal Hadi Salim @ 2026-06-17 14:23 UTC (permalink / raw)
  To: Xingquan Liu
  Cc: netdev, Jiri Pirko, Victor Nogueira, stable,
	Chia-Yu Chang (Nokia)
In-Reply-To: <CAM0EoM=o+kBQNND8ViMe8bZQmFAtATav+CFMmtp1udzu+tpTzA@mail.gmail.com>

On Wed, Jun 17, 2026 at 6:23 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Jun 16, 2026 at 6:03 PM Xingquan Liu <b1n@b1n.io> wrote:
> >
> > When DualPI2 splits a GSO skb into N segments, it propagates N
> > additional packets to its parent before returning NET_XMIT_SUCCESS.
> > The parent then accounts for the original skb once more, leaving its
> > qlen one larger than the number of packets actually queued.
> >
> > With QFQ as the parent, after all real packets are dequeued, QFQ still
> > has a non-zero qlen while its in-service aggregate has no active
> > classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
> > the result to qfq_peek_skb(), causing a NULL pointer dereference.
> >
> > Count only successfully queued segments and propagate the difference
> > between the original skb and those segments. Return success whenever
> > at least one segment was queued.
> >
> > Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Xingquan Liu <b1n@b1n.io>
> > ---
> >  net/sched/sch_dualpi2.c | 11 +++++------
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> > index dfec3c99eb45..37d6a8960310 100644
> > --- a/net/sched/sch_dualpi2.c
> > +++ b/net/sched/sch_dualpi2.c
> > @@ -461,7 +461,7 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >                 if (IS_ERR_OR_NULL(nskb))
> >                         return qdisc_drop(skb, sch, to_free);
> >
> > -               cnt = 1;
> > +               cnt = 0;
> >                 byte_len = 0;
> >                 orig_len = qdisc_pkt_len(skb);
> >                 skb_list_walk_safe(nskb, nskb, next) {
> > @@ -488,16 +488,15 @@ static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >                                 byte_len += nskb->len;
> >                         }
> >                 }
> > -               if (cnt > 1) {
> > +               if (cnt > 0) {
> >                         /* The caller will add the original skb stats to its
> >                          * backlog, compensate this if any nskb is enqueued.
> >                          */
> > -                       --cnt;
> > -                       byte_len -= orig_len;
> > +                       qdisc_tree_reduce_backlog(sch, 1 - cnt,
> > +                                                 orig_len - byte_len);
> >                 }
> > -               qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
> >                 consume_skb(skb);
> > -               return err;
> > +               return cnt > 0 ? NET_XMIT_SUCCESS : err;
> >         }
>
> This looks like a behavior change?
> Ex: If the last segment failed you will return XMIT_SUCCESS whereas
> before it could be with __NET_XMIT_BYPASS, NET_XMIT_CN,  etc.
> I am not sure what the best answer is and maybe it doesnt matter. Did
> you look at what other qdiscs do? I dont have time right now but will
> later - or you can before i get to it.
> Also, you didnt add the owner of this qdisc on your to:  - maybe he
> has some thoughts..
>

After looking at what other qdiscs do, your patch is fine. But please
fixup the commit to something like:

---
When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.

With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.

Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.

Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Xingquan Liu <b1n@b1n.io>
-----

Do you know how to create a tdc test that will recreate this? If not
either Victor or myself can help you create one.

cheers,
jamal

> cheers,
> jamal
>
>
> >         return dualpi2_enqueue_skb(skb, sch, to_free);
> >  }
> >
> > base-commit: fbc6a80cb5d3fd4ac4b56e8c9d791dd17be890c4
> > --
> > Xingquan Liu
> >

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH iwl-next v1] ixgbe: Implement PCI reset handler
From: Temerkhanov, Sergey @ 2026-06-17 14:36 UTC (permalink / raw)
  To: Paul Menzel
  Cc: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	Loktionov, Aleksandr, Bjorn Helgaas, linux-pci@vger.kernel.org
In-Reply-To: <bd5ab9e3-ab93-43ad-a2ce-03d56e2d2ecf@molgen.mpg.de>



> -----Original Message-----
> From: Paul Menzel <pmenzel@molgen.mpg.de>
> Sent: Wednesday, June 17, 2026 11:03 AM
> To: Temerkhanov, Sergey <sergey.temerkhanov@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; netdev@vger.kernel.org; Loktionov,
> Aleksandr <aleksandr.loktionov@intel.com>; Bjorn Helgaas
> <bhelgaas@google.com>; linux-pci@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH iwl-next v1] ixgbe: Implement PCI reset
> handler
> 
> [Cc: +Aleksandr (as in Reviewed-by:), +PCI subsystem]
> 
> Dear Sergey,
> 
> 
> Thank you for your patch.
> 
> Am 17.06.26 um 10:43 schrieb Sergey Temerkhanov:
> > Implement PCI device reset handler to allow the network device to get
> > re-initialized and function after a PCI-level reset.
> 
> Please describe the problem in more detail. When does PCI-level reset occur,
> and what is the current problematic situation?

The actual scenario is when a reset is invoked via sysfs during the operation and the adapter is not
properly restoring its state thereafter resulting in a TX queue timeout.

> 
> Also, what is ixgbe specific compared to a general PCIe implementation?
> 
> Please share details how to test it, and how you tested it.

The test is simple:
- Run traffic on the adapter
- Initiate reset via sysfs
- Observe TX queue WDT timeout w/o this change

> > +#define IXGBE_PCIE_RESET_RETRIES 1000
> 
> Why 1000? Isn’t there a generic PCIe macro? Please extend the commit
> message.

This is going to be replaced in v2

> > +	if (test_bit(__IXGBE_SERVICE_INITED, &adapter->state)) {
> > +		timer_delete_sync(&adapter->service_timer);
> > +		/* Prevent the service task from running while we're resetting.
> */
> 
> One of the two comments seems redundant.

I am clarifying this in v2. Essentially, the timer callback may queue a work, cancel_work_sync()
cancels any instance that may have been already pending.

> 
> > +		cancel_work_sync(&adapter->service_task);
> > +	}
> progress\n");
> 
> How can this happen? What should the user reading this error do?

Under the normal circumstances we should never get here, I am adding a comment in v2

> >   static DEFINE_SIMPLE_DEV_PM_OPS(ixgbe_pm_ops, ixgbe_suspend,
> > ixgbe_resume);
> 
> Kind regards,
> 
> Paul

Regards,
Sergey

^ permalink raw reply

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
From: Breno Leitao @ 2026-06-17 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Petr Mladek, Jakub Kicinski, Sebastian Andrzej Siewior,
	John Ogness, Sergey Senozhatsky, Vlad Poenaru, Thomas Gleixner,
	netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak
In-Reply-To: <20260617111958.GL49951@noisy.programming.kicks-ass.net>

On Wed, Jun 17, 2026 at 01:19:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 17, 2026 at 12:37:30PM +0200, Petr Mladek wrote:
> > On Tue 2026-06-16 14:17:19, Jakub Kicinski wrote:
> > > On Tue, 16 Jun 2026 19:02:57 +0200 Peter Zijlstra wrote:
> > > > > So this is not an issue since commit 7eab73b18630e ("netconsole: convert
> > > > > to NBCON console infrastructure"). Because from here now on writes are
> > > > > deferred to the nbcon thread. So this purely about -stable in this case.  
> > > > 
> > > > Hmm, I thought netconsole had some reserved skbs and could to writes
> > > > 'atomic' like? That said, it was 2.6 era the last time I looked at
> > > > netconsole.
> > > 
> > > Yes, that part is fine. The problem is that netconsole tries
> > > to reap Tx completions if the Tx queue is full. We can't call
> > > skb destructor in irq context so we put the completed skbs on
> > > a queue and try to arm softirq to get to them later.
> > > Arming softirq causes a ksoftirq wake up.
> > > 
> > > We already skip the completion polling if we detect getting called
> > > from the same networking driver. It's best effort, anyway.
> > > Networking-side fix would be to toss another OR condition into
> > > the skip. But we don't have one that'd work cleanly :S
> > 
> > Alternative solution might be to offload the ksoftirq wake up
> > to an irq_work. It might make this part safe for the
> > console->write_atomic() call.
> > 
> > Well, my understanding is that there are more problems.
> > AFAIK, some drivers do not use an IRQ safe locking, see
> > https://lore.kernel.org/all/oth5t27z6acp7qxut7u45ekyil7djirg2ny3bnsvnzeqasavxb@nhwdxahvcosh/
> 
> But anything using locking is not ->write_atomic() and should be driven
> from a kthread, no?

Good point. If that's the case, netconsole might not ever be able to drop
CON_NBCON_ATOMIC_UNSAFE for any network-based console driver at all. 

As far as I can tell, there isn't a network driver today whose transmit
path is completely lockless, so, even if we make netpoll lockless.

It's unlikely any NIC will ever achieve this, given that NIC TX
fundamentally relies on a shared DMA ring and doorbell register, which
inherently cannot be made lockless.

So, is it correct to state that CON_NBCON_ATOMIC_UNSAFE will be part of
netconsole forever-ish?

^ permalink raw reply

* Re: [PATCH iproute2-next v4] ip/bond: add lacp_strict support
From: Stephen Hemminger @ 2026-06-17 15:01 UTC (permalink / raw)
  To: Louis Scalbert
  Cc: netdev, andrew+netdev, jv, edumazet, kuba, pabeni, fbl, andy,
	shemminger, maheshb, jonas.gorski, horms
In-Reply-To: <20260617130314.3893243-1-louis.scalbert@6wind.com>

On Wed, 17 Jun 2026 15:03:14 +0200
Louis Scalbert <louis.scalbert@6wind.com> wrote:

> +		} else if (matches(*argv, "lacp_strict") == 0) {
> +			NEXT_ARG();
> +			if (get_index(lacp_strict_tbl, *argv) < 0)
> +				invarg("invalid lacp_strict", *argv);
> +
> +			lacp_strict = get_index(lacp_strict_tbl, *argv);
> +			addattr8(n, 1024, IFLA_BOND_LACP_STRICT, lacp_strict);
>  		} else if (matches(*argv, "tlb_dynamic_lb") == 0) {
>  			NEXT_ARG();
>  			if (get_u8(&tlb_dynamic_lb, *argv, 0)) {

Why not use parse_on_off like other code in this file.


> @@ -642,6 +658,15 @@ static void bond_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
>  			   "all_slaves_active %u ",
>  			   rta_getattr_u8(tb[IFLA_BOND_ALL_SLAVES_ACTIVE]));
>  
> +	if (tb[IFLA_BOND_LACP_STRICT]) {
> +		__u8 lacp_strict = rta_getattr_u8(tb[IFLA_BOND_LACP_STRICT]);
> +		print_string(PRINT_FP,
> +			     "lacp_strict",
> +			     "lacp_strict %s ",
> +			     get_name(lacp_strict_tbl, lacp_strict));
> +		print_bool(PRINT_JSON, "lacp_strict", NULL, lacp_strict);
> +	}
> +

Why not use print_on_off like other options

^ permalink raw reply

* Re: [PATCH net] dpaa2-switch: fix VLAN upper check not rejecting bridge join
From: Ioana Ciornei @ 2026-06-17 15:01 UTC (permalink / raw)
  To: Simon Horman
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, netdev, f.fainelli,
	vladimir.oltean, linux-kernel
In-Reply-To: <20260617114518.857486-1-horms@kernel.org>

On Wed, Jun 17, 2026 at 12:45:18PM +0100, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
> ---
> dpaa2-switch: fix VLAN upper check not rejecting bridge join
> 
> The blamed commit refactored the prechangeupper event handling but
> failed to actually return an error in case
> dpaa2_switch_prevent_bridging_with_8021q_upper() detected a 802.1q upper
> on a port which tries to join a bridge. Fix this by returning err
> instead of 0.
> 
> > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
> > index 52c1cb9cb7e02..46ae81c2fa013 100644
> > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
> > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
> > @@ -2177,7 +2177,7 @@ dpaa2_switch_prechangeupper_sanity_checks(struct net_device *netdev,
> >  	if (err) {
> >  		NL_SET_ERR_MSG_MOD(extack,
> >  				   "Cannot join a bridge while VLAN uppers are present");
> > -		return 0;
> > +		return err;
> 
> [Severity: High]
> If a user adds an 802.1q VLAN upper device to a port after it has already
> successfully joined a bridge, won't this check trigger when the port leaves
> the bridge?
> 
> In drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c,
> dpaa2_switch_prechangeupper_sanity_checks() is called for both bridge joins
> and unlinks from dpaa2_switch_port_prechangeupper():
> 
>     err = dpaa2_switch_prechangeupper_sanity_checks(netdev, upper_dev, extack);
>     if (err)
>         return err;
> 
>     if (!info->linking)
>         dpaa2_switch_port_pre_bridge_leave(netdev);
> 
> Since returning an error aborts the notifier chain, an unlink notification
> would fail this check, preventing dpaa2_switch_port_pre_bridge_leave() from
> running. Does this lead to state inconsistency and skipped switchdev
> unoffloads?
> 
> Could this check be guarded so it only runs when info->linking is true?

Valid point but the fix should be to prevent adding VLAN uppers on
bridged dpaa2-switch ports since the driver does not support the usecase
altogether.


^ permalink raw reply

* [net PATCH v2] octeontx2-af: Validate NIX maximum LFs correctly
From: Subbaraya Sundeep @ 2026-06-17 15:22 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, rkannoth
  Cc: netdev, linux-kernel, Subbaraya Sundeep

NIX maximum number of LFs can be set via devlink command
but that can be done before assigning any LFs to a PF/VF.
The condition used to check whether any LFs are assigned is
incorrect. This patch fixes that condition.

Fixes: dd7842878633 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
---
 .../marvell/octeontx2/af/rvu_devlink.c        | 27 +++++++++++++------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
index 6494a9ee2f0d..3b47ecb44d51 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -1510,7 +1510,9 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
 	struct rvu_devlink *rvu_dl = devlink_priv(devlink);
 	struct rvu *rvu = rvu_dl->rvu;
 	u16 max_nix0_lf, max_nix1_lf;
-	struct npc_mcam *mcam;
+	struct rvu_block *block;
+	int blkaddr = 0;
+	int free_lfs;
 	u64 cfg;
 
 	cfg = rvu_read64(rvu, BLKADDR_NIX0, NIX_AF_CONST2);
@@ -1518,14 +1520,23 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
 	cfg = rvu_read64(rvu, BLKADDR_NIX1, NIX_AF_CONST2);
 	max_nix1_lf = cfg & 0xFFF;
 
-	/* Do not allow user to modify maximum NIX LFs while mcam entries
-	 * have already been assigned.
+	/* Do not allow user to modify maximum NIX LFs while NIX LFs
+	 * have already been assigned. Note that modifying NIX LFs count
+	 * can be done only before any LF attach requests from PFs and VFs
+	 * and not later or concurrently.
 	 */
-	mcam = &rvu->hw->mcam;
-	if (mcam->bmap_fcnt < mcam->bmap_entries) {
-		NL_SET_ERR_MSG_MOD(extack,
-				   "mcam entries have already been assigned, can't resize");
-		return -EPERM;
+	blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
+	while (blkaddr) {
+		block = &rvu->hw->block[blkaddr];
+
+		free_lfs = rvu_rsrc_free_count(&block->lf);
+		if (free_lfs != block->lf.max) {
+			NL_SET_ERR_MSG_MOD(extack,
+					   "NIX LFs already assigned, can't resize");
+			return -EPERM;
+		}
+
+		blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
 	}
 
 	if (max_nix0_lf && val.vu16 > max_nix0_lf) {
-- 
2.48.1


^ permalink raw reply related

* [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-17 15:28 UTC (permalink / raw)
  To: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
	linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu,
	runyu.xiao, stable

smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
listen socket's sk_data_ready callback.  smc_clcsock_data_ready() then
immediately takes sk_callback_lock before looking up the SMC listener and
queuing smc_tcp_listen_work().

That is unsafe once the TCP listen socket is leaving TCP_LISTEN.  The TCP
close/flush path can run the installed sk_data_ready callback with
sk_callback_lock already held, so entering smc_clcsock_data_ready() again
tries to take the same rwlock recursively in the same thread.  The nvmet
TCP listener had to make the same state check before taking
sk_callback_lock for this reason.

This issue was found by our static analysis tool and then manually
reviewed against the current tree.

The grounded PoC kept the SMC listen callback installation path:

  smc_listen()
  smc_clcsock_replace_cb()
  sk_data_ready = smc_clcsock_data_ready()

It then modeled the close/flush carrier that invokes the installed
sk_data_ready callback while sk_callback_lock is already held.  Lockdep
reported the same-thread recursive acquisition:

  WARNING: possible recursive locking detected
  smc_clcsock_data_ready+0xa/0x4d [vuln_msv]
  smc_close_flush_work+0x1f/0x30 [vuln_msv]
  *** DEADLOCK ***

Return before taking sk_callback_lock when the underlying TCP socket is no
longer in TCP_LISTEN.  In that state there is no listen accept work to
queue for SMC, and avoiding the callback lock mirrors the fix used by the
TCP nvmet listener.

Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
Cc: stable@vger.kernel.org
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
---
 net/smc/af_smc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 6421c2e1c84d..1af4e3c333ff 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
 {
 	struct smc_sock *lsmc;
 
+	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)
+		return;
+
 	read_lock_bh(&listen_clcsock->sk_callback_lock);
 	lsmc = smc_clcsock_user_data(listen_clcsock);
 	if (!lsmc)
-- 
2.34.1


^ permalink raw reply related

* Re: [net PATCH v2] octeontx2-af: Validate NIX maximum LFs correctly
From: Subbaraya Sundeep @ 2026-06-17 15:31 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, rkannoth
  Cc: netdev, linux-kernel
In-Reply-To: <1781709750-23218-1-git-send-email-sbhatta@marvell.com>

Missed the changelog. Will resend.

Thanks,
Sundeep

pw-bot: changes-requested

On 2026-06-17 at 20:52:30, Subbaraya Sundeep (sbhatta@marvell.com) wrote:
> NIX maximum number of LFs can be set via devlink command
> but that can be done before assigning any LFs to a PF/VF.
> The condition used to check whether any LFs are assigned is
> incorrect. This patch fixes that condition.
> 
> Fixes: dd7842878633 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
> ---
>  .../marvell/octeontx2/af/rvu_devlink.c        | 27 +++++++++++++------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> index 6494a9ee2f0d..3b47ecb44d51 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
> @@ -1510,7 +1510,9 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
>  	struct rvu_devlink *rvu_dl = devlink_priv(devlink);
>  	struct rvu *rvu = rvu_dl->rvu;
>  	u16 max_nix0_lf, max_nix1_lf;
> -	struct npc_mcam *mcam;
> +	struct rvu_block *block;
> +	int blkaddr = 0;
> +	int free_lfs;
>  	u64 cfg;
>  
>  	cfg = rvu_read64(rvu, BLKADDR_NIX0, NIX_AF_CONST2);
> @@ -1518,14 +1520,23 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
>  	cfg = rvu_read64(rvu, BLKADDR_NIX1, NIX_AF_CONST2);
>  	max_nix1_lf = cfg & 0xFFF;
>  
> -	/* Do not allow user to modify maximum NIX LFs while mcam entries
> -	 * have already been assigned.
> +	/* Do not allow user to modify maximum NIX LFs while NIX LFs
> +	 * have already been assigned. Note that modifying NIX LFs count
> +	 * can be done only before any LF attach requests from PFs and VFs
> +	 * and not later or concurrently.
>  	 */
> -	mcam = &rvu->hw->mcam;
> -	if (mcam->bmap_fcnt < mcam->bmap_entries) {
> -		NL_SET_ERR_MSG_MOD(extack,
> -				   "mcam entries have already been assigned, can't resize");
> -		return -EPERM;
> +	blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
> +	while (blkaddr) {
> +		block = &rvu->hw->block[blkaddr];
> +
> +		free_lfs = rvu_rsrc_free_count(&block->lf);
> +		if (free_lfs != block->lf.max) {
> +			NL_SET_ERR_MSG_MOD(extack,
> +					   "NIX LFs already assigned, can't resize");
> +			return -EPERM;
> +		}
> +
> +		blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
>  	}
>  
>  	if (max_nix0_lf && val.vu16 > max_nix0_lf) {
> -- 
> 2.48.1
> 

^ permalink raw reply

* Re: [PATCH RESEND 1/6] sock: add sock_kzalloc helper
From: Thorsten Blum @ 2026-06-17 15:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Simon Horman, linux-crypto,
	linux-kernel, netdev
In-Reply-To: <20260615091555.4af017aa@kernel.org>

On Mon, Jun 15, 2026 at 09:15:55AM -0700, Jakub Kicinski wrote:
> On Sun, 14 Jun 2026 17:32:12 +0200 Thorsten Blum wrote:
> > Gentle ping? Patch 1/6 still needs an ack from netdev maintainers.
> 
> Perhaps other maintainers shared my feeling that this is a waste of
> time.

Could you elaborate on why sock_kzfree_s() is okay, but sock_kzalloc()
is not? Both are small, socket-specific zeroing helpers.

sock_kzalloc() has the same number of call sites as sock_kzfree_s(), and
it could also be used in net/ipv6/exthdrs.c in ipv6_renew_options().

Thanks,
Thorsten

^ permalink raw reply

* [net PATCH v3] octeontx2-af: Validate NIX maximum LFs correctly
From: Subbaraya Sundeep @ 2026-06-17 15:40 UTC (permalink / raw)
  To: andrew+netdev, davem, edumazet, kuba, pabeni, sgoutham, gakula,
	bbhushan2, rkannoth
  Cc: netdev, linux-kernel, Subbaraya Sundeep

NIX maximum number of LFs can be set via devlink command
but that can be done before assigning any LFs to a PF/VF.
The condition used to check whether any LFs are assigned is
incorrect. This patch fixes that condition.

Fixes: dd7842878633 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
---
v3 changes:
	None, updated changelog
v2 changes:
	Fixed AI review by updating error message
	Updated comment to mention modifying NIXLFs has to be done prior
	to attaching NIXLFs to any PFs/VFs.

 .../marvell/octeontx2/af/rvu_devlink.c        | 27 +++++++++++++------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
index 6494a9ee2f0d..3b47ecb44d51 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_devlink.c
@@ -1510,7 +1510,9 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
 	struct rvu_devlink *rvu_dl = devlink_priv(devlink);
 	struct rvu *rvu = rvu_dl->rvu;
 	u16 max_nix0_lf, max_nix1_lf;
-	struct npc_mcam *mcam;
+	struct rvu_block *block;
+	int blkaddr = 0;
+	int free_lfs;
 	u64 cfg;
 
 	cfg = rvu_read64(rvu, BLKADDR_NIX0, NIX_AF_CONST2);
@@ -1518,14 +1520,23 @@ static int rvu_af_dl_nix_maxlf_validate(struct devlink *devlink, u32 id,
 	cfg = rvu_read64(rvu, BLKADDR_NIX1, NIX_AF_CONST2);
 	max_nix1_lf = cfg & 0xFFF;
 
-	/* Do not allow user to modify maximum NIX LFs while mcam entries
-	 * have already been assigned.
+	/* Do not allow user to modify maximum NIX LFs while NIX LFs
+	 * have already been assigned. Note that modifying NIX LFs count
+	 * can be done only before any LF attach requests from PFs and VFs
+	 * and not later or concurrently.
 	 */
-	mcam = &rvu->hw->mcam;
-	if (mcam->bmap_fcnt < mcam->bmap_entries) {
-		NL_SET_ERR_MSG_MOD(extack,
-				   "mcam entries have already been assigned, can't resize");
-		return -EPERM;
+	blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
+	while (blkaddr) {
+		block = &rvu->hw->block[blkaddr];
+
+		free_lfs = rvu_rsrc_free_count(&block->lf);
+		if (free_lfs != block->lf.max) {
+			NL_SET_ERR_MSG_MOD(extack,
+					   "NIX LFs already assigned, can't resize");
+			return -EPERM;
+		}
+
+		blkaddr = rvu_get_next_nix_blkaddr(rvu, blkaddr);
 	}
 
 	if (max_nix0_lf && val.vu16 > max_nix0_lf) {
-- 
2.48.1


^ permalink raw reply related

* Re: [Bug] incompatibility between 'e1000e' and Aruba AOS-CX switches (too small inter-packet gap)
From: Philippe Andersson @ 2026-06-17 15:41 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Ludovic Calmant, Fabian Noël
In-Reply-To: <bdf0a4b8-e414-4df7-833c-3051987c3f2b@lunn.ch>


[-- Attachment #1.1.1: Type: text/plain, Size: 3063 bytes --]

On 16/06/2026 21:34, Andrew Lunn wrote:
>> A support ticket has already been opened with Aruba, but it's unclear at
>> this stage that the problem is on their side.
> 
> How easy is it to reproduce?
Provided you have the required hardware (PC with NIC using the 'e1000e' 
driver and Aruba AOS-CX switch such as HPE/Aruba CX 5420 or HPE/Aruba CX 
6200M -- perhaps others, we only tested with those ones), reproducing 
the issue is easy: running 'iperf3' for a few minutes (the PC with 
'e1000e' plays the role of iperf server). You will get a cluster of 
retransmits at the start of the test, and you may get further clusters 
at random intervals.

Here is an 'iperf3' output that shows the problem after only 10 secs.

-------------------------<cut>----------------------------
Connecting to host 10.1.1.21, port 5201
[  5] local 10.1.1.61 port 55096 connected to 10.1.1.21 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   110 MBytes   923 Mbits/sec   70   1.49 MBytes 

[  5]   1.00-2.00   sec   106 MBytes   891 Mbits/sec    0   1.63 MBytes 

[  5]   2.00-3.00   sec   106 MBytes   891 Mbits/sec    0   1.74 MBytes 

[  5]   3.00-4.00   sec   108 MBytes   902 Mbits/sec    0   1.83 MBytes 

[  5]   4.00-5.00   sec   106 MBytes   891 Mbits/sec    3   1.32 MBytes 

[  5]   5.00-6.00   sec   106 MBytes   891 Mbits/sec    0   1.42 MBytes 

[  5]   6.00-7.00   sec   108 MBytes   902 Mbits/sec    0   1.49 MBytes 

[  5]   7.00-8.00   sec   106 MBytes   891 Mbits/sec    0   1.54 MBytes 

[  5]   8.00-9.00   sec   106 MBytes   891 Mbits/sec    0   1.57 MBytes 

[  5]   9.00-10.00  sec   108 MBytes   902 Mbits/sec    0   1.59 MBytes 

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.04 GBytes   898 Mbits/sec   73             sender
[  5]   0.00-10.02  sec  1.04 GBytes   894 Mbits/sec 
receiver
-------------------------<cut>----------------------------

But you need to do this in parallel with "background network traffic" on 
the PC (such as NFS, for instance). We've not been able yet to 
characterise the minimum amount of traffic necessary for the problem to 
manifest itself but we'll try to do so.

> Can you run a git bisect from the last
> known good kernel version to the first known bad version?
Not really, as there is no "known good kernel". All kernels are good 
ones, as long as older ProCurve switches are used (e.g. HPE/Aruba 
ProCurve 5406R or HPE/Aruba ProCurve 2930M-48G-PoE+). The problem only 
shows when AOS-CX switches are used, and we only started deploying those 
in production a couple of months ago.

What I can tell you is that the problem is still present in 
6.12.90+deb13.1-amd64. This is the most recent kernel we tested.

HTH

Ph. A.

-- 

*Philippe Andersson*
Unix System Administrator
IBA Particle Therapy |
Tel: +32-10-475.983
Fax: +32-10-487.707
eMail: pan@iba-group.com
<http://www.iba-worldwide.com>



[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3165 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply

* Re: [Bug] incompatibility between 'e1000e' and Aruba AOS-CX switches (too small inter-packet gap)
From: Andrew Lunn @ 2026-06-17 16:07 UTC (permalink / raw)
  To: Philippe Andersson; +Cc: netdev, Ludovic Calmant, Fabian Noël
In-Reply-To: <3b073409-9096-4467-9baf-11196a70c014@iba-group.com>

> > Can you run a git bisect from the last
> > known good kernel version to the first known bad version?
> Not really, as there is no "known good kernel". All kernels are good ones,
> as long as older ProCurve switches are used (e.g. HPE/Aruba ProCurve 5406R
> or HPE/Aruba ProCurve 2930M-48G-PoE+). The problem only shows when AOS-CX
> switches are used, and we only started deploying those in production a
> couple of months ago.

I was hoping you could narrow it down to one patch which caused the
issue. But if it never worked...

Then i really think you need to be talking to both vendors and try to
get them to work together to identify the problem. Maybe send a NIC to
the switch vendor, so they have all the hardware, etc.

    Andrew

^ permalink raw reply

* Re: [Bug] incompatibility between 'e1000e' and Aruba AOS-CX switches (too small inter-packet gap)
From: Philippe Andersson @ 2026-06-17 16:18 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Ludovic Calmant, Fabian Noël
In-Reply-To: <ee384978-4eca-4a49-b1e7-55be7698970f@lunn.ch>


[-- Attachment #1.1.1: Type: text/plain, Size: 1362 bytes --]

On 17/06/2026 18:07, Andrew Lunn wrote:
>>> Can you run a git bisect from the last
>>> known good kernel version to the first known bad version?
>> Not really, as there is no "known good kernel". All kernels are good ones,
>> as long as older ProCurve switches are used (e.g. HPE/Aruba ProCurve 5406R
>> or HPE/Aruba ProCurve 2930M-48G-PoE+). The problem only shows when AOS-CX
>> switches are used, and we only started deploying those in production a
>> couple of months ago.
> 
> I was hoping you could narrow it down to one patch which caused the
> issue. But if it never worked...
Sorry to disappoint ;-)

> Then i really think you need to be talking to both vendors and try to
> get them to work together to identify the problem.
This is precisely why I contacted this mailing list. My understanding 
was that this was the proper way to report a potential bug in the 
'e1000e' driver now that it has been incorporated in the Linux kernel 
tree, as per the maintainers listed in the code. But if you know 
otherwise, please tell me.

> Maybe send a NIC to
> the switch vendor, so they have all the hardware, etc.
This is already ongoing.

Ph. A.

-- 

*Philippe Andersson*
Unix System Administrator
IBA Particle Therapy |
Tel: +32-10-475.983
Fax: +32-10-487.707
eMail: pan@iba-group.com
<http://www.iba-worldwide.com>



[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3165 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 203 bytes --]

^ permalink raw reply

* Re: [PATCH net] net/mlx5e: Use sender devcom for MPV master-up
From: manjunath.b.patil @ 2026-06-17 16:28 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky, netdev
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Patrisious Haddad, linux-rdma, linux-kernel, stable
In-Reply-To: <20260610173915.4053423-1-manjunath.b.patil@oracle.com>


On 6/10/26 10:39 AM, Manjunath Patil wrote:
> After PCIe DPC recovery, mlx5 reloads the affected functions and
> replays multiport affiliation events. In the reported failure, the
> first relevant device error was:
> 
>    pcieport 0000:10:01.1: DPC: containment event
>    pcieport 0000:10:01.1: PCIe Bus Error: severity=Uncorrected (Fatal)
>    pcieport 0000:10:01.1:    [ 5] SDES                   (First)
> 
> mlx5 recovered the PCI functions and resumed 0000:11:00.1. During
> that resume, RDMA multiport binding replayed
> MLX5_DRIVER_EVENT_AFFILIATION_DONE and mlx5e sent
> MPV_DEVCOM_MASTER_UP. The host then panicked with:
> 
>    BUG: kernel NULL pointer dereference, address: 0000000000000010
>    RIP: mlx5_devcom_comp_set_ready+0x5/0x40 [mlx5_core]
>    RDI: 0000000000000000
> 
> Call trace included:
> 
>    mlx5_devcom_comp_set_ready
>    mlx5e_devcom_event_mpv
>    mlx5_devcom_send_event
>    mlx5_ib_bind_slave_port
>    mlx5r_mp_probe
>    mlx5_pci_resume
> 
> MPV devcom registration publishes mlx5e private data to the component
> peer list before mlx5e_devcom_init_mpv() stores the returned component
> device in priv->devcom. A concurrent master-up event can therefore
> reach a peer whose private data is visible but whose priv->devcom
> backpointer is still NULL.
> 
> MPV_DEVCOM_MASTER_UP already carries the sender/master mlx5e private
> data as event_data. The ready bit is stored on the shared devcom
> component, not on an individual peer. Use the sender devcom when
> marking the MPV component ready.
> 
> This preserves the readiness transition while avoiding a NULL
> dereference of the peer devcom pointer during affiliation replay after
> PCI error recovery.
> 
> Fixes: bf11485f8419 ("net/mlx5: Register mlx5e priv to devcom in MPV mode")
> Assisted-by: Codex:gpt-5
> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
> Cc: stable@vger.kernel.org # 6.7+
> ---
Ping!

-Manjunath
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 8f2b3abe0092..f7ff20b97e8c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -211,11 +211,14 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
>   
>   static int mlx5e_devcom_event_mpv(int event, void *my_data, void *event_data)
>   {
> -	struct mlx5e_priv *slave_priv = my_data;
> +	struct mlx5e_priv *master_priv = event_data;
>   
>   	switch (event) {
>   	case MPV_DEVCOM_MASTER_UP:
> -		mlx5_devcom_comp_set_ready(slave_priv->devcom, true);
> +		if (!master_priv || !master_priv->devcom)
> +			return -EINVAL;
> +
> +		mlx5_devcom_comp_set_ready(master_priv->devcom, true);
>   		break;
>   	case MPV_DEVCOM_MASTER_DOWN:
>   		/* no need for comp set ready false since we unregister after


^ permalink raw reply

* Re: [PATCH v4 1/3] dt-bindings: net: add Realtek RTL8125 PCIe Ethernet
From: Heiner Kallweit @ 2026-06-17 16:43 UTC (permalink / raw)
  To: ricardo, nic_swsd, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Heiko Stuebner
  Cc: Sebastian Reichel, netdev, devicetree, linux-kernel,
	linux-arm-kernel, linux-rockchip
In-Reply-To: <20260617-rk3588-dts-rtl-eth-describe-dt-alias-v4-1-2bd38922d129@pardini.net>

On 17.06.2026 14:58, Ricardo Pardini via B4 Relay wrote:
> From: Ricardo Pardini <ricardo@pardini.net>
> 
> Add a binding for fixed/soldered Realtek RTL8125 PCIe Ethernet
> controller.
> 
> The "pciVVVV,DDDD" compatibles are the Open Firmware PCI Bus Binding
> spelling, auto-derived from PCI-SIG vendor/device IDs, but they still
> need a binding when used in a board DT - analogous to "usbVVVV,PPPP"
> compatibles documented in their own bindings (e.g. microchip,lan95xx)
> so board DTs attaching properties (fixed MAC, nvmem cell, ...) to
> these PCI function nodes can be validated.
> 
> Suggested-by: Sebastian Reichel <sebastian.reichel@collabora.com>
> Signed-off-by: Ricardo Pardini <ricardo@pardini.net>
> ---
>  .../devicetree/bindings/net/realtek,rtl8125.yaml   | 43 ++++++++++++++++++++++
>  MAINTAINERS                                        |  1 +
>  2 files changed, 44 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/realtek,rtl8125.yaml b/Documentation/devicetree/bindings/net/realtek,rtl8125.yaml
> new file mode 100644
> index 0000000000000..eee13fbc1e6a6
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/realtek,rtl8125.yaml
> @@ -0,0 +1,43 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/net/realtek,rtl8125.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Realtek RTL8125 2.5 Gigabit PCIe Ethernet Controller
> +
> +maintainers:
> +  - Heiner Kallweit <hkallweit1@gmail.com>
> +
> +description:
> +  The Realtek RTL8125 is a 2.5GBASE-T Ethernet controller with a PCIe host
> +  interface.
> +
> +allOf:
> +  - $ref: ethernet-controller.yaml#
> +
> +properties:
> +  compatible:
> +    const: pci10ec,8125

IIRC we came to the conclusion that the compatible string isn't used in the
relevant code path. Then why add it here? Is there an alignment on this?
If it should be added here, then an explaining comment would be helpful.

> +
> +  reg:
> +    maxItems: 1
> +
> +required:
> +  - compatible
> +  - reg
> +
> +unevaluatedProperties: false
> +
> +examples:
> +  - |
> +    pcie {
> +        #address-cells = <3>;
> +        #size-cells = <2>;
> +
> +        ethernet@0,0 {
> +            compatible = "pci10ec,8125";
> +            reg = <0x10000 0 0 0 0>;
> +            local-mac-address = [00 00 00 00 00 00];
> +        };
> +    };
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c8d4b913f26c1..e5fbd82946aec 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -134,6 +134,7 @@ M:	Heiner Kallweit <hkallweit1@gmail.com>
>  M:	nic_swsd@realtek.com
>  L:	netdev@vger.kernel.org
>  S:	Maintained
> +F:	Documentation/devicetree/bindings/net/realtek,rtl8125.yaml
>  F:	drivers/net/ethernet/realtek/r8169*
>  
>  8250/16?50 (AND CLONE UARTS) SERIAL DRIVER
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox