Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
@ 2026-06-25 17:40 Nabil S. Alramli
  2026-06-25 17:40 ` Nabil S. Alramli
  0 siblings, 1 reply; 5+ messages in thread
From: Nabil S. Alramli @ 2026-06-25 17:40 UTC (permalink / raw)
  To: saeedm, tariqt, mbloch, dtatulea
  Cc: dev, nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel

Hello mlx5 experts,

We have been experiencing frequent WARNINGs in the mlx5 driver on frag page
release and we think it could possibly be caused by a bug in mlx5. Could
you please review the attached patch and provide us your guidance on
whether or not our investigation and assumptions are valid, and if so,
would it be possible to incorporate this fix into your next release?

Best Regards,

Nabil S. Alramli (1):
  net/mlx5: RX, Fix refcount warning on frag page release

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 ++++++++++---------
 3 files changed, 22 insertions(+), 21 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
  2026-06-25 17:40 [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release Nabil S. Alramli
@ 2026-06-25 17:40 ` Nabil S. Alramli
  2026-06-26 13:12   ` Dragos Tatulea
  0 siblings, 1 reply; 5+ messages in thread
From: Nabil S. Alramli @ 2026-06-25 17:40 UTC (permalink / raw)
  To: saeedm, tariqt, mbloch, dtatulea
  Cc: dev, nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel

Under memory pressure, mlx5 driver has WARNING during fragmented page
release. This happens because there is a discrepency between what mlx5
thinks the page fragment counter is vs what the page_pool actually says it
is.

The cause of the issue is page allocations on concurrent cpus, which
increment the non-atomic u16 page counter mlx5e_frag_page.frags, while at
the same time the page reference counter net_iov.pp_ref_count is atomically
incremented. That sometimes leads to a difference in the counts and
therefore triggers the warning in page_pool_unref_netmem:

```
	ret = atomic_long_sub_return(nr, pp_ref_count);
	WARN_ON(ret < 0);
```

The actual stack trace looks like this:

```
WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
Hardware name: *
RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
Call Trace:
 <IRQ>
 mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
 mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
 mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
 __napi_poll+0x2b/0x1a0
 net_rx_action+0x30e/0x370
 ? sched_clock+0x9/0x10
 ? sched_clock_cpu+0xf/0x170
 handle_softirqs+0xe2/0x2a0
 common_interrupt+0x85/0xa0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x26/0x40
RIP: 0010:page_counter_uncharge+0x34/0x90
RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
 uncharge_batch+0x40/0xd0
```

The fix is to use an atomic page fragment counter, so it will always match
the number of references held in the page_pool.

Signed-off-by: Nabil S. Alramli <dev@nalramli.com>
Fixes: 6f5742846053 ("net/mlx5e: RX, Enable skb page recycling through the page_pool")
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 39 ++++++++++---------
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2270e2e550dd..c164106eb85d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -568,7 +568,7 @@ struct mlx5e_icosq {
 
 struct mlx5e_frag_page {
 	netmem_ref netmem;
-	u16 frags;
+	atomic_long_t frags;
 };
 
 enum mlx5e_wqe_frag_flag {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5a46870c4b74..571a0df9f604 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -400,7 +400,7 @@ static int mlx5e_rq_alloc_mpwqe_linear_info(struct mlx5e_rq *rq, int node,
 	rq->mpwqe.linear_info = li;
 
 	/* Set to max to force allocation on first run. */
-	li->frag_page.frags = li->max_frags;
+	atomic_long_set(&li->frag_page.frags, li->max_frags);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b60aa47c75b..ee360fa0c316 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -284,7 +284,7 @@ static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
 
 	*frag_page = (struct mlx5e_frag_page) {
 		.netmem	= netmem,
-		.frags	= 0,
+		.frags	= ATOMIC_LONG_INIT(0),
 	};
 
 	return 0;
@@ -293,7 +293,7 @@ static int mlx5e_page_alloc_fragmented(struct page_pool *pp,
 static void mlx5e_page_release_fragmented(struct page_pool *pp,
 					  struct mlx5e_frag_page *frag_page)
 {
-	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
+	u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - atomic_long_read(&frag_page->frags);
 	netmem_ref netmem = frag_page->netmem;
 
 	if (page_pool_unref_netmem(netmem, drain_count) == 0)
@@ -304,7 +304,7 @@ static int mlx5e_mpwqe_linear_page_refill(struct mlx5e_rq *rq)
 {
 	struct mlx5e_mpw_linear_info *li = rq->mpwqe.linear_info;
 
-	if (likely(li->frag_page.frags < li->max_frags))
+	if (likely(atomic_long_read(&li->frag_page.frags) < li->max_frags))
 		return 0;
 
 	if (likely(li->frag_page.netmem)) {
@@ -323,7 +323,8 @@ static void *mlx5e_mpwqe_get_linear_page_frag(struct mlx5e_rq *rq)
 	if (unlikely(mlx5e_mpwqe_linear_page_refill(rq)))
 		return NULL;
 
-	frag_offset = li->frag_page.frags << MLX5E_XDP_LOG_MAX_LINEAR_SZ;
+	frag_offset = atomic_long_read(&li->frag_page.frags) <<
+		      MLX5E_XDP_LOG_MAX_LINEAR_SZ;
 	WARN_ON(frag_offset >= BIT(rq->mpwqe.page_shift));
 
 	return netmem_address(li->frag_page.netmem) + frag_offset;
@@ -568,7 +569,7 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
 		return;
 	}
 
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 	skb_add_rx_frag_netmem(skb, next_frag, netmem,
 			       frag_offset, len, truesize);
 }
@@ -744,7 +745,7 @@ void mlx5e_mpwqe_dealloc_linear_page(struct mlx5e_rq *rq)
 	 * things in a good state for re-allocation.
 	 */
 	li->frag_page.netmem = 0;
-	li->frag_page.frags = li->max_frags;
+	atomic_long_set(&li->frag_page.frags, li->max_frags);
 }
 
 INDIRECT_CALLABLE_SCOPE bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
@@ -1615,7 +1616,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	/* queue up for recycling/reuse */
 	skb_mark_for_recycle(skb);
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 
 	return skb;
 }
@@ -1683,7 +1684,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 				struct mlx5e_wqe_frag_info *pwi;
 
 				for (pwi = head_wi; pwi < wi; pwi++)
-					pwi->frag_page->frags++;
+					atomic_long_inc(&pwi->frag_page->frags);
 			}
 			return NULL; /* page/packet was consumed by XDP */
 		}
@@ -1702,7 +1703,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL;
 
 	skb_mark_for_recycle(skb);
-	head_wi->frag_page->frags++;
+	atomic_long_inc(&head_wi->frag_page->frags);
 
 	if (xdp_buff_has_frags(&mxbuf->xdp)) {
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
@@ -1711,7 +1712,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 					  xdp_buff_get_skb_flags(&mxbuf->xdp));
 
 		for (struct mlx5e_wqe_frag_info *pwi = head_wi + 1; pwi < wi; pwi++)
-			pwi->frag_page->frags++;
+			atomic_long_inc(&pwi->frag_page->frags);
 	}
 
 	return skb;
@@ -1760,7 +1761,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-			wi->frag_page->frags++;
+			atomic_long_inc(&wi->frag_page->frags);
 		goto wq_cyc_pop;
 	}
 
@@ -1808,7 +1809,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-			wi->frag_page->frags++;
+			atomic_long_inc(&wi->frag_page->frags);
 		goto wq_cyc_pop;
 	}
 
@@ -2011,9 +2012,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 				struct mlx5e_frag_page *pfp;
 
 				for (pfp = head_page; pfp < frag_page; pfp++)
-					pfp->frags++;
+					atomic_long_inc(&pfp->frags);
 
-				linear_page->frags++;
+				atomic_long_inc(&linear_page->frags);
 			}
 			return NULL; /* page/packet was consumed by XDP */
 		}
@@ -2035,7 +2036,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL;
 
 		skb_mark_for_recycle(skb);
-		linear_page->frags++;
+		atomic_long_inc(&linear_page->frags);
 
 		if (xdp_buff_has_frags(&mxbuf->xdp)) {
 			struct mlx5e_frag_page *pagep;
@@ -2048,7 +2049,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 			pagep = head_page;
 			do
-				pagep->frags++;
+				atomic_long_inc(&pagep->frags);
 			while (++pagep < frag_page);
 
 			headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
@@ -2068,7 +2069,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 			pagep = frag_page - sinfo->nr_frags;
 			do
-				pagep->frags++;
+				atomic_long_inc(&pagep->frags);
 			while (++pagep < frag_page);
 		}
 		/* copy header */
@@ -2121,7 +2122,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 				 cqe_bcnt, mxbuf);
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
-				frag_page->frags++;
+				atomic_long_inc(&frag_page->frags);
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
@@ -2136,7 +2137,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	/* queue up for recycling/reuse */
 	skb_mark_for_recycle(skb);
-	frag_page->frags++;
+	atomic_long_inc(&frag_page->frags);
 
 	return skb;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
  2026-06-25 17:40 ` Nabil S. Alramli
@ 2026-06-26 13:12   ` Dragos Tatulea
  2026-06-26 18:02     ` Nabil S. Alramli
  0 siblings, 1 reply; 5+ messages in thread
From: Dragos Tatulea @ 2026-06-26 13:12 UTC (permalink / raw)
  To: Nabil S. Alramli, saeedm, tariqt, mbloch
  Cc: nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel



On 25.06.26 19:40, Nabil S. Alramli wrote:
> Under memory pressure, mlx5 driver has WARNING during fragmented page
> release. This happens because there is a discrepency between what mlx5
> thinks the page fragment counter is vs what the page_pool actually says it
> is.
> 
The mlx5 frag counter is not the same as pp_ref_count. The page gets
split into 64 parts during page allocation. The frag counter tracks how
many of those frags have been used.

> The cause of the issue is page allocations on concurrent cpus, which
> increment the non-atomic u16 page counter mlx5e_frag_page.frags, while at
> the same time the page reference counter net_iov.pp_ref_count is atomically
> incremented. That sometimes leads to a difference in the counts and
> therefore triggers the warning in page_pool_unref_netmem:
> 
page_pool page allocations must not happen in parallel on different CPUs.
Each queue has its own page_pool and allocation happens within the NAPI of
that queue which sticks to a single CPU. The release path does support
releasing on another CPU (release to ring).

How did you encounter this scenario of having parallel allocations on
different CPUs from the same page_pool?

> ```
> 	ret = atomic_long_sub_return(nr, pp_ref_count);
> 	WARN_ON(ret < 0);
> ```
> 
> The actual stack trace looks like this:
> 
> ```
> WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
> Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
> Hardware name: *
> RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
> RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
> RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
> RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
> RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
> R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
> R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
> FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
> Call Trace:
>  <IRQ>
>  mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
>  mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
>  mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
>  __napi_poll+0x2b/0x1a0
>  net_rx_action+0x30e/0x370
>  ? sched_clock+0x9/0x10
>  ? sched_clock_cpu+0xf/0x170
>  handle_softirqs+0xe2/0x2a0
>  common_interrupt+0x85/0xa0
>  </IRQ>
>  <TASK>
>  asm_common_interrupt+0x26/0x40
> RIP: 0010:page_counter_uncharge+0x34/0x90
> RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
> RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
> RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
> RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
> R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
> R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
>  uncharge_batch+0x40/0xd0
> ```
>
Can you provide more data on how you reproduced this? This helps to
narrow down the bug. Reproduction steps would be ideal.

> The fix is to use an atomic page fragment counter, so it will always match
> the number of references held in the page_pool.
>
This is not the right fix. The mlx5 page frag counter is not atomic
on purpose because all changes to it happen only within the NAPI
context.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
  2026-06-26 13:12   ` Dragos Tatulea
@ 2026-06-26 18:02     ` Nabil S. Alramli
  2026-06-27  7:48       ` Dragos Tatulea
  0 siblings, 1 reply; 5+ messages in thread
From: Nabil S. Alramli @ 2026-06-26 18:02 UTC (permalink / raw)
  To: Dragos Tatulea, saeedm, tariqt, mbloch
  Cc: nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel

On 6/26/26 09:12, Dragos Tatulea wrote:
> 
> 
> On 25.06.26 19:40, Nabil S. Alramli wrote:
>> Under memory pressure, mlx5 driver has WARNING during fragmented page
>> release. This happens because there is a discrepency between what mlx5
>> thinks the page fragment counter is vs what the page_pool actually says it
>> is.
>>
> The mlx5 frag counter is not the same as pp_ref_count. The page gets
> split into 64 parts during page allocation. The frag counter tracks how
> many of those frags have been used.
> 

Thank you for explaining that to me. I thought it was the same because in
mlx5e_page_release_fragmented, drain_count is computed from frag_page->frags
and then passed into page_pool_unref_page as the nr parameter which is then
compared to the refcount in the page_pool.

>> The cause of the issue is page allocations on concurrent cpus, which
>> increment the non-atomic u16 page counter mlx5e_frag_page.frags, while at
>> the same time the page reference counter net_iov.pp_ref_count is atomically
>> incremented. That sometimes leads to a difference in the counts and
>> therefore triggers the warning in page_pool_unref_netmem:
>>
> page_pool page allocations must not happen in parallel on different CPUs.
> Each queue has its own page_pool and allocation happens within the NAPI of
> that queue which sticks to a single CPU. The release path does support
> releasing on another CPU (release to ring).
> 
> How did you encounter this scenario of having parallel allocations on
> different CPUs from the same page_pool?
> 

Perhaps we didn't, I had assumed that the numbers must match. What we did
encounter is these WARNINGs, and they seemed to go away with this patch but
maybe it is a coincidence.

>> ```
>> 	ret = atomic_long_sub_return(nr, pp_ref_count);
>> 	WARN_ON(ret < 0);
>> ```
>>
>> The actual stack trace looks like this:
>>
>> ```
>> WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>> Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
>> Hardware name: *
>> RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>> RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
>> RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
>> RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
>> RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
>> R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
>> R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
>> FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
>> Call Trace:
>>  <IRQ>
>>  mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
>>  mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
>>  mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
>>  __napi_poll+0x2b/0x1a0
>>  net_rx_action+0x30e/0x370
>>  ? sched_clock+0x9/0x10
>>  ? sched_clock_cpu+0xf/0x170
>>  handle_softirqs+0xe2/0x2a0
>>  common_interrupt+0x85/0xa0
>>  </IRQ>
>>  <TASK>
>>  asm_common_interrupt+0x26/0x40
>> RIP: 0010:page_counter_uncharge+0x34/0x90
>> RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
>> RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
>> RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
>> RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
>> R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
>> R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
>>  uncharge_batch+0x40/0xd0
>> ```
>>
> Can you provide more data on how you reproduced this? This helps to
> narrow down the bug. Reproduction steps would be ideal.
> 

I don't have clear steps to reproduce it, we just have seen it randomly on
some servers that were under memory pressure. I will try to look into it more
and find a way to reliably reproduce it. I agree that would be ideal to find a
proper fix.

>> The fix is to use an atomic page fragment counter, so it will always match
>> the number of references held in the page_pool.
>>
> This is not the right fix. The mlx5 page frag counter is not atomic
> on purpose because all changes to it happen only within the NAPI
> context.
> 

That was a question that I had, is it ever possible for frag_page->frags to be
incremented / set outside of NAPI context? I tried to answer that by looking
at code and by tracing it but could not get a clear picture. If it's not
possible then I agree, this is not the right fix.

> Thanks,
> Dragos

Thanks again for your guidance.

Nabil S. Alramli


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release
  2026-06-26 18:02     ` Nabil S. Alramli
@ 2026-06-27  7:48       ` Dragos Tatulea
  0 siblings, 0 replies; 5+ messages in thread
From: Dragos Tatulea @ 2026-06-27  7:48 UTC (permalink / raw)
  To: Nabil S. Alramli, saeedm, tariqt, mbloch
  Cc: nalramli, leon, andrew+netdev, davem, edumazet, kuba, pabeni,
	netdev, linux-rdma, linux-kernel



On 26.06.26 20:02, Nabil S. Alramli wrote:
> On 6/26/26 09:12, Dragos Tatulea wrote:
>>
>>
>> [...]
>>> ```
>>> 	ret = atomic_long_sub_return(nr, pp_ref_count);
>>> 	WARN_ON(ret < 0);
>>> ```
>>>
>>> The actual stack trace looks like this:
>>>
>>> ```
>>> WARNING: CPU: 37 PID: 447795 at include/net/page_pool/helpers.h:277 mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>>> Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
>>> Hardware name: *
>>> RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x51/0x60 [mlx5_core]
>>> RSP: 0018:ffffc90019814d98 EFLAGS: 00010293
>>> RAX: 000000000000003f RBX: ffff88c0993d0a10 RCX: ffffea02424592c0
>>> RDX: 0000000000000001 RSI: ffffea02424592c0 RDI: ffff88c090e20000
>>> RBP: 000000000000000a R08: 0000000000001409 R09: 0000000000000006
>>> R10: 0000000000000000 R11: ffff88c095fbc040 R12: 000000000000141f
>>> R13: 0000000000000009 R14: ffff88c090e20000 R15: 0000000000000001
>>> FS:  00007f34149fa6c0(0000) GS:ffff89200fa40000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007ed0265eb000 CR3: 0000005091cbe000 CR4: 0000000000350ef0
>>> Call Trace:
>>>  <IRQ>
>>>  mlx5e_free_rx_wqes+0x7b/0xa0 [mlx5_core]
>>>  mlx5e_post_rx_wqes+0x1ac/0x5a0 [mlx5_core]
>>>  mlx5e_napi_poll+0x5e5/0x6f0 [mlx5_core]
>>>  __napi_poll+0x2b/0x1a0
>>>  net_rx_action+0x30e/0x370
>>>  ? sched_clock+0x9/0x10
>>>  ? sched_clock_cpu+0xf/0x170
>>>  handle_softirqs+0xe2/0x2a0
>>>  common_interrupt+0x85/0xa0
>>>  </IRQ>
>>>  <TASK>
>>>  asm_common_interrupt+0x26/0x40
>>> RIP: 0010:page_counter_uncharge+0x34/0x90
>>> RSP: 0018:ffffc900e728bb00 EFLAGS: 00000213
>>> RAX: ffff88aff4762000 RBX: ffff88aff4762100 RCX: 0000000000000304
>>> RDX: 0000000000000001 RSI: 00000000004e9e1a RDI: ffff88aff4762100
>>> RBP: 0000000000000001 R08: ffff891ea0560048 R09: 00007ffffffff000
>>> R10: 0000000000001000 R11: ffff891ae8061b00 R12: ffffffffffffffff
>>> R13: ffff89107fcfd4c0 R14: ffff891ae8061b00 R15: ffff892002fe1400
>>>  uncharge_batch+0x40/0xd0
>>> ```
>>>
>> Can you provide more data on how you reproduced this? This helps to
>> narrow down the bug. Reproduction steps would be ideal.
>>
> 
> I don't have clear steps to reproduce it, we just have seen it randomly on
> some servers that were under memory pressure. I will try to look into it more
> and find a way to reliably reproduce it. I agree that would be ideal to find a
> proper fix.
> 
What NIC is this?
What MTU is being used?
Is strided rq enabled (ethtool --show-priv-flags).
Is XDP/AF_XDP used? If yes, can you provide more details?
Is HW-GRO on?

Based on those answers we can review the code path and see if there
is a case where the accounting for the fragments is not done correctly

Also, is buf_alloc_err growing during these memory pressure?

>>> The fix is to use an atomic page fragment counter, so it will always match
>>> the number of references held in the page_pool.
>>>
>> This is not the right fix. The mlx5 page frag counter is not atomic
>> on purpose because all changes to it happen only within the NAPI
>> context.
>>
> 
> That was a question that I had, is it ever possible for frag_page->frags to be
> incremented / set outside of NAPI context? I tried to answer that by looking
> at code and by tracing it but could not get a clear picture. If it's not
> possible then I agree, this is not the right fix.
> 
If that happens it is probably a bug.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-27  7:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 17:40 [mellanox/mlx5-next RFC 1/1] net/mlx5: RX, Fix refcount warning on frag page release Nabil S. Alramli
2026-06-25 17:40 ` Nabil S. Alramli
2026-06-26 13:12   ` Dragos Tatulea
2026-06-26 18:02     ` Nabil S. Alramli
2026-06-27  7:48       ` Dragos Tatulea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox