* [PATCH] net/mlx5: fix counter TAILQ race between free and query callback
@ 2026-06-04 10:11 Laaahu
2026-06-08 12:41 ` Dariusz Sosnowski
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Laaahu @ 2026-06-04 10:11 UTC (permalink / raw)
To: dev; +Cc: stable, lilinhu
From: lilinhu <lilinhu618@gmail.com>
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no
synchronization and corrupt it: the pool-local list ends up with a
NULL head but a dangling tqh_last, and the global free list tail no
longer points to the real tail. The just-freed counter and every
counter inserted afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: lilinhu <lilinhu618@gmail.com>
---
drivers/net/mlx5/mlx5_flow.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..20aad87f5d 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9904,6 +9904,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated in the
+ * old-gen list during this round would otherwise be stranded.
+ * Move them back to the global free list to avoid a leak when
+ * queries fail persistently.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9927,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
@ 2026-06-08 12:41 ` Dariusz Sosnowski
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Dariusz Sosnowski @ 2026-06-08 12:41 UTC (permalink / raw)
To: Laaahu; +Cc: dev, stable
Hi,
Thank you for the contribution.
On Thu, Jun 04, 2026 at 06:11:12PM +0800, Laaahu wrote:
> From: lilinhu <lilinhu618@gmail.com>
>
> flow_dv_counter_free() inserts counters into
> pool->counters[pool->query_gen] under pool->csl. Meanwhile,
> mlx5_flow_async_pool_query_handle() moves counters from
> pool->counters[query_gen ^ 1] to the global free list via
> TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
>
> The comment in flow_dv_counter_free() claims the lock is not needed
> because the query callback and the release function operate on
> different lists. That holds only if the free path always observes
> the up-to-date query_gen. It can be violated:
>
> 1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
> pool->query_gen == 0 and is about to insert into counters[0].
> 2. The free thread is preempted by the OS scheduler; it is a regular
> pthread, not pinned to a core.
> 3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
> query is sent.
> 4. Hardware completes the query and the callback runs TAILQ_CONCAT on
> counters[0] (= query_gen ^ 1).
> 5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
> concurrently with step 4 on another core.
>
> Because the two paths take different locks, TAILQ_INSERT_TAIL and
> TAILQ_CONCAT run concurrently on the same list with no
> synchronization and corrupt it: the pool-local list ends up with a
> NULL head but a dangling tqh_last, and the global free list tail no
> longer points to the real tail. The just-freed counter and every
> counter inserted afterwards become unreachable and are leaked.
>
> Non-PMD threads can be preempted for hundreds of microseconds under
> CPU pressure, which is well within the async query round-trip time,
> so the window is reachable in practice.
>
> Fix it by taking pool->csl in the query completion callback before
> operating on pool->counters[query_gen], serializing the CONCAT with
> any concurrent INSERT. The lock is taken once per pool per query
> completion in the eal-intr-thread context, not on the datapath, so
> the cost is negligible. Lock order is pool->csl then cmng->csl,
> matching all other sites.
>
> Also handle the error path: previously the counters accumulated in
> pool->counters[query_gen] were abandoned when a query failed. Move
> them back to the global free list to avoid a leak on persistent
> query failures.
>
> Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
> Cc: stable@dpdk.org
>
> Signed-off-by: lilinhu <lilinhu618@gmail.com>
Code looks good to me.
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
DPDK community uses Signed-off-by to indicate
Developer's Certificate of Origin:
https://developercertificate.org/
This requires full name in the Signed-off-by tag.
Could you please help with providing us with your full name
in English alphabet?
Best regards,
Dariusz Sosnowski
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
2026-06-08 12:41 ` Dariusz Sosnowski
@ 2026-06-08 13:25 ` Linhu Li
2026-06-08 14:11 ` Dariusz Sosnowski
2026-06-09 9:22 ` Dariusz Sosnowski
2026-06-10 6:34 ` [PATCH v3] " Linhu Li
` (3 subsequent siblings)
5 siblings, 2 replies; 10+ messages in thread
From: Linhu Li @ 2026-06-08 13:25 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no
synchronization and corrupt it: the pool-local list ends up with a
NULL head but a dangling tqh_last, and the global free list tail no
longer points to the real tail. The just-freed counter and every
counter inserted afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
---
drivers/net/mlx5/mlx5_flow.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..20aad87f5d 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9904,6 +9904,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated in the
+ * old-gen list during this round would otherwise be stranded.
+ * Move them back to the global free list to avoid a leak when
+ * queries fail persistently.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9927,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH v2] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
@ 2026-06-08 14:11 ` Dariusz Sosnowski
2026-06-09 9:22 ` Dariusz Sosnowski
1 sibling, 0 replies; 10+ messages in thread
From: Dariusz Sosnowski @ 2026-06-08 14:11 UTC (permalink / raw)
To: Linhu Li, dev@dpdk.org; +Cc: stable@dpdk.org
> -----Original Message-----
> From: Linhu Li <lilinhu618@gmail.com>
> Sent: Monday, June 8, 2026 3:26 PM
> To: dev@dpdk.org
> Cc: stable@dpdk.org; Dariusz Sosnowski <dsosnowski@nvidia.com>; Linhu Li
> <lilinhu618@gmail.com>
> Subject: [PATCH v2] net/mlx5: fix counter TAILQ race between free and query
> callback
>
> flow_dv_counter_free() inserts counters into
> pool->counters[pool->query_gen] under pool->csl. Meanwhile,
> mlx5_flow_async_pool_query_handle() moves counters from
> pool->counters[query_gen ^ 1] to the global free list via
> TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
>
> The comment in flow_dv_counter_free() claims the lock is not needed
> because the query callback and the release function operate on different lists.
> That holds only if the free path always observes the up-to-date query_gen. It
> can be violated:
>
> 1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
> pool->query_gen == 0 and is about to insert into counters[0].
> 2. The free thread is preempted by the OS scheduler; it is a regular
> pthread, not pinned to a core.
> 3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
> query is sent.
> 4. Hardware completes the query and the callback runs TAILQ_CONCAT on
> counters[0] (= query_gen ^ 1).
> 5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
> concurrently with step 4 on another core.
>
> Because the two paths take different locks, TAILQ_INSERT_TAIL and
> TAILQ_CONCAT run concurrently on the same list with no synchronization and
> corrupt it: the pool-local list ends up with a NULL head but a dangling
> tqh_last, and the global free list tail no longer points to the real tail. The just-
> freed counter and every counter inserted afterwards become unreachable
> and are leaked.
>
> Non-PMD threads can be preempted for hundreds of microseconds under
> CPU pressure, which is well within the async query round-trip time, so the
> window is reachable in practice.
>
> Fix it by taking pool->csl in the query completion callback before operating on
> pool->counters[query_gen], serializing the CONCAT with any concurrent
> INSERT. The lock is taken once per pool per query completion in the eal-intr-
> thread context, not on the datapath, so the cost is negligible. Lock order is
> pool->csl then cmng->csl, matching all other sites.
>
> Also handle the error path: previously the counters accumulated in
> pool->counters[query_gen] were abandoned when a query failed. Move
> them back to the global free list to avoid a leak on persistent query failures.
>
> Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
> Cc: stable@dpdk.org
>
> Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
2026-06-08 14:11 ` Dariusz Sosnowski
@ 2026-06-09 9:22 ` Dariusz Sosnowski
1 sibling, 0 replies; 10+ messages in thread
From: Dariusz Sosnowski @ 2026-06-09 9:22 UTC (permalink / raw)
To: Linhu Li; +Cc: dev, stable, rasland
On Mon, Jun 08, 2026 at 09:25:55PM +0800, Linhu Li wrote:
> flow_dv_counter_free() inserts counters into
> pool->counters[pool->query_gen] under pool->csl. Meanwhile,
> mlx5_flow_async_pool_query_handle() moves counters from
> pool->counters[query_gen ^ 1] to the global free list via
> TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
>
> The comment in flow_dv_counter_free() claims the lock is not needed
> because the query callback and the release function operate on
> different lists. That holds only if the free path always observes
> the up-to-date query_gen. It can be violated:
>
> 1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
> pool->query_gen == 0 and is about to insert into counters[0].
> 2. The free thread is preempted by the OS scheduler; it is a regular
> pthread, not pinned to a core.
> 3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
> query is sent.
> 4. Hardware completes the query and the callback runs TAILQ_CONCAT on
> counters[0] (= query_gen ^ 1).
> 5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
> concurrently with step 4 on another core.
>
> Because the two paths take different locks, TAILQ_INSERT_TAIL and
> TAILQ_CONCAT run concurrently on the same list with no
> synchronization and corrupt it: the pool-local list ends up with a
> NULL head but a dangling tqh_last, and the global free list tail no
> longer points to the real tail. The just-freed counter and every
> counter inserted afterwards become unreachable and are leaked.
>
> Non-PMD threads can be preempted for hundreds of microseconds under
> CPU pressure, which is well within the async query round-trip time,
> so the window is reachable in practice.
>
> Fix it by taking pool->csl in the query completion callback before
> operating on pool->counters[query_gen], serializing the CONCAT with
> any concurrent INSERT. The lock is taken once per pool per query
> completion in the eal-intr-thread context, not on the datapath, so
> the cost is negligible. Lock order is pool->csl then cmng->csl,
> matching all other sites.
>
> Also handle the error path: previously the counters accumulated in
> pool->counters[query_gen] were abandoned when a query failed. Move
> them back to the global free list to avoid a leak on persistent
> query failures.
>
> Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
> Cc: stable@dpdk.org
>
> Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
2026-06-08 12:41 ` Dariusz Sosnowski
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
@ 2026-06-10 6:34 ` Linhu Li
2026-06-11 7:51 ` [PATCH v4] " Linhu Li
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Linhu Li @ 2026-06-10 6:34 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on different lists.
That holds only if the free path always observes the up-to-date query_gen. It
can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization and
corrupt it: the pool-local list ends up with a NULL head but a dangling
tqh_last, and the global free list tail no longer points to the real tail. The just-
freed counter and every counter inserted afterwards become unreachable
and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time, so the
window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before operating on
pool->counters[query_gen], serializing the CONCAT with any concurrent
INSERT. The lock is taken once per pool per query completion in the eal-intr-
thread context, not on the datapath, so the cost is negligible. Lock order is
pool->csl then cmng->csl, matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent query failures.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
doc/guides/rel_notes/release_26_07.rst | 21 +++++++++++++++++
drivers/net/mlx5/mlx5_flow.c | 31 ++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b8a3e2ced9..30a9564884 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -153,6 +153,27 @@ ABI Changes
* No ABI change that would break compatibility with 25.11.
+Fixed Issues
+------------
+
+.. This section should contain fixed issues in this release. Sample format:
+
+ * **Add a title in the past tense with a full stop.**
+
+ Add a short 1-2 sentence description of the fix in the past tense.
+
+ This section is a comment. Do not overwrite or remove it.
+ Also, make sure to start the actual text at the margin.
+ =======================================================
+
+* **net/mlx5: Fixed counter TAILQ race between free and query callback.**
+
+ Fixed a race condition where concurrent counter free operations and async
+ query completions could corrupt the counter free list, causing counter leaks.
+ The issue occurred when non-PMD threads were preempted between reading
+ ``query_gen`` and inserting into the counter list.
+
+
Known Issues
------------
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..2f785d58ec 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
uint64_t async_id, int status)
{
+ /*
+ * Handle async counter pool query completion.
+ * query_gen is flipped each round: freed counters go into [query_gen],
+ * while this callback moves [query_gen ^ 1] to the global free list.
+ * pool->csl must be held when operating on pool->counters[] to serialize
+ * with concurrent free-path insertions.
+ */
struct mlx5_flow_counter_pool *pool =
(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated
+ * in the old-gen list would otherwise be stranded.
+ * Move them back to the global free list. This is safe
+ * for both transient and persistent failures: the
+ * counters are still valid and can be reused.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v4] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
` (2 preceding siblings ...)
2026-06-10 6:34 ` [PATCH v3] " Linhu Li
@ 2026-06-11 7:51 ` Linhu Li
2026-06-16 8:03 ` [PATCH v5] " Linhu Li
2026-06-18 9:14 ` [PATCH v6] " Linhu Li
5 siblings, 0 replies; 10+ messages in thread
From: Linhu Li @ 2026-06-11 7:51 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization
and corrupt it: the pool-local list ends up with a NULL head but a
dangling tqh_last, and the global free list tail no longer points to
the real tail. The just-freed counter and every counter inserted
afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
doc/guides/rel_notes/release_26_07.rst | 21 +++++++++++++++++
drivers/net/mlx5/mlx5_flow.c | 31 ++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b8a3e2ced9..30a9564884 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -153,6 +153,27 @@ ABI Changes
* No ABI change that would break compatibility with 25.11.
+Fixed Issues
+------------
+
+.. This section should contain fixed issues in this release. Sample format:
+
+ * **Add a title in the past tense with a full stop.**
+
+ Add a short 1-2 sentence description of the fix in the past tense.
+
+ This section is a comment. Do not overwrite or remove it.
+ Also, make sure to start the actual text at the margin.
+ =======================================================
+
+* **net/mlx5: Fixed counter TAILQ race between free and query callback.**
+
+ Fixed a race condition where concurrent counter free operations and async
+ query completions could corrupt the counter free list, causing counter leaks.
+ The issue occurred when non-PMD threads were preempted between reading
+ ``query_gen`` and inserting into the counter list.
+
+
Known Issues
------------
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..2f785d58ec 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
uint64_t async_id, int status)
{
+ /*
+ * Handle async counter pool query completion.
+ * query_gen is flipped each round: freed counters go into [query_gen],
+ * while this callback moves [query_gen ^ 1] to the global free list.
+ * pool->csl must be held when operating on pool->counters[] to serialize
+ * with concurrent free-path insertions.
+ */
struct mlx5_flow_counter_pool *pool =
(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated
+ * in the old-gen list would otherwise be stranded.
+ * Move them back to the global free list. This is safe
+ * for both transient and persistent failures: the
+ * counters are still valid and can be reused.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v5] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
` (3 preceding siblings ...)
2026-06-11 7:51 ` [PATCH v4] " Linhu Li
@ 2026-06-16 8:03 ` Linhu Li
2026-06-18 9:14 ` [PATCH v6] " Linhu Li
5 siblings, 0 replies; 10+ messages in thread
From: Linhu Li @ 2026-06-16 8:03 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization
and corrupt it: the pool-local list ends up with a NULL head but a
dangling tqh_last, and the global free list tail no longer points to
the real tail. The just-freed counter and every counter inserted
afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Additionally, fix a second independent race in flow_dv_counter_free():
TAILQ_INSERT_TAIL is passed &pool->counters[pool->query_gen] directly,
but the macro evaluates its head argument multiple times. Since
pool->query_gen is a volatile bit-field, if mlx5_flow_query_alarm()
increments query_gen between two evaluations of the macro, the same
insertion can operate on two different lists: the earlier steps update
counters[0] while the later steps update counters[1], leaving both
lists with inconsistent metadata and leaking the counter. Fix by
caching pool->query_gen into a local variable before calling the macro.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
v5:
- Added fix for Race 2: cache pool->query_gen into a local variable
before TAILQ_INSERT_TAIL to prevent the macro from evaluating the
volatile bit-field multiple times and crossing generation lists.
- Updated release notes: moved the fix entry under "Updated NVIDIA mlx5
driver" in New Features instead of using a separate "Fixed Issues" section.
v4:
- Fixed commit log line length over 75 characters.
v3:
- Added release notes entry.
- Added function comment in mlx5_flow_async_pool_query_handle().
- Clarified error path comment to note it is safe for transient failures.
v2:
- Fixed Signed-off-by to use full name.
doc/guides/rel_notes/release_26_07.rst | 6 +++++
drivers/net/mlx5/mlx5_flow.c | 31 ++++++++++++++++++++++++++
drivers/net/mlx5/mlx5_flow_dv.c | 12 +++++-----
3 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b8a3e2ced9..6c6f0aa696 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -90,6 +90,11 @@ New Features
* Added support for transmitting LLDP packets based on mbuf packet type.
* Implemented AVX2 context descriptor transmit paths.
+* **Updated NVIDIA mlx5 driver.**
+
+ Fixed counter free list corruption when counter free operations race with
+ asynchronous query completions.
+
* **Updated PCAP ethernet driver.**
* Added support for VLAN insertion and stripping.
@@ -153,6 +158,7 @@ ABI Changes
* No ABI change that would break compatibility with 25.11.
+
Known Issues
------------
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..2f785d58ec 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
uint64_t async_id, int status)
{
+ /*
+ * Handle async counter pool query completion.
+ * query_gen is flipped each round: freed counters go into [query_gen],
+ * while this callback moves [query_gen ^ 1] to the global free list.
+ * pool->csl must be held when operating on pool->counters[] to serialize
+ * with concurrent free-path insertions.
+ */
struct mlx5_flow_counter_pool *pool =
(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated
+ * in the old-gen list would otherwise be stranded.
+ * Move them back to the global free list. This is safe
+ * for both transient and persistent failures: the
+ * counters are still valid and can be reused.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index c2a2874913..060ccdeec2 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -7129,6 +7129,7 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
struct mlx5_flow_counter_pool *pool = NULL;
struct mlx5_flow_counter *cnt;
enum mlx5_counter_type cnt_type;
+ uint32_t query_gen;
if (!counter)
return;
@@ -7153,16 +7154,17 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
cnt->pool = pool;
/*
* Put the counter back to list to be updated in none fallback mode.
- * Currently, we are using two list alternately, while one is in query,
+ * Currently, we are using two lists alternately, while one is in query,
* add the freed counter to the other list based on the pool query_gen
* value. After query finishes, add counter the list to the global
- * container counter list. The list changes while query starts. In
- * this case, lock will not be needed as query callback and release
- * function both operate with the different list.
+ * container counter list. Cache query_gen into a local variable before
+ * TAILQ_INSERT_TAIL, since the macro evaluates its head argument
+ * multiple times and pool->query_gen is a volatile bit-field.
*/
if (!priv->sh->sws_cmng.counter_fallback) {
rte_spinlock_lock(&pool->csl);
- TAILQ_INSERT_TAIL(&pool->counters[pool->query_gen], cnt, next);
+ query_gen = pool->query_gen;
+ TAILQ_INSERT_TAIL(&pool->counters[query_gen], cnt, next);
rte_spinlock_unlock(&pool->csl);
} else {
cnt->dcs_when_free = cnt->dcs_when_active;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v6] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
` (4 preceding siblings ...)
2026-06-16 8:03 ` [PATCH v5] " Linhu Li
@ 2026-06-18 9:14 ` Linhu Li
2026-06-24 7:17 ` Raslan Darawsheh
5 siblings, 1 reply; 10+ messages in thread
From: Linhu Li @ 2026-06-18 9:14 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization
and corrupt it: the pool-local list ends up with a NULL head but a
dangling tqh_last, and the global free list tail no longer points to
the real tail. The just-freed counter and every counter inserted
afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Additionally, fix a second independent race in flow_dv_counter_free():
TAILQ_INSERT_TAIL is passed &pool->counters[pool->query_gen] directly,
but the macro evaluates its head argument multiple times. Since
pool->query_gen is a volatile bit-field, if mlx5_flow_query_alarm()
increments query_gen between two evaluations of the macro, the same
insertion can operate on two different lists: the earlier steps update
counters[0] while the later steps update counters[1], leaving both
lists with inconsistent metadata and leaking the counter. Fix by
caching pool->query_gen into a local variable before calling the macro.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
v6:
- Rebased onto latest main to resolve a release notes conflict: a new
mlx5 entry was added upstream after v5, so this patch now adds its
fix as a sub-bullet under the existing "Updated NVIDIA mlx5 ethernet
driver" entry instead of a separate item.
v5:
- Added fix for Race 2: cache pool->query_gen into a local variable
before TAILQ_INSERT_TAIL to prevent the macro from evaluating the
volatile bit-field multiple times and crossing generation lists.
- Updated release notes: moved the fix entry under "Updated NVIDIA mlx5
driver" in New Features instead of using a separate "Fixed Issues" section.
v4:
- Fixed commit log line length over 75 characters.
v3:
- Added release notes entry.
- Added function comment in mlx5_flow_async_pool_query_handle().
- Clarified error path comment to note it is safe for transient failures.
v2:
- Fixed Signed-off-by to use full name.
doc/guides/rel_notes/release_26_07.rst | 2 ++
drivers/net/mlx5/mlx5_flow.c | 31 ++++++++++++++++++++++++++
drivers/net/mlx5/mlx5_flow_dv.c | 12 +++++-----
3 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 5d7aa8d1bf..cdbd28ef4f 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -139,6 +139,8 @@ New Features
* **Updated NVIDIA mlx5 ethernet driver.**
* Added support for selective Rx in scalar SPRQ Rx path.
+ * Fixed counter free list corruption when counter free operations race with
+ asynchronous query completions.
* **Updated PCAP ethernet driver.**
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index a95dd9dc94..b0eac185b5 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
uint64_t async_id, int status)
{
+ /*
+ * Handle async counter pool query completion.
+ * query_gen is flipped each round: freed counters go into [query_gen],
+ * while this callback moves [query_gen ^ 1] to the global free list.
+ * pool->csl must be held when operating on pool->counters[] to serialize
+ * with concurrent free-path insertions.
+ */
struct mlx5_flow_counter_pool *pool =
(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated
+ * in the old-gen list would otherwise be stranded.
+ * Move them back to the global free list. This is safe
+ * for both transient and persistent failures: the
+ * counters are still valid and can be reused.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 307354c886..58ebcf87eb 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -7129,6 +7129,7 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
struct mlx5_flow_counter_pool *pool = NULL;
struct mlx5_flow_counter *cnt;
enum mlx5_counter_type cnt_type;
+ uint32_t query_gen;
if (!counter)
return;
@@ -7153,16 +7154,17 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
cnt->pool = pool;
/*
* Put the counter back to list to be updated in none fallback mode.
- * Currently, we are using two list alternately, while one is in query,
+ * Currently, we are using two lists alternately, while one is in query,
* add the freed counter to the other list based on the pool query_gen
* value. After query finishes, add counter the list to the global
- * container counter list. The list changes while query starts. In
- * this case, lock will not be needed as query callback and release
- * function both operate with the different list.
+ * container counter list. Cache query_gen into a local variable before
+ * TAILQ_INSERT_TAIL, since the macro evaluates its head argument
+ * multiple times and pool->query_gen is a volatile bit-field.
*/
if (!priv->sh->sws_cmng.counter_fallback) {
rte_spinlock_lock(&pool->csl);
- TAILQ_INSERT_TAIL(&pool->counters[pool->query_gen], cnt, next);
+ query_gen = pool->query_gen;
+ TAILQ_INSERT_TAIL(&pool->counters[query_gen], cnt, next);
rte_spinlock_unlock(&pool->csl);
} else {
cnt->dcs_when_free = cnt->dcs_when_active;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v6] net/mlx5: fix counter TAILQ race between free and query callback
2026-06-18 9:14 ` [PATCH v6] " Linhu Li
@ 2026-06-24 7:17 ` Raslan Darawsheh
0 siblings, 0 replies; 10+ messages in thread
From: Raslan Darawsheh @ 2026-06-24 7:17 UTC (permalink / raw)
To: Linhu Li, dev; +Cc: stable, dsosnowski
Hi,
On 18/06/2026 12:14 PM, Linhu Li wrote:
> flow_dv_counter_free() inserts counters into
> pool->counters[pool->query_gen] under pool->csl. Meanwhile,
> mlx5_flow_async_pool_query_handle() moves counters from
> pool->counters[query_gen ^ 1] to the global free list via
> TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
>
> The comment in flow_dv_counter_free() claims the lock is not needed
> because the query callback and the release function operate on
> different lists. That holds only if the free path always observes
> the up-to-date query_gen. It can be violated:
>
> 1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
> pool->query_gen == 0 and is about to insert into counters[0].
> 2. The free thread is preempted by the OS scheduler; it is a regular
> pthread, not pinned to a core.
> 3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
> query is sent.
> 4. Hardware completes the query and the callback runs TAILQ_CONCAT on
> counters[0] (= query_gen ^ 1).
> 5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
> concurrently with step 4 on another core.
>
> Because the two paths take different locks, TAILQ_INSERT_TAIL and
> TAILQ_CONCAT run concurrently on the same list with no synchronization
> and corrupt it: the pool-local list ends up with a NULL head but a
> dangling tqh_last, and the global free list tail no longer points to
> the real tail. The just-freed counter and every counter inserted
> afterwards become unreachable and are leaked.
>
> Non-PMD threads can be preempted for hundreds of microseconds under
> CPU pressure, which is well within the async query round-trip time,
> so the window is reachable in practice.
>
> Fix it by taking pool->csl in the query completion callback before
> operating on pool->counters[query_gen], serializing the CONCAT with
> any concurrent INSERT. The lock is taken once per pool per query
> completion in the eal-intr-thread context, not on the datapath, so
> the cost is negligible. Lock order is pool->csl then cmng->csl,
> matching all other sites.
>
> Also handle the error path: previously the counters accumulated in
> pool->counters[query_gen] were abandoned when a query failed. Move
> them back to the global free list to avoid a leak on persistent
> query failures.
>
> Additionally, fix a second independent race in flow_dv_counter_free():
> TAILQ_INSERT_TAIL is passed &pool->counters[pool->query_gen] directly,
> but the macro evaluates its head argument multiple times. Since
> pool->query_gen is a volatile bit-field, if mlx5_flow_query_alarm()
> increments query_gen between two evaluations of the macro, the same
> insertion can operate on two different lists: the earlier steps update
> counters[0] while the later steps update counters[1], leaving both
> lists with inconsistent metadata and leaking the counter. Fix by
> caching pool->query_gen into a local variable before calling the macro.
>
> Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
> Cc: stable@dpdk.org
>
> Signed-off-by: Linhu Li <lilinhu618@gmail.com>
> Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-24 7:17 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
2026-06-08 12:41 ` Dariusz Sosnowski
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
2026-06-08 14:11 ` Dariusz Sosnowski
2026-06-09 9:22 ` Dariusz Sosnowski
2026-06-10 6:34 ` [PATCH v3] " Linhu Li
2026-06-11 7:51 ` [PATCH v4] " Linhu Li
2026-06-16 8:03 ` [PATCH v5] " Linhu Li
2026-06-18 9:14 ` [PATCH v6] " Linhu Li
2026-06-24 7:17 ` Raslan Darawsheh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox