[PATCH v5] net/mlx5: fix counter TAILQ race between free and query callback

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Linhu Li <lilinhu618@gmail.com>
To: dev@dpdk.org
Cc: stable@dpdk.org, dsosnowski@nvidia.com, Linhu Li <lilinhu618@gmail.com>
Subject: [PATCH v5] net/mlx5: fix counter TAILQ race between free and query callback
Date: Tue, 16 Jun 2026 16:03:52 +0800	[thread overview]
Message-ID: <20260616080352.10378-1-lilinhu618@gmail.com> (raw)
In-Reply-To: <20260604101112.72177-1-lilinhu618@gmail.com>

flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.

The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:

1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
   pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
   pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
   query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
   counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
   concurrently with step 4 on another core.

Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization
and corrupt it: the pool-local list ends up with a NULL head but a
dangling tqh_last, and the global free list tail no longer points to
the real tail. The just-freed counter and every counter inserted
afterwards become unreachable and are leaked.

Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.

Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.

Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.

Additionally, fix a second independent race in flow_dv_counter_free():
TAILQ_INSERT_TAIL is passed &pool->counters[pool->query_gen] directly,
but the macro evaluates its head argument multiple times. Since
pool->query_gen is a volatile bit-field, if mlx5_flow_query_alarm()
increments query_gen between two evaluations of the macro, the same
insertion can operate on two different lists: the earlier steps update
counters[0] while the later steps update counters[1], leaving both
lists with inconsistent metadata and leaking the counter. Fix by
caching pool->query_gen into a local variable before calling the macro.

Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org

Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---

v5:
- Added fix for Race 2: cache pool->query_gen into a local variable
  before TAILQ_INSERT_TAIL to prevent the macro from evaluating the
  volatile bit-field multiple times and crossing generation lists.
- Updated release notes: moved the fix entry under "Updated NVIDIA mlx5
  driver" in New Features instead of using a separate "Fixed Issues" section.

v4:
- Fixed commit log line length over 75 characters.

v3:
- Added release notes entry.
- Added function comment in mlx5_flow_async_pool_query_handle().
- Clarified error path comment to note it is safe for transient failures.

v2:
- Fixed Signed-off-by to use full name.

 doc/guides/rel_notes/release_26_07.rst |  6 +++++
 drivers/net/mlx5/mlx5_flow.c           | 31 ++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_flow_dv.c        | 12 +++++-----
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b8a3e2ced9..6c6f0aa696 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -90,6 +90,11 @@ New Features
   * Added support for transmitting LLDP packets based on mbuf packet type.
   * Implemented AVX2 context descriptor transmit paths.

+* **Updated NVIDIA mlx5 driver.**
+
+  Fixed counter free list corruption when counter free operations race with
+  asynchronous query completions.
+
 * **Updated PCAP ethernet driver.**

   * Added support for VLAN insertion and stripping.
@@ -153,6 +158,7 @@ ABI Changes
 * No ABI change that would break compatibility with 25.11.

+
 Known Issues
 ------------

diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..2f785d58ec 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
 mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
 				  uint64_t async_id, int status)
 {
+	/*
+	 * Handle async counter pool query completion.
+	 * query_gen is flipped each round: freed counters go into [query_gen],
+	 * while this callback moves [query_gen ^ 1] to the global free list.
+	 * pool->csl must be held when operating on pool->counters[] to serialize
+	 * with concurrent free-path insertions.
+	 */
 	struct mlx5_flow_counter_pool *pool =
 		(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
 	struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,

 	if (unlikely(status)) {
 		raw_to_free = pool->raw_hw;
+		/*
+		 * The query failed, so the freed counters accumulated
+		 * in the old-gen list would otherwise be stranded.
+		 * Move them back to the global free list. This is safe
+		 * for both transient and persistent failures: the
+		 * counters are still valid and can be reused.
+		 */
+		if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+			rte_spinlock_lock(&pool->csl);
+			rte_spinlock_lock(&cmng->csl[cnt_type]);
+			TAILQ_CONCAT(&cmng->counters[cnt_type],
+				     &pool->counters[query_gen], next);
+			rte_spinlock_unlock(&cmng->csl[cnt_type]);
+			rte_spinlock_unlock(&pool->csl);
+		}
 	} else {
 		raw_to_free = pool->raw;
 		if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
 		rte_spinlock_unlock(&pool->sl);
 		/* Be sure the new raw counters data is updated in memory. */
 		rte_io_wmb();
+		/*
+		 * A counter free thread may have read a stale query_gen
+		 * before the generation was flipped and could still be
+		 * inserting into this same old-gen list. Hold pool->csl to
+		 * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+		 * avoid corrupting the list.
+		 */
 		if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+			rte_spinlock_lock(&pool->csl);
 			rte_spinlock_lock(&cmng->csl[cnt_type]);
 			TAILQ_CONCAT(&cmng->counters[cnt_type],
 				     &pool->counters[query_gen], next);
 			rte_spinlock_unlock(&cmng->csl[cnt_type]);
+			rte_spinlock_unlock(&pool->csl);
 		}
 	}
 	LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index c2a2874913..060ccdeec2 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -7129,6 +7129,7 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
 	struct mlx5_flow_counter_pool *pool = NULL;
 	struct mlx5_flow_counter *cnt;
 	enum mlx5_counter_type cnt_type;
+	uint32_t query_gen;

 	if (!counter)
 		return;
@@ -7153,16 +7154,17 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
 	cnt->pool = pool;
 	/*
 	 * Put the counter back to list to be updated in none fallback mode.
-	 * Currently, we are using two list alternately, while one is in query,
+	 * Currently, we are using two lists alternately, while one is in query,
 	 * add the freed counter to the other list based on the pool query_gen
 	 * value. After query finishes, add counter the list to the global
-	 * container counter list. The list changes while query starts. In
-	 * this case, lock will not be needed as query callback and release
-	 * function both operate with the different list.
+	 * container counter list. Cache query_gen into a local variable before
+	 * TAILQ_INSERT_TAIL, since the macro evaluates its head argument
+	 * multiple times and pool->query_gen is a volatile bit-field.
 	 */
 	if (!priv->sh->sws_cmng.counter_fallback) {
 		rte_spinlock_lock(&pool->csl);
-		TAILQ_INSERT_TAIL(&pool->counters[pool->query_gen], cnt, next);
+		query_gen = pool->query_gen;
+		TAILQ_INSERT_TAIL(&pool->counters[query_gen], cnt, next);
 		rte_spinlock_unlock(&pool->csl);
 	} else {
 		cnt->dcs_when_free = cnt->dcs_when_active;
-- 
2.39.3 (Apple Git-146)

     prev parent reply	other threads:[~2026-06-17 13:52 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 10:11 [PATCH] net/mlx5: fix counter TAILQ race between free and query callback Laaahu
2026-06-08 12:41 ` Dariusz Sosnowski
2026-06-08 13:25 ` [PATCH v2] " Linhu Li
2026-06-08 14:11   ` Dariusz Sosnowski
2026-06-09  9:22   ` Dariusz Sosnowski
2026-06-10  6:34 ` [PATCH v3] " Linhu Li
2026-06-11  7:51 ` [PATCH v4] " Linhu Li
2026-06-16  8:03 ` Linhu Li [this message]

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b8a3e2ced dfblob:6c6f0aa69 dfblob:915ea29a5 dfblob:2f785d58e
dfblob:c2a287491 dfblob:060ccdeec )
 OR (
bs:"[PATCH v5] net/mlx5: fix counter TAILQ race between free and query callback" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260616080352.10378-1-lilinhu618@gmail.com \
    --to=lilinhu618@gmail.com \
    --cc=dev@dpdk.org \
    --cc=dsosnowski@nvidia.com \
    --cc=stable@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox