[PATCH] mempool: improve cache behaviour and performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mempool: improve cache behaviour and performance
@ 2026-04-08 14:13 Morten Brørup
  2026-04-08 15:41 ` Stephen Hemminger
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-08 14:13 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size instead of
RTE_MEMPOOL_CACHE_MAX_SIZE.

(3) was improved by flushing and replenishing the cache by half its size,
so an flush/replenish can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/replenish operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and was reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE.
For ABI compatibility purposes, keeping the size of the rte_mempool_cache
unchanged, a filler array (cache->unused_objs[]) was added.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

In addition to the Mempool library changes, some Intel network drivers
bypassing the Mempool API to access the mempool cache were updated
accordingly.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 drivers/net/intel/common/tx.h                 | 36 +-------
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 15 ++--
 lib/mempool/rte_mempool.c                     | 14 +--
 lib/mempool/rte_mempool.h                     | 86 ++++++++++---------
 4 files changed, 60 insertions(+), 91 deletions(-)

diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
index 283bd58d5d..c5114f756d 100644
--- a/drivers/net/intel/common/tx.h
+++ b/drivers/net/intel/common/tx.h
@@ -284,43 +284,11 @@ ci_tx_free_bufs_vec(struct ci_tx_queue *txq, ci_desc_done_fn desc_done, bool ctx
 			txq->fast_free_mp :
 			(txq->fast_free_mp = txep[0].mbuf->pool);
 
-	if (mp != NULL && (n & 31) == 0) {
-		void **cache_objs;
-		struct rte_mempool_cache *cache = rte_mempool_default_cache(mp, rte_lcore_id());
-
-		if (cache == NULL)
-			goto normal;
-
-		cache_objs = &cache->objs[cache->len];
-
-		if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
-			rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
-			goto done;
-		}
-
-		/* The cache follows the following algorithm
-		 *   1. Add the objects to the cache
-		 *   2. Anything greater than the cache min value (if it
-		 *   crosses the cache flush threshold) is flushed to the ring.
-		 */
-		/* Add elements back into the cache */
-		uint32_t copied = 0;
-		/* n is multiple of 32 */
-		while (copied < n) {
-			memcpy(&cache_objs[copied], &txep[copied], 32 * sizeof(void *));
-			copied += 32;
-		}
-		cache->len += n;
-
-		if (cache->len >= cache->flushthresh) {
-			rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-					cache->len - cache->size);
-			cache->len = cache->size;
-		}
+	if (mp != NULL) {
+		rte_mbuf_raw_free_bulk(mp, (void *)txep, n);
 		goto done;
 	}
 
-normal:
 	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
 	if (likely(m)) {
 		free[0] = m;
diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..e5eb56552f 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,14 +148,19 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..dafe98d1c2 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -104,13 +104,11 @@ struct __rte_cache_aligned rte_mempool_cache {
 		uint64_t get_success_objs;  /**< Objects successfully allocated. */
 	} stats;                        /**< Statistics */
 #endif
-	/**
-	 * Cache objects
-	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
-	 */
-	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
+	/** Cache objects */
+	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE];
+	/** Unused; for ABI compatibility purposes only */
+	void *unused_objs[RTE_MEMPOOL_CACHE_MAX_SIZE];
+	/* Note: Remember to add an RTE_CACHE_GUARD here, if removing unused_objs. */
 };
 
 /**
@@ -1047,11 +1045,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1377,7 +1380,7 @@ rte_mempool_cache_flush(struct rte_mempool_cache *cache,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_do_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	void **cache_objs;
@@ -1390,24 +1393,26 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size / 2);
 	} else {
-		/* The request itself is too big for the cache. */
+		/* The request itself is too big. */
 		goto driver_enqueue_stats_incremented;
 	}
 
@@ -1418,13 +1423,13 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 driver_enqueue:
 
-	/* increment stat now, adding in mempool always success */
+	/* Increment stats now, adding in mempool always succeeds. */
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
 
 driver_enqueue_stats_incremented:
 
-	/* push objects to the backend */
+	/* Push the objects to the backend. */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
 
@@ -1442,7 +1447,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	rte_mempool_trace_generic_put(mp, obj_table, n, cache);
@@ -1465,7 +1470,7 @@ rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   The number of objects to add in the mempool from obj_table.
  */
 static __rte_always_inline void
-rte_mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_put_bulk(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 		     unsigned int n)
 {
 	struct rte_mempool_cache *cache;
@@ -1507,7 +1512,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  *   - <0: Error; code of driver dequeue function.
  */
 static __rte_always_inline int
-rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_do_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1524,7 +1529,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1553,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1573,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
@@ -1629,7 +1635,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1663,7 +1669,7 @@ rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
+rte_mempool_get_bulk(struct rte_mempool *mp, void ** __rte_restrict obj_table, unsigned int n)
 {
 	struct rte_mempool_cache *cache;
 	cache = rte_mempool_default_cache(mp, rte_lcore_id());
@@ -1692,7 +1698,7 @@ rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get(struct rte_mempool *mp, void **obj_p)
+rte_mempool_get(struct rte_mempool *mp, void ** __rte_restrict obj_p)
 {
 	return rte_mempool_get_bulk(mp, obj_p, 1);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
@ 2026-04-08 15:41 ` Stephen Hemminger
  2026-04-09 10:25 ` [PATCH v2] " Morten Brørup
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Stephen Hemminger @ 2026-04-08 15:41 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty

On Wed,  8 Apr 2026 14:13:15 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:

> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh.
> The cache->flushthresh field is kept for API/ABI compatibility purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> 
> (3) was improved by flushing and replenishing the cache by half its size,
> so an flush/replenish can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/replenish operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and was reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> For ABI compatibility purposes, keeping the size of the rte_mempool_cache
> unchanged, a filler array (cache->unused_objs[]) was added.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> In addition to the Mempool library changes, some Intel network drivers
> bypassing the Mempool API to access the mempool cache were updated
> accordingly.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---

AI review had some good feedback. Mostly about adding a good release note.

Review of: [PATCH] mempool: improve cache behaviour and performance
From: Morten Brørup <mb@smartsharesystems.com>

This is a substantial and well-motivated rework of the mempool cache.
The half-size flush/refill strategy is sound and the performance data
is compelling. A few observations:

Warning:

1. drivers/net/intel/common/tx.h: The reworked fast-free path removes
the (n & 31) == 0 alignment requirement. The old code required 32-byte
alignment because it used a memcpy loop in 32-element chunks. The new
code calls rte_mbuf_raw_free_bulk() which has no such requirement, so
removing the condition is correct. However, the old code also bypassed
rte_pktmbuf_prefree_seg() for the entire batch when the cache was
available. The new code still bypasses prefree (raw_free_bulk doesn't
call it), but now does so for ANY value of n, not just multiples of 32.
Previously, non-aligned counts fell through to the "normal" path which
called rte_pktmbuf_prefree_seg() per mbuf. If any of those mbufs have
a non-zero refcount or external buffers, the old code handled that for
non-aligned batches but the new code will not. This is gated by
fast_free_mp being non-NULL (i.e. RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
is enabled), which contractually means single-pool, refcnt==1, no
external buffers — so functionally safe, but the behavioral change
should be called out in the commit message.

2. drivers/net/intel/idpf/idpf_common_rxtx_avx512.c: The new fallback
to idpf_singleq_rearm_common() when IDPF_RXQ_REARM_THRESH > cache->size / 2
is a correctness guard, but it means that for any mempool with
cache_size < 128, the vectorized rearm path silently degrades to the
scalar path. This is a performance cliff that applications won't expect
from reducing cache_size. Worth a comment or documentation note.

Info:

3. lib/mempool/rte_mempool.h: The __rte_restrict addition to all public
put/get API signatures is an ABI-compatible but API-visible change. The
restrict qualifier is a promise by the caller, not the callee. Callers
using the deprecated non-restrict signatures via function pointers or
wrappers will still compile, but documenting this in the release notes
would help downstream users understand the new aliasing contract.

4. lib/mempool/rte_mempool.h: In the put path flush branch, the
enqueue_bulk call now flushes objects from the middle of the cache
array (at offset len - size/2) rather than from offset 0. The objects
being flushed are the oldest in the cache (LIFO bottom). This changes
the access pattern for the backend ring — previously it saw the full
cache contents, now it sees the bottom half. This is fine for
correctness but changes the cache residency pattern, which is
presumably the intended improvement.

5. lib/mempool/rte_mempool.c: The validation in rte_mempool_create_empty
changes from cache_size * 1.5 > n to cache_size > n. This relaxes the
constraint — pools that were previously rejected (e.g. n=100,
cache_size=70, where 70*1.5=105 > 100 failed) will now succeed. This
is a user-visible behavioral change worth noting in release notes.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v2] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
  2026-04-08 15:41 ` Stephen Hemminger
@ 2026-04-09 10:25 ` Morten Brørup
  2026-04-09 11:05 ` [PATCH v3] " Morten Brørup
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-09 10:25 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size instead of
RTE_MEMPOOL_CACHE_MAX_SIZE.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

In addition to the Mempool library changes, some Intel network drivers
bypassing the Mempool API to access the mempool cache were updated
accordingly.
The Intel idpf AVX512 driver was missing some mbuf instrumentation when
bypassing the Packet Buffer (mbuf) API, so this was added.

Furthermore, the NXP dpaa and dpaa2 mempool drivers were updated
accordingly, specifically to not set the flush threshold.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(), inspired by AI review feedback.
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst          |  7 ++
 doc/guides/rel_notes/release_26_07.rst        | 18 +++++
 drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
 drivers/net/intel/common/tx.h                 | 38 +--------
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 58 ++++++++++---
 lib/mempool/rte_mempool.c                     | 14 +---
 lib/mempool/rte_mempool.h                     | 81 +++++++++++--------
 8 files changed, 123 insertions(+), 121 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 35c9b4e06c..1389e6e6b1 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -154,3 +154,10 @@ Deprecation Notices
 * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
   ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
   Those API functions are used internally by DPDK core and netvsc PMD.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 060b26ff61..ab461bc4da 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -24,6 +24,24 @@ DPDK Release 26.07
 New Features
 ------------
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  * The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate.
+
+* **Updated Intel common driver.**
+
+  * Added missing mbuf history marking to vectorized Tx path for MBUF_FAST_FREE.
+
+* **Updated Intel idpf driver.**
+
+  * Added missing mbuf history marking to AVX512 vectorized Rx path.
+
 .. This section should contain new features added in this release.
    Sample format:
 
diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
index 283bd58d5d..044ca68e2f 100644
--- a/drivers/net/intel/common/tx.h
+++ b/drivers/net/intel/common/tx.h
@@ -284,43 +284,13 @@ ci_tx_free_bufs_vec(struct ci_tx_queue *txq, ci_desc_done_fn desc_done, bool ctx
 			txq->fast_free_mp :
 			(txq->fast_free_mp = txep[0].mbuf->pool);
 
-	if (mp != NULL && (n & 31) == 0) {
-		void **cache_objs;
-		struct rte_mempool_cache *cache = rte_mempool_default_cache(mp, rte_lcore_id());
-
-		if (cache == NULL)
-			goto normal;
-
-		cache_objs = &cache->objs[cache->len];
-
-		if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
-			rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
-			goto done;
-		}
-
-		/* The cache follows the following algorithm
-		 *   1. Add the objects to the cache
-		 *   2. Anything greater than the cache min value (if it
-		 *   crosses the cache flush threshold) is flushed to the ring.
-		 */
-		/* Add elements back into the cache */
-		uint32_t copied = 0;
-		/* n is multiple of 32 */
-		while (copied < n) {
-			memcpy(&cache_objs[copied], &txep[copied], 32 * sizeof(void *));
-			copied += 32;
-		}
-		cache->len += n;
-
-		if (cache->len >= cache->flushthresh) {
-			rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-					cache->len - cache->size);
-			cache->len = cache->size;
-		}
+	if (mp != NULL) {
+		static_assert(sizeof(*txep) == sizeof(struct rte_mbuf *),
+				"txep is not similar to an array of rte_mbuf pointers");
+		rte_mbuf_raw_free_bulk(mp, (void *)txep, n);
 		goto done;
 	}
 
-normal:
 	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
 	if (likely(m)) {
 		free[0] = m;
diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..eb7a804780 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,14 +148,20 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
@@ -221,6 +227,17 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 4)), desc4_5);
 		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 6)), desc6_7);
 
+		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
+		__rte_mbuf_raw_sanity_check_mp(rxp[0], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[1], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[2], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[3], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[4], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[5], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[6], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[7], rxq->mp);
+		rte_mbuf_history_mark_bulk(rxp, 8, RTE_MBUF_HISTORY_OP_LIB_ALLOC);
+
 		rxp += IDPF_DESCS_PER_LOOP_AVX;
 		rxdp += IDPF_DESCS_PER_LOOP_AVX;
 		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
@@ -565,14 +582,20 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
@@ -585,8 +608,8 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 							 dma_addr0);
 				}
 			}
-		rte_atomic_fetch_add_explicit(&rx_bufq->rx_stats.mbuf_alloc_failed,
-				   IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed);
+			rte_atomic_fetch_add_explicit(&rx_bufq->rx_stats.mbuf_alloc_failed,
+					   IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed);
 			return;
 		}
 	}
@@ -629,6 +652,17 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 		rxdp[7].split_rd.pkt_addr =
 			_mm_cvtsi128_si64(_mm512_extracti32x4_epi32(iova_addrs_1, 3));
 
+		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
+		__rte_mbuf_raw_sanity_check_mp(rxp[0], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[1], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[2], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[3], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[4], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[5], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[6], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[7], rxq->mp);
+		rte_mbuf_history_mark_bulk(rxp, 8, RTE_MBUF_HISTORY_OP_LIB_ALLOC);
+
 		rxp += IDPF_DESCS_PER_LOOP_AVX;
 		rxdp += IDPF_DESCS_PER_LOOP_AVX;
 		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..aa2d51bbd5 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1377,7 +1384,7 @@ rte_mempool_cache_flush(struct rte_mempool_cache *cache,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_do_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	void **cache_objs;
@@ -1390,24 +1397,27 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size / 2);
 	} else {
-		/* The request itself is too big for the cache. */
+		/* The request itself is too big. */
 		goto driver_enqueue_stats_incremented;
 	}
 
@@ -1418,13 +1428,13 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 driver_enqueue:
 
-	/* increment stat now, adding in mempool always success */
+	/* Increment stats now, adding in mempool always succeeds. */
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
 
 driver_enqueue_stats_incremented:
 
-	/* push objects to the backend */
+	/* Push the objects to the backend. */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
 
@@ -1442,7 +1452,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	rte_mempool_trace_generic_put(mp, obj_table, n, cache);
@@ -1465,7 +1475,7 @@ rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   The number of objects to add in the mempool from obj_table.
  */
 static __rte_always_inline void
-rte_mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_put_bulk(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 		     unsigned int n)
 {
 	struct rte_mempool_cache *cache;
@@ -1507,7 +1517,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  *   - <0: Error; code of driver dequeue function.
  */
 static __rte_always_inline int
-rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_do_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1524,7 +1534,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1558,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1578,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
@@ -1629,7 +1640,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1663,7 +1674,7 @@ rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
+rte_mempool_get_bulk(struct rte_mempool *mp, void ** __rte_restrict obj_table, unsigned int n)
 {
 	struct rte_mempool_cache *cache;
 	cache = rte_mempool_default_cache(mp, rte_lcore_id());
@@ -1692,7 +1703,7 @@ rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get(struct rte_mempool *mp, void **obj_p)
+rte_mempool_get(struct rte_mempool *mp, void ** __rte_restrict obj_p)
 {
 	return rte_mempool_get_bulk(mp, obj_p, 1);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v3] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
  2026-04-08 15:41 ` Stephen Hemminger
  2026-04-09 10:25 ` [PATCH v2] " Morten Brørup
@ 2026-04-09 11:05 ` Morten Brørup
  2026-04-15 13:40   ` Morten Brørup
  2026-04-18 11:15 ` [PATCH v4] " Morten Brørup
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-04-09 11:05 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

In addition to the Mempool library changes, some Intel network drivers
bypassing the Mempool API to access the mempool cache were updated
accordingly.
The Intel idpf AVX512 driver was missing some mbuf instrumentation when
bypassing the Packet Buffer (mbuf) API, so this was added.

Furthermore, the NXP dpaa and dpaa2 mempool drivers were updated
accordingly, specifically to not set the flush threshold.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(), inspired by AI review feedback.
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst          |  7 ++
 doc/guides/rel_notes/release_26_07.rst        | 18 +++++
 drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
 drivers/net/intel/common/tx.h                 | 38 +--------
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 58 ++++++++++---
 lib/mempool/rte_mempool.c                     | 14 +---
 lib/mempool/rte_mempool.h                     | 81 +++++++++++--------
 8 files changed, 123 insertions(+), 121 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 35c9b4e06c..40760fffbb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -154,3 +154,10 @@ Deprecation Notices
 * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
   ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
   Those API functions are used internally by DPDK core and netvsc PMD.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 060b26ff61..ab461bc4da 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -24,6 +24,24 @@ DPDK Release 26.07
 New Features
 ------------
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  * The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate.
+
+* **Updated Intel common driver.**
+
+  * Added missing mbuf history marking to vectorized Tx path for MBUF_FAST_FREE.
+
+* **Updated Intel idpf driver.**
+
+  * Added missing mbuf history marking to AVX512 vectorized Rx path.
+
 .. This section should contain new features added in this release.
    Sample format:
 
diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
diff --git a/drivers/net/intel/common/tx.h b/drivers/net/intel/common/tx.h
index 283bd58d5d..eeb0980d40 100644
--- a/drivers/net/intel/common/tx.h
+++ b/drivers/net/intel/common/tx.h
@@ -284,43 +284,13 @@ ci_tx_free_bufs_vec(struct ci_tx_queue *txq, ci_desc_done_fn desc_done, bool ctx
 			txq->fast_free_mp :
 			(txq->fast_free_mp = txep[0].mbuf->pool);
 
-	if (mp != NULL && (n & 31) == 0) {
-		void **cache_objs;
-		struct rte_mempool_cache *cache = rte_mempool_default_cache(mp, rte_lcore_id());
-
-		if (cache == NULL)
-			goto normal;
-
-		cache_objs = &cache->objs[cache->len];
-
-		if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
-			rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
-			goto done;
-		}
-
-		/* The cache follows the following algorithm
-		 *   1. Add the objects to the cache
-		 *   2. Anything greater than the cache min value (if it
-		 *   crosses the cache flush threshold) is flushed to the ring.
-		 */
-		/* Add elements back into the cache */
-		uint32_t copied = 0;
-		/* n is multiple of 32 */
-		while (copied < n) {
-			memcpy(&cache_objs[copied], &txep[copied], 32 * sizeof(void *));
-			copied += 32;
-		}
-		cache->len += n;
-
-		if (cache->len >= cache->flushthresh) {
-			rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-					cache->len - cache->size);
-			cache->len = cache->size;
-		}
+	if (mp != NULL) {
+		static_assert(sizeof(*txep) == sizeof(struct rte_mbuf *),
+				"txep array is not similar to an array of rte_mbuf pointers");
+		rte_mbuf_raw_free_bulk(mp, (void *)txep, n);
 		goto done;
 	}
 
-normal:
 	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
 	if (likely(m)) {
 		free[0] = m;
diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..59a6c22e98 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,14 +148,20 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
@@ -221,6 +227,17 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 4)), desc4_5);
 		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 6)), desc6_7);
 
+		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
+		__rte_mbuf_raw_sanity_check_mp(rxp[0], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[1], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[2], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[3], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[4], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[5], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[6], rxq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[7], rxq->mp);
+		rte_mbuf_history_mark_bulk(rxp, 8, RTE_MBUF_HISTORY_OP_LIB_ALLOC);
+
 		rxp += IDPF_DESCS_PER_LOOP_AVX;
 		rxdp += IDPF_DESCS_PER_LOOP_AVX;
 		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
@@ -565,14 +582,20 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
@@ -585,8 +608,8 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 							 dma_addr0);
 				}
 			}
-		rte_atomic_fetch_add_explicit(&rx_bufq->rx_stats.mbuf_alloc_failed,
-				   IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed);
+			rte_atomic_fetch_add_explicit(&rx_bufq->rx_stats.mbuf_alloc_failed,
+					   IDPF_RXQ_REARM_THRESH, rte_memory_order_relaxed);
 			return;
 		}
 	}
@@ -629,6 +652,17 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 		rxdp[7].split_rd.pkt_addr =
 			_mm_cvtsi128_si64(_mm512_extracti32x4_epi32(iova_addrs_1, 3));
 
+		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
+		__rte_mbuf_raw_sanity_check_mp(rxp[0], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[1], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[2], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[3], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[4], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[5], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[6], rx_bufq->mp);
+		__rte_mbuf_raw_sanity_check_mp(rxp[7], rx_bufq->mp);
+		rte_mbuf_history_mark_bulk(rxp, 8, RTE_MBUF_HISTORY_OP_LIB_ALLOC);
+
 		rxp += IDPF_DESCS_PER_LOOP_AVX;
 		rxdp += IDPF_DESCS_PER_LOOP_AVX;
 		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..aa2d51bbd5 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1377,7 +1384,7 @@ rte_mempool_cache_flush(struct rte_mempool_cache *cache,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_do_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	void **cache_objs;
@@ -1390,24 +1397,27 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size / 2);
 	} else {
-		/* The request itself is too big for the cache. */
+		/* The request itself is too big. */
 		goto driver_enqueue_stats_incremented;
 	}
 
@@ -1418,13 +1428,13 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 driver_enqueue:
 
-	/* increment stat now, adding in mempool always success */
+	/* Increment stats now, adding in mempool always succeeds. */
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
 
 driver_enqueue_stats_incremented:
 
-	/* push objects to the backend */
+	/* Push the objects to the backend. */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
 
@@ -1442,7 +1452,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  */
 static __rte_always_inline void
-rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_generic_put(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	rte_mempool_trace_generic_put(mp, obj_table, n, cache);
@@ -1465,7 +1475,7 @@ rte_mempool_generic_put(struct rte_mempool *mp, void * const *obj_table,
  *   The number of objects to add in the mempool from obj_table.
  */
 static __rte_always_inline void
-rte_mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
+rte_mempool_put_bulk(struct rte_mempool *mp, void * const * __rte_restrict obj_table,
 		     unsigned int n)
 {
 	struct rte_mempool_cache *cache;
@@ -1507,7 +1517,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  *   - <0: Error; code of driver dequeue function.
  */
 static __rte_always_inline int
-rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_do_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1524,7 +1534,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1558,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1578,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
@@ -1629,7 +1640,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
+rte_mempool_generic_get(struct rte_mempool *mp, void ** __rte_restrict obj_table,
 			unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
@@ -1663,7 +1674,7 @@ rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
+rte_mempool_get_bulk(struct rte_mempool *mp, void ** __rte_restrict obj_table, unsigned int n)
 {
 	struct rte_mempool_cache *cache;
 	cache = rte_mempool_default_cache(mp, rte_lcore_id());
@@ -1692,7 +1703,7 @@ rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table, unsigned int n)
  *   - -ENOENT: Not enough entries in the mempool; no object is retrieved.
  */
 static __rte_always_inline int
-rte_mempool_get(struct rte_mempool *mp, void **obj_p)
+rte_mempool_get(struct rte_mempool *mp, void ** __rte_restrict obj_p)
 {
 	return rte_mempool_get_bulk(mp, obj_p, 1);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* RE: [PATCH v3] mempool: improve cache behaviour and performance
  2026-04-09 11:05 ` [PATCH v3] " Morten Brørup
@ 2026-04-15 13:40   ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-15 13:40 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 9 April 2026 13.06
> 
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-
> time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put
> into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold
> so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility
> purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its
> size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> In addition to the Mempool library changes, some Intel network drivers
> bypassing the Mempool API to access the mempool cache were updated
> accordingly.
> The Intel idpf AVX512 driver was missing some mbuf instrumentation when
> bypassing the Packet Buffer (mbuf) API, so this was added.
> 
> Furthermore, the NXP dpaa and dpaa2 mempool drivers were updated
> accordingly, specifically to not set the flush threshold.
> 

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")

> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> v3:
> * Fixed my copy-paste bug in idpf_splitq_rearm().
> v2:
> * Fixed issue found by abidiff:
>   Reverted cache objects array size reduction. Added a note instead.
> * Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
> * Updated idpf_splitq_rearm() like idpf_singleq_rearm().
> * Added a few more __rte_assume(), inspired by AI review feedback.
> * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
>   flush threshold.
> * Added release notes.
> * Added deprecation notes.
> ---
>  doc/guides/rel_notes/deprecation.rst          |  7 ++
>  doc/guides/rel_notes/release_26_07.rst        | 18 +++++
>  drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
>  drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
>  drivers/net/intel/common/tx.h                 | 38 +--------
>  .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 58 ++++++++++---
>  lib/mempool/rte_mempool.c                     | 14 +---
>  lib/mempool/rte_mempool.h                     | 81 +++++++++++--------
>  8 files changed, 123 insertions(+), 121 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 35c9b4e06c..40760fffbb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -154,3 +154,10 @@ Deprecation Notices
>  * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
>    ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
>    Those API functions are used internally by DPDK core and netvsc PMD.
> +
> +* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
> +  is obsolete, and will be removed in DPDK 26.11.
> +
> +* mempool: The object array in ``struct rte_mempool_cache`` is
> oversize by
> +  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
> +  DPDK 26.11.
> diff --git a/doc/guides/rel_notes/release_26_07.rst
> b/doc/guides/rel_notes/release_26_07.rst
> index 060b26ff61..ab461bc4da 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -24,6 +24,24 @@ DPDK Release 26.07
>  New Features
>  ------------
> 
> +* **Changed effective size of mempool cache.**
> +
> +  * The effective size of a mempool cache was changed to match the
> specified size at mempool creation; the effective size was previously
> 50 % larger than requested.
> +  * The ``flushthresh`` field of the ``struct rte_mempool_cache``
> became obsolete, but was kept for API/ABI compatibility purposes.
> +  * The effective size of the ``objs`` array in the ``struct
> rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but
> its size was kept for API/ABI compatibility purposes.
> +
> +* **Improved mempool cache flush/refill algorithm.**
> +
> +  * The mempool cache flush/refill algorithm was improved, to reduce
> the mempool cache miss rate.
> +
> +* **Updated Intel common driver.**
> +
> +  * Added missing mbuf history marking to vectorized Tx path for
> MBUF_FAST_FREE.
> +
> +* **Updated Intel idpf driver.**
> +
> +  * Added missing mbuf history marking to AVX512 vectorized Rx path.
> +
>  .. This section should contain new features added in this release.
>     Sample format:
> 
> diff --git a/drivers/mempool/dpaa/dpaa_mempool.c
> b/drivers/mempool/dpaa/dpaa_mempool.c
> index 2f9395b3f4..2f8555a026 100644
> --- a/drivers/mempool/dpaa/dpaa_mempool.c
> +++ b/drivers/mempool/dpaa/dpaa_mempool.c
> @@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
>  	struct bman_pool_params params = {
>  		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
>  	};
> -	unsigned int lcore_id;
> -	struct rte_mempool_cache *cache;
> 
>  	MEMPOOL_INIT_FUNC_TRACE();
> 
> @@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
>  	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
>  		   sizeof(struct dpaa_bp_info));
>  	mp->pool_data = (void *)bp_info;
> -	/* Update per core mempool cache threshold to optimal value which
> is
> -	 * number of buffers that can be released to HW buffer pool in
> -	 * a single API call.
> -	 */
> -	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> -		cache = &mp->local_cache[lcore_id];
> -		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
> -			lcore_id, cache->flushthresh,
> -			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
> -		if (cache->flushthresh)
> -			cache->flushthresh = cache->size +
> DPAA_MBUF_MAX_ACQ_REL;
> -	}
> 
>  	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
>  	return 0;
> diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> index 02b6741853..ee001d8ce0 100644
> --- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> +++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> @@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
>  	struct dpaa2_bp_info *bp_info;
>  	struct dpbp_attr dpbp_attr;
>  	uint32_t bpid;
> -	unsigned int lcore_id;
> -	struct rte_mempool_cache *cache;
>  	int ret;
> 
>  	avail_dpbp = dpaa2_alloc_dpbp_dev();
> @@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
>  	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d",
> dpbp_attr.bpid);
> 
>  	h_bp_list = bp_list;
> -	/* Update per core mempool cache threshold to optimal value which
> is
> -	 * number of buffers that can be released to HW buffer pool in
> -	 * a single API call.
> -	 */
> -	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> -		cache = &mp->local_cache[lcore_id];
> -		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d ->
> %d",
> -			lcore_id, cache->flushthresh,
> -			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
> -		if (cache->flushthresh)
> -			cache->flushthresh = cache->size +
> DPAA2_MBUF_MAX_ACQ_REL;
> -	}
> 
>  	return 0;
>  err4:
> diff --git a/drivers/net/intel/common/tx.h
> b/drivers/net/intel/common/tx.h
> index 283bd58d5d..eeb0980d40 100644
> --- a/drivers/net/intel/common/tx.h
> +++ b/drivers/net/intel/common/tx.h
> @@ -284,43 +284,13 @@ ci_tx_free_bufs_vec(struct ci_tx_queue *txq,
> ci_desc_done_fn desc_done, bool ctx
>  			txq->fast_free_mp :
>  			(txq->fast_free_mp = txep[0].mbuf->pool);
> 
> -	if (mp != NULL && (n & 31) == 0) {
> -		void **cache_objs;
> -		struct rte_mempool_cache *cache =
> rte_mempool_default_cache(mp, rte_lcore_id());
> -
> -		if (cache == NULL)
> -			goto normal;
> -
> -		cache_objs = &cache->objs[cache->len];
> -
> -		if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
> -			rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
> -			goto done;
> -		}
> -
> -		/* The cache follows the following algorithm
> -		 *   1. Add the objects to the cache
> -		 *   2. Anything greater than the cache min value (if it
> -		 *   crosses the cache flush threshold) is flushed to the
> ring.
> -		 */
> -		/* Add elements back into the cache */
> -		uint32_t copied = 0;
> -		/* n is multiple of 32 */
> -		while (copied < n) {
> -			memcpy(&cache_objs[copied], &txep[copied], 32 *
> sizeof(void *));
> -			copied += 32;
> -		}
> -		cache->len += n;
> -
> -		if (cache->len >= cache->flushthresh) {
> -			rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache-
> >size],
> -					cache->len - cache->size);
> -			cache->len = cache->size;
> -		}
> +	if (mp != NULL) {
> +		static_assert(sizeof(*txep) == sizeof(struct rte_mbuf *),
> +				"txep array is not similar to an array of
> rte_mbuf pointers");
> +		rte_mbuf_raw_free_bulk(mp, (void *)txep, n);
>  		goto done;
>  	}
> 
> -normal:
>  	m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
>  	if (likely(m)) {
>  		free[0] = m;
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> index 9af275cd9d..59a6c22e98 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> @@ -148,14 +148,20 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
> +			idpf_singleq_rearm_common(rxq);
> +			return;
> +		}
> +
> +		/* Backfill the cache from the backend; fetch (size / 2)
> objects. */
> +		__rte_assume(cache->len < cache->size / 2);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rxq->mp, &cache->objs[cache->len], req);
> +				(rxq->mp, &cache->objs[cache->len], cache->size
> / 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
>  			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rxq->nb_rx_desc) {
> @@ -221,6 +227,17 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
>  		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 4)),
> desc4_5);
>  		_mm512_storeu_si512(RTE_CAST_PTR(void *, (rxdp + 6)),
> desc6_7);
> 
> +		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
> +		__rte_mbuf_raw_sanity_check_mp(rxp[0], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[1], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[2], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[3], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[4], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[5], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[6], rxq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[7], rxq->mp);
> +		rte_mbuf_history_mark_bulk(rxp, 8,
> RTE_MBUF_HISTORY_OP_LIB_ALLOC);
> +
>  		rxp += IDPF_DESCS_PER_LOOP_AVX;
>  		rxdp += IDPF_DESCS_PER_LOOP_AVX;
>  		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
> @@ -565,14 +582,20 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(IDPF_RXQ_REARM_THRESH > cache->size / 2)) {
> +			idpf_splitq_rearm_common(rx_bufq);
> +			return;
> +		}
> +
> +		/* Backfill the cache from the backend; fetch (size / 2)
> objects. */
> +		__rte_assume(cache->len < cache->size / 2);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rx_bufq->mp, &cache->objs[cache->len], req);
> +				(rx_bufq->mp, &cache->objs[cache->len], cache-
> >size / 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
>  			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rx_bufq->nb_rx_desc) {
> @@ -585,8 +608,8 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
>  							 dma_addr0);
>  				}
>  			}
> -		rte_atomic_fetch_add_explicit(&rx_bufq-
> >rx_stats.mbuf_alloc_failed,
> -				   IDPF_RXQ_REARM_THRESH,
> rte_memory_order_relaxed);
> +			rte_atomic_fetch_add_explicit(&rx_bufq-
> >rx_stats.mbuf_alloc_failed,
> +					   IDPF_RXQ_REARM_THRESH,
> rte_memory_order_relaxed);
>  			return;
>  		}
>  	}
> @@ -629,6 +652,17 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
>  		rxdp[7].split_rd.pkt_addr =
> 
> 	_mm_cvtsi128_si64(_mm512_extracti32x4_epi32(iova_addrs_1, 3));
> 
> +		/* Instrumentation as in rte_mbuf_raw_alloc_bulk() */
> +		__rte_mbuf_raw_sanity_check_mp(rxp[0], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[1], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[2], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[3], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[4], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[5], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[6], rx_bufq->mp);
> +		__rte_mbuf_raw_sanity_check_mp(rxp[7], rx_bufq->mp);
> +		rte_mbuf_history_mark_bulk(rxp, 8,
> RTE_MBUF_HISTORY_OP_LIB_ALLOC);
> +
>  		rxp += IDPF_DESCS_PER_LOOP_AVX;
>  		rxdp += IDPF_DESCS_PER_LOOP_AVX;
>  		cache->len -= IDPF_DESCS_PER_LOOP_AVX;
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 3042d94c14..805b52cc58 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -52,11 +52,6 @@ static void
>  mempool_event_callback_invoke(enum rte_mempool_event event,
>  			      struct rte_mempool *mp);
> 
> -/* Note: avoid using floating point since that compiler
> - * may not think that is constant.
> - */
> -#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
> -
>  #if defined(RTE_ARCH_X86)
>  /*
>   * return the greatest common divisor between a and b (fast algorithm)
> @@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
>  static void
>  mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
>  {
> -	/* Check that cache have enough space for flush threshold */
> -
> 	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZ
> E) >
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache,
> objs[0]));
> -
>  	cache->size = size;
> -	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
> +	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility
> purposes only */
>  	cache->len = 0;
>  }
> 
> @@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned
> n, unsigned elt_size,
> 
>  	/* asked cache too big */
>  	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
> -	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
> +	    cache_size > n) {
>  		rte_errno = EINVAL;
>  		return NULL;
>  	}
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 2e54fc4466..aa2d51bbd5 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
>   */
>  struct __rte_cache_aligned rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
> -	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> purposes only */
>  	uint32_t len;	      /**< Current cache count */
>  #ifdef RTE_LIBRTE_MEMPOOL_STATS
>  	uint32_t unused;
> @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
>  	/**
>  	 * Cache objects
>  	 *
> -	 * Cache is allocated to this size to allow it to overflow in
> certain
> -	 * cases to avoid needless emptying of cache.
> +	 * Note:
> +	 * Cache is allocated at double size for API/ABI compatibility
> purposes only.
> +	 * When reducing its size at an API/ABI breaking release,
> +	 * remember to add a cache guard after it.
>  	 */
>  	alignas(RTE_CACHE_LINE_SIZE) void
> *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
>  };
> @@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
>   *   If cache_size is non-zero, the rte_mempool library will try to
>   *   limit the accesses to the common lockless pool, by maintaining a
>   *   per-lcore object cache. This argument must be lower or equal to
> - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> + *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
>   *   The access to the per-lcore table is of course
>   *   faster than the multi-producer/consumer pool. The cache can be
>   *   disabled if the cache_size argument is set to 0; it can be useful
> to
>   *   avoid losing objects in cache.
> + *   Note:
> + *   Mempool put/get requests of more than cache_size / 2 objects may
> be
> + *   partially or fully served directly by the multi-producer/consumer
> + *   pool, to avoid the overhead of copying the objects twice (instead
> of
> + *   once) when using the cache as a bounce buffer.
>   * @param private_data_size
>   *   The size of the private data appended after the mempool
>   *   structure. This is useful for storing some private data after the
> @@ -1377,7 +1384,7 @@ rte_mempool_cache_flush(struct rte_mempool_cache
> *cache,
>   *   A pointer to a mempool cache structure. May be NULL if not
> needed.
>   */
>  static __rte_always_inline void
> -rte_mempool_do_generic_put(struct rte_mempool *mp, void * const
> *obj_table,
> +rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *
> __rte_restrict obj_table,
>  			   unsigned int n, struct rte_mempool_cache *cache)
>  {
>  	void **cache_objs;
> @@ -1390,24 +1397,27 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> 
> -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2);
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> -	__rte_assume(cache->len <= cache->flushthresh);
> -	if (likely(cache->len + n <= cache->flushthresh)) {
> +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->len <= cache->size);
> +	if (likely(cache->len + n <= cache->size)) {
>  		/* Sufficient room in the cache for the objects. */
>  		cache_objs = &cache->objs[cache->len];
>  		cache->len += n;
> -	} else if (n <= cache->flushthresh) {
> +	} else if (n <= cache->size / 2) {
>  		/*
> -		 * The cache is big enough for the objects, but - as
> detected by
> -		 * the comparison above - has insufficient room for them.
> -		 * Flush the cache to make room for the objects.
> +		 * The number of objects is within the cache bounce buffer
> limit,
> +		 * but - as detected by the comparison above - the cache
> has
> +		 * insufficient room for them.
> +		 * Flush the cache to the backend to make room for the
> objects;
> +		 * flush (size / 2) objects.
>  		 */
> -		cache_objs = &cache->objs[0];
> -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> -		cache->len = n;
> +		__rte_assume(cache->len > cache->size / 2);
> +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> +		cache->len = cache->len - cache->size / 2 + n;
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size /
> 2);
>  	} else {
> -		/* The request itself is too big for the cache. */
> +		/* The request itself is too big. */
>  		goto driver_enqueue_stats_incremented;
>  	}
> 
> @@ -1418,13 +1428,13 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> 
>  driver_enqueue:
> 
> -	/* increment stat now, adding in mempool always success */
> +	/* Increment stats now, adding in mempool always succeeds. */
>  	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
>  	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
> 
>  driver_enqueue_stats_incremented:
> 
> -	/* push objects to the backend */
> +	/* Push the objects to the backend. */
>  	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
>  }
> 
> @@ -1442,7 +1452,7 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>   *   A pointer to a mempool cache structure. May be NULL if not
> needed.
>   */
>  static __rte_always_inline void
> -rte_mempool_generic_put(struct rte_mempool *mp, void * const
> *obj_table,
> +rte_mempool_generic_put(struct rte_mempool *mp, void * const *
> __rte_restrict obj_table,
>  			unsigned int n, struct rte_mempool_cache *cache)
>  {
>  	rte_mempool_trace_generic_put(mp, obj_table, n, cache);
> @@ -1465,7 +1475,7 @@ rte_mempool_generic_put(struct rte_mempool *mp,
> void * const *obj_table,
>   *   The number of objects to add in the mempool from obj_table.
>   */
>  static __rte_always_inline void
> -rte_mempool_put_bulk(struct rte_mempool *mp, void * const *obj_table,
> +rte_mempool_put_bulk(struct rte_mempool *mp, void * const *
> __rte_restrict obj_table,
>  		     unsigned int n)
>  {
>  	struct rte_mempool_cache *cache;
> @@ -1507,7 +1517,7 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
>   *   - <0: Error; code of driver dequeue function.
>   */
>  static __rte_always_inline int
> -rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
> +rte_mempool_do_generic_get(struct rte_mempool *mp, void **
> __rte_restrict obj_table,
>  			   unsigned int n, struct rte_mempool_cache *cache)
>  {
>  	int ret;
> @@ -1524,7 +1534,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	/* The cache is a stack, so copy will be in reverse order. */
>  	cache_objs = &cache->objs[cache->len];
> 
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
>  	if (likely(n <= cache->len)) {
>  		/* The entire request can be satisfied from the cache. */
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> @@ -1548,13 +1558,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
> 
> -	/* Dequeue below would overflow mem allocated for cache? */
> -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> +	/* Dequeue below would exceed the cache bounce buffer limit? */
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	if (unlikely(remaining > cache->size / 2))
>  		goto driver_dequeue;
> 
> -	/* Fill the cache from the backend; fetch size + remaining
> objects. */
> -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> -			cache->size + remaining);
> +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> 2);
>  	if (unlikely(ret < 0)) {
>  		/*
>  		 * We are buffer constrained, and not able to fetch all
> that.
> @@ -1568,10 +1578,11 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> 
> -	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	cache_objs = &cache->objs[cache->size + remaining];
> -	cache->len = cache->size;
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= cache->size / 2);
> +	cache_objs = &cache->objs[cache->size / 2];
> +	cache->len = cache->size / 2 - remaining;
>  	for (index = 0; index < remaining; index++)
>  		*obj_table++ = *--cache_objs;
> 
> @@ -1629,7 +1640,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>   *   - -ENOENT: Not enough entries in the mempool; no object is
> retrieved.
>   */
>  static __rte_always_inline int
> -rte_mempool_generic_get(struct rte_mempool *mp, void **obj_table,
> +rte_mempool_generic_get(struct rte_mempool *mp, void ** __rte_restrict
> obj_table,
>  			unsigned int n, struct rte_mempool_cache *cache)
>  {
>  	int ret;
> @@ -1663,7 +1674,7 @@ rte_mempool_generic_get(struct rte_mempool *mp,
> void **obj_table,
>   *   - -ENOENT: Not enough entries in the mempool; no object is
> retrieved.
>   */
>  static __rte_always_inline int
> -rte_mempool_get_bulk(struct rte_mempool *mp, void **obj_table,
> unsigned int n)
> +rte_mempool_get_bulk(struct rte_mempool *mp, void ** __rte_restrict
> obj_table, unsigned int n)
>  {
>  	struct rte_mempool_cache *cache;
>  	cache = rte_mempool_default_cache(mp, rte_lcore_id());
> @@ -1692,7 +1703,7 @@ rte_mempool_get_bulk(struct rte_mempool *mp, void
> **obj_table, unsigned int n)
>   *   - -ENOENT: Not enough entries in the mempool; no object is
> retrieved.
>   */
>  static __rte_always_inline int
> -rte_mempool_get(struct rte_mempool *mp, void **obj_p)
> +rte_mempool_get(struct rte_mempool *mp, void ** __rte_restrict obj_p)
>  {
>  	return rte_mempool_get_bulk(mp, obj_p, 1);
>  }
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v4] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (2 preceding siblings ...)
  2026-04-09 11:05 ` [PATCH v3] " Morten Brørup
@ 2026-04-18 11:15 ` Morten Brørup
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-18 11:15 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

As a consequence of the improved mempool cache algorithm, some drivers
were updated accordingly:
- The Intel idpf PMD was updated regarding how much to backfill the
  mempool cache in the AVX512 code.
- The NXP dpaa and dpaa2 mempool drivers were updated to not set the
  mempool cache flush threshold; doing this no longer has any effect, and
  thus became superfluous.

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf fast-free")
---
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
  patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
  fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
  __rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst          |  7 ++
 doc/guides/rel_notes/release_26_07.rst        | 10 +++
 drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 32 ++++++---
 lib/mempool/rte_mempool.c                     | 14 +---
 lib/mempool/rte_mempool.h                     | 65 +++++++++++--------
 7 files changed, 79 insertions(+), 77 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 35c9b4e06c..40760fffbb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -154,3 +154,10 @@ Deprecation Notices
 * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
   ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
   Those API functions are used internally by DPDK core and netvsc PMD.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 060b26ff61..67fd97fa61 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -24,6 +24,16 @@ DPDK Release 26.07
 New Features
 ------------
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  * The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate.
+
 .. This section should contain new features added in this release.
    Sample format:
 
diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..67893f6b07 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,14 +148,20 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
@@ -565,14 +571,20 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/* Backfill the cache from the backend; fetch (size / 2) objects. */
+		__rte_assume(cache->len < cache->size / 2);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[cache->len], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..4ece89cb47 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
  * @param cache_size
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
- *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   per-lcore object cache. This argument must be an even number,
+ *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1390,24 +1397,27 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->size / 2);
 	} else {
-		/* The request itself is too big for the cache. */
+		/* The request itself is too big. */
 		goto driver_enqueue_stats_incremented;
 	}
 
@@ -1524,7 +1534,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1558,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1578,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (3 preceding siblings ...)
  2026-04-18 11:15 ` [PATCH v4] " Morten Brørup
@ 2026-04-19  9:55 ` Morten Brørup
  2026-04-22 12:27   ` Morten Brørup
                     ` (4 more replies)
  2026-05-26 14:00 ` [PATCH v6] " Morten Brørup
                   ` (4 subsequent siblings)
  9 siblings, 5 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-19  9:55 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

As a consequence of the improved mempool cache algorithm, some drivers
were updated accordingly:
- The Intel idpf PMD was updated regarding how much to backfill the
  mempool cache in the AVX512 code.
- The NXP dpaa and dpaa2 mempool drivers were updated to not set the
  mempool cache flush threshold; doing this no longer has any effect, and
  thus became superfluous.

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf fast-free")
---
v5:
* Flush the cache from the bottom, where objects are colder, and move down
  the remaining objects, which are hotter.
* In the Intel idpf PMD, move up the hot objects in the cache and refill
  with cold objects at the bottom.
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
  patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
  fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
  __rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst          |  7 ++
 doc/guides/rel_notes/release_26_07.rst        | 10 +++
 drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++---
 lib/mempool/rte_mempool.c                     | 14 +---
 lib/mempool/rte_mempool.h                     | 70 ++++++++++++-------
 7 files changed, 104 insertions(+), 77 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 35c9b4e06c..40760fffbb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -154,3 +154,10 @@ Deprecation Notices
 * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
   ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
   Those API functions are used internally by DPDK core and netvsc PMD.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 060b26ff61..67fd97fa61 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -24,6 +24,16 @@ DPDK Release 26.07
 New Features
 ------------
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  * The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate.
+
 .. This section should contain new features added in this release.
    Sample format:
 
diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..dd2263b8d7 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
 				__m128i dma_addr0;
@@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
 				__m128i dma_addr0;
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..432c43ab15 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
  * @param cache_size
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
- *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   per-lcore object cache. This argument must be an even number,
+ *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects from the bottom of the cache, where
+		 * objects are less hot, and move down the remaining objects, which
+		 * are more hot, from the upper half of the cache.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
+		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
+				sizeof(void *) * (cache->len - cache->size / 2));
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
 	} else {
-		/* The request itself is too big for the cache. */
+		/* The request itself is too big. */
 		goto driver_enqueue_stats_incremented;
 	}
 
@@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
@ 2026-04-22 12:27   ` Morten Brørup
  2026-04-27 15:21   ` Morten Brørup
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-22 12:27 UTC (permalink / raw)
  To: dev

Recheck-request: iol-sample-apps-testing


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
  2026-04-22 12:27   ` Morten Brørup
@ 2026-04-27 15:21   ` Morten Brørup
  2026-04-28  7:44   ` Andrew Rybchenko
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-04-27 15:21 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

PING for review.

IMHO, reducing the mempool cache miss rate by factor 2.4 is a relevant performance improvement [*].

I'd like to see this patch go into DPDK 26.07, so the expected ABI/API breaking cleanup patch can go into DPDK 26.11.

[*] referring to performance data in the patch description:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.

-Morten

> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Sunday, 19 April 2026 11.55
> 
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-
> time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put
> into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold
> so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility
> purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its
> size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> As a consequence of the improved mempool cache algorithm, some drivers
> were updated accordingly:
> - The Intel idpf PMD was updated regarding how much to backfill the
>   mempool cache in the AVX512 code.
> - The NXP dpaa and dpaa2 mempool drivers were updated to not set the
>   mempool cache flush threshold; doing this no longer has any effect,
> and
>   thus became superfluous.
> 
> Bugzilla ID: 1027
> Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf
> fast-free")
> ---
> v5:
> * Flush the cache from the bottom, where objects are colder, and move
> down
>   the remaining objects, which are hotter.
> * In the Intel idpf PMD, move up the hot objects in the cache and
> refill
>   with cold objects at the bottom.
> v4:
> * Added Bugzilla ID.
> * Added Fixes tag. For reference only.
> * Moved fast-free related update of Intel common driver out as a
> separate
>   patch, and depend on that patch.
> * Omitted unrelated changes to the Intel idpf AVX512 driver,
> specifically
>   fixing an indentation and adding mbuf instrumentation.
> * Omitted unrelated changes to the mempool library, specifically adding
>   __rte_restrict and changing a couple of comments to proper sentences.
> * Please checkpatches by swapping operators in a couple of comparisons.
> v3:
> * Fixed my copy-paste bug in idpf_splitq_rearm().
> v2:
> * Fixed issue found by abidiff:
>   Reverted cache objects array size reduction. Added a note instead.
> * Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
> * Updated idpf_splitq_rearm() like idpf_singleq_rearm().
> * Added a few more __rte_assume(). (Inspired by AI review)
> * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
>   flush threshold.
> * Added release notes.
> * Added deprecation notes.
> ---
>  doc/guides/rel_notes/deprecation.rst          |  7 ++
>  doc/guides/rel_notes/release_26_07.rst        | 10 +++
>  drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
>  drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
>  .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++---
>  lib/mempool/rte_mempool.c                     | 14 +---
>  lib/mempool/rte_mempool.h                     | 70 ++++++++++++-------
>  7 files changed, 104 insertions(+), 77 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 35c9b4e06c..40760fffbb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -154,3 +154,10 @@ Deprecation Notices
>  * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
>    ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
>    Those API functions are used internally by DPDK core and netvsc PMD.
> +
> +* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
> +  is obsolete, and will be removed in DPDK 26.11.
> +
> +* mempool: The object array in ``struct rte_mempool_cache`` is
> oversize by
> +  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
> +  DPDK 26.11.
> diff --git a/doc/guides/rel_notes/release_26_07.rst
> b/doc/guides/rel_notes/release_26_07.rst
> index 060b26ff61..67fd97fa61 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -24,6 +24,16 @@ DPDK Release 26.07
>  New Features
>  ------------
> 
> +* **Changed effective size of mempool cache.**
> +
> +  * The effective size of a mempool cache was changed to match the
> specified size at mempool creation; the effective size was previously
> 50 % larger than requested.
> +  * The ``flushthresh`` field of the ``struct rte_mempool_cache``
> became obsolete, but was kept for API/ABI compatibility purposes.
> +  * The effective size of the ``objs`` array in the ``struct
> rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but
> its size was kept for API/ABI compatibility purposes.
> +
> +* **Improved mempool cache flush/refill algorithm.**
> +
> +  * The mempool cache flush/refill algorithm was improved, to reduce
> the mempool cache miss rate.
> +
>  .. This section should contain new features added in this release.
>     Sample format:
> 
> diff --git a/drivers/mempool/dpaa/dpaa_mempool.c
> b/drivers/mempool/dpaa/dpaa_mempool.c
> index 2f9395b3f4..2f8555a026 100644
> --- a/drivers/mempool/dpaa/dpaa_mempool.c
> +++ b/drivers/mempool/dpaa/dpaa_mempool.c
> @@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
>  	struct bman_pool_params params = {
>  		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
>  	};
> -	unsigned int lcore_id;
> -	struct rte_mempool_cache *cache;
> 
>  	MEMPOOL_INIT_FUNC_TRACE();
> 
> @@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
>  	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
>  		   sizeof(struct dpaa_bp_info));
>  	mp->pool_data = (void *)bp_info;
> -	/* Update per core mempool cache threshold to optimal value which
> is
> -	 * number of buffers that can be released to HW buffer pool in
> -	 * a single API call.
> -	 */
> -	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> -		cache = &mp->local_cache[lcore_id];
> -		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
> -			lcore_id, cache->flushthresh,
> -			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
> -		if (cache->flushthresh)
> -			cache->flushthresh = cache->size +
> DPAA_MBUF_MAX_ACQ_REL;
> -	}
> 
>  	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
>  	return 0;
> diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> index 02b6741853..ee001d8ce0 100644
> --- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> +++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
> @@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
>  	struct dpaa2_bp_info *bp_info;
>  	struct dpbp_attr dpbp_attr;
>  	uint32_t bpid;
> -	unsigned int lcore_id;
> -	struct rte_mempool_cache *cache;
>  	int ret;
> 
>  	avail_dpbp = dpaa2_alloc_dpbp_dev();
> @@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
>  	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d",
> dpbp_attr.bpid);
> 
>  	h_bp_list = bp_list;
> -	/* Update per core mempool cache threshold to optimal value which
> is
> -	 * number of buffers that can be released to HW buffer pool in
> -	 * a single API call.
> -	 */
> -	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> -		cache = &mp->local_cache[lcore_id];
> -		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d ->
> %d",
> -			lcore_id, cache->flushthresh,
> -			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
> -		if (cache->flushthresh)
> -			cache->flushthresh = cache->size +
> DPAA2_MBUF_MAX_ACQ_REL;
> -	}
> 
>  	return 0;
>  err4:
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> index 9af275cd9d..dd2263b8d7 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> @@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> +			idpf_singleq_rearm_common(rxq);
> +			return;
> +		}
> +
> +		/*
> +		 * Backfill the cache from the backend;
> +		 * move up the hot objects in the cache to the top half of
> the cache,
> +		 * and fetch (size / 2) objects to the bottom of the cache.
> +		 */
> +		__rte_assume(cache->len < cache->size / 2);
> +		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> +				sizeof(void *) * cache->len);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rxq->mp, &cache->objs[cache->len], req);
> +				(rxq->mp, &cache->objs[0], cache->size / 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
> +			/*
> +			 * No further action is required for roll back, as
> the objects moved
> +			 * in the cache were actually copied, and the cache
> remains intact.
> +			 */
>  			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rxq->nb_rx_desc) {
>  				__m128i dma_addr0;
> @@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> +			idpf_splitq_rearm_common(rx_bufq);
> +			return;
> +		}
> +
> +		/*
> +		 * Backfill the cache from the backend;
> +		 * move up the hot objects in the cache to the top half of
> the cache,
> +		 * and fetch (size / 2) objects to the bottom of the cache.
> +		 */
> +		__rte_assume(cache->len < cache->size / 2);
> +		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> +				sizeof(void *) * cache->len);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rx_bufq->mp, &cache->objs[cache->len], req);
> +				(rx_bufq->mp, &cache->objs[0], cache->size /
> 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
> +			/*
> +			 * No further action is required for roll back, as
> the objects moved
> +			 * in the cache were actually copied, and the cache
> remains intact.
> +			 */
>  			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rx_bufq->nb_rx_desc) {
>  				__m128i dma_addr0;
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 3042d94c14..805b52cc58 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -52,11 +52,6 @@ static void
>  mempool_event_callback_invoke(enum rte_mempool_event event,
>  			      struct rte_mempool *mp);
> 
> -/* Note: avoid using floating point since that compiler
> - * may not think that is constant.
> - */
> -#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
> -
>  #if defined(RTE_ARCH_X86)
>  /*
>   * return the greatest common divisor between a and b (fast algorithm)
> @@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
>  static void
>  mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
>  {
> -	/* Check that cache have enough space for flush threshold */
> -
> 	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZ
> E) >
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache,
> objs[0]));
> -
>  	cache->size = size;
> -	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
> +	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility
> purposes only */
>  	cache->len = 0;
>  }
> 
> @@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned
> n, unsigned elt_size,
> 
>  	/* asked cache too big */
>  	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
> -	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
> +	    cache_size > n) {
>  		rte_errno = EINVAL;
>  		return NULL;
>  	}
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 2e54fc4466..432c43ab15 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
>   */
>  struct __rte_cache_aligned rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
> -	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> purposes only */
>  	uint32_t len;	      /**< Current cache count */
>  #ifdef RTE_LIBRTE_MEMPOOL_STATS
>  	uint32_t unused;
> @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
>  	/**
>  	 * Cache objects
>  	 *
> -	 * Cache is allocated to this size to allow it to overflow in
> certain
> -	 * cases to avoid needless emptying of cache.
> +	 * Note:
> +	 * Cache is allocated at double size for API/ABI compatibility
> purposes only.
> +	 * When reducing its size at an API/ABI breaking release,
> +	 * remember to add a cache guard after it.
>  	 */
>  	alignas(RTE_CACHE_LINE_SIZE) void
> *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
>  };
> @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
>   * @param cache_size
>   *   If cache_size is non-zero, the rte_mempool library will try to
>   *   limit the accesses to the common lockless pool, by maintaining a
> - *   per-lcore object cache. This argument must be lower or equal to
> - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> + *   per-lcore object cache. This argument must be an even number,
> + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
>   *   The access to the per-lcore table is of course
>   *   faster than the multi-producer/consumer pool. The cache can be
>   *   disabled if the cache_size argument is set to 0; it can be useful
> to
>   *   avoid losing objects in cache.
> + *   Note:
> + *   Mempool put/get requests of more than cache_size / 2 objects may
> be
> + *   partially or fully served directly by the multi-producer/consumer
> + *   pool, to avoid the overhead of copying the objects twice (instead
> of
> + *   once) when using the cache as a bounce buffer.
>   * @param private_data_size
>   *   The size of the private data appended after the mempool
>   *   structure. This is useful for storing some private data after the
> @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> 
> -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2);
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> -	__rte_assume(cache->len <= cache->flushthresh);
> -	if (likely(cache->len + n <= cache->flushthresh)) {
> +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->len <= cache->size);
> +	if (likely(cache->len + n <= cache->size)) {
>  		/* Sufficient room in the cache for the objects. */
>  		cache_objs = &cache->objs[cache->len];
>  		cache->len += n;
> -	} else if (n <= cache->flushthresh) {
> +	} else if (n <= cache->size / 2) {
>  		/*
> -		 * The cache is big enough for the objects, but - as
> detected by
> -		 * the comparison above - has insufficient room for them.
> -		 * Flush the cache to make room for the objects.
> +		 * The number of objects is within the cache bounce buffer
> limit,
> +		 * but - as detected by the comparison above - the cache
> has
> +		 * insufficient room for them.
> +		 * Flush the cache to the backend to make room for the
> objects;
> +		 * flush (size / 2) objects from the bottom of the cache,
> where
> +		 * objects are less hot, and move down the remaining
> objects, which
> +		 * are more hot, from the upper half of the cache.
>  		 */
> -		cache_objs = &cache->objs[0];
> -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> -		cache->len = n;
> +		__rte_assume(cache->len > cache->size / 2);
> +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache-
> >size / 2);
> +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> +				sizeof(void *) * (cache->len - cache->size /
> 2));
> +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> +		cache->len = cache->len - cache->size / 2 + n;
>  	} else {
> -		/* The request itself is too big for the cache. */
> +		/* The request itself is too big. */
>  		goto driver_enqueue_stats_incremented;
>  	}
> 
> @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	/* The cache is a stack, so copy will be in reverse order. */
>  	cache_objs = &cache->objs[cache->len];
> 
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
>  	if (likely(n <= cache->len)) {
>  		/* The entire request can be satisfied from the cache. */
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
> 
> -	/* Dequeue below would overflow mem allocated for cache? */
> -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> +	/* Dequeue below would exceed the cache bounce buffer limit? */
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	if (unlikely(remaining > cache->size / 2))
>  		goto driver_dequeue;
> 
> -	/* Fill the cache from the backend; fetch size + remaining
> objects. */
> -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> -			cache->size + remaining);
> +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> 2);
>  	if (unlikely(ret < 0)) {
>  		/*
>  		 * We are buffer constrained, and not able to fetch all
> that.
> @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> 
> -	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	cache_objs = &cache->objs[cache->size + remaining];
> -	cache->len = cache->size;
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= cache->size / 2);
> +	cache_objs = &cache->objs[cache->size / 2];
> +	cache->len = cache->size / 2 - remaining;
>  	for (index = 0; index < remaining; index++)
>  		*obj_table++ = *--cache_objs;
> 
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
  2026-04-22 12:27   ` Morten Brørup
  2026-04-27 15:21   ` Morten Brørup
@ 2026-04-28  7:44   ` Andrew Rybchenko
  2026-05-22 16:11   ` Bruce Richardson
  2026-05-22 16:12   ` Bruce Richardson
  4 siblings, 0 replies; 46+ messages in thread
From: Andrew Rybchenko @ 2026-04-28  7:44 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

On 4/19/26 12:55 PM, Morten Brørup wrote:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> As a consequence of the improved mempool cache algorithm, some drivers
> were updated accordingly:
> - The Intel idpf PMD was updated regarding how much to backfill the
>    mempool cache in the AVX512 code.
> - The NXP dpaa and dpaa2 mempool drivers were updated to not set the
>    mempool cache flush threshold; doing this no longer has any effect, and
>    thus became superfluous.
> 
> Bugzilla ID: 1027
> Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf fast-free")

Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
                     ` (2 preceding siblings ...)
  2026-04-28  7:44   ` Andrew Rybchenko
@ 2026-05-22 16:11   ` Bruce Richardson
  2026-05-26  8:41     ` Morten Brørup
  2026-05-22 16:12   ` Bruce Richardson
  4 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2026-05-22 16:11 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 

Agree in principle with most of these changes. As we dicussed at the DPDK
summit conference, only issue I really have is with the threshold limits
here - allocating and freeing only half the cache at a time seems overly
conservative. Thinking about use-cases:

1 for apps where alloc + free (generally Rx+Tx) is on the same core(s),
  then we should run (almost) entirely out of cache.
2 for apps where we have alloc and free on different cores, then we have
  some caches always being filled and others always being emptied

For case #1, we only need worry about the thresholds for the odd case where
we have a burst that causes us to overflow our cache (and we can't increase
our cache size to cope and avoid that). Otherwise the thresholds don't
matter. However, for case #2, the thresholds are constantly involved as we
always are going to backing store. In this case, we really want to have the
allocs *always* fill the cache completely, and the frees completely empty
the cache.

Because of this, while we want to avoid cases where we fill the cache
completely only to have a further free causing it to be flushed, because of
case #2 we cannot be overly conservative in how much we free/empty.
Ideally, we want to fill to full less a single burst, and empty leaving
only a single burst in the cache. Unfortunately, we don't know what those
burst limits are, so we have to try and guess the best behaviour from
everything else.

All that said, commits with specific suggestions inline.

/Bruce

<snip>

> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 2e54fc4466..432c43ab15 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
>   */
>  struct __rte_cache_aligned rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
> -	uint32_t flushthresh; /**< Threshold before we flush excess elements */
> +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
>  	uint32_t len;	      /**< Current cache count */
>  #ifdef RTE_LIBRTE_MEMPOOL_STATS
>  	uint32_t unused;
> @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
>  	/**
>  	 * Cache objects
>  	 *
> -	 * Cache is allocated to this size to allow it to overflow in certain
> -	 * cases to avoid needless emptying of cache.
> +	 * Note:
> +	 * Cache is allocated at double size for API/ABI compatibility purposes only.
> +	 * When reducing its size at an API/ABI breaking release,
> +	 * remember to add a cache guard after it.
>  	 */
>  	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
>  };
> @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
>   * @param cache_size
>   *   If cache_size is non-zero, the rte_mempool library will try to
>   *   limit the accesses to the common lockless pool, by maintaining a
> - *   per-lcore object cache. This argument must be lower or equal to
> - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> + *   per-lcore object cache. This argument must be an even number,
> + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
>   *   The access to the per-lcore table is of course
>   *   faster than the multi-producer/consumer pool. The cache can be
>   *   disabled if the cache_size argument is set to 0; it can be useful to
>   *   avoid losing objects in cache.
> + *   Note:
> + *   Mempool put/get requests of more than cache_size / 2 objects may be
> + *   partially or fully served directly by the multi-producer/consumer
> + *   pool, to avoid the overhead of copying the objects twice (instead of
> + *   once) when using the cache as a bounce buffer.
>   * @param private_data_size
>   *   The size of the private data appended after the mempool
>   *   structure. This is useful for storing some private data after the
> @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
>  
> -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> -	__rte_assume(cache->len <= cache->flushthresh);
> -	if (likely(cache->len + n <= cache->flushthresh)) {
> +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->len <= cache->size);
> +	if (likely(cache->len + n <= cache->size)) {
>  		/* Sufficient room in the cache for the objects. */
>  		cache_objs = &cache->objs[cache->len];
>  		cache->len += n;
> -	} else if (n <= cache->flushthresh) {
> +	} else if (n <= cache->size / 2) {
>  		/*
> -		 * The cache is big enough for the objects, but - as detected by
> -		 * the comparison above - has insufficient room for them.
> -		 * Flush the cache to make room for the objects.
> +		 * The number of objects is within the cache bounce buffer limit,
> +		 * but - as detected by the comparison above - the cache has
> +		 * insufficient room for them.
> +		 * Flush the cache to the backend to make room for the objects;
> +		 * flush (size / 2) objects from the bottom of the cache, where
> +		 * objects are less hot, and move down the remaining objects, which
> +		 * are more hot, from the upper half of the cache.
>  		 */
> -		cache_objs = &cache->objs[0];
> -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> -		cache->len = n;
> +		__rte_assume(cache->len > cache->size / 2);
> +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
> +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> +				sizeof(void *) * (cache->len - cache->size / 2));
> +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> +		cache->len = cache->len - cache->size / 2 + n;

The flushing of only half the cache I'm not so certain about. I agree that
we want to not flush to empty, but I also think that we want to do more
than a half-flush, especially since we do an enqueue to the cache
immediately afterwards. Consider the case where we have a cache size of
128, and we do an enqueue of 32, with the cache currently full. In that
case we only flush 64, reducing the cache to 64, but then immediately
bringing it back up to 96. For cases where we have pipelines with all alloc
on one core and all free on another, this half-flush would be inefficient.

Instead, I would look to have a lower target threshold post-flush, and I
would suggest 25% - taking into account the newly freed buffers. For
example:

	/* if n > our target of 1/4 full, flush everything,
	 * else flush so that we end up with 1/4 full after n added.
	 */
	flush_count = n > cache->size/4 ? cache->len :
			(cache->len + n) - cache->size/4;


>  	} else {
> -		/* The request itself is too big for the cache. */
> +		/* The request itself is too big. */
>  		goto driver_enqueue_stats_incremented;

I think original comment is better. The request itself is not too big for
the whole mempool, just for the cache.

>  	}
>  
> @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	/* The cache is a stack, so copy will be in reverse order. */
>  	cache_objs = &cache->objs[cache->len];
>  
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
>  	if (likely(n <= cache->len)) {
>  		/* The entire request can be satisfied from the cache. */
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
>  
> -	/* Dequeue below would overflow mem allocated for cache? */
> -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> +	/* Dequeue below would exceed the cache bounce buffer limit? */
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	if (unlikely(remaining > cache->size / 2))
>  		goto driver_dequeue;
>  
> -	/* Fill the cache from the backend; fetch size + remaining objects. */
> -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> -			cache->size + remaining);
> +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);

Again, the cache->size / 2 doesn't seem right here. We at most half-fill
the cache and then take some objects from that, meaning that have just done
a re-fill of cache but end the function with it less than half full. Since
we take from this value, I'd suggest just filling the cache completely.

>  	if (unlikely(ret < 0)) {
>  		/*
>  		 * We are buffer constrained, and not able to fetch all that.
> @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
>  
> -	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	cache_objs = &cache->objs[cache->size + remaining];
> -	cache->len = cache->size;
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= cache->size / 2);
> +	cache_objs = &cache->objs[cache->size / 2];
> +	cache->len = cache->size / 2 - remaining;
>  	for (index = 0; index < remaining; index++)
>  		*obj_table++ = *--cache_objs;
>  
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5] mempool: improve cache behaviour and performance
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
                     ` (3 preceding siblings ...)
  2026-05-22 16:11   ` Bruce Richardson
@ 2026-05-22 16:12   ` Bruce Richardson
  2026-05-26  8:57     ` Morten Brørup
  4 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2026-05-22 16:12 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> As a consequence of the improved mempool cache algorithm, some drivers
> were updated accordingly:
> - The Intel idpf PMD was updated regarding how much to backfill the
>   mempool cache in the AVX512 code.
> - The NXP dpaa and dpaa2 mempool drivers were updated to not set the
>   mempool cache flush threshold; doing this no longer has any effect, and
>   thus became superfluous.
> 
> Bugzilla ID: 1027
> Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf fast-free")
> ---
> v5:
> * Flush the cache from the bottom, where objects are colder, and move down
>   the remaining objects, which are hotter.
> * In the Intel idpf PMD, move up the hot objects in the cache and refill
>   with cold objects at the bottom.
> v4:
> * Added Bugzilla ID.
> * Added Fixes tag. For reference only.
> * Moved fast-free related update of Intel common driver out as a separate
>   patch, and depend on that patch.
> * Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
>   fixing an indentation and adding mbuf instrumentation.
> * Omitted unrelated changes to the mempool library, specifically adding
>   __rte_restrict and changing a couple of comments to proper sentences.
> * Please checkpatches by swapping operators in a couple of comparisons.
> v3:
> * Fixed my copy-paste bug in idpf_splitq_rearm().
> v2:
> * Fixed issue found by abidiff:
>   Reverted cache objects array size reduction. Added a note instead.
> * Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
> * Updated idpf_splitq_rearm() like idpf_singleq_rearm().
> * Added a few more __rte_assume(). (Inspired by AI review)
> * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
>   flush threshold.
> * Added release notes.
> * Added deprecation notes.
> ---
>  doc/guides/rel_notes/deprecation.rst          |  7 ++
>  doc/guides/rel_notes/release_26_07.rst        | 10 +++
>  drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
>  drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
>  .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++---
>  lib/mempool/rte_mempool.c                     | 14 +---
>  lib/mempool/rte_mempool.h                     | 70 ++++++++++++-------
>  7 files changed, 104 insertions(+), 77 deletions(-)
> 
Can the idpf and dpaa changes be made in separate patches, so we can review
the mempool changes along in a single patch? Even if the commits can't work
logically together, perhaps they can be separated for review, and then
squashed on apply?

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-22 16:11   ` Bruce Richardson
@ 2026-05-26  8:41     ` Morten Brørup
  2026-05-26  9:39       ` Bruce Richardson
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-05-26  8:41 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 22 May 2026 18.12
> 
> On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> > This patch refactors the mempool cache to eliminate some unexpected
> > behaviour and reduce the mempool cache miss rate.
> >
> 
> Agree in principle with most of these changes. As we dicussed at the
> DPDK
> summit conference, only issue I really have is with the threshold
> limits
> here - allocating and freeing only half the cache at a time seems
> overly
> conservative. Thinking about use-cases:
> 
> 1 for apps where alloc + free (generally Rx+Tx) is on the same core(s),
>   then we should run (almost) entirely out of cache.

I strongly disagree about any goal to run the cache low.
The primary goal is to minimize the cache miss (refill and replenish) rate.

> 2 for apps where we have alloc and free on different cores, then we
> have
>   some caches always being filled and others always being emptied

Agree.

> 
> For case #1, we only need worry about the thresholds for the odd case
> where we have a burst that causes us to overflow our cache (and we
> can't increase our cache size to cope and avoid that). 
> Otherwise the thresholds don't matter.

It seems like you assume the application only does something like this:
Rx -> Rewrite -> Tx

In that case, the per-lcore cache only needs capacity for one burst, yes.
With my patch, the cache can be rightsized by requesting a cache size of 2 * burst size.
(Then the fill level will be either size/2 or empty, i.e. one burst or zero.
This also happens to meet your suggested goal about low fill level, which I disagree with.)

However, I don't think that is a realistic use case.
Many apps do something like this:

     |-> Rewrite ->|
Rx ->|             |-> Tx
     |-> Hold      |
         Release ->|

They often hold back packets before they are transmitted.
For a simple router, when the destination IP address is not in the neighbor table, packets to that IP address are queued until ARP/ND has been resolved, and then they are dequeued and transmitted.
Or apps performing shaping or pacing, where packets are held back in queues, and dequeued at the appropriate time.
For such apps, the waves are much bigger (than the simple Rx->Rewrite->Tx use case).

With a random enqueue/dequeue pattern, replenishing/draining the cache to size/2 minimizes the probability of reaching one of its edges (empty or full), triggering a "cache miss" (refill/replenish).

> However, for case #2, the thresholds are constantly involved as
> we
> always are going to backing store. In this case, we really want to have
> the
> allocs *always* fill the cache completely, and the frees completely
> empty
> the cache.

Agree.

> 
> Because of this, while we want to avoid cases where we fill the cache
> completely only to have a further free causing it to be flushed,
> because of
> case #2 we cannot be overly conservative in how much we free/empty.
> Ideally, we want to fill to full less a single burst, and empty leaving
> only a single burst in the cache. Unfortunately, we don't know what
> those
> burst limits are, so we have to try and guess the best behaviour from
> everything else.

I agree about not wanting to be overly conservative.
But in the use cases I have described for #1, I don't think a target fill level of size/2 is overly conservative.

I also acknowledge that this patch doubles the mempool cache miss rate for #2.
E.g. with a cache size of 512 and burst size of 64, the per-burst miss rate will be 64 / (1/2 * 512) = 1/4, compared to 64 / 512 = 1/8 with a full replenish/drain algorithm.

In theory, we could make it build time configurable to optimize mempools for #2.
But mempools are also used for other objects than mbufs, so that would have unwanted side effects for non-mbuf mempools.

If we went for an algorithm targeting replenish/drain at 25 % from the edges, the per-burst miss rate for #2 would be: 64 / (3/4 * 512) = 1/6.

How about addressing #2 in the release notes:
We describe that the cache refill/drain algorithm has been changed to only refill/drain to 50 % of the cache size, so pipelined applications performing Rx (mempool get) and Tx (mempool put) on separate cores should configure their mbuf pools with double the cache size of what they previously were to achieve similar performance.

> 
> All that said, commits with specific suggestions inline.
> 
> /Bruce
> 
> <snip>
> 
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 2e54fc4466..432c43ab15 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats
> {
> >   */
> >  struct __rte_cache_aligned rte_mempool_cache {
> >  	uint32_t size;	      /**< Size of the cache */
> > -	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> > +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> purposes only */
> >  	uint32_t len;	      /**< Current cache count */
> >  #ifdef RTE_LIBRTE_MEMPOOL_STATS
> >  	uint32_t unused;
> > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
> >  	/**
> >  	 * Cache objects
> >  	 *
> > -	 * Cache is allocated to this size to allow it to overflow in
> certain
> > -	 * cases to avoid needless emptying of cache.
> > +	 * Note:
> > +	 * Cache is allocated at double size for API/ABI compatibility
> purposes only.
> > +	 * When reducing its size at an API/ABI breaking release,
> > +	 * remember to add a cache guard after it.
> >  	 */
> >  	alignas(RTE_CACHE_LINE_SIZE) void
> *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
> >  };
> > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
> >   * @param cache_size
> >   *   If cache_size is non-zero, the rte_mempool library will try to
> >   *   limit the accesses to the common lockless pool, by maintaining
> a
> > - *   per-lcore object cache. This argument must be lower or equal to
> > - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> > + *   per-lcore object cache. This argument must be an even number,
> > + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
> >   *   The access to the per-lcore table is of course
> >   *   faster than the multi-producer/consumer pool. The cache can be
> >   *   disabled if the cache_size argument is set to 0; it can be
> useful to
> >   *   avoid losing objects in cache.
> > + *   Note:
> > + *   Mempool put/get requests of more than cache_size / 2 objects
> may be
> > + *   partially or fully served directly by the multi-
> producer/consumer
> > + *   pool, to avoid the overhead of copying the objects twice
> (instead of
> > + *   once) when using the cache as a bounce buffer.
> >   * @param private_data_size
> >   *   The size of the private data appended after the mempool
> >   *   structure. This is useful for storing some private data after
> the
> > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> >
> > -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2);
> > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > -	__rte_assume(cache->len <= cache->flushthresh);
> > -	if (likely(cache->len + n <= cache->flushthresh)) {
> > +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > +	__rte_assume(cache->len <= cache->size);
> > +	if (likely(cache->len + n <= cache->size)) {
> >  		/* Sufficient room in the cache for the objects. */
> >  		cache_objs = &cache->objs[cache->len];
> >  		cache->len += n;
> > -	} else if (n <= cache->flushthresh) {
> > +	} else if (n <= cache->size / 2) {
> >  		/*
> > -		 * The cache is big enough for the objects, but - as
> detected by
> > -		 * the comparison above - has insufficient room for them.
> > -		 * Flush the cache to make room for the objects.
> > +		 * The number of objects is within the cache bounce buffer
> limit,
> > +		 * but - as detected by the comparison above - the cache
> has
> > +		 * insufficient room for them.
> > +		 * Flush the cache to the backend to make room for the
> objects;
> > +		 * flush (size / 2) objects from the bottom of the cache,
> where
> > +		 * objects are less hot, and move down the remaining
> objects, which
> > +		 * are more hot, from the upper half of the cache.
> >  		 */
> > -		cache_objs = &cache->objs[0];
> > -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > -		cache->len = n;
> > +		__rte_assume(cache->len > cache->size / 2);
> > +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache-
> >size / 2);
> > +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> > +				sizeof(void *) * (cache->len - cache->size /
> 2));
> > +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> > +		cache->len = cache->len - cache->size / 2 + n;
> 
> The flushing of only half the cache I'm not so certain about. I agree
> that
> we want to not flush to empty, but I also think that we want to do more
> than a half-flush, especially since we do an enqueue to the cache
> immediately afterwards. Consider the case where we have a cache size of
> 128, and we do an enqueue of 32, with the cache currently full. In that
> case we only flush 64, reducing the cache to 64, but then immediately
> bringing it back up to 96.

I thought in depth about whether the flush/replenish sizes should consider the request size or not. (E.g. if I should replenish size/2 or size/2+request.)
I decided for not considering the request size, for two reasons:
a) It roughly doesn't matter, especially when considering a sequence of random get/put requests.
b) The size of the backend transactions become fixed, which has performance benefits: With my patch, they are always size/2, so if the cache size is 2^N, the backend transactions are 2^N and CPU cache aligned.

> For cases where we have pipelines with all
> alloc
> on one core and all free on another, this half-flush would be
> inefficient.
> 
> Instead, I would look to have a lower target threshold post-flush, and
> I
> would suggest 25% - taking into account the newly freed buffers.

It's not good for #1.
I agree that it is better for #2. But I don't think #2 is the likely use case.

After our discussion at the summit, I did start working a patch targeting fill levels at 25% from the cache edges, but I don't think it's better; so I'd rather stick with a target fill level of 50%.

> For example:
> 
> 	/* if n > our target of 1/4 full, flush everything,
> 	 * else flush so that we end up with 1/4 full after n added.
> 	 */
> 	flush_count = n > cache->size/4 ? cache->len :
> 			(cache->len + n) - cache->size/4;
> 
> 
> >  	} else {
> > -		/* The request itself is too big for the cache. */
> > +		/* The request itself is too big. */
> >  		goto driver_enqueue_stats_incremented;
> 
> I think original comment is better. The request itself is not too big
> for
> the whole mempool, just for the cache.

Ack.

> 
> >  	}
> >
> > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >  	/* The cache is a stack, so copy will be in reverse order. */
> >  	cache_objs = &cache->objs[cache->len];
> >
> > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> >  	if (likely(n <= cache->len)) {
> >  		/* The entire request can be satisfied from the cache. */
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >
> > -	/* Dequeue below would overflow mem allocated for cache? */
> > -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > +	/* Dequeue below would exceed the cache bounce buffer limit? */
> > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > +	if (unlikely(remaining > cache->size / 2))
> >  		goto driver_dequeue;
> >
> > -	/* Fill the cache from the backend; fetch size + remaining
> objects. */
> > -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > -			cache->size + remaining);
> > +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> 2);
> 
> Again, the cache->size / 2 doesn't seem right here. We at most half-
> fill
> the cache and then take some objects from that, meaning that have just
> done
> a re-fill of cache but end the function with it less than half full.
> Since
> we take from this value, I'd suggest just filling the cache completely.

The issues at the edges of the cache are symmetrical.
If we replenish the cache to full, and the next transaction is a put, the cache needs to be drained.
That's why I replenish to size/2.

> 
> >  	if (unlikely(ret < 0)) {
> >  		/*
> >  		 * We are buffer constrained, and not able to fetch all
> that.
> > @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> >
> > -	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > -	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > -	cache_objs = &cache->objs[cache->size + remaining];
> > -	cache->len = cache->size;
> > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > +	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > +	__rte_assume(remaining <= cache->size / 2);
> > +	cache_objs = &cache->objs[cache->size / 2];
> > +	cache->len = cache->size / 2 - remaining;
> >  	for (index = 0; index < remaining; index++)
> >  		*obj_table++ = *--cache_objs;
> >
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-22 16:12   ` Bruce Richardson
@ 2026-05-26  8:57     ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-26  8:57 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 22 May 2026 18.13
> 
> On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> > This patch refactors the mempool cache to eliminate some unexpected
> > behaviour and reduce the mempool cache miss rate.
> >
> > 1.
> > The actual cache size was 1.5 times the cache size specified at run-
> time
> > mempool creation.
> > This was obviously not expected by application developers.
> >
> > 2.
> > In get operations, the check for when to use the cache as bounce
> buffer
> > did not respect the run-time configured cache size,
> > but compared to the build time maximum possible cache size
> > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> > E.g. with a configured cache size of 32 objects, getting 256 objects
> > would first fetch 32 + 256 = 288 objects into the cache,
> > and then move the 256 objects from the cache to the destination
> memory,
> > instead of fetching the 256 objects directly to the destination
> memory.
> > This had a performance cost.
> > However, this is unlikely to occur in real applications, so it is not
> > important in itself.
> >
> > 3.
> > When putting objects into a mempool, and the mempool cache did not
> have
> > free space for so many objects,
> > the cache was flushed completely, and the new objects were then put
> into
> > the cache.
> > I.e. the cache drain level was zero.
> > This (complete cache flush) meant that a subsequent get operation
> (with
> > the same number of objects) completely emptied the cache,
> > so another subsequent get operation required replenishing the cache.
> >
> > Similarly,
> > When getting objects from a mempool, and the mempool cache did not
> hold so
> > many objects,
> > the cache was replenished to cache->size + remaining objects,
> > and then (the remaining part of) the requested objects were fetched
> via
> > the cache,
> > which left the cache filled (to cache->size) at completion.
> > I.e. the cache refill level was cache->size (plus some, depending on
> > request size).
> >
> > (1) was improved by generally comparing to cache->size instead of
> > cache->flushthresh, when considering the capacity of the cache.
> > The cache->flushthresh field is kept for API/ABI compatibility
> purposes,
> > and initialized to cache->size instead of cache->size * 1.5.
> >
> > (2) was improved by generally comparing to cache->size / 2 instead of
> > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> >
> > (3) was improved by flushing and replenishing the cache by half its
> size,
> > so a flush/refill can be followed randomly by get or put requests.
> > This also reduced the number of objects in each flush/refill
> operation.
> >
> > As a consequence of these changes, the size of the array holding the
> > objects in the cache (cache->objs[]) no longer needs to be
> > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> >
> > Performance data:
> > With a real WAN Optimization application, where the number of
> allocated
> > packets varies (as they are held in e.g. shaper queues), the mempool
> > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> > This was deployed in production at an ISP, and using an effective
> cache
> > size of 384 objects.
> >
> > As a consequence of the improved mempool cache algorithm, some
> drivers
> > were updated accordingly:
> > - The Intel idpf PMD was updated regarding how much to backfill the
> >   mempool cache in the AVX512 code.
> > - The NXP dpaa and dpaa2 mempool drivers were updated to not set the
> >   mempool cache flush threshold; doing this no longer has any effect,
> and
> >   thus became superfluous.
> >
> > Bugzilla ID: 1027
> > Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf
> fast-free")
> > ---
> > v5:
> > * Flush the cache from the bottom, where objects are colder, and move
> down
> >   the remaining objects, which are hotter.
> > * In the Intel idpf PMD, move up the hot objects in the cache and
> refill
> >   with cold objects at the bottom.
> > v4:
> > * Added Bugzilla ID.
> > * Added Fixes tag. For reference only.
> > * Moved fast-free related update of Intel common driver out as a
> separate
> >   patch, and depend on that patch.
> > * Omitted unrelated changes to the Intel idpf AVX512 driver,
> specifically
> >   fixing an indentation and adding mbuf instrumentation.
> > * Omitted unrelated changes to the mempool library, specifically
> adding
> >   __rte_restrict and changing a couple of comments to proper
> sentences.
> > * Please checkpatches by swapping operators in a couple of
> comparisons.
> > v3:
> > * Fixed my copy-paste bug in idpf_splitq_rearm().
> > v2:
> > * Fixed issue found by abidiff:
> >   Reverted cache objects array size reduction. Added a note instead.
> > * Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
> > * Updated idpf_splitq_rearm() like idpf_singleq_rearm().
> > * Added a few more __rte_assume(). (Inspired by AI review)
> > * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
> >   flush threshold.
> > * Added release notes.
> > * Added deprecation notes.
> > ---
> >  doc/guides/rel_notes/deprecation.rst          |  7 ++
> >  doc/guides/rel_notes/release_26_07.rst        | 10 +++
> >  drivers/mempool/dpaa/dpaa_mempool.c           | 14 ----
> >  drivers/mempool/dpaa2/dpaa2_hw_mempool.c      | 14 ----
> >  .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++---
> >  lib/mempool/rte_mempool.c                     | 14 +---
> >  lib/mempool/rte_mempool.h                     | 70 ++++++++++++-----
> --
> >  7 files changed, 104 insertions(+), 77 deletions(-)
> >
> Can the idpf and dpaa changes be made in separate patches, so we can
> review
> the mempool changes along in a single patch? Even if the commits can't
> work
> logically together, perhaps they can be separated for review, and then
> squashed on apply?

Sure.

The idpf changes are required due to the driver bypassing the mempool API.
If such direct access to the mempool cache is required, it would be better to add a mempool cache zero-copy API.

The dpaa changes are simple clean-ups.
They set a variable which this mempool change makes obsolete, so no need to set it.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-26  8:41     ` Morten Brørup
@ 2026-05-26  9:39       ` Bruce Richardson
  2026-05-26 10:37         ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2026-05-26  9:39 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

On Tue, May 26, 2026 at 10:41:44AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 22 May 2026 18.12
> > 
> > On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> > > This patch refactors the mempool cache to eliminate some unexpected
> > > behaviour and reduce the mempool cache miss rate.
> > >
> > 
> > Agree in principle with most of these changes. As we dicussed at the
> > DPDK
> > summit conference, only issue I really have is with the threshold
> > limits
> > here - allocating and freeing only half the cache at a time seems
> > overly
> > conservative. Thinking about use-cases:
> > 
> > 1 for apps where alloc + free (generally Rx+Tx) is on the same core(s),
> >   then we should run (almost) entirely out of cache.
> 
> I strongly disagree about any goal to run the cache low.
> The primary goal is to minimize the cache miss (refill and replenish) rate.
> 
> > 2 for apps where we have alloc and free on different cores, then we
> > have
> >   some caches always being filled and others always being emptied
> 
> Agree.
> 
> > 
> > For case #1, we only need worry about the thresholds for the odd case
> > where we have a burst that causes us to overflow our cache (and we
> > can't increase our cache size to cope and avoid that). 
> > Otherwise the thresholds don't matter.
> 
> It seems like you assume the application only does something like this:
> Rx -> Rewrite -> Tx
> 
> In that case, the per-lcore cache only needs capacity for one burst, yes.
> With my patch, the cache can be rightsized by requesting a cache size of 2 * burst size.
> (Then the fill level will be either size/2 or empty, i.e. one burst or zero.
> This also happens to meet your suggested goal about low fill level, which I disagree with.)
> 
> However, I don't think that is a realistic use case.
> Many apps do something like this:
> 
>      |-> Rewrite ->|
> Rx ->|             |-> Tx
>      |-> Hold      |
>          Release ->|
> 

I agree, that's why our cache size needs to be bigger than our burst size.

> They often hold back packets before they are transmitted.
> For a simple router, when the destination IP address is not in the neighbor table, packets to that IP address are queued until ARP/ND has been resolved, and then they are dequeued and transmitted.
> Or apps performing shaping or pacing, where packets are held back in queues, and dequeued at the appropriate time.
> For such apps, the waves are much bigger (than the simple Rx->Rewrite->Tx use case).
> 
> With a random enqueue/dequeue pattern, replenishing/draining the cache to size/2 minimizes the probability of reaching one of its edges (empty or full), triggering a "cache miss" (refill/replenish).
> 
> > However, for case #2, the thresholds are constantly involved as
> > we
> > always are going to backing store. In this case, we really want to have
> > the
> > allocs *always* fill the cache completely, and the frees completely
> > empty
> > the cache.
> 
> Agree.
> 
> > 
> > Because of this, while we want to avoid cases where we fill the cache
> > completely only to have a further free causing it to be flushed,
> > because of
> > case #2 we cannot be overly conservative in how much we free/empty.
> > Ideally, we want to fill to full less a single burst, and empty leaving
> > only a single burst in the cache. Unfortunately, we don't know what
> > those
> > burst limits are, so we have to try and guess the best behaviour from
> > everything else.
> 
> I agree about not wanting to be overly conservative.
> But in the use cases I have described for #1, I don't think a target fill level of size/2 is overly conservative.
> 
> I also acknowledge that this patch doubles the mempool cache miss rate for #2.
> E.g. with a cache size of 512 and burst size of 64, the per-burst miss rate will be 64 / (1/2 * 512) = 1/4, compared to 64 / 512 = 1/8 with a full replenish/drain algorithm.
> 
> In theory, we could make it build time configurable to optimize mempools for #2.
> But mempools are also used for other objects than mbufs, so that would have unwanted side effects for non-mbuf mempools.
> 
> If we went for an algorithm targeting replenish/drain at 25 % from the edges, the per-burst miss rate for #2 would be: 64 / (3/4 * 512) = 1/6.
> 
> How about addressing #2 in the release notes:
> We describe that the cache refill/drain algorithm has been changed to only refill/drain to 50 % of the cache size, so pipelined applications performing Rx (mempool get) and Tx (mempool put) on separate cores should configure their mbuf pools with double the cache size of what they previously were to achieve similar performance.
> 
> > 
> > All that said, commits with specific suggestions inline.
> > 
> > /Bruce
> > 
> > <snip>
> > 
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 2e54fc4466..432c43ab15 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats
> > {
> > >   */
> > >  struct __rte_cache_aligned rte_mempool_cache {
> > >  	uint32_t size;	      /**< Size of the cache */
> > > -	uint32_t flushthresh; /**< Threshold before we flush excess
> > elements */
> > > +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> > purposes only */
> > >  	uint32_t len;	      /**< Current cache count */
> > >  #ifdef RTE_LIBRTE_MEMPOOL_STATS
> > >  	uint32_t unused;
> > > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
> > >  	/**
> > >  	 * Cache objects
> > >  	 *
> > > -	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > > -	 * cases to avoid needless emptying of cache.
> > > +	 * Note:
> > > +	 * Cache is allocated at double size for API/ABI compatibility
> > purposes only.
> > > +	 * When reducing its size at an API/ABI breaking release,
> > > +	 * remember to add a cache guard after it.
> > >  	 */
> > >  	alignas(RTE_CACHE_LINE_SIZE) void
> > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
> > >  };
> > > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
> > >   * @param cache_size
> > >   *   If cache_size is non-zero, the rte_mempool library will try to
> > >   *   limit the accesses to the common lockless pool, by maintaining
> > a
> > > - *   per-lcore object cache. This argument must be lower or equal to
> > > - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> > > + *   per-lcore object cache. This argument must be an even number,
> > > + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
> > >   *   The access to the per-lcore table is of course
> > >   *   faster than the multi-producer/consumer pool. The cache can be
> > >   *   disabled if the cache_size argument is set to 0; it can be
> > useful to
> > >   *   avoid losing objects in cache.
> > > + *   Note:
> > > + *   Mempool put/get requests of more than cache_size / 2 objects
> > may be
> > > + *   partially or fully served directly by the multi-
> > producer/consumer
> > > + *   pool, to avoid the overhead of copying the objects twice
> > (instead of
> > > + *   once) when using the cache as a bounce buffer.
> > >   * @param private_data_size
> > >   *   The size of the private data appended after the mempool
> > >   *   structure. This is useful for storing some private data after
> > the
> > > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct rte_mempool
> > *mp, void * const *obj_table,
> > >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > >
> > > -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> > 2);
> > > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > -	__rte_assume(cache->len <= cache->flushthresh);
> > > -	if (likely(cache->len + n <= cache->flushthresh)) {
> > > +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > +	__rte_assume(cache->len <= cache->size);
> > > +	if (likely(cache->len + n <= cache->size)) {
> > >  		/* Sufficient room in the cache for the objects. */
> > >  		cache_objs = &cache->objs[cache->len];
> > >  		cache->len += n;
> > > -	} else if (n <= cache->flushthresh) {
> > > +	} else if (n <= cache->size / 2) {
> > >  		/*
> > > -		 * The cache is big enough for the objects, but - as
> > detected by
> > > -		 * the comparison above - has insufficient room for them.
> > > -		 * Flush the cache to make room for the objects.
> > > +		 * The number of objects is within the cache bounce buffer
> > limit,
> > > +		 * but - as detected by the comparison above - the cache
> > has
> > > +		 * insufficient room for them.
> > > +		 * Flush the cache to the backend to make room for the
> > objects;
> > > +		 * flush (size / 2) objects from the bottom of the cache,
> > where
> > > +		 * objects are less hot, and move down the remaining
> > objects, which
> > > +		 * are more hot, from the upper half of the cache.
> > >  		 */
> > > -		cache_objs = &cache->objs[0];
> > > -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > -		cache->len = n;
> > > +		__rte_assume(cache->len > cache->size / 2);
> > > +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache-
> > >size / 2);
> > > +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> > > +				sizeof(void *) * (cache->len - cache->size /
> > 2));
> > > +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> > > +		cache->len = cache->len - cache->size / 2 + n;
> > 
> > The flushing of only half the cache I'm not so certain about. I agree
> > that
> > we want to not flush to empty, but I also think that we want to do more
> > than a half-flush, especially since we do an enqueue to the cache
> > immediately afterwards. Consider the case where we have a cache size of
> > 128, and we do an enqueue of 32, with the cache currently full. In that
> > case we only flush 64, reducing the cache to 64, but then immediately
> > bringing it back up to 96.
> 
> I thought in depth about whether the flush/replenish sizes should consider the request size or not. (E.g. if I should replenish size/2 or size/2+request.)
> I decided for not considering the request size, for two reasons:
> a) It roughly doesn't matter, especially when considering a sequence of random get/put requests.
> b) The size of the backend transactions become fixed, which has performance benefits: With my patch, they are always size/2, so if the cache size is 2^N, the backend transactions are 2^N and CPU cache aligned.
> 
> > For cases where we have pipelines with all
> > alloc
> > on one core and all free on another, this half-flush would be
> > inefficient.
> > 
> > Instead, I would look to have a lower target threshold post-flush, and
> > I
> > would suggest 25% - taking into account the newly freed buffers.
> 
> It's not good for #1.
> I agree that it is better for #2. But I don't think #2 is the likely use case.
> 
> After our discussion at the summit, I did start working a patch targeting fill levels at 25% from the cache edges, but I don't think it's better; so I'd rather stick with a target fill level of 50%.
> 
> > For example:
> > 
> > 	/* if n > our target of 1/4 full, flush everything,
> > 	 * else flush so that we end up with 1/4 full after n added.
> > 	 */
> > 	flush_count = n > cache->size/4 ? cache->len :
> > 			(cache->len + n) - cache->size/4;
> > 
> > 
> > >  	} else {
> > > -		/* The request itself is too big for the cache. */
> > > +		/* The request itself is too big. */
> > >  		goto driver_enqueue_stats_incremented;
> > 
> > I think original comment is better. The request itself is not too big
> > for
> > the whole mempool, just for the cache.
> 
> Ack.
> 
> > 
> > >  	}
> > >
> > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> > >  	/* The cache is a stack, so copy will be in reverse order. */
> > >  	cache_objs = &cache->objs[cache->len];
> > >
> > > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > >  	if (likely(n <= cache->len)) {
> > >  		/* The entire request can be satisfied from the cache. */
> > >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> > >  	for (index = 0; index < len; index++)
> > >  		*obj_table++ = *--cache_objs;
> > >
> > > -	/* Dequeue below would overflow mem allocated for cache? */
> > > -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > > +	/* Dequeue below would exceed the cache bounce buffer limit? */
> > > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> > > +	if (unlikely(remaining > cache->size / 2))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Fill the cache from the backend; fetch size + remaining
> > objects. */
> > > -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > > -			cache->size + remaining);
> > > +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> > > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> > 2);
> > 
> > Again, the cache->size / 2 doesn't seem right here. We at most half-
> > fill
> > the cache and then take some objects from that, meaning that have just
> > done
> > a re-fill of cache but end the function with it less than half full.
> > Since
> > we take from this value, I'd suggest just filling the cache completely.
> 
> The issues at the edges of the cache are symmetrical.
> If we replenish the cache to full, and the next transaction is a put, the cache needs to be drained.
> That's why I replenish to size/2.
> 

Not necessarily. After you fill the cache here (to 50%) you then take
elements from it. If we filled to 100%, then we'd only hit a flush if we
freed back more elements than we just allocated. Now, that's reasonably
likely which is why I'm ok to not filling to 100%. However, doing a refill
as part of the op, and then leaving the cache less than half full after the
op finishes is just wasteful IMHO!

One other point. The obvious worst case scenario we want to avoid is
constant fill/empty/fill/empty sequences. However, so long as we have an
asymmetry in how we do fills and empty thresholds, a repeating sequence
should never occur, and even pairs of fill/empty should be very rare.
Consider the case, for example, where we fill to 100% but only drain to
25%. For this, we can ignore scenario #2, since we should have pretty good
cache usage. For scenario #1, of allocs/frees on a single core, randomly
interleaved, our worst case is:
1) alloc with cache empty, in which case we fill to 100%, then take the
alloc
2) have a free of a burst greater than that which we just allocated,
causing an immediate flush again.

That's pretty poor behaviour, but the thing is that after the flush we now
have mempool at 25% + freed burst - so expected between 25-50% of cache
utilization. That's the sweet spot we want to target - after each
operation, the mempool cache should ideally be at 50%. Whether the next op
is an alloc or a free, our cache can handle it, and likely a couple of each
in sequence. Therefore, our possible fill/empty combinations are likely to
be rare occurances.

[In all this, I am making the assumption that burst size is well less than
cache size. Also, similar logic would be applicable for the inverse
scenario, e.g. flush to empty (and fill burst) and fill to 75%]

Now, all said, I tend to agree that we want to leave space for a decent
size burst after a fill. That is why I think that filling to 75% is
reasonable. After an alloc that triggers a fill, I don't want the cache
less than 50% full, but not completely full so there is room for a free
without a flush, and similarly for a free that triggers a flush, the cache
should not be empty, but also should not be more than half full.

One suggestion - we could always add a simple tunable that specifies the
margin, or reserved entries for alloc and free. We can then guide in the
docs that the value should be e.g. "zero for apps where alloc and free take
place on different cores. 20%-50% of cache is recommended where alloc and
free take place on the same core"

/Bruce


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-26  9:39       ` Bruce Richardson
@ 2026-05-26 10:37         ` Morten Brørup
  2026-05-26 17:45           ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-05-26 10:37 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 26 May 2026 11.40
> 
> On Tue, May 26, 2026 at 10:41:44AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 22 May 2026 18.12
> > >
> > > On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Brørup wrote:
> > > > This patch refactors the mempool cache to eliminate some
> unexpected
> > > > behaviour and reduce the mempool cache miss rate.
> > > >
> > >
> > > Agree in principle with most of these changes. As we dicussed at
> the
> > > DPDK
> > > summit conference, only issue I really have is with the threshold
> > > limits
> > > here - allocating and freeing only half the cache at a time seems
> > > overly
> > > conservative. Thinking about use-cases:
> > >
> > > 1 for apps where alloc + free (generally Rx+Tx) is on the same
> core(s),
> > >   then we should run (almost) entirely out of cache.
> >
> > I strongly disagree about any goal to run the cache low.
> > The primary goal is to minimize the cache miss (refill and replenish)
> rate.
> >
> > > 2 for apps where we have alloc and free on different cores, then we
> > > have
> > >   some caches always being filled and others always being emptied
> >
> > Agree.
> >
> > >
> > > For case #1, we only need worry about the thresholds for the odd
> case
> > > where we have a burst that causes us to overflow our cache (and we
> > > can't increase our cache size to cope and avoid that).
> > > Otherwise the thresholds don't matter.
> >
> > It seems like you assume the application only does something like
> this:
> > Rx -> Rewrite -> Tx
> >
> > In that case, the per-lcore cache only needs capacity for one burst,
> yes.
> > With my patch, the cache can be rightsized by requesting a cache size
> of 2 * burst size.
> > (Then the fill level will be either size/2 or empty, i.e. one burst
> or zero.
> > This also happens to meet your suggested goal about low fill level,
> which I disagree with.)
> >
> > However, I don't think that is a realistic use case.
> > Many apps do something like this:
> >
> >      |-> Rewrite ->|
> > Rx ->|             |-> Tx
> >      |-> Hold      |
> >          Release ->|
> >
> 
> I agree, that's why our cache size needs to be bigger than our burst
> size.
> 
> > They often hold back packets before they are transmitted.
> > For a simple router, when the destination IP address is not in the
> neighbor table, packets to that IP address are queued until ARP/ND has
> been resolved, and then they are dequeued and transmitted.
> > Or apps performing shaping or pacing, where packets are held back in
> queues, and dequeued at the appropriate time.
> > For such apps, the waves are much bigger (than the simple Rx-
> >Rewrite->Tx use case).
> >
> > With a random enqueue/dequeue pattern, replenishing/draining the
> cache to size/2 minimizes the probability of reaching one of its edges
> (empty or full), triggering a "cache miss" (refill/replenish).
> >
> > > However, for case #2, the thresholds are constantly involved as
> > > we
> > > always are going to backing store. In this case, we really want to
> have
> > > the
> > > allocs *always* fill the cache completely, and the frees completely
> > > empty
> > > the cache.
> >
> > Agree.
> >
> > >
> > > Because of this, while we want to avoid cases where we fill the
> cache
> > > completely only to have a further free causing it to be flushed,
> > > because of
> > > case #2 we cannot be overly conservative in how much we free/empty.
> > > Ideally, we want to fill to full less a single burst, and empty
> leaving
> > > only a single burst in the cache. Unfortunately, we don't know what
> > > those
> > > burst limits are, so we have to try and guess the best behaviour
> from
> > > everything else.
> >
> > I agree about not wanting to be overly conservative.
> > But in the use cases I have described for #1, I don't think a target
> fill level of size/2 is overly conservative.
> >
> > I also acknowledge that this patch doubles the mempool cache miss
> rate for #2.
> > E.g. with a cache size of 512 and burst size of 64, the per-burst
> miss rate will be 64 / (1/2 * 512) = 1/4, compared to 64 / 512 = 1/8
> with a full replenish/drain algorithm.
> >
> > In theory, we could make it build time configurable to optimize
> mempools for #2.
> > But mempools are also used for other objects than mbufs, so that
> would have unwanted side effects for non-mbuf mempools.
> >
> > If we went for an algorithm targeting replenish/drain at 25 % from
> the edges, the per-burst miss rate for #2 would be: 64 / (3/4 * 512) =
> 1/6.
> >
> > How about addressing #2 in the release notes:
> > We describe that the cache refill/drain algorithm has been changed to
> only refill/drain to 50 % of the cache size, so pipelined applications
> performing Rx (mempool get) and Tx (mempool put) on separate cores
> should configure their mbuf pools with double the cache size of what
> they previously were to achieve similar performance.
> >
> > >
> > > All that said, commits with specific suggestions inline.
> > >
> > > /Bruce
> > >
> > > <snip>
> > >
> > > > diff --git a/lib/mempool/rte_mempool.h
> b/lib/mempool/rte_mempool.h
> > > > index 2e54fc4466..432c43ab15 100644
> > > > --- a/lib/mempool/rte_mempool.h
> > > > +++ b/lib/mempool/rte_mempool.h
> > > > @@ -89,7 +89,7 @@ struct __rte_cache_aligned
> rte_mempool_debug_stats
> > > {
> > > >   */
> > > >  struct __rte_cache_aligned rte_mempool_cache {
> > > >  	uint32_t size;	      /**< Size of the cache */
> > > > -	uint32_t flushthresh; /**< Threshold before we flush excess
> > > elements */
> > > > +	uint32_t flushthresh; /**< Obsolete; for API/ABI
> compatibility
> > > purposes only */
> > > >  	uint32_t len;	      /**< Current cache count */
> > > >  #ifdef RTE_LIBRTE_MEMPOOL_STATS
> > > >  	uint32_t unused;
> > > > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache
> {
> > > >  	/**
> > > >  	 * Cache objects
> > > >  	 *
> > > > -	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > > -	 * cases to avoid needless emptying of cache.
> > > > +	 * Note:
> > > > +	 * Cache is allocated at double size for API/ABI
> compatibility
> > > purposes only.
> > > > +	 * When reducing its size at an API/ABI breaking release,
> > > > +	 * remember to add a cache guard after it.
> > > >  	 */
> > > >  	alignas(RTE_CACHE_LINE_SIZE) void
> > > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
> > > >  };
> > > > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp);
> > > >   * @param cache_size
> > > >   *   If cache_size is non-zero, the rte_mempool library will try
> to
> > > >   *   limit the accesses to the common lockless pool, by
> maintaining
> > > a
> > > > - *   per-lcore object cache. This argument must be lower or
> equal to
> > > > - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> > > > + *   per-lcore object cache. This argument must be an even
> number,
> > > > + *   lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n.
> > > >   *   The access to the per-lcore table is of course
> > > >   *   faster than the multi-producer/consumer pool. The cache can
> be
> > > >   *   disabled if the cache_size argument is set to 0; it can be
> > > useful to
> > > >   *   avoid losing objects in cache.
> > > > + *   Note:
> > > > + *   Mempool put/get requests of more than cache_size / 2
> objects
> > > may be
> > > > + *   partially or fully served directly by the multi-
> > > producer/consumer
> > > > + *   pool, to avoid the overhead of copying the objects twice
> > > (instead of
> > > > + *   once) when using the cache as a bounce buffer.
> > > >   * @param private_data_size
> > > >   *   The size of the private data appended after the mempool
> > > >   *   structure. This is useful for storing some private data
> after
> > > the
> > > > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct
> rte_mempool
> > > *mp, void * const *obj_table,
> > > >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > > >  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > > >
> > > > -	__rte_assume(cache->flushthresh <=
> RTE_MEMPOOL_CACHE_MAX_SIZE *
> > > 2);
> > > > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > > -	__rte_assume(cache->len <= cache->flushthresh);
> > > > -	if (likely(cache->len + n <= cache->flushthresh)) {
> > > > +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> > > > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > > +	__rte_assume(cache->len <= cache->size);
> > > > +	if (likely(cache->len + n <= cache->size)) {
> > > >  		/* Sufficient room in the cache for the objects. */
> > > >  		cache_objs = &cache->objs[cache->len];
> > > >  		cache->len += n;
> > > > -	} else if (n <= cache->flushthresh) {
> > > > +	} else if (n <= cache->size / 2) {
> > > >  		/*
> > > > -		 * The cache is big enough for the objects, but - as
> > > detected by
> > > > -		 * the comparison above - has insufficient room for
> them.
> > > > -		 * Flush the cache to make room for the objects.
> > > > +		 * The number of objects is within the cache bounce
> buffer
> > > limit,
> > > > +		 * but - as detected by the comparison above - the
> cache
> > > has
> > > > +		 * insufficient room for them.
> > > > +		 * Flush the cache to the backend to make room for
> the
> > > objects;
> > > > +		 * flush (size / 2) objects from the bottom of the
> cache,
> > > where
> > > > +		 * objects are less hot, and move down the remaining
> > > objects, which
> > > > +		 * are more hot, from the upper half of the cache.
> > > >  		 */
> > > > -		cache_objs = &cache->objs[0];
> > > > -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len);
> > > > -		cache->len = n;
> > > > +		__rte_assume(cache->len > cache->size / 2);
> > > > +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0],
> cache-
> > > >size / 2);
> > > > +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size
> / 2],
> > > > +				sizeof(void *) * (cache->len - cache-
> >size /
> > > 2));
> > > > +		cache_objs = &cache->objs[cache->len - cache->size /
> 2];
> > > > +		cache->len = cache->len - cache->size / 2 + n;
> > >
> > > The flushing of only half the cache I'm not so certain about. I
> agree
> > > that
> > > we want to not flush to empty, but I also think that we want to do
> more
> > > than a half-flush, especially since we do an enqueue to the cache
> > > immediately afterwards. Consider the case where we have a cache
> size of
> > > 128, and we do an enqueue of 32, with the cache currently full. In
> that
> > > case we only flush 64, reducing the cache to 64, but then
> immediately
> > > bringing it back up to 96.
> >
> > I thought in depth about whether the flush/replenish sizes should
> consider the request size or not. (E.g. if I should replenish size/2 or
> size/2+request.)
> > I decided for not considering the request size, for two reasons:
> > a) It roughly doesn't matter, especially when considering a sequence
> of random get/put requests.
> > b) The size of the backend transactions become fixed, which has
> performance benefits: With my patch, they are always size/2, so if the
> cache size is 2^N, the backend transactions are 2^N and CPU cache
> aligned.
> >
> > > For cases where we have pipelines with all
> > > alloc
> > > on one core and all free on another, this half-flush would be
> > > inefficient.
> > >
> > > Instead, I would look to have a lower target threshold post-flush,
> and
> > > I
> > > would suggest 25% - taking into account the newly freed buffers.
> >
> > It's not good for #1.
> > I agree that it is better for #2. But I don't think #2 is the likely
> use case.
> >
> > After our discussion at the summit, I did start working a patch
> targeting fill levels at 25% from the cache edges, but I don't think
> it's better; so I'd rather stick with a target fill level of 50%.
> >
> > > For example:
> > >
> > > 	/* if n > our target of 1/4 full, flush everything,
> > > 	 * else flush so that we end up with 1/4 full after n added.
> > > 	 */
> > > 	flush_count = n > cache->size/4 ? cache->len :
> > > 			(cache->len + n) - cache->size/4;
> > >
> > >
> > > >  	} else {
> > > > -		/* The request itself is too big for the cache. */
> > > > +		/* The request itself is too big. */
> > > >  		goto driver_enqueue_stats_incremented;
> > >
> > > I think original comment is better. The request itself is not too
> big
> > > for
> > > the whole mempool, just for the cache.
> >
> > Ack.
> >
> > >
> > > >  	}
> > > >
> > > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct
> rte_mempool
> > > *mp, void **obj_table,
> > > >  	/* The cache is a stack, so copy will be in reverse order.
> */
> > > >  	cache_objs = &cache->objs[cache->len];
> > > >
> > > > -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> > > > +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > >  	if (likely(n <= cache->len)) {
> > > >  		/* The entire request can be satisfied from the
> cache. */
> > > >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk,
> 1);
> > > > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct
> rte_mempool
> > > *mp, void **obj_table,
> > > >  	for (index = 0; index < len; index++)
> > > >  		*obj_table++ = *--cache_objs;
> > > >
> > > > -	/* Dequeue below would overflow mem allocated for cache? */
> > > > -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > > > +	/* Dequeue below would exceed the cache bounce buffer
> limit? */
> > > > +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> > > > +	if (unlikely(remaining > cache->size / 2))
> > > >  		goto driver_dequeue;
> > > >
> > > > -	/* Fill the cache from the backend; fetch size + remaining
> > > objects. */
> > > > -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > > > -			cache->size + remaining);
> > > > +	/* Fill the cache from the backend; fetch (size / 2)
> objects. */
> > > > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache-
> >size /
> > > 2);
> > >
> > > Again, the cache->size / 2 doesn't seem right here. We at most
> half-
> > > fill
> > > the cache and then take some objects from that, meaning that have
> just
> > > done
> > > a re-fill of cache but end the function with it less than half
> full.
> > > Since
> > > we take from this value, I'd suggest just filling the cache
> completely.
> >
> > The issues at the edges of the cache are symmetrical.
> > If we replenish the cache to full, and the next transaction is a put,
> the cache needs to be drained.
> > That's why I replenish to size/2.
> >
> 
> Not necessarily. After you fill the cache here (to 50%) you then take
> elements from it. If we filled to 100%, then we'd only hit a flush if
> we
> freed back more elements than we just allocated. Now, that's reasonably
> likely which is why I'm ok to not filling to 100%. However, doing a
> refill
> as part of the op, and then leaving the cache less than half full after
> the
> op finishes is just wasteful IMHO!

I'm targeting half full, and disregard if it is pre-op or post-op.
This makes the backend transactions cache aligned in both addressing and size, which opens an opportunity to performance optimize the mempool drivers for this.

> 
> One other point. The obvious worst case scenario we want to avoid is
> constant fill/empty/fill/empty sequences. However, so long as we have
> an
> asymmetry in how we do fills and empty thresholds, a repeating sequence
> should never occur, and even pairs of fill/empty should be very rare.
> Consider the case, for example, where we fill to 100% but only drain to
> 25%. For this, we can ignore scenario #2, since we should have pretty
> good
> cache usage. For scenario #1, of allocs/frees on a single core,
> randomly
> interleaved, our worst case is:
> 1) alloc with cache empty, in which case we fill to 100%, then take the
> alloc
> 2) have a free of a burst greater than that which we just allocated,
> causing an immediate flush again.
> 
> That's pretty poor behaviour, but the thing is that after the flush we
> now
> have mempool at 25% + freed burst - so expected between 25-50% of cache
> utilization. That's the sweet spot we want to target - after each
> operation, the mempool cache should ideally be at 50%. Whether the next
> op
> is an alloc or a free, our cache can handle it, and likely a couple of
> each
> in sequence. Therefore, our possible fill/empty combinations are likely
> to
> be rare occurances.
> 
> [In all this, I am making the assumption that burst size is well less
> than
> cache size. Also, similar logic would be applicable for the inverse
> scenario, e.g. flush to empty (and fill burst) and fill to 75%]

I'm not so sure about this assumption.
With a cache size of 512 and a bursts of 64, the cache only holds 8 bursts.
50% is 4 bursts, and 25% is only 2 bursts.

Using a replenish/drain level in the middle requires 5 bursts in either direction to pass the edge (and trigger replenish/flush). 
Using a replenish/drain level 25% from the edge requires only 3 bursts in the wrong direction to pass the edge (and trigger replenish/flush). Much higher probability with random get/put.

> 
> Now, all said, I tend to agree that we want to leave space for a decent
> size burst after a fill. That is why I think that filling to 75% is
> reasonable. After an alloc that triggers a fill, I don't want the cache
> less than 50% full, but not completely full so there is room for a free
> without a flush, and similarly for a free that triggers a flush, the
> cache
> should not be empty, but also should not be more than half full.
> 
> One suggestion - we could always add a simple tunable that specifies
> the
> margin, or reserved entries for alloc and free. We can then guide in
> the
> docs that the value should be e.g. "zero for apps where alloc and free
> take
> place on different cores. 20%-50% of cache is recommended where alloc
> and
> free take place on the same core"

Yes, a simple tunable is a really good idea.

At this point, I think we should optimize for use case #1, and go for the 50% fill level.
Then we can add a tunable to optimize for use case #2 later. I will try to come up with a draft for such a follow-up patch within the next few days.

The 50% fill level in this patch is not as bad for use case #2 (roughly doubling the burst miss rate from 1/8 to 1/4), compared to how bad the original algorithm is for use case #1 (very high miss probability - only two ops in the wrong direction - after drain/replenish).

-Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v6] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (4 preceding siblings ...)
  2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
@ 2026-05-26 14:00 ` Morten Brørup
  2026-05-26 16:00   ` Morten Brørup
  2026-05-29  8:53   ` fengchengwen
  2026-05-27 11:36 ` [PATCH v6] net/idpf: update for new mempool cache algorithm Morten Brørup
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-26 14:00 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf fast-free")
---
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
  Tests using the Intel idpf PMD in AVX512 mode may fail with this patch.
* Reverted a small code comment change. The original was better. (Bruce)
* Reverted rte_mempool_create() description requiring the cache_size to be
  an even number. There is no such requirement.
v5:
* Flush the cache from the bottom, where objects are colder, and move down
  the remaining objects, which are hotter.
* In the Intel idpf PMD, move up the hot objects in the cache and refill
  with cold objects at the bottom.
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
  patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
  fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
  __rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst   |  7 +++
 doc/guides/rel_notes/release_26_07.rst | 11 +++++
 lib/mempool/rte_mempool.c              | 14 +-----
 lib/mempool/rte_mempool.h              | 66 ++++++++++++++++----------
 4 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 35c9b4e06c..40760fffbb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -154,3 +154,10 @@ Deprecation Notices
 * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
   ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
   Those API functions are used internally by DPDK core and netvsc PMD.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 6f43d9b61c..3f793f504a 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -63,6 +63,17 @@ New Features
     ``rte_eal_init`` and the application is responsible for probing each device,
   * ``--auto-probing`` enables the initial bus probing, which is the current default behavior.
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate for most application types.
+  Applications where each lcore only puts or gets to a mempool, e.g. pipelined applications where ethdev Rx and Tx run on separate lcores, should adapt to the new algorithm by doubling their configured mempool cache size, to avoid doubling their mempool cache miss rate.
+
 * **Updated PCAP ethernet driver.**
 
   * Added support for VLAN insertion and stripping.
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2e54fc4466..cd0f229b59 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1390,22 +1397,30 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects from the bottom of the cache, where
+		 * objects are less hot, and move down the remaining objects, which
+		 * are more hot, from the upper half of the cache.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
+		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
+				sizeof(void *) * (cache->len - cache->size / 2));
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
 	} else {
 		/* The request itself is too big for the cache. */
 		goto driver_enqueue_stats_incremented;
@@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* RE: [PATCH v6] mempool: improve cache behaviour and performance
  2026-05-26 14:00 ` [PATCH v6] " Morten Brørup
@ 2026-05-26 16:00   ` Morten Brørup
  2026-06-01 13:36     ` Thomas Monjalon
  2026-05-29  8:53   ` fengchengwen
  1 sibling, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-05-26 16:00 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Tuesday, 26 May 2026 16.00
> 
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-
> time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put
> into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold
> so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility
> purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its
> size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.
> 
> Bugzilla ID: 1027
> Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

Forgot carrying an Ack over from v5:
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>

> ---
> Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf
> fast-free")

This dependency seems to cause CI apply failures.
The dependency is based on an older snapshot of main,
and this patch is based on a new snapshot of main.

> ---
> v6:
> * Moved driver changes out as separate patches, for easier review.
> (Bruce)
>   Tests using the Intel idpf PMD in AVX512 mode may fail with this
> patch.
> * Reverted a small code comment change. The original was better.
> (Bruce)
> * Reverted rte_mempool_create() description requiring the cache_size to
> be
>   an even number. There is no such requirement.
> v5:
> * Flush the cache from the bottom, where objects are colder, and move
> down
>   the remaining objects, which are hotter.
> * In the Intel idpf PMD, move up the hot objects in the cache and
> refill
>   with cold objects at the bottom.
> v4:
> * Added Bugzilla ID.
> * Added Fixes tag. For reference only.
> * Moved fast-free related update of Intel common driver out as a
> separate
>   patch, and depend on that patch.
> * Omitted unrelated changes to the Intel idpf AVX512 driver,
> specifically
>   fixing an indentation and adding mbuf instrumentation.
> * Omitted unrelated changes to the mempool library, specifically adding
>   __rte_restrict and changing a couple of comments to proper sentences.
> * Please checkpatches by swapping operators in a couple of comparisons.
> v3:
> * Fixed my copy-paste bug in idpf_splitq_rearm().
> v2:
> * Fixed issue found by abidiff:
>   Reverted cache objects array size reduction. Added a note instead.
> * Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
> * Updated idpf_splitq_rearm() like idpf_singleq_rearm().
> * Added a few more __rte_assume(). (Inspired by AI review)
> * Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
>   flush threshold.
> * Added release notes.
> * Added deprecation notes.
> ---
>  doc/guides/rel_notes/deprecation.rst   |  7 +++
>  doc/guides/rel_notes/release_26_07.rst | 11 +++++
>  lib/mempool/rte_mempool.c              | 14 +-----
>  lib/mempool/rte_mempool.h              | 66 ++++++++++++++++----------
>  4 files changed, 61 insertions(+), 37 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 35c9b4e06c..40760fffbb 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -154,3 +154,10 @@ Deprecation Notices
>  * bus/vmbus: Starting DPDK 25.11, all the vmbus API defined in
>    ``drivers/bus/vmbus/rte_bus_vmbus.h`` will become internal to DPDK.
>    Those API functions are used internally by DPDK core and netvsc PMD.
> +
> +* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
> +  is obsolete, and will be removed in DPDK 26.11.
> +
> +* mempool: The object array in ``struct rte_mempool_cache`` is
> oversize by
> +  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
> +  DPDK 26.11.
> diff --git a/doc/guides/rel_notes/release_26_07.rst
> b/doc/guides/rel_notes/release_26_07.rst
> index 6f43d9b61c..3f793f504a 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -63,6 +63,17 @@ New Features
>      ``rte_eal_init`` and the application is responsible for probing
> each device,
>    * ``--auto-probing`` enables the initial bus probing, which is the
> current default behavior.
> 
> +* **Changed effective size of mempool cache.**
> +
> +  * The effective size of a mempool cache was changed to match the
> specified size at mempool creation; the effective size was previously
> 50 % larger than requested.
> +  * The ``flushthresh`` field of the ``struct rte_mempool_cache``
> became obsolete, but was kept for API/ABI compatibility purposes.
> +  * The effective size of the ``objs`` array in the ``struct
> rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but
> its size was kept for API/ABI compatibility purposes.
> +
> +* **Improved mempool cache flush/refill algorithm.**
> +
> +  The mempool cache flush/refill algorithm was improved, to reduce the
> mempool cache miss rate for most application types.
> +  Applications where each lcore only puts or gets to a mempool, e.g.
> pipelined applications where ethdev Rx and Tx run on separate lcores,
> should adapt to the new algorithm by doubling their configured mempool
> cache size, to avoid doubling their mempool cache miss rate.
> +
>  * **Updated PCAP ethernet driver.**
> 
>    * Added support for VLAN insertion and stripping.
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 3042d94c14..805b52cc58 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -52,11 +52,6 @@ static void
>  mempool_event_callback_invoke(enum rte_mempool_event event,
>  			      struct rte_mempool *mp);
> 
> -/* Note: avoid using floating point since that compiler
> - * may not think that is constant.
> - */
> -#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
> -
>  #if defined(RTE_ARCH_X86)
>  /*
>   * return the greatest common divisor between a and b (fast algorithm)
> @@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
>  static void
>  mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
>  {
> -	/* Check that cache have enough space for flush threshold */
> -
> 	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZ
> E) >
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
> -			 RTE_SIZEOF_FIELD(struct rte_mempool_cache,
> objs[0]));
> -
>  	cache->size = size;
> -	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
> +	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility
> purposes only */
>  	cache->len = 0;
>  }
> 
> @@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned
> n, unsigned elt_size,
> 
>  	/* asked cache too big */
>  	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
> -	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
> +	    cache_size > n) {
>  		rte_errno = EINVAL;
>  		return NULL;
>  	}
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 2e54fc4466..cd0f229b59 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
>   */
>  struct __rte_cache_aligned rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
> -	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> +	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility
> purposes only */
>  	uint32_t len;	      /**< Current cache count */
>  #ifdef RTE_LIBRTE_MEMPOOL_STATS
>  	uint32_t unused;
> @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
>  	/**
>  	 * Cache objects
>  	 *
> -	 * Cache is allocated to this size to allow it to overflow in
> certain
> -	 * cases to avoid needless emptying of cache.
> +	 * Note:
> +	 * Cache is allocated at double size for API/ABI compatibility
> purposes only.
> +	 * When reducing its size at an API/ABI breaking release,
> +	 * remember to add a cache guard after it.
>  	 */
>  	alignas(RTE_CACHE_LINE_SIZE) void
> *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
>  };
> @@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
>   *   If cache_size is non-zero, the rte_mempool library will try to
>   *   limit the accesses to the common lockless pool, by maintaining a
>   *   per-lcore object cache. This argument must be lower or equal to
> - *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
> + *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
>   *   The access to the per-lcore table is of course
>   *   faster than the multi-producer/consumer pool. The cache can be
>   *   disabled if the cache_size argument is set to 0; it can be useful
> to
>   *   avoid losing objects in cache.
> + *   Note:
> + *   Mempool put/get requests of more than cache_size / 2 objects may
> be
> + *   partially or fully served directly by the multi-producer/consumer
> + *   pool, to avoid the overhead of copying the objects twice (instead
> of
> + *   once) when using the cache as a bounce buffer.
>   * @param private_data_size
>   *   The size of the private data appended after the mempool
>   *   structure. This is useful for storing some private data after the
> @@ -1390,22 +1397,30 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> 
> -	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2);
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> -	__rte_assume(cache->len <= cache->flushthresh);
> -	if (likely(cache->len + n <= cache->flushthresh)) {
> +	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +	__rte_assume(cache->len <= cache->size);
> +	if (likely(cache->len + n <= cache->size)) {
>  		/* Sufficient room in the cache for the objects. */
>  		cache_objs = &cache->objs[cache->len];
>  		cache->len += n;
> -	} else if (n <= cache->flushthresh) {
> +	} else if (n <= cache->size / 2) {
>  		/*
> -		 * The cache is big enough for the objects, but - as
> detected by
> -		 * the comparison above - has insufficient room for them.
> -		 * Flush the cache to make room for the objects.
> +		 * The number of objects is within the cache bounce buffer
> limit,
> +		 * but - as detected by the comparison above - the cache
> has
> +		 * insufficient room for them.
> +		 * Flush the cache to the backend to make room for the
> objects;
> +		 * flush (size / 2) objects from the bottom of the cache,
> where
> +		 * objects are less hot, and move down the remaining
> objects, which
> +		 * are more hot, from the upper half of the cache.
>  		 */
> -		cache_objs = &cache->objs[0];
> -		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> -		cache->len = n;
> +		__rte_assume(cache->len > cache->size / 2);
> +		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache-
> >size / 2);
> +		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
> +				sizeof(void *) * (cache->len - cache->size /
> 2));
> +		cache_objs = &cache->objs[cache->len - cache->size / 2];
> +		cache->len = cache->len - cache->size / 2 + n;
>  	} else {
>  		/* The request itself is too big for the cache. */
>  		goto driver_enqueue_stats_incremented;
> @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	/* The cache is a stack, so copy will be in reverse order. */
>  	cache_objs = &cache->objs[cache->len];
> 
> -	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
> +	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
>  	if (likely(n <= cache->len)) {
>  		/* The entire request can be satisfied from the cache. */
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
> 
> -	/* Dequeue below would overflow mem allocated for cache? */
> -	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
> +	/* Dequeue below would exceed the cache bounce buffer limit? */
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	if (unlikely(remaining > cache->size / 2))
>  		goto driver_dequeue;
> 
> -	/* Fill the cache from the backend; fetch size + remaining
> objects. */
> -	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> -			cache->size + remaining);
> +	/* Fill the cache from the backend; fetch (size / 2) objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size /
> 2);
>  	if (unlikely(ret < 0)) {
>  		/*
>  		 * We are buffer constrained, and not able to fetch all
> that.
> @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
>  	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> 
> -	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> -	cache_objs = &cache->objs[cache->size + remaining];
> -	cache->len = cache->size;
> +	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
> +	__rte_assume(remaining <= cache->size / 2);
> +	cache_objs = &cache->objs[cache->size / 2];
> +	cache->len = cache->size / 2 - remaining;
>  	for (index = 0; index < remaining; index++)
>  		*obj_table++ = *--cache_objs;
> 
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-26 10:37         ` Morten Brørup
@ 2026-05-26 17:45           ` Morten Brørup
  2026-05-27  8:48             ` Bruce Richardson
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-05-26 17:45 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Tuesday, 26 May 2026 12.37
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 26 May 2026 11.40
> >

[...]

> > [In all this, I am making the assumption that burst size is well less
> > than
> > cache size. Also, similar logic would be applicable for the inverse
> > scenario, e.g. flush to empty (and fill burst) and fill to 75%]
> 
> I'm not so sure about this assumption.
> With a cache size of 512 and a bursts of 64, the cache only holds 8
> bursts.
> 50% is 4 bursts, and 25% is only 2 bursts.
> 
> Using a replenish/drain level in the middle requires 5 bursts in either
> direction to pass the edge (and trigger replenish/flush).
> Using a replenish/drain level 25% from the edge requires only 3 bursts
> in the wrong direction to pass the edge (and trigger replenish/flush).
> Much higher probability with random get/put.
> 
> >
> > Now, all said, I tend to agree that we want to leave space for a
> decent
> > size burst after a fill. That is why I think that filling to 75% is
> > reasonable. After an alloc that triggers a fill, I don't want the
> cache
> > less than 50% full, but not completely full so there is room for a
> free
> > without a flush, and similarly for a free that triggers a flush, the
> > cache
> > should not be empty, but also should not be more than half full.
> >
> > One suggestion - we could always add a simple tunable that specifies
> > the
> > margin, or reserved entries for alloc and free. We can then guide in
> > the
> > docs that the value should be e.g. "zero for apps where alloc and
> free
> > take
> > place on different cores. 20%-50% of cache is recommended where alloc
> > and
> > free take place on the same core"
> 
> Yes, a simple tunable is a really good idea.
> 
> At this point, I think we should optimize for use case #1, and go for
> the 50% fill level.
> Then we can add a tunable to optimize for use case #2 later. I will try
> to come up with a draft for such a follow-up patch within the next few
> days.

Adding a tunable is not so simple...
The choice of mempool cache algorithm (drain/replenish to 50% vs. drain/replenish completely) should be passed via the "flags" parameter in rte_mempool_create(), but rte_pktmbuf_pool_create() is missing the "flags" parameter.
We can add it at the next ABI breaking release.
WDYT?

We should use that addition as an opportunity to move the case where the objects are not entirely handled by the cache into non-inlined functions, so the inlined functions don't grow too much in size, when they need to handle two different algorithms.

> 
> The 50% fill level in this patch is not as bad for use case #2 (roughly
> doubling the burst miss rate from 1/8 to 1/4), compared to how bad the
> original algorithm is for use case #1 (very high miss probability -
> only two ops in the wrong direction - after drain/replenish).
> 
> -Morten


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-26 17:45           ` Morten Brørup
@ 2026-05-27  8:48             ` Bruce Richardson
  2026-05-27  9:22               ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2026-05-27  8:48 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

On Tue, May 26, 2026 at 07:45:24PM +0200, Morten Brørup wrote:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Tuesday, 26 May 2026 12.37
> > 
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Tuesday, 26 May 2026 11.40
> > >
> 
> [...]
> 
> > > [In all this, I am making the assumption that burst size is well less
> > > than
> > > cache size. Also, similar logic would be applicable for the inverse
> > > scenario, e.g. flush to empty (and fill burst) and fill to 75%]
> > 
> > I'm not so sure about this assumption.
> > With a cache size of 512 and a bursts of 64, the cache only holds 8
> > bursts.
> > 50% is 4 bursts, and 25% is only 2 bursts.
> > 
> > Using a replenish/drain level in the middle requires 5 bursts in either
> > direction to pass the edge (and trigger replenish/flush).
> > Using a replenish/drain level 25% from the edge requires only 3 bursts
> > in the wrong direction to pass the edge (and trigger replenish/flush).
> > Much higher probability with random get/put.
> > 
> > >
> > > Now, all said, I tend to agree that we want to leave space for a
> > decent
> > > size burst after a fill. That is why I think that filling to 75% is
> > > reasonable. After an alloc that triggers a fill, I don't want the
> > cache
> > > less than 50% full, but not completely full so there is room for a
> > free
> > > without a flush, and similarly for a free that triggers a flush, the
> > > cache
> > > should not be empty, but also should not be more than half full.
> > >
> > > One suggestion - we could always add a simple tunable that specifies
> > > the
> > > margin, or reserved entries for alloc and free. We can then guide in
> > > the
> > > docs that the value should be e.g. "zero for apps where alloc and
> > free
> > > take
> > > place on different cores. 20%-50% of cache is recommended where alloc
> > > and
> > > free take place on the same core"
> > 
> > Yes, a simple tunable is a really good idea.
> > 
> > At this point, I think we should optimize for use case #1, and go for
> > the 50% fill level.
> > Then we can add a tunable to optimize for use case #2 later. I will try
> > to come up with a draft for such a follow-up patch within the next few
> > days.
> 
> Adding a tunable is not so simple...
> The choice of mempool cache algorithm (drain/replenish to 50% vs. drain/replenish completely) should be passed via the "flags" parameter in rte_mempool_create(), but rte_pktmbuf_pool_create() is missing the "flags" parameter.
> We can add it at the next ABI breaking release.
> WDYT?
> 
I don't want this just a binary flag with two settings, I think it should
be an actual numeric value. Can we not use function versioning to add the
new parameter to all functions needing it, without worrying about ABI
breakage.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v5] mempool: improve cache behaviour and performance
  2026-05-27  8:48             ` Bruce Richardson
@ 2026-05-27  9:22               ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-27  9:22 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 27 May 2026 10.48
> 
> On Tue, May 26, 2026 at 07:45:24PM +0200, Morten Brørup wrote:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Tuesday, 26 May 2026 12.37
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Tuesday, 26 May 2026 11.40
> > > >
> >
> > [...]
> >
> > > > [In all this, I am making the assumption that burst size is well
> less
> > > > than
> > > > cache size. Also, similar logic would be applicable for the
> inverse
> > > > scenario, e.g. flush to empty (and fill burst) and fill to 75%]
> > >
> > > I'm not so sure about this assumption.
> > > With a cache size of 512 and a bursts of 64, the cache only holds 8
> > > bursts.
> > > 50% is 4 bursts, and 25% is only 2 bursts.
> > >
> > > Using a replenish/drain level in the middle requires 5 bursts in
> either
> > > direction to pass the edge (and trigger replenish/flush).
> > > Using a replenish/drain level 25% from the edge requires only 3
> bursts
> > > in the wrong direction to pass the edge (and trigger
> replenish/flush).
> > > Much higher probability with random get/put.
> > >
> > > >
> > > > Now, all said, I tend to agree that we want to leave space for a
> > > decent
> > > > size burst after a fill. That is why I think that filling to 75%
> is
> > > > reasonable. After an alloc that triggers a fill, I don't want the
> > > cache
> > > > less than 50% full, but not completely full so there is room for
> a
> > > free
> > > > without a flush, and similarly for a free that triggers a flush,
> the
> > > > cache
> > > > should not be empty, but also should not be more than half full.
> > > >
> > > > One suggestion - we could always add a simple tunable that
> specifies
> > > > the
> > > > margin, or reserved entries for alloc and free. We can then guide
> in
> > > > the
> > > > docs that the value should be e.g. "zero for apps where alloc and
> > > free
> > > > take
> > > > place on different cores. 20%-50% of cache is recommended where
> alloc
> > > > and
> > > > free take place on the same core"
> > >
> > > Yes, a simple tunable is a really good idea.
> > >
> > > At this point, I think we should optimize for use case #1, and go
> for
> > > the 50% fill level.
> > > Then we can add a tunable to optimize for use case #2 later. I will
> try
> > > to come up with a draft for such a follow-up patch within the next
> few
> > > days.
> >
> > Adding a tunable is not so simple...
> > The choice of mempool cache algorithm (drain/replenish to 50% vs.
> drain/replenish completely) should be passed via the "flags" parameter
> in rte_mempool_create(), but rte_pktmbuf_pool_create() is missing the
> "flags" parameter.
> > We can add it at the next ABI breaking release.
> > WDYT?
> >
> I don't want this just a binary flag with two settings, I think it
> should be an actual numeric value.

If it was plain and simple to support a numeric value, I'd do it.
But the two algorithms differ too much.
If needed, the flag can be used as an enum to support more algorithms in the future.

My WIP (not even build tested), where you can see the different algorithms steered by the cache->flags field, looks like this:

static __rte_always_inline void
rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
			   unsigned int n, struct rte_mempool_cache *cache)
{
	void **cache_objs;

	/* No cache provided? */
	if (unlikely(cache == NULL))
		goto driver_enqueue;

	/* Increment stats now, adding in mempool always succeeds. */
	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);

	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
	__rte_assume(cache->len <= cache->size);
	if (likely(cache->len + n <= cache->size)) {
		/* Sufficient room in the cache for the objects. */
		cache_objs = &cache->objs[cache->len];
		cache->len += n;

        /* Add the objects to the cache. */
        rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);

        return;
	}

    /* Insufficient room in the cache for the objects. */

    if (cache->flags & RTE_MEMPOOL_F_CACHE_ACCESS_ONE_WAY) {
        /* The algorithm is optimized for put or get operations only. */
        /* The request itself exceeds the cache bounce buffer limit? */
        __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
        if (n > cache->size)
            goto driver_enqueue_stats_incremented;

        /* Fill the cache completely by adding the first part of the objects. */
        cache_objs = &cache->objs[cache->len];
        rte_memcpy(cache_objs, obj_table, sizeof(void *) * (cache->size - cache->len));
        obj_table += cache->size - cache->len;

        /* Flush the entire cache to the backend. */
        rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size);

        /* Add the remaining objects to the cache. */
        cache->len = n - (cache->size - cache->len);
        rte_memcpy(&cache->objs[0], obj_table, sizeof(void *) * cache->len);
    } else {
        /* The algorithm is optimized for a balanced mix of put and get operations. */
        /* The request itself exceeds the cache bounce buffer limit? */
        __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
        if (n > cache->size / 2)
            goto driver_enqueue_stats_incremented;

        /*
         * Flush part of the cache to the backend to make room for the objects;
         * flush (size / 2) objects from the bottom of the cache, where
         * objects are less hot, and move down the remaining objects, which
         * are more hot, from the upper half of the cache.
         */
        __rte_assume(cache->len > cache->size / 2);
        rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
        rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
                sizeof(void *) * (cache->len - cache->size / 2));
        cache_objs = &cache->objs[cache->len - cache->size / 2];
        cache->len = cache->len - cache->size / 2 + n;

        /* Add the objects to the cache. */
        rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
    }

    return;

driver_enqueue:

	/* increment stat now, adding in mempool always success */
	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);

driver_enqueue_stats_incremented:

	/* push objects to the backend */
	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
}



> Can we not use function versioning to add the
> new parameter to all functions needing it, without worrying about ABI
> breakage.

The public structures will also change, so I don't think so.

> 
> /Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v6] net/idpf: update for new mempool cache algorithm
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (5 preceding siblings ...)
  2026-05-26 14:00 ` [PATCH v6] " Morten Brørup
@ 2026-05-27 11:36 ` Morten Brørup
  2026-05-27 11:36   ` [PATCH v6] mempool/dpaa: " Morten Brørup
  2026-05-27 11:36   ` [PATCH v6] mempool/dpaa2: " Morten Brørup
  2026-06-01 16:40 ` [PATCH v7] mempool: improve cache behaviour and performance Morten Brørup
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-27 11:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the PMD was
updated regarding how much to backfill the mempool cache in the AVX512
code path.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-164427 ("mempool: improve cache behaviour and performance")
---
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++++++----
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 9af275cd9d..dd2263b8d7 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
 				__m128i dma_addr0;
@@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
 				__m128i dma_addr0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v6] mempool/dpaa: update for new mempool cache algorithm
  2026-05-27 11:36 ` [PATCH v6] net/idpf: update for new mempool cache algorithm Morten Brørup
@ 2026-05-27 11:36   ` Morten Brørup
  2026-05-27 11:36   ` [PATCH v6] mempool/dpaa2: " Morten Brørup
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-27 11:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the mempool
driver was updated to not modify the mempool cache's flushthresh field,
which is now obsolete, and modifying it has no effect.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-164427 ("mempool: improve cache behaviour and performance")
---
 drivers/mempool/dpaa/dpaa_mempool.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v6] mempool/dpaa2: update for new mempool cache algorithm
  2026-05-27 11:36 ` [PATCH v6] net/idpf: update for new mempool cache algorithm Morten Brørup
  2026-05-27 11:36   ` [PATCH v6] mempool/dpaa: " Morten Brørup
@ 2026-05-27 11:36   ` Morten Brørup
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-27 11:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the mempool
driver was updated to not modify the mempool cache's flushthresh field,
which is now obsolete, and modifying it has no effect.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
Depends-on: patch-164427 ("mempool: improve cache behaviour and performance")
---
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v6] mempool: improve cache behaviour and performance
  2026-05-26 14:00 ` [PATCH v6] " Morten Brørup
  2026-05-26 16:00   ` Morten Brørup
@ 2026-05-29  8:53   ` fengchengwen
  2026-05-29 11:43     ` Morten Brørup
  1 sibling, 1 reply; 46+ messages in thread
From: fengchengwen @ 2026-05-29  8:53 UTC (permalink / raw)
  To: Morten Brørup, dev, Andrew Rybchenko, Bruce Richardson,
	Jingjing Wu, Praveen Shetty, Hemant Agrawal, Sachin Saxena

On 5/26/2026 10:00 PM, Morten Brørup wrote:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
> 
> 1.
> The actual cache size was 1.5 times the cache size specified at run-time
> mempool creation.
> This was obviously not expected by application developers.
> 
> 2.
> In get operations, the check for when to use the cache as bounce buffer
> did not respect the run-time configured cache size,
> but compared to the build time maximum possible cache size
> (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> E.g. with a configured cache size of 32 objects, getting 256 objects
> would first fetch 32 + 256 = 288 objects into the cache,
> and then move the 256 objects from the cache to the destination memory,
> instead of fetching the 256 objects directly to the destination memory.
> This had a performance cost.
> However, this is unlikely to occur in real applications, so it is not
> important in itself.
> 
> 3.
> When putting objects into a mempool, and the mempool cache did not have
> free space for so many objects,
> the cache was flushed completely, and the new objects were then put into
> the cache.
> I.e. the cache drain level was zero.
> This (complete cache flush) meant that a subsequent get operation (with
> the same number of objects) completely emptied the cache,
> so another subsequent get operation required replenishing the cache.
> 
> Similarly,
> When getting objects from a mempool, and the mempool cache did not hold so
> many objects,
> the cache was replenished to cache->size + remaining objects,
> and then (the remaining part of) the requested objects were fetched via
> the cache,
> which left the cache filled (to cache->size) at completion.
> I.e. the cache refill level was cache->size (plus some, depending on
> request size).
> 
> (1) was improved by generally comparing to cache->size instead of
> cache->flushthresh, when considering the capacity of the cache.
> The cache->flushthresh field is kept for API/ABI compatibility purposes,
> and initialized to cache->size instead of cache->size * 1.5.
> 
> (2) was improved by generally comparing to cache->size / 2 instead of
> RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> 
> (3) was improved by flushing and replenishing the cache by half its size,
> so a flush/refill can be followed randomly by get or put requests.
> This also reduced the number of objects in each flush/refill operation.
> 
> As a consequence of these changes, the size of the array holding the
> objects in the cache (cache->objs[]) no longer needs to be
> 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> Performance data:
> With a real WAN Optimization application, where the number of allocated
> packets varies (as they are held in e.g. shaper queues), the mempool
> cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> This was deployed in production at an ISP, and using an effective cache
> size of 384 objects.

Does the application run as a RTC (run-to-complete) mode?
How about pipeline model which NIC recv packets and enqueue ring, another
work thread dequeue packets, process packets and then free packets mbuf?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v6] mempool: improve cache behaviour and performance
  2026-05-29  8:53   ` fengchengwen
@ 2026-05-29 11:43     ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-05-29 11:43 UTC (permalink / raw)
  To: fengchengwen, dev, Andrew Rybchenko, Bruce Richardson,
	Jingjing Wu, Praveen Shetty, Hemant Agrawal, Sachin Saxena

> From: fengchengwen [mailto:fengchengwen@huawei.com]
> Sent: Friday, 29 May 2026 10.54
> 
> On 5/26/2026 10:00 PM, Morten Brørup wrote:
> > This patch refactors the mempool cache to eliminate some unexpected
> > behaviour and reduce the mempool cache miss rate.
> >
> > 1.
> > The actual cache size was 1.5 times the cache size specified at run-
> time
> > mempool creation.
> > This was obviously not expected by application developers.
> >
> > 2.
> > In get operations, the check for when to use the cache as bounce
> buffer
> > did not respect the run-time configured cache size,
> > but compared to the build time maximum possible cache size
> > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> > E.g. with a configured cache size of 32 objects, getting 256 objects
> > would first fetch 32 + 256 = 288 objects into the cache,
> > and then move the 256 objects from the cache to the destination
> memory,
> > instead of fetching the 256 objects directly to the destination
> memory.
> > This had a performance cost.
> > However, this is unlikely to occur in real applications, so it is not
> > important in itself.
> >
> > 3.
> > When putting objects into a mempool, and the mempool cache did not
> have
> > free space for so many objects,
> > the cache was flushed completely, and the new objects were then put
> into
> > the cache.
> > I.e. the cache drain level was zero.
> > This (complete cache flush) meant that a subsequent get operation
> (with
> > the same number of objects) completely emptied the cache,
> > so another subsequent get operation required replenishing the cache.
> >
> > Similarly,
> > When getting objects from a mempool, and the mempool cache did not
> hold so
> > many objects,
> > the cache was replenished to cache->size + remaining objects,
> > and then (the remaining part of) the requested objects were fetched
> via
> > the cache,
> > which left the cache filled (to cache->size) at completion.
> > I.e. the cache refill level was cache->size (plus some, depending on
> > request size).
> >
> > (1) was improved by generally comparing to cache->size instead of
> > cache->flushthresh, when considering the capacity of the cache.
> > The cache->flushthresh field is kept for API/ABI compatibility
> purposes,
> > and initialized to cache->size instead of cache->size * 1.5.
> >
> > (2) was improved by generally comparing to cache->size / 2 instead of
> > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> >
> > (3) was improved by flushing and replenishing the cache by half its
> size,
> > so a flush/refill can be followed randomly by get or put requests.
> > This also reduced the number of objects in each flush/refill
> operation.
> >
> > As a consequence of these changes, the size of the array holding the
> > objects in the cache (cache->objs[]) no longer needs to be
> > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> >
> > Performance data:
> > With a real WAN Optimization application, where the number of
> allocated
> > packets varies (as they are held in e.g. shaper queues), the mempool
> > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> > This was deployed in production at an ISP, and using an effective
> cache
> > size of 384 objects.
> 
> Does the application run as a RTC (run-to-complete) mode?

Yes, the application runs as RTC mode.

> How about pipeline model which NIC recv packets and enqueue ring,
> another
> work thread dequeue packets, process packets and then free packets
> mbuf?
> 

If one thread only receives packets (mempool get) and another thread only transmits/frees (mempool put), their cache miss rate roughly doubles.
But the number of objects copied to/from the backend per cache miss roughly halves to exactly size/2.
And the backend copy operations become CPU cache aligned (assuming all transactions with the backend go via the mempool cache).

The release notes mention that such pipelined applications should double their configured mempool cache size.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v6] mempool: improve cache behaviour and performance
  2026-05-26 16:00   ` Morten Brørup
@ 2026-06-01 13:36     ` Thomas Monjalon
  2026-06-01 13:51       ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Monjalon @ 2026-06-01 13:36 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

26/05/2026 18:00, Morten Brørup:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Tuesday, 26 May 2026 16.00
> > 
> > This patch refactors the mempool cache to eliminate some unexpected
> > behaviour and reduce the mempool cache miss rate.
> > 
> > 1.
> > The actual cache size was 1.5 times the cache size specified at run-
> > time
> > mempool creation.
> > This was obviously not expected by application developers.
> > 
> > 2.
> > In get operations, the check for when to use the cache as bounce buffer
> > did not respect the run-time configured cache size,
> > but compared to the build time maximum possible cache size
> > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> > E.g. with a configured cache size of 32 objects, getting 256 objects
> > would first fetch 32 + 256 = 288 objects into the cache,
> > and then move the 256 objects from the cache to the destination memory,
> > instead of fetching the 256 objects directly to the destination memory.
> > This had a performance cost.
> > However, this is unlikely to occur in real applications, so it is not
> > important in itself.
> > 
> > 3.
> > When putting objects into a mempool, and the mempool cache did not have
> > free space for so many objects,
> > the cache was flushed completely, and the new objects were then put
> > into
> > the cache.
> > I.e. the cache drain level was zero.
> > This (complete cache flush) meant that a subsequent get operation (with
> > the same number of objects) completely emptied the cache,
> > so another subsequent get operation required replenishing the cache.
> > 
> > Similarly,
> > When getting objects from a mempool, and the mempool cache did not hold
> > so
> > many objects,
> > the cache was replenished to cache->size + remaining objects,
> > and then (the remaining part of) the requested objects were fetched via
> > the cache,
> > which left the cache filled (to cache->size) at completion.
> > I.e. the cache refill level was cache->size (plus some, depending on
> > request size).
> > 
> > (1) was improved by generally comparing to cache->size instead of
> > cache->flushthresh, when considering the capacity of the cache.
> > The cache->flushthresh field is kept for API/ABI compatibility
> > purposes,
> > and initialized to cache->size instead of cache->size * 1.5.
> > 
> > (2) was improved by generally comparing to cache->size / 2 instead of
> > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> > 
> > (3) was improved by flushing and replenishing the cache by half its
> > size,
> > so a flush/refill can be followed randomly by get or put requests.
> > This also reduced the number of objects in each flush/refill operation.
> > 
> > As a consequence of these changes, the size of the array holding the
> > objects in the cache (cache->objs[]) no longer needs to be
> > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

I'm not sure why waiting?


> > Performance data:
> > With a real WAN Optimization application, where the number of allocated
> > packets varies (as they are held in e.g. shaper queues), the mempool
> > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> > This was deployed in production at an ISP, and using an effective cache
> > size of 384 objects.
> > 
> > Bugzilla ID: 1027
> > Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Forgot carrying an Ack over from v5:
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 
> > ---
> > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for mbuf
> > fast-free")
> 
> This dependency seems to cause CI apply failures.
> The dependency is based on an older snapshot of main,
> and this patch is based on a new snapshot of main.

The dependency should be resolved now.
Please could you send a v7?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v6] mempool: improve cache behaviour and performance
  2026-06-01 13:36     ` Thomas Monjalon
@ 2026-06-01 13:51       ` Morten Brørup
  2026-06-01 14:19         ` Thomas Monjalon
  0 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 13:51 UTC (permalink / raw)
  To: Thomas Monjalon, Bruce Richardson
  Cc: dev, Andrew Rybchenko, Jingjing Wu, Praveen Shetty,
	Hemant Agrawal, Sachin Saxena

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Monday, 1 June 2026 15.36
> 
> 26/05/2026 18:00, Morten Brørup:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Tuesday, 26 May 2026 16.00
> > >
> > > This patch refactors the mempool cache to eliminate some unexpected
> > > behaviour and reduce the mempool cache miss rate.
> > >
> > > 1.
> > > The actual cache size was 1.5 times the cache size specified at
> run-
> > > time
> > > mempool creation.
> > > This was obviously not expected by application developers.
> > >
> > > 2.
> > > In get operations, the check for when to use the cache as bounce
> buffer
> > > did not respect the run-time configured cache size,
> > > but compared to the build time maximum possible cache size
> > > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> > > E.g. with a configured cache size of 32 objects, getting 256
> objects
> > > would first fetch 32 + 256 = 288 objects into the cache,
> > > and then move the 256 objects from the cache to the destination
> memory,
> > > instead of fetching the 256 objects directly to the destination
> memory.
> > > This had a performance cost.
> > > However, this is unlikely to occur in real applications, so it is
> not
> > > important in itself.
> > >
> > > 3.
> > > When putting objects into a mempool, and the mempool cache did not
> have
> > > free space for so many objects,
> > > the cache was flushed completely, and the new objects were then put
> > > into
> > > the cache.
> > > I.e. the cache drain level was zero.
> > > This (complete cache flush) meant that a subsequent get operation
> (with
> > > the same number of objects) completely emptied the cache,
> > > so another subsequent get operation required replenishing the
> cache.
> > >
> > > Similarly,
> > > When getting objects from a mempool, and the mempool cache did not
> hold
> > > so
> > > many objects,
> > > the cache was replenished to cache->size + remaining objects,
> > > and then (the remaining part of) the requested objects were fetched
> via
> > > the cache,
> > > which left the cache filled (to cache->size) at completion.
> > > I.e. the cache refill level was cache->size (plus some, depending
> on
> > > request size).
> > >
> > > (1) was improved by generally comparing to cache->size instead of
> > > cache->flushthresh, when considering the capacity of the cache.
> > > The cache->flushthresh field is kept for API/ABI compatibility
> > > purposes,
> > > and initialized to cache->size instead of cache->size * 1.5.
> > >
> > > (2) was improved by generally comparing to cache->size / 2 instead
> of
> > > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> > >
> > > (3) was improved by flushing and replenishing the cache by half its
> > > size,
> > > so a flush/refill can be followed randomly by get or put requests.
> > > This also reduced the number of objects in each flush/refill
> operation.
> > >
> > > As a consequence of these changes, the size of the array holding
> the
> > > objects in the cache (cache->objs[]) no longer needs to be
> > > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> > > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> 
> I'm not sure why waiting?

Because the rte_mempool_cache structure holding the array is part of the public API:
https://elixir.bootlin.com/dpdk/v26.03/source/lib/mempool/rte_mempool.h#L113

abidiff complained about it in v1, so I reverted the array size reduction in v2.

> 
> 
> > > Performance data:
> > > With a real WAN Optimization application, where the number of
> allocated
> > > packets varies (as they are held in e.g. shaper queues), the
> mempool
> > > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> > > This was deployed in production at an ISP, and using an effective
> cache
> > > size of 384 objects.
> > >
> > > Bugzilla ID: 1027
> > > Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > Forgot carrying an Ack over from v5:
> > Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >
> > > ---
> > > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for
> mbuf
> > > fast-free")
> >
> > This dependency seems to cause CI apply failures.
> > The dependency is based on an older snapshot of main,
> > and this patch is based on a new snapshot of main.
> 
> The dependency should be resolved now.
> Please could you send a v7?
> 

I'm not 100 % sure, but I think the problem is CI only...

This patch is based on main, so it should apply as-is.

Bruce has already applied the other patch (that this one depends on) to the intel-next-net tree.
The other patch is based on an older snapshot of main, so when using "Depends-on", I guess the CI bases its series on the other patch; and then the CI fails to apply this patch (because it's based on a newer snapshot of main).


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v6] mempool: improve cache behaviour and performance
  2026-06-01 13:51       ` Morten Brørup
@ 2026-06-01 14:19         ` Thomas Monjalon
  2026-06-01 14:27           ` Morten Brørup
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Monjalon @ 2026-06-01 14:19 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Bruce Richardson, dev, Andrew Rybchenko, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena

01/06/2026 15:51, Morten Brørup:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Monday, 1 June 2026 15.36
> > 
> > 26/05/2026 18:00, Morten Brørup:
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Tuesday, 26 May 2026 16.00
> > > >
> > > > This patch refactors the mempool cache to eliminate some unexpected
> > > > behaviour and reduce the mempool cache miss rate.
> > > >
> > > > 1.
> > > > The actual cache size was 1.5 times the cache size specified at
> > run-
> > > > time
> > > > mempool creation.
> > > > This was obviously not expected by application developers.
> > > >
> > > > 2.
> > > > In get operations, the check for when to use the cache as bounce
> > buffer
> > > > did not respect the run-time configured cache size,
> > > > but compared to the build time maximum possible cache size
> > > > (RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
> > > > E.g. with a configured cache size of 32 objects, getting 256
> > objects
> > > > would first fetch 32 + 256 = 288 objects into the cache,
> > > > and then move the 256 objects from the cache to the destination
> > memory,
> > > > instead of fetching the 256 objects directly to the destination
> > memory.
> > > > This had a performance cost.
> > > > However, this is unlikely to occur in real applications, so it is
> > not
> > > > important in itself.
> > > >
> > > > 3.
> > > > When putting objects into a mempool, and the mempool cache did not
> > have
> > > > free space for so many objects,
> > > > the cache was flushed completely, and the new objects were then put
> > > > into
> > > > the cache.
> > > > I.e. the cache drain level was zero.
> > > > This (complete cache flush) meant that a subsequent get operation
> > (with
> > > > the same number of objects) completely emptied the cache,
> > > > so another subsequent get operation required replenishing the
> > cache.
> > > >
> > > > Similarly,
> > > > When getting objects from a mempool, and the mempool cache did not
> > hold
> > > > so
> > > > many objects,
> > > > the cache was replenished to cache->size + remaining objects,
> > > > and then (the remaining part of) the requested objects were fetched
> > via
> > > > the cache,
> > > > which left the cache filled (to cache->size) at completion.
> > > > I.e. the cache refill level was cache->size (plus some, depending
> > on
> > > > request size).
> > > >
> > > > (1) was improved by generally comparing to cache->size instead of
> > > > cache->flushthresh, when considering the capacity of the cache.
> > > > The cache->flushthresh field is kept for API/ABI compatibility
> > > > purposes,
> > > > and initialized to cache->size instead of cache->size * 1.5.
> > > >
> > > > (2) was improved by generally comparing to cache->size / 2 instead
> > of
> > > > RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
> > > >
> > > > (3) was improved by flushing and replenishing the cache by half its
> > > > size,
> > > > so a flush/refill can be followed randomly by get or put requests.
> > > > This also reduced the number of objects in each flush/refill
> > operation.
> > > >
> > > > As a consequence of these changes, the size of the array holding
> > the
> > > > objects in the cache (cache->objs[]) no longer needs to be
> > > > 2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
> > > > RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
> > 
> > I'm not sure why waiting?
> 
> Because the rte_mempool_cache structure holding the array is part of the public API:
> https://elixir.bootlin.com/dpdk/v26.03/source/lib/mempool/rte_mempool.h#L113
> 
> abidiff complained about it in v1, so I reverted the array size reduction in v2.
> 
> > 
> > 
> > > > Performance data:
> > > > With a real WAN Optimization application, where the number of
> > allocated
> > > > packets varies (as they are held in e.g. shaper queues), the
> > mempool
> > > > cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
> > > > This was deployed in production at an ISP, and using an effective
> > cache
> > > > size of 384 objects.
> > > >
> > > > Bugzilla ID: 1027
> > > > Fixes: ea5dd2744b90 ("mempool: cache optimisations")
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Forgot carrying an Ack over from v5:
> > > Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > >
> > > > ---
> > > > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib for
> > mbuf
> > > > fast-free")
> > >
> > > This dependency seems to cause CI apply failures.
> > > The dependency is based on an older snapshot of main,
> > > and this patch is based on a new snapshot of main.
> > 
> > The dependency should be resolved now.
> > Please could you send a v7?
> > 
> 
> I'm not 100 % sure, but I think the problem is CI only...
> 
> This patch is based on main, so it should apply as-is.
> 
> Bruce has already applied the other patch (that this one depends on) to the intel-next-net tree.
> The other patch is based on an older snapshot of main, so when using "Depends-on", I guess the CI bases its series on the other patch; and then the CI fails to apply this patch (because it's based on a newer snapshot of main).

The patch is in main since last week.

So you could send a new version with the ack and it will run in CI.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v6] mempool: improve cache behaviour and performance
  2026-06-01 14:19         ` Thomas Monjalon
@ 2026-06-01 14:27           ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 14:27 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Bruce Richardson, dev

> > > > > Depends-on: patch-163181 ("net/intel: do not bypass mbuf lib
> for
> > > mbuf
> > > > > fast-free")
> > > >
> > > > This dependency seems to cause CI apply failures.
> > > > The dependency is based on an older snapshot of main,
> > > > and this patch is based on a new snapshot of main.
> > >
> > > The dependency should be resolved now.
> > > Please could you send a v7?
> > >
> >
> > I'm not 100 % sure, but I think the problem is CI only...
> >
> > This patch is based on main, so it should apply as-is.
> >
> > Bruce has already applied the other patch (that this one depends on)
> to the intel-next-net tree.
> > The other patch is based on an older snapshot of main, so when using
> "Depends-on", I guess the CI bases its series on the other patch; and
> then the CI fails to apply this patch (because it's based on a newer
> snapshot of main).
> 
> The patch is in main since last week.
> 
> So you could send a new version with the ack and it will run in CI.
> 

OK. Will do.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v7] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (6 preceding siblings ...)
  2026-05-27 11:36 ` [PATCH v6] net/idpf: update for new mempool cache algorithm Morten Brørup
@ 2026-06-01 16:40 ` Morten Brørup
  2026-06-03 15:44   ` Thomas Monjalon
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
  2026-06-04 11:48 ` [PATCH v8] mempool: improve cache behaviour and performance Morten Brørup
  9 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 16:40 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena, Thomas Monjalon
  Cc: Morten Brørup

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
v7:
* Rebased. Dependency no longer required. (Thomas)
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
  Tests using the Intel idpf PMD in AVX512 mode may fail with this patch.
* Reverted a small code comment change. The original was better. (Bruce)
* Reverted rte_mempool_create() description requiring the cache_size to be
  an even number. There is no such requirement.
v5:
* Flush the cache from the bottom, where objects are colder, and move down
  the remaining objects, which are hotter.
* In the Intel idpf PMD, move up the hot objects in the cache and refill
  with cold objects at the bottom.
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
  patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
  fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
  __rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst   |  7 +++
 doc/guides/rel_notes/release_26_07.rst | 11 +++++
 lib/mempool/rte_mempool.c              | 14 +-----
 lib/mempool/rte_mempool.h              | 66 ++++++++++++++++----------
 4 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 17f90a6352..1b6cc181fb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -158,3 +158,10 @@ Deprecation Notices
 * net/iavf: The dynamic mbuf field used to detect LLDP packets on the
   transmit path in the iavf PMD will be removed in a future release.
   After removal, only packet type-based detection will be supported.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 8b4f8401e2..1b15c878f6 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -63,6 +63,17 @@ New Features
     ``rte_eal_init`` and the application is responsible for probing each device,
   * ``--auto-probing`` enables the initial bus probing, which is the current default behavior.
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate for most application types.
+  Applications where each lcore only puts or gets to a mempool, e.g. pipelined applications where ethdev Rx and Tx run on separate lcores, should adapt to the new algorithm by doubling their configured mempool cache size, to avoid doubling their mempool cache miss rate.
+
 * **Added LinkData sxe2 ethernet driver.**
 
   Added network driver for the LinkData network adapters.
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3042d94c14..805b52cc58 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 8c384d3453..26f47bf258 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1390,22 +1397,30 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects from the bottom of the cache, where
+		 * objects are less hot, and move down the remaining objects, which
+		 * are more hot, from the upper half of the cache.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
+		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
+				sizeof(void *) * (cache->len - cache->size / 2));
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
 	} else {
 		/* The request itself is too big for the cache. */
 		goto driver_enqueue_stats_incremented;
@@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (7 preceding siblings ...)
  2026-06-01 16:40 ` [PATCH v7] mempool: improve cache behaviour and performance Morten Brørup
@ 2026-06-01 18:36 ` Morten Brørup
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa: " Morten Brørup
                     ` (4 more replies)
  2026-06-04 11:48 ` [PATCH v8] mempool: improve cache behaviour and performance Morten Brørup
  9 siblings, 5 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 18:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the PMD was
updated regarding how much to backfill the mempool cache in the AVX512
code path.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v7:
* Rebased.
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
---
Depends-on: patch-164745 ("mempool: improve cache behaviour and performance")
---
 .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++++++----
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
index 8db4c64106..5788a009ab 100644
--- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
+++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
@@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_singleq_rearm_common(rxq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rxq->mp, &cache->objs[cache->len], req);
+				(rxq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rxq->nb_rx_desc) {
 				__m128i dma_addr0;
@@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
 	/* Can this be satisfied from the cache? */
 	if (cache->len < IDPF_RXQ_REARM_THRESH) {
 		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
-							cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill would exceed the cache bounce buffer limit? */
+		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
+			idpf_splitq_rearm_common(rx_bufq);
+			return;
+		}
+
+		/*
+		 * Backfill the cache from the backend;
+		 * move up the hot objects in the cache to the top half of the cache,
+		 * and fetch (size / 2) objects to the bottom of the cache.
+		 */
+		__rte_assume(cache->len < cache->size / 2);
+		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
+				sizeof(void *) * cache->len);
 		int ret = rte_mempool_ops_dequeue_bulk
-				(rx_bufq->mp, &cache->objs[cache->len], req);
+				(rx_bufq->mp, &cache->objs[0], cache->size / 2);
 		if (ret == 0) {
-			cache->len += req;
+			cache->len += cache->size / 2;
 		} else {
+			/*
+			 * No further action is required for roll back, as the objects moved
+			 * in the cache were actually copied, and the cache remains intact.
+			 */
 			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
 			    rx_bufq->nb_rx_desc) {
 				__m128i dma_addr0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v7] mempool/dpaa: update for new mempool cache algorithm
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
@ 2026-06-01 18:36   ` Morten Brørup
  2026-06-02  6:51     ` Morten Brørup
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa2: " Morten Brørup
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 18:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the mempool
driver was updated to not modify the mempool cache's flushthresh field,
which is now obsolete, and modifying it has no effect.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v7:
* Rebased.
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
---
Depends-on: patch-164745 ("mempool: improve cache behaviour and performance")
---
 drivers/mempool/dpaa/dpaa_mempool.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/mempool/dpaa/dpaa_mempool.c b/drivers/mempool/dpaa/dpaa_mempool.c
index 2f9395b3f4..2f8555a026 100644
--- a/drivers/mempool/dpaa/dpaa_mempool.c
+++ b/drivers/mempool/dpaa/dpaa_mempool.c
@@ -58,8 +58,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	struct bman_pool_params params = {
 		.flags = BMAN_POOL_FLAG_DYNAMIC_BPID
 	};
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 
 	MEMPOOL_INIT_FUNC_TRACE();
 
@@ -129,18 +127,6 @@ dpaa_mbuf_create_pool(struct rte_mempool *mp)
 	rte_memcpy(bp_info, (void *)&rte_dpaa_bpid_info[bpid],
 		   sizeof(struct dpaa_bp_info));
 	mp->pool_data = (void *)bp_info;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA_MBUF_MAX_ACQ_REL;
-	}
 
 	DPAA_MEMPOOL_INFO("BMAN pool created for bpid =%d", bpid);
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v7] mempool/dpaa2: update for new mempool cache algorithm
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa: " Morten Brørup
@ 2026-06-01 18:36   ` Morten Brørup
  2026-06-02  6:53     ` Morten Brørup
  2026-06-02  6:45   ` [PATCH v7] net/idpf: " Morten Brørup
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-06-01 18:36 UTC (permalink / raw)
  To: dev, Andrew Rybchenko, Bruce Richardson, Jingjing Wu,
	Praveen Shetty, Hemant Agrawal, Sachin Saxena
  Cc: Morten Brørup

As a consequence of the improved mempool cache algorithm, the mempool
driver was updated to not modify the mempool cache's flushthresh field,
which is now obsolete, and modifying it has no effect.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v7:
* Rebased.
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
---
Depends-on: patch-164745 ("mempool: improve cache behaviour and performance")
---
 drivers/mempool/dpaa2/dpaa2_hw_mempool.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
index 02b6741853..ee001d8ce0 100644
--- a/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
+++ b/drivers/mempool/dpaa2/dpaa2_hw_mempool.c
@@ -54,8 +54,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	struct dpaa2_bp_info *bp_info;
 	struct dpbp_attr dpbp_attr;
 	uint32_t bpid;
-	unsigned int lcore_id;
-	struct rte_mempool_cache *cache;
 	int ret;
 
 	avail_dpbp = dpaa2_alloc_dpbp_dev();
@@ -152,18 +150,6 @@ rte_hw_mbuf_create_pool(struct rte_mempool *mp)
 	DPAA2_MEMPOOL_DEBUG("BP List created for bpid =%d", dpbp_attr.bpid);
 
 	h_bp_list = bp_list;
-	/* Update per core mempool cache threshold to optimal value which is
-	 * number of buffers that can be released to HW buffer pool in
-	 * a single API call.
-	 */
-	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-		cache = &mp->local_cache[lcore_id];
-		DPAA2_MEMPOOL_DEBUG("lCore %d: cache->flushthresh %d -> %d",
-			lcore_id, cache->flushthresh,
-			(uint32_t)(cache->size + DPAA2_MBUF_MAX_ACQ_REL));
-		if (cache->flushthresh)
-			cache->flushthresh = cache->size + DPAA2_MBUF_MAX_ACQ_REL;
-	}
 
 	return 0;
 err4:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa: " Morten Brørup
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa2: " Morten Brørup
@ 2026-06-02  6:45   ` Morten Brørup
  2026-06-10 11:21   ` Morten Brørup
  2026-06-10 11:31   ` Morten Brørup
  4 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-02  6:45 UTC (permalink / raw)
  To: dev

Recheck-request: iol-unit-arm64-testing

CI subsystem log:

[176/407] Compiling C object lib/librte_cmdline.a.p/cmdline_cmdline.c.o
FAILED: lib/librte_cmdline.a.p/cmdline_cmdline.c.o
gcc-10 -Ilib/librte_cmdline.a.p -Ilib -I../lib -Ilib/cmdline -I../lib/cmdline -Ilib/eal/common -I../lib/eal/common -I. -I.. -Iconfig -I../config -Ilib/eal/include -I../lib/eal/include -Ilib/eal/linux/include -I../lib/eal/linux/include -Ilib/eal/arm/include -I../lib/eal/arm/include -I../kernel/linux -Ilib/eal -I../lib/eal -Ilib/kvargs -I../lib/kvargs -Ilib/log -I../lib/log -I../lib/metrics -Ilib/telemetry -I../lib/telemetry -Ilib/argparse -I../lib/argparse -Ilib/net -I../lib/net -Ilib/mbuf -I../lib/mbuf -Ilib/mempool -I../lib/mempool -Ilib/ring -I../lib/ring -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -Werror -std=c11 -O3 -include rte_config.h -Wvla -Wcast-qual -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpointer-arith -Wshadow -Wsign-compare -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned -Wno-missing-field-initializers -D_GN
 U_SOURCE -fPIC -march=armv8.2-a+crypto+sve -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -DRTE_LOG_DEFAULT_LOGTYPE=lib.cmdline -MD -MQ lib/librte_cmdline.a.p/cmdline_cmdline.c.o -MF lib/librte_cmdline.a.p/cmdline_cmdline.c.o.d -o lib/librte_cmdline.a.p/cmdline_cmdline.c.o -c ../lib/cmdline/cmdline.c
gcc-10: internal compiler error: Segmentation fault signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v7] mempool/dpaa: update for new mempool cache algorithm
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa: " Morten Brørup
@ 2026-06-02  6:51     ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-02  6:51 UTC (permalink / raw)
  To: dev

Recheck-request: iol-unit-arm64-testing

CI log:
==== Unit test summary for Red Hat Enterprise Linux 9.7 (dpdk_unit_test): ====
[...]
53/145 DPDK:fast-tests / eal_flags_file_prefix_autotest           FAIL             0.88s   (exit status 255 or signal 127 SIGinvalid)
>>> DPDK_TEST=eal_flags_file_prefix_autotest MALLOC_PERTURB_=192 /home-local/jenkins-local/jenkins-agent/workspace/Generic-Unit-Test-DPDK/dpdk/build/app/dpdk-test --file-prefix=eal_flags_file_prefix_autotest
[...]
Error (line 1407) - failed to run with --file-prefix=memtest1
Test Failed


^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v7] mempool/dpaa2: update for new mempool cache algorithm
  2026-06-01 18:36   ` [PATCH v7] mempool/dpaa2: " Morten Brørup
@ 2026-06-02  6:53     ` Morten Brørup
  0 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-02  6:53 UTC (permalink / raw)
  To: dev

Recheck-request: iol-unit-arm64-testing

CI subsystem log:
==== 20 line log output for Ubuntu 24.04 (lpm_autotest): ====
[...]
[89/406] Compiling C object lib/librte_eal.a.p/eal_linux_eal_timer.c.o
FAILED: lib/librte_eal.a.p/eal_linux_eal_timer.c.o
gcc-10 -Ilib/librte_eal.a.p -Ilib -I../lib -Ilib/eal/common -I../lib/eal/common -I. -I.. -Iconfig -I../config -Ilib/eal/include -I../lib/eal/include -Ilib/eal/linux/include -I../lib/eal/linux/include -Ilib/eal/arm/include -I../lib/eal/arm/include -I../kernel/linux -Ilib/eal -I../lib/eal -Ilib/kvargs -I../lib/kvargs -Ilib/log -I../lib/log -I../lib/metrics -Ilib/telemetry -I../lib/telemetry -Ilib/argparse -I../lib/argparse -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -Werror -std=c11 -O3 -include rte_config.h -Wvla -Wcast-qual -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpointer-arith -Wshadow -Wsign-compare -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned -Wno-missing-field-initializers -D_GNU_SOURCE -fPIC -march=armv8.2-a+crypto+sve -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API '-DABI_VERSION="26.2"' -DRTE_LOG_DEFAULT_LOGTYPE=l
 ib.eal -MD -MQ lib/librte_eal.a.p/eal_linux_eal_timer.c.o -MF lib/librte_eal.a.p/eal_linux_eal_timer.c.o.d -o lib/librte_eal.a.p/eal_linux_eal_timer.c.o -c ../lib/eal/linux/eal_timer.c
gcc-10: internal compiler error: Segmentation fault signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v7] mempool: improve cache behaviour and performance
  2026-06-01 16:40 ` [PATCH v7] mempool: improve cache behaviour and performance Morten Brørup
@ 2026-06-03 15:44   ` Thomas Monjalon
  0 siblings, 0 replies; 46+ messages in thread
From: Thomas Monjalon @ 2026-06-03 15:44 UTC (permalink / raw)
  To: Andrew Rybchenko, Bruce Richardson, Hemant Agrawal,
	Morten Brørup, Konstantin Ananyev, David Marchand,
	Robin Jarry, Jerin Jacob
  Cc: dev, Jingjing Wu, Praveen Shetty, Sachin Saxena

01/06/2026 18:40, Morten Brørup:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.

I feel we need more opinions about this behaviour change.
Bruce was suggesting some user-tuning. What do you think?
Also the decision may be easier if we have some performance benchmarks.
Who could help to show some numbers before/after in various cases please?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v8] mempool: improve cache behaviour and performance
  2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
                   ` (8 preceding siblings ...)
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
@ 2026-06-04 11:48 ` Morten Brørup
  2026-06-04 13:57   ` Morten Brørup
  2026-06-10 11:06   ` Thomas Monjalon
  9 siblings, 2 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-04 11:48 UTC (permalink / raw)
  To: dev, Thomas Monjalon; +Cc: Morten Brørup, Andrew Rybchenko

This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.

1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.

2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.

3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.

Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).

(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.

(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.

(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.

As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.

Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.

Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
v8:
* Rebased.
  CI cannot apply the patch when the release notes have changed at the tip
  of main. Hoping to win the race this time.
v7:
* Rebased. Dependency no longer required. (Thomas)
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
  Tests using the Intel idpf PMD in AVX512 mode may fail with this patch.
* Reverted a small code comment change. The original was better. (Bruce)
* Reverted rte_mempool_create() description requiring the cache_size to be
  an even number. There is no such requirement.
v5:
* Flush the cache from the bottom, where objects are colder, and move down
  the remaining objects, which are hotter.
* In the Intel idpf PMD, move up the hot objects in the cache and refill
  with cold objects at the bottom.
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
  patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
  fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
  __rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
  Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
  flush threshold.
* Added release notes.
* Added deprecation notes.
---
 doc/guides/rel_notes/deprecation.rst   |  7 +++
 doc/guides/rel_notes/release_26_07.rst | 11 +++++
 lib/mempool/rte_mempool.c              | 14 +-----
 lib/mempool/rte_mempool.h              | 66 ++++++++++++++++----------
 4 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 17f90a6352..1b6cc181fb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -158,3 +158,10 @@ Deprecation Notices
 * net/iavf: The dynamic mbuf field used to detect LLDP packets on the
   transmit path in the iavf PMD will be removed in a future release.
   After removal, only packet type-based detection will be supported.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+  is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+  factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+  DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index d2563ac503..b1010d238f 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -69,6 +69,17 @@ New Features
     ``rte_eal_init`` and the application is responsible for probing each device,
   * ``--auto-probing`` enables the initial bus probing, which is the current default behavior.
 
+* **Changed effective size of mempool cache.**
+
+  * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+  * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+  * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+  The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate for most application types.
+  Applications where each lcore only puts or gets to a mempool, e.g. pipelined applications where ethdev Rx and Tx run on separate lcores, should adapt to the new algorithm by doubling their configured mempool cache size, to avoid doubling their mempool cache miss rate.
+
 * **Added RISC-V vector paths.**
 
   * Increased the default SIMD bitwidth to allow using the vector extension.
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3ddf7b9c11..817e2b8dc1 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
 mempool_event_callback_invoke(enum rte_mempool_event event,
 			      struct rte_mempool *mp);
 
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
 #if defined(RTE_ARCH_X86)
 /*
  * return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
-	/* Check that cache have enough space for flush threshold */
-	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
-			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
 	cache->size = size;
-	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+	cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
 	cache->len = 0;
 }
 
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
 
 	/* asked cache too big */
 	if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
-	    CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+	    cache_size > n) {
 		rte_errno = EINVAL;
 		return NULL;
 	}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index d3f969847b..50d958c7c6 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
  */
 struct __rte_cache_aligned rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
-	uint32_t flushthresh; /**< Threshold before we flush excess elements */
+	uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
 	uint32_t len;	      /**< Current cache count */
 #ifdef RTE_LIBRTE_MEMPOOL_STATS
 	uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
 	/**
 	 * Cache objects
 	 *
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
+	 * Note:
+	 * Cache is allocated at double size for API/ABI compatibility purposes only.
+	 * When reducing its size at an API/ABI breaking release,
+	 * remember to add a cache guard after it.
 	 */
 	alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
 };
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
  *   If cache_size is non-zero, the rte_mempool library will try to
  *   limit the accesses to the common lockless pool, by maintaining a
  *   per-lcore object cache. This argument must be lower or equal to
- *   RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ *   RTE_MEMPOOL_CACHE_MAX_SIZE and n.
  *   The access to the per-lcore table is of course
  *   faster than the multi-producer/consumer pool. The cache can be
  *   disabled if the cache_size argument is set to 0; it can be useful to
  *   avoid losing objects in cache.
+ *   Note:
+ *   Mempool put/get requests of more than cache_size / 2 objects may be
+ *   partially or fully served directly by the multi-producer/consumer
+ *   pool, to avoid the overhead of copying the objects twice (instead of
+ *   once) when using the cache as a bounce buffer.
  * @param private_data_size
  *   The size of the private data appended after the mempool
  *   structure. This is useful for storing some private data after the
@@ -1419,22 +1426,30 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
 
-	__rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
-	__rte_assume(cache->len <= cache->flushthresh);
-	if (likely(cache->len + n <= cache->flushthresh)) {
+	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+	__rte_assume(cache->len <= cache->size);
+	if (likely(cache->len + n <= cache->size)) {
 		/* Sufficient room in the cache for the objects. */
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
-	} else if (n <= cache->flushthresh) {
+	} else if (n <= cache->size / 2) {
 		/*
-		 * The cache is big enough for the objects, but - as detected by
-		 * the comparison above - has insufficient room for them.
-		 * Flush the cache to make room for the objects.
+		 * The number of objects is within the cache bounce buffer limit,
+		 * but - as detected by the comparison above - the cache has
+		 * insufficient room for them.
+		 * Flush the cache to the backend to make room for the objects;
+		 * flush (size / 2) objects from the bottom of the cache, where
+		 * objects are less hot, and move down the remaining objects, which
+		 * are more hot, from the upper half of the cache.
 		 */
-		cache_objs = &cache->objs[0];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
-		cache->len = n;
+		__rte_assume(cache->len > cache->size / 2);
+		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
+		rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
+				sizeof(void *) * (cache->len - cache->size / 2));
+		cache_objs = &cache->objs[cache->len - cache->size / 2];
+		cache->len = cache->len - cache->size / 2 + n;
 	} else {
 		/* The request itself is too big for the cache. */
 		goto driver_enqueue_stats_incremented;
@@ -1553,7 +1568,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	/* The cache is a stack, so copy will be in reverse order. */
 	cache_objs = &cache->objs[cache->len];
 
-	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+	__rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
 	if (likely(n <= cache->len)) {
 		/* The entire request can be satisfied from the cache. */
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1577,13 +1592,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	/* Dequeue below would overflow mem allocated for cache? */
-	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* Dequeue below would exceed the cache bounce buffer limit? */
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	if (unlikely(remaining > cache->size / 2))
 		goto driver_dequeue;
 
-	/* Fill the cache from the backend; fetch size + remaining objects. */
-	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
-			cache->size + remaining);
+	/* Fill the cache from the backend; fetch (size / 2) objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
 	if (unlikely(ret < 0)) {
 		/*
 		 * We are buffer constrained, and not able to fetch all that.
@@ -1597,10 +1612,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
 
-	__rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
-	cache_objs = &cache->objs[cache->size + remaining];
-	cache->len = cache->size;
+	__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+	__rte_assume(remaining <= cache->size / 2);
+	cache_objs = &cache->objs[cache->size / 2];
+	cache->len = cache->size / 2 - remaining;
 	for (index = 0; index < remaining; index++)
 		*obj_table++ = *--cache_objs;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* RE: [PATCH v8] mempool: improve cache behaviour and performance
  2026-06-04 11:48 ` [PATCH v8] mempool: improve cache behaviour and performance Morten Brørup
@ 2026-06-04 13:57   ` Morten Brørup
  2026-06-10 11:06   ` Thomas Monjalon
  1 sibling, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-04 13:57 UTC (permalink / raw)
  To: dev

Recheck-request: iol-intel-Performance


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v8] mempool: improve cache behaviour and performance
  2026-06-04 11:48 ` [PATCH v8] mempool: improve cache behaviour and performance Morten Brørup
  2026-06-04 13:57   ` Morten Brørup
@ 2026-06-10 11:06   ` Thomas Monjalon
  1 sibling, 0 replies; 46+ messages in thread
From: Thomas Monjalon @ 2026-06-10 11:06 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, Andrew Rybchenko

04/06/2026 13:48, Morten Brørup:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.

Applied, thanks.




^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
                     ` (2 preceding siblings ...)
  2026-06-02  6:45   ` [PATCH v7] net/idpf: " Morten Brørup
@ 2026-06-10 11:21   ` Morten Brørup
  2026-06-10 11:31     ` Bruce Richardson
  2026-06-10 11:31   ` Morten Brørup
  4 siblings, 1 reply; 46+ messages in thread
From: Morten Brørup @ 2026-06-10 11:21 UTC (permalink / raw)
  To: Jingjing Wu, Praveen Shetty, Bruce Richardson; +Cc: dev

Intel idpf maintainers,

PING for review.

The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.

[1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba


Venlig hilsen / Kind regards,
-Morten Brørup

> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Monday, 1 June 2026 20.36
> To: dev@dpdk.org; Andrew Rybchenko; Bruce Richardson; Jingjing Wu;
> Praveen Shetty; Hemant Agrawal; Sachin Saxena
> Cc: Morten Brørup
> Subject: [PATCH v7] net/idpf: update for new mempool cache algorithm
> 
> As a consequence of the improved mempool cache algorithm, the PMD was
> updated regarding how much to backfill the mempool cache in the AVX512
> code path.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> v7:
> * Rebased.
> v6:
> * Moved driver changes out as separate patches, for easier review.
> (Bruce)
> ---
> Depends-on: patch-164745 ("mempool: improve cache behaviour and
> performance")
> ---
>  .../net/intel/idpf/idpf_common_rxtx_avx512.c  | 52 +++++++++++++++----
>  1 file changed, 42 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> index 8db4c64106..5788a009ab 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> @@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> +			idpf_singleq_rearm_common(rxq);
> +			return;
> +		}
> +
> +		/*
> +		 * Backfill the cache from the backend;
> +		 * move up the hot objects in the cache to the top half of
> the cache,
> +		 * and fetch (size / 2) objects to the bottom of the cache.
> +		 */
> +		__rte_assume(cache->len < cache->size / 2);
> +		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> +				sizeof(void *) * cache->len);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rxq->mp, &cache->objs[cache->len], req);
> +				(rxq->mp, &cache->objs[0], cache->size / 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
> +			/*
> +			 * No further action is required for roll back, as
> the objects moved
> +			 * in the cache were actually copied, and the cache
> remains intact.
> +			 */
>  			if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rxq->nb_rx_desc) {
>  				__m128i dma_addr0;
> @@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
>  	/* Can this be satisfied from the cache? */
>  	if (cache->len < IDPF_RXQ_REARM_THRESH) {
>  		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> -							cache->len);
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> +		/* Backfill would exceed the cache bounce buffer limit? */
> +		__rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> +		if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> +			idpf_splitq_rearm_common(rx_bufq);
> +			return;
> +		}
> +
> +		/*
> +		 * Backfill the cache from the backend;
> +		 * move up the hot objects in the cache to the top half of
> the cache,
> +		 * and fetch (size / 2) objects to the bottom of the cache.
> +		 */
> +		__rte_assume(cache->len < cache->size / 2);
> +		rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> +				sizeof(void *) * cache->len);
>  		int ret = rte_mempool_ops_dequeue_bulk
> -				(rx_bufq->mp, &cache->objs[cache->len], req);
> +				(rx_bufq->mp, &cache->objs[0], cache->size /
> 2);
>  		if (ret == 0) {
> -			cache->len += req;
> +			cache->len += cache->size / 2;
>  		} else {
> +			/*
> +			 * No further action is required for roll back, as
> the objects moved
> +			 * in the cache were actually copied, and the cache
> remains intact.
> +			 */
>  			if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
>  			    rx_bufq->nb_rx_desc) {
>  				__m128i dma_addr0;
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-10 11:21   ` Morten Brørup
@ 2026-06-10 11:31     ` Bruce Richardson
  2026-06-10 12:17       ` Thomas Monjalon
  0 siblings, 1 reply; 46+ messages in thread
From: Bruce Richardson @ 2026-06-10 11:31 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Jingjing Wu, Praveen Shetty, dev

On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> Intel idpf maintainers,
> 
> PING for review.
> 
> The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> 
> [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> 
> 
> Venlig hilsen / Kind regards,
> -Morten Brørup
> 
Yep. I was waiting to see what happened to the mempool patch before
considering this for next-net-intel.

/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
                     ` (3 preceding siblings ...)
  2026-06-10 11:21   ` Morten Brørup
@ 2026-06-10 11:31   ` Morten Brørup
  4 siblings, 0 replies; 46+ messages in thread
From: Morten Brørup @ 2026-06-10 11:31 UTC (permalink / raw)
  To: dev

Recheck-request: iol-unit-arm64-testing

Unrelated CI failure.

CI log:

==== 20 line log output for Ubuntu 20.04 (lpm_autotest): ====
[7/407] Compiling C object lib/librte_log.a.p/log_log_journal.c.o
[8/407] Linking static target lib/librte_log.a
[9/407] Generating rte_argparse_map with a custom command
[10/407] Compiling C object lib/librte_kvargs.a.p/kvargs_rte_kvargs.c.o
[11/407] Linking static target lib/librte_kvargs.a
[12/407] Compiling C object lib/librte_argparse.a.p/argparse_rte_argparse.c.o
[13/407] Generating kvargs.sym_chk with a custom command (wrapped by meson to capture output)
[14/407] Linking static target lib/librte_argparse.a
[15/407] Generating log.sym_chk with a custom command (wrapped by meson to capture output)
[16/407] Linking target lib/librte_log.so.26.2
[17/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry_data.c.o
[18/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry.c.o
[19/407] Generating rte_telemetry_map with a custom command
FAILED: lib/telemetry_exports.map
/usr/bin/python3 ../buildtools/gen-version-map.py --linker gnu --abi-version ../ABI_VERSION --output lib/telemetry_exports.map --source ../lib/telemetry/telemetry.c ../lib/telemetry/telemetry_data.c ../lib/telemetry/telemetry_legacy.c
Segmentation fault (core dumped)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-10 11:31     ` Bruce Richardson
@ 2026-06-10 12:17       ` Thomas Monjalon
  2026-06-10 12:34         ` Bruce Richardson
  0 siblings, 1 reply; 46+ messages in thread
From: Thomas Monjalon @ 2026-06-10 12:17 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson; +Cc: Jingjing Wu, Praveen Shetty, dev

10/06/2026 13:31, Bruce Richardson:
> On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > Intel idpf maintainers,
> > 
> > PING for review.
> > 
> > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> > 
> > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> > 
> > 
> > Venlig hilsen / Kind regards,
> > -Morten Brørup
> > 
> Yep. I was waiting to see what happened to the mempool patch before
> considering this for next-net-intel.

I've merged it directly in main with dpaa mempool patches.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
  2026-06-10 12:17       ` Thomas Monjalon
@ 2026-06-10 12:34         ` Bruce Richardson
  0 siblings, 0 replies; 46+ messages in thread
From: Bruce Richardson @ 2026-06-10 12:34 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Morten Brørup, Jingjing Wu, Praveen Shetty, dev

On Wed, Jun 10, 2026 at 02:17:43PM +0200, Thomas Monjalon wrote:
> 10/06/2026 13:31, Bruce Richardson:
> > On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > > Intel idpf maintainers,
> > > 
> > > PING for review.
> > > 
> > > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> > > 
> > > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> > > 
> > > 
> > > Venlig hilsen / Kind regards,
> > > -Morten Brørup
> > > 
> > Yep. I was waiting to see what happened to the mempool patch before
> > considering this for next-net-intel.
> 
> I've merged it directly in main with dpaa mempool patches.
> 
Ok, thanks.
/Bruce

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2026-06-10 12:34 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 14:13 [PATCH] mempool: improve cache behaviour and performance Morten Brørup
2026-04-08 15:41 ` Stephen Hemminger
2026-04-09 10:25 ` [PATCH v2] " Morten Brørup
2026-04-09 11:05 ` [PATCH v3] " Morten Brørup
2026-04-15 13:40   ` Morten Brørup
2026-04-18 11:15 ` [PATCH v4] " Morten Brørup
2026-04-19  9:55 ` [PATCH v5] " Morten Brørup
2026-04-22 12:27   ` Morten Brørup
2026-04-27 15:21   ` Morten Brørup
2026-04-28  7:44   ` Andrew Rybchenko
2026-05-22 16:11   ` Bruce Richardson
2026-05-26  8:41     ` Morten Brørup
2026-05-26  9:39       ` Bruce Richardson
2026-05-26 10:37         ` Morten Brørup
2026-05-26 17:45           ` Morten Brørup
2026-05-27  8:48             ` Bruce Richardson
2026-05-27  9:22               ` Morten Brørup
2026-05-22 16:12   ` Bruce Richardson
2026-05-26  8:57     ` Morten Brørup
2026-05-26 14:00 ` [PATCH v6] " Morten Brørup
2026-05-26 16:00   ` Morten Brørup
2026-06-01 13:36     ` Thomas Monjalon
2026-06-01 13:51       ` Morten Brørup
2026-06-01 14:19         ` Thomas Monjalon
2026-06-01 14:27           ` Morten Brørup
2026-05-29  8:53   ` fengchengwen
2026-05-29 11:43     ` Morten Brørup
2026-05-27 11:36 ` [PATCH v6] net/idpf: update for new mempool cache algorithm Morten Brørup
2026-05-27 11:36   ` [PATCH v6] mempool/dpaa: " Morten Brørup
2026-05-27 11:36   ` [PATCH v6] mempool/dpaa2: " Morten Brørup
2026-06-01 16:40 ` [PATCH v7] mempool: improve cache behaviour and performance Morten Brørup
2026-06-03 15:44   ` Thomas Monjalon
2026-06-01 18:36 ` [PATCH v7] net/idpf: update for new mempool cache algorithm Morten Brørup
2026-06-01 18:36   ` [PATCH v7] mempool/dpaa: " Morten Brørup
2026-06-02  6:51     ` Morten Brørup
2026-06-01 18:36   ` [PATCH v7] mempool/dpaa2: " Morten Brørup
2026-06-02  6:53     ` Morten Brørup
2026-06-02  6:45   ` [PATCH v7] net/idpf: " Morten Brørup
2026-06-10 11:21   ` Morten Brørup
2026-06-10 11:31     ` Bruce Richardson
2026-06-10 12:17       ` Thomas Monjalon
2026-06-10 12:34         ` Bruce Richardson
2026-06-10 11:31   ` Morten Brørup
2026-06-04 11:48 ` [PATCH v8] mempool: improve cache behaviour and performance Morten Brørup
2026-06-04 13:57   ` Morten Brørup
2026-06-10 11:06   ` Thomas Monjalon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.