From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C871CD5BC8 for ; Tue, 26 May 2026 08:41:48 +0000 (UTC) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 419D440280; Tue, 26 May 2026 10:41:47 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 0E90440279 for ; Tue, 26 May 2026 10:41:46 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id E497F20A2F; Tue, 26 May 2026 10:41:44 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v5] mempool: improve cache behaviour and performance Date: Tue, 26 May 2026 10:41:44 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F6589A@smartserver.smartshare.dk> In-Reply-To: X-MimeOLE: Produced By Microsoft Exchange V6.5 X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v5] mempool: improve cache behaviour and performance Thread-Index: AdzqBboxmUL8wbsbQKSFYpT47OAWLAC1T1Sw References: <20260408141315.904381-1-mb@smartsharesystems.com> <20260419095526.39526-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , "Andrew Rybchenko" , "Jingjing Wu" , "Praveen Shetty" , "Hemant Agrawal" , "Sachin Saxena" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Friday, 22 May 2026 18.12 >=20 > On Sun, Apr 19, 2026 at 09:55:26AM +0000, Morten Br=F8rup wrote: > > This patch refactors the mempool cache to eliminate some unexpected > > behaviour and reduce the mempool cache miss rate. > > >=20 > Agree in principle with most of these changes. As we dicussed at the > DPDK > summit conference, only issue I really have is with the threshold > limits > here - allocating and freeing only half the cache at a time seems > overly > conservative. Thinking about use-cases: >=20 > 1 for apps where alloc + free (generally Rx+Tx) is on the same = core(s), > then we should run (almost) entirely out of cache. I strongly disagree about any goal to run the cache low. The primary goal is to minimize the cache miss (refill and replenish) = rate. > 2 for apps where we have alloc and free on different cores, then we > have > some caches always being filled and others always being emptied Agree. >=20 > For case #1, we only need worry about the thresholds for the odd case > where we have a burst that causes us to overflow our cache (and we > can't increase our cache size to cope and avoid that).=20 > Otherwise the thresholds don't matter. It seems like you assume the application only does something like this: Rx -> Rewrite -> Tx In that case, the per-lcore cache only needs capacity for one burst, = yes. With my patch, the cache can be rightsized by requesting a cache size of = 2 * burst size. (Then the fill level will be either size/2 or empty, i.e. one burst or = zero. This also happens to meet your suggested goal about low fill level, = which I disagree with.) However, I don't think that is a realistic use case. Many apps do something like this: |-> Rewrite ->| Rx ->| |-> Tx |-> Hold | Release ->| They often hold back packets before they are transmitted. For a simple router, when the destination IP address is not in the = neighbor table, packets to that IP address are queued until ARP/ND has = been resolved, and then they are dequeued and transmitted. Or apps performing shaping or pacing, where packets are held back in = queues, and dequeued at the appropriate time. For such apps, the waves are much bigger (than the simple = Rx->Rewrite->Tx use case). With a random enqueue/dequeue pattern, replenishing/draining the cache = to size/2 minimizes the probability of reaching one of its edges (empty = or full), triggering a "cache miss" (refill/replenish). > However, for case #2, the thresholds are constantly involved as > we > always are going to backing store. In this case, we really want to = have > the > allocs *always* fill the cache completely, and the frees completely > empty > the cache. Agree. >=20 > Because of this, while we want to avoid cases where we fill the cache > completely only to have a further free causing it to be flushed, > because of > case #2 we cannot be overly conservative in how much we free/empty. > Ideally, we want to fill to full less a single burst, and empty = leaving > only a single burst in the cache. Unfortunately, we don't know what > those > burst limits are, so we have to try and guess the best behaviour from > everything else. I agree about not wanting to be overly conservative. But in the use cases I have described for #1, I don't think a target = fill level of size/2 is overly conservative. I also acknowledge that this patch doubles the mempool cache miss rate = for #2. E.g. with a cache size of 512 and burst size of 64, the per-burst miss = rate will be 64 / (1/2 * 512) =3D 1/4, compared to 64 / 512 =3D 1/8 with = a full replenish/drain algorithm. In theory, we could make it build time configurable to optimize mempools = for #2. But mempools are also used for other objects than mbufs, so that would = have unwanted side effects for non-mbuf mempools. If we went for an algorithm targeting replenish/drain at 25 % from the = edges, the per-burst miss rate for #2 would be: 64 / (3/4 * 512) =3D = 1/6. How about addressing #2 in the release notes: We describe that the cache refill/drain algorithm has been changed to = only refill/drain to 50 % of the cache size, so pipelined applications = performing Rx (mempool get) and Tx (mempool put) on separate cores = should configure their mbuf pools with double the cache size of what = they previously were to achieve similar performance. >=20 > All that said, commits with specific suggestions inline. >=20 > /Bruce >=20 > >=20 > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 2e54fc4466..432c43ab15 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats > { > > */ > > struct __rte_cache_aligned rte_mempool_cache { > > uint32_t size; /**< Size of the cache */ > > - uint32_t flushthresh; /**< Threshold before we flush excess > elements */ > > + uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility > purposes only */ > > uint32_t len; /**< Current cache count */ > > #ifdef RTE_LIBRTE_MEMPOOL_STATS > > uint32_t unused; > > @@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache { > > /** > > * Cache objects > > * > > - * Cache is allocated to this size to allow it to overflow in > certain > > - * cases to avoid needless emptying of cache. > > + * Note: > > + * Cache is allocated at double size for API/ABI compatibility > purposes only. > > + * When reducing its size at an API/ABI breaking release, > > + * remember to add a cache guard after it. > > */ > > alignas(RTE_CACHE_LINE_SIZE) void > *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; > > }; > > @@ -1046,12 +1048,17 @@ rte_mempool_free(struct rte_mempool *mp); > > * @param cache_size > > * If cache_size is non-zero, the rte_mempool library will try to > > * limit the accesses to the common lockless pool, by maintaining > a > > - * per-lcore object cache. This argument must be lower or equal = to > > - * RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5. > > + * per-lcore object cache. This argument must be an even number, > > + * lower or equal to RTE_MEMPOOL_CACHE_MAX_SIZE and n. > > * The access to the per-lcore table is of course > > * faster than the multi-producer/consumer pool. The cache can be > > * disabled if the cache_size argument is set to 0; it can be > useful to > > * avoid losing objects in cache. > > + * Note: > > + * Mempool put/get requests of more than cache_size / 2 objects > may be > > + * partially or fully served directly by the multi- > producer/consumer > > + * pool, to avoid the overhead of copying the objects twice > (instead of > > + * once) when using the cache as a bounce buffer. > > * @param private_data_size > > * The size of the private data appended after the mempool > > * structure. This is useful for storing some private data after > the > > @@ -1390,24 +1397,32 @@ rte_mempool_do_generic_put(struct = rte_mempool > *mp, void * const *obj_table, > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1); > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n); > > > > - __rte_assume(cache->flushthresh <=3D RTE_MEMPOOL_CACHE_MAX_SIZE * > 2); > > - __rte_assume(cache->len <=3D RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > > - __rte_assume(cache->len <=3D cache->flushthresh); > > - if (likely(cache->len + n <=3D cache->flushthresh)) { > > + __rte_assume(cache->size <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); > > + __rte_assume(cache->size / 2 <=3D RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > > + __rte_assume(cache->len <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); > > + __rte_assume(cache->len <=3D cache->size); > > + if (likely(cache->len + n <=3D cache->size)) { > > /* Sufficient room in the cache for the objects. */ > > cache_objs =3D &cache->objs[cache->len]; > > cache->len +=3D n; > > - } else if (n <=3D cache->flushthresh) { > > + } else if (n <=3D cache->size / 2) { > > /* > > - * The cache is big enough for the objects, but - as > detected by > > - * the comparison above - has insufficient room for them. > > - * Flush the cache to make room for the objects. > > + * The number of objects is within the cache bounce buffer > limit, > > + * but - as detected by the comparison above - the cache > has > > + * insufficient room for them. > > + * Flush the cache to the backend to make room for the > objects; > > + * flush (size / 2) objects from the bottom of the cache, > where > > + * objects are less hot, and move down the remaining > objects, which > > + * are more hot, from the upper half of the cache. > > */ > > - cache_objs =3D &cache->objs[0]; > > - rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len); > > - cache->len =3D n; > > + __rte_assume(cache->len > cache->size / 2); > > + rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache- > >size / 2); > > + rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2], > > + sizeof(void *) * (cache->len - cache->size / > 2)); > > + cache_objs =3D &cache->objs[cache->len - cache->size / 2]; > > + cache->len =3D cache->len - cache->size / 2 + n; >=20 > The flushing of only half the cache I'm not so certain about. I agree > that > we want to not flush to empty, but I also think that we want to do = more > than a half-flush, especially since we do an enqueue to the cache > immediately afterwards. Consider the case where we have a cache size = of > 128, and we do an enqueue of 32, with the cache currently full. In = that > case we only flush 64, reducing the cache to 64, but then immediately > bringing it back up to 96. I thought in depth about whether the flush/replenish sizes should = consider the request size or not. (E.g. if I should replenish size/2 or = size/2+request.) I decided for not considering the request size, for two reasons: a) It roughly doesn't matter, especially when considering a sequence of = random get/put requests. b) The size of the backend transactions become fixed, which has = performance benefits: With my patch, they are always size/2, so if the = cache size is 2^N, the backend transactions are 2^N and CPU cache = aligned. > For cases where we have pipelines with all > alloc > on one core and all free on another, this half-flush would be > inefficient. >=20 > Instead, I would look to have a lower target threshold post-flush, and > I > would suggest 25% - taking into account the newly freed buffers. It's not good for #1. I agree that it is better for #2. But I don't think #2 is the likely use = case. After our discussion at the summit, I did start working a patch = targeting fill levels at 25% from the cache edges, but I don't think = it's better; so I'd rather stick with a target fill level of 50%. > For example: >=20 > /* if n > our target of 1/4 full, flush everything, > * else flush so that we end up with 1/4 full after n added. > */ > flush_count =3D n > cache->size/4 ? cache->len : > (cache->len + n) - cache->size/4; >=20 >=20 > > } else { > > - /* The request itself is too big for the cache. */ > > + /* The request itself is too big. */ > > goto driver_enqueue_stats_incremented; >=20 > I think original comment is better. The request itself is not too big > for > the whole mempool, just for the cache. Ack. >=20 > > } > > > > @@ -1524,7 +1539,7 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > > /* The cache is a stack, so copy will be in reverse order. */ > > cache_objs =3D &cache->objs[cache->len]; > > > > - __rte_assume(cache->len <=3D RTE_MEMPOOL_CACHE_MAX_SIZE * 2); > > + __rte_assume(cache->len <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); > > if (likely(n <=3D cache->len)) { > > /* The entire request can be satisfied from the cache. */ > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > @@ -1548,13 +1563,13 @@ rte_mempool_do_generic_get(struct = rte_mempool > *mp, void **obj_table, > > for (index =3D 0; index < len; index++) > > *obj_table++ =3D *--cache_objs; > > > > - /* Dequeue below would overflow mem allocated for cache? */ > > - if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE)) > > + /* Dequeue below would exceed the cache bounce buffer limit? */ > > + __rte_assume(cache->size / 2 <=3D RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > > + if (unlikely(remaining > cache->size / 2)) > > goto driver_dequeue; > > > > - /* Fill the cache from the backend; fetch size + remaining > objects. */ > > - ret =3D rte_mempool_ops_dequeue_bulk(mp, cache->objs, > > - cache->size + remaining); > > + /* Fill the cache from the backend; fetch (size / 2) objects. */ > > + ret =3D rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size = / > 2); >=20 > Again, the cache->size / 2 doesn't seem right here. We at most half- > fill > the cache and then take some objects from that, meaning that have just > done > a re-fill of cache but end the function with it less than half full. > Since > we take from this value, I'd suggest just filling the cache = completely. The issues at the edges of the cache are symmetrical. If we replenish the cache to full, and the next transaction is a put, = the cache needs to be drained. That's why I replenish to size/2. >=20 > > if (unlikely(ret < 0)) { > > /* > > * We are buffer constrained, and not able to fetch all > that. > > @@ -1568,10 +1583,11 @@ rte_mempool_do_generic_get(struct = rte_mempool > *mp, void **obj_table, > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > > > - __rte_assume(cache->size <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); > > - __rte_assume(remaining <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); > > - cache_objs =3D &cache->objs[cache->size + remaining]; > > - cache->len =3D cache->size; > > + __rte_assume(cache->size / 2 <=3D RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > > + __rte_assume(remaining <=3D RTE_MEMPOOL_CACHE_MAX_SIZE / 2); > > + __rte_assume(remaining <=3D cache->size / 2); > > + cache_objs =3D &cache->objs[cache->size / 2]; > > + cache->len =3D cache->size / 2 - remaining; > > for (index =3D 0; index < remaining; index++) > > *obj_table++ =3D *--cache_objs; > > > > -- > > 2.43.0 > >