From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3CE0CD4F54 for ; Wed, 27 May 2026 09:22:05 +0000 (UTC) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D456840285; Wed, 27 May 2026 11:22:04 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id B93DC4026C for ; Wed, 27 May 2026 11:22:03 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 932EA208E7; Wed, 27 May 2026 11:22:03 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v5] mempool: improve cache behaviour and performance Date: Wed, 27 May 2026 11:22:01 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F658A0@smartserver.smartshare.dk> In-Reply-To: X-MimeOLE: Produced By Microsoft Exchange V6.5 X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v5] mempool: improve cache behaviour and performance Thread-Index: AdzttZUDUlh7r5ICRgm251MJ41wYYQAAxi3g References: <20260408141315.904381-1-mb@smartsharesystems.com> <20260419095526.39526-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35F6589A@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35F6589D@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35F6589F@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , "Andrew Rybchenko" , "Jingjing Wu" , "Praveen Shetty" , "Hemant Agrawal" , "Sachin Saxena" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Wednesday, 27 May 2026 10.48 >=20 > On Tue, May 26, 2026 at 07:45:24PM +0200, Morten Br=F8rup wrote: > > > From: Morten Br=F8rup [mailto:mb@smartsharesystems.com] > > > Sent: Tuesday, 26 May 2026 12.37 > > > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > Sent: Tuesday, 26 May 2026 11.40 > > > > > > > > [...] > > > > > > [In all this, I am making the assumption that burst size is well > less > > > > than > > > > cache size. Also, similar logic would be applicable for the > inverse > > > > scenario, e.g. flush to empty (and fill burst) and fill to 75%] > > > > > > I'm not so sure about this assumption. > > > With a cache size of 512 and a bursts of 64, the cache only holds = 8 > > > bursts. > > > 50% is 4 bursts, and 25% is only 2 bursts. > > > > > > Using a replenish/drain level in the middle requires 5 bursts in > either > > > direction to pass the edge (and trigger replenish/flush). > > > Using a replenish/drain level 25% from the edge requires only 3 > bursts > > > in the wrong direction to pass the edge (and trigger > replenish/flush). > > > Much higher probability with random get/put. > > > > > > > > > > > Now, all said, I tend to agree that we want to leave space for a > > > decent > > > > size burst after a fill. That is why I think that filling to 75% > is > > > > reasonable. After an alloc that triggers a fill, I don't want = the > > > cache > > > > less than 50% full, but not completely full so there is room for > a > > > free > > > > without a flush, and similarly for a free that triggers a flush, > the > > > > cache > > > > should not be empty, but also should not be more than half full. > > > > > > > > One suggestion - we could always add a simple tunable that > specifies > > > > the > > > > margin, or reserved entries for alloc and free. We can then = guide > in > > > > the > > > > docs that the value should be e.g. "zero for apps where alloc = and > > > free > > > > take > > > > place on different cores. 20%-50% of cache is recommended where > alloc > > > > and > > > > free take place on the same core" > > > > > > Yes, a simple tunable is a really good idea. > > > > > > At this point, I think we should optimize for use case #1, and go > for > > > the 50% fill level. > > > Then we can add a tunable to optimize for use case #2 later. I = will > try > > > to come up with a draft for such a follow-up patch within the next > few > > > days. > > > > Adding a tunable is not so simple... > > The choice of mempool cache algorithm (drain/replenish to 50% vs. > drain/replenish completely) should be passed via the "flags" parameter > in rte_mempool_create(), but rte_pktmbuf_pool_create() is missing the > "flags" parameter. > > We can add it at the next ABI breaking release. > > WDYT? > > > I don't want this just a binary flag with two settings, I think it > should be an actual numeric value. If it was plain and simple to support a numeric value, I'd do it. But the two algorithms differ too much. If needed, the flag can be used as an enum to support more algorithms in = the future. My WIP (not even build tested), where you can see the different = algorithms steered by the cache->flags field, looks like this: static __rte_always_inline void rte_mempool_do_generic_put(struct rte_mempool *mp, void * const = *obj_table, unsigned int n, struct rte_mempool_cache *cache) { void **cache_objs; /* No cache provided? */ if (unlikely(cache =3D=3D NULL)) goto driver_enqueue; /* Increment stats now, adding in mempool always succeeds. */ RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1); RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n); __rte_assume(cache->size <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); __rte_assume(cache->len <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); __rte_assume(cache->len <=3D cache->size); if (likely(cache->len + n <=3D cache->size)) { /* Sufficient room in the cache for the objects. */ cache_objs =3D &cache->objs[cache->len]; cache->len +=3D n; /* Add the objects to the cache. */ rte_memcpy(cache_objs, obj_table, sizeof(void *) * n); return; } /* Insufficient room in the cache for the objects. */ if (cache->flags & RTE_MEMPOOL_F_CACHE_ACCESS_ONE_WAY) { /* The algorithm is optimized for put or get operations only. */ /* The request itself exceeds the cache bounce buffer limit? */ __rte_assume(cache->size <=3D RTE_MEMPOOL_CACHE_MAX_SIZE); if (n > cache->size) goto driver_enqueue_stats_incremented; /* Fill the cache completely by adding the first part of the = objects. */ cache_objs =3D &cache->objs[cache->len]; rte_memcpy(cache_objs, obj_table, sizeof(void *) * (cache->size = - cache->len)); obj_table +=3D cache->size - cache->len; /* Flush the entire cache to the backend. */ rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size); /* Add the remaining objects to the cache. */ cache->len =3D n - (cache->size - cache->len); rte_memcpy(&cache->objs[0], obj_table, sizeof(void *) * = cache->len); } else { /* The algorithm is optimized for a balanced mix of put and get = operations. */ /* The request itself exceeds the cache bounce buffer limit? */ __rte_assume(cache->size / 2 <=3D RTE_MEMPOOL_CACHE_MAX_SIZE / = 2); if (n > cache->size / 2) goto driver_enqueue_stats_incremented; /* * Flush part of the cache to the backend to make room for the = objects; * flush (size / 2) objects from the bottom of the cache, where * objects are less hot, and move down the remaining objects, = which * are more hot, from the upper half of the cache. */ __rte_assume(cache->len > cache->size / 2); rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / = 2); rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2], sizeof(void *) * (cache->len - cache->size / 2)); cache_objs =3D &cache->objs[cache->len - cache->size / 2]; cache->len =3D cache->len - cache->size / 2 + n; /* Add the objects to the cache. */ rte_memcpy(cache_objs, obj_table, sizeof(void *) * n); } return; driver_enqueue: /* increment stat now, adding in mempool always success */ RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1); RTE_MEMPOOL_STAT_ADD(mp, put_objs, n); driver_enqueue_stats_incremented: /* push objects to the backend */ rte_mempool_ops_enqueue_bulk(mp, obj_table, n); } > Can we not use function versioning to add the > new parameter to all functions needing it, without worrying about ABI > breakage. The public structures will also change, so I don't think so. >=20 > /Bruce