* [PATCH v3] net/iavf: fix to consolidate link change event handling
From: Anurag Mandal @ 2026-06-10 14:12 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
In-Reply-To: <20260609001022.357509-1-anurag.mandal@intel.com>
Handled link-change events through a common static function that
reads the correct advanced & legacy link fields properly and
updates no-poll/watchdog/LSC state consistently.
Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without interrupt")
Fixes: 48de41ca11f0 ("net/avf: enable link status update")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
V3: Addressed Ciara Loftus's review comments
- removed two unnecessary NULL checks
V2: Addressed Ciara Loftus's review comments
- removed unnecessary NULL checks which were overly defensive checks
drivers/net/intel/iavf/iavf_vchnl.c | 115 +++++++++++++++-------------
1 file changed, 63 insertions(+), 52 deletions(-)
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 0643a835d5..36b7ee9526 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -216,6 +216,67 @@ iavf_convert_link_speed(enum virtchnl_link_speed virt_link_speed)
return speed;
}
+/*
+ * iavf_handle_link_change_event: common handler for VIRTCHNL link change events
+ *
+ * @dev: pointer to rte_eth_dev for this VF
+ * @vpe: pointer to the virtchnl_pf_event payload received from the PF
+ *
+ * Handle PF link-change event: decode adv/legacy link info, update VF
+ * link state, sync no-poll/watchdog behavior & notify app via LSC event.
+ */
+static void
+iavf_handle_link_change_event(struct rte_eth_dev *dev,
+ struct virtchnl_pf_event *vpe)
+{
+ struct iavf_adapter *adapter =
+ IAVF_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+ struct iavf_info *vf = &adapter->vf;
+ bool adv_link_speed;
+
+ adv_link_speed = (vf->vf_res != NULL) &&
+ ((vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) != 0);
+
+ if (adv_link_speed) {
+ vf->link_up = vpe->event_data.link_event_adv.link_status;
+ vf->link_speed = vpe->event_data.link_event_adv.link_speed;
+ } else {
+ enum virtchnl_link_speed speed;
+
+ vf->link_up = vpe->event_data.link_event.link_status;
+ speed = vpe->event_data.link_event.link_speed;
+ vf->link_speed = iavf_convert_link_speed(speed);
+ }
+
+ iavf_dev_link_update(dev, 0);
+
+ /*
+ * Update watchdog/no_poll state BEFORE notifying the application via
+ * the LSC event. Otherwise the application's link-up callback could
+ * race with stale (link-down) no_poll/watchdog state and either
+ * continue to drop traffic or trigger a spurious reset detection.
+ *
+ * Keeping the watchdog enabled whenever the link cannot be trusted
+ * (link is down or a VF reset is in progress); the watchdog drives
+ * auto-reset recovery, so it must remain armed in those cases.
+ */
+ if (vf->link_up && !vf->vf_reset)
+ iavf_dev_watchdog_disable(adapter);
+ else
+ iavf_dev_watchdog_enable(adapter);
+
+ if (adapter->devargs.no_poll_on_link_down) {
+ iavf_set_no_poll(adapter, true);
+ PMD_DRV_LOG(DEBUG, "VF no poll turned %s",
+ adapter->no_poll ? "on" : "off");
+ }
+
+ iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
+
+ PMD_DRV_LOG(INFO, "Link status update:%s",
+ vf->link_up ? "up" : "down");
+}
+
/* Read data in admin queue to get msg from pf driver */
static enum iavf_aq_result
iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
@@ -253,34 +314,7 @@ iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
result = IAVF_MSG_SYS;
switch (vpe->event) {
case VIRTCHNL_EVENT_LINK_CHANGE:
- vf->link_up =
- vpe->event_data.link_event.link_status;
- if (vf->vf_res != NULL &&
- vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
- vf->link_speed =
- vpe->event_data.link_event_adv.link_speed;
- } else {
- enum virtchnl_link_speed speed;
- speed = vpe->event_data.link_event.link_speed;
- vf->link_speed = iavf_convert_link_speed(speed);
- }
- iavf_dev_link_update(vf->eth_dev, 0);
- iavf_dev_event_post(vf->eth_dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
- if (vf->link_up && !vf->vf_reset) {
- iavf_dev_watchdog_disable(adapter);
- } else {
- if (!vf->link_up)
- iavf_dev_watchdog_enable(adapter);
- }
- if (adapter->devargs.no_poll_on_link_down) {
- iavf_set_no_poll(adapter, true);
- if (adapter->no_poll)
- PMD_DRV_LOG(DEBUG, "VF no poll turned on");
- else
- PMD_DRV_LOG(DEBUG, "VF no poll turned off");
- }
- PMD_DRV_LOG(INFO, "Link status update:%s",
- vf->link_up ? "up" : "down");
+ iavf_handle_link_change_event(vf->eth_dev, vpe);
break;
case VIRTCHNL_EVENT_RESET_IMPENDING:
vf->vf_reset = true;
@@ -525,30 +559,7 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev, uint8_t *msg,
break;
case VIRTCHNL_EVENT_LINK_CHANGE:
PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_LINK_CHANGE event");
- vf->link_up = pf_msg->event_data.link_event.link_status;
- if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
- vf->link_speed =
- pf_msg->event_data.link_event_adv.link_speed;
- } else {
- enum virtchnl_link_speed speed;
- speed = pf_msg->event_data.link_event.link_speed;
- vf->link_speed = iavf_convert_link_speed(speed);
- }
- iavf_dev_link_update(dev, 0);
- if (vf->link_up && !vf->vf_reset) {
- iavf_dev_watchdog_disable(adapter);
- } else {
- if (!vf->link_up)
- iavf_dev_watchdog_enable(adapter);
- }
- if (adapter->devargs.no_poll_on_link_down) {
- iavf_set_no_poll(adapter, true);
- if (adapter->no_poll)
- PMD_DRV_LOG(DEBUG, "VF no poll turned on");
- else
- PMD_DRV_LOG(DEBUG, "VF no poll turned off");
- }
- iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
+ iavf_handle_link_change_event(dev, pf_msg);
break;
case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_PF_DRIVER_CLOSE event");
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v1 00/20] net/sxe2: added Linkdata sxe2 ethernet driver
From: Thomas Monjalon @ 2026-06-10 14:02 UTC (permalink / raw)
To: Jie Liu; +Cc: stephen, dev
In-Reply-To: <20260610013936.3634968-1-liujie5@linkdatatechnology.com>
It should be v15.
10/06/2026 03:39, liujie5@linkdatatechnology.com:
> From: Jie Liu <liujie5@linkdatatechnology.com>
>
> This patch set implements core functionality for the SXE2 PMD,
> including basic driver framework, data path setup, and advanced
> offload features (VLAN, RSS,TM, PTP etc.).
>
> Jie Liu (20):
> net/sxe2: support AVX512 vectorized path for Rx and Tx
> net/sxe2: add AVX2 vector data path for Rx and Tx
> drivers: add supported packet types get callback
> net/sxe2: support L2 filtering and MAC config
> drivers: support RSS feature
> net/sxe2: support TM hierarchy and shaping
> net/sxe2: support IPsec inline protocol offload
> net/sxe2: support statistics and multi-process
> drivers: interrupt handling
> net/sxe2: add NEON vec Rx/Tx burst functions
> drivers: add support for VF representors
> net/sxe2: add support for custom UDP tunnel ports
> net/sxe2: support firmware version reading
> net/sxe2: implement get monitor address
> common/sxe2: add shared SFP module definitions
> net/sxe2: support SFP module info and EEPROM access
> net/sxe2: implement private dump info
> net/sxe2: add mbuf validation in Tx debug mode
> drivers: add testpmd commands for private features
There are big chances that this commit is too much controversial.
Please remove it from this series so we can discuss it separately.
Also it should be split: 1 commit per feature.
> net/sxe2: update sxe2 feature matrix docs
Please update the documentation atomically in previous commits.
When you support a new feature, the commit should include the related doc changes.
Thank you
^ permalink raw reply
* Re: [PATCH v6 0/7] Cuckoo hash optimization for small key sizes
From: Thomas Monjalon @ 2026-06-10 13:46 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev, vladimir.medvedkin
In-Reply-To: <20260329232409.205940-1-stephen@networkplumber.org>
> Stephen Hemminger (7):
> hash: move table of hash compare functions out of header
> hash: use static_assert
> hash: remove spurious warnings in CRC32 init
> hash: simplify key comparison across architectures
> hash: add support for common small key sizes
> app/test: convert hash test to use test suite runner
> test/hash: add test for key compare functions
That's a pity we didn't have more reviews.
Applied, thanks.
^ permalink raw reply
* RE: [RFC v4 0/3] lib/fastmem: fast small-object allocator
From: Konstantin Ananyev @ 2026-06-10 12:35 UTC (permalink / raw)
To: Mattias Rönnblom, dev@dpdk.org
Cc: Morten Brørup, Mattias Rönnblom, Yogaraj Baskaravel,
Stephen Hemminger, Bruce Richardson
In-Reply-To: <20260530092634.46218-1-hofors@lysator.liu.se>
Hi Mattias,
> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.
As stated before, I summitted RFC for the one we use internally:
https://patchwork.dpdk.org/project/dpdk/patch/20260610103918.96857-1-konstantin.ananyev@huawei.com/
Many things and ideas are similar, some are not.
Below I tried to summarize the main differences (as I see them).
I do understand that our use-cases and requirements are different,
but might be we can have a blend that will fit all of us.
Another two things that are probably necessary to move forward:
- some unified set of stress/performance test-caces that we can run
against all three: mempool/fastmem/memtank.
- some sort of guinea-pig: DPDK sub-component where this new lib
can be applied. We can try straight with the mbuf, but that's probably
quite ambitious choice for the first integration. Again, this is just one
of the possible usage scenarios.
Let me know what are your thoughts here.
Thanks
Konstantin
> Motivation
> ----------
>
> DPDK applications commonly maintain many mempools — one per object
> type (connections, sessions, timers, work items). Each must be sized
> up front, wastes memory when over-provisioned, and cannot serve
> objects of a different size. Fastmem eliminates this by accepting
> arbitrary sizes at runtime, backed by a slab allocator that
> repurposes memory across size classes as demand shifts.
I agree about first one - it is a big problem that you have to over-provision
everything with the mempool these days.
About forcing user to explicitly create multiple pools - for me it is not such big problem,
after all in most cases user knows the size of the objects he need to alloc/free upfront.
AFAIK - majority of SLAB-based allocators these days support both flavors:
user can create/maintain his own SLAB for some specific object types, or use generic
alloc/free which is backed by bunch of SLABs underneath, each serving specific size.
Might be we can do the same and support both too.
>
> Design
> ------
>
> Three-layer architecture:
>
> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
> reserved lazily (or pre-reserved for deterministic latency).
> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
In many cases, user don't need DMA-capabl;e memory for such objects,
so simple rte_malloc or even libc malloc might be enough.
I understand the intention - it is probably the fastest way to do things,
but I think it is way too constrained.
Might be the best approach is to do what memtank does -
allow user to define his own allox/free/init callbacks, then fastmem
approach will become just one of possible cases.
> The alignment enables O(1) slab lookup from any object pointer
> via bitmask — no radix tree or index structure. Slabs move
> freely between 18 power-of-2 size classes (8 B to 1 MiB).
That is cool idea and thought about doing the same: limit possible size
for SLAB to power-of-two values.
But then I realized that we still need to store inside the object some
extra metadata for stats and sanity-checking. So extra 8B for a SLAB
pointer doesn't make much difference.
But again - I think we can support both and make it configurable at creation time.
> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
> path). Cache misses trigger bulk transfers to/from the shared
> bin under a spinlock.
memtank lacks of per-lcore caches right now, mostly due to lack of time
to implement it. It is definitely a good feature - way to go.
> Key properties:
>
> - Zero per-object metadata in the production build.
> - NUMA-aware, with per-socket bins and free-slab pools.
> - DMA-usable memory with O(1) virt-to-IOVA translation.
> - Bulk alloc/free with all-or-nothing semantics.
Personally, I don't find it very convenient.
For most cases we care about - we do use best-effort semantics.
Again, probably we can support both, same as rte_ring API.
> - Backing memory never returned during lifetime (slabs recycled).
For our case it is important to have ability return memory back to the system.
memtank lib supports it (though of course some fragmentation is possible).
Again it is much easier with separate pools.
What I really like with fastmem - that one SLAB can re-use memory from different
one, that seems usefull and might mitigate memory footprint growth till some extent.
Again, I suppose both flavors caon coexist:
individual pools can grow/shring, while fasmem (bunch of predefined pools) will not.
> - Non-EAL threads supported (bypass cache, take bin lock).
> - Secondary process support (lazy attach, no per-lcore caches).
>
^ permalink raw reply
* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Bruce Richardson @ 2026-06-10 12:34 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: Morten Brørup, Jingjing Wu, Praveen Shetty, dev
In-Reply-To: <b0-eVG8rSLyIiCcsmRSskA@monjalon.net>
On Wed, Jun 10, 2026 at 02:17:43PM +0200, Thomas Monjalon wrote:
> 10/06/2026 13:31, Bruce Richardson:
> > On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > > Intel idpf maintainers,
> > >
> > > PING for review.
> > >
> > > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> > >
> > > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> > >
> > >
> > > Venlig hilsen / Kind regards,
> > > -Morten Brørup
> > >
> > Yep. I was waiting to see what happened to the mempool patch before
> > considering this for next-net-intel.
>
> I've merged it directly in main with dpaa mempool patches.
>
Ok, thanks.
/Bruce
^ permalink raw reply
* Re: [PATCH v3] cmdline: prevent out-of-bounds read in completion buffer
From: Thomas Monjalon @ 2026-06-10 12:28 UTC (permalink / raw)
To: Daniil Iskhakov; +Cc: stable, dev, rrv, sdl.dpdk
In-Reply-To: <20260430170111.1557768-1-dish@amicon.ru>
30/04/2026 19:01, Daniil Iskhakov:
> tmp_buf is populated by the completion callback and is not guaranteed
> to be NUL-terminated.
>
> The code already accounts for this when computing tmp_size with
> strnlen(tmp_buf, sizeof(tmp_buf)). However, another loop in the same
> path still walks tmp_buf until a NUL byte is found, without checking
> the buffer limit.
>
> If the callback writes a full-sized non-NUL-terminated string, the loop
> may read past the end of tmp_buf.
>
> Fix this by computing a bounded length for each completion choice before
> printing it.
>
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
>
> Signed-off-by: Daniil Iskhakov <dish@amicon.ru>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Thomas Monjalon @ 2026-06-10 12:17 UTC (permalink / raw)
To: Morten Brørup, Bruce Richardson; +Cc: Jingjing Wu, Praveen Shetty, dev
In-Reply-To: <ailLDte3Xo8jsUWf@bricha3-mobl1.ger.corp.intel.com>
10/06/2026 13:31, Bruce Richardson:
> On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > Intel idpf maintainers,
> >
> > PING for review.
> >
> > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> >
> > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> >
> >
> > Venlig hilsen / Kind regards,
> > -Morten Brørup
> >
> Yep. I was waiting to see what happened to the mempool patch before
> considering this for next-net-intel.
I've merged it directly in main with dpaa mempool patches.
^ permalink raw reply
* [PATCH v2 2/2] ethdev: fix out-of-bounds write in flex item conversion
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
In-Reply-To: <20260610113334.277895-1-jamestiotio@gmail.com>
rte_flow_item_flex_conv() is dispatched from rte_flow_conv_copy() to
deep-copy the variable-length pattern that follows a flex item header.
The function took no size argument at all, so the trailing rte_memcpy()
of `src->length` bytes was gated only on `buf != NULL`, violating the
documented contract that output is truncated to the caller-supplied
buffer size. A caller passing a buffer just large enough for the header
struct had adjacent memory clobbered by up to 4 GiB of pattern data,
since `src->length` is uint32_t and unbounded.
Propagate the remaining buffer size `size - sz` from
rte_flow_conv_copy() into the desc_fn callback and gate the inner
memcpy on it.
Fixes: dc4d860e8a89 ("ethdev: introduce configurable flexible item")
Cc: stable@dpdk.org
Signed-off-by: James Raphael Tiovalen <jamestiotio@gmail.com>
---
lib/ethdev/rte_flow.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/lib/ethdev/rte_flow.c b/lib/ethdev/rte_flow.c
index e534f2295b..60c9a3d06f 100644
--- a/lib/ethdev/rte_flow.c
+++ b/lib/ethdev/rte_flow.c
@@ -36,7 +36,7 @@ uint64_t rte_flow_dynf_metadata_mask;
struct rte_flow_desc_data {
const char *name;
size_t size;
- size_t (*desc_fn)(void *dst, const void *src);
+ size_t (*desc_fn)(void *dst, const void *src, size_t size);
};
/**
@@ -68,16 +68,17 @@ rte_flow_conv_copy(void *buf, const void *data, const size_t size,
if (buf != NULL)
rte_memcpy(buf, data, (size > sz ? sz : size));
if (rte_type && desc[type].desc_fn)
- sz += desc[type].desc_fn(size > 0 ? buf : NULL, data);
+ sz += desc[type].desc_fn(size > 0 ? buf : NULL, data,
+ size > sz ? size - sz : 0);
return sz;
}
static size_t
-rte_flow_item_flex_conv(void *buf, const void *data)
+rte_flow_item_flex_conv(void *buf, const void *data, size_t size)
{
struct rte_flow_item_flex *dst = buf;
const struct rte_flow_item_flex *src = data;
- if (buf) {
+ if (buf && size >= src->length) {
dst->pattern = rte_memcpy
((void *)((uintptr_t)(dst + 1)), src->pattern,
src->length);
--
2.43.0
^ permalink raw reply related
* [PATCH v2 1/2] ethdev: fix out-of-bounds write in GENEVE option conversion
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
In-Reply-To: <20260610113334.277895-1-jamestiotio@gmail.com>
rte_flow_conv_item_spec() is documented to truncate output to the
caller-supplied buffer size. For RTE_FLOW_ITEM_TYPE_GENEVE_OPT, the
deep-copy of the variable-length option data was gated on `size > 0`
instead of `size >= off + tmp`, the form used by the sibling RAW
branch. A caller passing a buffer just large enough for the header
struct had adjacent memory clobbered by up to `option_len * 4` bytes of
option payload.
Align the GENEVE_OPT guard with the RAW one.
Fixes: 841a0445442d ("ethdev: fix GENEVE option item conversion")
Cc: stable@dpdk.org
Signed-off-by: James Raphael Tiovalen <jamestiotio@gmail.com>
---
lib/ethdev/rte_flow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/ethdev/rte_flow.c b/lib/ethdev/rte_flow.c
index ec0fe08355..e534f2295b 100644
--- a/lib/ethdev/rte_flow.c
+++ b/lib/ethdev/rte_flow.c
@@ -701,7 +701,7 @@ rte_flow_conv_item_spec(void *buf, const size_t size,
src.geneve_opt = data;
dst.geneve_opt = buf;
tmp = spec.geneve_opt ? (spec.geneve_opt->option_len << 2) : 0;
- if (size > 0 && tmp > 0 && src.geneve_opt->data) {
+ if (size >= off + tmp && tmp > 0 && src.geneve_opt->data) {
deep_src = (void *)((uintptr_t)(dst.geneve_opt + 1));
dst.geneve_opt->data = rte_memcpy(deep_src,
src.geneve_opt->data,
--
2.43.0
^ permalink raw reply related
* [PATCH v2 0/2] ethdev: fix out-of-bounds writes in rte_flow_conv()
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
rte_flow_conv() is documented to truncate output to the caller-supplied
buffer size, but two paths handling variable-length trailing data
ignored that contract and copied the full payload whenever the
destination pointer was non-NULL. A caller passing a buffer just large
enough for the fixed-size header had adjacent memory clobbered:
- GENEVE_OPT: up to option_len * 4 bytes
- FLEX: up to 4 GiB, since src->length is a uint32_t and the API places
no bounds on it
Patch 1 aligns the GENEVE_OPT guard with the sibling RAW branch, which
already gates its copy on the remaining buffer size.
Patch 2 plumbs the remaining buffer size into the flex-item desc_fn
callback (which previously took no size argument at all) and gates the
inner rte_memcpy() on it.
v2 fixes the merge conflict between patch 1 and the main branch.
James Raphael Tiovalen (2):
ethdev: fix out-of-bounds write in GENEVE option conversion
ethdev: fix out-of-bounds write in flex item conversion
lib/ethdev/rte_flow.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--
2.43.0
^ permalink raw reply
* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Morten Brørup @ 2026-06-10 11:31 UTC (permalink / raw)
To: dev
In-Reply-To: <20260601183621.252920-1-mb@smartsharesystems.com>
Recheck-request: iol-unit-arm64-testing
Unrelated CI failure.
CI log:
==== 20 line log output for Ubuntu 20.04 (lpm_autotest): ====
[7/407] Compiling C object lib/librte_log.a.p/log_log_journal.c.o
[8/407] Linking static target lib/librte_log.a
[9/407] Generating rte_argparse_map with a custom command
[10/407] Compiling C object lib/librte_kvargs.a.p/kvargs_rte_kvargs.c.o
[11/407] Linking static target lib/librte_kvargs.a
[12/407] Compiling C object lib/librte_argparse.a.p/argparse_rte_argparse.c.o
[13/407] Generating kvargs.sym_chk with a custom command (wrapped by meson to capture output)
[14/407] Linking static target lib/librte_argparse.a
[15/407] Generating log.sym_chk with a custom command (wrapped by meson to capture output)
[16/407] Linking target lib/librte_log.so.26.2
[17/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry_data.c.o
[18/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry.c.o
[19/407] Generating rte_telemetry_map with a custom command
FAILED: lib/telemetry_exports.map
/usr/bin/python3 ../buildtools/gen-version-map.py --linker gnu --abi-version ../ABI_VERSION --output lib/telemetry_exports.map --source ../lib/telemetry/telemetry.c ../lib/telemetry/telemetry_data.c ../lib/telemetry/telemetry_legacy.c
Segmentation fault (core dumped)
^ permalink raw reply
* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Bruce Richardson @ 2026-06-10 11:31 UTC (permalink / raw)
To: Morten Brørup; +Cc: Jingjing Wu, Praveen Shetty, dev
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F65902@smartserver.smartshare.dk>
On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> Intel idpf maintainers,
>
> PING for review.
>
> The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
>
> [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
>
>
> Venlig hilsen / Kind regards,
> -Morten Brørup
>
Yep. I was waiting to see what happened to the mempool patch before
considering this for next-net-intel.
/Bruce
^ permalink raw reply
* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Morten Brørup @ 2026-06-10 11:21 UTC (permalink / raw)
To: Jingjing Wu, Praveen Shetty, Bruce Richardson; +Cc: dev
In-Reply-To: <20260601183621.252920-1-mb@smartsharesystems.com>
Intel idpf maintainers,
PING for review.
The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
[1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
Venlig hilsen / Kind regards,
-Morten Brørup
> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Monday, 1 June 2026 20.36
> To: dev@dpdk.org; Andrew Rybchenko; Bruce Richardson; Jingjing Wu;
> Praveen Shetty; Hemant Agrawal; Sachin Saxena
> Cc: Morten Brørup
> Subject: [PATCH v7] net/idpf: update for new mempool cache algorithm
>
> As a consequence of the improved mempool cache algorithm, the PMD was
> updated regarding how much to backfill the mempool cache in the AVX512
> code path.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> v7:
> * Rebased.
> v6:
> * Moved driver changes out as separate patches, for easier review.
> (Bruce)
> ---
> Depends-on: patch-164745 ("mempool: improve cache behaviour and
> performance")
> ---
> .../net/intel/idpf/idpf_common_rxtx_avx512.c | 52 +++++++++++++++----
> 1 file changed, 42 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> index 8db4c64106..5788a009ab 100644
> --- a/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> +++ b/drivers/net/intel/idpf/idpf_common_rxtx_avx512.c
> @@ -148,15 +148,31 @@ idpf_singleq_rearm(struct idpf_rx_queue *rxq)
> /* Can this be satisfied from the cache? */
> if (cache->len < IDPF_RXQ_REARM_THRESH) {
> /* No. Backfill the cache first, and then fill from it */
> - uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> - cache->len);
>
> - /* How many do we require i.e. number to fill the cache +
> the request */
> + /* Backfill would exceed the cache bounce buffer limit? */
> + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> + if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> + idpf_singleq_rearm_common(rxq);
> + return;
> + }
> +
> + /*
> + * Backfill the cache from the backend;
> + * move up the hot objects in the cache to the top half of
> the cache,
> + * and fetch (size / 2) objects to the bottom of the cache.
> + */
> + __rte_assume(cache->len < cache->size / 2);
> + rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> + sizeof(void *) * cache->len);
> int ret = rte_mempool_ops_dequeue_bulk
> - (rxq->mp, &cache->objs[cache->len], req);
> + (rxq->mp, &cache->objs[0], cache->size / 2);
> if (ret == 0) {
> - cache->len += req;
> + cache->len += cache->size / 2;
> } else {
> + /*
> + * No further action is required for roll back, as
> the objects moved
> + * in the cache were actually copied, and the cache
> remains intact.
> + */
> if (rxq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
> rxq->nb_rx_desc) {
> __m128i dma_addr0;
> @@ -565,15 +581,31 @@ idpf_splitq_rearm(struct idpf_rx_queue *rx_bufq)
> /* Can this be satisfied from the cache? */
> if (cache->len < IDPF_RXQ_REARM_THRESH) {
> /* No. Backfill the cache first, and then fill from it */
> - uint32_t req = IDPF_RXQ_REARM_THRESH + (cache->size -
> - cache->len);
>
> - /* How many do we require i.e. number to fill the cache +
> the request */
> + /* Backfill would exceed the cache bounce buffer limit? */
> + __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE
> / 2);
> + if (unlikely(cache->size / 2 < IDPF_RXQ_REARM_THRESH)) {
> + idpf_splitq_rearm_common(rx_bufq);
> + return;
> + }
> +
> + /*
> + * Backfill the cache from the backend;
> + * move up the hot objects in the cache to the top half of
> the cache,
> + * and fetch (size / 2) objects to the bottom of the cache.
> + */
> + __rte_assume(cache->len < cache->size / 2);
> + rte_memcpy(&cache->objs[cache->size / 2], &cache->objs[0],
> + sizeof(void *) * cache->len);
> int ret = rte_mempool_ops_dequeue_bulk
> - (rx_bufq->mp, &cache->objs[cache->len], req);
> + (rx_bufq->mp, &cache->objs[0], cache->size /
> 2);
> if (ret == 0) {
> - cache->len += req;
> + cache->len += cache->size / 2;
> } else {
> + /*
> + * No further action is required for roll back, as
> the objects moved
> + * in the cache were actually copied, and the cache
> remains intact.
> + */
> if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
> rx_bufq->nb_rx_desc) {
> __m128i dma_addr0;
> --
> 2.43.0
^ permalink raw reply
* Re: [PATCH v8] mempool: improve cache behaviour and performance
From: Thomas Monjalon @ 2026-06-10 11:06 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev, Andrew Rybchenko
In-Reply-To: <20260604114851.12586-1-mb@smartsharesystems.com>
04/06/2026 13:48, Morten Brørup:
> This patch refactors the mempool cache to eliminate some unexpected
> behaviour and reduce the mempool cache miss rate.
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v2] eal: fix function versioning with LTO
From: David Marchand @ 2026-06-10 10:54 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev
In-Reply-To: <20260609135532.80396-1-stephen@networkplumber.org>
On Tue, 9 Jun 2026 at 15:56, Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> When using function versioning and building with LTO,
> GCC gets confused by the symbol versioning using __asm__.
> There are no uses of function versioning in upstream repo.
> This was found when adding additional parameter to
> rte_eth_dev_get_name_by_port.
>
> Assembler messages:
> Error: invalid attempt to declare external version name as default in symbol `rte_eth_dev_get_name_by_port@@DPDK_27'
>
> The workaround GCC 10 introduced was an additional function attribute;
> clang doesn't have or need this attribute. No need to backport this to
> LTS since there is no function versioning in those releases.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Marchand <david.marchand@redhat.com>
Applied, thanks.
--
David Marchand
^ permalink raw reply
* Re: [EXTERNAL] [PATCH dpdk] graph: replace circular buffer with priority-based bitmap scheduling
From: Kiran Kumar Kokkilagadda @ 2026-06-10 10:51 UTC (permalink / raw)
To: Robin Jarry, dev@dpdk.org, Jerin Jacob, Nithin Kumar Dabilpuram,
Zhirun Yan
Cc: Vladimir Medvedkin, Christophe Fontaine, David Marchand,
Konstantin Ananyev, Maxime Leroy
In-Reply-To: <20260519101232.541102-2-rjarry@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 36805 bytes --]
From: Robin Jarry <rjarry@redhat.com>
Date: Tuesday, 19 May 2026 at 3:43 PM
To: dev@dpdk.org <dev@dpdk.org>; Jerin Jacob <jerinj@marvell.com>; Kiran Kumar Kokkilagadda <kirankumark@marvell.com>; Nithin Kumar Dabilpuram <ndabilpuram@marvell.com>; Zhirun Yan <yanzhirun_163@163.com>
Cc: Vladimir Medvedkin <vladimir.medvedkin@intel.com>; Christophe Fontaine <cfontain@redhat.com>; David Marchand <david.marchand@redhat.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>; Maxime Leroy <maxime@leroys.fr>
Subject: [EXTERNAL] [PATCH dpdk] graph: replace circular buffer with priority-based bitmap scheduling
Replace the FIFO circular buffer used to track pending nodes with
a bitmap and a priority-sorted schedule table. Each node can now have
a scheduling priority (int16_t, default 0, lower value means visited
first). Source nodes are forced to INT16_MIN so they always run first.
At graph creation time, nodes are sorted by (priority, id) and assigned
a bit position (sched_idx). During the walk, a bitmap tracks which nodes
have pending objects. Scanning from the lowest bit naturally visits
nodes in priority order.
This improves batching in fan-out-then-converge topologies. When
eth_input classifies packets to both mpls_input and ipv4_input, the old
FIFO order could process ipv4_input before mpls_input, causing
ipv4_input to be visited twice (once before and once after the MPLS
label is popped). With mpls_input at a higher priority (lower value), it
runs first and its output accumulates in ipv4_input which is then
visited only once with all packets.
The bitmap set operation is idempotent (OR on an already-set bit is
a no-op) which removes the need for the idx == 0 guards that the
circular buffer required to avoid duplicate enqueue.
Suggested-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>
Signed-off-by: Robin Jarry <rjarry@redhat.com>
Cc: Christophe Fontaine <cfontain@redhat.com>
Cc: David Marchand <david.marchand@redhat.com>
Cc: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Cc: Maxime Leroy <maxime@leroys.fr>
---
doc/guides/prog_guide/graph_lib.rst | 37 +-
.../prog_guide/img/graph_mem_layout.svg | 1823 +++++++----------
lib/graph/graph.c | 19 +-
lib/graph/graph_debug.c | 12 +-
lib/graph/graph_populate.c | 117 +-
lib/graph/graph_private.h | 27 +-
lib/graph/node.c | 2 +
lib/graph/rte_graph.h | 1 +
lib/graph/rte_graph_model_mcore_dispatch.h | 34 +-
lib/graph/rte_graph_model_rtc.h | 65 +-
lib/graph/rte_graph_worker.h | 2 +-
lib/graph/rte_graph_worker_common.h | 81 +-
12 files changed, 984 insertions(+), 1236 deletions(-)
diff --git a/doc/guides/prog_guide/graph_lib.rst b/doc/guides/prog_guide/graph_lib.rst
index 8409e7666e85..9c6d8679b686 100644
--- a/doc/guides/prog_guide/graph_lib.rst
+++ b/doc/guides/prog_guide/graph_lib.rst
@@ -117,13 +117,22 @@ next_node[]:
The dynamic array to store the downstream nodes connected to this node. Downstream
node should not be current node itself or a source node.
+priority:
+^^^^^^^^^
+
+The scheduling priority of the node (``int16_t``, default 0). Nodes with lower
+priority values are visited first during the graph walk. This allows control
+over the order in which pending nodes are processed, which can improve packet
+batching in topologies where multiple paths converge on the same node.
+
Source node:
^^^^^^^^^^^^
Source nodes are static nodes created using ``RTE_NODE_REGISTER`` by passing
``flags`` as ``RTE_NODE_SOURCE_F``.
-While performing the graph walk, the ``process()`` function of all the source
-nodes will be called first. So that these nodes can be used as input nodes for a graph.
+Source nodes are automatically assigned the lowest possible priority
+(``INT16_MIN``) so that their ``process()`` function is always called first
+during the graph walk. This ensures they act as input nodes for a graph.
nb_xstats:
^^^^^^^^^^
@@ -396,12 +405,26 @@ Graph object memory layout
Understanding the memory layout helps to debug the graph library and
improve the performance if needed.
-Graph object consists of a header, circular buffer to store the pending stream
-when walking over the graph, variable-length memory to store the ``rte_node`` objects,
-and variable-length memory to store the xstat reported by each ``rte_node``.
+A graph object consists of a header, a scheduling table mapping bit positions to
+node offsets, pending and source bitmaps for tracking which nodes need
+processing, variable-length memory to store the ``rte_node`` objects, and
+variable-length memory to store the xstat reported by each ``rte_node``.
-The graph_nodes_mem_create() creates and populate this memory. The functions
-such as ``rte_graph_walk()`` and ``rte_node_enqueue_*`` use this memory
+Nodes are sorted by ``(priority, node_id)`` at graph creation time and each
+node is assigned a bit position in the pending bitmap. During the graph walk,
+the bitmap is scanned from the lowest bit, so nodes with lower priority values
+are visited first. Source nodes are always assigned the lowest priority
+(``INT16_MIN``) to ensure they run before any processing node.
+
+This priority-based ordering improves batching in fan-out-then-converge
+topologies. For example, if ``eth_input`` classifies packets to both
+``mpls_input`` and ``ipv4_input``, giving ``mpls_input`` a lower priority value
+ensures it runs first. Its output accumulates in ``ipv4_input`` which is then
+visited only once with all packets, instead of being visited twice (before and
+after MPLS label popping).
+
+The ``graph_fp_mem_create()`` function creates and populates this memory. The
+functions such as ``rte_graph_walk()`` and ``rte_node_enqueue_*`` use this memory
to enable fastpath services.
diff --git a/lib/graph/graph.c b/lib/graph/graph.c
index 6911ea8abeed..6dc1402e6bd0 100644
--- a/lib/graph/graph.c
+++ b/lib/graph/graph.c
@@ -334,20 +334,6 @@ graph_mem_fixup_secondary(struct rte_graph *graph)
return graph_mem_fixup_node_ctx(graph);
}
-static bool
-graph_src_node_avail(struct graph *graph)
-{
- struct graph_node *graph_node;
-
- STAILQ_FOREACH(graph_node, &graph->node_list, next)
- if ((graph_node->node->flags & RTE_NODE_SOURCE_F) &&
- (graph_node->node->lcore_id == RTE_MAX_LCORE ||
- graph->lcore_id == graph_node->node->lcore_id))
- return true;
-
- return false;
-}
-
RTE_EXPORT_SYMBOL(rte_graph_model_mcore_dispatch_core_bind)
int
rte_graph_model_mcore_dispatch_core_bind(rte_graph_t id, int lcore)
@@ -375,9 +361,8 @@ rte_graph_model_mcore_dispatch_core_bind(rte_graph_t id, int lcore)
graph->graph->dispatch.lcore_id = graph->lcore_id;
graph->socket = rte_lcore_to_socket_id(lcore);
- /* check the availability of source node */
- if (!graph_src_node_avail(graph))
- graph->graph->head = 0;
+ /* Rebuild source bitmap with only source nodes bound to this lcore */
+ graph_src_bitmap_rebuild(graph);
return 0;
diff --git a/lib/graph/graph_debug.c b/lib/graph/graph_debug.c
index e3b8cccdc1f0..8e99fa1b0fb8 100644
--- a/lib/graph/graph_debug.c
+++ b/lib/graph/graph_debug.c
@@ -15,8 +15,8 @@ graph_dump(FILE *f, struct graph *g)
fprintf(f, "graph <%s>\n", g->name);
fprintf(f, " id=%" PRIu32 "\n", g->id);
- fprintf(f, " cir_start=%" PRIu32 "\n", g->cir_start);
- fprintf(f, " cir_mask=%" PRIu32 "\n", g->cir_mask);
+ fprintf(f, " sched_table_off=%" PRIu32 "\n", g->sched_table_off);
+ fprintf(f, " nb_sched_words=%" PRIu16 "\n", g->nb_sched_words);
fprintf(f, " addr=%p\n", g);
fprintf(f, " graph=%p\n", g->graph);
fprintf(f, " mem_sz=%zu\n", g->mem_sz);
@@ -63,14 +63,14 @@ rte_graph_obj_dump(FILE *f, struct rte_graph *g, bool all)
fprintf(f, "graph <%s> @ %p\n", g->name, g);
fprintf(f, " id=%" PRIu32 "\n", g->id);
- fprintf(f, " head=%" PRId32 "\n", (int32_t)g->head);
- fprintf(f, " tail=%" PRId32 "\n", (int32_t)g->tail);
- fprintf(f, " cir_mask=0x%" PRIx32 "\n", g->cir_mask);
fprintf(f, " nb_nodes=%" PRId32 "\n", g->nb_nodes);
+ fprintf(f, " nb_sched_words=%" PRIu16 "\n", g->nb_sched_words);
fprintf(f, " socket=%d\n", g->socket);
fprintf(f, " fence=0x%" PRIx64 "\n", g->fence);
fprintf(f, " nodes_start=0x%" PRIx32 "\n", g->nodes_start);
- fprintf(f, " cir_start=%p\n", g->cir_start);
+ fprintf(f, " sched_table=%p\n", g->sched_table);
+ fprintf(f, " pending=%p\n", g->pending);
+ fprintf(f, " src_pending=%p\n", g->src_pending);
rte_graph_foreach_node(count, off, g, n) {
if (!all && n->idx == 0)
diff --git a/lib/graph/graph_populate.c b/lib/graph/graph_populate.c
index 026daecb2122..45bc7704fede 100644
--- a/lib/graph/graph_populate.c
+++ b/lib/graph/graph_populate.c
@@ -3,6 +3,8 @@
*/
+#include <stdlib.h>
+
#include <rte_common.h>
#include <rte_errno.h>
#include <rte_malloc.h>
@@ -15,19 +17,27 @@ static size_t
graph_fp_mem_calc_size(struct graph *graph)
{
struct graph_node *graph_node;
- rte_node_t val;
+ uint16_t nwords;
size_t sz;
/* Graph header */
sz = sizeof(struct rte_graph);
- /* Source nodes list */
- sz += sizeof(rte_graph_off_t) * graph->src_node_count;
- /* Circular buffer for pending streams of size number of nodes */
- val = rte_align32pow2(graph->node_count * sizeof(rte_graph_off_t));
- sz = RTE_ALIGN(sz, val);
- graph->cir_start = sz;
- graph->cir_mask = rte_align32pow2(graph->node_count) - 1;
- sz += val;
+
+ /* Schedule table: node offset indexed by sched_idx */
+ sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
+ graph->sched_table_off = sz;
+ sz += sizeof(rte_graph_off_t) * graph->node_count;
+
+ /* Pending and source pending bitmaps */
+ nwords = (graph->node_count + 63) / 64;
+ graph->nb_sched_words = nwords;
+ sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
+ graph->pending_off = sz;
+ sz += sizeof(uint64_t) * nwords;
+ sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
+ graph->src_pending_off = sz;
+ sz += sizeof(uint64_t) * nwords;
+
/* Fence */
sz += sizeof(RTE_GRAPH_FENCE);
sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
@@ -54,20 +64,44 @@ graph_fp_mem_calc_size(struct graph *graph)
}
static void
-graph_header_popluate(struct graph *_graph)
+graph_header_populate(struct graph *_graph)
{
struct rte_graph *graph = _graph->graph;
- graph->tail = 0;
- graph->head = (int32_t)-_graph->src_node_count;
- graph->cir_mask = _graph->cir_mask;
graph->nb_nodes = _graph->node_count;
- graph->cir_start = RTE_PTR_ADD(graph, _graph->cir_start);
+ graph->nb_sched_words = _graph->nb_sched_words;
+ graph->sched_table = RTE_PTR_ADD(graph, _graph->sched_table_off);
+ graph->pending = RTE_PTR_ADD(graph, _graph->pending_off);
+ graph->src_pending = RTE_PTR_ADD(graph, _graph->src_pending_off);
graph->nodes_start = _graph->nodes_start;
graph->socket = _graph->socket;
graph->id = _graph->id;
memcpy(graph->name, _graph->name, RTE_GRAPH_NAMESIZE);
graph->fence = RTE_GRAPH_FENCE;
+
+ memset(graph->pending, 0, sizeof(uint64_t) * _graph->nb_sched_words);
+ memset(graph->src_pending, 0, sizeof(uint64_t) * _graph->nb_sched_words);
+}
+
+static int16_t
+graph_node_effective_priority(const struct graph_node *gn)
+{
+ if (gn->node->flags & RTE_NODE_SOURCE_F)
+ return INT16_MIN;
+ return gn->node->priority;
+}
+
+static int
+graph_node_priority_cmp(const void *a, const void *b)
+{
+ const struct graph_node *const *na = a;
+ const struct graph_node *const *nb = b;
+ int16_t pa = graph_node_effective_priority(*na);
+ int16_t pb = graph_node_effective_priority(*nb);
+
+ if (pa != pb)
+ return (int)pa - (int)pb;
+ return (int)(*na)->node->id - (int)(*nb)->node->id;
}
static void
@@ -76,15 +110,26 @@ graph_nodes_populate(struct graph *_graph)
rte_graph_off_t xstat_off = _graph->xstats_start;
rte_graph_off_t off = _graph->nodes_start;
struct rte_graph *graph = _graph->graph;
- struct graph_node *graph_node;
+ struct graph_node **sorted, *graph_node;
rte_edge_t count, nb_edges;
rte_node_t pid;
+ uint32_t n;
- STAILQ_FOREACH(graph_node, &_graph->node_list, next) {
+ /* Build a sorted array of graph_node pointers by (priority, id) */
+ sorted = calloc(_graph->node_count, sizeof(*sorted));
+ RTE_VERIFY(sorted != NULL);
+ n = 0;
+ STAILQ_FOREACH(graph_node, &_graph->node_list, next)
+ sorted[n++] = graph_node;
+ qsort(sorted, n, sizeof(*sorted), graph_node_priority_cmp);
+
+ for (n = 0; n < _graph->node_count; n++) {
+ graph_node = sorted[n];
struct rte_node *node = RTE_PTR_ADD(graph, off);
memset(node, 0, sizeof(*node));
node->fence = RTE_GRAPH_FENCE;
node->off = off;
+ node->sched_idx = n;
if (graph_pcap_is_enable()) {
node->process = graph_pcap_dispatch;
node->original_process = graph_node->node->process;
@@ -123,8 +168,14 @@ graph_nodes_populate(struct graph *_graph)
off += sizeof(struct rte_node *) * nb_edges;
off = RTE_ALIGN(off, RTE_CACHE_LINE_SIZE);
node->next = off;
+
+ /* Fill the schedule table */
+ graph->sched_table[n] = node->off;
+
__rte_node_stream_alloc(graph, node);
}
+
+ free(sorted);
}
struct rte_node *
@@ -179,12 +230,11 @@ graph_node_nexts_populate(struct graph *_graph)
}
static int
-graph_src_nodes_offset_populate(struct graph *_graph)
+graph_src_bitmap_populate(struct graph *_graph)
{
struct rte_graph *graph = _graph->graph;
struct graph_node *graph_node;
struct rte_node *node;
- int32_t head = -1;
const char *name;
STAILQ_FOREACH(graph_node, &_graph->node_list, next) {
@@ -195,7 +245,7 @@ graph_src_nodes_offset_populate(struct graph *_graph)
SET_ERR_JMP(EINVAL, fail, "%s not found", name);
__rte_node_stream_alloc(graph, node);
- graph->cir_start[head--] = node->off;
+ __rte_node_pending_set(graph->src_pending, node);
}
}
@@ -204,17 +254,42 @@ graph_src_nodes_offset_populate(struct graph *_graph)
return -rte_errno;
}
+void
+graph_src_bitmap_rebuild(struct graph *_graph)
+{
+ struct rte_graph *graph = _graph->graph;
+ struct graph_node *graph_node;
+ struct rte_node *node;
+ const char *name;
+
+ memset(graph->src_pending, 0,
+ sizeof(uint64_t) * graph->nb_sched_words);
+
+ STAILQ_FOREACH(graph_node, &_graph->node_list, next) {
+ if (!(graph_node->node->flags & RTE_NODE_SOURCE_F))
+ continue;
+ if (graph_node->node->lcore_id != RTE_MAX_LCORE &&
+ graph_node->node->lcore_id != _graph->lcore_id)
+ continue;
+ name = graph_node->node->name;
+ node = graph_node_name_to_ptr(graph, name);
+ if (node == NULL)
+ continue;
+ __rte_node_pending_set(graph->src_pending, node);
+ }
+}
+
static int
graph_fp_mem_populate(struct graph *graph)
{
int rc;
- graph_header_popluate(graph);
+ graph_header_populate(graph);
if (graph_pcap_is_enable())
graph_pcap_init(graph);
graph_nodes_populate(graph);
rc = graph_node_nexts_populate(graph);
- rc |= graph_src_nodes_offset_populate(graph);
+ rc |= graph_src_bitmap_populate(graph);
return rc;
}
diff --git a/lib/graph/graph_private.h b/lib/graph/graph_private.h
index 26cdc6637192..df6f83b20261 100644
--- a/lib/graph/graph_private.h
+++ b/lib/graph/graph_private.h
@@ -49,6 +49,7 @@ struct node {
STAILQ_ENTRY(node) next; /**< Next node in the list. */
char name[RTE_NODE_NAMESIZE]; /**< Name of the node. */
uint64_t flags; /**< Node configuration flag. */
+ int16_t priority; /**< Scheduling priority. */
unsigned int lcore_id;
/**< Node runs on the Lcore ID used for mcore dispatch model. */
rte_node_process_t process; /**< Node process function. */
@@ -98,19 +99,23 @@ struct graph {
const struct rte_memzone *mz;
/**< Memzone to store graph data. */
rte_graph_off_t nodes_start;
- /**< Node memory start offset in graph reel. */
+ /**< Node memory start offset in graph memory. */
rte_graph_off_t xstats_start;
- /**< Node xstats memory start offset in graph reel. */
+ /**< Node xstats memory start offset in graph memory. */
rte_node_t src_node_count;
/**< Number of source nodes in a graph. */
struct rte_graph *graph;
/**< Pointer to graph data. */
rte_node_t node_count;
/**< Total number of nodes. */
- uint32_t cir_start;
- /**< Circular buffer start offset in graph reel. */
- uint32_t cir_mask;
- /**< Circular buffer mask for wrap around. */
+ uint32_t sched_table_off;
+ /**< Schedule table start offset in graph memory. */
+ uint32_t pending_off;
+ /**< Pending bitmap start offset in graph memory. */
+ uint32_t src_pending_off;
+ /**< Source pending bitmap start offset in graph memory. */
+ uint16_t nb_sched_words;
+ /**< Number of uint64_t words in pending bitmaps. */
rte_graph_t id;
/**< Graph identifier. */
rte_graph_t parent_id;
@@ -378,6 +383,16 @@ int graph_fp_mem_create(struct graph *graph);
*/
int graph_fp_mem_destroy(struct graph *graph);
+/**
+ * @internal
+ *
+ * Rebuild the source pending bitmap based on lcore affinity.
+ *
+ * @param graph
+ * Pointer to the internal graph object.
+ */
+void graph_src_bitmap_rebuild(struct graph *graph);
+
/* Lookup functions */
/**
* @internal
diff --git a/lib/graph/node.c b/lib/graph/node.c
index e3359fe490a5..b5599143b37b 100644
--- a/lib/graph/node.c
+++ b/lib/graph/node.c
@@ -153,6 +153,7 @@ __rte_node_register(const struct rte_node_register *reg)
if (rte_strscpy(node->name, reg->name, RTE_NODE_NAMESIZE) < 0)
goto free_xstat;
node->flags = reg->flags;
+ node->priority = reg->priority;
node->process = reg->process;
node->init = reg->init;
node->fini = reg->fini;
@@ -216,6 +217,7 @@ node_clone(struct node *node, const char *name)
/* Clone the source node */
reg->flags = node->flags;
+ reg->priority = node->priority;
reg->process = node->process;
reg->init = node->init;
reg->fini = node->fini;
diff --git a/lib/graph/rte_graph.h b/lib/graph/rte_graph.h
index 7e433f466129..6cd32ec22284 100644
--- a/lib/graph/rte_graph.h
+++ b/lib/graph/rte_graph.h
@@ -496,6 +496,7 @@ struct rte_node_register {
char name[RTE_NODE_NAMESIZE]; /**< Name of the node. */
uint64_t flags; /**< Node configuration flag. */
#define RTE_NODE_SOURCE_F (1ULL << 0) /**< Node type is source. */
+ int16_t priority; /**< Scheduling priority (lower = visited first, default 0). */
This will break the ABI. Please run ABI check and see/fix.
rte_node_process_t process; /**< Node process function. */
rte_node_init_t init; /**< Node init function. */
rte_node_fini_t fini; /**< Node fini function. */
diff --git a/lib/graph/rte_graph_model_mcore_dispatch.h b/lib/graph/rte_graph_model_mcore_dispatch.h
index f9ff3daa88ec..50a473564b56 100644
--- a/lib/graph/rte_graph_model_mcore_dispatch.h
+++ b/lib/graph/rte_graph_model_mcore_dispatch.h
@@ -77,9 +77,13 @@ int rte_graph_model_mcore_dispatch_node_lcore_affinity_set(const char *name,
unsigned int lcore_id);
/**
- * Perform graph walk on the circular buffer and invoke the process function
+ * Perform graph walk on the pending bitmap and invoke the process function
* of the nodes and collect the stats.
*
+ * Nodes are visited in scheduling order (lowest priority value first).
+ * Source nodes are seeded into the pending bitmap at the start of each walk.
+ * Nodes with different lcore affinity are dispatched to their target lcore.
+ *
* @param graph
* Graph pointer returned from rte_graph_lookup function.
*
@@ -88,20 +92,28 @@ int rte_graph_model_mcore_dispatch_node_lcore_affinity_set(const char *name,
static inline void
rte_graph_walk_mcore_dispatch(struct rte_graph *graph)
{
- const rte_graph_off_t *cir_start = graph->cir_start;
- const rte_node_t mask = graph->cir_mask;
- uint32_t head = graph->head;
+ const uint16_t nwords = graph->nb_sched_words;
struct rte_node *node;
+ uint16_t word, bit;
if (graph->dispatch.wq != NULL)
__rte_graph_mcore_dispatch_sched_wq_process(graph);
- while (likely(head != graph->tail)) {
- node = (struct rte_node *)RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
+ /* Seed pending bitmap with source nodes bound to this lcore */
+ for (word = 0; word < nwords; word++)
+ graph->pending[word] |= graph->src_pending[word];
- /* skip the src nodes which not bind with current worker */
- if ((int32_t)head < 1 && node->dispatch.lcore_id != graph->dispatch.lcore_id)
- continue;
+ for (;;) {
+ /* find first word with any pending bit */
+ for (word = 0; word < nwords; word++)
+ if (graph->pending[word])
+ break;
+ if (word == nwords)
+ break; /* no more pending nodes */
+
+ bit = rte_ctz64(graph->pending[word]);
+ graph->pending[word] &= ~(1ULL << bit);
+ node = __rte_graph_pending_node(graph, word, bit);
/* Schedule the node until all task/objs are done */
if (node->dispatch.lcore_id != RTE_MAX_LCORE &&
@@ -111,11 +123,7 @@ rte_graph_walk_mcore_dispatch(struct rte_graph *graph)
continue;
__rte_node_process(graph, node);
-
- head = likely((int32_t)head > 0) ? head & mask : head;
}
-
- graph->tail = 0;
}
#ifdef __cplusplus
diff --git a/lib/graph/rte_graph_model_rtc.h b/lib/graph/rte_graph_model_rtc.h
index 4b6236e301e3..38feb3e1ca88 100644
--- a/lib/graph/rte_graph_model_rtc.h
+++ b/lib/graph/rte_graph_model_rtc.h
@@ -6,9 +6,12 @@
#include "rte_graph_worker_common.h"
/**
- * Perform graph walk on the circular buffer and invoke the process function
+ * Perform graph walk on the pending bitmap and invoke the process function
* of the nodes and collect the stats.
*
+ * Nodes are visited in scheduling order (lowest priority value first).
+ * Source nodes are seeded into the pending bitmap at the start of each walk.
+ *
* @param graph
* Graph pointer returned from rte_graph_lookup function.
*
@@ -17,30 +20,52 @@
static inline void
rte_graph_walk_rtc(struct rte_graph *graph)
{
- const rte_graph_off_t *cir_start = graph->cir_start;
- const rte_node_t mask = graph->cir_mask;
- uint32_t head = graph->head;
+ const uint16_t nwords = graph->nb_sched_words;
struct rte_node *node;
+ uint16_t word, bit;
/*
- * Walk on the source node(s) ((cir_start - head) -> cir_start) and then
- * on the pending streams (cir_start -> (cir_start + mask) -> cir_start)
- * in a circular buffer fashion.
+ * Nodes are assigned a bit position (sched_idx) sorted by (priority,
+ * node_id) at graph creation time. Source nodes are forced to INT16_MIN
+ * priority so they always come first.
*
- * +-----+ <= cir_start - head [number of source nodes]
- * | |
- * | ... | <= source nodes
- * | |
- * +-----+ <= cir_start [head = 0] [tail = 0]
- * | |
- * | ... | <= pending streams
- * | |
- * +-----+ <= cir_start + mask
+ * sched_table[] maps bit positions to node offsets:
+ *
+ * pending[] sched_table[]
+ * +----------+ +------------------+
+ * | word 0 | ---> | src_node_0 | bit 0 (prio=INT16_MIN)
+ * | 1100...1 | | src_node_1 | bit 1 (prio=INT16_MIN)
+ * | | | mpls_input | bit 2 (prio=-10)
+ * | | | ipv4_input | bit 3 (prio=0)
+ * | | | ... |
+ * +----------+ +------------------+
+ * | word 1 | ---> | ip4_rewrite | bit 64 (prio=10)
+ * | ... | | ... |
+ * +----------+ +------------------+
+ *
+ * Walk: for each word, find lowest set bit (rte_ctz64), process that
+ * node, clear the bit, re-read the word (processing may have set new
+ * bits), repeat.
+ *
+ * After each node is processed, restart scanning from word 0 since
+ * processing may set bits in any word, including earlier ones.
*/
- while (likely(head != graph->tail)) {
- node = (struct rte_node *)RTE_PTR_ADD(graph, cir_start[(int32_t)head++]);
+
+ /* Seed pending bitmap with source nodes */
+ for (word = 0; word < nwords; word++)
+ graph->pending[word] |= graph->src_pending[word];
+
+ for (;;) {
+ /* find first word with any pending bit */
+ for (word = 0; word < nwords; word++)
+ if (graph->pending[word])
+ break;
+ if (word == nwords)
+ break; /* no more pending nodes */
+
+ bit = rte_ctz64(graph->pending[word]);
+ graph->pending[word] &= ~(1ULL << bit);
+ node = __rte_graph_pending_node(graph, word, bit);
__rte_node_process(graph, node);
- head = likely((int32_t)head > 0) ? head & mask : head;
}
- graph->tail = 0;
}
diff --git a/lib/graph/rte_graph_worker.h b/lib/graph/rte_graph_worker.h
index b0f952a82cc9..e513d7a655d9 100644
--- a/lib/graph/rte_graph_worker.h
+++ b/lib/graph/rte_graph_worker.h
@@ -14,7 +14,7 @@ extern "C" {
#endif
/**
- * Perform graph walk on the circular buffer and invoke the process function
+ * Perform graph walk on the pending bitmap and invoke the process function
* of the nodes and collect the stats.
*
* @param graph
diff --git a/lib/graph/rte_graph_worker_common.h b/lib/graph/rte_graph_worker_common.h
index 4ab53a533e4c..e52a37ce5e84 100644
--- a/lib/graph/rte_graph_worker_common.h
+++ b/lib/graph/rte_graph_worker_common.h
@@ -49,15 +49,14 @@ SLIST_HEAD(rte_graph_rq_head, rte_graph);
*/
struct __rte_cache_aligned rte_graph {
/* Fast path area. */
- uint32_t tail; /**< Tail of circular buffer. */
- uint32_t head; /**< Head of circular buffer. */
- uint32_t cir_mask; /**< Circular buffer wrap around mask. */
rte_node_t nb_nodes; /**< Number of nodes in the graph. */
- rte_graph_off_t *cir_start; /**< Pointer to circular buffer. */
rte_graph_off_t nodes_start; /**< Offset at which node memory starts. */
+ rte_graph_off_t *sched_table; /**< Node offset indexed by sched_idx. */
+ uint64_t *pending; /**< Bitmap of pending nodes. */
+ uint64_t *src_pending; /**< Bitmap of source nodes (constant). */
+ uint16_t nb_sched_words; /**< Number of uint64_t words in pending bitmaps. */
uint8_t model; /**< graph model */
- uint8_t reserved1; /**< Reserved for future use. */
- uint16_t reserved2; /**< Reserved for future use. */
+ /* 26 bytes padding */
union {
/* Fast schedule area for mcore dispatch model */
struct {
@@ -98,6 +97,7 @@ struct __rte_cache_aligned rte_node {
rte_node_t id; /**< Node identifier. */
rte_node_t parent_id; /**< Parent Node identifier. */
rte_edge_t nb_edges; /**< Number of edges from this node. */
+ uint16_t sched_idx; /**< Bit position in pending bitmap. */
uint32_t realloc_count; /**< Number of times realloced. */
char parent[RTE_NODE_NAMESIZE]; /**< Parent node name. */
@@ -132,7 +132,7 @@ struct __rte_cache_aligned rte_node {
}; /**< Node Context. */
uint16_t size; /**< Total number of objects available. */
uint16_t idx; /**< Number of objects used. */
- rte_graph_off_t off; /**< Offset of node in the graph reel. */
+ rte_graph_off_t off; /**< Offset of node in the graph memory. */
uint64_t total_cycles; /**< Cycles spent in this node. */
uint64_t total_calls; /**< Calls done to this node. */
uint64_t total_objs; /**< Objects processed by this node. */
@@ -187,12 +187,12 @@ void __rte_node_stream_alloc_size(struct rte_graph *graph,
/**
* @internal
*
- * Enqueue a given node to the tail of the graph reel.
+ * Process a node's pending objects and collect stats.
*
* @param graph
* Pointer Graph object.
* @param node
- * Pointer to node object to be enqueued.
+ * Pointer to node object to be processed.
*/
static __rte_always_inline void
__rte_node_process(struct rte_graph *graph, struct rte_node *node)
@@ -220,21 +220,42 @@ __rte_node_process(struct rte_graph *graph, struct rte_node *node)
/**
* @internal
*
- * Enqueue a given node to the tail of the graph reel.
+ * Get a pointer to a node from the scheduling table.
*
* @param graph
* Pointer Graph object.
+ * @param word
+ * Offset in the pending bitmap.
+ * @param bit
+ * Bit number.
+ *
+ * @return
+ * Pointer to the node.
+ */
+static __rte_always_inline struct rte_node *
+__rte_graph_pending_node(struct rte_graph *graph, uint16_t word, uint16_t bit)
+{
+ const uint16_t index = (word * sizeof(*graph->pending) * CHAR_BIT) + bit;
+ const rte_graph_off_t node_offset = graph->sched_table[index];
+ return RTE_PTR_ADD(graph, node_offset);
+}
+
+/**
+ * @internal
+ *
+ * Mark a node as pending in the graph scheduling bitmap.
+ *
+ * @param bitmap
+ * Either graph->pending or graph->src_pending.
* @param node
- * Pointer to node object to be enqueued.
+ * Pointer to node object to be marked pending.
*/
static __rte_always_inline void
-__rte_node_enqueue_tail_update(struct rte_graph *graph, struct rte_node *node)
+__rte_node_pending_set(uint64_t *bitmap, struct rte_node *node)
{
- uint32_t tail;
-
- tail = graph->tail;
- graph->cir_start[tail++] = node->off;
- graph->tail = tail & graph->cir_mask;
+ const uint16_t word = node->sched_idx / (sizeof(*bitmap) * CHAR_BIT);
+ const uint16_t bit = node->sched_idx % (sizeof(*bitmap) * CHAR_BIT);
+ bitmap[word] |= 1ULL << bit;
}
/**
@@ -242,8 +263,8 @@ __rte_node_enqueue_tail_update(struct rte_graph *graph, struct rte_node *node)
*
* Enqueue sequence prologue function.
*
- * Updates the node to tail of graph reel and resizes the number of objects
- * available in the stream as needed.
+ * Marks the node as pending in the scheduling bitmap and resizes the number
+ * of objects available in the stream as needed.
*
* @param graph
* Pointer to the graph object.
@@ -259,9 +280,7 @@ __rte_node_enqueue_prologue(struct rte_graph *graph, struct rte_node *node,
const uint16_t idx, const uint16_t space)
{
- /* Add to the pending stream list if the node is new */
- if (idx == 0)
- __rte_node_enqueue_tail_update(graph, node);
+ __rte_node_pending_set(graph->pending, node);
if (unlikely(node->size < (idx + space)))
__rte_node_stream_alloc_size(graph, node, node->size + space);
@@ -293,7 +312,7 @@ __rte_node_next_node_get(struct rte_node *node, rte_edge_t next)
/**
* Enqueue the objs to next node for further processing and set
- * the next node to pending state in the circular buffer.
+ * the next node to pending state in the scheduling bitmap.
*
* @param graph
* Graph pointer returned from rte_graph_lookup().
@@ -321,7 +340,7 @@ rte_node_enqueue(struct rte_graph *graph, struct rte_node *node,
/**
* Enqueue only one obj to next node for further processing and
- * set the next node to pending state in the circular buffer.
+ * set the next node to pending state in the scheduling bitmap.
*
* @param graph
* Graph pointer returned from rte_graph_lookup().
@@ -347,7 +366,7 @@ rte_node_enqueue_x1(struct rte_graph *graph, struct rte_node *node,
/**
* Enqueue only two objs to next node for further processing and
- * set the next node to pending state in the circular buffer.
+ * set the next node to pending state in the scheduling bitmap.
* Same as rte_node_enqueue_x1 but enqueue two objs.
*
* @param graph
@@ -377,7 +396,7 @@ rte_node_enqueue_x2(struct rte_graph *graph, struct rte_node *node,
/**
* Enqueue only four objs to next node for further processing and
- * set the next node to pending state in the circular buffer.
+ * set the next node to pending state in the scheduling bitmap.
* Same as rte_node_enqueue_x1 but enqueue four objs.
*
* @param graph
@@ -414,7 +433,7 @@ rte_node_enqueue_x4(struct rte_graph *graph, struct rte_node *node,
/**
* Enqueue objs to multiple next nodes for further processing and
- * set the next nodes to pending state in the circular buffer.
+ * set the next nodes to pending state in the scheduling bitmap.
* objs[i] will be enqueued to nexts[i].
*
* @param graph
@@ -472,7 +491,7 @@ rte_node_next_stream_get(struct rte_graph *graph, struct rte_node *node,
}
/**
- * Put the next stream to pending state in the circular buffer
+ * Put the next stream to pending state in the scheduling bitmap
* for further processing. Should be invoked after rte_node_next_stream_get().
*
* @param graph
@@ -495,9 +514,7 @@ rte_node_next_stream_put(struct rte_graph *graph, struct rte_node *node,
return;
node = __rte_node_next_node_get(node, next);
- if (node->idx == 0)
- __rte_node_enqueue_tail_update(graph, node);
-
+ __rte_node_pending_set(graph->pending, node);
node->idx += idx;
}
@@ -530,7 +547,7 @@ rte_node_next_stream_move(struct rte_graph *graph, struct rte_node *src,
src->objs = dobjs;
src->size = dsz;
dst->idx = src->idx;
- __rte_node_enqueue_tail_update(graph, dst);
+ __rte_node_pending_set(graph->pending, dst);
} else { /* Move the objects from src node to dst node */
rte_node_enqueue(graph, src, next, src->objs, src->idx);
}
--
2.54.0
[-- Attachment #2: Type: text/html, Size: 36292 bytes --]
^ permalink raw reply related
* RE: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
From: Loftus, Ciara @ 2026-06-10 10:50 UTC (permalink / raw)
To: Mandal, Anurag, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <20260610100704.366722-1-anurag.mandal@intel.com>
> Subject: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
>
> During PF initiated reset recovery, iavf_dev_close() sending
> an extra VIRTCHNL_OP_RESET_VF while recovery is already in progress.
> That second reset can leave PF/VF virtchnl state inconsistent and
> cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after
> ToR link flap/power-cycle, leaving the VF unable to recover.
> This results in connection loss.
>
> This patch introduces a new flag 'pf_reset_in_progress', that is
> set only when iavf_handle_hw_reset() is entered with
> vf_initiated_reset as false and is cleared on exit.
> Also, close-time VF reset and related close-time virtchnl
> operations are skipped when PF triggered reset recovery is set.
> This is done to avoid a duplicate VF reset, and keep normal
> behavior for application-driven close or VF-initiated reinit.
>
> Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
> Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
> Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without
> interrupt")
> Cc: stable@dpdk.org
>
> Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
I think you may need to respin due to patch application failure.
I have some suggestions for improving the comments/release notes
that you could include in the next version. Code looks good to me.
> ---
> V5: Addressed Ciara Loftus's comments
> - added separate flag for PF initiated reset recovery
> V4: Addressed Ciara Loftus's comments
> - split VF reset from other code changes
> V3: Addressed latest ai-code-review comments
> V2: Addressed ai-code-review comments
>
> doc/guides/rel_notes/release_26_07.rst | 3 ++
> drivers/net/intel/iavf/iavf.h | 7 +++++
> drivers/net/intel/iavf/iavf_ethdev.c | 40 +++++++++++++++-----------
> drivers/net/intel/iavf/iavf_vchnl.c | 18 ++++++++++--
> 4 files changed, 49 insertions(+), 19 deletions(-)
>
> diff --git a/doc/guides/rel_notes/release_26_07.rst
> b/doc/guides/rel_notes/release_26_07.rst
> index d2563ac503..f6899a78c3 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -95,6 +95,9 @@ New Features
>
> * Added support for transmitting LLDP packets based on mbuf packet type.
> * Implemented AVX2 context descriptor transmit paths.
> + * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
> + reset recovery, which earlier caused virtchnl state corruption
> + and connection loss after a top-of-rack (ToR) link flap/power-cycle.
I think something more concise here would be better.
eg. "Fixed duplicate send of 'VIRTCHNL_OP_RESET_VF' during PF reset
recovery which could cause virtchnl state corruption"
>
> * **Updated PCAP ethernet driver.**
>
> diff --git a/drivers/net/intel/iavf/iavf.h b/drivers/net/intel/iavf/iavf.h
> index 2615b6f034..67aacbe7a6 100644
> --- a/drivers/net/intel/iavf/iavf.h
> +++ b/drivers/net/intel/iavf/iavf.h
> @@ -292,6 +292,13 @@ struct iavf_info {
>
> bool in_reset_recovery;
>
> + /*
> + * Set only while iavf_handle_hw_reset()
> + * is processing a PF-initiated reset
> + * (vf_initiated_reset == false).
> + */
I don't think a comment is warranted here, the variable name is
self-explanatory.
> + bool pf_reset_in_progress;
> +
> uint32_t ptp_caps;
> rte_spinlock_t phc_time_aq_lock;
> };
> diff --git a/drivers/net/intel/iavf/iavf_ethdev.c
> b/drivers/net/intel/iavf/iavf_ethdev.c
> index a8031e23a5..2b6f4daa99 100644
> --- a/drivers/net/intel/iavf/iavf_ethdev.c
> +++ b/drivers/net/intel/iavf/iavf_ethdev.c
> @@ -3166,23 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
>
> ret = iavf_dev_stop(dev);
>
> - /*
> - * Release redundant queue resource when close the dev
> - * so that other vfs can re-use the queues.
> - */
> - if (vf->lv_enabled) {
> - ret = iavf_request_queues(dev,
> IAVF_MAX_NUM_QUEUES_DFLT);
> - if (ret)
> - PMD_DRV_LOG(ERR, "Reset the num of queues
> failed");
> + /* Skip RESET_VF on a PF-initiated reset */
Regarding the comment above, here we're not skipping RESET_VF rather
preventing sending virtchnl messages to the adminq during the PF-initiated
reset. I suggest rewording the comment to reflect that.
> + if (!vf->pf_reset_in_progress) {
>
> - vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
> - }
> + /*
> + * Release redundant queue resource when close the dev
> + * so that other vfs can re-use the queues.
> + */
> + if (vf->lv_enabled) {
> + ret = iavf_request_queues(dev,
> IAVF_MAX_NUM_QUEUES_DFLT);
> + if (ret)
> + PMD_DRV_LOG(ERR, "Reset the num of
> queues failed");
> + vf->max_rss_qregion =
^ permalink raw reply
* [RFC] memtank: add memtank library
From: Konstantin Ananyev @ 2026-06-10 10:39 UTC (permalink / raw)
To: dev; +Cc: hofors, mb
Introduce memtank, highly customizable fixed sized object allocator
for DPDK applications. It offers close to the mempool level performance
on the fast path.
Main difference with the mempool is the ability to grow/shrink at runtime
with user provided grow/shirnk threshold values, plus some extra
features for higher flexibility.
Key properties:
- relies on user to provide callbacks for actual memory reservations.
User is free to choose whatever is most suitable way for his scenario,
i.e: via malloc/rte_malloc/mmap/some custom memory allocator.
- user defined constructor callback for newly allocated objects.
- bulk alloc and free APIs.
- different alloc/free policies (specified by user via flags parameter):
* lightweight as possible, but can fail
* more robust, but heavyweight - causes call to user-provided backing
memory allocator.
- backing memory grows/shrinks on demand, special API extensions
to allow user control grow/shrink size, frequency and when/where it
is going to happen (DP, CP, both, etc.).
- ability to pre-allocate all objects at memtank creation time
(mempool like behavior).
- custom object size and alignment.
- per object runtime statistics and sanity-checks (boundary violation,
double free, etc.) can be enabled/disabled at memtank creation time.
Known limitations (subject for further improvements):
- scalability:
after 8+ lcores conventional mempool (with FIFO) starts to outperform
memtank (which uses LIFO inside).
- mempool_cache integration is not part of the library and right now
has to be implemented by used manually on top of memtank API.
Envisioned usage scenarios within DPDK-based apps:
various flow/session control structures (TCP PCB, CT, NAT sessions, etc.)
that needs to be allocated/freed at the data-path.
Also can be used by 'semi-fastpath' allocations:
TBL-8 blocks for LPM, hash buckets, etc.
Initial idea is inspired by Linux/Solaris SLAB allocators.
Also re-used some ideas from my previous work for TLDK project:
https://github.com/FDio/tldk
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
app/test/meson.build | 2 +
app/test/test_memtank.c | 160 +++
app/test/test_memtank_stress.c | 1075 +++++++++++++++++
doc/api/doxy-api-index.md | 1 +
doc/api/doxy-api.conf.in | 1 +
.../prog_guide/img/memtank_internal.svg | 1 +
doc/guides/prog_guide/index.rst | 1 +
doc/guides/prog_guide/memtank_lib.rst | 231 ++++
lib/memtank/memtank.c | 630 ++++++++++
lib/memtank/memtank.h | 110 ++
lib/memtank/meson.build | 18 +
lib/memtank/misc.c | 375 ++++++
lib/memtank/rte_memtank.h | 303 +++++
lib/meson.build | 1 +
14 files changed, 2909 insertions(+)
create mode 100644 app/test/test_memtank.c
create mode 100644 app/test/test_memtank_stress.c
create mode 100644 doc/guides/prog_guide/img/memtank_internal.svg
create mode 100644 doc/guides/prog_guide/memtank_lib.rst
create mode 100644 lib/memtank/memtank.c
create mode 100644 lib/memtank/memtank.h
create mode 100644 lib/memtank/meson.build
create mode 100644 lib/memtank/misc.c
create mode 100644 lib/memtank/rte_memtank.h
diff --git a/app/test/meson.build b/app/test/meson.build
index 61024125a7..d1b6203cf5 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -129,6 +129,8 @@ source_file_deps = {
'test_memory.c': [],
'test_mempool.c': [],
'test_mempool_perf.c': [],
+ 'test_memtank.c': ['memtank'],
+ 'test_memtank_stress.c': ['memtank'],
'test_memzone.c': [],
'test_meter.c': ['meter'],
'test_metrics.c': ['metrics'],
diff --git a/app/test/test_memtank.c b/app/test/test_memtank.c
new file mode 100644
index 0000000000..1cf55d65e6
--- /dev/null
+++ b/app/test/test_memtank.c
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <inttypes.h>
+
+#include <rte_memory.h>
+#include <rte_memtank.h>
+#include <rte_errno.h>
+#include "test.h"
+
+/* TEST SUITE */
+
+static int
+memtank_test_setup(void)
+{
+ return 0;
+}
+
+static void
+memtank_test_teardown(void)
+{
+}
+
+
+
+static void *
+test_alloc(size_t sz, void *p)
+{
+ RTE_SET_USED(p);
+ return malloc(sz);
+}
+
+static void
+test_free(void *buf, void *p)
+{
+ RTE_SET_USED(p);
+ return free(buf);
+}
+
+static int
+test_memtank_create_invalid(void)
+{
+ struct rte_memtank_prm prm;
+ struct rte_memtank *mt;
+
+ memset(&prm, 0, sizeof(prm));
+
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.alloc = test_alloc;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.free = test_free;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.obj_align = 2;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.min_free = 2;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.max_free = 2;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.max_obj = 2;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_EQUAL(mt, NULL, "memtank create");
+ RTE_TEST_ASSERT_EQUAL(rte_errno, EINVAL, "errno EINVAL");
+
+ prm.nb_obj_chunk = 2;
+ rte_errno = 0;
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_NOT_EQUAL(mt, NULL, "memtank create");
+
+ rte_memtank_destroy(mt);
+ return TEST_SUCCESS;
+}
+
+
+static int
+test_memtank_alloc(void)
+{
+ struct rte_memtank_prm prm;
+ struct rte_memtank *mt;
+
+ memset(&prm, 0, sizeof(prm));
+ prm.alloc = test_alloc;
+ prm.free = test_free;
+ prm.obj_align = 2;
+ prm.nb_obj_chunk = 2;
+ prm.min_free = 2;
+ prm.max_free = 2;
+ prm.max_obj = 10;
+
+ mt = rte_memtank_create(&prm);
+ RTE_TEST_ASSERT_NOT_EQUAL(mt, NULL, "memtank create");
+
+ void *obj[3] = { NULL };
+ uint32_t rc;
+
+ /* min_obj is 0 so this is expected to fail */
+ rc = rte_memtank_alloc(mt, obj, 1, RTE_MTANK_ALLOC_CHUNK);
+ RTE_TEST_ASSERT_EQUAL(rc, 0, "memtank alloc chunk 0 (%u)", rc);
+
+ rc = rte_memtank_alloc(mt, obj, 1, RTE_MTANK_ALLOC_GROW);
+ RTE_TEST_ASSERT_EQUAL(rc, 1, "memtank alloc 1 (%u)", rc);
+ RTE_TEST_ASSERT_NOT_EQUAL(obj[0], NULL, "alloc obj");
+
+ rc = rte_memtank_alloc(mt, obj, 3, RTE_MTANK_ALLOC_CHUNK);
+ RTE_TEST_ASSERT_EQUAL(rc, 3, "memtank alloc 3 (%u)", rc);
+
+ /* will fail - out of free objs */
+ rc = rte_memtank_alloc(mt, obj, 1, RTE_MTANK_ALLOC_CHUNK);
+ RTE_TEST_ASSERT_EQUAL(rc, 0, "memtank alloc chunk 0 (%u)", rc);
+
+ rte_memtank_destroy(mt);
+ return TEST_SUCCESS;
+}
+
+static struct unit_test_suite memtank_testsuite = {
+ .suite_name = "memtank library test suite",
+ .setup = memtank_test_setup,
+ .teardown = memtank_test_teardown,
+ .unit_test_cases = {
+ TEST_CASE(test_memtank_alloc),
+ TEST_CASE(test_memtank_create_invalid),
+ TEST_CASES_END(), /**< NULL terminate unit test array */
+ },
+};
+
+static int
+test_memtank(void)
+{
+ return unit_test_suite_runner(&memtank_testsuite);
+}
+
+REGISTER_FAST_TEST(memtank_autotest, NOHUGE_OK, ASAN_OK, test_memtank);
diff --git a/app/test/test_memtank_stress.c b/app/test/test_memtank_stress.c
new file mode 100644
index 0000000000..67b10c6611
--- /dev/null
+++ b/app/test/test_memtank_stress.c
@@ -0,0 +1,1075 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ * Copyright(c) 2025 Huawei Technologies Co., Ltd
+ */
+
+#include <string.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_errno.h>
+#include <rte_launch.h>
+#include <rte_cycles.h>
+#include <rte_eal.h>
+#include <rte_ring.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_random.h>
+#include <rte_hexdump.h>
+#include <rte_malloc.h>
+#include <rte_memtank.h>
+
+#include "test.h"
+
+struct memstat {
+ struct {
+ rte_atomic64_t nb_call;
+ rte_atomic64_t nb_fail;
+ rte_atomic64_t sz;
+ } alloc;
+ struct {
+ rte_atomic64_t nb_call;
+ rte_atomic64_t nb_fail;
+ } free;
+ uint64_t nb_alloc_obj;
+};
+
+struct memtank_stat {
+ uint64_t nb_cycle;
+ struct {
+ uint64_t nb_call;
+ uint64_t nb_req;
+ uint64_t nb_alloc;
+ uint64_t nb_cycle;
+ uint64_t max_cycle;
+ uint64_t min_cycle;
+ } alloc;
+ struct {
+ uint64_t nb_call;
+ uint64_t nb_free;
+ uint64_t nb_cycle;
+ uint64_t max_cycle;
+ uint64_t min_cycle;
+ } free;
+ struct {
+ uint64_t nb_call;
+ uint64_t nb_chunk;
+ uint64_t nb_cycle;
+ uint64_t max_cycle;
+ uint64_t min_cycle;
+ } grow;
+ struct {
+ uint64_t nb_call;
+ uint64_t nb_chunk;
+ uint64_t nb_cycle;
+ uint64_t max_cycle;
+ uint64_t min_cycle;
+ } shrink;
+};
+
+struct master_args {
+ uint64_t run_cycles;
+ uint32_t delay_us;
+ uint32_t flags;
+};
+
+struct worker_args {
+ uint32_t max_obj;
+ uint32_t obj_size;
+ uint32_t alloc_flags;
+ uint32_t free_flags;
+ struct rte_ring *rng;
+};
+
+struct memtank_arg {
+ struct rte_memtank *mt;
+ union {
+ struct master_args master;
+ struct worker_args worker;
+ };
+ struct memtank_stat stats;
+} __rte_cache_aligned;
+
+#define BULK_NUM 32
+
+#define OBJ_SZ_MIN 1
+#define OBJ_SZ_MAX 0x100000
+#define OBJ_SZ_DEF (4 * RTE_CACHE_LINE_SIZE + 1)
+
+#define TEST_TIME 10
+#define CLEANUP_TIME 3
+
+#define FREE_THRSH_MIN 0
+#define FREE_THRSH_MAX 100
+
+enum {
+ WRK_CMD_STOP,
+ WRK_CMD_RUN,
+};
+
+enum {
+ MASTER_FLAG_GROW = 1,
+ MASTER_FLAG_SHRINK = 2,
+};
+
+enum {
+ MEM_FUNC_SYS,
+ MEM_FUNC_RTE,
+};
+
+static uint32_t wrk_cmd __rte_cache_aligned;
+
+static struct rte_memtank_prm mtnk_prm = {
+ .min_free = 4 * BULK_NUM,
+ .max_free = 32 * BULK_NUM,
+ .obj_size = OBJ_SZ_DEF,
+ .obj_align = RTE_CACHE_LINE_SIZE,
+ .nb_obj_chunk = BULK_NUM,
+ .flags = RTE_MTANK_OBJ_DBG,
+};
+
+static struct {
+ uint32_t run_time; /* test run-time in seconds */
+ uint32_t wrk_max_obj; /* max alloced objects per worker */
+ uint32_t wrk_fill_thrsh; /* wrk fill thresh % (0-100) */
+ uint32_t wrk_free_thrsh; /* wrk free thresh % (0-100) */
+ int32_t mem_func; /* memory subsystem to use for alloc/free */
+ int32_t verbose; /* verbose: print stat for each worker */
+} global_cfg = {
+ .run_time = TEST_TIME,
+ .wrk_max_obj = 2 * BULK_NUM,
+ .wrk_fill_thrsh = FREE_THRSH_MAX,
+ .wrk_free_thrsh = FREE_THRSH_MIN,
+ .mem_func = MEM_FUNC_SYS,
+ .verbose = 0,
+};
+
+static void *
+alloc_func(size_t sz)
+{
+ switch (global_cfg.mem_func) {
+ case MEM_FUNC_SYS:
+ return malloc(sz);
+ case MEM_FUNC_RTE:
+ return rte_malloc(NULL, sz, 0);
+ }
+
+ return NULL;
+}
+
+static void
+free_func(void *p)
+{
+ switch (global_cfg.mem_func) {
+ case MEM_FUNC_SYS:
+ return free(p);
+ case MEM_FUNC_RTE:
+ return rte_free(p);
+ }
+}
+
+static void *
+test_alloc1(size_t sz, void *p)
+{
+ struct memstat *ms;
+ void *buf;
+
+ ms = p;
+ buf = alloc_func(sz);
+ rte_atomic64_inc(&ms->alloc.nb_call);
+ if (buf != NULL) {
+ memset(buf, 0, sz);
+ rte_atomic64_add(&ms->alloc.sz, sz);
+ } else
+ rte_atomic64_inc(&ms->alloc.nb_fail);
+
+ return buf;
+}
+
+static void
+test_free1(void *buf, void *p)
+{
+ struct memstat *ms;
+
+ ms = p;
+
+ free_func(buf);
+ rte_atomic64_inc(&ms->free.nb_call);
+ if (buf == NULL)
+ rte_atomic64_inc(&ms->free.nb_fail);
+}
+
+static void
+global_cfg_dump(FILE *f)
+{
+ fprintf(f, "%s={\n", __func__);
+ fprintf(f, "\t.run_time=%u\n", global_cfg.run_time);
+ fprintf(f, "\t.wrk_max_obj=%u\n", global_cfg.wrk_max_obj);
+ fprintf(f, "\t.wrk_fill_thrsh=%u\n", global_cfg.wrk_fill_thrsh);
+ fprintf(f, "\t.wrk_free_thrsh=%u\n", global_cfg.wrk_free_thrsh);
+ fprintf(f, "\t.mem_func=%d\n", global_cfg.mem_func);
+ fprintf(f, "\t.verbose=%d\n", global_cfg.verbose);
+ fprintf(f, "};\n");
+}
+
+static void
+memstat_dump(FILE *f, struct memstat *ms)
+{
+ uint64_t alloc_sz, nb_alloc;
+ long double muc, mut;
+
+ nb_alloc = rte_atomic64_read(&ms->alloc.nb_call) -
+ rte_atomic64_read(&ms->alloc.nb_fail);
+ alloc_sz = rte_atomic64_read(&ms->alloc.sz) / nb_alloc;
+ nb_alloc -= rte_atomic64_read(&ms->free.nb_call) -
+ rte_atomic64_read(&ms->free.nb_fail);
+ alloc_sz *= nb_alloc;
+ mut = (alloc_sz == 0) ? 1 :
+ (long double)ms->nb_alloc_obj * mtnk_prm.obj_size / alloc_sz;
+ muc = (alloc_sz == 0) ? 1 :
+ (long double)(ms->nb_alloc_obj + mtnk_prm.max_free) *
+ mtnk_prm.obj_size / alloc_sz;
+
+ fprintf(f, "%s(%p)={\n", __func__, ms);
+ fprintf(f, "\talloc={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n",
+ rte_atomic64_read(&ms->alloc.nb_call));
+ fprintf(f, "\t\tnb_fail=%" PRIu64 ",\n",
+ rte_atomic64_read(&ms->alloc.nb_fail));
+ fprintf(f, "\t\tsz=%" PRIu64 ",\n",
+ rte_atomic64_read(&ms->alloc.sz));
+ fprintf(f, "\t},\n");
+ fprintf(f, "\tfree={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n",
+ rte_atomic64_read(&ms->free.nb_call));
+ fprintf(f, "\t\tnb_fail=%" PRIu64 ",\n",
+ rte_atomic64_read(&ms->free.nb_fail));
+ fprintf(f, "\t},\n");
+ fprintf(f, "\tnb_alloc_obj=%" PRIu64 ",\n", ms->nb_alloc_obj);
+ fprintf(f, "\tnb_alloc_chunk=%" PRIu64 ",\n", nb_alloc);
+ fprintf(f, "\talloc_sz=%" PRIu64 ",\n", alloc_sz);
+ fprintf(f, "\tmem_util(total)=%.2Lf %%,\n", mut * 100);
+ fprintf(f, "\tmem_util(cached)=%.2Lf %%,\n", muc * 100);
+ fprintf(f, "};\n");
+
+}
+
+static void
+memtank_stat_reset(struct memtank_stat *ms)
+{
+ static const struct memtank_stat init_stat = {
+ .alloc.min_cycle = UINT64_MAX,
+ .free.min_cycle = UINT64_MAX,
+ .grow.min_cycle = UINT64_MAX,
+ .shrink.min_cycle = UINT64_MAX,
+ };
+
+ *ms = init_stat;
+}
+
+static void
+memtank_stat_aggr(struct memtank_stat *as, const struct memtank_stat *ms)
+{
+ if (ms->alloc.nb_call != 0) {
+ as->alloc.nb_call += ms->alloc.nb_call;
+ as->alloc.nb_req += ms->alloc.nb_req;
+ as->alloc.nb_alloc += ms->alloc.nb_alloc;
+ as->alloc.nb_cycle += ms->alloc.nb_cycle;
+ as->alloc.max_cycle = RTE_MAX(as->alloc.max_cycle,
+ ms->alloc.max_cycle);
+ as->alloc.min_cycle = RTE_MIN(as->alloc.min_cycle,
+ ms->alloc.min_cycle);
+ }
+ if (ms->free.nb_call != 0) {
+ as->free.nb_call += ms->free.nb_call;
+ as->free.nb_free += ms->free.nb_free;
+ as->free.nb_cycle += ms->free.nb_cycle;
+ as->free.max_cycle = RTE_MAX(as->free.max_cycle,
+ ms->free.max_cycle);
+ as->free.min_cycle = RTE_MIN(as->free.min_cycle,
+ ms->free.min_cycle);
+ }
+ if (ms->grow.nb_call != 0) {
+ as->grow.nb_call += ms->grow.nb_call;
+ as->grow.nb_chunk += ms->grow.nb_chunk;
+ as->grow.nb_cycle += ms->grow.nb_cycle;
+ as->grow.max_cycle = RTE_MAX(as->grow.max_cycle,
+ ms->grow.max_cycle);
+ as->grow.min_cycle = RTE_MIN(as->grow.min_cycle,
+ ms->grow.min_cycle);
+ }
+ if (ms->shrink.nb_call != 0) {
+ as->shrink.nb_call += ms->shrink.nb_call;
+ as->shrink.nb_chunk += ms->shrink.nb_chunk;
+ as->shrink.nb_cycle += ms->shrink.nb_cycle;
+ as->shrink.max_cycle = RTE_MAX(as->shrink.max_cycle,
+ ms->shrink.max_cycle);
+ as->shrink.min_cycle = RTE_MIN(as->shrink.min_cycle,
+ ms->shrink.min_cycle);
+ }
+}
+
+static void
+memtank_stat_dump(FILE *f, uint32_t lc, const struct memtank_stat *ms)
+{
+ uint64_t t;
+ long double st;
+
+ st = (long double)rte_get_timer_hz() / US_PER_S;
+
+ if (lc == UINT32_MAX)
+ fprintf(f, "%s(AGGREGATE)={\n", __func__);
+ else
+ fprintf(f, "%s(lc=%u)={\n", __func__, lc);
+
+ fprintf(f, "\tnb_cycle=%" PRIu64 ",\n", ms->nb_cycle);
+ if (ms->alloc.nb_call != 0) {
+ fprintf(f, "\talloc={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n", ms->alloc.nb_call);
+ fprintf(f, "\t\tnb_req=%" PRIu64 ",\n", ms->alloc.nb_req);
+ fprintf(f, "\t\tnb_alloc=%" PRIu64 ",\n", ms->alloc.nb_alloc);
+ fprintf(f, "\t\tnb_cycle=%" PRIu64 ",\n", ms->alloc.nb_cycle);
+
+ t = ms->alloc.nb_req - ms->alloc.nb_alloc;
+ fprintf(f, "\t\tfailed req: %"PRIu64 "(%.2Lf %%)\n",
+ t, (long double)t * 100 / ms->alloc.nb_req);
+ fprintf(f, "\t\tcycles/alloc: %.2Lf\n",
+ (long double)ms->alloc.nb_cycle / ms->alloc.nb_alloc);
+ fprintf(f, "\t\tobj/call(avg): %.2Lf\n",
+ (long double)ms->alloc.nb_alloc / ms->alloc.nb_call);
+
+ fprintf(f, "\t\tmax cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->alloc.max_cycle,
+ (long double)ms->alloc.max_cycle / st);
+ fprintf(f, "\t\tmin cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->alloc.min_cycle,
+ (long double)ms->alloc.min_cycle / st);
+
+ fprintf(f, "\t},\n");
+ }
+ if (ms->free.nb_call != 0) {
+ fprintf(f, "\tfree={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n", ms->free.nb_call);
+ fprintf(f, "\t\tnb_free=%" PRIu64 ",\n", ms->free.nb_free);
+ fprintf(f, "\t\tnb_cycle=%" PRIu64 ",\n", ms->free.nb_cycle);
+
+ fprintf(f, "\t\tcycles/free: %.2Lf\n",
+ (long double)ms->free.nb_cycle / ms->free.nb_free);
+ fprintf(f, "\t\tobj/call(avg): %.2Lf\n",
+ (long double)ms->free.nb_free / ms->free.nb_call);
+
+ fprintf(f, "\t\tmax cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->free.max_cycle,
+ (long double)ms->free.max_cycle / st);
+ fprintf(f, "\t\tmin cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->free.min_cycle,
+ (long double)ms->free.min_cycle / st);
+
+ fprintf(f, "\t},\n");
+ }
+ if (ms->grow.nb_call != 0) {
+ fprintf(f, "\tgrow={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n", ms->grow.nb_call);
+ fprintf(f, "\t\tnb_chunk=%" PRIu64 ",\n", ms->grow.nb_chunk);
+ fprintf(f, "\t\tnb_cycle=%" PRIu64 ",\n", ms->grow.nb_cycle);
+
+ fprintf(f, "\t\tcycles/chunk: %.2Lf\n",
+ (long double)ms->grow.nb_cycle / ms->grow.nb_chunk);
+ fprintf(f, "\t\tobj/call(avg): %.2Lf\n",
+ (long double)ms->grow.nb_chunk / ms->grow.nb_call);
+
+ fprintf(f, "\t\tmax cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->grow.max_cycle,
+ (long double)ms->grow.max_cycle / st);
+ fprintf(f, "\t\tmin cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->grow.min_cycle,
+ (long double)ms->grow.min_cycle / st);
+
+ fprintf(f, "\t},\n");
+ }
+ if (ms->shrink.nb_call != 0) {
+ fprintf(f, "\tshrink={\n");
+ fprintf(f, "\t\tnb_call=%" PRIu64 ",\n", ms->shrink.nb_call);
+ fprintf(f, "\t\tnb_chunk=%" PRIu64 ",\n", ms->shrink.nb_chunk);
+ fprintf(f, "\t\tnb_cycle=%" PRIu64 ",\n", ms->shrink.nb_cycle);
+
+ fprintf(f, "\t\tcycles/chunk: %.2Lf\n",
+ (long double)ms->shrink.nb_cycle / ms->shrink.nb_chunk);
+ fprintf(f, "\t\tobj/call(avg): %.2Lf\n",
+ (long double)ms->shrink.nb_chunk / ms->shrink.nb_call);
+
+ fprintf(f, "\t\tmax cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->shrink.max_cycle,
+ (long double)ms->shrink.max_cycle / st);
+ fprintf(f, "\t\tmin cycles/call=%" PRIu64 "(%.2Lf usec),\n",
+ ms->shrink.min_cycle,
+ (long double)ms->shrink.min_cycle / st);
+
+ fprintf(f, "\t},\n");
+ }
+ fprintf(f, "};\n");
+}
+
+static int32_t
+check_fill_objs(void *obj[], uint32_t sz, uint32_t num,
+ uint8_t check, uint8_t fill)
+{
+ uint32_t i;
+ uint8_t buf[sz];
+
+ static rte_spinlock_t dump_lock;
+
+ memset(buf, check, sz);
+
+ for (i = 0; i != num; i++) {
+ if (memcmp(buf, obj[i], sz) != 0) {
+ rte_spinlock_lock(&dump_lock);
+ printf("%s(%u, %u, %hu, %hu) failed at %u-th iter, "
+ "offendig object: %p\n",
+ __func__, sz, num, check, fill, i, obj[i]);
+ rte_memdump(stdout, "expected", buf, sz);
+ rte_memdump(stdout, "result", obj[i], sz);
+ rte_spinlock_unlock(&dump_lock);
+ return -EINVAL;
+ }
+ memset(obj[i], fill, sz);
+ }
+ return 0;
+}
+
+static void
+destroy_worker_ring(struct worker_args *wa)
+{
+ free(wa->rng);
+ wa->rng = NULL;
+}
+
+static int
+create_worker_ring(struct worker_args *wa, uint32_t lc)
+{
+ int32_t rc;
+ size_t sz;
+ struct rte_ring *ring;
+
+ sz = rte_ring_get_memsize(wa->max_obj);
+ ring = aligned_alloc(alignof(typeof(*ring)), sz);
+ if (ring == NULL) {
+ printf("%s(%u): alloc(%zu) for FIFO with %u elems failed",
+ __func__, lc, sz, wa->max_obj);
+ return -ENOMEM;
+ }
+ rc = rte_ring_init(ring, "", wa->max_obj,
+ RING_F_SP_ENQ | RING_F_SC_DEQ);
+ if (rc != 0) {
+ printf("%s(%u): rte_ring_init(%p, %u) failed, error: %d(%s)\n",
+ __func__, lc, ring, wa->max_obj,
+ rc, strerror(-rc));
+ free(ring);
+ return rc;
+ }
+
+ wa->rng = ring;
+ return rc;
+}
+
+static int
+test_worker_cleanup(void *arg)
+{
+ void *obj[BULK_NUM];
+ int32_t rc;
+ uint32_t lc, n, num;
+ struct memtank_arg *ma;
+ struct rte_ring *ring;
+
+ ma = arg;
+ ring = ma->worker.rng;
+ lc = rte_lcore_id();
+
+ rc = 0;
+ for (n = rte_ring_count(ring); rc == 0 && n != 0; n -= num) {
+
+ num = rte_rand() % RTE_DIM(obj);
+ num = RTE_MIN(num, n);
+
+ if (num != 0) {
+ /* retrieve objects to free */
+ rte_ring_dequeue_bulk(ring, obj, num, NULL);
+
+ /* check and fill contents of freeing objects */
+ rc = check_fill_objs(obj, ma->worker.obj_size, num,
+ lc, 0);
+ if (rc == 0) {
+ rte_memtank_free(ma->mt, obj, num,
+ ma->worker.free_flags);
+ ma->stats.free.nb_free += num;
+ }
+ }
+ }
+
+ return rc;
+}
+
+static int
+test_memtank_worker(void *arg)
+{
+ int32_t rc;
+ uint32_t lc, n, num, tfl, tfr;
+ uint64_t cl, tm0, tm1;
+ struct memtank_arg *ma;
+ struct rte_ring *ring;
+ void *obj[BULK_NUM];
+
+ ma = arg;
+ lc = rte_lcore_id();
+
+ /* calculate fill threshold */
+ tfl = (FREE_THRSH_MAX - global_cfg.wrk_fill_thrsh) *
+ ma->worker.max_obj / FREE_THRSH_MAX;
+
+ /* calculate free threshold */
+ tfr = ma->worker.max_obj * global_cfg.wrk_free_thrsh / FREE_THRSH_MAX;
+
+ ring = ma->worker.rng;
+
+ while (wrk_cmd != WRK_CMD_RUN) {
+ rte_smp_rmb();
+ rte_pause();
+ }
+
+ cl = rte_rdtsc_precise();
+
+ do {
+ num = rte_rand() % RTE_DIM(obj);
+ n = rte_ring_free_count(ring);
+ num = (n >= tfl) ? RTE_MIN(num, n) : 0;
+
+ /* perform alloc*/
+ if (num != 0) {
+ tm0 = rte_rdtsc_precise();
+ n = rte_memtank_alloc(ma->mt, obj, num,
+ ma->worker.alloc_flags);
+ tm1 = rte_rdtsc_precise();
+
+ /* check and fill contents of allocated objects */
+ rc = check_fill_objs(obj, ma->worker.obj_size, n,
+ 0, lc);
+ if (rc != 0)
+ break;
+
+ tm1 = tm1 - tm0;
+
+ /* collect alloc stat */
+ ma->stats.alloc.nb_call++;
+ ma->stats.alloc.nb_req += num;
+ ma->stats.alloc.nb_alloc += n;
+ ma->stats.alloc.nb_cycle += tm1;
+ ma->stats.alloc.max_cycle =
+ RTE_MAX(ma->stats.alloc.max_cycle, tm1);
+ ma->stats.alloc.min_cycle =
+ RTE_MIN(ma->stats.alloc.min_cycle, tm1);
+
+ /* store allocated objects */
+ rte_ring_enqueue_bulk(ring, obj, n, NULL);
+ }
+
+ /* get some objects to free */
+ num = rte_rand() % RTE_DIM(obj);
+ n = rte_ring_count(ring);
+ num = (n >= tfr) ? RTE_MIN(num, n) : 0;
+
+ /* perform free*/
+ if (num != 0) {
+
+ /* retrieve objects to free */
+ rte_ring_dequeue_bulk(ring, obj, num, NULL);
+
+ /* check and fill contents of freeing objects */
+ rc = check_fill_objs(obj, ma->worker.obj_size, num,
+ lc, 0);
+ if (rc != 0)
+ break;
+
+ tm0 = rte_rdtsc_precise();
+ rte_memtank_free(ma->mt, obj, num,
+ ma->worker.free_flags);
+ tm1 = rte_rdtsc_precise();
+
+ tm1 = tm1 - tm0;
+
+ /* collect free stat */
+ ma->stats.free.nb_call++;
+ ma->stats.free.nb_free += num;
+ ma->stats.free.nb_cycle += tm1;
+ ma->stats.free.max_cycle =
+ RTE_MAX(ma->stats.free.max_cycle, tm1);
+ ma->stats.free.min_cycle =
+ RTE_MIN(ma->stats.free.min_cycle, tm1);
+ }
+
+ rte_smp_mb();
+ } while (wrk_cmd == WRK_CMD_RUN);
+
+ ma->stats.nb_cycle = rte_rdtsc_precise() - cl;
+
+ return rc;
+}
+
+static int
+test_memtank_master(void *arg)
+{
+ struct memtank_arg *ma;
+ uint64_t cl, tm0, tm1, tm2;
+ uint32_t i, n;
+
+ ma = (struct memtank_arg *)arg;
+
+ for (cl = 0, i = 0; cl < ma->master.run_cycles;
+ cl += tm2 - tm0, i++) {
+
+ tm0 = rte_rdtsc_precise();
+
+ if (ma->master.flags & MASTER_FLAG_SHRINK) {
+
+ n = rte_memtank_shrink(ma->mt);
+ tm1 = rte_rdtsc_precise();
+ ma->stats.shrink.nb_call++;
+ ma->stats.shrink.nb_chunk += n;
+ tm1 = tm1 - tm0;
+
+ if (n != 0) {
+ ma->stats.shrink.nb_cycle += tm1;
+ ma->stats.shrink.max_cycle =
+ RTE_MAX(ma->stats.shrink.max_cycle,
+ tm1);
+ ma->stats.shrink.min_cycle =
+ RTE_MIN(ma->stats.shrink.min_cycle,
+ tm1);
+ }
+ }
+
+ if (ma->master.flags & MASTER_FLAG_GROW) {
+
+ tm1 = rte_rdtsc_precise();
+ n = rte_memtank_grow(ma->mt);
+ tm2 = rte_rdtsc_precise();
+ ma->stats.grow.nb_call++;
+ ma->stats.grow.nb_chunk += n;
+ tm2 = tm2 - tm1;
+
+ if (n != 0) {
+ ma->stats.grow.nb_cycle += tm2;
+ ma->stats.grow.max_cycle =
+ RTE_MAX(ma->stats.grow.max_cycle,
+ tm2);
+ ma->stats.grow.min_cycle =
+ RTE_MIN(ma->stats.grow.min_cycle,
+ tm2);
+ }
+ }
+
+ wrk_cmd = WRK_CMD_RUN;
+ rte_smp_mb();
+
+ rte_delay_us(ma->master.delay_us);
+ tm2 = rte_rdtsc_precise();
+ }
+
+ ma->stats.nb_cycle = cl;
+
+ rte_smp_mb();
+ wrk_cmd = WRK_CMD_STOP;
+
+ return 0;
+}
+
+static int
+fill_worker_args(struct worker_args *wa, uint32_t alloc_flags,
+ uint32_t free_flags, uint32_t lc)
+{
+ wa->max_obj = global_cfg.wrk_max_obj;
+ wa->obj_size = mtnk_prm.obj_size;
+ wa->alloc_flags = alloc_flags;
+ wa->free_flags = free_flags;
+
+ return create_worker_ring(wa, lc);
+}
+
+static void
+fill_master_args(struct master_args *ma, uint32_t flags)
+{
+ uint64_t tm;
+
+ tm = global_cfg.run_time * rte_get_timer_hz();
+
+ ma->run_cycles = tm;
+ ma->delay_us = US_PER_S / MS_PER_S;
+ ma->flags = flags;
+}
+
+static int
+test_memtank_cleanup(struct rte_memtank *mt, struct memstat *ms,
+ struct memtank_arg arg[], const char *tname)
+{
+ int32_t rc;
+ uint32_t lc;
+
+ printf("%s(%s)\n", __func__, tname);
+
+ RTE_LCORE_FOREACH_WORKER(lc)
+ rte_eal_remote_launch(test_worker_cleanup, &arg[lc], lc);
+
+ /* launch on master */
+ lc = rte_lcore_id();
+ arg[lc].master.run_cycles = CLEANUP_TIME * rte_get_timer_hz();
+ test_memtank_master(&arg[lc]);
+
+ rc = 0;
+ ms->nb_alloc_obj = 0;
+ RTE_LCORE_FOREACH_WORKER(lc) {
+ rc |= rte_eal_wait_lcore(lc);
+ ms->nb_alloc_obj += arg[lc].stats.alloc.nb_alloc -
+ arg[lc].stats.free.nb_free;
+ }
+
+ rte_memtank_dump(stdout, mt, RTE_MTANK_DUMP_STAT);
+
+ memstat_dump(stdout, ms);
+ rc = rte_memtank_sanity_check(mt, 0);
+
+ return rc;
+}
+
+/*
+ * alloc/free by workers threads.
+ * grow/shrink by master
+ */
+static int
+test_memtank_mt(const char *tname, uint32_t alloc_flags, uint32_t free_flags)
+{
+ int32_t rc;
+ uint32_t lc;
+ struct rte_memtank *mt;
+ struct rte_memtank_prm prm;
+ struct memstat ms;
+ struct memtank_stat wrk_stats;
+ struct memtank_arg arg[RTE_MAX_LCORE];
+
+ printf("%s(%s) start\n", __func__, tname);
+
+ memset(&prm, 0, sizeof(prm));
+ memset(&ms, 0, sizeof(ms));
+
+ prm = mtnk_prm;
+ prm.alloc = test_alloc1;
+ prm.free = test_free1;
+ prm.udata = &ms;
+
+ mt = rte_memtank_create(&prm);
+ if (mt == NULL) {
+ printf("%s(%s): memtank_create() failed\n", __func__, tname);
+ return -ENOMEM;
+ }
+
+ /* dump initial memory stats */
+ memstat_dump(stdout, &ms);
+
+ rc = 0;
+ memset(arg, 0, sizeof(arg));
+
+ /* prepare args on all slaves */
+ RTE_LCORE_FOREACH_WORKER(lc) {
+ arg[lc].mt = mt;
+ rc = fill_worker_args(&arg[lc].worker, alloc_flags,
+ free_flags, lc);
+ if (rc != 0)
+ break;
+ memtank_stat_reset(&arg[lc].stats);
+ }
+
+ if (rc != 0) {
+ rte_memtank_destroy(mt);
+ return rc;
+ }
+
+ /* launch on all slaves */
+ RTE_LCORE_FOREACH_WORKER(lc)
+ rte_eal_remote_launch(test_memtank_worker, &arg[lc], lc);
+
+ /* launch on master */
+ lc = rte_lcore_id();
+ arg[lc].mt = mt;
+ fill_master_args(&arg[lc].master,
+ MASTER_FLAG_GROW | MASTER_FLAG_SHRINK);
+ test_memtank_master(&arg[lc]);
+
+ /* wait for slaves and collect stats. */
+
+ memtank_stat_reset(&wrk_stats);
+
+ rc = 0;
+ RTE_LCORE_FOREACH_WORKER(lc) {
+ rc |= rte_eal_wait_lcore(lc);
+ if (global_cfg.verbose != 0)
+ memtank_stat_dump(stdout, lc, &arg[lc].stats);
+ memtank_stat_aggr(&wrk_stats, &arg[lc].stats);
+ ms.nb_alloc_obj += arg[lc].stats.alloc.nb_alloc -
+ arg[lc].stats.free.nb_free;
+ }
+
+ memtank_stat_dump(stdout, UINT32_MAX, &wrk_stats);
+
+ lc = rte_lcore_id();
+ memtank_stat_dump(stdout, lc, &arg[lc].stats);
+ rte_memtank_dump(stdout, mt, RTE_MTANK_DUMP_STAT);
+
+ memstat_dump(stdout, &ms);
+ rc |= rte_memtank_sanity_check(mt, 0);
+
+ /* run cleanup on all slave cores */
+ if (rc == 0)
+ rc = test_memtank_cleanup(mt, &ms, arg, tname);
+
+ RTE_LCORE_FOREACH_WORKER(lc)
+ destroy_worker_ring(&arg[lc].worker);
+
+ rte_memtank_destroy(mt);
+ return rc;
+}
+
+/*
+ * alloc/free by workers threads.
+ * grow/shrink by master
+ */
+static int
+test_memtank_mt1(const char *tname)
+{
+ return test_memtank_mt(tname, 0, 0);
+}
+
+/*
+ * alloc/free with grow/shrink by worker threads.
+ * master does nothing
+ */
+static int
+test_memtank_mt2(const char *tname)
+{
+ const uint32_t alloc_flags = RTE_MTANK_ALLOC_CHUNK |
+ RTE_MTANK_ALLOC_GROW;
+ const uint32_t free_flags = RTE_MTANK_FREE_SHRINK;
+
+ return test_memtank_mt(tname, alloc_flags, free_flags);
+}
+
+static int
+parse_uint_val(const char *str, uint32_t *val, uint32_t min, uint32_t max)
+{
+ unsigned long v;
+ char *end;
+
+ errno = 0;
+ v = strtoul(str, &end, 0);
+ if (errno != 0 || end[0] != 0 || v < min || v > max)
+ return -EINVAL;
+
+ val[0] = v;
+ return 0;
+}
+
+static int
+parse_mem_str(const char *str)
+{
+ uint32_t i;
+
+ static const struct {
+ const char *name;
+ int32_t val;
+ } name2val[] = {
+ {
+ .name = "sys",
+ .val = MEM_FUNC_SYS,
+ },
+ {
+ .name = "rte",
+ .val = MEM_FUNC_RTE,
+ },
+ };
+
+ for (i = 0; i != RTE_DIM(name2val); i++) {
+ if (strcmp(str, name2val[i].name) == 0)
+ return name2val[i].val;
+ }
+ return -EINVAL;
+}
+
+/* update global values based on provided user input */
+static int
+update_global_cfg(void)
+{
+ mtnk_prm.max_obj = global_cfg.wrk_max_obj * rte_lcore_count();
+ return 0;
+}
+
+static int
+parse_opt(int argc, char * const argv[])
+{
+ int32_t opt, rc;
+ uint32_t v;
+
+ rc = 0;
+ optind = 0;
+ optarg = NULL;
+
+ while ((opt = getopt(argc, argv, "a:F:f:M:m:s:t:w:v")) != EOF) {
+ switch (opt) {
+ case 'a':
+ rc = parse_mem_str(optarg);
+ if (rc >= 0)
+ global_cfg.mem_func = rc;
+ break;
+ case 'F':
+ rc = parse_uint_val(optarg, &v, FREE_THRSH_MIN,
+ FREE_THRSH_MAX);
+ if (rc == 0)
+ global_cfg.wrk_fill_thrsh = v;
+ break;
+ case 'f':
+ rc = parse_uint_val(optarg, &v, FREE_THRSH_MIN,
+ FREE_THRSH_MAX);
+ if (rc == 0)
+ global_cfg.wrk_free_thrsh = v;
+ break;
+ case 'M':
+ rc = parse_uint_val(optarg, &v, 0, UINT32_MAX);
+ if (rc == 0)
+ mtnk_prm.max_free = v;
+ break;
+ case 'm':
+ rc = parse_uint_val(optarg, &v, 0, UINT32_MAX);
+ if (rc == 0)
+ mtnk_prm.min_free = v;
+ break;
+ case 's':
+ rc = parse_uint_val(optarg, &v, OBJ_SZ_MIN,
+ OBJ_SZ_MAX);
+ if (rc == 0)
+ mtnk_prm.obj_size = v;
+ break;
+ case 't':
+ rc = parse_uint_val(optarg, &v, 0, UINT32_MAX);
+ if (rc == 0)
+ global_cfg.run_time = v;
+ break;
+ case 'w':
+ rc = parse_uint_val(optarg, &v, 0, UINT32_MAX);
+ if (rc == 0)
+ global_cfg.wrk_max_obj = v;
+ break;
+ case 'v':
+ global_cfg.verbose = 1;
+ break;
+ default:
+ rc = -EINVAL;
+ }
+ }
+
+ if (rc < 0)
+ printf("%s: invalid value: \"%s\" for option: \'%c\'\n",
+ __func__, optarg, opt);
+
+ return rc;
+}
+
+static int
+run_memtank_stress(int argc, char *argv[])
+{
+ int32_t rc;
+ uint32_t i, k;
+
+ const struct {
+ const char *name;
+ int (*func)(const char *tname);
+ } tests[] = {
+ {
+ .name = "MT1-WRK_ALLOC_FREE-MST_GROW_SHRINK",
+ .func = test_memtank_mt1,
+ },
+ {
+ .name = "MT1-WRK_ALLOC+GROW_FREE+SHRINK",
+ .func = test_memtank_mt2,
+ },
+ };
+
+ rc = parse_opt(argc, argv);
+ if (rc < 0) {
+ printf("%s: parse_op failed with error code: %d\n",
+ __func__, rc);
+ return rc;
+ }
+
+ /* update global values based on provided user input */
+ rc = update_global_cfg();
+ if (rc < 0)
+ return rc;
+
+ global_cfg_dump(stdout);
+
+ for (i = 0, k = 0; i != RTE_DIM(tests); i++) {
+
+ printf("TEST %s START\n", tests[i].name);
+
+ rc = tests[i].func(tests[i].name);
+ k += (rc == 0);
+
+ if (rc != 0)
+ printf("TEST %s FAILED\n", tests[i].name);
+ else
+ printf("TEST %s OK\n", tests[i].name);
+ }
+
+ printf("Number of tests:\t%u\nSuccess:\t%u\nFailed:\t%u\n",
+ i, k, i - k);
+ return (k != i);
+}
+
+static int
+test_memtank_stress(void)
+{
+ int32_t rc;
+ uint32_t i;
+ const char *val;
+ char *sp, *str, *argv[16];
+ char buf[0x100];
+
+ static const char *dlm = " \t\n";
+ static const char *evar = "DPDK_MEMTANK_STRESS_TEST_PARAM";
+ static const char *eval = "";
+
+ val = getenv(evar);
+ if (val == NULL)
+ val = eval;
+
+ snprintf(buf, sizeof(buf), "%s", val);
+
+ for (i = 0, str = buf; i != RTE_DIM(argv); str = NULL, i++) {
+ argv[i] = strtok_r(str, dlm, &sp);
+ if (argv[i] == NULL)
+ break;
+ }
+
+ if (i == RTE_DIM(argv)) {
+ printf("invalid value: \"%s\" for $(%s)\n", val, evar);
+ return -EINVAL;
+ }
+
+ rc = run_memtank_stress(i, argv);
+ return rc;
+}
+
+REGISTER_STRESS_TEST(memtank_stress_autotest, test_memtank_stress);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 9296042119..68203b9d7b 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -71,6 +71,7 @@ The public API headers are grouped by topics:
[mempool](@ref rte_mempool.h),
[malloc](@ref rte_malloc.h),
[memcpy](@ref rte_memcpy.h)
+ [memtank](@ref rte_memtank.h)
- **timers**:
[cycles](@ref rte_cycles.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index bedd944681..042482fe16 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -59,6 +59,7 @@ INPUT = @TOPDIR@/doc/api/doxy-api-index.md \
@TOPDIR@/lib/mbuf \
@TOPDIR@/lib/member \
@TOPDIR@/lib/mempool \
+ @TOPDIR@/lib/memtank \
@TOPDIR@/lib/meter \
@TOPDIR@/lib/metrics \
@TOPDIR@/lib/mldev \
diff --git a/doc/guides/prog_guide/img/memtank_internal.svg b/doc/guides/prog_guide/img/memtank_internal.svg
new file mode 100644
index 0000000000..0f13242ed5
--- /dev/null
+++ b/doc/guides/prog_guide/img/memtank_internal.svg
@@ -0,0 +1 @@
+<svg width="1280" height="720" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" overflow="hidden"><defs><clipPath id="clip0"><rect x="0" y="0" width="1280" height="720"/></clipPath></defs><g clip-path="url(#clip0)"><rect x="0" y="0" width="1280" height="720" fill="#FFFFFF"/><text font-family="Calibri Light,Calibri Light_MSFontService,sans-serif" font-weight="300" font-size="59" transform="translate(110.933 111)">memtank<tspan font-size="59" x="236.9" y="0">(internal structure)</tspan></text><rect x="392.5" y="170.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="392.5" y="186.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="392.5" y="202.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="392.5" y="218.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="392.5" y="233.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="392.5" y="249.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="392.5" y="265.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="392.5" y="281.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-weight="400" font-size="16" transform="translate(205.267 215)">min_free<tspan font-size="16" x="62.62" y="0">(grow thresh) </tspan><tspan font-family="Wingdings,Wingdings_MSFontService,sans-serif" font-size="16" x="154" y="0"></tspan><tspan font-size="16" x="-3.66667" y="81">max_free</tspan><tspan font-size="16" x="61.4733" y="81">(shrink thresh) </tspan><tspan font-family="Wingdings,Wingdings_MSFontService,sans-serif" font-size="16" x="159.593" y="81"></tspan><tspan font-style="italic" font-size="24" x="370.867" y="-3">memtank_alloc</tspan><tspan font-style="italic" font-size="24" x="520.973" y="-3">(</tspan><tspan font-style="italic" font-size="24" x="528.307" y="-3">mtnk</tspan><tspan font-style="italic" font-size="24" x="578.24" y="-3">, </tspan><tspan font-style="italic" font-size="24" x="589.74" y="-3">obj</tspan><tspan font-style="italic" font-size="24" x="620.073" y="-3">[], </tspan><tspan font-style="italic" font-size="24" x="646.24" y="-3">num</tspan><tspan font-style="italic" font-size="24" x="689.907" y="-3">, flags) </tspan></text><path d="M488.5 207.167 561.833 207.167 561.833 207.833 488.5 207.833ZM560.5 203.5 568.5 207.5 560.5 211.5Z" fill="#203864"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-style="italic" font-weight="400" font-size="24" transform="translate(577.2 274)">memtank_free(<tspan font-size="24" x="150.273" y="0">mtnk</tspan><tspan font-size="24" x="200.207" y="0">, </tspan>obj<tspan font-size="24" x="242.04" y="0">[], </tspan><tspan font-size="24" x="268.207" y="0">num</tspan><tspan font-size="24" x="311.873" y="0">, flags) </tspan></text><path d="M496.167 269.167 569.5 269.167 569.5 269.833 496.167 269.833ZM497.5 273.5 489.5 269.5 497.5 265.5Z" fill="#5B9BD5"/><rect x="90" y="353" width="162" height="29" fill="#E7E6E6"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-weight="400" font-size="16" transform="translate(99.9518 373)">mchunks<tspan font-size="16" x="58.1867" y="0">[USED]</tspan></text><path d="M264.5 365.5 876.43 365.5 876.43 403.05 877.055 403.05" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M847.5 383.5 910.397 383.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M865.5 402.5 888.672 402.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M858.5 393.5 898.224 393.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><rect x="325.5" y="385.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="325.5" y="401.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="325.5" y="417.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="325.5" y="433.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="505.5" y="383.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="399.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="415.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="431.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="686.5" y="385.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="686.5" y="401.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="686.5" y="417.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><rect x="686.5" y="433.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#FFFFFF"/><path d="M368.5 365.5 368.5 385.369" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M548.5 365.5 548.5 384.022" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M728.5 365.5 728.917 385.369" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><rect x="90" y="478" width="162" height="29" fill="#E7E6E6"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-weight="400" font-size="16" transform="translate(99.5345 498)">mchunks<tspan font-size="16" x="58.1867" y="0">[FULL]</tspan></text><path d="M264.5 490.5 729.18 490.5 729.18 526.369 729.059 526.369" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M699.5 507.5 762.397 507.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M718.5 526.5 741.672 526.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M710.5 517.5 750.224 517.5" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><rect x="325.5" y="510.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="325.5" y="526.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="325.5" y="542.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="325.5" y="558.5" width="85" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="508.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="524.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="540.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><rect x="505.5" y="556.5" width="86" height="16" stroke="#2F528F" stroke-width="1.33333" stroke-miterlimit="8" fill="#D6DCE5"/><path d="M367.5 490.5 367.5 510.369" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><path d="M548.5 490.5 548.5 509.022" stroke="#000000" stroke-width="0.666667" stroke-miterlimit="8" fill="none" fill-rule="evenodd"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-style="italic" font-weight="400" font-size="24" transform="translate(178.705 644)">memtank_grow<tspan font-size="24" x="154.32" y="0">(</tspan><tspan font-size="24" x="161.653" y="0">mtnk</tspan><tspan font-size="24" x="211.587" y="0">) </tspan><tspan font-size="24" x="303.585" y="0">memtank_shrink</tspan>(<tspan font-size="24" x="473.858" y="0">mtnk</tspan><tspan font-size="24" x="523.792" y="0">) </tspan></text><path d="M0.333333-8.08095e-07 0.333422 36.6397-0.333245 36.6397-0.333333 8.08095e-07ZM4.00009 35.3063 0.000104987 43.3064-3.99991 35.3064Z" fill="#203864" transform="matrix(1 0 0 -1 367.5 620.806)"/><path d="M548.833 577.5 548.833 615.486 548.167 615.486 548.167 577.5ZM552.5 614.153 548.5 622.153 544.5 614.153Z" fill="#5B9BD5"/><path d="M416.833 310.167 416.833 360.426 416.167 360.426 416.167 310.167ZM412.5 311.5 416.5 303.5 420.5 311.5Z" fill="#203864"/><path d="M0.333346 6.66666 0.333438 56.926-0.333228 56.926-0.333321 6.66666ZM-3.99999 8.00001 0 0 4.00001 7.99999Z" fill="#5B9BD5" transform="matrix(1 0 0 -1 450.5 360.426)"/><text fill="#203864" font-family="Calibri,Calibri_MSFontService,sans-serif" font-style="italic" font-weight="400" font-size="16" transform="translate(313.161 342)">ALLOC_CHUNK <tspan font-size="16" x="-51.8536" y="264">ALLOC_GROW </tspan><tspan fill="#4472C4" font-size="16" x="244.219" y="260">FREE_SHRINK </tspan></text><rect x="90" y="161" width="162" height="29" fill="#E7E6E6"/><text font-family="Calibri,Calibri_MSFontService,sans-serif" font-weight="400" font-size="16" transform="translate(99.5345 181)">LIFO queue </text></g></svg>
\ No newline at end of file
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index e6f24945b0..834266f8de 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -26,6 +26,7 @@ Memory Management
lcore_var
mempool_lib
+ memtank_lib
mbuf_lib
multi_proc_support
diff --git a/doc/guides/prog_guide/memtank_lib.rst b/doc/guides/prog_guide/memtank_lib.rst
new file mode 100644
index 0000000000..51fe822572
--- /dev/null
+++ b/doc/guides/prog_guide/memtank_lib.rst
@@ -0,0 +1,231 @@
+.. SPDX-License-Identifier: BSD-3-Clause
+ Copyright(c) 2026 Huawei Technologies Co., Ltd
+
+Memtank Library
+===============
+
+The memtank library is a fixed sized object allocator for DPDK applications.
+Same a s mempool it allows to alloc/free bulk of objects of fixed size in a
+lightweight manner.
+But in addition it can grow/shrink dynamically plus provides extra
+additional API for higher flexibility:
+
+* manual grow()/shrink() functions
+
+* different alloc/free policies
+ (can be specified by user via flags parameter):
+
+ * lightweight as possible, but can fail
+
+ * more robust, but heavyweight - might cause call to user-provided backing
+ memory allocator.
+
+* user provided callbacks for actual system-wide memory reservations.
+ User is free to choose whatever is most suitable way for his/her scenario,
+ i.e: via malloc/rte_malloc/mmap/some custom allocator.
+
+* user defined constructor callback for newly allocated objects.
+
+* built-in per object runtime verify (boundary violation, double free, etc.) –
+ controlled by flags at memtank creation time.
+
+
+Memtank usage scenarios
+-----------------------
+
+Use memtank when:
+
+* Relatively small objects of the same size are allocated and freed on the
+ data path with low, predictable latency requirements.
+
+* Pre-allocation memory for maximum possible number of objects is not
+ feasible.
+
+
+Internals
+---------
+
+Internally each memtank consists of:
+
+* Fixed size LIFO queue that serves as a pool of free objects for fast
+ allocation/deallocation. It's internals and behavior are very similar
+ to current mempool LIFO driver.
+
+* Two lists (USED, FREE) of memchunks. Memchunk is an analog of SLAB:
+ For performance reasons memtank tries to allocate memory in relatively
+ big chunks (memchunks) and then split each memchunk in dozens (or hundreds)
+ of objects. Objects from memchunks are used to populate pool of free
+ objects (see above).
+
+* Each memchunk consists of some metadata plus array of free objects (LIFO)
+ that belong to that chunk. As soon as number of free objects in the chunk
+ becomes equal to the total number of objects it considered as ```FREE```
+ and can be 'shrinked' - relased back to the memory subsystem.
+
+* Each object within memtank is a properly aligned and initialized data buffer
+ that will be provided to the user followed by the metadata that is used
+ to determine which memchunk it belongs to plus some extra fields used for
+ statisitics collection and runtime verification. Total size of the metadata
+ for each object: 32B.
+
+There are two user defined thresholds that control memtank grow/shrink
+behavior:
+
+* ```min_free``` - grow threshold. That value controls two things: when it is
+ time to request for more memory from the underlying memory subsystem and how
+ many memory has to be requested/released in one go.
+
+* ```max_free``` - shrink threshold. That value determines when it is ok for
+ memtank to start to release unused memory back to the underlying memory
+ subsystem.
+
+Also user can define a grow limit: ```max_obj``` - maximimum possible number
+of objects that given memtank can contain. By setting all these three
+parameters to the same value, memtank behaves like mempool with LIFO driver.
+
+.. _figure_memtank-internals:
+
+.. figure:: img/memtank_internal.svg
+
+
+Brief API description
+---------------------
+
+* ```rte_memtank_create()```/```rte_memtank_destroy()``` are responsible for
+creation/destroying the memntank.
+
+* ```rte_memtank_alloc()```/```rte_memtank_free()``` - perform objects
+ allocation/deallocation from/to the memntank. Note that both of them
+ operate in bulks and accept extra flag parameter to allow user to specify
+ exact behavior.
+
+* ```rte_memtank_chunk_alloc()```/```rte_memtank_chunk_free()``` also perform
+ allocation/deallocation from/to the memntank. Though these functions bypass
+ pooll of free objects and allocate/free objects straight from/to the pool.
+
+* ```rte_memtank_grow()```/```rte_memtank_shrink()``` are intended to
+ explicitly reserve/release memory from/to underlying memory subsystem and
+ add/remove objects to/from the tank. Possible usage scenario - either some
+ house-keeping task, or even data-path thread during idle periods.
+
+* ```rte_memtank_dump()```/```rte_memtank_sanity_check()``` - miscelanneous
+ API for statistics/internal dumping and sanity cheking.
+
+
+Aled public API functions except ```rte_memetank_destroy()``` are MT safe and
+can be called concurrently from different threads.
+
+Object allocation
+~~~~~~~~~~~~~~~~~
+
+By default ```rte_memtank_alloc()``` first tries to get objects from the free
+objects pool. If there are not enough free objects in the pool, then behavior
+depends on the flag values user provided:
+
+* none - alloc() will simply return to the user obtained from the pool objects.
+
+* ```RTE_MTANK_ALLOC_CHUNK``` - alloc() will try to get remaining free objects
+ from already allocated memchunks.
+
+* If already allocated memchunks also don't contain enough
+ free objects and ```RTE_MTANK_ALLOC_GROW``` is specified, then it will try
+ to perform ```grow``` operation by allocating extra memory from the
+ underlying memory susbystem and creating new memchunks to satisfy user
+ request.
+
+In last two cases, it will try to refill free pool up to ```min_free```
+threshold value.
+
+Object de-allocation
+~~~~~~~~~~~~~~~~~~~~
+
+In reverse, ```ret_memtank_free()``` first tries to put objects back to
+the free pool. In case there is not enough room, it puts remaining free
+objects to the memchunks they belong to. After that, if
+```RTE_MTANK_FREE_SHRINK```` is specified it starts ```shrink``` operation
+to return unused memchunks back to the memory subsystem.
+
+
+Grow/Shrink
+~~~~~~~~~~~
+
+Apart from invoking ```grow```/```shrink``` implicitly (via alloc/free flags)
+there is an API for explicit invocation:
+
+* ```rte_memtank_grow(struct rte_memtank \*)``` - if number of objects in
+the free pool drops below ```min_free``` thershold, it requests next memory
+region from the udnerlying memory subsystem, creates new memchunks from it
+and populates the pool.
+
+* ```rte_memtank_shrink(struct memtank \*)``` - if total number of free objects
+in the tank exceeds ```max_free``` theshold it de-allocates unused memchunks
+back to the underlying memory subsystem.
+
+
+Create/Destroy
+~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ sruct user_defined_type;
+
+ /*
+ * User defined callbacks to reserve/release memory from/to backing
+ * memory subsystem.
+ */
+
+ static void *
+ user_defined_alloc(size_t sz, void *udata)
+ {
+ RTE_SET_USED(udata);
+ return rte_malloc(NULL, sz, 0);
+ }
+
+ static void
+ user_defined_free(void *buf, void *udata)
+ {
+ RTE_SET_USED(udata);
+ rte_free(buf);
+ }
+
+ /*
+ * As used needs new memtank he fills memtank param structure and calls
+ * rte_memtank_create():
+ */
+ static struct rte_memtank_prm prm = {
+ /* min number of free objs in the pool (grow threshold). */
+ .min_free = 1024,
+ /* max number of free objs (shrink threshold)a */
+ .max_free = 1024 * 1024,
+ .obj_size = sizeof(struct user_defined_type);
+ .obj_align = alignof(struct user_defined_type);
+ .nb_obj_chunk = 2 * 1024,
+ /* enable obj runtime verify and stats collection */
+ .flags = RTE_MTANK_OBJ_DBG,
+ /* user defined callbacks to reserve/release actual memory */
+ .alloc = user_defined_alloc,
+ .free = user_define_free,
+ };
+
+ struct rte_memtank *mt = rte_memtank_create(&prm);
+
+ ....
+
+ /* no more objects from the memtank are in use */
+ rte_memtank_destroy(mt);
+
+
+Known limitations (subject for further improvements):
+-----------------------------------------------------
+
+* scalability:
+ after 8+ lcores conventional mempool (with FIFO) starts to outperform
+ memtank (which by default uses LIFO inside).
+
+* mempool_cache integration is not part of the library and right now
+ has to be implemented by used manually on top of memtank API.
+
+* As pool of free objects might contain objects from different memchunks,
+ it can prevent some memchunks to get deallocated and returned back to
+ the memory subsystem.
+
diff --git a/lib/memtank/memtank.c b/lib/memtank/memtank.c
new file mode 100644
index 0000000000..2c6ec948a5
--- /dev/null
+++ b/lib/memtank/memtank.c
@@ -0,0 +1,630 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ * Copyright(c) 2025 Huawei Technologies Co., Ltd
+ */
+
+#include "memtank.h"
+#include <rte_bitops.h>
+#include <rte_errno.h>
+#include <eal_export.h>
+
+#define MEMTANK_OBJ_BULK 0x100
+#define MEMTANK_CHUNK_BULK 0x100
+
+#define ALIGN_MUL_CEIL(v, mul) \
+ ((typeof(v))(((uint64_t)(v) + (mul) - 1) / (mul)))
+
+
+static inline size_t
+memtank_meta_size(uint32_t nb_free)
+{
+ size_t sz;
+ static const struct rte_memtank *mt;
+
+ sz = sizeof(*mt) + nb_free * sizeof(mt->mtf.free[0]);
+ sz = RTE_ALIGN_CEIL(sz, alignof(typeof(*mt)));
+ return sz;
+}
+
+static inline size_t
+memchunk_meta_size(uint32_t nb_obj)
+{
+ size_t sz;
+ static const struct memchunk *ch;
+
+ sz = sizeof(*ch) + nb_obj * sizeof(ch->free[0]);
+ sz = RTE_ALIGN_CEIL(sz, alignof(typeof(*ch)));
+ return sz;
+}
+
+static inline size_t
+memobj_size(uint32_t obj_size, uint32_t obj_align)
+{
+ size_t sz;
+ static const struct memobj *obj;
+
+ sz = sizeof(*obj) + obj_size;
+ sz = RTE_ALIGN_CEIL(sz, obj_align);
+ return sz;
+}
+
+static inline size_t
+memchunk_size(uint32_t nb_obj, uint32_t obj_size, uint32_t obj_align)
+{
+ size_t algn, sz;
+ static const struct memchunk *ch;
+
+ algn = RTE_MAX(alignof(typeof(*ch)), obj_align);
+ sz = memchunk_meta_size(nb_obj);
+ sz += nb_obj * memobj_size(obj_size, obj_align);
+ sz = RTE_ALIGN_CEIL(sz + algn - 1, algn);
+ return sz;
+}
+
+static void
+init_chunk(struct rte_memtank *mt, struct memchunk *ch)
+{
+ uint32_t i, n, sz;
+ uintptr_t p;
+ struct memobj *obj;
+
+ const struct memobj cobj = {
+ .red_zone1 = RED_ZONE_V1,
+ .chunk = ch,
+ .red_zone2 = RED_ZONE_V2,
+ };
+
+ n = mt->prm.nb_obj_chunk;
+ sz = mt->obj_size;
+
+ /* get start of memobj array */
+ p = (uintptr_t)ch + memchunk_meta_size(n);
+ p = RTE_ALIGN_CEIL(p, mt->prm.obj_align);
+
+ for (i = 0; i != n; i++) {
+ obj = obj_pub_full(p, sz);
+ obj[0] = cobj;
+ ch->free[i] = (void *)p;
+ p += sz;
+ }
+
+ ch->nb_total = n;
+ ch->nb_free = n;
+
+ if (mt->prm.init != NULL)
+ mt->prm.init(ch->free, n, mt->prm.udata);
+}
+
+static __rte_always_inline void
+copy_objs(void *dst[], void * const src[], uint32_t num)
+{
+ memcpy(dst, src, num * sizeof(dst[0]));
+}
+
+static inline uint32_t
+get_free(struct memtank_free *t, void *obj[], uint32_t num)
+{
+ uint32_t len, n;
+
+ rte_spinlock_lock(&t->lock);
+
+ len = t->nb_free;
+ n = RTE_MIN(num, len);
+ len -= n;
+ copy_objs(obj, t->free + len, n);
+ t->nb_free = len;
+
+ rte_spinlock_unlock(&t->lock);
+ return n;
+}
+
+static inline uint32_t
+put_free(struct memtank_free *t, void * const obj[], uint32_t num)
+{
+ uint32_t len, n;
+
+ rte_spinlock_lock(&t->lock);
+
+ len = t->nb_free;
+ n = t->max_free - len;
+ n = RTE_MIN(num, n);
+ copy_objs(t->free + len, obj, n);
+ t->nb_free = len + n;
+
+ rte_spinlock_unlock(&t->lock);
+ return n;
+}
+
+static inline void
+fill_free(struct rte_memtank *mt, uint32_t num, uint32_t flags)
+{
+ uint32_t i, l, k, n;
+ void *free[MEMTANK_OBJ_BULK];
+
+ for (i = 0; i != num; i += n) {
+ /* how many objects we need to add into @free */
+ n = RTE_MIN(num - i, RTE_DIM(free));
+ k = rte_memtank_chunk_alloc(mt, free, n, flags);
+ l = put_free(&mt->mtf, free, k);
+
+ /* @free is full, return allocated objects back to chunks */
+ if (l != k)
+ rte_memtank_chunk_free(mt, free + l, k - l, 0);
+
+ /* either free is full, or chunks are empty */
+ if (l != n)
+ break;
+ }
+}
+
+static void
+put_chunk(struct rte_memtank *mt, struct memchunk *ch, void * const obj[],
+ uint32_t num)
+{
+ uint32_t k, n;
+ struct mchunk_list *ls;
+
+ /* chunk should be in the *used* list */
+ k = MC_USED;
+ ls = &mt->chl[k];
+ rte_spinlock_lock(&ls->lock);
+
+ n = ch->nb_free;
+ RTE_ASSERT(n + num <= ch->nb_total);
+
+ copy_objs(ch->free + n, obj, num);
+ ch->nb_free = n + num;
+
+ /* chunk is full now */
+ if (ch->nb_free == ch->nb_total) {
+ TAILQ_REMOVE(&ls->chunk, ch, link);
+ k = MC_FULL;
+ /* chunk is not empty anymore, move it to the head */
+ } else if (n == 0) {
+ TAILQ_REMOVE(&ls->chunk, ch, link);
+ TAILQ_INSERT_HEAD(&ls->chunk, ch, link);
+ }
+
+ rte_spinlock_unlock(&ls->lock);
+
+ /* insert this chunk into the *full* list */
+ if (k == MC_FULL) {
+ ls = &mt->chl[k];
+ rte_spinlock_lock(&ls->lock);
+ TAILQ_INSERT_HEAD(&ls->chunk, ch, link);
+ rte_spinlock_unlock(&ls->lock);
+ }
+}
+
+static inline uint32_t
+_shrink_chunk(struct rte_memtank *mt, struct memchunk *ch[MEMTANK_CHUNK_BULK],
+ uint32_t num)
+{
+ uint32_t i, k;
+ struct mchunk_list *ls;
+
+ ls = &mt->chl[MC_FULL];
+ rte_spinlock_lock(&ls->lock);
+
+ for (k = 0; k != num; k++) {
+ ch[k] = TAILQ_LAST(&ls->chunk, mchunk_head);
+ if (ch[k] == NULL)
+ break;
+ TAILQ_REMOVE(&ls->chunk, ch[k], link);
+ }
+
+ rte_spinlock_unlock(&ls->lock);
+
+ rte_atomic_fetch_sub_explicit(&mt->nb_chunks, k,
+ rte_memory_order_acq_rel);
+
+ for (i = 0; i != k; i++)
+ mt->prm.free(ch[i]->raw, mt->prm.udata);
+
+ return k;
+}
+
+
+static uint32_t
+shrink_chunk(struct rte_memtank *mt, uint32_t num)
+{
+ uint32_t i, k, n;
+ struct memchunk *ch[MEMTANK_CHUNK_BULK];
+
+ k = 0;
+ n = 0;
+ for (i = 0; i != num && n != k; i += k) {
+ n = RTE_MIN(num - i, RTE_DIM(ch));
+ k = _shrink_chunk(mt, ch, n);
+ }
+
+ return i;
+}
+
+static struct memchunk *
+alloc_chunk(struct rte_memtank *mt)
+{
+ void *p;
+ struct memchunk *ch;
+
+ p = mt->prm.alloc(mt->chunk_size, mt->prm.udata);
+ if (p == NULL)
+ return NULL;
+ ch = RTE_PTR_ALIGN_CEIL(p, alignof(typeof(*ch)));
+ ch->raw = p;
+ return ch;
+}
+
+/* Determine by how many chunks we can actually grow */
+static inline uint32_t
+grow_num(struct rte_memtank *mt, uint32_t num)
+{
+ uint32_t k, n, max;
+
+ max = mt->max_chunk;
+ n = num + rte_atomic_fetch_add_explicit(&mt->nb_chunks, num,
+ rte_memory_order_acq_rel);
+
+ if (n <= max)
+ return num;
+
+ k = n - max;
+ return (k >= num) ? 0 : num - k;
+}
+
+static uint32_t
+grow_chunk(struct rte_memtank *mt, uint32_t num)
+{
+ uint32_t k, n;
+ struct mchunk_list *fls;
+ struct mchunk_head ls;
+ struct memchunk *ch;
+
+ /* check can we grow further */
+ k = grow_num(mt, num);
+
+ TAILQ_INIT(&ls);
+
+ for (n = 0; n != k; n++) {
+ ch = alloc_chunk(mt);
+ if (ch == NULL)
+ break;
+ init_chunk(mt, ch);
+ TAILQ_INSERT_HEAD(&ls, ch, link);
+ }
+
+ if (n != 0) {
+ fls = &mt->chl[MC_FULL];
+ rte_spinlock_lock(&fls->lock);
+ TAILQ_CONCAT(&fls->chunk, &ls, link);
+ rte_spinlock_unlock(&fls->lock);
+ }
+
+ if (n != num)
+ rte_atomic_fetch_sub_explicit(&mt->nb_chunks, num - n,
+ rte_memory_order_acq_rel);
+
+ return n;
+}
+
+static void
+obj_dbg_alloc(struct rte_memtank *mt, void * const obj[], uint32_t nb_obj)
+{
+ uint32_t i, sz;
+ struct memobj *po;
+
+ sz = mt->obj_size;
+ for (i = 0; i != nb_obj; i++) {
+ po = obj_pub_full((uintptr_t)obj[i], sz);
+ RTE_VERIFY(memobj_verify(po, 0) == 0);
+ po->dbg.nb_alloc++;
+ }
+}
+
+static void
+obj_dbg_free(struct rte_memtank *mt, void * const obj[], uint32_t nb_obj)
+{
+ uint32_t i, sz;
+ struct memobj *po;
+
+ sz = mt->obj_size;
+ for (i = 0; i != nb_obj; i++) {
+ po = obj_pub_full((uintptr_t)obj[i], sz);
+ RTE_VERIFY(memobj_verify(po, 1) == 0);
+ po->dbg.nb_free++;
+ }
+}
+
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_chunk_free, 26.11)
+void
+rte_memtank_chunk_free(struct rte_memtank *mt, void * const obj[],
+ uint32_t nb_obj, uint32_t flags)
+{
+ size_t csz;
+ uint32_t i, j, k, osz;
+ struct memobj *mo;
+ struct memchunk *ch;
+
+ csz = mt->chunk_size;
+ osz = mt->obj_size;
+
+ if (mt->flags & RTE_MTANK_OBJ_DBG)
+ obj_dbg_free(mt, obj, nb_obj);
+
+ k = 0;
+ for (i = 0; i != nb_obj; i = j) {
+
+ mo = obj_pub_full((uintptr_t)obj[i], osz);
+ ch = mo->chunk;
+
+ /* find number of consequtive objs from the same chunk */
+ for (j = i + 1; j != nb_obj; j++) {
+ if (obj_check_chunk((uintptr_t)obj[j], osz,
+ (uintptr_t)ch, csz) != 0)
+ break;
+ RTE_ASSERT(ch ==
+ obj_pub_full((uintptr_t)obj[j], osz)->chunk);
+ }
+
+ put_chunk(mt, ch, obj + i, j - i);
+ k++;
+ }
+
+ if (flags & RTE_MTANK_FREE_SHRINK)
+ shrink_chunk(mt, k);
+}
+
+static uint32_t
+get_chunk(struct mchunk_list *ls, struct mchunk_head *els,
+ struct mchunk_head *uls, void *obj[], uint32_t nb_obj)
+{
+ uint32_t l, k, n;
+ struct memchunk *ch, *nch;
+
+ rte_spinlock_lock(&ls->lock);
+
+ n = 0;
+ for (ch = TAILQ_FIRST(&ls->chunk);
+ n != nb_obj && ch != NULL && ch->nb_free != 0;
+ ch = nch, n += k) {
+
+ k = RTE_MIN(nb_obj - n, ch->nb_free);
+ l = ch->nb_free - k;
+ copy_objs(obj + n, ch->free + l, k);
+ ch->nb_free = l;
+
+ nch = TAILQ_NEXT(ch, link);
+
+ /* chunk is empty now */
+ if (l == 0) {
+ TAILQ_REMOVE(&ls->chunk, ch, link);
+ TAILQ_INSERT_TAIL(els, ch, link);
+ } else if (uls != NULL) {
+ TAILQ_REMOVE(&ls->chunk, ch, link);
+ TAILQ_INSERT_HEAD(uls, ch, link);
+ }
+ }
+
+ rte_spinlock_unlock(&ls->lock);
+ return n;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_chunk_alloc, 26.11)
+uint32_t
+rte_memtank_chunk_alloc(struct rte_memtank *mt, void *obj[], uint32_t nb_obj,
+ uint32_t flags)
+{
+ uint32_t k, n;
+ struct mchunk_head els, uls;
+
+ /* walk though the *used* list first */
+ n = get_chunk(&mt->chl[MC_USED], &mt->chl[MC_USED].chunk, NULL,
+ obj, nb_obj);
+
+ if (n != nb_obj) {
+
+ TAILQ_INIT(&els);
+ TAILQ_INIT(&uls);
+
+ /* walk though the *full* list */
+ n += get_chunk(&mt->chl[MC_FULL], &els, &uls,
+ obj + n, nb_obj - n);
+
+ if (n != nb_obj && (flags & RTE_MTANK_ALLOC_GROW) != 0) {
+
+ /*
+ * try to allocate extra memchunks.
+ * note that at rare situations with really high load
+ * when number of allocated chunks is close to the
+ * max allowed limit, when multiple threads are
+ * trying to do grow_chunk() simultaneously, it
+ * can fail for some of them leading to a failure
+ * to allocate new elements.
+ */
+ k = ALIGN_MUL_CEIL(nb_obj - n,
+ mt->prm.nb_obj_chunk);
+ k = grow_chunk(mt, k);
+
+ /* walk through the *full* list again */
+ if (k != 0)
+ n += get_chunk(&mt->chl[MC_FULL], &els, &uls,
+ obj + n, nb_obj - n);
+ }
+
+ /* concatenate with *used* list our temporary lists */
+ rte_spinlock_lock(&mt->chl[MC_USED].lock);
+
+ /* put new non-emtpy elems at head of the *used* list */
+ TAILQ_CONCAT(&uls, &mt->chl[MC_USED].chunk, link);
+ TAILQ_CONCAT(&mt->chl[MC_USED].chunk, &uls, link);
+
+ /* put new emtpy elems at tail of the *used* list */
+ TAILQ_CONCAT(&mt->chl[MC_USED].chunk, &els, link);
+
+ rte_spinlock_unlock(&mt->chl[MC_USED].lock);
+ }
+
+ if (mt->flags & RTE_MTANK_OBJ_DBG)
+ obj_dbg_alloc(mt, obj, n);
+
+ return n;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_grow, 26.11)
+int
+rte_memtank_grow(struct rte_memtank *mt)
+{
+ uint32_t k, n, num;
+ struct memtank_free *t;
+
+ t = &mt->mtf;
+
+ /* how many chunks we need to grow */
+ k = t->min_free - t->nb_free;
+ if ((int32_t)k <= 0)
+ return 0;
+
+ num = ALIGN_MUL_CEIL(k, mt->prm.nb_obj_chunk);
+
+ /* try to grow and refill the *free* */
+ n = grow_chunk(mt, num);
+ if (n != 0)
+ fill_free(mt, k, 0);
+
+ return n;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_shrink, 26.11)
+int
+rte_memtank_shrink(struct rte_memtank *mt)
+{
+ uint32_t n;
+ struct memtank_free *t;
+
+ t = &mt->mtf;
+
+ /* how many chunks we need to shrink */
+ if (t->nb_free < t->max_free)
+ return 0;
+
+ /* how many chunks we need to free */
+ n = ALIGN_MUL_CEIL(t->min_free, mt->prm.nb_obj_chunk);
+
+ /* free up to *num* chunks */
+ return shrink_chunk(mt, n);
+}
+
+static int
+check_param(const struct rte_memtank_prm *prm)
+{
+ if (prm->alloc == NULL || prm->free == NULL ||
+ prm->min_free > prm->max_free ||
+ prm->max_free > prm->max_obj ||
+ rte_is_power_of_2(prm->obj_align) == 0 ||
+ prm->min_free == 0 ||
+ prm->nb_obj_chunk == 0)
+ return -EINVAL;
+ return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_create, 26.11)
+struct rte_memtank *
+rte_memtank_create(const struct rte_memtank_prm *prm)
+{
+ int32_t rc;
+ size_t sz;
+ void *p;
+ struct rte_memtank *mt;
+
+ rc = check_param(prm);
+ if (rc != 0) {
+ rte_errno = -rc;
+ return NULL;
+ }
+
+ sz = memtank_meta_size(prm->max_free);
+ p = prm->alloc(sz, prm->udata);
+ if (p == NULL) {
+ rte_errno = ENOMEM;
+ return NULL;
+ }
+
+ mt = RTE_PTR_ALIGN_CEIL(p, alignof(typeof(*mt)));
+
+ memset(mt, 0, sizeof(*mt));
+ mt->prm = *prm;
+
+ mt->raw = p;
+ mt->chunk_size = memchunk_size(prm->nb_obj_chunk, prm->obj_size,
+ prm->obj_align);
+ mt->obj_size = memobj_size(prm->obj_size, prm->obj_align);
+ mt->max_chunk = ALIGN_MUL_CEIL(prm->max_obj, prm->nb_obj_chunk);
+ mt->flags = prm->flags;
+
+ mt->mtf.min_free = prm->min_free;
+ mt->mtf.max_free = prm->max_free;
+
+ TAILQ_INIT(&mt->chl[MC_FULL].chunk);
+ TAILQ_INIT(&mt->chl[MC_USED].chunk);
+
+ return mt;
+}
+
+static void
+free_mchunk_list(struct rte_memtank *mt, struct mchunk_list *ls)
+{
+ struct memchunk *ch;
+
+ for (ch = TAILQ_FIRST(&ls->chunk); ch != NULL;
+ ch = TAILQ_FIRST(&ls->chunk)) {
+ TAILQ_REMOVE(&ls->chunk, ch, link);
+ mt->prm.free(ch->raw, mt->prm.udata);
+ }
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_destroy, 26.11)
+void
+rte_memtank_destroy(struct rte_memtank *mt)
+{
+ if (mt != NULL) {
+ free_mchunk_list(mt, &mt->chl[MC_FULL]);
+ free_mchunk_list(mt, &mt->chl[MC_USED]);
+ mt->prm.free(mt->raw, mt->prm.udata);
+ }
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_alloc, 26.11)
+uint32_t
+rte_memtank_alloc(struct rte_memtank *mt, void *obj[], uint32_t num,
+ uint32_t flags)
+{
+ uint32_t n;
+ struct memtank_free *t;
+
+ t = &mt->mtf;
+ n = get_free(t, obj, num);
+
+ /* not enough free objects, try to allocate via memchunks */
+ if (n != num && flags != 0) {
+ n += rte_memtank_chunk_alloc(mt, obj + n, num - n, flags);
+
+ /* refill *free* tank */
+ if (n == num)
+ fill_free(mt, t->min_free, flags);
+ }
+
+ return n;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_free, 26.11)
+void
+rte_memtank_free(struct rte_memtank *t, void * const obj[], uint32_t num,
+ uint32_t flags)
+{
+ uint32_t n;
+
+ n = put_free(&t->mtf, obj, num);
+ if (n != num)
+ rte_memtank_chunk_free(t, obj + n, num - n, flags);
+}
diff --git a/lib/memtank/memtank.h b/lib/memtank/memtank.h
new file mode 100644
index 0000000000..872f2f7def
--- /dev/null
+++ b/lib/memtank/memtank.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ * Copyright(c) 2025 Huawei Technologies Co., Ltd
+ */
+
+#ifndef _MEMTANK_H_
+#define _MEMTANK_H_
+
+#include <rte_memtank.h>
+#include <rte_atomic.h>
+#include <rte_spinlock.h>
+#include <rte_log.h>
+#include <stdalign.h>
+#include <errno.h>
+
+extern int memtank_logtype;
+#define RTE_LOGTYPE_MTANK memtank_logtype
+#define MTANK_LOG(level, ...) \
+ RTE_LOG_LINE(level, MTANK, "" __VA_ARGS__)
+
+struct memobj {
+ uint64_t red_zone1;
+ struct memchunk *chunk; /* ptr to the chunk it belongs to */
+ struct {
+ uint32_t nb_alloc;
+ uint32_t nb_free;
+ } dbg;
+ uint64_t red_zone2;
+};
+
+#define RED_ZONE_V1 UINT64_C(0xBADECAFEBADECAFE)
+#define RED_ZONE_V2 UINT64_C(0xDEADBEEFDEADBEEF)
+
+struct memchunk {
+ TAILQ_ENTRY(memchunk) link; /* link to the next chunk in the tank */
+ void *raw; /* un-aligned ptr returned by alloc() */
+ uint32_t nb_total; /* total number of objects in the chunk */
+ uint32_t nb_free; /* number of free object in the chunk */
+ void *free[]; /* array of free objects */
+} __rte_cache_aligned;
+
+
+TAILQ_HEAD(mchunk_head, memchunk);
+
+struct mchunk_list {
+ rte_spinlock_t lock;
+ struct mchunk_head chunk; /* list of chunks */
+} __rte_cache_aligned;
+
+enum {
+ MC_FULL, /* all memchunk objs are free */
+ MC_USED, /* some of memchunk objs are allocated */
+ MC_NUM,
+};
+
+struct memtank_free {
+ rte_spinlock_t lock;
+ uint32_t min_free;
+ uint32_t max_free;
+ uint32_t nb_free;
+ void *free[];
+} __rte_cache_aligned;
+
+struct rte_memtank {
+ /* user provided data */
+ struct rte_memtank_prm prm;
+
+ /*run-time data */
+ void *raw; /* un-aligned ptr returned by alloc() */
+ size_t chunk_size; /* full size of each memchunk */
+ uint32_t obj_size; /* full size of each memobj */
+ uint32_t max_chunk; /* max allowed number of chunks */
+ uint32_t flags; /* behavior flags */
+ RTE_ATOMIC(uint32_t) nb_chunks; /* number of allocated chunks */
+ struct mchunk_list chl[MC_NUM]; /* lists of memchunks */
+ struct memtank_free mtf; /* cached free objects */
+};
+
+/**
+ * Obtain pointer to internal memobj struct from public one
+ */
+static inline struct memobj *
+obj_pub_full(uintptr_t p, uint32_t obj_sz)
+{
+ uintptr_t v;
+
+ v = p + obj_sz - sizeof(struct memobj);
+ return (struct memobj *)v;
+}
+
+/**
+ * Fast check: does given object belongs to that memchunk.
+ * Returns zero, if object is within the chunk, non-zero value otherwise.
+ */
+static inline int
+obj_check_chunk(uintptr_t obj, size_t obj_sz, uintptr_t chn, size_t chn_sz)
+{
+ return (obj <= chn || obj + obj_sz > chn + chn_sz);
+}
+
+static inline int
+memobj_verify(const struct memobj *mo, uint32_t finc)
+{
+ if (mo->red_zone1 != RED_ZONE_V1 || mo->red_zone2 != RED_ZONE_V2 ||
+ mo->dbg.nb_alloc != mo->dbg.nb_free + finc)
+ return -EINVAL;
+ return 0;
+}
+
+#endif /* _MEMTANK_H_ */
diff --git a/lib/memtank/meson.build b/lib/memtank/meson.build
new file mode 100644
index 0000000000..a4c54c09bd
--- /dev/null
+++ b/lib/memtank/meson.build
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2017 Intel Corporation
+
+extra_flags = []
+
+foreach flag: extra_flags
+ if cc.has_argument(flag)
+ cflags += flag
+ endif
+endforeach
+
+sources = files('memtank.c',
+ 'misc.c',
+)
+headers = files(
+ 'rte_memtank.h',
+)
+deps += ['ring', 'telemetry']
diff --git a/lib/memtank/misc.c b/lib/memtank/misc.c
new file mode 100644
index 0000000000..526bbbbcf1
--- /dev/null
+++ b/lib/memtank/misc.c
@@ -0,0 +1,375 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ * Copyright(c) 2025 Huawei Technologies Co., Ltd
+ */
+
+#include "memtank.h"
+#include <inttypes.h>
+#include <stdlib.h>
+#include <eal_export.h>
+
+#define CHUNK_OBJ_LT_NUM 4
+
+struct mchunk_stat {
+ uint32_t nb_empty;
+ uint32_t nb_full;
+ struct {
+ uint32_t nb_chunk;
+ uint32_t nb_obj;
+ struct {
+ uint32_t val;
+ uint32_t num;
+ } chunk_obj_lt[CHUNK_OBJ_LT_NUM];
+ } used;
+};
+
+struct mfree_stat {
+ uint32_t nb_chunk;
+ struct mchunk_stat chunk;
+};
+
+RTE_LOG_REGISTER_DEFAULT(memtank_logtype, INFO);
+
+static void
+mchunk_stat_dump(FILE *f, const struct mchunk_stat *st)
+{
+ uint32_t i;
+
+ fprintf(f, "\t\tstat={\n");
+ fprintf(f, "\t\t\tnb_empty=%u,\n", st->nb_empty);
+ fprintf(f, "\t\t\tnb_full=%u,\n", st->nb_full);
+ fprintf(f, "\t\t\tused={\n");
+ fprintf(f, "\t\t\t\tnb_chunk=%u,\n", st->used.nb_chunk);
+ fprintf(f, "\t\t\t\tnb_obj=%u,\n", st->used.nb_obj);
+
+ for (i = 0; i != RTE_DIM(st->used.chunk_obj_lt); i++) {
+ if (st->used.chunk_obj_lt[i].num != 0)
+ fprintf(f, "\t\t\t\tnb_chunk_obj_lt_%u=%u,\n",
+ st->used.chunk_obj_lt[i].val,
+ st->used.chunk_obj_lt[i].num);
+ }
+
+ fprintf(f, "\t\t\t},\n");
+ fprintf(f, "\t\t},\n");
+}
+
+static void
+mchunk_stat_init(struct mchunk_stat *st, uint32_t nb_obj_chunk)
+{
+ uint32_t i;
+
+ memset(st, 0, sizeof(*st));
+ for (i = 0; i != RTE_DIM(st->used.chunk_obj_lt); i++) {
+ st->used.chunk_obj_lt[i].val = (i + 1) * nb_obj_chunk /
+ RTE_DIM(st->used.chunk_obj_lt);
+ }
+}
+
+static void
+mchunk_stat_collect(struct mchunk_stat *st, const struct memchunk *ch)
+{
+ uint32_t i, n;
+
+ n = ch->nb_total - ch->nb_free;
+
+ if (ch->nb_free == 0)
+ st->nb_empty++;
+ else if (n == 0)
+ st->nb_full++;
+ else {
+ st->used.nb_chunk++;
+ st->used.nb_obj += n;
+
+ for (i = 0; i != RTE_DIM(st->used.chunk_obj_lt); i++) {
+ if (n < st->used.chunk_obj_lt[i].val) {
+ st->used.chunk_obj_lt[i].num++;
+ break;
+ }
+ }
+ }
+}
+
+static void
+mchunk_list_dump(FILE *f, struct rte_memtank *mt, uint32_t idx, uint32_t flags)
+{
+ struct mchunk_list *ls;
+ const struct memchunk *ch;
+ struct mchunk_stat mcs;
+
+ ls = &mt->chl[idx];
+ mchunk_stat_init(&mcs, mt->prm.nb_obj_chunk);
+
+ rte_spinlock_lock(&ls->lock);
+
+ for (ch = TAILQ_FIRST(&ls->chunk); ch != NULL;
+ ch = TAILQ_NEXT(ch, link)) {
+
+ /* collect chunk stats */
+ if (flags & RTE_MTANK_DUMP_CHUNK_STAT)
+ mchunk_stat_collect(&mcs, ch);
+
+ /* dump chunk metadata */
+ if (flags & RTE_MTANK_DUMP_CHUNK) {
+ fprintf(f, "\t\tmemchunk@%p={\n", ch);
+ fprintf(f, "\t\t\traw=%p,\n", ch->raw);
+ fprintf(f, "\t\t\tnb_total=%u,\n", ch->nb_total);
+ fprintf(f, "\t\t\tnb_free=%u,\n", ch->nb_free);
+ fprintf(f, "\t\t},\n");
+ }
+ }
+
+ rte_spinlock_unlock(&ls->lock);
+
+ /* print chunk stats */
+ if (flags & RTE_MTANK_DUMP_CHUNK_STAT)
+ mchunk_stat_dump(f, &mcs);
+}
+
+static void
+mfree_stat_init(struct mfree_stat *st, uint32_t nb_obj_chunk)
+{
+ st->nb_chunk = 0;
+ mchunk_stat_init(&st->chunk, nb_obj_chunk);
+}
+
+static int
+ptr_cmp(const void *p1, const void *p2)
+{
+ uintptr_t rc, v1, v2;
+
+ v1 = *(const uintptr_t *)p1;
+ v2 = *(const uintptr_t *)p2;
+ rc = v1 - v2;
+ return (rc > v1) ? -1 : ((rc > 0) ? 1 : 0);
+}
+
+static void
+mfree_stat_collect(struct mfree_stat *st, struct rte_memtank *mt)
+{
+ uint32_t i, j, n, sz;
+ uintptr_t *p;
+ const struct memobj *mo;
+
+ sz = mt->obj_size;
+
+ p = malloc(mt->mtf.max_free * sizeof(*p));
+ if (p == NULL)
+ return;
+
+ /**
+ * grab free lock and keep it till we analyze related memchunks,
+ * to make sure none of these memchunks will be freed until
+ * we are finished.
+ */
+ rte_spinlock_lock(&mt->mtf.lock);
+
+ /* collect chunks for all objects in free[] */
+ n = mt->mtf.nb_free;
+ memcpy(p, mt->mtf.free, n * sizeof(*p));
+ for (i = 0; i != n; i++) {
+ mo = obj_pub_full(p[i], sz);
+ p[i] = (uintptr_t)mo->chunk;
+ }
+
+ /* sort chunk pointers */
+ qsort(p, n, sizeof(*p), ptr_cmp);
+
+ /* for each chunk collect stats */
+ for (i = 0; i != n; i = j) {
+
+ RTE_ASSERT(st->nb_chunk < mt->max_chunk);
+ st->nb_chunk++;
+ mchunk_stat_collect(&st->chunk, (const struct memchunk *)p[i]);
+ for (j = i + 1; j != n && p[i] == p[j]; j++)
+ ;
+ }
+
+ rte_spinlock_unlock(&mt->mtf.lock);
+ free(p);
+}
+
+static void
+mfree_stat_dump(FILE *f, const struct mfree_stat *st)
+{
+ fprintf(f, "\tfree_stat={\n");
+ fprintf(f, "\t\tnb_chunk=%u,\n", st->nb_chunk);
+ mchunk_stat_dump(f, &st->chunk);
+ fprintf(f, "\t},\n");
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_dump, 26.11)
+void
+rte_memtank_dump(FILE *f, struct rte_memtank *mt, uint32_t flags)
+{
+ uint32_t n;
+
+ if (f == NULL || mt == NULL)
+ return;
+
+ fprintf(f, "rte_memtank@%p={\n", mt);
+ fprintf(f, "\tmin_free=%u,\n", mt->mtf.min_free);
+ fprintf(f, "\tmax_free=%u,\n", mt->mtf.max_free);
+ fprintf(f, "\tnb_free=%u,\n", mt->mtf.nb_free);
+ fprintf(f, "\tchunk_size=%zu,\n", mt->chunk_size);
+ fprintf(f, "\tobj_size=%u,\n", mt->obj_size);
+ fprintf(f, "\tmax_chunk=%u,\n", mt->max_chunk);
+ fprintf(f, "\tflags=%#x,\n", mt->flags);
+ n = rte_atomic_load_explicit(&mt->nb_chunks, rte_memory_order_relaxed);
+ fprintf(f, "\tnb_chunks=%u,\n", n);
+
+ if (flags & RTE_MTANK_DUMP_FREE_STAT) {
+ struct mfree_stat mfs;
+ mfree_stat_init(&mfs, mt->prm.nb_obj_chunk);
+ mfree_stat_collect(&mfs, mt);
+ mfree_stat_dump(f, &mfs);
+ }
+
+ if (flags & (RTE_MTANK_DUMP_CHUNK | RTE_MTANK_DUMP_CHUNK_STAT)) {
+
+ fprintf(f, "\t[FULL]={\n");
+ mchunk_list_dump(f, mt, MC_FULL, flags);
+ fprintf(f, "\t},\n");
+
+ fprintf(f, "\t[USED]={,\n");
+ mchunk_list_dump(f, mt, MC_USED, flags);
+ fprintf(f, "\t},\n");
+ }
+ fprintf(f, "};\n");
+}
+
+static int
+mobj_bulk_check(const char *fname, const struct rte_memtank *mt,
+ const uintptr_t p[], uint32_t num, uint32_t fmsk)
+{
+ int32_t ret;
+ uintptr_t align;
+ uint32_t i, k, sz;
+ const struct memobj *mo;
+
+ k = ((mt->flags & RTE_MTANK_OBJ_DBG) != 0) & fmsk;
+ sz = mt->obj_size;
+ align = mt->prm.obj_align - 1;
+
+ ret = 0;
+ for (i = 0; i != num; i++) {
+
+ if (p[i] == (uintptr_t)NULL) {
+ ret--;
+ MTANK_LOG(ERR,
+ "%s(mt=%p, %p[%u]): NULL object",
+ fname, mt, p, i);
+ } else if ((p[i] & align) != 0) {
+ ret--;
+ MTANK_LOG(ERR,
+ "%s(mt=%p, %p[%u]): object %#zx violates "
+ "expected alignment %#zx",
+ fname, mt, p, i, p[i], align);
+ } else {
+ mo = obj_pub_full(p[i], sz);
+ if (memobj_verify(mo, k) != 0) {
+ ret--;
+ MTANK_LOG(ERR,
+ "%s(mt=%p, %p[%u]): "
+ "invalid object header @%#zx={"
+ "red_zone1=%#" PRIx64 ","
+ "dbg={nb_alloc=%u,nb_free=%u},"
+ "red_zone2=%#" PRIx64
+ "}",
+ fname, mt, p, i, p[i],
+ mo->red_zone1,
+ mo->dbg.nb_alloc, mo->dbg.nb_free,
+ mo->red_zone2);
+ }
+ }
+ }
+
+ return ret;
+}
+
+/* grab free lock and check objects in free[] */
+static int
+mfree_check(struct rte_memtank *mt)
+{
+ int32_t rc;
+
+ rte_spinlock_lock(&mt->mtf.lock);
+ rc = mobj_bulk_check(__func__, mt, (const uintptr_t *)mt->mtf.free,
+ mt->mtf.nb_free, 1);
+ rte_spinlock_unlock(&mt->mtf.lock);
+ return rc;
+}
+
+static int
+mchunk_check(const struct rte_memtank *mt, const struct memchunk *mc,
+ uint32_t tc)
+{
+ int32_t n, rc;
+
+ rc = 0;
+ n = mc->nb_total - mc->nb_free;
+
+ rc -= (mc->nb_total != mt->prm.nb_obj_chunk);
+ rc -= (tc == MC_FULL) ? (n != 0) : (n <= 0);
+ rc -= (RTE_PTR_ALIGN_CEIL(mc->raw, alignof(typeof(*mc))) != mc);
+
+ if (rc != 0)
+ MTANK_LOG(ERR, "%s(mt=%p, tc=%u): invalid memchunk @%p={"
+ "raw=%p, nb_total=%u, nb_free=%u}",
+ __func__, mt, tc, mc,
+ mc->raw, mc->nb_total, mc->nb_free);
+
+ rc += mobj_bulk_check(__func__, mt, (const uintptr_t *)mc->free,
+ mc->nb_free, 0);
+ return rc;
+}
+
+static int
+mchunk_list_check(struct rte_memtank *mt, uint32_t tc, uint32_t *nb_chunk)
+{
+ int32_t rc;
+ uint32_t n;
+ struct mchunk_list *ls;
+ const struct memchunk *ch;
+
+ ls = &mt->chl[tc];
+ rte_spinlock_lock(&ls->lock);
+
+ rc = 0;
+ for (n = 0, ch = TAILQ_FIRST(&ls->chunk); ch != NULL;
+ ch = TAILQ_NEXT(ch, link), n++)
+ rc += mchunk_check(mt, ch, tc);
+
+ rte_spinlock_unlock(&ls->lock);
+
+ *nb_chunk = n;
+ return rc;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_memtank_sanity_check, 26.11)
+int
+rte_memtank_sanity_check(struct rte_memtank *mt, int32_t ct)
+{
+ int32_t rc;
+ uint32_t n, nf, nu;
+
+ rc = mfree_check(mt);
+
+ nf = 0;
+ nu = 0;
+ rc += mchunk_list_check(mt, MC_FULL, &nf);
+ rc += mchunk_list_check(mt, MC_USED, &nu);
+
+ /*
+ * if some other threads concurently do alloc/free/grow/shrink
+ * these numbers can still not match.
+ */
+ n = rte_atomic_load_explicit(&mt->nb_chunks, rte_memory_order_relaxed);
+ if (nf + nu != n && ct == 0) {
+ MTANK_LOG(ERR,
+ "%s(mt=%p) nb_chunks: expected=%u, full=%u, used=%u",
+ __func__, mt, n, nf, nu);
+ rc--;
+ }
+
+ return rc;
+}
diff --git a/lib/memtank/rte_memtank.h b/lib/memtank/rte_memtank.h
new file mode 100644
index 0000000000..7359b39840
--- /dev/null
+++ b/lib/memtank/rte_memtank.h
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ * Copyright(c) 2025 Huawei Technologies Co., Ltd
+ */
+
+#ifndef _RTE_MEMTANK_H_
+#define _RTE_MEMTANK_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <stdio.h>
+
+/**
+ * @file
+ * RTE memtank
+ *
+ * Same a s mempool it allows to alloc/free objects of fixed size
+ * in a lightweight manner (probably not as lightweight as mempool,
+ * but hopefully close enough).
+ * But in addition it can grow/shrink dynamically plus provides extra
+ * additional API for higher flexibility:
+ * - manual grow()/shrink() functions
+ * - different alloc/free policies
+ * (can be specified by user via flags parameter).
+ *
+ * Internally it consists of:
+ * - LIFO queue (fast allocator/deallocator)
+ * - lists of memchunks (USED, FREE).
+ *
+ * For performance reasons memtank tries to allocate memory in
+ * relatively big chunks (memchunks) and then split each memchunk
+ * in dozens (or hundreds) of objects.
+ * There are two thresholds:
+ * - min_free (grow threshold)
+ * - max_free (shrink threshold)
+ */
+
+struct rte_memtank;
+
+/** generic memtank behavior flags */
+enum {
+ /** Enable obj debugging */
+ RTE_MTANK_OBJ_DBG = 1,
+};
+
+struct rte_memtank_prm {
+ /** min number of free objs in the ring (grow threshold). */
+ uint32_t min_free;
+ uint32_t max_free; /**< max number of free objs (empty threshold) */
+ uint32_t max_obj; /**< max number of objs (grow limit) */
+ uint32_t obj_size; /**< size of each mem object */
+ uint32_t obj_align; /**< alignment of each mem object */
+ uint32_t nb_obj_chunk; /**< number of objects per chunk */
+ uint32_t flags; /**< behavior flags */
+ /** user provided function to alloc chunk of memory */
+ void * (*alloc)(size_t len, void *udata);
+ /** user provided function to free chunk of memory */
+ void (*free)(void *mem, void *udata);
+ /** user provided function to initialiaze an object */
+ void (*init)(void *obj[], uint32_t num, void *udata);
+ void *udata; /**< opaque user data for alloc/free/init */
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Allocate and intitialize new memtank instance, based on the
+ * parameters provided. Note that it uses user-provided *alloc()* function
+ * to allocate space for the memtank metadata.
+ * @param prm
+ * Parameters used to create and initialise new memtank.
+ * @return
+ * - Pointer to new memtank insteance created, if operation completed
+ * successfully.
+ * - NULL on error with rte_errno set appropriately.
+ */
+__rte_experimental
+struct rte_memtank *
+rte_memtank_create(const struct rte_memtank_prm *prm);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Destroy the memtank and free all memory referenced by the memtank.
+ * The objects must not be used by other cores as they will be freed.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ */
+__rte_experimental
+void
+rte_memtank_destroy(struct rte_memtank *t);
+
+
+/** alloc flags */
+enum {
+ RTE_MTANK_ALLOC_CHUNK = 1,
+ /** Allocate extra memchunks if needed */
+ RTE_MTANK_ALLOC_GROW = 2,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Allocate up to requested number of objects from the memtank.
+ * Note that depending on *alloc* behavior (flags) some new memory chunks
+ * can be allocated from the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @param obj
+ * An array of void * pointers (objects) that will be filled.
+ * @param num
+ * Number of objects to allocate from the memtank.
+ * @param flags
+ * Flags that control allocation behavior.
+ * @return
+ * Number of allocated objects.
+ */
+__rte_experimental
+uint32_t
+rte_memtank_alloc(struct rte_memtank *t, void *obj[], uint32_t num,
+ uint32_t flags);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Allocate up to requested number of objects from the memtank.
+ * Note that this function bypasses *free* cache(s) and tries to allocate
+ * objects straight from the memory chunks.
+ * Note that depending on *alloc* behavior (flags) some new memory chunks
+ * can be allocated from the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @param obj
+ * An array of void * pointers (objects) that will be filled.
+ * @param nb_obj
+ * Number of objects to allocate from the memtank.
+ * @param flags
+ * Flags that control allocation behavior.
+ * @return
+ * Number of allocated objects.
+ */
+__rte_experimental
+uint32_t
+rte_memtank_chunk_alloc(struct rte_memtank *t, void *obj[], uint32_t nb_obj,
+ uint32_t flags);
+
+/** free flags */
+enum {
+ /** Free unneeded chunk of memory */
+ RTE_MTANK_FREE_SHRINK = 1,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Free (put) provided objects back to the memtank.
+ * Note that depending on *free* behavior (flags) some memory chunks can be
+ * returned (freed) to the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @param obj
+ * An array of object pointers to be freed.
+ * @param num
+ * Number of objects to free.
+ * @param flags
+ * Flags that control free behavior.
+ */
+__rte_experimental
+void
+rte_memtank_free(struct rte_memtank *t, void * const obj[], uint32_t num,
+ uint32_t flags);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Free (put) provided objects back to the memtank.
+ * Note that this function bypasses *free* cache(s) and tries to put
+ * objects straight to the memory chunks.
+ * Note that depending on *free* behavior (flags) some memory chunks can be
+ * returned (freed) to the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @param obj
+ * An array of object pointers to be freed.
+ * @param nb_obj
+ * Number of objects to allocate from the memtank.
+ * @param flags
+ * Flags that control allocation behavior.
+ */
+__rte_experimental
+void
+rte_memtank_chunk_free(struct rte_memtank *t, void * const obj[],
+ uint32_t nb_obj, uint32_t flags);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check does number of objects in *free* cache is below memtank grow
+ * threshold (min_free). If yes, then tries to allocate memory for new
+ * objects from the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @return
+ * Number of newly allocated memory chunks.
+ */
+__rte_experimental
+int
+rte_memtank_grow(struct rte_memtank *t);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check does number of objects in *free* cache have reached memtank shrink
+ * threshold (max_free). If yes, then tries to return excessive memory to
+ * the underlying memory subsystem.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @return
+ * Number of freed memory chunks.
+ */
+__rte_experimental
+int
+rte_memtank_shrink(struct rte_memtank *t);
+
+/** dump flags */
+enum {
+ RTE_MTANK_DUMP_FREE_STAT = 1,
+ RTE_MTANK_DUMP_CHUNK_STAT = 2,
+ RTE_MTANK_DUMP_CHUNK = 4,
+ /* first not used power of two */
+ RTE_MTANK_DUMP_END = 8,
+
+ /** dump all stats */
+ RTE_MTANK_DUMP_STAT =
+ (RTE_MTANK_DUMP_FREE_STAT | RTE_MTANK_DUMP_CHUNK_STAT),
+ /** dump everything */
+ RTE_MTANK_DUMP_ALL = RTE_MTANK_DUMP_END - 1,
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Dump information about the memtank to the file.
+ * Note that depending of *flags* value it might cause some internal locks
+ * grabbing, and might affect performance of others threads that
+ * concurently use same memtank.
+ *
+ * @param f
+ * A pinter to the file.
+ * @param t
+ * A pointer to the memtank instance.
+ * @param flags
+ * Flags that control dump behavior.
+ */
+__rte_experimental
+void
+rte_memtank_dump(FILE *f, struct rte_memtank *t, uint32_t flags);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check the consistency of the given memtank instance.
+ * Dumps error messages to the RTE log subsystem, if some inconsitency
+ * is detected.
+ *
+ * @param t
+ * A pointer to the memtank instance.
+ * @param ct
+ * Value greater then zero, if some other threads do concurently use
+ * that memtank.
+ * @return
+ * Zero on success, or negative value otherwise.
+ */
+__rte_experimental
+int
+rte_memtank_sanity_check(struct rte_memtank *t, int32_t ct);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMTANK_H_ */
diff --git a/lib/meson.build b/lib/meson.build
index af5c160cb8..3b13dfee6c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -19,6 +19,7 @@ libraries = [
'ring',
'rcu', # rcu depends on ring
'mempool',
+ 'memtank',
'mbuf',
'net',
'meter',
--
2.51.0
^ permalink raw reply related
* [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
From: Anurag Mandal @ 2026-06-10 10:07 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
In-Reply-To: <20260605202911.314359-1-anurag.mandal@intel.com>
During PF initiated reset recovery, iavf_dev_close() sending
an extra VIRTCHNL_OP_RESET_VF while recovery is already in progress.
That second reset can leave PF/VF virtchnl state inconsistent and
cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after
ToR link flap/power-cycle, leaving the VF unable to recover.
This results in connection loss.
This patch introduces a new flag 'pf_reset_in_progress', that is
set only when iavf_handle_hw_reset() is entered with
vf_initiated_reset as false and is cleared on exit.
Also, close-time VF reset and related close-time virtchnl
operations are skipped when PF triggered reset recovery is set.
This is done to avoid a duplicate VF reset, and keep normal
behavior for application-driven close or VF-initiated reinit.
Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without interrupt")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
V5: Addressed Ciara Loftus's comments
- added separate flag for PF initiated reset recovery
V4: Addressed Ciara Loftus's comments
- split VF reset from other code changes
V3: Addressed latest ai-code-review comments
V2: Addressed ai-code-review comments
doc/guides/rel_notes/release_26_07.rst | 3 ++
drivers/net/intel/iavf/iavf.h | 7 +++++
drivers/net/intel/iavf/iavf_ethdev.c | 40 +++++++++++++++-----------
drivers/net/intel/iavf/iavf_vchnl.c | 18 ++++++++++--
4 files changed, 49 insertions(+), 19 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index d2563ac503..f6899a78c3 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -95,6 +95,9 @@ New Features
* Added support for transmitting LLDP packets based on mbuf packet type.
* Implemented AVX2 context descriptor transmit paths.
+ * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
+ reset recovery, which earlier caused virtchnl state corruption
+ and connection loss after a top-of-rack (ToR) link flap/power-cycle.
* **Updated PCAP ethernet driver.**
diff --git a/drivers/net/intel/iavf/iavf.h b/drivers/net/intel/iavf/iavf.h
index 2615b6f034..67aacbe7a6 100644
--- a/drivers/net/intel/iavf/iavf.h
+++ b/drivers/net/intel/iavf/iavf.h
@@ -292,6 +292,13 @@ struct iavf_info {
bool in_reset_recovery;
+ /*
+ * Set only while iavf_handle_hw_reset()
+ * is processing a PF-initiated reset
+ * (vf_initiated_reset == false).
+ */
+ bool pf_reset_in_progress;
+
uint32_t ptp_caps;
rte_spinlock_t phc_time_aq_lock;
};
diff --git a/drivers/net/intel/iavf/iavf_ethdev.c b/drivers/net/intel/iavf/iavf_ethdev.c
index a8031e23a5..2b6f4daa99 100644
--- a/drivers/net/intel/iavf/iavf_ethdev.c
+++ b/drivers/net/intel/iavf/iavf_ethdev.c
@@ -3166,23 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
ret = iavf_dev_stop(dev);
- /*
- * Release redundant queue resource when close the dev
- * so that other vfs can re-use the queues.
- */
- if (vf->lv_enabled) {
- ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
- if (ret)
- PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ /* Skip RESET_VF on a PF-initiated reset */
+ if (!vf->pf_reset_in_progress) {
- vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
- }
+ /*
+ * Release redundant queue resource when close the dev
+ * so that other vfs can re-use the queues.
+ */
+ if (vf->lv_enabled) {
+ ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
+ if (ret)
+ PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
+ }
- /* Disable promiscuous mode before resetting the VF. This is to avoid
- * potential issues when the PF is bound to the kernel driver.
- */
- if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
- iavf_config_promisc(adapter, false, false);
+ /*
+ * Disable promiscuous mode before resetting the VF. This is to avoid
+ * potential issues when the PF is bound to the kernel driver.
+ */
+ if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
+ iavf_config_promisc(adapter, false, false);
+ }
adapter->closed = true;
@@ -3195,7 +3199,9 @@ iavf_dev_close(struct rte_eth_dev *dev)
iavf_flow_flush(dev, NULL);
iavf_flow_uninit(adapter);
- iavf_vf_reset(hw);
+ /* Skip RESET_VF on a PF-initiated reset */
+ if (!vf->pf_reset_in_progress)
+ iavf_vf_reset(hw);
vf->aq_intr_enabled = false;
iavf_shutdown_adminq(hw);
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_WB_ON_ITR) {
@@ -3368,6 +3374,7 @@ iavf_handle_hw_reset(struct rte_eth_dev *dev, bool vf_initiated_reset)
}
vf->in_reset_recovery = true;
+ vf->pf_reset_in_progress = !vf_initiated_reset;
iavf_set_no_poll(adapter, false);
/* Call the pre reset callback */
@@ -3418,6 +3425,7 @@ iavf_handle_hw_reset(struct rte_eth_dev *dev, bool vf_initiated_reset)
vf->post_reset_cb(dev->data->port_id, ret, vf->post_reset_cb_arg);
vf->in_reset_recovery = false;
+ vf->pf_reset_in_progress = false;
iavf_set_no_poll(adapter, false);
return;
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 94ccfb5d6e..cf3513ef94 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -283,9 +283,21 @@ iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
vf->link_up ? "up" : "down");
break;
case VIRTCHNL_EVENT_RESET_IMPENDING:
- vf->vf_reset = true;
- iavf_set_no_poll(adapter, false);
- PMD_DRV_LOG(INFO, "VF is resetting");
+ /*
+ * Force link down on impending reset to drop
+ * the cached link-up state; a fresh LSC up
+ * event will be re-issued by the PF once the
+ * VF is reinitialised.
+ */
+ vf->link_up = false;
+ if (!vf->vf_reset) {
+ vf->vf_reset = true;
+ iavf_set_no_poll(adapter, false);
+ iavf_dev_event_post(vf->eth_dev,
+ RTE_ETH_EVENT_INTR_RESET,
+ NULL, 0);
+ }
+ PMD_DRV_LOG(DEBUG, "VF is resetting");
break;
case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
vf->dev_closed = true;
--
2.34.1
^ permalink raw reply related
* RE: [PATCH v2] net/iavf: fix to consolidate link change event handling
From: Loftus, Ciara @ 2026-06-10 9:43 UTC (permalink / raw)
To: Mandal, Anurag, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <20260609173822.364452-1-anurag.mandal@intel.com>
> Subject: [PATCH v2] net/iavf: fix to consolidate link change event handling
>
[snip]
> +
> /* Read data in admin queue to get msg from pf driver */
> static enum iavf_aq_result
> iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
> @@ -249,38 +310,15 @@ iavf_read_msg_from_pf(struct iavf_adapter
> *adapter, uint16_t buf_len,
> if (opcode == VIRTCHNL_OP_EVENT) {
> struct virtchnl_pf_event *vpe =
> (struct virtchnl_pf_event *)event.msg_buf;
> + if (vpe == NULL) {
> + PMD_DRV_LOG(ERR, "Invalid PF event message");
> + return IAVF_MSG_ERR;
> + }
This check can be removed.
iavf_read_msg_from_pf is called from iavf_wait_for_msg which performs a
NULL check on the same location (args->out_buffer) before passing it to
iavf_read_msg_from_pf
>
> result = IAVF_MSG_SYS;
> switch (vpe->event) {
> case VIRTCHNL_EVENT_LINK_CHANGE:
> - vf->link_up =
> - vpe->event_data.link_event.link_status;
> - if (vf->vf_res != NULL &&
> - vf->vf_res->vf_cap_flags &
> VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> - vf->link_speed =
> - vpe-
> >event_data.link_event_adv.link_speed;
> - } else {
> - enum virtchnl_link_speed speed;
> - speed = vpe-
> >event_data.link_event.link_speed;
> - vf->link_speed =
> iavf_convert_link_speed(speed);
> - }
> - iavf_dev_link_update(vf->eth_dev, 0);
> - iavf_dev_event_post(vf->eth_dev,
> RTE_ETH_EVENT_INTR_LSC, NULL, 0);
> - if (vf->link_up && !vf->vf_reset) {
> - iavf_dev_watchdog_disable(adapter);
> - } else {
> - if (!vf->link_up)
> - iavf_dev_watchdog_enable(adapter);
> - }
> - if (adapter->devargs.no_poll_on_link_down) {
> - iavf_set_no_poll(adapter, true);
> - if (adapter->no_poll)
> - PMD_DRV_LOG(DEBUG, "VF no poll
> turned on");
> - else
> - PMD_DRV_LOG(DEBUG, "VF no poll
> turned off");
> - }
> - PMD_DRV_LOG(INFO, "Link status update:%s",
> - vf->link_up ? "up" : "down");
> + iavf_handle_link_change_event(vf->eth_dev, vpe);
> break;
> case VIRTCHNL_EVENT_RESET_IMPENDING:
> vf->vf_reset = true;
> @@ -505,6 +543,12 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev,
> uint8_t *msg,
> PMD_DRV_LOG(DEBUG, "Error event");
> return;
> }
> +
> + if (pf_msg == NULL) {
> + PMD_DRV_LOG(ERR, "Invalid PF event message");
> + return;
> + }
This too can be removed.
pf_msg resolves to vf->aq_resp which is a fixed buffer allocated at
driver init time. It cannot be NULL here.
With those two changes I think the patch will be good to go.
> +
> switch (pf_msg->event) {
> case VIRTCHNL_EVENT_RESET_IMPENDING:
> PMD_DRV_LOG(DEBUG,
> "VIRTCHNL_EVENT_RESET_IMPENDING event");
> @@ -518,30 +562,7 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev,
> uint8_t *msg,
> break;
> case VIRTCHNL_EVENT_LINK_CHANGE:
> PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_LINK_CHANGE
> event");
> - vf->link_up = pf_msg->event_data.link_event.link_status;
> - if (vf->vf_res->vf_cap_flags &
> VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> - vf->link_speed =
> - pf_msg-
> >event_data.link_event_adv.link_speed;
> - } else {
> - enum virtchnl_link_speed speed;
> - speed = pf_msg->event_data.link_event.link_speed;
> - vf->link_speed = iavf_convert_link_speed(speed);
> - }
> - iavf_dev_link_update(dev, 0);
> - if (vf->link_up && !vf->vf_reset) {
> - iavf_dev_watchdog_disable(adapter);
> - } else {
> - if (!vf->link_up)
> - iavf_dev_watchdog_enable(adapter);
> - }
> - if (adapter->devargs.no_poll_on_link_down) {
> - iavf_set_no_poll(adapter, true);
> - if (adapter->no_poll)
> - PMD_DRV_LOG(DEBUG, "VF no poll turned
> on");
> - else
> - PMD_DRV_LOG(DEBUG, "VF no poll turned
> off");
> - }
> - iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL,
> 0);
> + iavf_handle_link_change_event(dev, pf_msg);
> break;
> case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
> PMD_DRV_LOG(DEBUG,
> "VIRTCHNL_EVENT_PF_DRIVER_CLOSE event");
> --
> 2.34.1
^ permalink raw reply
* [PATCH 1/2] eal/pflock: add API to downgrade from wr to rd lock
From: Eimear Morrissey @ 2026-06-10 9:11 UTC (permalink / raw)
To: dev; +Cc: Konstantin Ananyev
In-Reply-To: <20260610091147.88412-1-eimear.morrissey@huawei.com>
From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Add a new API that allows for the caller to downgrade from wrlock
to rdlock. Note that caller is expected to obtain wrlock before calling
that function.
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
lib/eal/include/rte_pflock.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/lib/eal/include/rte_pflock.h b/lib/eal/include/rte_pflock.h
index 6797ce5920..ed5255b3b5 100644
--- a/lib/eal/include/rte_pflock.h
+++ b/lib/eal/include/rte_pflock.h
@@ -179,6 +179,27 @@ rte_pflock_write_unlock(rte_pflock_t *pf)
rte_atomic_fetch_add_explicit(&pf->wr.out, 1, rte_memory_order_release);
}
+/**
+ * Release a pflock held for writing, while keeping lock for reading.
+ *
+ * @param pf
+ * A pointer to a pflock structure.
+ */
+static inline void
+rte_pflock_write_downgrade(rte_pflock_t *pf)
+{
+ /* Migrate from write phase to read phase. */
+ rte_atomic_fetch_add_explicit(&pf->rd.in, RTE_PFLOCK_RINC,
+ rte_memory_order_acq_rel);
+ rte_atomic_fetch_and_explicit(&pf->rd.in, RTE_PFLOCK_LSB,
+ rte_memory_order_release);
+
+ /* Allow other writers to continue. */
+ rte_atomic_fetch_add_explicit(&pf->wr.out, 1,
+ rte_memory_order_release);
+}
+
+
#ifdef __cplusplus
}
#endif
--
2.51.0
^ permalink raw reply related
* [PATCH 2/2] app/test: add stress tests for rwlock and pflock
From: Eimear Morrissey @ 2026-06-10 9:11 UTC (permalink / raw)
To: dev
In-Reply-To: <20260610091147.88412-1-eimear.morrissey@huawei.com>
Stress tests for pflock. Since the logic is generic enough for
rwlock run them against rwlock too.
Signed-off-by: Eimear Morrissey <eimear.morrissey@huawei.com>
---
app/test/meson.build | 2 +
app/test/test_pflock_stress.c | 76 ++++++
app/test/test_rwlock_stress.c | 59 +++++
app/test/test_rwlock_stress_impl.h | 393 +++++++++++++++++++++++++++++
4 files changed, 530 insertions(+)
create mode 100644 app/test/test_pflock_stress.c
create mode 100644 app/test/test_rwlock_stress.c
create mode 100644 app/test/test_rwlock_stress_impl.h
diff --git a/app/test/meson.build b/app/test/meson.build
index 61024125a7..f85ad617ce 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -140,6 +140,7 @@ source_file_deps = {
'test_pdump.c': ['pdump'] + sample_packet_forward_deps,
'test_per_lcore.c': [],
'test_pflock.c': [],
+ 'test_pflock_stress.c': [],
'test_pie.c': ['sched'],
'test_pmd_af_packet.c': ['net_af_packet', 'ethdev', 'bus_vdev'],
'test_pmd_pcap.c': ['net_pcap', 'ethdev', 'bus_vdev'] + packet_burst_generator_deps,
@@ -178,6 +179,7 @@ source_file_deps = {
'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
'test_ring_stress.c': ['ptr_compress'],
'test_rwlock.c': [],
+ 'test_rwlock_stress.c': [],
'test_sched.c': ['net', 'sched'],
'test_security.c': ['net', 'security'],
'test_security_inline_macsec.c': ['ethdev', 'security'],
diff --git a/app/test/test_pflock_stress.c b/app/test/test_pflock_stress.c
new file mode 100644
index 0000000000..cafc5defba
--- /dev/null
+++ b/app/test/test_pflock_stress.c
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Huawei Technologies Co., Ltd
+ */
+
+#include "test_rwlock_stress_impl.h"
+
+/* Pflock operation implementations */
+static void
+pflock_init_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_init(&lock->lock.pflock);
+}
+
+static void
+pflock_read_lock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_read_lock(&lock->lock.pflock);
+}
+
+static void
+pflock_read_unlock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_read_unlock(&lock->lock.pflock);
+}
+
+static void
+pflock_write_lock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_write_lock(&lock->lock.pflock);
+}
+
+static void
+pflock_write_unlock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_write_unlock(&lock->lock.pflock);
+}
+
+static void
+pflock_write_downgrade_fn(struct rwlock_stress_lock *lock)
+{
+ rte_pflock_write_downgrade(&lock->lock.pflock);
+}
+
+/* Pflock operations table */
+static const struct rwlock_ops pflock_ops = {
+ .name = "pflock",
+ .init = pflock_init_fn,
+ .read_lock = pflock_read_lock_fn,
+ .read_unlock = pflock_read_unlock_fn,
+ .write_lock = pflock_write_lock_fn,
+ .write_unlock = pflock_write_unlock_fn,
+ .write_downgrade = pflock_write_downgrade_fn,
+};
+
+static const struct test_descriptor pflock_specific_tests[] = {
+{
+ .name = "write_downgrade",
+ .num_readers_pct = 50,
+ .reader_delay_us = 0,
+ .writer_delay_us = 0,
+ .flags = DOWNGRADE_TEST,
+ },
+};
+
+static int
+run_pflock_tests(void)
+{
+ int ret = 0;
+ ret |= run_test_suite("PFLOCK Common Stress Tests", &pflock_ops,
+ tests, RTE_DIM(tests));
+ ret |= run_test_suite("PFLOCK Specific Stress Tests", &pflock_ops,
+ pflock_specific_tests, RTE_DIM(pflock_specific_tests));
+ return ret ? -1 : 0;
+}
+
+REGISTER_STRESS_TEST(pflock_stress_autotest, run_pflock_tests);
diff --git a/app/test/test_rwlock_stress.c b/app/test/test_rwlock_stress.c
new file mode 100644
index 0000000000..5d151f3f8f
--- /dev/null
+++ b/app/test/test_rwlock_stress.c
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Huawei Technologies Co., Ltd
+ */
+
+#include "test_rwlock_stress_impl.h"
+
+/* RWLock operation implementations */
+static void
+rwlock_init_fn(struct rwlock_stress_lock *lock)
+{
+ rte_rwlock_init(&lock->lock.rwlock);
+}
+
+static void
+rwlock_read_lock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_rwlock_read_lock(&lock->lock.rwlock);
+}
+
+static void
+rwlock_read_unlock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_rwlock_read_unlock(&lock->lock.rwlock);
+}
+
+static void
+rwlock_write_lock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_rwlock_write_lock(&lock->lock.rwlock);
+}
+
+static void
+rwlock_write_unlock_fn(struct rwlock_stress_lock *lock)
+{
+ rte_rwlock_write_unlock(&lock->lock.rwlock);
+}
+
+/* RWLock operations table */
+static const struct rwlock_ops rwlock_ops = {
+ .name = "rwlock",
+ .init = rwlock_init_fn,
+ .read_lock = rwlock_read_lock_fn,
+ .read_unlock = rwlock_read_unlock_fn,
+ .write_lock = rwlock_write_lock_fn,
+ .write_unlock = rwlock_write_unlock_fn,
+};
+
+static int
+run_rwlock_tests(void)
+{
+ int ret = 0;
+
+ ret |= run_test_suite("RWLOCK Stress Tests", &rwlock_ops, tests,
+ RTE_DIM(tests));
+
+ return ret ? -1 : 0;
+}
+
+REGISTER_STRESS_TEST(rwlock_stress_autotest, run_rwlock_tests);
diff --git a/app/test/test_rwlock_stress_impl.h b/app/test/test_rwlock_stress_impl.h
new file mode 100644
index 0000000000..d28ccd76e0
--- /dev/null
+++ b/app/test/test_rwlock_stress_impl.h
@@ -0,0 +1,393 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Huawei Technologies Co., Ltd
+ */
+
+#ifndef _TEST_RWLOCK_STRESS_H_
+#define _TEST_RWLOCK_STRESS_H_
+
+/**
+ * Generic reader-writer lock stress test.
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <stdbool.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_launch.h>
+#include <rte_per_lcore.h>
+#include <rte_malloc.h>
+#include <rte_pflock.h>
+#include <rte_random.h>
+#include <rte_rwlock.h>
+
+#include "test.h"
+
+#define TEST_DURATION_SEC 5
+#define COUNTER_ARRAY_SIZE 1024
+#define DOWNGRADE_TEST 0x1 /* Will attempt to downgrade from write to read lock */
+#define DYNAMIC_ROLES 0x2 /* Threads can switch between reader/writer roles */
+
+struct rwlock_stress_lock;
+
+/**
+ * Lock operations interface.
+ */
+struct rwlock_ops {
+ const char *name;
+
+ void (*init)(struct rwlock_stress_lock *lock);
+ void (*read_lock)(struct rwlock_stress_lock *lock);
+ void (*read_unlock)(struct rwlock_stress_lock *lock);
+ void (*write_lock)(struct rwlock_stress_lock *lock);
+ void (*write_unlock)(struct rwlock_stress_lock *lock);
+ void (*write_downgrade)(struct rwlock_stress_lock *lock);
+};
+
+/**
+ * Generic lock structure.
+ */
+struct rwlock_stress_lock {
+ const struct rwlock_ops *ops;
+
+ union {
+ struct rte_pflock pflock;
+ rte_rwlock_t rwlock;
+ } lock;
+};
+
+/**
+ * Per-lcore statistics
+ */
+struct lcore_stats {
+ uint64_t reader_ops;
+ uint64_t writer_ops;
+ uint64_t local_counter;
+ uint64_t reader_errors;
+ uint64_t writer_errors;
+ uint64_t acquire_time;
+} __rte_cache_aligned;
+
+/**
+ * Test controls
+ */
+struct test_descriptor {
+ const char *name;
+ uint32_t num_readers_pct; /* Percentage of workers as readers (0-100) */
+ uint32_t reader_delay_us; /* Microseconds to delay in reader */
+ uint32_t writer_delay_us; /* Microseconds to delay in writer */
+ uint32_t flags; /* Specialist test behaviour */
+};
+
+/**
+ * Shared test state.
+ */
+struct rwlock_test_shared {
+ struct rwlock_stress_lock lock;
+ volatile uint64_t counter;
+ volatile uint64_t counter_array[COUNTER_ARRAY_SIZE];
+ volatile bool stop;
+ uint32_t num_readers;
+ uint32_t num_writers;
+ const struct test_descriptor *test;
+ struct lcore_stats stats[RTE_MAX_LCORE];
+} __rte_cache_aligned;
+
+/* Test descriptors array */
+static const struct test_descriptor tests[] = {
+ {
+ .name = "basic_reader_writer",
+ .num_readers_pct = 75,
+ .reader_delay_us = 0,
+ .writer_delay_us = 0,
+ },
+ {
+ .name = "long_hold",
+ .num_readers_pct = 67,
+ .reader_delay_us = 100,
+ .writer_delay_us = 100,
+ },
+ {
+ .name = "rapid_acquire_release",
+ .num_readers_pct = 67,
+ .reader_delay_us = 0,
+ .writer_delay_us = 0,
+ },
+ {
+ .name = "dynamic_roles",
+ .num_readers_pct = 75,
+ .reader_delay_us = 0,
+ .writer_delay_us = 0,
+ .flags = DYNAMIC_ROLES,
+ },
+};
+
+static inline bool
+should_be_writer(uint32_t num_readers, uint32_t flags)
+{
+ uint32_t total_lcores = rte_lcore_count();
+ if (total_lcores <= 1)
+ return true;
+
+ if (flags & DYNAMIC_ROLES) {
+ uint32_t readers_pct = (num_readers * 100) / (total_lcores - 1);
+ return (rte_rand_max(100) >= readers_pct);
+ }
+
+ unsigned int idx = rte_lcore_index(rte_lcore_id()) - 1;
+ return idx >= num_readers;
+}
+
+static void
+handle_error(struct rwlock_test_shared *s, unsigned int lcore_id,
+ bool write_lock, const char *func, int line)
+{
+ s->stop = true;
+ if (write_lock) {
+ s->stats[lcore_id].writer_errors++;
+ s->lock.ops->write_unlock(&s->lock);
+ } else {
+ s->stats[lcore_id].reader_errors++;
+ /* Don't unlock here as it's already unlocked by the calling function */
+ }
+ printf("ERROR: lcore:%u: %s:%d early termination\n", lcore_id, func, line);
+}
+
+static int
+handle_writer_work(struct rwlock_test_shared *s, unsigned int lcore_id,
+ const struct test_descriptor *test, uint64_t delta)
+{
+ s->lock.ops->write_lock(&s->lock);
+ uint64_t old_val = s->counter;
+ s->counter += delta;
+ s->stats[lcore_id].local_counter += delta;
+
+ /* Verify increment was atomic */
+ if (s->counter != old_val + delta) {
+ handle_error(s, lcore_id, true, __func__, __LINE__);
+ return -1;
+ }
+
+ /* Update all array elements */
+ for (uint32_t i = 0; i < COUNTER_ARRAY_SIZE; i++) {
+ s->counter_array[i] += delta;
+ if (s->counter_array[i] != s->counter) {
+ handle_error(s, lcore_id, true, __func__, __LINE__);
+ return -1;
+ }
+ }
+
+ if (test->flags & DOWNGRADE_TEST) {
+ /* Downgrade to read lock */
+ if (s->lock.ops->write_downgrade) {
+ s->lock.ops->write_downgrade(&s->lock);
+ /* Verify array consistency under read lock */
+ for (uint32_t i = 0; i < COUNTER_ARRAY_SIZE; i++) {
+ if (s->counter_array[i] != s->counter) {
+ handle_error(s, lcore_id, false, __func__, __LINE__);
+ return -1;
+ }
+ }
+ s->lock.ops->read_unlock(&s->lock);
+ }
+ } else {
+ if (test->writer_delay_us > 0)
+ rte_delay_us_sleep(test->writer_delay_us);
+ s->lock.ops->write_unlock(&s->lock);
+ }
+ s->stats[lcore_id].writer_ops++;
+ return 0;
+}
+
+static int
+handle_reader_work(struct rwlock_test_shared *s, unsigned int lcore_id,
+ const struct test_descriptor *test)
+{
+ uint64_t local_counter;
+
+ s->lock.ops->read_lock(&s->lock);
+ local_counter = s->counter;
+
+ /* Verify array consistency */
+ for (uint32_t i = 0; i < COUNTER_ARRAY_SIZE; i++) {
+ if (s->counter_array[i] != local_counter) {
+ handle_error(s, lcore_id, false, __func__, __LINE__);
+ return -1;
+ }
+ }
+
+ if (test->reader_delay_us > 0)
+ rte_delay_us_sleep(test->reader_delay_us);
+
+ /* Verify counter didn't change during read */
+ if (s->counter != local_counter) {
+ handle_error(s, lcore_id, false, __func__, __LINE__);
+ return -1;
+ }
+
+ s->lock.ops->read_unlock(&s->lock);
+ s->stats[lcore_id].reader_ops++;
+ return 0;
+}
+
+static int
+lcore_function(void *arg)
+{
+ struct rwlock_test_shared *s = arg;
+ unsigned int lcore_id = rte_lcore_id();
+ bool is_writer = should_be_writer(s->num_readers, s->test->flags);
+ const struct test_descriptor *test = s->test;
+
+ while (!s->stop) {
+ uint64_t start = rte_get_timer_cycles();
+ uint64_t delta = (rte_rand() % 64) + 1;
+ int ret;
+
+ if (is_writer)
+ ret = handle_writer_work(s, lcore_id, test, delta);
+ else
+ ret = handle_reader_work(s, lcore_id, test);
+
+ if (ret < 0)
+ continue;
+
+ /* Record max acquire time */
+ uint64_t wait_time = rte_get_timer_cycles() - start;
+ if (wait_time > s->stats[lcore_id].acquire_time)
+ s->stats[lcore_id].acquire_time = wait_time;
+ }
+
+ return 0;
+}
+
+static int
+verify(struct rwlock_test_shared *s)
+{
+ int ret = 0;
+ unsigned int lcore_id;
+ uint64_t total_reader_errors = 0;
+ uint64_t total_writer_errors = 0;
+ uint64_t sum_local_counters = 0;
+
+ /* Calculate errors and counters */
+ RTE_LCORE_FOREACH_WORKER(lcore_id) {
+ total_reader_errors += s->stats[lcore_id].reader_errors;
+ total_writer_errors += s->stats[lcore_id].writer_errors;
+ sum_local_counters += s->stats[lcore_id].local_counter;
+ }
+
+ /* Verify sum of per-lcore counters matches the shared counter */
+ if (s->counter != sum_local_counters) {
+ printf(" FAILED: shared counter=%" PRIu64
+ " sum of local counters=%" PRIu64 "\n",
+ s->counter, sum_local_counters);
+ ret = -1;
+ }
+
+ if (total_reader_errors) {
+ printf(" FAILED: reader errors=%" PRIu64 "\n",
+ total_reader_errors);
+ ret = -1;
+ }
+
+ if (total_writer_errors) {
+ printf(" FAILED: writer errors =%" PRIu64 "\n",
+ total_writer_errors);
+ ret = -1;
+ }
+
+ /* Verify array consistency */
+ for (uint32_t i = 0; i < COUNTER_ARRAY_SIZE; i++) {
+ if (s->counter_array[i] != s->counter) {
+ printf(" FAILED: counter_array[%u]=%" PRIu64 " counter=%" PRIu64 "\n",
+ i, s->counter_array[i], s->counter);
+ ret = -1;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+static int
+test_rwlock_stress_impl(const struct rwlock_ops *ops,
+ const struct test_descriptor *ind_test)
+{
+ struct rwlock_test_shared shared = {0};
+ uint64_t start_time, end_time;
+ uint64_t total_reader_ops = 0;
+ uint64_t total_writer_ops = 0;
+ uint64_t max_acquire_time = 0;
+ unsigned int lcore_id;
+ int ret = 0;
+
+ shared.lock.ops = ops;
+ shared.lock.ops->init(&shared.lock);
+ shared.test = ind_test;
+ shared.num_readers = (ind_test->num_readers_pct * (rte_lcore_count() - 1)) / 100;
+ shared.num_writers = (rte_lcore_count() - 1) - shared.num_readers;
+
+ printf(" %u readers, %u writers\n", shared.num_readers, shared.num_writers);
+
+ /* Launch workers */
+ RTE_LCORE_FOREACH_WORKER(lcore_id) {
+ rte_eal_remote_launch(lcore_function, &shared, lcore_id);
+ }
+
+ /* Run test for duration */
+ start_time = rte_get_timer_cycles();
+ rte_delay_ms(TEST_DURATION_SEC * 1000);
+
+ /* Stop workers and collect stats */
+ shared.stop = true;
+ RTE_LCORE_FOREACH_WORKER(lcore_id) {
+ rte_eal_wait_lcore(lcore_id);
+ if (shared.stats[lcore_id].acquire_time > max_acquire_time)
+ max_acquire_time = shared.stats[lcore_id].acquire_time;
+ total_reader_ops += shared.stats[lcore_id].reader_ops;
+ total_writer_ops += shared.stats[lcore_id].writer_ops;
+ }
+ end_time = rte_get_timer_cycles();
+
+ printf(" %"PRIu64" reader ops, %"PRIu64" writer ops,"
+ "total time: %.2f seconds\n",
+ total_reader_ops, total_writer_ops,
+ (double)(end_time - start_time) / rte_get_timer_hz());
+
+ ret = verify(&shared);
+ if (ret == 0) {
+ uint64_t hz = rte_get_timer_hz();
+ printf(" PASSED: All checks passed (max wait: %.2f us)\n",
+ (double)max_acquire_time * 1000000 / hz);
+ }
+ return ret;
+}
+
+/**
+ * Run a test suite with the given title and tests
+ */
+static int
+run_test_suite(const char *title, const struct rwlock_ops *ops,
+ const struct test_descriptor suite[], uint32_t count)
+{
+ uint32_t failed = 0;
+
+ printf("%s\n===================\n\n", title);
+ for (uint32_t i = 0; i < count; i++) {
+ printf("Test %u/%u: %s\n", i + 1, count, suite[i].name);
+ if (test_rwlock_stress_impl(ops, &suite[i]) < 0)
+ failed++;
+ printf("\n");
+ }
+ printf("===================\n");
+ printf("Results: %u/%u passed, %u failed\n", count - failed, count, failed);
+
+ return failed ? -1 : 0;
+}
+
+#endif /* _TEST_RWLOCK_STRESS_H_ */
--
2.51.0
^ permalink raw reply related
* [PATCH 0/2] Pflock downgrade & stress tests for pflock/rwlock libraries
From: Eimear Morrissey @ 2026-06-10 9:11 UTC (permalink / raw)
To: dev
Add new downgrade option for pflock. Add stress tests for this &
by extension the rest of the pflock/rwlock libraries.
Eimear Morrissey (1):
app/test: add stress tests for rwlock and pflock
Konstantin Ananyev (1):
eal/pflock: add API to downgrade from wr to rd lock
app/test/meson.build | 2 +
app/test/test_pflock_stress.c | 76 ++++++
app/test/test_rwlock_stress.c | 59 +++++
app/test/test_rwlock_stress_impl.h | 393 +++++++++++++++++++++++++++++
lib/eal/include/rte_pflock.h | 21 ++
5 files changed, 551 insertions(+)
create mode 100644 app/test/test_pflock_stress.c
create mode 100644 app/test/test_rwlock_stress.c
create mode 100644 app/test/test_rwlock_stress_impl.h
--
2.51.0
^ permalink raw reply
* Re: [PATCH 1/3] net/iavf: downgrade opcode 0 ARQ log to debug
From: Bruce Richardson @ 2026-06-10 8:26 UTC (permalink / raw)
To: Loftus, Ciara; +Cc: dev@dpdk.org, Talluri, ChaitanyababuX
In-Reply-To: <IA4PR11MB927828E81A5F1EC7C81BABF48E1A2@IA4PR11MB9278.namprd11.prod.outlook.com>
On Wed, Jun 10, 2026 at 09:13:21AM +0100, Loftus, Ciara wrote:
> > Subject: Re: [PATCH 1/3] net/iavf: downgrade opcode 0 ARQ log to debug
> >
> > On Mon, Jun 08, 2026 at 02:55:16PM +0000, Ciara Loftus wrote:
> > > From: Talluri Chaitanyababu <chaitanyababux.talluri@intel.com>
> > >
> > > After admin queue reinitialisation, completions from uninitialised
> > > ARQ ring descriptor memory may arrive before any real PF response.
> > > These carry opcode 0 (`VIRTCHNL_OP_UNKNOWN`) and trigger a WARNING
> > > log on every poll iteration, flooding the log during reset recovery.
> > >
> > > Treat opcode 0 as a distinct case and log it at DEBUG level, while
> > > retaining WARNING for genuine opcode mismatches.
> > >
> > > Signed-off-by: Talluri Chaitanyababu <chaitanyababux.talluri@intel.com>
> > > ---
> > > drivers/net/intel/iavf/iavf_vchnl.c | 11 +++++++++--
> > > 1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > Should this be backported as a bugfix?
>
> The issue has been present since the driver was added, but only
> triggered in practise more recently with the addition of the reset
> logic.
> I think this tag can be added:
>
> Fixes: 2d23ed74c079 ("net/iavf: enable iavf PMD")
> Cc: stable@dpdk.org
>
> Let me know if you need a respin.
Sorry, never waited for your reply before applying patch, and has already
been pulled to main. Since the issue only recent appears we are probably ok
without backport then. If you think it needs it, I suggest sending an extra
copy of this patch to stable, so that it can be added to the list there.
/Bruce
^ permalink raw reply
* RE: [PATCH 1/3] net/iavf: downgrade opcode 0 ARQ log to debug
From: Loftus, Ciara @ 2026-06-10 8:13 UTC (permalink / raw)
To: Richardson, Bruce; +Cc: dev@dpdk.org, Talluri, ChaitanyababuX
In-Reply-To: <aigjEm-RskU5s45v@bricha3-mobl1.ger.corp.intel.com>
> Subject: Re: [PATCH 1/3] net/iavf: downgrade opcode 0 ARQ log to debug
>
> On Mon, Jun 08, 2026 at 02:55:16PM +0000, Ciara Loftus wrote:
> > From: Talluri Chaitanyababu <chaitanyababux.talluri@intel.com>
> >
> > After admin queue reinitialisation, completions from uninitialised
> > ARQ ring descriptor memory may arrive before any real PF response.
> > These carry opcode 0 (`VIRTCHNL_OP_UNKNOWN`) and trigger a WARNING
> > log on every poll iteration, flooding the log during reset recovery.
> >
> > Treat opcode 0 as a distinct case and log it at DEBUG level, while
> > retaining WARNING for genuine opcode mismatches.
> >
> > Signed-off-by: Talluri Chaitanyababu <chaitanyababux.talluri@intel.com>
> > ---
> > drivers/net/intel/iavf/iavf_vchnl.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> Should this be backported as a bugfix?
The issue has been present since the driver was added, but only
triggered in practise more recently with the addition of the reset
logic.
I think this tag can be added:
Fixes: 2d23ed74c079 ("net/iavf: enable iavf PMD")
Cc: stable@dpdk.org
Let me know if you need a respin.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox